Voice recognition is an Hot topic, i am personally fascinated, sometimes staggered by these technologies. The critical business actors that are investing on the top notche trendy technologies such as IOT (internet of things) are ready to spend and invest on it. The trends followed are that most of the engineers are formed to use technologies that doesn’t exist any-more, and working for a job that isn’t created yet. If we think outside the box, we could make the observation that the evil-minded developer is ready to use this, because it seems that the voice recognition’s softwares were widely used by intelligence agencies previously (if not, it was just one of the movie’s industry side effect), likely to prevent the “bad guy” who entice a lyrical return to the stone age for his sake of the history. And so, if the disaster is occurring, the stakeholders should debate about the ethics for a long time, to strengthen the evidence of their intellectual honesty. Voice recognition is a market where some investors are waiting for the inflexion point to invest, once the software’s idea proposed, the “proprietary” source code will be continuously developed until another “lesser proprietary” software beats the charts. That’s the point of this article, I’m going to introduce you the app i created 2 days ago, that’s being saied my purpose is not to defend a “political” opinion about this technology, neither to prove you my intellectual autonomy. I just share this article with the hope it will be useful, without no warranty, the only one is that the knowledge must be free even if the time spent is money and each work need a fairly remuneration, that’s why education/open-source is starting here.
What is an STT/TTS engine ?
STT mean Speech To Text, it is an engine for transforming speech into text, a TTS engine is doing the opposite, it transforms text into speech. I wouldn’t debate on what technologies are the most effective for what usage. We will try to stay focused on web technologies, and innovation. For a webpage, i have tested Annyang (by TalAter), though this one is considered as an arrested development, this project is still active, it seems to reach effectively the STT purpose on google chrome. It’s a Javascript library created to interact voice commands across Google Chrome’s speech API (STT), but if you wish to have a look on the existing web STT engine, there is a small Javascript Library listing there ( 5 voice control javascript libraries for developers).
Speech Recognition API Browser’s compatibility for today

IOT
Another global project based mainly on IOT compliance (raspbian etc..) is called jasper, this AIO (all in one) includes multiples STT and TTS (text to speech) engines listed below.
STT
- Pocketsphinx is a open-source speech decoder by the CMU Sphinx project. It’s fast and designed to work well on embedded systems (like the Raspberry Pi). Unfortunately, the recognition rate is not the best and it has a lot of depencies. On the other hand, recognition will be performed offline, i.e. you don’t need an active internet connection to use it. It’s the right thing to use if you’re cautious with your personal data.
- Google STT is the speech-to-text system by Google. If you have an Android smartphone, you might already be familiar with it, because it’s basically the same engine that performs recognition if you say OK, Google. It can only transcribe a limited amount of speech a day and needs an active internet connection.
- AT&T STT is a speech decoder by the telecommunications company AT&T. Like Google Speech, it also performs decoding online and thus needs an active internet connection.
- Wit.ai STT relies on the wit.ai cloud services and uses crowdsourcing to train speech recognition algorithms. Like you’d expect from a cloud service, you also need an active internet connection.
- Julius is a high-performance open source speech recognition engine. It does not need an active internet connection. Please note that you will need to train your own acoustic model, which is a very complex task that we do not provide support for. Regular users are most likely better suited with one of the other STT engines listed here.
TTS
- eSpeak is a compact open-source speech synthesizer for many platforms. Speech synthesis is done offline, but most voices can sound very “robotic”.
Festival uses the Festival Speech Synthesis System, an open source speech synthesizer developed by the Centre for Speech Technology Research at the University of Edinburgh. Like eSpeak, also synthesizes speech offline. - Flite uses CMU Flite (festival-lite), a lightweight and fast synthesis engine that was primarily designed for small embedded machines. It synthesizes speech offline, so no internet connection is required.
- SVOX Pico TTS was the Text-to-Speech engine used in Android 1.6 “Donut”. It’s an open-source small footprint application and also works offline. The quality is rather good compared to eSpeak and Festival.
- Google TTS uses the same Text-to-Speech API which is also used by newer Android devices. The Synthesis itself is done on Google’s servers, so that you need an active internet connection and also can’t expect a lot of privacy if you use this.
- Ivona TTS uses Amazon’s Ivona Speech Cloud service, which is used in the Kindle Fire. Speech synthesis is done online, so an active internet connection and Amazon has access to everything Jasper says to you.
- MaryTTS is an open-source TTS system written in Java. You need to set up your own MaryTTS server and configure Jasper to use it. Because the server can be hosted on the same machine that runs Jasper, you do not need internet access.
- Mac OS X TTS does only work if you’re running Jasper on a Mac. It then uses the say command in MacOS to synthesize speech.
To install Jasper on OrangePI CTRL+F agency OS
apt-get install jasper
API VOCALR
The idea was to develop something new and not from scratch because of time, so i have decided to use the annyang lib for the TTS.
The javascript library annyang is available on github , some links are available at the end of this article, the compatible languages are the following:
- Afrikaans
af
- Basque
eu
- Bulgarian
bg
- Catalan
ca
- Arabic (Egypt)
ar-EG
- Arabic (Jordan)
ar-JO
- Arabic (Kuwait)
ar-KW
- Arabic (Lebanon)
ar-LB
- Arabic (Qatar)
ar-QA
- Arabic (UAE)
ar-AE
- Arabic (Morocco)
ar-MA
- Arabic (Iraq)
ar-IQ
- Arabic (Algeria)
ar-DZ
- Arabic (Bahrain)
ar-BH
- Arabic (Lybia)
ar-LY
- Arabic (Oman)
ar-OM
- Arabic (Saudi Arabia)
ar-SA
- Arabic (Tunisia)
ar-TN
- Arabic (Yemen)
ar-YE
- Czech
cs
- Dutch
nl-NL
- English (Australia)
en-AU
- English (Canada)
en-CA
- English (India)
en-IN
- English (New Zealand)
en-NZ
- English (South Africa)
en-ZA
- English(UK)
en-GB
- English(US)
en-US
- Finnish
fi
- French
fr-FR
- Galician
gl
- German
de-DE
- Hebrew
he
- Hungarian
hu
- Icelandic
is
- Italian
it-IT
- Indonesian
id
- Japanese
ja
- Korean
ko
- Latin
la
- Mandarin Chinese
zh-CN
- Traditional Taiwan
zh-TW
- Simplified China zh-CN
?
- Simplified Hong Kong
zh-HK
- Yue Chinese (Traditional Hong Kong)
zh-yue
- Malaysian
ms-MY
- Norwegian
no-NO
- Polish
pl
- Pig Latin
xx-piglatin
- Portuguese
pt-PT
- Portuguese (Brasil)
pt-BR
- Romanian
ro-RO
- Russian
ru
- Serbian
sr-SP
- Slovak
sk
- Spanish (Argentina)
es-AR
- Spanish (Bolivia)
es-BO
- Spanish (Chile)
es-CL
- Spanish (Colombia)
es-CO
- Spanish (Costa Rica)
es-CR
- Spanish (Dominican Republic)
es-DO
- Spanish (Ecuador)
es-EC
- Spanish (El Salvador)
es-SV
- Spanish (Guatemala)
es-GT
- Spanish (Honduras)
es-HN
- Spanish (Mexico)
es-MX
- Spanish (Nicaragua)
es-NI
- Spanish (Panama)
es-PA
- Spanish (Paraguay)
es-PY
- Spanish (Peru)
es-PE
- Spanish (Puerto Rico)
es-PR
- Spanish (Spain)
es-ES
- Spanish (US)
es-US
- Spanish (Uruguay)
es-UY
- Spanish (Venezuela)
es-VE
- Swedish
sv-SE
- Turkish
tr
- Zulu
zu
It is possible to test/work/develop/design for the PoC , the library use the webkit’s speech API (webkitSpeechRecognition) included into Google Chrome (a server-side based recognition). If you look on github, some ports of the webkitSpeechRecognition, function has been done for a local usage with pocketsphynx ( a C program, transformed in Javascript by using emscripten), that is not actually a W3C standard , (read further info about the draft written in 2014). The mozilla developpers/hackers are trying to get theses stuff working locally by using grammars functions and their own engine (with Gecko), twice projects may converge some days.
The design
API VOCALR has been designed in order to be MVC (Model-View-Controller) compliant, the nodeJS app use middleware functions if you wish to hook the source code for monitoring purposes. The following is a tree without (node-modules folder).
. ├── app │ ├── controllers │ │ ├── biorhythm.server.controller.js │ │ ├── index.server.controller.js │ │ └── prenoms.server.controller.js │ ├── routes │ │ ├── biorhythm.server.routes.js │ │ ├── index.server.routes.js │ │ └── prenoms.server.routes.js │ └── views │ ├── biorhythm.ejs │ ├── index.ejs │ └── prenoms.ejs ├── config │ └── express.js ├── license.txt ├── package.json ├── public │ ├── css │ │ ├── bootstrap.min.css │ │ ├── bootstrap-theme.min.css │ │ └── main.css │ ├── img │ │ ├── mini_icon_say.png │ │ ├── mosaic1.png │ │ └── palette.png │ ├── js │ │ ├── annyang.js │ │ ├── annyang.min.js │ │ ├── bootstrap.min.js │ │ ├── highlight.pack.js │ │ ├── jquery.min.js │ │ ├── mespeak_config.json │ │ └── mespeak.js │ └── voices │ ├── ca.json │ ├── cs.json │ ├── de.json │ ├── el.json │ ├── en │ │ ├── en.json │ │ ├── en-n.json │ │ ├── en-rp.json │ │ ├── en-sc.json │ │ ├── en-us.json │ │ └── en-wm.json │ ├── eo.json │ ├── es.json │ ├── es-la.json │ ├── fi.json │ ├── fr.json │ ├── hu.json │ ├── it.json │ ├── kn.json │ ├── la.json │ ├── lv.json │ ├── nl.json │ ├── pl.json │ ├── pt.json │ ├── pt-pt.json │ ├── ro.json │ ├── sk.json │ ├── sv.json │ ├── tr.json │ ├── zh.json │ └── zh-yue.json ├── README.md ├── server.js ├── tree.txt └── vocal.siteweb.tld.conf 89 directories, 387 files
As we can see, the expressJS, with ejs templating engine are used, bootstrap.css package is included too for training.
Api and testing
Some API were added into the project for research/scientific purposes. They work similarly. It retreives JSON encoded content with a jQuery crossdomain request. The Prenoms API Php source code is available on my repo.
Example :
Cross domain request with jQuery.
function crossdomain(text) { //element , result, and remotejsonresult are metasyntaxic variables. var element = document.getElementById('result'); $.ajax({ url : "https://siteweb.tld/api?var="+text+"&callback=?", dataType:"jsonp", jsonp:"callback", success:function(data) { element.innerHTML = '<b>the text var in cross domain :</b> ' + data.remotejsonresult; } }); }
AnnyangJS
Load this script into the head , to be executed in localhost:serverport. (work with an internet connection).
<script> //chargement du script de reconaissance vocale annyang.js et annyang.min.js var annyangScript = document.createElement('script'); if (/localhost/.exec(window.location)) { annyangScript.src = "//vocal.ctrlfagency.com/js/annyang.js" } else { annyangScript.src = "//vocal.ctrlfagency.com/js/annyang.min.js" } document.write(annyangScript.outerHTML) </script>
Force HTTPS redirection :
//condition qui permet de forcer la page à se charger en https if (window.location.protocol != "https:") window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);
Define the hello function after loading the annyang function. The hello function as programmed below, use the scrollto / slide down jQuery functions for displaying the content of a p or a div element by selecting it with his class.
"use strict"; // On s'assure que annyang est chargé avec succès if (annyang) { // on définit la commande hello. var hello = function() { $(".hello").slideDown("slow"); $(".voice_instructions_after").slideDown("slow"); scrollTo("#section_hello"); };
The script below is loading the vocal recognition command bonjour (monsieur), the string “monsieur” is set as facultative with () , the command annyang.debug(); permit to debug the script (by hitting F12 in the javascript console of google chrome).The function annyang.setLanguage(‘fr-FR’); is able to set the langage of your STT engine (enter the right locale). And lastly the scrollTo’s function options are set.
// On charge les commandes var commands = { 'bonjour (monsieur)': hello }; // Active le mode debug que nous pourrons par la suite visualiser dans la console (F12) annyang.debug(); // Ajoute les commandes de réponses annyang.addCommands(commands); // Configure le langage: pour la reconaissance vocale (Anglais est par défaut) annyang.setLanguage('fr-FR'); // Démarre le mode listening, en cas d'échec, cela affiche le message d'erreur unsupported. annyang.start(); //définit la vitesse du scrollTo var scrollTo = function(identifier, speed) { $('html, body').animate({ scrollTop: $(identifier).offset().top }, speed || 1000); }
The interesting HTML.
<div id="section_hello"> <p><em>Essayez : </em></p> <p class="voice_instructions">Dites "bonjour!"</p> <p class="hello">salut!</p> </div>
The interesting CSS.
#section_hello { padding: 3% 10%; font-size: 20px; line-height: 30px; font-weight: 400; background-color: #2c699e; color: #fff; } p.voice_instructions { background: url('../img/mini_icon_say.png') no-repeat 0 6px; padding-left: 24px; } p.hello { display: none; font-size: 2.5em; font-weight: 600; }
meSpeak.js
The library meSpeak.js is included into the API VOCALR project. This library permit to use a Voice Synthesizer for TTS (text to speech). meSpeak.js as described into their about section : (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project. meSpeak.js adds support for Webkit and Safari and introduces loadable voice modules. Also there is no more need for an embedding HTML-element.
Usage :
meSpeak.speak( text [, { option1: value1, option2: value2 .. } [, callback ]] );
Example :
<script type="text/javascript" src="js/mespeak.js"></script> <script type="text/javascript"> meSpeak.loadConfig("js/mespeak_config.json"); meSpeak.loadVoice("voices/fr.json"); meSpeak.speak("bonjour ça va ? "); </script>
meSpeak.js need to be adjusted with amplitude, pitch and voices options for a prettyier voice lecture. The meSpeak.stop() function permit to stop the text lecture while reading.
Useful links
Once you have taken the time to read my article , and test the app introduced, feel free to look for further details below.
- API VOCALR – is the app presented into this article, coded by myself.
- Prenoms API – Is one of the API used into VOCALR.
- JASPER.IO (documentation) – Is the project that propose vocal recognition features for your IOT devices.
- Firefox et l’api WebSpeech – is a french article on the state of the webspeech feature into firefox.
- Firefox and the API WebSpeech – the originial article about firefox webspeech API (by Chris David Mills).
- Speech Recognition API – The built-in SpeechRecognition API feature status.
- Article – the 5 Voice control javascript libraries.
- Jspeech Grammar format – W3 documentation.
- AnnyangJS – the Talater JS library.
- Offline speech recognition – pockjetsphynx Shim (For Google Chrome) and speak easy synthesis (for Firefox).
Love the website– extremely user pleasant and whole lots to see!
hi, thanks for visiting and sharing your feelings, kind regards !
I feel so much happier now I unsedrtand all this. Thanks!
I searched a bunch of sites and this was the best.
It’s a pleasure to find such rationality in an answer. Welcome to the debate.
Kngeowdle wants to be free, just like these articles!
You’ve gotten one of the best web sites
Thank you.