vocalrws

Voice recognition is an Hot topic, i am personally fascinated, sometimes staggered by these technologies. The critical business actors that are investing on the top notche trendy technologies such as IOT (internet of things) are ready to spend and invest on it. The trends followed are that most of the engineers are formed to use technologies that doesn’t exist any-more, and working for a job that isn’t created yet. If we think outside the box, we could make the observation that the evil-minded developer is ready to use this, because it seems that the voice recognition’s softwares were widely used by intelligence agencies previously (if not, it was just one of the movie’s industry side effect), likely to prevent the “bad guy” who entice a lyrical return to the stone age for his sake of the history. And so, if the disaster is occurring, the stakeholders should debate about the ethics for a long time, to strengthen the evidence of their intellectual honesty. Voice recognition is a market where some investors are waiting for the inflexion point to invest, once the software’s idea proposed, the “proprietary” source code will be continuously developed until another “lesser proprietary” software beats the charts. That’s the point of this article, I’m going to introduce you the app i created 2 days ago, that’s being saied my purpose is not to defend a “political” opinion about this technology, neither to prove you my intellectual autonomy. I just share this article with the hope it will be useful, without no warranty, the only one is that the knowledge must be free even if the time spent is money and each work need a fairly remuneration, that’s why education/open-source is starting here.

What is an STT/TTS engine ?

STT mean Speech To Text, it is an engine for transforming speech into text, a TTS engine is doing the opposite, it transforms text into speech. I wouldn’t debate on what technologies are the most effective for what usage. We will try to stay focused on web technologies, and innovation. For a webpage, i have tested Annyang (by TalAter), though this one is considered as an arrested development, this project is still active, it seems to reach effectively the STT purpose on google chrome. It’s a Javascript library created to interact voice commands across Google Chrome’s speech API (STT), but if you wish to have a look on the existing web STT engine, there is a small Javascript Library listing there ( 5 voice control javascript libraries for developers).

Speech Recognition API Browser’s compatibility for today
speechrecognitionapistatus

IOT

Another global project based mainly on IOT compliance (raspbian etc..) is called jasper, this AIO (all in one) includes multiples STT and TTS (text to speech) engines listed below.

STT

  • Pocketsphinx is a open-source speech decoder by the CMU Sphinx project. It’s fast and designed to work well on embedded systems (like the Raspberry Pi). Unfortunately, the recognition rate is not the best and it has a lot of depencies. On the other hand, recognition will be performed offline, i.e. you don’t need an active internet connection to use it. It’s the right thing to use if you’re cautious with your personal data.
  • Google STT is the speech-to-text system by Google. If you have an Android smartphone, you might already be familiar with it, because it’s basically the same engine that performs recognition if you say OK, Google. It can only transcribe a limited amount of speech a day and needs an active internet connection.
  • AT&T STT is a speech decoder by the telecommunications company AT&T. Like Google Speech, it also performs decoding online and thus needs an active internet connection.
  • Wit.ai STT relies on the wit.ai cloud services and uses crowdsourcing to train speech recognition algorithms. Like you’d expect from a cloud service, you also need an active internet connection.
  • Julius is a high-performance open source speech recognition engine. It does not need an active internet connection. Please note that you will need to train your own acoustic model, which is a very complex task that we do not provide support for. Regular users are most likely better suited with one of the other STT engines listed here.

TTS

  • eSpeak is a compact open-source speech synthesizer for many platforms. Speech synthesis is done offline, but most voices can sound very “robotic”.
    Festival uses the Festival Speech Synthesis System, an open source speech synthesizer developed by the Centre for Speech Technology Research at the University of Edinburgh. Like eSpeak, also synthesizes speech offline.
  • Flite uses CMU Flite (festival-lite), a lightweight and fast synthesis engine that was primarily designed for small embedded machines. It synthesizes speech offline, so no internet connection is required.
  • SVOX Pico TTS was the Text-to-Speech engine used in Android 1.6 “Donut”. It’s an open-source small footprint application and also works offline. The quality is rather good compared to eSpeak and Festival.
  • Google TTS uses the same Text-to-Speech API which is also used by newer Android devices. The Synthesis itself is done on Google’s servers, so that you need an active internet connection and also can’t expect a lot of privacy if you use this.
  • Ivona TTS uses Amazon’s Ivona Speech Cloud service, which is used in the Kindle Fire. Speech synthesis is done online, so an active internet connection and Amazon has access to everything Jasper says to you.
  • MaryTTS is an open-source TTS system written in Java. You need to set up your own MaryTTS server and configure Jasper to use it. Because the server can be hosted on the same machine that runs Jasper, you do not need internet access.
  • Mac OS X TTS does only work if you’re running Jasper on a Mac. It then uses the say command in MacOS to synthesize speech.

To install Jasper on OrangePI CTRL+F agency OS

apt-get install jasper

API VOCALR

The idea was to develop something new and not from scratch because of time, so i have decided to use the annyang lib for the TTS.

The javascript library annyang is available on github , some links are available at the end of this article, the compatible languages are the following:

  • Afrikaans af
  • Basque eu
  • Bulgarian bg
  • Catalan ca
  • Arabic (Egypt) ar-EG
  • Arabic (Jordan) ar-JO
  • Arabic (Kuwait) ar-KW
  • Arabic (Lebanon) ar-LB
  • Arabic (Qatar) ar-QA
  • Arabic (UAE) ar-AE
  • Arabic (Morocco) ar-MA
  • Arabic (Iraq) ar-IQ
  • Arabic (Algeria) ar-DZ
  • Arabic (Bahrain) ar-BH
  • Arabic (Lybia) ar-LY
  • Arabic (Oman) ar-OM
  • Arabic (Saudi Arabia) ar-SA
  • Arabic (Tunisia) ar-TN
  • Arabic (Yemen) ar-YE
  • Czech cs
  • Dutch nl-NL
  • English (Australia) en-AU
  • English (Canada) en-CA
  • English (India) en-IN
  • English (New Zealand) en-NZ
  • English (South Africa) en-ZA
  • English(UK) en-GB
  • English(US) en-US
  • Finnish fi
  • French fr-FR
  • Galician gl
  • German de-DE
  • Hebrew he
  • Hungarian hu
  • Icelandic is
  • Italian it-IT
  • Indonesian id
  • Japanese ja
  • Korean ko
  • Latin la
  • Mandarin Chinese zh-CN
  • Traditional Taiwan zh-TW
  • Simplified China zh-CN ?
  • Simplified Hong Kong zh-HK
  • Yue Chinese (Traditional Hong Kong) zh-yue
  • Malaysian ms-MY
  • Norwegian no-NO
  • Polish pl
  • Pig Latin xx-piglatin
  • Portuguese pt-PT
  • Portuguese (Brasil) pt-BR
  • Romanian ro-RO
  • Russian ru
  • Serbian sr-SP
  • Slovak sk
  • Spanish (Argentina) es-AR
  • Spanish (Bolivia) es-BO
  • Spanish (Chile) es-CL
  • Spanish (Colombia) es-CO
  • Spanish (Costa Rica) es-CR
  • Spanish (Dominican Republic) es-DO
  • Spanish (Ecuador) es-EC
  • Spanish (El Salvador) es-SV
  • Spanish (Guatemala) es-GT
  • Spanish (Honduras) es-HN
  • Spanish (Mexico) es-MX
  • Spanish (Nicaragua) es-NI
  • Spanish (Panama) es-PA
  • Spanish (Paraguay) es-PY
  • Spanish (Peru) es-PE
  • Spanish (Puerto Rico) es-PR
  • Spanish (Spain) es-ES
  • Spanish (US) es-US
  • Spanish (Uruguay) es-UY
  • Spanish (Venezuela) es-VE
  • Swedish sv-SE
  • Turkish tr
  • Zulu zu

It is possible to test/work/develop/design for the PoC , the library use the webkit’s speech API (webkitSpeechRecognition) included into Google Chrome (a server-side based recognition). If you look on github, some ports of the webkitSpeechRecognition, function has been done for a local usage with pocketsphynx ( a C program, transformed in Javascript by using emscripten), that is not actually a W3C standard , (read further info about the draft written in 2014). The mozilla developpers/hackers are trying to get theses stuff working locally by using grammars functions and their own engine (with Gecko), twice projects may converge some days.

The design

API VOCALR has been designed in order to be MVC (Model-View-Controller) compliant, the nodeJS app use middleware functions if you wish to hook the source code for monitoring purposes. The following is a tree without (node-modules folder).

.
├── app
│   ├── controllers
│   │   ├── biorhythm.server.controller.js
│   │   ├── index.server.controller.js
│   │   └── prenoms.server.controller.js
│   ├── routes
│   │   ├── biorhythm.server.routes.js
│   │   ├── index.server.routes.js
│   │   └── prenoms.server.routes.js
│   └── views
│       ├── biorhythm.ejs
│       ├── index.ejs
│       └── prenoms.ejs
├── config
│   └── express.js
├── license.txt
├── package.json
├── public
│   ├── css
│   │   ├── bootstrap.min.css
│   │   ├── bootstrap-theme.min.css
│   │   └── main.css
│   ├── img
│   │   ├── mini_icon_say.png
│   │   ├── mosaic1.png
│   │   └── palette.png
│   ├── js
│   │   ├── annyang.js
│   │   ├── annyang.min.js
│   │   ├── bootstrap.min.js
│   │   ├── highlight.pack.js
│   │   ├── jquery.min.js
│   │   ├── mespeak_config.json
│   │   └── mespeak.js
│   └── voices
│       ├── ca.json
│       ├── cs.json
│       ├── de.json
│       ├── el.json
│       ├── en
│       │   ├── en.json
│       │   ├── en-n.json
│       │   ├── en-rp.json
│       │   ├── en-sc.json
│       │   ├── en-us.json
│       │   └── en-wm.json
│       ├── eo.json
│       ├── es.json
│       ├── es-la.json
│       ├── fi.json
│       ├── fr.json
│       ├── hu.json
│       ├── it.json
│       ├── kn.json
│       ├── la.json
│       ├── lv.json
│       ├── nl.json
│       ├── pl.json
│       ├── pt.json
│       ├── pt-pt.json
│       ├── ro.json
│       ├── sk.json
│       ├── sv.json
│       ├── tr.json
│       ├── zh.json
│       └── zh-yue.json
├── README.md
├── server.js
├── tree.txt
└── vocal.siteweb.tld.conf

89 directories, 387 files

As we can see, the expressJS, with ejs templating engine are used, bootstrap.css package is included too for training.

Api and testing

Some API were added into the project for research/scientific purposes. They work similarly. It retreives JSON encoded content with a jQuery crossdomain request. The Prenoms API Php source code is available on my repo.

Example :

Cross domain request with jQuery.

function crossdomain(text)
{
//element , result, and remotejsonresult are metasyntaxic variables.
var element = document.getElementById('result');
$.ajax({
    		url : "https://siteweb.tld/api?var="+text+"&callback=?",
		    dataType:"jsonp",
		    jsonp:"callback",
		    success:function(data)
		    {

		element.innerHTML = '<b>the text var in cross domain :</b> ' + data.remotejsonresult;
		    }
		});
}

AnnyangJS

Load this script into the head , to be executed in localhost:serverport. (work with an internet connection).

<script>
	//chargement du script de reconaissance vocale annyang.js et annyang.min.js	
 	var annyangScript = document.createElement('script');
  	if (/localhost/.exec(window.location)) {
 		annyangScript.src = "//vocal.ctrlfagency.com/js/annyang.js"
   			 } else {
  		annyangScript.src = "//vocal.ctrlfagency.com/js/annyang.min.js"
  			}
	document.write(annyangScript.outerHTML)
</script>

Force HTTPS redirection :

//condition qui permet de forcer la page à se charger en https
if (window.location.protocol != "https:")
   window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);

Define the hello function after loading the annyang function. The hello function as programmed below, use the scrollto / slide down jQuery functions for displaying the content of a p or a div element by selecting it with his class.

"use strict";
	// On s'assure que annyang est chargé avec succès
	if (annyang) {
		// on définit la commande hello.
		var hello = function() {
			$(".hello").slideDown("slow");
			$(".voice_instructions_after").slideDown("slow");
			scrollTo("#section_hello");
			};

The script below is loading the vocal recognition command bonjour (monsieur), the string “monsieur” is set as facultative with () , the command annyang.debug(); permit to debug the script (by hitting F12 in the javascript console of google chrome).The function annyang.setLanguage(‘fr-FR’); is able to set the langage of your STT engine (enter the right locale). And lastly the scrollTo’s function options are set.

// On charge les commandes
	var commands = {
		'bonjour (monsieur)':        hello
	                 };
	// Active le mode debug que nous pourrons par la suite visualiser dans la console (F12)
	annyang.debug();

    	// Ajoute les commandes de réponses
	annyang.addCommands(commands);

    	// Configure le langage: pour la reconaissance vocale (Anglais est par défaut)
    	annyang.setLanguage('fr-FR');

    	// Démarre le mode listening, en cas d'échec, cela affiche le message d'erreur unsupported.
 	annyang.start();

	//définit la vitesse du scrollTo
  	var scrollTo = function(identifier, speed) {
    		$('html, body').animate({
        	scrollTop: $(identifier).offset().top
    		}, speed || 1000);
  	}

The interesting HTML.

<div id="section_hello">
	  <p><em>Essayez : </em></p>
	  <p class="voice_instructions">Dites "bonjour!"</p>
	  <p class="hello">salut!</p>
</div>

The interesting CSS.

#section_hello {
  padding: 3% 10%;
  font-size: 20px;
  line-height: 30px;
  font-weight: 400;
  background-color: #2c699e;
  color: #fff;
}
p.voice_instructions {
  background: url('../img/mini_icon_say.png') no-repeat 0 6px;
  padding-left: 24px;
}
p.hello {
  display: none;
  font-size: 2.5em;
  font-weight: 600;
}

meSpeak.js

The library meSpeak.js is included into the API VOCALR project. This library permit to use a Voice Synthesizer for TTS (text to speech). meSpeak.js as described into their about section : (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project. meSpeak.js adds support for Webkit and Safari and introduces loadable voice modules. Also there is no more need for an embedding HTML-element.

Usage :
meSpeak.speak( text [, { option1: value1, option2: value2 .. } [, callback ]] );

Example :

<script type="text/javascript" src="js/mespeak.js"></script>
  <script type="text/javascript">
    meSpeak.loadConfig("js/mespeak_config.json");
    meSpeak.loadVoice("voices/fr.json");
    meSpeak.speak("bonjour ça va ? ");
  </script>

meSpeak.js need to be adjusted with amplitude, pitch and voices options for a prettyier voice lecture. The meSpeak.stop() function permit to stop the text lecture while reading.

Useful links

Once you have taken the time to read my article , and test the app introduced, feel free to look for further details below.

8 thoughts on “Getting started with Voice Recognition

Comments are closed.