Anoop Hallur
What is it?
Why we need?
How do we do it?
Current Practices
More !!
1. Machine understands what you speak
2. What you dont speak
3. Other sounds too
Does not deal with ultrasonic wavelength
Only human audible sounds are under study now
Dont you want that in your app too ??
You have recorded a large data set of radio / recordings and want to generate the transcripts of the recordings (Analyze it)
Many more applications .
Pure JS
window['webkitSpeechRecognition'] // enabled browser
var recognition = new webkitSpeechRecognition();
recognition.onstart = function () {...};
recognition.onerror = function () {...};
recognition.onend = function () {...};
recognition.onresult = function () {...};
1. It works only if human is speaking in person to the device
2. Works only in chrome
3. Does not work offline
You can still use google speech API, and get what you want
Use it from your server
Upload
HTTP Post : www.google.com/speech-api/v2/recognize
Audio Data with its format info
API Key
More on API Keys
Limits: presently 50 per day
Did you notice ? en_US . All languages are not yet covered
Whats should the API do ??
You have to find who among these two has made the sound
We have not modelled the cat and dog sounds sufficiently
Lets do it !!
Speech is also data, can be treated similar to text data (only analogy)
Problem is reduced to classifier problem
Can be solved effeciently by any one of the machine learning technique
https://github.com/anooprh/PyOhio-Prsesentation/tree/master/catDogGame
A single digit recognizer
Why more challenging --> Prediction has to be 1/10 values
Why do this --> A live demo
Example 1 : 2 voices(2 acoustic models), 1 sound (1 language model)
Example 2 : 1 voice(1 acoustic model), 2 sounds (language models)
What we want to recognize boils down to these two parameters
Acoustic and Language Models
In English : large number of speakers and 44 phonemes (basic sounds)
All recognizers have different formats and specifications
Its a mess !!
In English : large number of speakers and 44 phonemes (basic sounds)
Finite State Grammar
Finite State Grammar
FSG_BEGIN
NUM_STATES 5
START_STATE 0
FINAL_STATE 4
TRANSITION 0 1 0.9 ONCE
TRANSITION 0 0 0.01 ONCE
TRANSITION 1 2 0.9 UPON
TRANSITION 1 1 0.01 UPON
...
...
FSG_END
pocketsphinx_continuous
-hmm /usr/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k
-fsg princess-minus-START_.fsg
-dict HUB-CAPS4.5000.DIC
1. It works only if human is speaking in person to the device
2. Works only in chrome
3. Does not work offline
End pointing - determining where the word ends
Background noise
Puffing / breathing sound
Many more