- Jacek Jarmulak
How to use Voicegain with Twilio Media Streams
Updated: Oct 2, 2020

Voicegain adds grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.
The difference between Voicegain speech recognition and Twilio TwiML <Gather> is:
Voicegain supports grammars with semantic tags (GRXML or JSGF) while <Gather> is a large vocabulary recognizer that just returns text, and
Voicegain is significantly cheaper (we will describe the price difference in an upcoming blog post).
When using Voicegain with Twilio, your application logic will need to handle callback requests from both Twilio and Voicegain.
Each recognition will involve two main steps described below:
Initiating Speech Recognition with Voicegain
This is done by invoking Voicegain async recognition API: /asr/recognize/async
Below is an example of the payload needed to start a new recognition session:
{
"sessions": [{
"asyncMode": "REAL-TIME",
"callback": { "uri" : "https://my.host/my-reco-result-callback" }
}],
"audio": {
"source": { "stream": { "protocol": "TWIML" } },
"format": "PCMU",
"rate": 8000
},
"settings": {
"asr": {
"startInputTimers" : "false",
"noInputTimeout": 5000,
"incompleteTimeout": 5000,
"completeTimeout": 1000,
"grammars" : [{
"type" : "GRXML",
"name": "my-grammar",
"fromUrl" : {
"url" : "https://my.host/grammars/my-grammar.grxml"
}
}]
}
}
}
Some notes about the content of the request:
startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
asr settings include the three standard timeouts used in grammar based recognition - no-input, complete, and incomplete timeouts
grammar is set to GRXML grammar loaded from an external URL
This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.
Note, if the grammar is specified to recognize DTMF, the Voicegain recognizer will recognize DTMF signals included in the audio sent from Twilio Platform.
TwiML <Connect><Stream> request
After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:
<Response>
<Say voice="woman" language="en-US">Let me confirm</Say>
<Connect>
<Stream url="wss://api.ascalon.ai/v1/0/socket/7b39-sd55-s25-aab" >
<Parameter name="prompt01" value="https://my.host/rcrdng/abc"/>
<Parameter name="prompt02" value="$235.00"/>
<Parameter name="prompt03" value="clip:correct ?"/>
<Parameter name="voice" value ="benjamin"/>
<Parameter name="bargeIn" value ="enable"/>
</Stream>
</Connect>
<Redirect method="POST">https://my.host/twilio/cb</Redirect>
</Response>
Some notes about the content of the TwiML request:
the websocket URL is the one returned from Voicegain /asr/recognize/async request
more than one question prompt is supported - they will be played one after another
three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
bargeIn is enabled - prompt playback will stop as soon as caller starts speaking
Returned Recognition Response
Below is an example response from the recognition. This response is from built-in phone grammar.
{
"session": {
"sessionId":"0-0kf8ue1zb05u0iy45cin3i6uqmo9",
"startInputTimers":false,
"asyncMode":"REAL-TIME"},
"result":{
"status":"MATCH",
"lastEvent":"RECOGNITION-COMPLETE",
"alternatives":[{
"utterance":"9 7 2 5 1 8 zero 8 6 3",
"confidence":0.9805818200111389,
"grammar":"phone",
"semanticTags":{"input":"speech","phone":"9725180863"}
}],
"final":true},
"phase":"DONE"
}