Large vocabulary transcription for Twilio developers

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

  • lower cost per each speech-to-text capture
  • higher accuracy for customers who choose Acoustic Model customization
  • access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:

 "sessions": [{
   "asyncMode": "REAL-TIME",
   "callback": { "uri" : "" },
   "content" : {"full" : ["transcript"] }
 "audio": {
   "source": { "stream": { "protocol": "TWIML" } },
   "format": "PCMU",
   "rate": 8000
 "settings": {
   "asr": {
     "startInputTimers" : "false",
     "confidenceThreshold" : 0.25,
     "noInputTimeout": 5000,
     "completeTimeout": 1000

Some notes about the content of the request:

  • we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
  • asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

 <Say voice="woman" language="en-US">Let me confirm</Say>
   <Stream url="wss://" >
     <Parameter name="prompt01" value=""/>
     <Parameter name="prompt02" value="$235.00"/>
     <Parameter name="prompt03" value="clip:correct ?"/>
     <Parameter name="voice" value ="benjamin"/>    
     <Parameter name="bargeIn" value ="enable"/>
 <Redirect method="POST"></Redirect>

Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/transcribe/async request
  • more than one question prompt is supported - they will be played one after another
  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated   using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription  in case where "content" : {"full" : ["transcript"] } .

   "sessionId": "0-0kfir69zu1bjkp8fuuh4o8ufxmwz",
   "asyncMode": "REAL-TIME"},
   "status": "MATCH",
   "final": true,
   "transcript": "mine phone is unknown yet"},
 "responseType": "AsyncResultFull",
   "phase": "DONE",
   "audioStartTime": 0,
   "audioEndTime": 5010,
   "audioDuration": 5010,
   "clockStartTime": 1601069052776,
   "clockEndTime": 1601069058605,
   "xRealTime": 1.16}

