How to use Voicegain with Twilio Media Streams

Voicegain adds grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

The difference between Voicegain speech recognition and Twilio TwiML <Gather> is:

  1. Voicegain supports grammars with semantic tags (GRXML or JSGF) while <Gather> is a large vocabulary recognizer that just returns text, and
  2. Voicegain is  significantly cheaper (we will describe the price difference in an upcoming blog post).

When using Voicegain with Twilio, your application logic will need to handle callback requests from both Twilio and Voicegain.

Each recognition will involve two main steps described below:

Initiating Speech Recognition with Voicegain

This is done by invoking Voicegain async recognition API: /asr/recognize/async

Below is an example of the payload needed to start a new recognition session:

Some notes about the content of the request:

  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
  • asr settings include the three standard timeouts used in grammar based recognition - no-input, complete, and incomplete timeouts
  • grammar is set to GRXML grammar loaded from an external URL

This request, if successful, will return the websocket url in the field. This value will be used in making a TwiML request.

Note, if the grammar is specified to recognize DTMF, the Voicegain recognizer will recognize DTMF signals included in the audio sent from Twilio Platform.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/recognize/async request
  • more than one question prompt is supported - they will be played one after another
  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated   using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Recognition Response

Below is an example response from the recognition. This response is from built-in phone grammar.

Sign up for an app today
* No credit card required.


Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control