• Jacek Jarmulak

Voicegain offers large vocabulary transcription to Twilio developers

Updated: Oct 12, 2020

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

  • lower cost per each speech-to-text capture

  • higher accuracy for customers who choose Acoustic Model customization

  • access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:

  "sessions": [{ 
    "asyncMode": "REAL-TIME",
    "callback": { "uri" : "https://my.host/my-reco-result-callback" },
    "content" : {"full" : ["transcript"] } 
  "audio": { 
    "source": { "stream": { "protocol": "TWIML" } }, 
    "format": "PCMU", 
    "rate": 8000
  "settings": {
    "asr": { 
      "startInputTimers" : "false", 
      "confidenceThreshold" : 0.25,
      "noInputTimeout": 5000, 
      "completeTimeout": 1000

Some notes about the content of the request:

  • we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)

  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing

  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz

  • asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

  <Say voice="woman" language="en-US">Let me confirm</Say> 
    <Stream url="wss://api.ascalon.ai/v1/0/socket/7b39-sd55-s25-aab" >
      <Parameter name="prompt01" value="https://my.host/rcrdng/abc"/>
      <Parameter name="prompt02" value="$235.00"/> 
      <Parameter name="prompt03" value="clip:correct ?"/>
      <Parameter name="voice" value ="benjamin"/>    
      <Parameter name="bargeIn" value ="enable"/> 
  <Redirect method="POST">https://my.host/twilio/cb</Redirect> 

Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/transcribe/async request

  • more than one question prompt is supported - they will be played one after another

  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts

  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription in case where "content" : {"full" : ["transcript"] } .

    "sessionId": "0-0kfir69zu1bjkp8fuuh4o8ufxmwz",
    "asyncMode": "REAL-TIME"},
    "status": "MATCH",
    "lastEvent": "RECOGNITION-COMPLETE",
    "final": true,
    "transcript": "mine phone is unknown yet"},
  "responseType": "AsyncResultFull",
    "phase": "DONE",
    "audioStartTime": 0,
    "audioEndTime": 5010,
    "audioDuration": 5010,
    "clockStartTime": 1601069052776,
    "clockEndTime": 1601069058605,
    "xRealTime": 1.16}

64 views0 comments