By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
CPaaS

Large vocabulary transcription for Twilio developers

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

  • lower cost per each speech-to-text capture
  • higher accuracy for customers who choose Acoustic Model customization
  • access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:

{
 "sessions": [{
   "asyncMode": "REAL-TIME",
   "callback": { "uri" : "https://my.host/my-reco-result-callback" },
   "content" : {"full" : ["transcript"] }
 }],
 "audio": {
   "source": { "stream": { "protocol": "TWIML" } },
   "format": "PCMU",
   "rate": 8000
 },
 "settings": {
   "asr": {
     "startInputTimers" : "false",
     "confidenceThreshold" : 0.25,
     "noInputTimeout": 5000,
     "completeTimeout": 1000
    }
  }
}

Some notes about the content of the request:

  • we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
  • asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

<Response>
 <Say voice="woman" language="en-US">Let me confirm</Say>
 <Connect>
   <Stream url="wss://api.ascalon.ai/v1/0/socket/7b39-sd55-s25-aab" >
     <Parameter name="prompt01" value="https://my.host/rcrdng/abc"/>
     <Parameter name="prompt02" value="$235.00"/>
     <Parameter name="prompt03" value="clip:correct ?"/>
     <Parameter name="voice" value ="benjamin"/>    
     <Parameter name="bargeIn" value ="enable"/>
   </Stream>
 </Connect>
 <Redirect method="POST">https://my.host/twilio/cb</Redirect>
</Response>

Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/transcribe/async request
  • more than one question prompt is supported - they will be played one after another
  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated   using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription  in case where "content" : {"full" : ["transcript"] } .

{
 "session":{
   "sessionId": "0-0kfir69zu1bjkp8fuuh4o8ufxmwz",
   "asyncMode": "REAL-TIME"},
 "result":{
   "status": "MATCH",
   "lastEvent": "RECOGNITION-COMPLETE",
   "final": true,
   "transcript": "mine phone is unknown yet"},
 "responseType": "AsyncResultFull",
 "progress":{
   "phase": "DONE",
   "audioStartTime": 0,
   "audioEndTime": 5010,
   "audioDuration": 5010,
   "clockStartTime": 1601069052776,
   "clockEndTime": 1601069058605,
   "xRealTime": 1.16}
}


Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us →