Large vocabulary transcription for Twilio developers

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

lower cost per each speech-to-text capture
higher accuracy for customers who choose Acoustic Model customization
access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:

‍

Some notes about the content of the request:

we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

‍

Some notes about the content of the TwiML request:

the websocket URL is the one returned from Voicegain /asr/transcribe/async request
more than one question prompt is supported - they will be played one after another
three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription in case where "content" : {"full" : ["transcript"] } .

‍

Casey

AI Voice Agent Platform

Transcribe