• Jacek Jarmulak

Voicegain Speech-to-Text integrates with Twilio Media Streams

Updated: a day ago


Build Speech IVRs, Voice Bots

Voicegain launched an extension to Voicegain /asr/recognize API that supports Twilio Media Streams via TwiML <Connect><Stream>. With this launch, developers using Twilio's Programmable Voice get an accurate, affordable, and easy to use ASR/Speech-to-text platform to build speech-IVRs or Voice Bots.


Key Features of Twilio Media Streams support

Voicegain Twilio Media Streams support gives developers the following features:

  1. Grammar Support: Developers can now write speech IVR applications that use grammars. Many traditional VoiceXML IVRs are built using grammars. However, until now Twilio TwiML did not support use of speech grammars as the <Gather> command supports only text capture. This made it hard to migrate existing VoiceXML IVR applications to the Twilio platform. Mapping of text to semantic meaning had to be done separately, plus large vocabulary recognizer was more likely to return spurious recognitions. Voicegain solves these problems by supporting both GRXML and JSGF speech grammars at the core speech-to-text (ASR) engine level. This delivers higher accuracy compared to an ASR that uses a large vocabulary language model to recognize text and then applies grammars to the recognized text.

  2. 90% Savings on ASR Licensing costs: Build low cost speech IVR application. -- A big advantage for developers of the Twilio Programmable Voice platform has been its affordable pricing. However, that was not necessarily true for existing ASR options like <Gather> that is priced at 2 cents for 15 seconds (with 15 second minimum) or 8 cents per minute. With Voicegain the price is 1.25 cents per minute measured at 1 second increments. Including the billing increment, we are 90% cheaper.

  3. Better Timeout Support: Voicegain supports configurable timeouts for no-input, complete timeout and incomplete timeout. Because the grammar is integrated with the recognizer, Voicegain ASR is able to deliver accurate complete timeout response which is not possible with <Gather> command for which the only way to tell if the caller stopped speaking is a large enough pause.

  4. Simplify dynamic prompt playback. -- In order to make use of <Connect><Stream> as easy as possible, we support passing prompts when invoking <Stream>. Prompts can be provided either as text or as URLs. If provided as text then Voicegain will either use TTS or perform dynamic concatenation of prerecorded prompts. A prompt manager for such prerecorded prompts is provided as part of Voicegain Web Portal. Configurable barge-in is supported for the prompts.

  5. Fine-tune and test grammars. -- Voicegain Web Portal includes a tool for reviewing and fine tuning grammars. The tool also supports regression tests. With this functionality you will never have to deploy grammars without knowing how well they are going to perform after changes.


How Twilio Media Streams works with Voicegain


TwiML <Stream> requires a websocket url. This url can be obtained by invoking Voicegain /asr/recognize/async API. When invoking this API the grammar to be used in the recognition has to be provided. The websocket URL will be returned in the response.


In addition to the wss url, Custom Parameters within <Connect><Stream> command are used to pass information about the question prompt to be played to the caller by Voicegain. This can be a text or a url to a service that will provide the audio.

Once <Connect><Stream> has been invoked, Voicegain platform takes over- it:

  • Plays the prompt via the back channel of <Stream>

  • As soon as caller starts speaking, the prompt playback is stopped (if it it was still playing) exactly like in <Gather>

  • Spoken words are are recognized using grammar. Recognition result is then provided as a callback from Voicegain Platform. In case of no-input or no-match an appropriate callback will also be made.

  • <Stream> connection is stopped and the TwiML application will continue with a next command.

BTW, we do support DTMF input as an alternative to speech input.


[UPDATE: you can see more details of how to use Voicegain with Twilio Media Streams in this new Blog post.]


Other features of the Voicegain Platform

1. On Premise Edge Support: While Voicegain APIs are available as a cloud PaaS service, Voicegain also supports OnPrem/Edge deployment. Voicegain can be deployed as a containerized service on a single node Kubernetes cluster, or onto multi-node high-availability Kubernetes cluster (on your GPU hardware or your VPC).


2. Acoustic model customization: This allows to achieve very high accuracy beyond what is possible with out of the box recognizers. The grammar tuning and regression tool mentioned earlier, can be used to collect training data for acoustic model customization.


More Features Coming

On our near-term roadmap for Twilio users we have several more features:

  • Advanced Answering Machine Detection (AMD) -- will be invoked using <Connect><Stream> and will provide very accurate answering machine detection using speech recognition.

  • Large vocabulary language model to just capture the spoken words (no grammars are used) and integrate with any NLU Engine of your choice. We think it will be attractive because of the lower cost compared to <Gather>.

  • Real-time agent assist - we are combining our real-time speech recognition with speech analytics to deliver an API that will allow for building real-time agent assist and monitoring applications.

You can sign up to try our platform. We are offering 600 minutes of free monthly use of the platform. If you have questions about integration with Twilio, send us a note at support@voicegain.ai.


Twilio, TwiML and Twilio Programmable Voice are registered trademarks of Twilio, Inc

85 views
Contact Us