Voicegain Speech-to-Text integrates with Twilio Media Streams
Updated: Oct 7
Build IVRs, Voice Bots, Real-time Agent Assist
Voicegain launched an extension to Voicegain /asr/recognize API that supports Twilio Media Streams via TwiML <Connect><Stream>. With this launch, developers using Twilio's Programmable Voice get an accurate, affordable, and easy to use ASR to build grammar based bots and speech-IVRs.
Update: Voicegain also announced that its large vocabulary transcription (/asr/transcribe API) integrates with Twilio Media Streams. Developers may use this to voice enable a chat bot developed on any bot platform or develop a real-time agent assist application.
Key Features of Twilio Media Streams support
Voicegain Twilio Media Streams support gives developers the following features:
Grammar Support for bots & IVRs: Developers can now write voice bots or ivrs that use grammars. Use of grammars can improve recognition accuracy and simplify bot development by constraining the speech-to-text engine. Also many traditional VoiceXML IVRs are built using grammars. Until now Twilio TwiML did not support use of speech grammars as the <Gather> command supports only text capture. This made it hard to build simple bots or migrate existing VoiceXML IVR applications to the Twilio platform. Mapping of text to semantic meaning had to be done separately, plus large vocabulary recognizer was more likely to return spurious recognitions. Voicegain solves these problems by supporting both GRXML and JSGF speech grammars at the core speech-to-text (ASR) engine level. This delivers higher accuracy compared to an ASR that uses a large vocabulary language model to recognize text and then applies grammars to the recognized text.
90% Savings on ASR Licensing costs: A big advantage for developers of the Twilio Programmable Voice platform has been its affordable pricing. However, that was not necessarily true for existing ASR options like <Gather> that is priced at 8 cents/minute (with a 15 second minimum). With Voicegain the ASR/STT price is 1.25 cents/ minute measured at 1 second increments. If you include the billing increment, developers get 90% cost savings.
Better Timeout Support: Voicegain supports configurable timeouts for no-input, complete timeout and incomplete timeout. Because the grammar is integrated with the recognizer, Voicegain ASR is able to deliver accurate complete timeout response which is not possible with <Gather> command for which the only way to tell if the caller stopped speaking is a large enough pause.
Simplify dynamic prompt playback. -- In order to make use of <Connect><Stream> as easy as possible, we support passing prompts when invoking <Stream>. Prompts can be provided either as text or as URLs. If provided as text then Voicegain will either use TTS or perform dynamic concatenation of prerecorded prompts. A prompt manager for such prerecorded prompts is provided as part of Voicegain Web Portal. Configurable barge-in is supported for the prompts.
Fine-tune and test grammars. -- Voicegain Web Portal includes a tool for reviewing and fine tuning grammars. The tool also supports regression tests. With this functionality you will never have to deploy grammars without knowing how well they are going to perform after changes.
How Twilio Media Streams works with Voicegain
TwiML <Stream> requires a websocket url. This url can be obtained by invoking Voicegain /asr/recognize/async API. When invoking this API the grammar to be used in the recognition has to be provided. The websocket URL will be returned in the response.
In addition to the wss url, Custom Parameters within <Connect><Stream> command are used to pass information about the question prompt to be played to the caller by Voicegain. This can be a text or a url to a service that will provide the audio.
Once <Connect><Stream> has been invoked, Voicegain platform takes over- it:
Plays the prompt via the back channel of <Stream>
As soon as caller starts speaking, the prompt playback is stopped (if it it was still playing) exactly like in <Gather>
Spoken words are are recognized using grammar. Recognition result is then provided as a callback from Voicegain Platform. In case of no-input or no-match an appropriate callback will also be made.
<Stream> connection is stopped and the TwiML application will continue with a next command.
BTW, we also support DTMF input as an alternative to speech input.
[UPDATE: you can see more details of how to use Voicegain with Twilio Media Streams in this new Blog post.]
Other features of the Voicegain Platform
1. On Premise Edge Support: While Voicegain APIs are available as a cloud PaaS service, Voicegain also supports OnPrem/Edge deployment. Voicegain can be deployed as a containerized service on a single node Kubernetes cluster, or onto multi-node high-availability Kubernetes cluster (on your GPU hardware or your VPC).
2. Acoustic model customization: This allows to achieve very high accuracy beyond what is possible with out of the box recognizers. The grammar tuning and regression tool mentioned earlier, can be used to collect training data for acoustic model customization.
More Features Coming
On our near-term roadmap for Twilio users we have several more features:
Advanced Answering Machine Detection (AMD) -- will be invoked using <Connect><Stream> and will provide very accurate answering machine detection using speech recognition.
Large vocabulary language model to just capture the spoken words (no grammars are used) and integrate with any NLU Engine of your choice. We think it will be attractive because of the lower cost compared to <Gather>.
Real-time agent assist - we are combining our real-time speech recognition with speech analytics to deliver an API that will allow for building real-time agent assist and monitoring applications.
You can sign up to try our platform. We are offering 600 minutes of free monthly use of the platform. If you have questions about integration with Twilio, send us a note at firstname.lastname@example.org.
Twilio, TwiML and Twilio Programmable Voice are registered trademarks of Twilio, Inc