Invoke our ASR with Speech Grammars or use large vocabulary transcription for Voice Bots, IVRs, real-time captioning and more.
On a broad benchmark, our accuracy of 89% is on par with the very best
Talk to us in English, Spanish, German, Portuguese, Korean Hindi (more coming)
Tested on compute instances on Google, AWS, Azure, IBM & Oracle
Integrates with VXML, FreeSWITCH and other platforms that talk MRCP
Most enterprise IT organizations have mature telephony based IVR applications that serve as the “front door” for all voice based customer support calls. These applications use a combination of touchtone (DTMF) and speech to interact with callers. They have been carefully designed, developed and tuned over the years.
The objectives of any IVR are two fold 1) Automate simple routine queries (like balance inquiry, payment status, etc) and 2) Authenticate and intelligently route calls that require live support to the appropriate agent.
IT organizations across industry verticals like financial services, travel, media, telecom, retail or health-care have a small staff of in-house or outsourced IVR developers to maintain these applications. While enterprises have been focused on scaling and upgrading their digital support channels (like chat and email), IVR applications have largely remained un-touched for years.
As CIOs and CDOs (Chief Digital Officers) embark on strategic initiatives to migrate enterprise workloads to the Cloud, one "niche" workload on this list is the IVR. However migrating IVRs "as-is" to the cloud is tricky. The languages, protocols and platforms that these telephony based IVRs were built on is from the early 2000s and are approaching obsolescence. Also while they support directed dialogs with limited customer spoken utterances, they are not a good fit for conversational bot interactions.
So IT organizations are faced with a Catch 22 situation. On one-hand, it is cumbersome to maintain these IVR workloads. On the other hand, the rationale to migrate existing platforms "as-is" to modern cloud infrastructure is questionable. Why bear the trouble and expense if IVRs are eventually are going to be replaced by conversational bots?
So there is a real need to modernize these IVRs as part of their cloud migration strategy.
Traditionally speech IVR applications ran on on-premise Contact Center telephony platforms. Companies like Avaya, Nortel, Cisco, Intervoice, Genesys and Aspect dominated the vendor landscape. In the early to mid-2000s, these vendors worked collaboratively as part of the W3C consortium to develop VoiceXML, an open vendor agnostic language for speech-enabled IVR applications.
VoiceXML enabled developers to build interactive voice dialogs and provided a standard way to interact with an automatic speech recognizer (ASR). This was done using a telephony based protocol called MRCP. The standard also provided a method to define speech grammars called SRGS and a format called GRXML.
The architecture and supporting jargon/terminology around VoiceXML borrowed heavily from the web world. The VoiceXML platform was referred to as a “Voice browser” that could “render VoiceXML pages” just like how a web browser could render HTML pages. Most contact center platforms provided visual IDEs to help build and maintain these interactive call flows. Some also automated the generation of the VoiceXML pages. The IDE generated code that could run on application server (like Apache Tomcat) which in turn generated VoiceXML pages that were sent to a VoiceXML platform over standard HTTP. The application server was also responsible for making web-services requests to enterprise database resources that were required for the IVR interaction; for e.g. billing/payment systems or CRM systems.
Also most ASRs from the late 90s and early 2000s were based on Hidden Markov Models and Gaussian Mixture models. They mainly supported grammar-based recognition - which meant that as a Speech IVR developer you had to anticipate all possible utterances that a user could say in response to a question/prompt. There were some options to build open-ended statistical language models but these were tricky and required careful selection of the training corpus.
Why modernize now?
While VoiceXML worked well in the past, it is a niche and outdated language. The last release of VoiceXML 2.1 was back in 2007!! That is more than a decade ago.
And a lot has changed in the web world since then. VoiceXML was developed at a time when JSP (Java Server Pages) was widely used. So it was before JSON, YAML, RESTful APIs & AJAX.
For enterprises, it is expensive to maintain a dedicated staff - whether in-house or outsourced - with niche skills in technologies like VoiceXML and MRCP.
Enterprises should ideally be able to run IVR app like any other modern web application. Most enterprise web apps are built on programming languages like Python, Node.JS that are popular with web developers. They are containerized using docker and orchestrated using Kubernetes.
It would be ideal for an enterprise IT organization for its IVR app to be built on similar programming languages so that it can be supported or maintained just like other applications in the IT portfolio.
In addition to the obsolescence of VoiceXML, the speech recognition engine (ASR) that was deployed in the early 2000s has also become outdated. Modern speech-to-text engines are built on Deep Neural Networks that run on powerful GPU infrastructure. They offer amazing accuracy and allow the use of a very large vocabulary - which is what is needed for bot like conversational experience. Also modern NLU engines allow you to easily extract intents from the transcribed text.
So if an enterprise wants to offer a voice bot that supports an open conversational experience, they need to move to a modern DNN based Speech-to-Text platform that can integrate with such NLU engines.
At Voicegain, we recommend that an enterprise first modernize the underlying infrastructure while retaining the existing IVR application logic. This is a great first step. It allows an enterprise to continue serving existing users while taking a step towards providing a more conversational user experience.
We suggest that the existing call flow logic - which is typically maintained using visual IDEs of contact center platforms - get rewritten (ideally with the help of automated tools) into a modern programming language like Python or Node.Js.
Instead of generating legacy VoiceXML pages, enterprises should use web friendly data representation languages like JSON or YAML to interact with modern RESTful Speech-to-Text APIs using web callbacks.
How Voicegain supports IVR App modernization?
At Voicegain, we provide a modern Voice AI platform that includes
Voicegain is developing tools to automatically convert VoiceXML to equivalent JSON/YAML representation that talks to our callback APIs.
How is this a "future proof" architecture for an enterprise?
The Voicegain platform is capable of large vocabulary transcription which is a requirement for NLU based Voice Bots. This will be the way customers interact with enterprises in the future.
We allow developers to switch between grammar based recognition and large vocabulary recognition at each and every turn of the dialog; or you could simultaneously use both to achieve more flexibility.
Our Telephony Bot APIs can also integrate with Bot Frameworks like Google Dialog Flow, .
We are inviting enterprise web developers for a free trial of our platform.
Voicegain STT platform has supported MRCP (Media Resource Control Protocol) for a long time now. Our ASR can be accessed using MRCP and we support both grammar-based recognition (e.g. GRXML) and large-vocabulary transcription. MRCP is a communication protocol designed to connect telephony based IVRs and Voice Bots with speech recognizers (ASR) and speech synthesizers (TTS).
Previously we tested connecting to Voicegain using MRCP from VXML platforms like Dialogic PowerMedia XMS or Aspect Prophecy. We had not tested connecting from FreeSWITCH, a popular open source telephony platform, using its MRCP plugin mod_unimrcp.
We are pleased to announce that Voicegain platform works out-of-the box with mod_unimrcp, the MRCP plugin for FreeSWITCH. However, getting the mod_unimrcp plugin to work on FreeSWITCH is not particularly trivial. Here are some pointers to help those who would like to use mod_unimrcp with our platform.
There are currently 2 options to do this. We plan to add a third option very soon
Also, the current TTS option accessible over MRCP are not great. Our focus has been on the use of prerecorded prompts for IVRs and Voice Bots. We plan to shortly allow developers to access the Google or Amazon TTS.
mod_unimrcp does not get built by default when you build FreeSWITCH from source. To get it built you need to enable it in build/modules.conf.in by uncommenting this line: #asr_tts/mod_unimrcp
After the build, before starting FreeSWITCH you will need to:
Here is an example MRCP v2 profile for connecting to Voicegain MRCP:
Here are some additional notes about the configuration file:
Here is an example of how to play a question prompt and to invoke the ASR via mod_unimrcp to recognize a spoken phone number:
What this example does is:
The result of the recognition is a string in XML format (NLSML). You will need to parse it to get the utterance and any semantic interpretations. NLSML result also contains confidence.
The normal command "play_and_detect_speech" holds onto ASR session until the end of the call - this makes subsequent recognitions more responsive, but you are paying for the MRCP session. You can also use this command "play_and_detect_speech_close_asr" to release ASR session immediately after recognition.
If you have any questions about the use of Voicegain ASR via MRCP please contact us at: firstname.lastname@example.org
On our roadmap we have a mod_voicegain plugin for FreeSWITCH which will bypass the need for mod_unimrcp and unimrcp server and will be talking from FreeSWITCH directly to the Voicegain ASR using gRPC.