• Arun Santhebennur

Modernize your VoiceXML IVR into conversational Voice Bots

Updated: 5 days ago



The urgent need for IVR platform modernization

Most enterprise IT organizations have mature telephony based IVR applications that serve as the “front door” for all voice based customer support calls. These applications use a combination of touchtone (DTMF) and speech to interact with callers. They have been carefully designed, developed and tuned over the years.


The objectives of any IVR are two fold 1) Automate simple routine queries (like balance inquiry, payment status, etc) and 2) Authenticate and intelligently route calls that require live support to the appropriate agent.


IT organizations across industry verticals like financial services, travel, media, telecom, retail or health-care have a small staff of in-house or outsourced IVR developers to maintain these applications. While enterprises have been focused on scaling and upgrading their digital support channels (like chat and email), IVR applications have largely remained un-touched for years.


As CIOs and CDOs (Chief Digital Officers) embark on strategic initiatives to migrate enterprise workloads to the Cloud, one "niche" workload on this list is the IVR. However migrating IVRs "as-is" to the cloud is tricky. The languages, protocols and platforms that these telephony based IVRs were built on is from the early 2000s and are approaching obsolescence. Also while they support directed dialogs with limited customer spoken utterances, they are not a good fit for conversational bot interactions.


So IT organizations are faced with a Catch 22 situation. On one-hand, it is cumbersome to maintain these IVR workloads. On the other hand, the rationale to migrate existing platforms "as-is" to modern cloud infrastructure is questionable. Why bear the trouble and expense if IVRs are eventually are going to be replaced by conversational bots?


So there is a real need to modernize these IVRs as part of their cloud migration strategy.


A brief look at the underlying infrastructure of these IVR applications

Traditionally speech IVR applications ran on on-premise Contact Center telephony platforms. Companies like Avaya, Nortel, Cisco, Intervoice, Genesys and Aspect dominated the vendor landscape. In the early to mid-2000s, these vendors worked collaboratively as part of the W3C consortium to develop VoiceXML, an open vendor agnostic language for speech-enabled IVR applications.


VoiceXML enabled developers to build interactive voice dialogs and provided a standard way to interact with an automatic speech recognizer (ASR). This was done using a telephony based protocol called MRCP. The standard also provided a method to define speech grammars called SRGS and a format called GRXML.


The architecture and supporting jargon/terminology around VoiceXML borrowed heavily from the web world. The VoiceXML platform was referred to as a “Voice browser” that could “render VoiceXML pages” just like how a web browser could render HTML pages. Most contact center platforms provided visual IDEs to help build and maintain these interactive call flows. Some also automated the generation of the VoiceXML pages. The IDE generated code that could run on application server (like Apache Tomcat) which in turn generated VoiceXML pages that were sent to a VoiceXML platform over standard HTTP. The application server was also responsible for making web-services requests to enterprise database resources that were required for the IVR interaction; for e.g. billing/payment systems or CRM systems.


Also most ASRs from the late 90s and early 2000s were based on Hidden Markov Models and Gaussian Mixture models. They mainly supported grammar-based recognition - which meant that as a Speech IVR developer you had to anticipate all possible utterances that a user could say in response to a question/prompt. There were some options to build open-ended statistical language models but these were tricky and required careful selection of the training corpus.


Why modernize now? While VoiceXML worked well in the past, it is a niche and outdated language. The last release of VoiceXML 2.1 was back in 2007!! That is more than a decade ago.

And a lot has changed in the web world since then. VoiceXML was developed at a time when JSP (Java Server Pages) was widely used. So it was before JSON, YAML, RESTful APIs & AJAX.


For enterprises, it is expensive to maintain a dedicated staff - whether in-house or outsourced - with niche skills in technologies like VoiceXML and MRCP.


Enterprises should ideally be able to run IVR app like any other modern web application. Most enterprise web apps are built on programming languages like Python, Node.JS that are popular with web developers. They are containerized using docker and orchestrated using Kubernetes.


It would be ideal for an enterprise IT organization for its IVR app to be built on similar programming languages so that it can be supported or maintained just like other applications in the IT portfolio.


In addition to the obsolescence of VoiceXML, the speech recognition engine (ASR) that was deployed in the early 2000s has also become outdated. Modern speech-to-text engines are built on Deep Neural Networks that run on powerful GPU infrastructure. They offer amazing accuracy and allow the use of a very large vocabulary - which is what is needed for bot like conversational experience. Also modern NLU engines allow you to easily extract intents from the transcribed text.


So if an enterprise wants to offer a voice bot that supports an open conversational experience, they need to move to a modern DNN based Speech-to-Text platform that can integrate with such NLU engines.


Our recipe for IVR App modernization



At Voicegain, we recommend that an enterprise first modernize the underlying infrastructure while retaining the existing IVR application logic. This is a great first step. It allows an enterprise to continue serving existing users while taking a step towards providing a more conversational user experience.


How can an enterprise modernize its legacy IVR app?

We suggest that the existing call flow logic - which is typically maintained using visual IDEs of contact center platforms - get converted (ideally with the help of automated tools) into a modern programming language like Python or Node.Js.


Instead of generating legacy VoiceXML pages, enterprises should use web friendly data representation languages like JSON or YAML to interact with modern RESTful Speech-to-Text APIs using web callbacks.


How Voicegain supports IVR App modernization?

At Voicegain, we provide a modern Voice AI platform that includes

  1. A modern DNN based speech recognizer accessible using RESTful APIs

  2. Ability to interface directly with telephone calls delivered over SIP/RTP

  3. JSON style callback APIs to replace functionality of a VoiceXML

  4. Ability to deploy on your VPC/Private Cloud or use as a cloud service

  5. Fully feature compatible with legacy standards (support SRGS grammars, universals)

  6. Training of the underlying acoustic model & language models to get high recognition accuracy

We also provide tools to automatically convert VoiceXML to equivalent JSON/YAML representation that talks to our callback APIs.


How is this a "future proof" architecture for an enterprise?

The Voicegain platform is capable of large vocabulary transcription which is a requirement for NLU based Voice Bots. This will be the way customers interact with enterprises in the future.


We allow developers to switch between grammar based recognition and large vocabulary recognition at each and every turn of the dialog; or you could simultaneously use both to achieve more flexibility.


We also integrate with DNN based NLU Engines like RASA and Google Dialogflow.


We are inviting enterprise web developers for a free trial of our platform.



Contact Us