• Arun Santhebennur

How web developers can upgrade legacy IVRs into conversational bots?

Updated: Apr 3



Why do IVRs need a facelift?

Most IT organizations have a portfolio of mature IVR applications that act as a “front door” for all customer support phone calls. These applications have been carefully designed and tuned over the years to ensure that (a) routine queries are contained in the IVR and (b) more complex transactions that require live human assistance are routed efficiently to an agent that is best suited to handle such calls. IT organizations - whether they are in banking, insurance, airlines, telecommunications or health-care - have a small staff of either in-house or outsourced IVR developers that maintain these applications. But the infrastructure that they were built on is from the early 2000s and is nearing obsolescence. It may not be supported for too long. And while enterprises have been investing in digital channels like chat and email, the importance of phone calls in customer support is not going to diminish. It is time for back-end web developers to take control and modernize the IVR using modern programming frameworks and leveraging the latest developments in AI & machine learning.


What development platforms were these speech-enabled IVRs built on?

Traditionally IVR applications were developed on platforms provided by telephony switch vendors like Avaya, Nortel, Genesys, Aspect, etc. This changed in the early and mid-2000s when the big telephony vendors worked collaboratively as part of the W3C consortium to develop VoiceXML, an open vendor agnostic language for speech-enabled IVR applications. VoiceXML enabled developers to build interactive voice dialogs, provided a standard to interact with a speech recognizer (using a protocol called MRCP) and a method to define speech grammars (called SRGS). The architecture and supporting jargon/terminology around VoiceXML borrowed heavily from the web world. The VoiceXML platform was referred to as a “Voice browser” that could “render VoiceXML pages” just like how a web browser could render HTML pages. Companies also built visual IDEs to help IVR developers design and develop interactive call flows and speech grammars. Some also automated the generation of the VoiceXML pages. Most enterprises built the application logic using a higher-level programming language like Java and ran it on an application server (like Apache Tomcat) which in turn sent VoiceXML pages to the VoiceXML platform over standard HTTP. The application server was also responsible for making web-services requests to enterprise database resources that were required for the IVR interaction - whether it was backend billing/payment systems or troubleshooting services (e.g. in Cable companies).


Why modernize now? While VoiceXML worked in the past, it is a niche and outdated language. The last release of VoiceXML 2.1 was back in 2007 - which is more than a decade back. VoiceXML was developed at a time when JSP (Java Server Pages) were widely used. This was before the new dynamic web applications were built - so pre-JSON, AJAX and RESTful APIs.


For enterprises, it is not easy to maintain a staff of VoiceXML developers - whether inhouse or outsourced. Web developers are unlikely to learn VoiceXML. In fact, if you still have a VoiceXML developer, they are probably multi-skilled and probably know another language like Java, Python or Node.js. Net-net, most IT organizations would prefer using a modern web framework like Django or Node.js to build and maintain these IVR apps.


In addition to the obsolescence of VoiceXML, the speech recognition engine (ASR) that was deployed in the early 2000s has also become outdated. An ASR from the late 90s and early 2000s relied on traditional Hidden Markov Models and Gaussian Mixture models that had limited recognition accuracy. These speech recognizers largely supported grammar-based recognition - which meant that as a developer you had to anticipate all possible utterances that a user could say in response to a question. There were some options to build open-ended statistical language models but these were tricky and required careful selection of the training corpus. Also, developers had limited access to some key features of the recognizer, e.g. they were unable to customize the acoustic model to account for accents and words for their applications to improve accuracy. Net-net recognition performance was fairly limited.


However modern speech-to-text engines leverage advances in deep neural networks that run on modern GPUs. DNNs power modern personal conversational assistants like Alexa and Google Now and they offer highly accurate speech recognition performance.


Another big challenge is that these speech-enabled IVRs are prohibitively expensive to maintain. Enterprises pay for licenses based on peak capacity/utilization. In essence, you are reserving a “seat” for the entire year even if you used it for a few minutes.




Voicegain, a modern speech-to-text platform for voice interactions

At Voicegain, we provide a modern DNN based speech recognizer that

  1. Interfaces with an audio stream delivered using RTP/SIP

  2. Provides a modern RESTful API for application; application logic may be written in a modern programming language (like Python or Node.JS)

  3. Deployable at the Edge or invokable as a cloud service

  4. Fully feature compatible with legacy VoiceXML platforms (support SRGS grammars, universals)

  5. Recognize of a customer’s utterance even when if it is not in grammar

  6. Allow developers to customize the underlying acoustic model & language models and access deep underlying features like multiple

We are inviting enterprise web developers for a free trial of our platform.


The Path forward

Today users expect their IVRs to be conversational and perform on par with assistants like Alexa and Google Now.


That being said, organizations have already made significant investments in the design and development of these applications. Ideally, organizations want a way to leverage these investments in design and modernize the underlying platforms.


Voicegain provides web-developers 3 flexible options to upgrade their legacy IVR:

  1. Keep the VoiceXML application as is, but replace only the ASR with our DNN based speech recognizer.

  2. Retain the design of your IVR application, but rewrite the logic in a programming language of your choice while using Voicegain APIs to replace both the VoiceXML platform and the ASR

  3. If you have already developed a text-based chatbot using modern NLU based software, you can use Voicegain to voice-enable your chatbot. Essentially Voicegain can serve as the mouth and ears to your chatbot.

29 views