By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Voice Bot

Building Voice Bots: Should you always use an NLU engine?

Businesses of all sizes are looking to develop Voicebots to automate customer service calls or voice based sales interactions. These bots may be voice versions of existing Chatbots, or exclusively voice based bots. While Chatbots automate routine transactions over the web, many users like the ability to use voice (app or phone) when it is convenient.

A voice bot dialog consists of multiple interactions where a single interaction typically involves 3 steps:

  1. A caller/customer's spoken utterance is converted into text
  2. Intent is extracted from the transcribed text
  3. Next step of the conversation is determined based on the intent extracted and the current state/context of the conversation.

For the first step, developers use a Speech-to-Text platform to transcribe the spoken utterance into text. ASR or Automatic Speech Recognition is another term that is used to describe the same type of software.

When it comes to extracting intent from the customer utterance, they typically use an NLU engine. This is understandable because developers would like to re-use the dialog flow or conversation turns programmed in their Chatbot App for their Voicebot.

A second option is to use Speech Grammars which match the spoken utterance and assign meaning (intent) to it. This option is not in vogue these days but Speech Grammars have been successfully used in telephony IVR systems that supported speech interaction using ASR.

This article explores both approaches to building Voicebots.

The NLU approach

Most developers today use the NLU approach as a default option for Steps 2 and 3. Popular NLU engines include  Google Dialog Flow, Microsoft LUIS, Amazon Lex and also increasingly an open source framework like RASA.  

An NLU Engine helps developers configure different intents that match training phrases, specify input and output contexts that are associated with these intents, and define actions that drive the conversation turns. This method of development is very powerful and expressive. It allows you to build bots that are truly conversational. If you use NLU to build a Chatbot  you can generally reuse its application logic for a Voicebot.

But it has a significant drawback. You need to hire highly skilled natural language developers. Designing new intents, handling input and output contexts, entities etc is not easy. Since you require skilled developers, the development of bots using NLU is expensive. It is not just expensive to build but it is costly to maintain too. For example, if you want to add new skills to the bot that are beyond its initial set of  capabilities, modifying the contexts is not an easy process.

Net-net the NLU approach is a really good fit if (a) you want to develop a sophisticated bot that can support a truly conversational experience (b) you are able to hire and engage skilled NLP developers and (c) you have adequate budgets to develop such bots.

The Speech Grammar approach

One approach that was used in the past and seems to have been forgotten these days is the use of Speech Grammars. Grammars were used extensively to build traditional telephony based speech IVRs for over 20 years now, but most NLP and web developers are not aware of them.

A Speech Grammar provides either a list of all utterances that  can be recognized, or, more commonly, a set of rules that can generate the utterances that can be recognized. Such grammar combines two functions:

  1. it provides a language model that guides the speech-to-text engine in evaluating the hypotheses, and
  2. it can attach semantic meaning to the recognized text utterances.  

The second function is achieved by attaching tags to the rules in the grammars. Tag formats exist that support complex expressions to be evaluated for grammars that have many nested rules. These tags allow the developer to essentially code intent extraction right into the grammar.

Also Step 3 - which is the dialog/conversation flow management - can be implemented in any backend programming language - Java, Python or Node.js. Developers of voice bots that are on a budget and are looking to building a simple bot with just a few intents should strongly consider grammars as an alternative approach to NLU.

NLU and Speech Grammar compared

Advantages of NLU
  • NLU can be applied to text that has been written as well as text coming from speech-to-text engine. This allows in principle for the same application logic to be used for both a Chatbot and a Voicebot. Speech Grammars are not good at ignoring input text that does not match the grammar rules. This makes Speech Grammars not directly applicable to Chatbots, though ways have been devised to allow Speech Grammar to do "fuzzy matching".
  • A well trained NLU can capture correct intents in more complex situations than a Speech Grammar. Note, however, that some of the NLU techniques could be used to automatically generate grammars with tags that could be a close match for NLU performance.
Advantages of Grammars
  • NLU intent recognition may suffer if the speech-to-text conversion was not 100% correct. We have seen reports of combined Speech-to-Text+NLU accuracy being very low (down to just 70%) in some use cases. Speech Grammars, on the other hand, are used as a language model while evaluating speech hypotheses.  This allows the recognizer to still deliver correct intents even when the spoken phrase does not match the grammar exactly - the recognition result will have lower confidence but will still be usable.
  • Speech grammars are simple to build and use. Also, there is no need to integrate NLU system with Speech-to-Text system. All the work can be performed by the Speech-to-Text engine

Our Recommendation

Voicegain is one of the few Speech-to-Text or ASR engines that supports both approaches.

Developers can easily integrate Voicegain's large vocabulary speech-to-text (Transcribe API) with any popular NLU engine. One advantage that we have here is the ability to output multiple hypotheses - when using the word-tree output mode. This allows multiple NLU intent matches to be done of the different speech hypotheses  with the goal of determining if the there is an NLU consensus in spite of differing speech-to-text output. This approach can deliver higher accuracy.

We also provide our Recognize API and RTC Callback APIs ; both of these support speech grammars. Developers may code the application flow/dialog of the voicebot in any backend programming language - Java, Python, Node.Js. We have extensive support for telephony protocols like SIP/RTP and we support WebRTC.

Most other STT engines - including Microsoft, Amazon and Google - do not support grammars. This may have something to do with the fact that they are also trying to promote their NLU engines for chatbot applications.

If you are building a Voicebot and you'd like to have a discussion on which approach suits you, do not hesitate to get in touch with us. You can email us at

Sign up for an app today
* No credit card required.


Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us →