• Jacek Jarmulak

Combining grammar-based and large vocabulary speech recognition

Updated: Apr 15

In this blog post we present a unique feature of the Voicegain speech-to-text platform that efficiently combines the use of grammars with the use of large vocabulary models to provide developers with the ability to achieve high recognition accuracy in a very efficient and convenient way.


Two Types of Speech Recognition


Speech recognition (ASR) systems generally can be divided into two types:


Large Vocabulary Continuous Speech Recognition.

This type of recognizer is generally used for transcription where the vocabulary is very broad and the length of the speech audio is unlimited (except for practical e.g. resource related limit). Typical components and processing steps of such a system are illustrated below:

The working of such a system is as follows: (s) The audio signal is processed into features. (b) The features are fed into an acoustic model processor. The processor converts data from the acoustic realm to text/linguistic or some other intermediate (e.g. audio embeddings) realm. The output values may be phonemes, letters, word pieces, audio embeddings, etc., presented as vectors of probabilities. (c) These vectors are then passed to search/optimization component. Search uses the language model to decide which hypotheses formed from the output of the previous stage are most likely to be the correct textual interpretation of the input speech audio.


The Language Models used may take variety of forms. Two of the many possible manifestations are: (a) ARPA language models, which are n-gram based, and (b) Neural Network language models where a neural network (e.g., RNN) is trained to represent a language model. Some of the Language models can also incorporate a decoder part, if the acoustic model output is encoded (e.g. if it is represented by acoustic embedding).


Because the vocabulary of this type of recognizers is large, they are prone to misrecognitions. This is particularly the case for short utterances that do not provide much context for the language model to sufficiently constrain the hypotheses. An example would be misrecognizing “card” as “car” if that is the only word that is said and a speaker has a specific accent.


Cloud speech-to-text offerings from the Big Cloud providers - Google, Amazon, and Microsoft are all examples of Large Vocabulary ASRs.


Grammar-Based Speech Recognition.

In such a system, the Voice Bot/IVR developer uses a context free grammar to define a set of possible utterances that can be recognized. The grammars are typically defined using the SRGS (Speech Recognition Grammar Specification) standard - either ABNF or GRXML grammar. Other types of grammars used are JSGF (JSpeech Grammar Format) and GSL (which is Nuance Grammar Specification Language).


Components and processing steps of a typical speech recognition system that uses such grammars are illustrated below:

In this system the evaluation of the output from acoustic model processing is done by a search/optimizer that uses the rules contained in the grammar to decide which hypotheses are acceptable. Only the utterances that can be generated from the grammar may be output.


If an utterance outside of the grammar is spoken and presented to the recognizer it may still be recognized but with low confidence. If the confidence is below a set threshold a NOMATCH will be returned.


The obvious disadvantage of using such a recognizer is that it will not recognize utterances outside the scope of grammar. Such utterances are called Out-of-Grammar utterances. However, a big advantage with this approach is that it will be less prone to misrecognition when an utterance that is spoken has been anticipated and is included in the grammar.


An additional advantage of using a grammar-based recognizer is that most grammars allow for insertion of semantic tags, which allow the grammar to not only define an utterance but also the semantic interpretation of that utterance.


Examples of such a grammar-based speech recognition system would the speech-to-text offerings like Nuance ASR or Lumenvox ASR.


Combining grammar-based and large vocabulary recognition


Clearly both types of speech recognition systems have advantages and disadvantages. It hence seems understandable that a combination of both could potentially have the advantages of both while possibly avoiding some disadvantages.


Approach using a combination of existing ASRs


A simple approach would be to combine two different speech recognition systems. One would need to create two speech recognition sessions and split the incoming audio stream so that each session is fed a copy of incoming audio. Those two sessions would process the audio separately and would output separate results that would then need to be combined. This is illustrated below:


Disadvantages of using two ASR sessions


The setup as presented above has several disadvantages:

  1. It introduces complexity in the streaming of the audio to the recognizer. Additional proxy like component needs to be added that splits the audio stream and feeds it to two separate ASR systems.

  2. Combining the results also requires a new separate component. This is not necessarily trivial because of the different end-pointing of the two disconnected ASR systems meaning that the results will arrive at different times.

  3. Extra compute resources will be needed to support running two separate ASR systems instead of just one.

  4. Another disadvantage is having to pay double the license fee as each ASR will have to have a separate session license.


Voicegain approach


Voicegain platform provides a speech recognition system that combines both types of speech recognition to benefit from the advantages of both. Our system is illustrated in the figure below:

In this system the processing up to the output from the Acoustic model processing is essentially identical to the processing done in systems depicted in the first two figures of this post. However, after that step Voicegain includes a novel Search/Optimization module that uses both grammar and the large vocabulary language model to generate the final recognition results. The end-pointing is performed in a way that is similar to grammar-based recognizer as that seems to make most sense given the use case (but this can be modified). The final recognition result will comprise n-best results from the grammar-based recognition, if the grammar did MATCH, and one or more hypotheses from the large vocabulary recognition.


The application developer may make own decisions as to how to use the recognition result. For example, the confidence value may be used to determine whether the grammar-based result or the large vocabulary result should be used at a given point in the application.


With Voicegain’s release of 1.22.0 , this feature is Generally Available as part of our Recognize API.


An example request using our /asr/recognize/async API looks like this:


{
    "sessions": [{
        "asyncMode": "REAL-TIME",
        "websocket": {
            "additionalEvents": ["INPUT-STARTED"]
        }
    }],
    "audio": {
        "source": {
            "stream": {
                "protocol": "WEBSOCKET"
            }
        },
        "format": "L16",
        "rate": 8000,
        "channels": "mono",
        "capture": false
    },
    "settings": {
        "asr": {
            "confidenceThreshold": 0.0001,
            "noInputTimeout": 6000,
            "completeTimeout": 1000,
            "incompleteTimeout": 2000,
            "sensitivity": 0.0,
            "grammars": [{
                "type": "JJSGF",
                "parameters": {
                    "tag-format": "semantics/1.0-literals"
                },
                "grammar": "pizza",
                "public": {
                    "root": "<card> {card} | <cash> {cash} | <pickup> {pickup} | <delivery> {delivery}"
                },
                "rules": {
                    "<card>": "card",
                    "<cash>": "cash",
                    "<pickup>": "pickup",
                    "<delivery>": "delivery"
                }
            }, {
                "type": "BUILT-IN",
                "name": "transcribe"
            }]
        }
    }
}

As you can see there is just one definition for the incoming audio stream. The grammar section of settings.asr contains two grammar definitions:

  • one is a standard JSGF grammar with literal tag format semantics,

  • the other is actually not a grammar but a command to turn on large vocabulary transcription for this session {type:BUILT-IN, name:transcribe}

MRCP Use Case

In addition to being available in our STT API and Telephone Bot API the ability to support both gramma-based and large vocabulary recognition at the same time is supported via the MRCP interface. For example, from VXML you can pass both GRXML grammar and builtin:speech/transcribe grammar and you will receive both GRXML result and large vocabulary result.

If you are building an Intelligent Voice Assistant, Voice Bot, Speech IVR Application or any other application that could benefit from this feature, please contact us via (email info@voicegain.ai) to engage in a more in-depth discussion.

246 views0 comments