Key Differentiators

Current speech-to-text enterprise market can be divided into 3 distinct groups of players. Note, that we are focusing here on speech-to-text platforms rather than complete end-user products (so we do not include consumer products like Dragon NaturallySpeaking, etc.)

The old ASRs - for example Nuance (and every speech company that Nuance acquired over the years) and Lumenvox. These speech-to-text engines go back to late 1990s early 2000s. They were built using technology relying on Gaussian Models and Hidden Markov Chains. They do require on-prem install.
Established Cloud Speech-to-Text services - like Google, AWS, Microsoft Azure, IBM. Some of these also began with recognizers build using Gaussian Models and Hidden Markov Chains, but by 2012 started transitioning to recognizers using Deep Neural Network models for speech recognition.
New players - these are new companies going back to about 2015. That is when Nvidia made it possible for pretty much anyone to train DNNs on Nvidia's new GPUs. A lot of small companies arose which built their own speech-to-text engines either from scratch or using open-source foundations. Now, 5 years later, many of them are entering speech-to-text market with mature products and delivering high recognition accuracy.

Where does Voicegain fit here?

We consider ourselves as as one of the new players as we started working on our own DNN-based speech-to-text engine at the end of 2016. However, we have been working with old style ASRs since 2006 and as a result we knew very well limitations of those. That was what motivated us to develop ASRs of our own.

We are also very familiar with employing ASRs in real-world large volume applications so we know which features the users of ASRs want - be it developers who build the applications, or IT personnel that has to host and maintain them.

All of this guided us in decisions we made when developing our speech-to-text platform.

So how is Voicegain product different?

Below we list what we think are 4 key differentiators of our speech-to-text platform compared to competition. Note that the competitive field is pretty broad, and we consider a particular feature a differentiator if it is not a common feature in the market.

1) Edge Deployment

By, Edge Deployment we mean a deployment on customer premises (datacenter) or on VPC. Moreover, the deployment is fully orchestrated and managed from the Cloud (for more information see our blog post about Benefits of Edge Deployment). The aspect of orchestration and built-in management makes it essentially different from the old ASRs which were also deployed on-prem and required Support Contracts do deploy them successfully and to maintain them over time.

We think that Edge Deployment is critical for a speech-to-text platform which is to replace many of the old ASRs in their applications.

2) Acoustic Model Customization

Over the years when working with ASRs we noticed that there were cases where the ASR would show consistently higher error rates. Usually, this was related to IVR calls coming from customers in regions of the country with distinct accents.

In some of our use cases so far, ability to customize models has allowed us to reduce WER very significantly (e.g. from 8% WER to 3%).

We are currently working on a rigorous experiment where we are customizing our model to support Irish English. We plan to report in detail on the results in April.

3) Targeted support for IVR

Voicegain speech-to-text platform was developed specifically with IVR use cases in mind. Currently the platform supports the following 3 IVR uses cases, and we are working on adding conversational NLU later this year.

a) ASR with support for legacy IVR Standards

In order to make our speech-to-text engine an attractive solution for replacement of old ASRs, we implemented it to support legacy standards like MRCP and GRXML. That support is not a mere add-on, simply tagging a Web API on the back of an MRCP server, but is more integral - our core speech-to-text engine directly interprets a superset of MCRP protocol commands.

We also support GRXML and JSGF grammars - via MRCP, in IVR callbacks, and over Web API.

When used with grammars, big advantage of Voicegain recognizer is that at the core it is a large vocabulary recognizer. Grammars are used to do constrain the recognized utterances to facilitate semantic mapping, but the recognizer can also recognize Out-of-Grammar utterances, which opens new possibilities for IVR tuning.

b) Web-hook IVR Support (without VXML)

Flow-based IVR systems have traditionally been built using two approaches - (i) either having the dialog interactions interpreted on a VXML platform (VXML browser), or (ii) using webhooks invoking application logic running on standard web back-end platforms (examples of the latter are offerings of e.g. Twilio, Plivo, or Tropo).

Our platform supports webhook style IVRs. Incoming calls can be interfaced via standard telephony SIP/RTP, and the IVR dialog can be directed from any platform that implements web-hooks (e.g. Node.js, Django)

c) Enabling IVRs that use chatbot back-end

Many companies have invested significant effort into building their text based chatbots rather than using products like Google Dialogflow. What Voicegain platform provides is an easy way to deploy the existing chatbot logic on a telephony speech channel. This takes advantage of our platform's webhook-ivr IVR support and can feed real-time text (including multiple alternatives) to a chatbot platform. We also provide audio output either via TTS or prerecorded clips.

4) End-to-end support for Real-Time Continuous Speech-to-Text

Because IVR has always been our focus, we built our Acoustic Models to support low latency real-time speech-to-text (both continuous large vocabulary and with context-free grammars). We also focused on convenient ways to stream audio into our speech-to-text platform, and to consume the generated transcript.

One of our products is Live Transcribe which allows for real-time transcription (with just few seconds delay) which is then broadcast over websockets and can be consumed on provided web clients. This opens possibility to do live speaker transcription with uses cases that may include conferences, lectures, etc. making these events easier to participate by hearing impaired audience members.

Casey

Transcribe