Speech to Text Engine

Deep Neural Networks at the core of Voicegain Speech-to-Text technology

The Speech to Text engine has been built from the ground up by the Voicegain R&D team utilizing Deep Neural Networks as the core enabling technology. Our core team has been working with speech recognition technology for more than a decade. We automated over 500M calls for Fortune 500 enterprise customers. For the Voicegain Speech to Text Engine, we did not merely repackage existing academic or open source projects. Rather, we used latest DNN research to build a custom speech recognition pipeline, which gives us full control over all its aspects.


Accuracy varies depending on the type of speech audio but  is in line with offerings from the big 3 (Google, Amazon, Microsoft). We have published details for our off-line recognizer benchmarks in this blog post. Real-time accuracy is lower by a few percent, again depending on the type of speech audio data.

Out-of-the-box accuracy for recognition using grammars is about the same as other commercial ASR engines. This accuracy can be significantly increased with custom acoustic models. It is worth noting that, as far as we know, ours is the only DNN ASR that directly uses grammars in the process of speech recognition, instead of overlaying grammars on top of a a result from a large vocabulary recognizer. This gives you a usable N-Best result.

You can easily test accuracy on your domain by signing up and testing using our Free Tier. You will be able to upload your audio from web browser and check the transcript accuracy.

Currently, we offer acoustic models for two languages: English and Spanish.

Voicegan platform supports customization of models used in Speec-to-Text

The Voicegain Speech to Text Engine uses 2 types of models –
1) Acoustic Model and 2) Language Model. In order to improve recognition accuracy of domain specific words, unique speaker accents or pronunciations, we provide APIs that help modify both types of models.

Language model customization requires simple upload of domain specific corpus and vocabulary files.

However, the biggest impact comes from customizing the Acoustic Model that is represented in the DNN. This requires accurately transcribed audio files. They do not have to be time-aligned, so the effort to prepare them is relatively low.

With Custom Acoustic and Language Models some of our customers have achieved accuracy of about 98% for real-time transcription. In other use cases the Word Error rate was cut to 3/5ths of the original WER value.

Clients get exclusive access to their custom models. To incorporate a custom model as part of its platform, Voicegain may enter into a separate licensing agreement with the client.

Read more ...




Ease of use - Voicegain in the Cloud

Voicegain Cloud is the implementation of the Voicegain platform on the cloud. Clients can use the our Speech-To-Text Web APIs, MRCP based IVR incl. tuning tools, and live or offline transcription functionality on Voicegain Cloud.


Services in the cloud are available in two pricing tiers.

  • Premium On Demand Speech-to-Text best suited for critical real-time recognition

  • Economy Pre-emptible Speech-to-Text suited for less critical use cases like offline recognition

Full control - Voicegai Platform deployed at the Edge

Voicegain Edge is the implementation of the Voicegain platform on the client’s infrastructure. Clients get full control of data and data security and a lower price by installing Voicegain on the edge.

Clients would need to provide a dedicated Kubernetes Cluster on servers with Nvidia CUDA GPUs and Document and Object Storage. Alternatively, Voicegain stack can be deployed into your AWS instances.

Voicegain apps are auto-deployed into the Kubernetes Cluster and managed using the Cloud Web Console.

Read more ...

Access Voicegain Platform via Restful WebAPI

VoiceGain Web API supports both: 

  • Large vocabulary transcription — real-time, semi-real-time, and offline

  • Speech recognition using context-free grammars (e.g. GRXML) 


Speech audio input can be submitted to Voicegain either: (a) in-line with the web request, (b) provide a URL that Voicegain can retrieve from, (c) stream via web-socket e.g. from a browser or (d) streamed from provided Java utility.


Recognition results are made available via: (a) HTTP response (b) polling (c) webhook callback, (d) web-sockets, (e) Message queues.


Additional APIs are available for language model construction, acoustic model training, broadcast web-socket, data upload, IVR tuning tool, etc..

Manage Voicegain resources via a Web Portal

Voicegain Cloud Web Console is provided to facilitate the following tasks: 

  • Transcription of audio files, incl. transcript review

  • Managing  broadcast web-sockets for real-time transcription

  • Management of Voicegain Edge Kubernetes clusters  

  • Configuration of the Speech to Text Engine  

  • Language Model and Pronunciation model customization 

  • Tuning & Regression testing tools 

  • User and Role management 

  • API documentation and support 

  • Billing 

  • Training of Custom Acoustic DNN models 

  • Metrics Dashboard powered by Grafana 

  • Application Logs


Upcoming features

  • Additional speech-to-text languages  beyond English and Spanish

  • Acoustic Model Training on Edge using API and later a Web UI - customers will be able to do their own model training

  • Voicegain Edge Appliance