Speech to Text Engine
DEEP NEURAL NETWORKS
The Speech to Text engine has been built from the ground up by the Voicegain R&D team utilizing Deep Neural Networks as the core enabling technology. Our core team has been working with speech recognition technology for more than a decade. We automated over 500M calls for Fortune 500 enterprise customers. For the Voicegain Speech to Text Engine, we did not merely repackage existing academic or open source projects. Rather, we used latest DNN research to build a custom speech recognition pipeline, which gives us full control over all its aspects.
Accuracy varies depending on the type of speech audio but is in line with offerings from the big 3 (Google, Amazon, Microsoft). We have published details for our off-line recognizer benchmarks in this blog post. Real-time accuracy is lower by a few percent, again depending on the type of speech audio data.
Out-of-the-box accuracy for recognition using grammars is about the same as other commercial ASR engines. This accuracy can be significantly increased with custom acoustic models. It is worth noting that, as far as we know, ours is the only DNN ASR that directly uses grammars in the process of speech recognition, instead of overlaying grammars on top of a a result from a large vocabulary recognizer. This gives you a usable N-Best result.
You can easily test accuracy on your domain by signing up and testing using our Free Tier. You will be able to upload your audio from web browser and check the transcript accuracy.
Currently, we offer acoustic models for two languages: English and Spanish.
The Voicegain Speech to Text Engine uses 2 types of models –
1) Acoustic Model and 2) Language Model. In order to improve recognition accuracy of domain specific words, unique speaker accents or pronunciations, we provide APIs that help modify both types of models.
Language model customization requires simple upload of domain specific corpus and vocabulary files.
However, the biggest impact comes from customizing the Acoustic Model that is represented in the DNN. This requires accurately transcribed audio files. They do not have to be time-aligned, so the effort to prepare them is relatively low.
With Custom Acoustic and Language Models some of our customers have achieved accuracy of about 98% for real-time transcription. In other use cases the Word Error rate was cut to 3/5ths of the original WER value.
Clients get exclusive access to their custom models. To incorporate a custom model as part of its platform, Voicegain may enter into a separate licensing agreement with the client.
Voicegain Cloud is the implementation of the Voicegain platform on the cloud. Clients can use the our Speech-To-Text Web APIs, MRCP based IVR incl. tuning tools, and live or offline transcription functionality on Voicegain Cloud.
Services in the cloud are available in two pricing tiers.
Premium On Demand Speech-to-Text best suited for critical real-time recognition
Economy Pre-emptible Speech-to-Text suited for less critical use cases like offline recognition
Voicegain Edge is the implementation of the Voicegain platform on the client’s infrastructure. Clients get full control of data and data security and a lower price by installing Voicegain on the edge.
Clients would need to provide a dedicated Kubernetes Cluster on servers with Nvidia CUDA GPUs and Document and Object Storage. Alternatively, Voicegain stack can be deployed into your AWS instances.
Voicegain apps are auto-deployed into the Kubernetes Cluster and managed using the Cloud Web Console.
RESTFUL WEB API
VoiceGain Web API supports both:
Large vocabulary transcription — real-time, semi-real-time, and offline
Speech recognition using context-free grammars (e.g. GRXML)
Speech audio input can be submitted to Voicegain either: (a) in-line with the web request, (b) provide a URL that Voicegain can retrieve from, (c) stream via web-socket e.g. from a browser or (d) streamed from provided Java utility.
Recognition results are made available via: (a) HTTP response (b) polling (c) webhook callback, (d) web-sockets, (e) Message queues.
Additional APIs are available for language model construction, acoustic model training, broadcast web-socket, data upload, IVR tuning tool, etc..
Voicegain Cloud Web Console is provided to facilitate the following tasks:
Transcription of audio files, incl. transcript review
Managing broadcast web-sockets for real-time transcription
Management of Voicegain Edge Kubernetes clusters
Configuration of the Speech to Text Engine
Language Model and Pronunciation model customization
Tuning & Regression testing tools
User and Role management
API documentation and support
Training of Custom Acoustic DNN models
Metrics Dashboard powered by Grafana
Additional speech-to-text languages beyond English and Spanish
Acoustic Model Training on Edge using API and later a Web UI - customers will be able to do their own model training
Voicegain Edge Appliance