Speech-to-Text (STT) APIs enable developers to embed automatic transcription into any voice-enabled app. APIs are built on top of highly accurate and trainable deep learning asr models and we support both batch and streaming use cases.
Invoke our STT APIs using our highly scalable cloud service or deploy a containerized version of Voicegain in your VPC or datacenter. Our APIs can convert audio/video files in batch or a real-time media stream into text and we support 40+ audio formats.
On a broad benchmark, our accuracy of 89% is on par with the very best
Talk to us in English, Spanish, German, Portuguese, Korean (more coming)
Tested on compute instances on Google, AWS, Azure, IBM & Oracle
Integrates with Twilio, Genesys, FreeSWITCH and other CCaaS and CPaaS platforms
Today we are really excited to announce the launch of Voicegain Whisper, an optimized version of Open AI's Whisper Speech recognition/ASR model that can be accessed using Voicegain APIs. The same APIs currently process over 60 Million minutes of audio every month for leading enterprises in the US including Samsung, Aetna and several Fortune 100 enterprise. Generative AI developers now have access to a well-tested accurate, affordable and accessible transcription API. They can integrate Voicegain Whisper APIs with LLMs like GPT 3.5 and 4 (from Open AI) PaLM2 (from Google), Claude (from Anthropic), LLAMA 2 (Open Source from Meta), and their own private LLMs to power generative conversational AI apps. Open AI open sourced several versions of the Whisper models released. With today's release Voicegain supports Whisper-medium, Whisper-small and Whisper-base. Voicegain now supports transcription in over 99 different languages that are supported by Whisper.
There are four main reasons for developers to use Voicegain Whisper over other offerings:
While developers can use Voicegain Whisper on our multi-tenant cloud offering, a big differentiator for Voicegain is our support for the Edge. The Voicegain platform has been architected and designed for single-tenant private cloud and datacenter deployment. In addition to the core deep-learning-based Speech-to-text model, our platform includes our REST API services, logging and monitoring systems, auto-scaling and offline task and queue management. Today the same APIs are enabling Voicegain to processes over 60 Million minutes a month. We can bring this practical real-world experience of running AI models at scale to our developer community.
Since the Voicegain platform is deployed on Kubernetes clusters, it is well suited for modern AI SaaS product companies and enterprises that want to integrate with their private LLMs.
At Voicegain, we have optimized Whisper for higher throughput. As a result, we are able to offer access to the Whisper model at a price that is 40% lower than what Open AI offers.
Voicegain also offers critical features for contact centers and meetings. Our APIs support two-channel stereo audio - which is common in contact center recording systems. Word-level timestamps is another important feature that our API offers which is needed to map audio to text. There is another feature that we have for the Voicegain models - enhanced diarization models - which is a required feature for contact center and meeting use-cases - will soon be made available on Whisper.
We also offer premium support and uptime SLAs for our multi-tenant cloud offering. These APIs today process over 60 millions minutes of audio every month for our enterprise and startup customers.
OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.
Any developer - whether you are a one person startup or a large enterprise - can access the Voicegain Whisper model by signing up for a free developer account. We offer 15,000 mins of free credits when you sign up today. This should allow you to build and test your app. Here is a link to get started on Voicegain Console, our developer focused web application. Here is also a link to our Github
There are two ways to select Voicegain Whisper. The first is to configure the settings in Voicegain Consoler, our developer focused UI. The second method is to configure Whisper as the model in API settings. If you would like more information or if you have any questions, please drop us an email support@voicegain.ai
Since June 2020, Voicegain has published benchmarks on the accuracy of its Speech-to-Text relative to big tech ASRs/Speech-to-Text engines like Amazon, Google, IBM and Microsoft.
The benchmark dataset for this comparison has been a 3rd Party dataset published by an independent party and it includes a wide variety of audio data – audiobooks, youtube videos, podcasts, phone conversations, zoom meetings and more.
Here is a link to some of the benchmarks that we have published.
1. Link to June 2020 Accuracy Benchmark
2. Link to Sep 2020 Accuracy Benchmark
3. Link to June 2021 Accuracy Benchmark
4. Link to Oct 2021 Accuracy Benchmark
5. Link to June 2022 Accuracy Benchmark
Through this process, we have gained insights into what it takes to deliver high accuracy for a specific use case.
We are now introducing an industry-first relative Speech-to-Text accuracy benchmark to our clients. By "relative", Voicegain’s accuracy (measured by Word Error Rate) shall be compared with a big tech player that the client is comparing us to. Voicegain will provide an SLA that its accuracy vis-à-vis this big tech player will be practically on-par.
We follow the following 4 step process to calculate relative accuracy SLA
In partnership with the client, Voicegain selects benchmark audio dataset that is representative of the actual data that the client shall process. Usually this is a randomized selection of client audio. We also recommend that clients retain their own independent benchmark dataset which is not shared with Voicegain to validate our results.
Voicegain partners with industry leading manual AI labeling companies to generate a 99% human generated accurate transcript of this benchmark dataset. We refer to this as the golden reference.
On this benchmark dataset, Voicegain shall provide scripts that enable clients to run a Word Error Rate (WER) comparison between the Voicegain platform and any one of the industry leading ASR providers that the client is comparing us to.
Currently Voicegain calculate the following two(2) KPIs
a. Median Word Error Rate: This is the median WER across all the audio files in the benchmark dataset for both the ASRs
b. Fourth Quartile Word Error Rate: After you organize the audio files in the benchmark dataset in increasing order of WER with the Big Tech ASR, we compute and compare the average WER of the fourth quartile for both Voicegain and the Big Tech ASR
So we contractually guarantee that Voicegain’s accuracy for the above 2 KPIs relative to the other ASR shall be within a threshold that is acceptable to the client.
Voicegain measures this accuracy SLA twice in the first year of the contract and annually once from the second year onwards.
If Voicegain does not meet the terms of the relative accuracy SLA, then we will train the underlying acoustic model to meet the accuracy SLA. We will take on the expenses associated with labeling and training . Voicegain shall guarantee that it shall meet the accuracy SLA within 90 days of the date of measurement.
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
Interested in customizing the ASR or deploying Voicegain on your infrastructure?