Our Blog

News, Insights, sample code & more!

Benchmark
2025 Speech-to-Text Accuracy Benchmark for 8 kHz Call Center Audio Files

Voicegain is releasing the results of its 2025 STT accuracy benchmark on an internally curated dataset of forty(40) call center audio files. This benchmark compares the accuracy of Voicegain's in-house STT models with that of the big cloud providers and also Voicegain's implementation of OpenAI's Whisper.

In the years past, we had published benchmarks that compared the accuracy of our in-house STT models against those of the big cloud providers. Here is the accuracy benchmark release in 2022 and the first release in 2021 and our second release in 2021. However the datasets we compared our STT models was a publicly available benchmark dataset that was on Medium and it included a wide variety of audio files - drawn from meetings, podcasts and telephony conversations.

Since 2023, Voicegain has focused on training and improving the accuracy of its in house Speech-to-Text AI models call center audio data. The benchmark we are releasing today is based on a Voicegain curated dataset of 40 audio files. These 40 files are from 8 different customers and from different industry verticals. For example two calls are consumer technology products, two are health insurance and one each in telecom, retail, manufacturing and consumer services. We did this to track how well the underlying acoustic models are trained on a variety of call center interactions.

Why a separate benchmark for Call Center Audio Data ?

In general Call Center audio data has the following characteristics

  1. Narrowband: Most telephony systems used in call center encode the audio in a limited bandwidth 8 kHz format. Unless AI models are trained on such audio, the recognition accuracy can be limited.
  2. Noisy data: There is significant background noise and over-talk in call center audio recordings.
  3. Accents: Call Center agents work in different international locations. Even the end customers in the US have different accents. So the STT engine needs to be tuned to different accents.

Results of our Benchmark:

How was the accuracy of the engines calculated? We first created a golden transcript (human labeled) for each of the 40 files and calculated the Word Error Rate (WER) of each of the Speech-to-Text AI models that are included in the benchmark. The accuracy that is shown below is 1 - WER in percentage terms.

Most Accurate - Amazon AWS came out on top with an accuracy of 87.67%

Least Accurate - Google Video was the least trained acoustic model on our 8 kHz audio dataset. The accuracy was 68.38%

Most Accurate Voicegain Model - Voicegain-Whisper-Large-V3 is the most accurate model that Voicegain provides. Its accuracy was 86.17%

Accuracy of our inhouse Voicegain Omega Model - 85.09%. While this is slightly lower than Whisper-Large and AWS, it has two big advantages. The model is optimized for on-premise/pvt cloud deployment and it can further be trained on client audio data to get an accuracy that is higher.

Custom Acoustic Model Training

One very important consideration for prospective customers is that while this benchmark is on the 40 files in this curated list, the actual results for their use-case may vary. The accuracy numbers shown above can be considered as a good starting point. With custom acoustic model training, the actual accuracy for a production use-case can be much higher.

Private Cloud/On-Premise Deployment

There is also another important consideration for customers that want to deploy a Speech-to-Text model in their VPC or Datacenter. In addition to accuracy, the actual size of the model is very important. It is in this context that Voicegain Omega shines.

Additional Result of our Streaming Speech-to-Text

We also found that Voicegain Kappa - our Streaming STT engine has an accuracy that is very close to the accuracy of Voicegain Omega. The accuracy of Voicegain Kappa is less than 1% lower than Voicegain Omega.

Reproducing this Benchmark

If you are an enterprise that would like to reproduce this benchmark, please contact us over email (support@voicegain.ai). Please use your business email and share your full contact details. We would first need to qualify you, sign an NDA and then we can share the PII-redacted version of these audio call recordings.

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Large vocabulary transcription for Twilio developers
CPaaS
Large vocabulary transcription for Twilio developers

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

  • lower cost per each speech-to-text capture
  • higher accuracy for customers who choose Acoustic Model customization
  • access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:


Some notes about the content of the request:

  • we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
  • asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:



Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/transcribe/async request
  • more than one question prompt is supported - they will be played one after another
  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated   using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription  in case where "content" : {"full" : ["transcript"] } .



Read more → 
Live Transcription Example
Use Cases
Live Transcription Example

We want to share a short video showing live transcription in action at CBC. This one is using our baseline Acoustic Model. No customizations were made, no hints used. This video gives an idea of what latency is achievable with real-time transcription.


Automated real-time transcription is a great solution for accommodating hearing impaired if no sign-language interpreter is available. I can be used, e.g., at churches to transcribe sermons, at conventions and meetings to transcribe talks, at educational institutions (schools, universities) to live transcribe lessons and lectures, etc.

Voicegain Platform provides a complete stack to support live transcription:

  • Utility for audio capture at source
  • Cloud based or On-Prem transcription engine and API
  • Web portal for controlling multiple simultaneous live transcriptions
  • Web-based viewer app to enable following the transcription on any device with web browser. This app can also be embedded into any web page.

Very high accuracy - above that provided by Google, Amazon, and Microsoft Cloud speech-to-text - can be achieved through Acoustic Model customization.

Read more → 
How to use Voicegain with Twilio Media Streams
CPaaS
How to use Voicegain with Twilio Media Streams

Voicegain adds grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

The difference between Voicegain speech recognition and Twilio TwiML <Gather> is:

  1. Voicegain supports grammars with semantic tags (GRXML or JSGF) while <Gather> is a large vocabulary recognizer that just returns text, and
  2. Voicegain is  significantly cheaper (we will describe the price difference in an upcoming blog post).

When using Voicegain with Twilio, your application logic will need to handle callback requests from both Twilio and Voicegain.

Each recognition will involve two main steps described below:

Initiating Speech Recognition with Voicegain

This is done by invoking Voicegain async recognition API: /asr/recognize/async

Below is an example of the payload needed to start a new recognition session:

Some notes about the content of the request:

  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
  • asr settings include the three standard timeouts used in grammar based recognition - no-input, complete, and incomplete timeouts
  • grammar is set to GRXML grammar loaded from an external URL

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, if the grammar is specified to recognize DTMF, the Voicegain recognizer will recognize DTMF signals included in the audio sent from Twilio Platform.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:


Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/recognize/async request
  • more than one question prompt is supported - they will be played one after another
  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated   using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Recognition Response

Below is an example response from the recognition. This response is from built-in phone grammar.


Read more → 
Speech-to-Text Accuracy Benchmark Revisited
Benchmark
Speech-to-Text Accuracy Benchmark Revisited

Some of the feedback that we received regarding the previously published benchmark data, see here and here, was concerning the fact that the Jason Kincaid data set contained some audio that produced terrible WER across all recognizers and in practice no one would user automated speech recognition on such files. That is true. In our opinion, there are very few use cases where WER worse than 20%, i.e. where on average 1 in every 5 words is recognized incorrectly, is acceptable.

New Methodology

What we have done for this blog post is we have removed from the reported set those benchmark files for which none on the recognizers tested could deliver WER 20% or less. This criterion resulted in removal of 10 files - 9 from the Jason Kincaid set of 44 and 1 file from the rev.ai set of 20. The files removed fall into 3 categories:

  • recordings of meetings - 4 files (this amounts to half of the meeting recordings in the original set),
  • telephone conversations - 4 files (4 out of 11 phone phone conversations in the original set),
  • multi-presenter, very animated podcasts - 2 files (there were a lot of other podcasts in the set that did meet the cut off).

The results

As you can see, Voicegain and Amazon recognizers are very evenly matched with average WER differing only by 0.02%, the same holds for Google Enhanced and Microsoft recognizer with the WER difference being only 0.04%. The WER of Google Standard is about twice of the other recognizers.

Read more → 
Speech-to-Text Accuracy Benchmark - September 2020
Benchmark
Speech-to-Text Accuracy Benchmark - September 2020

[UPDATE - October 31st, 2021:  Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced. Our pricing is now 0.95 cents/minute]


[UPDATE: For results reported using slightly different methodology see our new blog post.]


This is a continuation of the blog post from June where we reported the previous speech-to-text accuracy results. We encourage you to read it first, as it sets up a  context to better understand the significance of benchmarking for speech-to-text.

Apart for that background intro, the key differences from the previous post are:

  • We have improved our recognizer and we are now essentially tied with Amazon
  • We added another set of benchmark files - 20 files published by rev.ai. Please reference the data linked here when trying to reproduce this benchmark.

Here are the results.


Comparison to the June benchmark on 44 files.


Less than 3 months have passed from the previous test, so it is not surprising to see no improvement on Google and Amazon recognizers.


Voicegain recognizer has how overtaken Amazon by a hair breadth in average accuracy, although Amazon median accuracy on this data set is slightly above Voicegain.


Microsrosoft recognizer has improved during this time period - on the 44 benchmark files it is now on average better than Google Enhanced (in the chart we retained ordering from the June test). The single bad outlier in Google Enhanced results does alone not account for the better average WER on the Microsoft on this data set.  


Google Standard is still very bad and we will likely stop reporting on it in detail in our future comparisons.


Results from the benchmark on 20 new files.

The audio from the 20-file rev.ai test is not as challenging as some of the files in the 44-file benchmark set. Consequently the results are on average better but the ranking of the recognizers does not change.


As you can see in this chart, on this data set the Voicegain recognizer is marginally better than Amazon in.  It has lower WER on 13 out of 20 test files and it beats Amazon in the mean and median values. On this data set Google Enhanced beats Microsoft.


Combined results on 44+20 files

Finally, here are the combined results for all the 64 benchmark files we tested.


On the combined benchmark Voicegain beats Amazon both in average and median WER, although the median advantage is not as big as on the 20 file rev.ai set. [Note that as of 2/10/21 Voicegain WER is now 16.46|14.26]

What we would like to point out is that when comparing Google Enhanced to Microsoft, one wins if we compare the average WER while the other has a better median WER value. This highlights that the results vary a lot depending on what specific audio file is being compared.


Conclusions

These results show that choosing the best recognizer for a given application should be done only after thorough testing. Performance of the recognizers varies a lot depending on the audio data and acoustic environment. Moreover, the prices vary significantly. We encourage you to  try the Voicegain Speech-to-Text engine for your application. It might be a better fit for your application. Even if the accuracy is a couple of points behind the two top players,  you might still want to consider Voicegain because:

  • Our acoustic models can be customized to your specific speech audio and this can reduce the word error rates below the best out-of-the-box options - see our Improved Accuracy from Acoustic Model Training blog post.
  • If the accuracy difference is small, Voicegain might still make sense given the lower price.  
  • We are continuously training our recognizer and it is only a matter of time before we catch up.

Read more → 
Voicegain Speech-to-Text integrates with Twilio Media Streams
Developers
Voicegain Speech-to-Text integrates with Twilio Media Streams

Voicegain launched an extension to Voicegain /asr/recognize API that supports Twilio Media Streams via TwiML <Connect><Stream>. With this launch,  developers using Twilio's Programmable Voice get an accurate, affordable, and easy to use ASR to build Voice Bots /Speech-IVRs.

Update: Voicegain also announced that its large vocabulary transcription (/asr/transcribe API) integrates with Twilio Media Streams. Developers may use this to voice enable a chat bot developed on any bot platform or develop a real-time agent assist application.

Key Features of Twilio Media Streams support

Voicegain Twilio Media Streams support gives developers the following features:

  1. Grammar Support for bots & IVRs: Developers can now write voice bots or ivrs that use grammars. Use of grammars can improve recognition accuracy and simplify bot development by constraining the speech-to-text engine. Also many traditional VoiceXML IVRs are built using grammars. Until now Twilio TwiML did not support use of speech grammars as the <Gather> command supports only text capture. This made it hard to build simple bots or migrate existing VoiceXML IVR applications to the Twilio platform. Mapping of text to semantic meaning had to be done separately, plus large vocabulary recognizer was more likely  to return spurious recognitions. Voicegain solves these problems by supporting both GRXML and JSGF speech grammars at the core speech-to-text (ASR) engine level. This delivers higher accuracy compared to an ASR that uses a large vocabulary language model to recognize text and then applies grammars to the recognized text.
  2. 90% Savings on ASR Licensing costs: A big advantage for developers of the Twilio Programmable Voice platform has been its affordable pricing. However, that was not necessarily true for existing ASR options like <Gather> that is priced at  8 cents/minute (with a 15 second minimum). With Voicegain the ASR/STT price is 1.25 cents/ minute measured at 1 second increments.  If you include the billing increment, developers get 90% cost savings.
  3. Better Timeout Support: Voicegain supports configurable timeouts for no-input, complete timeout and incomplete timeout. Because the grammar is integrated with the recognizer, Voicegain ASR is able to deliver accurate complete timeout response which is not possible with <Gather> command for which the only way to tell if the caller stopped speaking is a large enough pause.
  4. Simplify dynamic prompt playback. -- In order to make use of <Connect><Stream> as easy as possible, we support passing prompts when invoking <Stream>. Prompts can be provided either as text or as URLs. If provided as text then Voicegain will either use TTS or perform dynamic concatenation of prerecorded prompts.  A prompt manager for such prerecorded prompts is provided as part of Voicegain Web Portal. Configurable barge-in is supported for the prompts.
  5. Fine-tune and test grammars. -- Voicegain Web Portal includes a tool for reviewing and fine tuning grammars. The tool also supports regression tests. With this functionality you will never have to deploy grammars without knowing how well they are going to perform after changes.


How Twilio Media Streams works with Voicegain


TwiML <Stream> requires a websocket url. This url can be obtained by invoking Voicegain /asr/recognize/async API. When invoking this API the grammar to be used in the recognition has to be provided. The websocket URL will be returned in the response.  


In addition to the wss url, Custom Parameters within <Connect><Stream> command are used to pass information about the question prompt to be played  to the caller by Voicegain. This can be a text or a url to a service that will provide the audio.

Once <Connect><Stream> has been invoked, Voicegain platform takes over-  it:

  • Plays the prompt via the back channel of <Stream>
  • As soon as caller starts speaking, the prompt playback is stopped (if it it was still playing) exactly like in <Gather>
  • Spoken words are are recognized using grammar. Recognition result is then provided as a callback from Voicegain Platform. In case of no-input or no-match an appropriate callback will also be made.
  • <Stream> connection is stopped and the TwiML application will continue with a next command.

BTW, we also support DTMF input as an alternative to speech input.

[UPDATE: you can see more details of how to use Voicegain with Twilio Media Streams in this new Blog post.]

Other features of the Voicegain Platform

1. On Premise Edge Support: While Voicegain APIs are available as a cloud PaaS service, Voicegain also supports OnPrem/Edge deployment. Voicegain can be deployed as a containerized service on a single node Kubernetes cluster, or onto multi-node high-availability Kubernetes cluster (on your GPU hardware or your VPC).

2. Acoustic model customization: This allows to achieve very high accuracy beyond what is possible with out of the box recognizers. The grammar tuning and regression tool mentioned earlier, can be used to collect training data for acoustic model customization.

More Features Coming

On our near-term roadmap for Twilio users we have several more features:

  • Advanced Answering Machine Detection (AMD) -- will be invoked using <Connect><Stream> and will provide very accurate answering machine detection using speech recognition.
  • Large vocabulary language model to just capture the spoken words (no grammars are used) and integrate with any NLU Engine of your choice. We think it will be attractive because of the lower cost compared to <Gather>.
  • Real-time agent assist - we are combining our real-time speech recognition with speech analytics to deliver an API that will allow for building real-time agent assist and monitoring applications.

You can sign up to try our platform. We are offering 600 minutes of free monthly use of the platform. If you have questions about integration with Twilio, send us a note at support@voicegain.ai.

Twilio, TwiML and Twilio Programmable Voice are registered trademarks of Twilio, Inc

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control