Speech-to-Text Accuracy Benchmark

It has been another 6 months since we published our last speech recognition accuracy benchmark. Back then, the results were as follows (from most accurate to the least): Microsoft, then Amazon closely followed by Voicegain, then new Google latest_long and Google Enhanced last.

While the order has remained the same as the last benchmark, three companies - Amazon, Voicegain and Microsoft showed significant improvement.

Since the last benchmark, at Voicegain we invested in more training - mainly lectures - conducted over zoom and in a live setting. Training on this type of data resulted in a further increase in the accuracy of our model. We are actually in the middle of a further round of training with a focus on call center conversations.

As far as the other recognizers are concerned:

Microsoft and Amazon both improved by about the same amount.
Google recognizers did not improve. Actually, the WER numbers for them are worse than in June.

Methodology

We have repeated the test using similar methodology as before: used 44 files from the Jason Kincaid data set and 20 files published by rev.ai and removed all files where none of the recognizers could achieve a Word Error Rate (WER) lower than 25%.

This time again only one file was that difficult. It was a bad quality phone interview (Byron Smith Interview 111416 - YouTube) with WER of 25.48%

We publish this since we want to ensure that any third party - any ASR Vendor, Developer or Analyst - to be able to reproduce these results.

The Results

You can see box-plots with the results above. The chart also reports the average and median Word Error Rate (WER)

Only 3 recognizers have improved in the last 6 months.

Amazon by 0.68% in the median and 0.40% in the average
Voicegain by 0.47% in the median and 0.45% in the average
Microsoft by 0.33% in the median and 0.25% in the average

Detailed data from this benchmark indicates that Amazon is better than Voicegain on audio files with WER below the median and worse on audio files with accuracy above the median. Otherwise, AWS and Voicegain are very closely matched. However we have also run a client-specific benchmark where it was the other way around - Amazon as slightly better on audio files with WER above the median than Voicegain, but Voicegain was better on audio files with WER below the median. Net-net, it really depends on type of audio files, but overall, our results indicate that Voicegain is very close to AWS.

‍

Best Recognizer

Let's look at the number of files on which each recognizer was the best one.

Microsoft was best on 36 out of the 63 files
Amazon was best on 15 files.
Voicegain was best on 9 audio files
Google latest-long was best on just 1 file
Google Video Enhanced was best on 2 files - these were the 2 easiest files - Google got 0.82% and 1.52% WER on them - one was Sherlock Holmes from Librivox and the other The Art of War by Sun Tzu, also a Librivox audiobook.

‍

Improvements over time

We now have done the same benchmark 5 times so we can draw charts showing how each of the recognizers has improved over the last 2 years and 3 months. (Note for Google the latest 2 results are from latest-long model, other Google results are from video enhanced.)

You can clearly see that Voicegain and Amazon started quite bit behind Google and Microsoft but have since caught up.

Google seems to have the longest development cycles with very little improvement since Sept. 2021 till about half a year ago. Microsoft, on the other hand, releases an improved recognizer every 6 months. Our improved releases are even more frequent than that.

As you can see, the field is very close and you get different results on different files (the average and median do not paint the whole picture). As always, we invite you to review our apps, sign-up and test our accuracy with your data.

‍

Out-of-the-box accuracy is not everything

When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:

Ability to customize the Acoustic Model - Voicegain model may be trained on your audio data - we have several blogposts describing both research and real use-case model customization. The improvements can vary from several percent on more generic cases, to over 50% to some specific cases, in particular for voicebots.
Ease of integration - Many Speech-to-Text providers offer limited APIs especially for developers building applications that require interfacing with telephony or on-premise contact center platforms.
Price - Voicegain is 60%-75% less expensive compared to other Speech-to-Text/ASR software providers while offering almost comparable accuracy. This makes it affordable to transcribe and analyze speech in large volumes.
Support for On-Premise/Edge Deployment - The cloud Speech-to-Text service providers offer limited support to deploy their speech-to-text software in client data-centers or on the private clouds of other providers. On the other hand, Voicegain can be installed on any Kubernetes cluster - whether managed by a large cloud provider or by the client.

‍

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

‍

Casey

Transcribe