Speech-to-Text Accuracy Benchmark

[UPDATE - October 31st, 2021: Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced. Our pricing is now 0.95 cents/minute]

[UPDATE: For results reported using slightly different methodology see our new blog post.]

This is a continuation of the blog post from June where we reported the previous speech-to-text accuracy results. We encourage you to read it first, as it sets up a context to better understand the significance of benchmarking for speech-to-text.

Apart for that background intro, the key differences from the previous post are:

We have improved our recognizer and we are now essentially tied with Amazon
We added another set of benchmark files - 20 files published by rev.ai . Please reference the data linked here when trying to reproduce this benchmark.

Here are the results.

Comparison to the June benchmark on 44 files.

‍

Less than 3 months have passed from the previous test, so it is not surprising to see no improvement on Google and Amazon recognizers.

Voicegain recognizer has how overtaken Amazon by a hair breadth in average accuracy, although Amazon median accuracy on this data set is slightly above Voicegain.

Microsrosoft recognizer has improved during this time period - on the 44 benchmark files it is now on average better than Google Enhanced (in the chart we retained ordering from the June test). The single bad outlier in Google Enhanced results does alone not account for the better average WER on the Microsoft on this data set.

Google Standard is still very bad and we will likely stop reporting on it in detail in our future comparisons.

Results from the benchmark on 20 new files.

The audio from the 20-file rev.ai test is not as challenging as some of the files in the 44-file benchmark set. Consequently the results are on average better but the ranking of the recognizers does not change.

As you can see in this chart, on this data set the Voicegain recognizer is marginally better than Amazon in. It has lower WER on 13 out of 20 test files and it beats Amazon in the mean and median values. On this data set Google Enhanced beats Microsoft.

Combined results on 44+20 files

Finally, here are the combined results for all the 64 benchmark files we tested.

‍

On the combined benchmark Voicegain beats Amazon both in average and median WER, although the median advantage is not as big as on the 20 file rev.ai set. [Note that as of 2/10/21 Voicegain WER is now 16.46|14.26]

What we would like to point out is that when comparing Google Enhanced to Microsoft, one wins if we compare the average WER while the other has a better median WER value. This highlights that the results vary a lot depending on what specific audio file is being compared.

Conclusions

These results show that choosing the best recognizer for a given application should be done only after thorough testing. Performance of the recognizers varies a lot depending on the audio data and acoustic environment. Moreover, the prices vary significantly. We encourage you to try the Voicegain Speech-to-Text engine for your application. It might be a better fit for your application. Even if the accuracy is a couple of points behind the two top players, you might still want to consider Voicegain because:

Our acoustic models can be customized to your specific speech audio and this can reduce the word error rates below the best out-of-the-box options - see our Improved Accuracy from Acoustic Model Training blog post.
If the accuracy difference is small, Voicegain might still make sense given the lower price.
We are continuously training our recognizer and it is only a matter of time before we catch up.

‍

Casey

Transcribe