Our Blog

News, Insights, sample code & more!

Benchmark
2025 Speech-to-Text Accuracy Benchmark for 8 kHz Call Center Audio Files

Voicegain is releasing the results of its 2025 STT accuracy benchmark on an internally curated dataset of forty(40) call center audio files. This benchmark compares the accuracy of Voicegain's in-house STT models with that of the big cloud providers and also Voicegain's implementation of OpenAI's Whisper.

In the years past, we had published benchmarks that compared the accuracy of our in-house STT models against those of the big cloud providers. Here is the accuracy benchmark release in 2022 and the first release in 2021 and our second release in 2021. However the datasets we compared our STT models was a publicly available benchmark dataset that was on Medium and it included a wide variety of audio files - drawn from meetings, podcasts and telephony conversations.

Since 2023, Voicegain has focused on training and improving the accuracy of its in house Speech-to-Text AI models call center audio data. The benchmark we are releasing today is based on a Voicegain curated dataset of 40 audio files. These 40 files are from 8 different customers and from different industry verticals. For example two calls are consumer technology products, two are health insurance and one each in telecom, retail, manufacturing and consumer services. We did this to track how well the underlying acoustic models are trained on a variety of call center interactions.

Why a separate benchmark for Call Center Audio Data ?

In general Call Center audio data has the following characteristics

  1. Narrowband: Most telephony systems used in call center encode the audio in a limited bandwidth 8 kHz format. Unless AI models are trained on such audio, the recognition accuracy can be limited.
  2. Noisy data: There is significant background noise and over-talk in call center audio recordings.
  3. Accents: Call Center agents work in different international locations. Even the end customers in the US have different accents. So the STT engine needs to be tuned to different accents.

Results of our Benchmark:

How was the accuracy of the engines calculated? We first created a golden transcript (human labeled) for each of the 40 files and calculated the Word Error Rate (WER) of each of the Speech-to-Text AI models that are included in the benchmark. The accuracy that is shown below is 1 - WER in percentage terms.

Accuracy Benchmark of different STT engines on Curate 8 kHz call center benchmark

Most Accurate - Amazon AWS came out on top with an accuracy of 87.67%

Least Accurate - Google Video was the least trained acoustic model on our 8 kHz audio dataset. The accuracy was 68.38%

Most Accurate Voicegain Model - Voicegain-Whisper-Large-V3 is the most accurate model that Voicegain provides. Its accuracy was 86.17%

Accuracy of our inhouse Voicegain Omega Model - 85.09%. While this is slightly lower than Whisper-Large and AWS, it has two big advantages. The model is optimized for on-premise/pvt cloud deployment and it can further be trained on client audio data to get an accuracy that is higher.

Custom Acoustic Model Training

One very important consideration for prospective customers is that while this benchmark is on the 40 files in this curated list, the actual results for their use-case may vary. The accuracy numbers shown above can be considered as a good starting point. With custom acoustic model training, the actual accuracy for a production use-case can be much higher.

Private Cloud/On-Premise Deployment

There is also another important consideration for customers that want to deploy a Speech-to-Text model in their VPC or Datacenter. In addition to accuracy, the actual size of the model is very important. It is in this context that Voicegain Omega shines.

Additional Result of our Streaming Speech-to-Text

We also found that Voicegain Kappa - our Streaming STT engine has an accuracy that is very close to the accuracy of Voicegain Omega. The accuracy of Voicegain Kappa is less than 1% lower than Voicegain Omega.

Reproducing this Benchmark

If you are an enterprise that would like to reproduce this benchmark, please contact us over email (support@voicegain.ai). Please use your business email and share your full contact details. We would first need to qualify you, sign an NDA and then we can share the PII-redacted version of these audio call recordings.

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Four ways to use Voicegain Speech-to-Text with Telnyx
CPaaS
Four ways to use Voicegain Speech-to-Text with Telnyx

This blog post will describe 4 ways you can use Telnyx with the Voicegain's Deep Neural Network based Speech-to-Text/ASR platform.

#1: Real-time Transcription and Speech-Analytics

For developers looking to get the raw text/transcript, the Voicegain STT API supports real time transcription of streamed audio from Telnyx.

For conversational AI applications that need NLU tags like sentiment, named entities, intents and keywords in the submitted audio, Voicegain's real-time Speech Analytics API provides those metrics in addition to the transcript.

While both the STT API and the Speech Analytics API support multiple methods to stream audio, Voicegain recommends RTP streaming as the primary method with Telnyx. Developers can stream either 1-channel or 2-channel RTP (the two channels are tied together which is important for some Speech Analytics features).

You can use Telnyx Call Control API to fork the call audio and send it to Voicegain. Call Control API allows you to send either inbound (rx) or outbound (tx) audio or both, this is done using the fork_start command. You can find a complete example of a code needed for real-time transcription of a call here: platform/examples/telnyx/call_control_fork_of_bridged_call at master · voicegain/platform (github.com)

Applications of real-time transcription and speech analytics include live agent assist in contact centers, extraction of insights for sales calls conducted over telephony, and meeting analytics.

#2: Voice Bot or IVR using Voicegain Telephony Bot API

If you want to build a Voice Bot or an IVR application that handles calls coming over Telnyx then we suggest using Voicegain Telephony Bot API - this is a callback API similar in style to Twilio's TwiML. This API handles speech-to-text, DTMF digits and also plays prompts (TTS, pre-recorded, or a combination).

Your calls are transferred from Telnyx to Voicegain using a simple SIP INVITE. The SIP INVITE is accomplished using Telnyx Call Control Dial command. You can find a complete example how to do this here: platform/telnyx-dial-outbound-lambda.py at master · voicegain/platform (github.com)

Voicegain Telephony Bot API allows you to build two types of applications:

  • Voice Bot applications using either your own Bot framework or using frameworks like RASA or Google Dialog flow for the bot logic. The "ear" and "mouth" of the bot is provided by Voicegain. This blog shows how a Voice Bot using RASA can be constructed: Easy How-To: Build a Voicebot using Voicegain, RASA, and AWS Lambda.
  • Alternatively, you can build more traditional IVRs using call flows and grammars. You can either program them directly using Telephony Bot API by implementing appropriate callbacks. Alternatively, we provide a simple script that allows you to specify the entire IVR application in a declarative way in a YAML file. You can find a complete example how to do this on our github: platform/declarative-ivr at master · voicegain/platform (github.com)

#3: Use Voicegain STT API as needed in your Call Control App

If your application has only a limited need for speech recognition, you can invoke Voicegain STT API only as needed. Every time you need speech recognition you simply start a new ASR session with Voicegain either in transcribe (large vocabulary transcription) or recognize (grammar-based recognition) mode. The session will return an RTP ip:port to which you can fork your Telnyx audio. You can receive speech-to-text results either over a websocket or via a callback. After you a done with the transcription/recognition session you stop the Telnyx audio fork.

An example application that could be built like that is a voice controlled voicemail retrieval application where Voicegain recognize API is used in a continuous mode and listens to commands like play, stop, next, etc.

#4: Build your own Voice Bot using Long-Session STT API

Finally, you could use Voicegain Long-Session API (planned to be released later in 2021). This API allows you to establish single long session that takes an ongoing stream of inbound audio from Telnyx (via fork command). Once the session is established you can issue commands for transcription or recognition. They would return results upon finding a speech endpoint or when you explicitly stop them. After processing the results you could issue additional commands on the same Voicegain session.

In addition to returning recognition results, Long-Session STT API returns important events, like e.g. start-of-speech that allows you to implement proper barge-in behavior.

Using this API you could build your own Voice Bot just like the Voice Bots from #2, but you could have more control over your Telnyx session, e.g. you could use conference commands.

Read more → 
Speech Analytics Comparison: NER Capabilities & Accuracy
Benchmark
Speech Analytics Comparison: NER Capabilities & Accuracy

This post is the first in a series of posts that compares the performance of Voicegain Speech Analytics against Google and Amazon. This post compares the capabilities and accuracy of recognition/extraction of Named Entities. The Google APIs used for comparison were those under Cloud Natural Language and the Amazon APIs were under AWS Comprehend.

Named Entity Recognition (NER) or extraction of Named Entities is a one of the features of the Voicegain Speech Analytics API. Named Entities Recognition locates and classifies named entities in unstructured text that may be obtained e.g. from the transcription of the audio files. Although there is a lot of overlap between Google, Amazon and Voicegain with respect to the classification categories, there are also some significant differences which are summarized below.

Supported NER Categories



The full spreadsheet linked here shows the named entities extracted by the Voicegain Speech Analytics API and it compares them to the named entity categories available in Google and Amazon Comprehend APIs. Amazon has two NER API: Entity, and PII Entity.

If you look at the spreadsheet you will see that Amazon non-PII Entity API offers little granularity in the named entity categories.  For example, it groups a lot of numerical named entities into single QUANTITY category. It groups dates and time (of day) into a single category DATE. On the other hand then PII Entity API has a lot of fine categories related items typically PII-redacted, but it misses a lot of other common entity categories.

Google API seems to cover the usual categories but misses some entities used in call-center application, e.g. CC, SNN, EMAIL>

A category that Voicegain does not support is OTHER. This category  which is available in Google and Amazon requires additional application logic to interpret the string that it matches.

Accuracy Comparison

We have tested all 4 APIs on a set of call center calls.

The overall results show that Voicegain and Amazon non-PII PAI detect similar named entities (with the caveat that Amazon NER categories are less specific). Compared to these two, Google NER API misses more entities, but it also marks many additional words falling into the OTHER categories (which is generally is not very useful, at least not when analyzing call center calls.  

Looking at the Amazon PII Entities we noticed that:

  • was good on NAME, BANK_ACCOUNT_NUMBER
  • EMAIL and PHONE worked mostly OK, but had some strange false positives
  • CREDIT_DEBIT_NUMBER had false positives (e.g. from phone) or partial matches
  • DATE_TIME was not picking all phrases that the description said this category should recognize
  • ADDRESS was working with mixed success - sometimes not picking clear address text or recognizing only part of it
  • EXPIRY_DATE had many false positives - combinations of 4 digits that clearly were not valid expiry dates

Where Voicegain has a matching entity category for AWS PII Entity it performed same or better.As you see it is difficult to summarize the results because the entities are not directly comparable. If you want to know how Voicegain NER will perform on your data we suggest you test the Voicegain Speech Analytics API which includes NER, keyword, phrase detection, sentiment analysis, etc.

For testing, you have two options:

  1. You can create a free developer account on the Voicegain Platform. Here is how you can sign up. Once you sign up, please use the Transcribe+ feature. If you have any questions, please email us at support@voicegain.ai
  2. You can also use the beta version of our Speech Analytics app and upload your 2 channel audio recording. To get access, please email us at support@voicegain.ai
Read more → 
SIP INVITE Voicegain from Twilio, SignalWire, Telnyx CPaaS
CPaaS
SIP INVITE Voicegain from Twilio, SignalWire, Telnyx CPaaS

Voicegain Telephony Bot API allows developers to use Voicegain Speech-to-Text to build Voice Bots or programmable speech IVR using a simple callback API. With latest Voicegain Platform release 1.21.0 it is now possible to establish SIP sessions to Voicegain Telephony Bot API using a simple SIP Invite.


Before release 1.21.0, the only way for voice app developers to use the Voicegain Telephony Bot API was to call the application using phone numbers that were purchased from Voicegain (via the Web Console). However, we have always wanted to allow clients to bring their own carrier or CPaaS platform and this release allows developers to do just that.


At Voicegain our focus is on offering our ASR/Speech Recognition functionality and our full featured Speech-to-Text APIs. We understand developers rely on their CPaaS platforms for a whole host of important features - messaging, emails, conferencing and international coverage. Now, it is possible to integrate Voicegain Telephony Bot API with any CPaaS that supports SIP Invite. You can combine powerful and affordable Speech Recognition features of the Voicegain Platform with the comprehensive  API features of these CPaaS platforms


We have already tested SIP Invite extensively on Twilio, SignalWire, and Telnyx platforms. Other similar platforms should also work without issues. We will report any additional platforms that we have explicitly tested in the future.


How SIP INVITE works with Twilio & SignalWire

On Twilio and SignalWire platforms is trivial to establish SIP session to Voicegain. The only thing needed is the <Dial><Sip> command from TwiML or LaML, for example:



Some notes about the above example:

  • The SIP URI user name is a unique random identifier assigned on Voicegain Platform to each Telephony Bot Application.
  • After the SIP connection gets established, the application prompts and speech recognition will be under control of Voicegain Platform based on commands passed using our Telephony Bot API
  • Once Voicegain `disconnect` command is issued, the control of the application flow will be returned back to the host platform (i.e. Twilio, SignalWire or any other CPaaS platform).
  • It is possible to pass custom headers to Voicegain during SIP Invite - this way it is possible to associate host sessions with Voicegain sessions.
  • It is possible to make multiple <Dial><Sip> requests to Voicegain from host application during a single host session.

On our github you can find sample code showing how to dial a outbound call and then bridge it to Voicegain SIP:

What about Telnyx

On Telnyx we tested SIP INVITE using the Telnyx Call Control API. The only functional difference from Twilio and SIgnalWire is that on Telnyx you cannot choose TCP as SIP transport (only UDP is supported).

Here is a sample Python code showing how to dial Voicegain SIP:


The complete code for an AWS Lambda function that dials a number using Telnyx and then bridges it to Voicegain SIP is available here: platform/telnyx-dial-outbound-lambda.py at master · voicegain/platform (github.com)


What can I build with the Telephony Bot API?

Our Telephony Bot API is a callback API in similar fashion as TwiML or LaML. The main difference is that it is based on JSON and our functionality is focused on Speech Recognition. You can read more about it in our blog post announcing release of that API back in August.


On out Github you can find an example of a Node.js function on AWS Lambda that demonstrates how to interface Voicegain Telephony Bot API with a RASA NLU bot: platform/examples/voicebot-lambda-vg-rasa at master · voicegain/platform (github.com)


You can also check out our sample python function code on AWS Lambda which shows how to implement more traditional (VoiceXML like) IVRs with the use of Speech grammars on top of our Telephony Bot API: platform/declarative-ivr at master · voicegain/platform (github.com)

Read more → 
How to signup for a developer account and start using Voicegain
Developers
How to signup for a developer account and start using Voicegain

Here are all the steps needed to signup for a developer account on the Voicegain Platform. Once you have the account you can access the Web Console and you can find all the info on how to use the Web Console and the APIs on our Zendesk Knowledge Base .

1. Start at console.voicegain.ai/signup  

2. Enter your name and email. If you wish you can check the Terms of Service and/or Privacy Policy.

3. On the next page let us know how you learned about Voicegain, how you wan to use Voicegain, and accept Terms of Service.


5. After you click Next, Voicegain will send you an email with the link to the next step. If you do not get the email, please check a Junk Mail folder, and if it is not there, please follow instruction on the page shown below.



6. Once you get the email, click on the Set Password button.


7. You will be directed to a web page where you can set your Voicegain password.


8. After you click (Re)set Password you will be directed to the login page where you can enter your login credentials.



9. On the next page click the right arrow icon next to "Cloud Web Console"


10. This will take you to the home page of the Voicegain Web Console. You can follow the mini tutorial that is available on the home page.


11. Help articles are available under the question mark (?) menu. There also you will find our helpdesk support link. Note, some of the support articles are available only to logged in users while others are public.



Read more → 
Test Voicegain realtime Speech-to-Text from your browser
Developers
Test Voicegain realtime Speech-to-Text from your browser

You can now test the accuracy of both our realtime and offline speech-to-text  by visiting our demo page.

Read out paragraphs of your favorite book, give a speech that inspires, mimic your favorite actor or just play a podcast or YouTube video!

Health check for the demo

  1. We currently support Chrome and Edge browser only.
  2. Please ensure that your CPU utilization is not too high (<50%) and your internet bandwidth is reasonable (10 Mbps both directions).
  3. Ensure that your microphone is not being used by another program like Zoom, Teams, Skype or Webex.

If you are noticing delays in real-time transcription results, they are  likely because of resource issues on your computer.

Real-time Transcription

Simply click on the microphone icon to get started. You can either speak or stream audio into your microphone from your browser for a full minute.

You can also play back the audio to make sure that it was indeed streamed to us accurately.

Offline Transcription

Click on the upload recording icon to get started. You can upload up a mono or stereo recorded file - wav or FLAC - that is up to 15MB in size. If you need to transcribe a larger file, please sign up for a free account.

Drop us an email (support@voicegain.ai) if you have any comments.

Read more → 
Speech-to-Text Accuracy Benchmark - June 2021
Benchmark
Speech-to-Text Accuracy Benchmark - June 2021

[UPDATE - October 31st, 2021:  Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced.]

It has been over 8 months since we published our last speech recognition accuracy benchmark (described  here). Back then the results were as follows (from most accurate to least): Microsoft and Google Enhanced (close 2nd), then Voicegain and Amazon (also close 4th) and then, far behind, Google Standard.

Methodology

We have repeated the test using the same methodology as before: take 44 files from the Jason Kincaid data set and 20 files published by rev.ai and remove all files where the best recognizer could not achieve a Word Error Rate (WER) lower than 20%. Last time we removed 10 files, but this time as the recognizers improved only 8 files had their WER higher than 20%.

The files removed fall into 3 categories:

  • recordings of meetings - 3 files (3 out of 7 meeting recordings in the original set),
  • telephone conversations - 3 files (3 out of 11 phone phone conversations in the original set),
  • multi-presenter, very animated podcasts - 2 files (there were a lot of other podcasts in the set that did meet the cut off).

Some of our customers told us that they previously used IBM Watson, so we decided to add also it to the test.

Results

In the new test, as you can see in the results chart above, the order has changed: Amazon has leap-frogged everyone by increasing its median accuracy by over 3% to just 10.02%, it is now in the pole position. Microsoft, Google Enhanced  and Google Standard performed at approximately the same level. The Voicegain recognizer improved by about 2%.  The newly tested IBM Watson is better than Google Standard, but lags the other recognizers.

Voicegain is tied with Google Enhanced

New results put Voicegain recognizer very close to Google enhanced:

  1. Average WER of Voicegain is just 0.66% behind Google, while median WER is just 0.63% behind. To put it in context -  Voicegain makes one additional mistake every 155 words compared to Google Enhanced.
  2. Voicegain was actually marginally better than Google Enhanced on the min error, 1st quartile, 3rd quartile, and max error.
  3. Overall Voicegain was better on 20 files while Google was better on 36 files.

However the results for a use case depends on the specific audio - for some of them Voicegain will perform slightly better and for some Google may perform marginally better. As always, we invite you to review our apps, sign-up and test our accuracy with your  data.

What about Open Source recognizers

We have looked at both the Mozilla DeepSpeech and Kaldi projects. We ran our complete benchmark on Mozilla DeepSpeech and found that it significantly trails behind Google Standard recognizer. Out of 64 audio files, Mozilla was better than Google Standard on only 5 files and tied on 1. It was worse on the remaining 58 files. Median WER was 15.63% worse for Mozilla compared to Google Standard. The lowest WER of 9.66% for Mozilla DeepSpeech was on audio from Librivox "The Art of War by Sun Tzu". For comparison, Voicegain achieves 3.45% WER on that file.

Regarding Kaldi we have not benchmarked it yet, but from the research published online it looks like Kaldi trails Google Standard too, at least when used with its standard ASpIRE and LibriSpeech models.

Out-of-the-box accuracy is not everything

When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:

  • Ability to customize the Acoustic Model - Voicegain model may be trained on your audio data - we have demonstrated improvement in accuracy of 7-10%. In fact for one of our customers with adequate training data and good quality audio we were able achieve a WER of 0.5% (99.5% accuracy)
  • Ease of integration - Many Speech-to-Text providers offer limited APIs especially for developers building applications that require interfacing with  telephony or on-premise contact center platforms.
  • Price - Voicegain is 60%-75% less expensive compared to other Speech-to-Text/ASR software providers while offering almost comparable accuracy. This makes it affordable to transcribe and analyze speech in large volumes.
  • Support for On-Premise/Edge Deployment - The cloud Speech-to-Text service providers offer limited support to deploy their speech-to-text software in client data-centers or on the private clouds of other providers. On the other hand, Voicegain can be installed on any Kubernetes cluster - whether managed by a large cloud provider or by the client.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.


2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account  and receive $50 in free credits


3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control