Blog | Speech-to-Text Platform

Benchmark

Voicegain introduces relative Speech-to-Text Accuracy SLA

Arun Santhebennur

•

min read

•

October 17, 2022

Since June 2020, Voicegain has published benchmarks on the accuracy of its Speech-to-Text relative to big tech ASRs/Speech-to-Text engines like Amazon, Google, IBM and Microsoft.

The benchmark dataset for this comparison has been a 3rd Party dataset published by an independent party and it includes a wide variety of audio data – audiobooks, youtube videos, podcasts, phone conversations, zoom meetings and more.

Here is a link to some of the benchmarks that we have published.

1. Link to June 2020 Accuracy Benchmark

2. Link to Sep 2020 Accuracy Benchmark

3. Link to June 2021 Accuracy Benchmark

4. Link to Oct 2021 Accuracy Benchmark

5. Link to June 2022 Accuracy Benchmark

Through this process, we have gained insights into what it takes to deliver high accuracy for a specific use case.

We are now introducing an industry-first relative Speech-to-Text accuracy benchmark to our clients. By "relative", Voicegain’s accuracy (measured by Word Error Rate) shall be compared with a big tech player that the client is comparing us to. Voicegain will provide an SLA that its accuracy vis-à-vis this big tech player will be practically on-par.

We follow the following 4 step process to calculate relative accuracy SLA

1. Identify Client Benchmark Dataset

In partnership with the client, Voicegain selects benchmark audio dataset that is representative of the actual data that the client shall process. Usually this is a randomized selection of client audio. We also recommend that clients retain their own independent benchmark dataset which is not shared with Voicegain to validate our results.

2. Generate golden reference

Voicegain partners with industry leading manual AI labeling companies to generate a 99% human generated accurate transcript of this benchmark dataset. We refer to this as the golden reference.

3. Run Relative Accuracy comparison

On this benchmark dataset, Voicegain shall provide scripts that enable clients to run a Word Error Rate (WER) comparison between the Voicegain platform and any one of the industry leading ASR providers that the client is comparing us to.

4. Calculate KPIs for Relative Accuracy SLA

‍Currently Voicegain calculate the following two(2) KPIs

a. Median Word Error Rate: This is the median WER across all the audio files in the benchmark dataset for both the ASRs‍

b. Fourth Quartile Word Error Rate: After you organize the audio files in the benchmark dataset in increasing order of WER with the Big Tech ASR, we compute and compare the average WER of the fourth quartile for both Voicegain and the Big Tech ASR

So we contractually guarantee that Voicegain’s accuracy for the above 2 KPIs relative to the other ASR shall be within a threshold that is acceptable to the client.

How often is this accuracy SLA measured?

Voicegain measures this accuracy SLA twice in the first year of the contract and annually once from the second year onwards.

What happens if Voicegain fails to meet the SLA?

If Voicegain does not meet the terms of the relative accuracy SLA, then we will train the underlying acoustic model to meet the accuracy SLA. We will take on the expenses associated with labeling and training . Voicegain shall guarantee that it shall meet the accuracy SLA within 90 days of the date of measurement.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

‍

Use Cases

Use Voicegain to Transcribe Encrypted Twilio Recordings

Jacek

•

min read

•

October 12, 2022

Twilio platform supports encrypted call recordings. Here is Twillo documentation regarding how to setup encryption for the recordings on their platform.

Voicegain platform supports direct intake of encrypted recordings from the Twilio platform.

The overall diagram of how all of the components work together is as follows:

Bellow we describe how to configure a setup that will automatically submit encrypted recordings from Twilio to Voicegain transcription as soon as those recordings are completed.

Configure the Private Key for decryption

Voicegain will require a Private Key in a PKCS#8 format to decrypt Twilio recordings. Twilio documentation describes how to generate a Private Key in that format.

Once you have the key, you need to upload it via Voicegain Web Console to the Context that you will be using for transcription. This can be done via Settings -> API Security -> Auth Configuration. You need to choose Type: Twilio Encrypted Recording.

Configure AWS Lambda function

We will be handling Twilio recording callbacks using an AWS Lambda function, but you can use an equivalent from a different Cloud platform or you can have your own service that handles https callbacks.

A sample AWS Lambda function in Python is available on Voicegain Github: platform/AWS-lambda-for-encrypted-recordings.py at master · voicegain/platform (github.com)

You will need to modify that function before it can be used.

First you need to enter the following parameters:

voicegainJwt - you need to get the JWT from the same Context that you uploaded the Private Key to
myAuthConf - this is the name under which you uploaded the Private Key
expectedPublicKeyId - this is the name under which, on Twilio platform, you uploaded the Public Key

The Lambda function receives the callback from Twilio, parses the relevant info from it, and then submits a request to Voicegain STT API for OFFLINE transcription. If you want, you can modify, in the Lambda function code, the body of the request that will be submitted to Voicegain. For example, the github sample submits the results of transcription to be viewable in the Web Console (Portal), but you will likely want to change that, so that the results are submitted via a Callback to your HTTPS endpoint (there is a comment indicating where the change would need to be made).

You can also make other changes to the body of the request as needed. For the complete spec of the Voicegain Transcribe API see here.

Run a Test

Here is a simple python code that can be used to make an outbound Twilio call which will be recorded and then submitted for transcription.

Notice that:

We set the URL of the Lambda function in recordingStatusCallback.
And we tell Twilio to make the callback only when the call recording is completed in recordingStatusCallbackEvent.

‍

ASR,Benchmark

Speech-to-Text Accuracy Benchmark - June 2022

Jacek Jarmulak

•

min read

•

June 16, 2022

It has been over 7 months since we published our last speech recognition accuracy benchmark. Back then the results were as follows (from most accurate to least): Microsoft and Amazon (close 2nd), then Voicegain and Google Enhanced, and then, far behind, IBM Watson and Google Standard.

Since then we have obtained more training data and added additional features to our training process. This resulted in a further increase in the accuracy of our model.

As far as the other recognizers are concerned:

Microsoft and Amazon both improved, with Microsoft improving a lot on the more difficult files from the benchmark set
Google has released a new model "latest-long" which is quite a bit better than the previous Google's best Video Enhanced model. Accuracy of Video Enhanced stayed pretty much unchanged.

We have decided to no longer report on Google Standard and IBM Watson accuracy, which were always far behind in accuracy.

Methodology

We have repeated the test using similar methodology as before: used 44 files from the Jason Kincaid data set and 20 files published by rev.ai and removed all files where none of the recognizers could achieve a Word Error Rate (WER) lower than 25%.

This time only one file was that difficult. It was a bad quality phone interview (Byron Smith Interview 111416 - YouTube).

‍

The Results

You can see boxplots with the results above. The chart also reports the average and median Word Error Rate (WER)

All of the recognizers have improved (Google Video Enhanced model stayed much the same but Google now has a new recognizer that is better).

Google latest-long, Voicegain, and Amazon are now very close together, while Microsoft is better by about 1 %.

‍

Best Recognizer

Let's look at the number of files on which each recognizer was the best one.

Microsoft was best on 35 out of the 63 files
Amazon was best on 15 files (note that in the October 2021 benchmark Amazon was best on 29 files).
Voicegain was close behind Amazon by being best on 12 audio files
Google latest-long was best on 4
Google Video Enhanced wins a participation trophy by being best on 1 file, which was a very easy "The Art of War by Sun Tzu Full" Librivox Audiobook - WER of 1.79%

Note, the numbers do not add to 63 because there were a few files where two recognizers had identical results (to 2 digits behind comma).

‍

Improvements over time

We now have done the same benchmark 4 times so we can draw charts showing how each of the recognizers has improved over the last 1 year and 9 months. (Note for Google the latest result is from latest-long model, other Google results are from video enhanced.)

You can clearly see that Voicegain and Amazon started quite bit behind Google and Microsoft but have since caught up.

Google seems to have the longest development cycles with very little improvement since Sept. 2021 till very recently. Microsoft, on the other hand, releases an improved recognizer every 6 months. Our improved releases are even more frequent than that.

‍

As you can see the field is very close and you get different results on different files (the average and median do not paint the whole picture). As always, we invite you to review our apps, sign-up and test our accuracy with your data.

‍

Out-of-the-box accuracy is not everything

When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:

Ability to customize the Acoustic Model - Voicegain model may be trained on your audio data - we have several blogposts describing both research and real use-case model customization. The improvements can vary from several percent on more generic cases, to over 50% to some specific cases, in particular for voicebots.
Ease of integration - Many Speech-to-Text providers offer limited APIs especially for developers building applications that require interfacing with telephony or on-premise contact center platforms.
Price - Voicegain is 60%-75% less expensive compared to other Speech-to-Text/ASR software providers while offering almost comparable accuracy. This makes it affordable to transcribe and analyze speech in large volumes.
Support for On-Premise/Edge Deployment - The cloud Speech-to-Text service providers offer limited support to deploy their speech-to-text software in client data-centers or on the private clouds of other providers. On the other hand, Voicegain can be installed on any Kubernetes cluster - whether managed by a large cloud provider or by the client.

‍

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

‍

Transcription

Announcing Transcribe, AI Meeting Assistant

Arun Santhebennur

•

min read

•

June 2, 2022

Today, we are really excited to announce the launch of Voicegain Transcribe, an AI based transcription assistant for both in-person and web meetings. With Transcribe, users can focus on their meetings and leave the note taking to us.

Transcribe can also be used to convert streaming and recorded audio from video events, webinars, podcasts and lectures into text.

Voicegain Transcribe is an app accessible from Chrome or Edge Browser and is powered by Voicegain's highly accurate speech recognition platform. Our out-of-the-box accuracy of 89% is on par with the very best.

‍
Currently there are 3 main ways you can use Voicegain Transcribe:

Voicgain Transcribe, an app to record and transcribe meetings, live video and webinars, is now available — Screenshot of Voicegain Transcribe on first time log-in

1. Using browser sharing

If you join meetings directly from your Chrome or Edge browser (without any downloads or plug-ins), then you can use this feature to send audio to Voicegain. Examples of meeting platforms include Google Meet, BlueJeans, Webex and Zoom.

On a Windows device, browser sharing also works with a client desktop app like Zoom and Microsoft Teams. On a Mac/Apple device, browser sharing support desktop apps.

2. App for Zoom Local Recordings

Voicegain offers a downloadable Windows client app that is installed on the user's computer. This app accesses Zoom Local Recordings and automatically uploads them for transcription to Voicegain Transcribe.

Zoom has two types of recordings - Local Recordings and Cloud Recordings. This app is for Local Recordings - where the recording is stored on the hard disk of the user's computer. To learn more about Zoom local recording click here.

Zoom also allows a separate audio file for each participant's recording. Voicegain App supports upload of these individual participant's audio file so that the speaker labels are accurately assigned to the transcript.

‍

3. Upload Audio Recordings

Users may also upload pre-recorded audio files of their meetings, podcasts, calls and generate the transcript. We support over 40 different formats including mp3, mp4, wav, aac and ogg). Voicegain supports speaker diarization - so we can separate speakers even on a single channel audio recording.

‍

Languages Supported

Currently we support English and Spanish. More languages are in our roadmap - German, Portuguese, Hindi.

‍

Advanced Features

Voicegain Transcribe also supports the following advanced Features.‍

a. Projects

Users can organize their meeting recordings and audio files into different projects. A project is like a workspace or a folder.

b. Voice Signatures

Users can save the voice signatures of meeting participants and users so that you can accurately assign speaker labels.

c. Meeting Action Items and Sentiment

Voicegain can also extract meeting action items, positive and negative sentiment.

d. PII Redaction

Users can also mask - in both text and audio - any personally identifiable information.

Coming Soon - Join using meeting url

We are adding a feature where Voicegain Transcribe can join any meeting by having the user just enter the meeting url and inviting Voicegain Transcribe.

We are also adding a Chrome extension that will make it much easier to record and transcribe web meetings.

Get Started for Free today!

By signing up today, you will be signed up on our forever Free Plan - which makes you eligible for 120 mins of Meeting Transcription free every month . Once you are satisfied with our accuracy and our user experience, you can easily upgrade to Paid Plans.

If you have any questions, please email us at support@voicegain.ai

‍

Languages

New Languages Available in Voicegain Speech-to-Text

Jacek Jarmulak

•

min read

•

May 5, 2022

[Updated: 5/27/2022]

In addition to the current support for English, Spanish, Hindi, and German languages in its Speech-to-Text platform, Voicegain is releasing support for many new languages over the next couple of months.

Languages Generally Available right now

English - blend of mainly US, UK, Irish accents - includes punctuation and digit/time/currency formatting support
Spanish - focused on Latin America accents - includes punctuation and digit/time/currency formatting support
Hindi
German

You can access these languages right now from the Web Console or via our Transcribe App or via the API

Languages available right now in Alpha early access program

Upon request we will make these languages available for your testing. Generally, they can be available within hours from receiving a request. Please contact us at support@voicegain.ai

Portuguese - blend of European and Brazilian Portuguese
Polish
Korean
Dutch
Ukrainian

What does "Alpha early access" mean?

The Alpha early access models differ from full-featured production models in the following ways:

They are not good at rejecting background noise, music, etc.
The vocabulary may be limited - they may not be good at recognizing names of products, people, places, etc. Generally the vocabulary is the core every day vocabulary of a given language.
They will not be good at recognizing heavy or unusual accents.
Punctuation and capitalization is not available.
Formatting of digits, time, dates, currencies is not available.
For languages not using Latin alphabet, there could be occasional glitches in the characters in the transcript.
Initially most of those models are available in offline/batch mode only. We are working on training the real-time/streaming models.

As alpha models are being trained on additional data, their accuracy will improve. We are also working on punctuation, capitalization, and formatting of each of those models.

Languages that will be available in the first half of June

French - blend of European French (Metropolitan) and Canadian French (Quebec)

We will update this post as soon as these languages are available in the Alpha early access program.

Languages available by end of June

Arabic
Italian

Do not see a language that you need?

Since our language models are created exclusively with End-to-End Deep Learning, we can perform transfer learning from one language to another, and quickly support new languages and dialects to better meet your use case. Don’t see your language listed below? Contact us at support@voicegain.ai, as new languages and dialects are released frequently.

‍

Model Training

Acoustic Model Training delivers big gains in ASR Accuracy

Jacek Jarmulak

•

min read

•

February 28, 2022

This is a Case Study of training the acoustic model of Deep learning based Speech-to-Text/ASR engine for a Voice Bot that could take orders for Indian Food.

The Problem

The client approached Voicegain as they experienced very low accuracy of speech recognition for a specific telephony based voice bot for food ordering.

The voice bot had to recognize Indian food dishes with acceptable accuracy, so that the dialog could be conducted in a natural conversational manner rather than having to fallback to rigid call flows like e.g. enumerating through a list.

The spoken response would be provided by provided by speakers of South Asian Indian origin. This meant that in addition to having to recognize unique names, the accent would be a problem too.

The out-of-the box accuracy of Voicegain and other prominent ASR engines was considered too low. Our accuracy was particularly low because our training datasets did not have any examples of Indian Dish names spoken with heavy Indian accents.

With the use of Hints, the results improved significantly and we achieved an accuracy of over 30%. However, 30% was far from being good enough.

The Approach

Voicegain first collected relevant training data (audio and transcripts) and trained the acoustic model of our deep learning based ASR. We have had good success with it in the past, in particular with our latest DNN architecture, see e.g. post about recognition of UK postcodes.

We used a third party data generation service to initially collect over 11,000 samples of Indian Food utterances - 75 utterances per participant. The quality varied widely, but that is good because we think it reflected well the quality of the audio that would be encountered in a real application. Later we collected additional 4600 samples.

We trained two models:

A "balanced" model - where the Food Dish training data was combine with our complete training set to train the model.
A "focused" model - there the Food Dish data was combined with just a small subset of our other training data set.

We also first trained on the 10k set, collected the benchmark results, and then trained on the additional 5k data.

We randomly selected 12 sets of 75 utterances (total 894 after some bad recordings were removed) for a benchmark set and used the remaining 10k+ for training. We plan to share a link to the test data set here in a few days.

The Results - A 75% improvement in accuracy!

We compared our accuracy against Google and Amazon AWS both before and after training and the results are presented in a chart below. The accuracy presented here is the accuracy of recognizing the whole dish name correctly. If one word of several in a dish name was mis-recognized, then it was counted as a failure to recognize the dish name. We applied the same methodology if one extra word was recognized, except for additional words that can easily be ignored, e.g., "a", "the", etc. We also allowed for reasonable variances in spelling that would not introduce ambiguity, e.g. "biryani" was considered a match to "biriyani".

Note that the tests on Voicegain recognizer were ran with various audio encodings:

PCMU 8kHz - is a telephony quality audio
L16 16kHz - is closer to the audio quality you would expect from most webrtc applications and delivers better accuracy

Also, the AWS test was done in offline mode (which generally delivers better accuracy), while Google and Voicegain tests were done in streaming (real-time) mode.

We did a similar set of tests with the use of hints (we did not include AWS because our test script did not support AWS hints at that time).

This shows that huge benefits can be achieved by targeted model training for speech recognition. For this domain, that was new to our model, we increased accuracy by over 75% (10.18% to 86.24%) as result of training.

As you can see, after training we exceeded the Speech-to-Text accuracy of Google by over 45% (86.24% vs 40.38%) if no hints were used. With the use of hints we were better than Google STT by about 36% (87.58% vs 61.30%).

We examined cases where mistakes were still made and they fell into 3 broad categories:

Recordings missing an end part of the last word. That is because the stop record button was pressed while the last word was still being spoken. The recorded part of the last word is generally recognized ok, e.g., instead of "curry" we recognize "cu". (We plan to manually review the benchmarks set and modify the expected values according to what is being said and then recompute the accuracy numbers.)
Really bad quality recordings - where the volume of the audio is barely over the background noise level. In this case we usually missed some words or parts of words. This also explains why the hints do not give more improvement - there are no sufficient quality partial hypotheses that the hints could boost.
Loud background speech noise. In this case we usually recognized additional words beyond what was expected.

The first type of problems we think can be overcome by training on additional data and that is what we are planning to do, hoping to eventually get accuracy close to 85% (for L16 16kHz audio). The second type could be potentially resolved by post-processing in the application logic if we return the dB values of the recognized words.

Interested?

If your speech application also suffers from low accuracy and using hints or text-based language models is not working well enough, then acoustic model training could be the answer. Send us an email at info@voicegain.ai and we could discuss doing a project to show how Voicegain trained model can achieve best accuracy on your domain.

‍