Our Blog

News, Insights, sample code & more!

ASR
Going beyond Accuracy: Key STT API Features for Contact Center Voice AI Apps

This article is for companies building Voice AI Apps targeting the Contact Center. It outlines the key technical features, beyond accuracy, that are important while evaluating an OEM Speech-to-Text (STT) API. Usually, most analyses focus on the importance of accuracy and metrics like benchmarks of word error rates (WER). While accuracy is very important, there are other technical features that are equally important for contact center AI apps.

Introduction

There are multiple use-cases for Voice AI Apps in the ContactCenter. Some of the common use cases are 1) AI Voicebot or Voice Agent 2) Real-time Agent Assist 3)Post Call Speech Analytics.

This article is focused on the third use-case which is Post-Call Speech Analytics. This use-case relies on batch STT APIs while the first two use-cases require streaming transcription. This Speech Analytics App helps the Quality Assurance and Agent-Performance management process. This article is intended for Product Managers and Engineering leads involved in building such AI Voice Apps that target the QA, Coaching and Agent Performance management process in the call center. Companies building such apps could include 1) CCaaS Vendors adding AI features, 2) Enterprise IT or Call Center BPO Digital organizations building an in-house Speech Analytics App 3) Call Center Voice AI Startups

1. Accurate Speaker Diarization

Very often, call-center audio recordings are only available in mono. And even if the audio recording is in 2-channel/stereo, it could include multiple voices in a single channel. For example, the Agent channel can include IVR prompts and hold music recordings in addition to the Agent voice. Hence a very important criterion for an OEM Speech-to-Text vendor is that they provide accurate speaker diarization.

We would recommend doing a test of various speech-to-text vendors with a good sample set of mono audio files. Select files that are going to be used in production and calculate the Diarization Error Rate. Here is a useful link that outlines the technical aspects of  understanding and measuring speaker diarization.

2. Accurate PII/Named Entity Redaction and PCI Compliance

A very common requirement of Voice AI Apps is to redact PII – which stands for Personally Identifiable Information. PII Redaction is a post-processing step that a Speech-to-Text API vendor needs to perform. It involves accurate identification of entities like name, email address, phone number and mailing addresses and subsequent redaction both in text and audio. In addition, there are PCI – Payment Card Industry – specific named entities like Credit Card number, 3-digit PIN and expiry dates. Successful PII and PCI redaction requires post-processing algorithms to accurately identify a wide range of PII entities and cover a wide range of test scenarios. These test scenarios need to cover scenarios where there errors in user input and errors in speech recognition too.

There is another important capability related to PCI/PII redaction. Very often PII/PCI entities are present across multiple turns in a conversation between an Agent and Caller. It is important that the post-processing algorithm of the OEM Speech-to-Text vendor is able to process both channels simultaneously when looking for these named entities.

3. Language Detection

A Call Center audio recording could start off in one languageand then switch to another. The Speech-to-Text API should be able to detect language and then perform the transcription.

4. Hints/Keyword Boosting

There will always be words that are not accurately transcribed by even the most accurate Speech-to-Text model. The API should include support for Hints or Keyword Boosting where words that are consistently misrecognized can get replaced by the correctly transcribed word. This is especially applicable for names of companies, products and industry specific terminology.

5. Sentiment and Emotion

There are AI models that measure sentiment and emotion, and these models can be incorporated in the post-processing stage of transcription to enhance the Speech-to-Text API. Sentiment is extracted from the text of the transcript while Emotion is computed from the tone of the audio. A well-designed API should return Sentiment and Emotion throughout the interaction between the Agent and Caller. It should effectively compute the overall sentiment of the call by weighting the “ending sentiment” appropriately.

6. Talk-Listen Ratios, Overtalk and Other Incidents

While measuring the quality of an Agent-Caller conversation, there are a few important audio-related metrics that are tracked in a call center.  These include Talk-Listen Ratios, overtalk incidents and excessive silence and hold.

7. Other Optional LLM-Powered Features

There are other LLM-powered features like computation of theQA Score and the summary of the conversation. However, these are features are builtby the developer of the AI Voice App by integrating the output of the Speech-to-TextAPI with the APIs offered by the LLM of the developer’s choice.

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
SIP INVITE Voicegain from Twilio, SignalWire, Telnyx CPaaS
CPaaS
SIP INVITE Voicegain from Twilio, SignalWire, Telnyx CPaaS

Voicegain Telephony Bot API allows developers to use Voicegain Speech-to-Text to build Voice Bots or programmable speech IVR using a simple callback API. With latest Voicegain Platform release 1.21.0 it is now possible to establish SIP sessions to Voicegain Telephony Bot API using a simple SIP Invite.


Before release 1.21.0, the only way for voice app developers to use the Voicegain Telephony Bot API was to call the application using phone numbers that were purchased from Voicegain (via the Web Console). However, we have always wanted to allow clients to bring their own carrier or CPaaS platform and this release allows developers to do just that.


At Voicegain our focus is on offering our ASR/Speech Recognition functionality and our full featured Speech-to-Text APIs. We understand developers rely on their CPaaS platforms for a whole host of important features - messaging, emails, conferencing and international coverage. Now, it is possible to integrate Voicegain Telephony Bot API with any CPaaS that supports SIP Invite. You can combine powerful and affordable Speech Recognition features of the Voicegain Platform with the comprehensive  API features of these CPaaS platforms


We have already tested SIP Invite extensively on Twilio, SignalWire, and Telnyx platforms. Other similar platforms should also work without issues. We will report any additional platforms that we have explicitly tested in the future.


How SIP INVITE works with Twilio & SignalWire

On Twilio and SignalWire platforms is trivial to establish SIP session to Voicegain. The only thing needed is the <Dial><Sip> command from TwiML or LaML, for example:



Some notes about the above example:

  • The SIP URI user name is a unique random identifier assigned on Voicegain Platform to each Telephony Bot Application.
  • After the SIP connection gets established, the application prompts and speech recognition will be under control of Voicegain Platform based on commands passed using our Telephony Bot API
  • Once Voicegain `disconnect` command is issued, the control of the application flow will be returned back to the host platform (i.e. Twilio, SignalWire or any other CPaaS platform).
  • It is possible to pass custom headers to Voicegain during SIP Invite - this way it is possible to associate host sessions with Voicegain sessions.
  • It is possible to make multiple <Dial><Sip> requests to Voicegain from host application during a single host session.

On our github you can find sample code showing how to dial a outbound call and then bridge it to Voicegain SIP:

What about Telnyx

On Telnyx we tested SIP INVITE using the Telnyx Call Control API. The only functional difference from Twilio and SIgnalWire is that on Telnyx you cannot choose TCP as SIP transport (only UDP is supported).

Here is a sample Python code showing how to dial Voicegain SIP:


The complete code for an AWS Lambda function that dials a number using Telnyx and then bridges it to Voicegain SIP is available here: platform/telnyx-dial-outbound-lambda.py at master · voicegain/platform (github.com)


What can I build with the Telephony Bot API?

Our Telephony Bot API is a callback API in similar fashion as TwiML or LaML. The main difference is that it is based on JSON and our functionality is focused on Speech Recognition. You can read more about it in our blog post announcing release of that API back in August.


On out Github you can find an example of a Node.js function on AWS Lambda that demonstrates how to interface Voicegain Telephony Bot API with a RASA NLU bot: platform/examples/voicebot-lambda-vg-rasa at master · voicegain/platform (github.com)


You can also check out our sample python function code on AWS Lambda which shows how to implement more traditional (VoiceXML like) IVRs with the use of Speech grammars on top of our Telephony Bot API: platform/declarative-ivr at master · voicegain/platform (github.com)

Read more → 
How to signup for a developer account and start using Voicegain
Developers
How to signup for a developer account and start using Voicegain

Here are all the steps needed to signup for a developer account on the Voicegain Platform. Once you have the account you can access the Web Console and you can find all the info on how to use the Web Console and the APIs on our Zendesk Knowledge Base .

1. Start at console.voicegain.ai/signup  

2. Enter your name and email. If you wish you can check the Terms of Service and/or Privacy Policy.

3. On the next page let us know how you learned about Voicegain, how you wan to use Voicegain, and accept Terms of Service.


5. After you click Next, Voicegain will send you an email with the link to the next step. If you do not get the email, please check a Junk Mail folder, and if it is not there, please follow instruction on the page shown below.



6. Once you get the email, click on the Set Password button.


7. You will be directed to a web page where you can set your Voicegain password.


8. After you click (Re)set Password you will be directed to the login page where you can enter your login credentials.



9. On the next page click the right arrow icon next to "Cloud Web Console"


10. This will take you to the home page of the Voicegain Web Console. You can follow the mini tutorial that is available on the home page.


11. Help articles are available under the question mark (?) menu. There also you will find our helpdesk support link. Note, some of the support articles are available only to logged in users while others are public.



Read more → 
Test Voicegain realtime Speech-to-Text from your browser
Developers
Test Voicegain realtime Speech-to-Text from your browser

You can now test the accuracy of both our realtime and offline speech-to-text  by visiting our demo page.

Read out paragraphs of your favorite book, give a speech that inspires, mimic your favorite actor or just play a podcast or YouTube video!

Health check for the demo

  1. We currently support Chrome and Edge browser only.
  2. Please ensure that your CPU utilization is not too high (<50%) and your internet bandwidth is reasonable (10 Mbps both directions).
  3. Ensure that your microphone is not being used by another program like Zoom, Teams, Skype or Webex.

If you are noticing delays in real-time transcription results, they are  likely because of resource issues on your computer.

Real-time Transcription

Simply click on the microphone icon to get started. You can either speak or stream audio into your microphone from your browser for a full minute.

You can also play back the audio to make sure that it was indeed streamed to us accurately.

Offline Transcription

Click on the upload recording icon to get started. You can upload up a mono or stereo recorded file - wav or FLAC - that is up to 15MB in size. If you need to transcribe a larger file, please sign up for a free account.

Drop us an email (support@voicegain.ai) if you have any comments.

Read more → 
Speech-to-Text Accuracy Benchmark - June 2021
Benchmark
Speech-to-Text Accuracy Benchmark - June 2021

[UPDATE - October 31st, 2021:  Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced.]

It has been over 8 months since we published our last speech recognition accuracy benchmark (described  here). Back then the results were as follows (from most accurate to least): Microsoft and Google Enhanced (close 2nd), then Voicegain and Amazon (also close 4th) and then, far behind, Google Standard.

Methodology

We have repeated the test using the same methodology as before: take 44 files from the Jason Kincaid data set and 20 files published by rev.ai and remove all files where the best recognizer could not achieve a Word Error Rate (WER) lower than 20%. Last time we removed 10 files, but this time as the recognizers improved only 8 files had their WER higher than 20%.

The files removed fall into 3 categories:

  • recordings of meetings - 3 files (3 out of 7 meeting recordings in the original set),
  • telephone conversations - 3 files (3 out of 11 phone phone conversations in the original set),
  • multi-presenter, very animated podcasts - 2 files (there were a lot of other podcasts in the set that did meet the cut off).

Some of our customers told us that they previously used IBM Watson, so we decided to add also it to the test.

Results

In the new test, as you can see in the results chart above, the order has changed: Amazon has leap-frogged everyone by increasing its median accuracy by over 3% to just 10.02%, it is now in the pole position. Microsoft, Google Enhanced  and Google Standard performed at approximately the same level. The Voicegain recognizer improved by about 2%.  The newly tested IBM Watson is better than Google Standard, but lags the other recognizers.

Voicegain is tied with Google Enhanced

New results put Voicegain recognizer very close to Google enhanced:

  1. Average WER of Voicegain is just 0.66% behind Google, while median WER is just 0.63% behind. To put it in context -  Voicegain makes one additional mistake every 155 words compared to Google Enhanced.
  2. Voicegain was actually marginally better than Google Enhanced on the min error, 1st quartile, 3rd quartile, and max error.
  3. Overall Voicegain was better on 20 files while Google was better on 36 files.

However the results for a use case depends on the specific audio - for some of them Voicegain will perform slightly better and for some Google may perform marginally better. As always, we invite you to review our apps, sign-up and test our accuracy with your  data.

What about Open Source recognizers

We have looked at both the Mozilla DeepSpeech and Kaldi projects. We ran our complete benchmark on Mozilla DeepSpeech and found that it significantly trails behind Google Standard recognizer. Out of 64 audio files, Mozilla was better than Google Standard on only 5 files and tied on 1. It was worse on the remaining 58 files. Median WER was 15.63% worse for Mozilla compared to Google Standard. The lowest WER of 9.66% for Mozilla DeepSpeech was on audio from Librivox "The Art of War by Sun Tzu". For comparison, Voicegain achieves 3.45% WER on that file.

Regarding Kaldi we have not benchmarked it yet, but from the research published online it looks like Kaldi trails Google Standard too, at least when used with its standard ASpIRE and LibriSpeech models.

Out-of-the-box accuracy is not everything

When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:

  • Ability to customize the Acoustic Model - Voicegain model may be trained on your audio data - we have demonstrated improvement in accuracy of 7-10%. In fact for one of our customers with adequate training data and good quality audio we were able achieve a WER of 0.5% (99.5% accuracy)
  • Ease of integration - Many Speech-to-Text providers offer limited APIs especially for developers building applications that require interfacing with  telephony or on-premise contact center platforms.
  • Price - Voicegain is 60%-75% less expensive compared to other Speech-to-Text/ASR software providers while offering almost comparable accuracy. This makes it affordable to transcribe and analyze speech in large volumes.
  • Support for On-Premise/Edge Deployment - The cloud Speech-to-Text service providers offer limited support to deploy their speech-to-text software in client data-centers or on the private clouds of other providers. On the other hand, Voicegain can be installed on any Kubernetes cluster - whether managed by a large cloud provider or by the client.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.


2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account  and receive $50 in free credits


3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Read more → 
Voicegain bietet automatische Spracherkennung in Deutsch
Languages
Voicegain bietet automatische Spracherkennung in Deutsch

Wir freuen uns, die Verfügbarkeit von deutscher Spracherkennung auf der Voicegain-Plattform bekannt zu geben. Es ist die dritte Sprache, die Voicegain nach Englisch und Spanisch unterstützt.

Die Spracherkennungsgenauigkeit des deutschen Modells hängt von der Art des Sprachaudios ab. Im Allgemeinen liegen wir nur wenige Prozent hinter der Genauigkeit zurück, die die Speech-to-Text-Engines von Amazon oder Google bieten. Der Vorteil unseres Spracherkennung ist der deutlich niedrigere Preis sowie die Möglichkeit, kundenspezifische Akustikmodelle zu trainieren. Benutzerdefinierte Modelle können eine höhere Genauigkeit aufweisen als Amazon oder Google. Wir empfehlen Ihnen, unsere Webkonsole und / oder API zu verwenden, um die tatsächliche Leistung Ihrer eigenen Daten zu testen.  

Natürlich bietet die Voicegain-Plattform auch andere Vorteile wie die Unterstützung von Edge-Bereitstellung (on-prem) und eine umfangreiche API mit vielen Optionen für die sofort einsatzbereite Integration in z. Telefonieumgebungen.

Derzeit ist unsere Speech-to-Text-API mit dem deutschen Modell voll funktionsfähig. Einige der Speech Analytics-API-Funktionen sind für Deutsch noch nicht verfügbar, z. B. Named Entity Recognition oder Sentiment / Mood Detection.

Das deutsche Modell ist zunächst nur in der Version verfügbar, die die Offline-Transkription unterstützt. Die Echtzeitversion des Modells wird in naher Zukunft verfügbar sein.

Um der API mitzuteilen, dass Sie das deutsche Akustikmodell verwenden möchten, müssen Sie es nur in den Kontexteinstellungen auswählen. Deutsche Modelle haben 'de' im Namen, z. VoiceGain-ol-de: 1

Wenn Sie die deutsche Sprachausgabe verwenden möchten, senden Sie uns bitte eine E-Mail an support@voicegain.ai. Wir werden sie für Ihr Konto aktivieren. Wenn Ihre Anwendung ein Echtzeitmodell erfordert, teilen Sie uns dies bitte ebenfalls mit.

Read more → 
Voicegain offers German Speech-to-Text
Languages
Voicegain offers German Speech-to-Text

We are pleased to announce availability of German Speech-to-Text on the Voicegain Platform. It is the third language that Voicegain supports after English and Spanish.

The recognition accuracy of the German model depends on the type of speech audio. Generally, we are just a few % behind the accuracy offered by the Speech-to-Text engines of the larger players (Amazon, Google, etc). The advantage of our recognizer is its affordability, ability to train customized acoustic models and deploy it in the datacenter or VPC. Custom models can have accuracy higher than that of Amazon or Google. We also offer extensive support for integrating with telephony.

We encourage you to sign up for a developer account and use our Web Console and/or our APIs to test the real-life performance on your own data.

Currently, our Speech-to-Text API supports the German Model. Currently the German Model supports off-line transcription. Real-time/Streaming version of the Model will be available in the near future.

To use the German Acoustic Model in Voicegain Web Console, select "de" under Languages in the Speech Recognition settings.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control