This article is for companies building Voice AI Apps targeting the Contact Center. It outlines the key technical features, beyond accuracy, that are important while evaluating an OEM Speech-to-Text (STT) API. Usually, most analyses focus on the importance of accuracy and metrics like benchmarks of word error rates (WER). While accuracy is very important, there are other technical features that are equally important for contact center AI apps.
There are multiple use-cases for Voice AI Apps in the ContactCenter. Some of the common use cases are 1) AI Voicebot or Voice Agent 2) Real-time Agent Assist 3)Post Call Speech Analytics.
This article is focused on the third use-case which is Post-Call Speech Analytics. This use-case relies on batch STT APIs while the first two use-cases require streaming transcription. This Speech Analytics App helps the Quality Assurance and Agent-Performance management process. This article is intended for Product Managers and Engineering leads involved in building such AI Voice Apps that target the QA, Coaching and Agent Performance management process in the call center. Companies building such apps could include 1) CCaaS Vendors adding AI features, 2) Enterprise IT or Call Center BPO Digital organizations building an in-house Speech Analytics App 3) Call Center Voice AI Startups
Very often, call-center audio recordings are only available in mono. And even if the audio recording is in 2-channel/stereo, it could include multiple voices in a single channel. For example, the Agent channel can include IVR prompts and hold music recordings in addition to the Agent voice. Hence a very important criterion for an OEM Speech-to-Text vendor is that they provide accurate speaker diarization.
We would recommend doing a test of various speech-to-text vendors with a good sample set of mono audio files. Select files that are going to be used in production and calculate the Diarization Error Rate. Here is a useful link that outlines the technical aspects of understanding and measuring speaker diarization.
A very common requirement of Voice AI Apps is to redact PII – which stands for Personally Identifiable Information. PII Redaction is a post-processing step that a Speech-to-Text API vendor needs to perform. It involves accurate identification of entities like name, email address, phone number and mailing addresses and subsequent redaction both in text and audio. In addition, there are PCI – Payment Card Industry – specific named entities like Credit Card number, 3-digit PIN and expiry dates. Successful PII and PCI redaction requires post-processing algorithms to accurately identify a wide range of PII entities and cover a wide range of test scenarios. These test scenarios need to cover scenarios where there errors in user input and errors in speech recognition too.
There is another important capability related to PCI/PII redaction. Very often PII/PCI entities are present across multiple turns in a conversation between an Agent and Caller. It is important that the post-processing algorithm of the OEM Speech-to-Text vendor is able to process both channels simultaneously when looking for these named entities.
A Call Center audio recording could start off in one languageand then switch to another. The Speech-to-Text API should be able to detect language and then perform the transcription.
There will always be words that are not accurately transcribed by even the most accurate Speech-to-Text model. The API should include support for Hints or Keyword Boosting where words that are consistently misrecognized can get replaced by the correctly transcribed word. This is especially applicable for names of companies, products and industry specific terminology.
There are AI models that measure sentiment and emotion, and these models can be incorporated in the post-processing stage of transcription to enhance the Speech-to-Text API. Sentiment is extracted from the text of the transcript while Emotion is computed from the tone of the audio. A well-designed API should return Sentiment and Emotion throughout the interaction between the Agent and Caller. It should effectively compute the overall sentiment of the call by weighting the “ending sentiment” appropriately.
While measuring the quality of an Agent-Caller conversation, there are a few important audio-related metrics that are tracked in a call center. These include Talk-Listen Ratios, overtalk incidents and excessive silence and hold.
There are other LLM-powered features like computation of theQA Score and the summary of the conversation. However, these are features are builtby the developer of the AI Voice App by integrating the output of the Speech-to-TextAPI with the APIs offered by the LLM of the developer’s choice.
Voicegain Telephony Bot API allows developers to use Voicegain Speech-to-Text to build Voice Bots or programmable speech IVR using a simple callback API. With latest Voicegain Platform release 1.21.0 it is now possible to establish SIP sessions to Voicegain Telephony Bot API using a simple SIP Invite.
Before release 1.21.0, the only way for voice app developers to use the Voicegain Telephony Bot API was to call the application using phone numbers that were purchased from Voicegain (via the Web Console). However, we have always wanted to allow clients to bring their own carrier or CPaaS platform and this release allows developers to do just that.
At Voicegain our focus is on offering our ASR/Speech Recognition functionality and our full featured Speech-to-Text APIs. We understand developers rely on their CPaaS platforms for a whole host of important features - messaging, emails, conferencing and international coverage. Now, it is possible to integrate Voicegain Telephony Bot API with any CPaaS that supports SIP Invite. You can combine powerful and affordable Speech Recognition features of the Voicegain Platform with the comprehensive API features of these CPaaS platforms
We have already tested SIP Invite extensively on Twilio, SignalWire, and Telnyx platforms. Other similar platforms should also work without issues. We will report any additional platforms that we have explicitly tested in the future.
On Twilio and SignalWire platforms is trivial to establish SIP session to Voicegain. The only thing needed is the <Dial><Sip> command from TwiML or LaML, for example:
Some notes about the above example:
On our github you can find sample code showing how to dial a outbound call and then bridge it to Voicegain SIP:
On Telnyx we tested SIP INVITE using the Telnyx Call Control API. The only functional difference from Twilio and SIgnalWire is that on Telnyx you cannot choose TCP as SIP transport (only UDP is supported).
Here is a sample Python code showing how to dial Voicegain SIP:
The complete code for an AWS Lambda function that dials a number using Telnyx and then bridges it to Voicegain SIP is available here: platform/telnyx-dial-outbound-lambda.py at master · voicegain/platform (github.com)
Our Telephony Bot API is a callback API in similar fashion as TwiML or LaML. The main difference is that it is based on JSON and our functionality is focused on Speech Recognition. You can read more about it in our blog post announcing release of that API back in August.
On out Github you can find an example of a Node.js function on AWS Lambda that demonstrates how to interface Voicegain Telephony Bot API with a RASA NLU bot: platform/examples/voicebot-lambda-vg-rasa at master · voicegain/platform (github.com)
You can also check out our sample python function code on AWS Lambda which shows how to implement more traditional (VoiceXML like) IVRs with the use of Speech grammars on top of our Telephony Bot API: platform/declarative-ivr at master · voicegain/platform (github.com)
Here are all the steps needed to signup for a developer account on the Voicegain Platform. Once you have the account you can access the Web Console and you can find all the info on how to use the Web Console and the APIs on our Zendesk Knowledge Base .
1. Start at console.voicegain.ai/signup
2. Enter your name and email. If you wish you can check the Terms of Service and/or Privacy Policy.
3. On the next page let us know how you learned about Voicegain, how you wan to use Voicegain, and accept Terms of Service.
5. After you click Next, Voicegain will send you an email with the link to the next step. If you do not get the email, please check a Junk Mail folder, and if it is not there, please follow instruction on the page shown below.
6. Once you get the email, click on the Set Password button.
7. You will be directed to a web page where you can set your Voicegain password.
8. After you click (Re)set Password you will be directed to the login page where you can enter your login credentials.
9. On the next page click the right arrow icon next to "Cloud Web Console"
10. This will take you to the home page of the Voicegain Web Console. You can follow the mini tutorial that is available on the home page.
11. Help articles are available under the question mark (?) menu. There also you will find our helpdesk support link. Note, some of the support articles are available only to logged in users while others are public.
You can now test the accuracy of both our realtime and offline speech-to-text by visiting our demo page.
Read out paragraphs of your favorite book, give a speech that inspires, mimic your favorite actor or just play a podcast or YouTube video!
If you are noticing delays in real-time transcription results, they are likely because of resource issues on your computer.
Simply click on the microphone icon to get started. You can either speak or stream audio into your microphone from your browser for a full minute.
You can also play back the audio to make sure that it was indeed streamed to us accurately.
Click on the upload recording icon to get started. You can upload up a mono or stereo recorded file - wav or FLAC - that is up to 15MB in size. If you need to transcribe a larger file, please sign up for a free account.
Drop us an email (support@voicegain.ai) if you have any comments.
[UPDATE - October 31st, 2021: Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced.]
It has been over 8 months since we published our last speech recognition accuracy benchmark (described here). Back then the results were as follows (from most accurate to least): Microsoft and Google Enhanced (close 2nd), then Voicegain and Amazon (also close 4th) and then, far behind, Google Standard.
We have repeated the test using the same methodology as before: take 44 files from the Jason Kincaid data set and 20 files published by rev.ai and remove all files where the best recognizer could not achieve a Word Error Rate (WER) lower than 20%. Last time we removed 10 files, but this time as the recognizers improved only 8 files had their WER higher than 20%.
The files removed fall into 3 categories:
Some of our customers told us that they previously used IBM Watson, so we decided to add also it to the test.
In the new test, as you can see in the results chart above, the order has changed: Amazon has leap-frogged everyone by increasing its median accuracy by over 3% to just 10.02%, it is now in the pole position. Microsoft, Google Enhanced and Google Standard performed at approximately the same level. The Voicegain recognizer improved by about 2%. The newly tested IBM Watson is better than Google Standard, but lags the other recognizers.
New results put Voicegain recognizer very close to Google enhanced:
However the results for a use case depends on the specific audio - for some of them Voicegain will perform slightly better and for some Google may perform marginally better. As always, we invite you to review our apps, sign-up and test our accuracy with your data.
We have looked at both the Mozilla DeepSpeech and Kaldi projects. We ran our complete benchmark on Mozilla DeepSpeech and found that it significantly trails behind Google Standard recognizer. Out of 64 audio files, Mozilla was better than Google Standard on only 5 files and tied on 1. It was worse on the remaining 58 files. Median WER was 15.63% worse for Mozilla compared to Google Standard. The lowest WER of 9.66% for Mozilla DeepSpeech was on audio from Librivox "The Art of War by Sun Tzu". For comparison, Voicegain achieves 3.45% WER on that file.
Regarding Kaldi we have not benchmarked it yet, but from the research published online it looks like Kaldi trails Google Standard too, at least when used with its standard ASpIRE and LibriSpeech models.
When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
Wir freuen uns, die Verfügbarkeit von deutscher Spracherkennung auf der Voicegain-Plattform bekannt zu geben. Es ist die dritte Sprache, die Voicegain nach Englisch und Spanisch unterstützt.
Die Spracherkennungsgenauigkeit des deutschen Modells hängt von der Art des Sprachaudios ab. Im Allgemeinen liegen wir nur wenige Prozent hinter der Genauigkeit zurück, die die Speech-to-Text-Engines von Amazon oder Google bieten. Der Vorteil unseres Spracherkennung ist der deutlich niedrigere Preis sowie die Möglichkeit, kundenspezifische Akustikmodelle zu trainieren. Benutzerdefinierte Modelle können eine höhere Genauigkeit aufweisen als Amazon oder Google. Wir empfehlen Ihnen, unsere Webkonsole und / oder API zu verwenden, um die tatsächliche Leistung Ihrer eigenen Daten zu testen.
Natürlich bietet die Voicegain-Plattform auch andere Vorteile wie die Unterstützung von Edge-Bereitstellung (on-prem) und eine umfangreiche API mit vielen Optionen für die sofort einsatzbereite Integration in z. Telefonieumgebungen.
Derzeit ist unsere Speech-to-Text-API mit dem deutschen Modell voll funktionsfähig. Einige der Speech Analytics-API-Funktionen sind für Deutsch noch nicht verfügbar, z. B. Named Entity Recognition oder Sentiment / Mood Detection.
Das deutsche Modell ist zunächst nur in der Version verfügbar, die die Offline-Transkription unterstützt. Die Echtzeitversion des Modells wird in naher Zukunft verfügbar sein.
Um der API mitzuteilen, dass Sie das deutsche Akustikmodell verwenden möchten, müssen Sie es nur in den Kontexteinstellungen auswählen. Deutsche Modelle haben 'de' im Namen, z. VoiceGain-ol-de: 1
Wenn Sie die deutsche Sprachausgabe verwenden möchten, senden Sie uns bitte eine E-Mail an support@voicegain.ai. Wir werden sie für Ihr Konto aktivieren. Wenn Ihre Anwendung ein Echtzeitmodell erfordert, teilen Sie uns dies bitte ebenfalls mit.
We are pleased to announce availability of German Speech-to-Text on the Voicegain Platform. It is the third language that Voicegain supports after English and Spanish.
The recognition accuracy of the German model depends on the type of speech audio. Generally, we are just a few % behind the accuracy offered by the Speech-to-Text engines of the larger players (Amazon, Google, etc). The advantage of our recognizer is its affordability, ability to train customized acoustic models and deploy it in the datacenter or VPC. Custom models can have accuracy higher than that of Amazon or Google. We also offer extensive support for integrating with telephony.
We encourage you to sign up for a developer account and use our Web Console and/or our APIs to test the real-life performance on your own data.
Currently, our Speech-to-Text API supports the German Model. Currently the German Model supports off-line transcription. Real-time/Streaming version of the Model will be available in the near future.
To use the German Acoustic Model in Voicegain Web Console, select "de" under Languages in the Speech Recognition settings.
Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Interested in customizing the ASR or deploying Voicegain on your infrastructure?