This article is for companies building Voice AI Apps targeting the Contact Center. It outlines the key technical features, beyond accuracy, that are important while evaluating an OEM Speech-to-Text (STT) API. Usually, most analyses focus on the importance of accuracy and metrics like benchmarks of word error rates (WER). While accuracy is very important, there are other technical features that are equally important for contact center AI apps.
There are multiple use-cases for Voice AI Apps in the ContactCenter. Some of the common use cases are 1) AI Voicebot or Voice Agent 2) Real-time Agent Assist 3)Post Call Speech Analytics.
This article is focused on the third use-case which is Post-Call Speech Analytics. This use-case relies on batch STT APIs while the first two use-cases require streaming transcription. This Speech Analytics App helps the Quality Assurance and Agent-Performance management process. This article is intended for Product Managers and Engineering leads involved in building such AI Voice Apps that target the QA, Coaching and Agent Performance management process in the call center. Companies building such apps could include 1) CCaaS Vendors adding AI features, 2) Enterprise IT or Call Center BPO Digital organizations building an in-house Speech Analytics App 3) Call Center Voice AI Startups
Very often, call-center audio recordings are only available in mono. And even if the audio recording is in 2-channel/stereo, it could include multiple voices in a single channel. For example, the Agent channel can include IVR prompts and hold music recordings in addition to the Agent voice. Hence a very important criterion for an OEM Speech-to-Text vendor is that they provide accurate speaker diarization.
We would recommend doing a test of various speech-to-text vendors with a good sample set of mono audio files. Select files that are going to be used in production and calculate the Diarization Error Rate. Here is a useful link that outlines the technical aspects of understanding and measuring speaker diarization.
A very common requirement of Voice AI Apps is to redact PII – which stands for Personally Identifiable Information. PII Redaction is a post-processing step that a Speech-to-Text API vendor needs to perform. It involves accurate identification of entities like name, email address, phone number and mailing addresses and subsequent redaction both in text and audio. In addition, there are PCI – Payment Card Industry – specific named entities like Credit Card number, 3-digit PIN and expiry dates. Successful PII and PCI redaction requires post-processing algorithms to accurately identify a wide range of PII entities and cover a wide range of test scenarios. These test scenarios need to cover scenarios where there errors in user input and errors in speech recognition too.
There is another important capability related to PCI/PII redaction. Very often PII/PCI entities are present across multiple turns in a conversation between an Agent and Caller. It is important that the post-processing algorithm of the OEM Speech-to-Text vendor is able to process both channels simultaneously when looking for these named entities.
A Call Center audio recording could start off in one languageand then switch to another. The Speech-to-Text API should be able to detect language and then perform the transcription.
There will always be words that are not accurately transcribed by even the most accurate Speech-to-Text model. The API should include support for Hints or Keyword Boosting where words that are consistently misrecognized can get replaced by the correctly transcribed word. This is especially applicable for names of companies, products and industry specific terminology.
There are AI models that measure sentiment and emotion, and these models can be incorporated in the post-processing stage of transcription to enhance the Speech-to-Text API. Sentiment is extracted from the text of the transcript while Emotion is computed from the tone of the audio. A well-designed API should return Sentiment and Emotion throughout the interaction between the Agent and Caller. It should effectively compute the overall sentiment of the call by weighting the “ending sentiment” appropriately.
While measuring the quality of an Agent-Caller conversation, there are a few important audio-related metrics that are tracked in a call center. These include Talk-Listen Ratios, overtalk incidents and excessive silence and hold.
There are other LLM-powered features like computation of theQA Score and the summary of the conversation. However, these are features are builtby the developer of the AI Voice App by integrating the output of the Speech-to-TextAPI with the APIs offered by the LLM of the developer’s choice.
Voicegain Speech-to-Text platform has already for a while supported many of the Twilio features like:
Release 1.26.0 of the Voicegain platform finally offers a full 2-channel support for Twilio Media Streams. This enables real-time transcription of both the inbound and outbound channels at the same time.
Twilio <Stream> command takes a websocket url parameter as a target to which the selected channels are streamed, for example:
The wss url can obtained by starting a new Voicegain real-time transcription session using https://api.voicegain.ai/v1/asr/transcribe/async API. The session part of the request may look like this (notice that two session are started and each will be fed different channel left/right of the audio stream):
We also need to tell Voicegain to take input in TWIML protocol in stereo:
Notice that we can enable audio capture which in addition will give us a stereo recording of the call once the session is complete.
In the response of the start of Voicegain session we get 3 websocket urls:
On our github we provide an example python code that starts a simple outbound Twilio phone call and then transcribes in real-time both inbound and outbound audio.
The sample code illustrates an outbound calling example which is somewhat simpler because there are no callback involved. In a case of an inbound call, the request to Voicegain would have to be done from your Twilio callback function that gets invoked when a new call comes in, otherwise, the rest of the code would be very similar to our github example.
Some of these are already listed on Twilio Media Streams page:
We will be testing the <Stream> functionality on the LaML command language provided by SignalWire platform which is very similar to Twilio TwiML - we will update our blog with the results of those test.
We are also working on a real-time version of our Speech Analytics API. Once complete then all Speech Analytics functionality will be available real-time to users of Twilio and SignalWire platforms.
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
We are excited to announce a new Speech-to-Text (STT) API that works with AudioCodes VoiceAI Connect*. AudioCodes VoiceAI Connect (VAIC) enables enterprises to connect a bot framework and speech services, such as text-to-speech (TTS) and speech-to-text (STT), to the enterprises’ voice and telephony channels to power Voice Bots, conversational IVRs and Agent Assist use-cases.
With this new API, enterprises and NLU/Conversational AI platform companies can leverage the capabilities of AudioCodes VAIC with Voicegain as the ASR or STT engine for their contact center AI initiatives.
The two main use-cases in Contact Centers are (1) building Voice Bots (or voice-enabling a chatbot) and (2) building real-time Agent Assist.
While AudioCodes supports Cloud STT options from large players Microsoft, Google and Amazon, introducing Voicegain as an additional ASR option offers three key benefits to prospective customers. These benefits can be summarized as the 3 As - Accuracy, Affordability and Accessibility.
To get very high STT accuracy, companies now understand the importance of training the underlying acoustic models on application specific audio data. While it is necessary to have a reasonable out-of-the-box accuracy, building voice bots or extracting high quality analytics from voice data requires more than what is offered. Voicegain offers a full fledged training data pipeline and easy-to-use APIs to help speed up the building of custom acoustic models. We have demonstrated significant reduction in Word Error Rates even with a few hundred hours of client specific audio files.
Because AudioCodes VAIC makes it very easy two switch between various STT services, you can easily compare performance of Voicegain STT to any of the other STT providers supported on AudioCodes.
Voicegain offers disruptive pricing compared to the big 3 STT providers at essentially the same out-of-the-box accuracy. Our pricing is 40%-75% lower than the big 3 Cloud Speech-to-Text providers. This is especially important for real-time analytics (real-time agent assist) use case in contact centers as the audio/transcription volumes are very large. In addition to APIs, we also provide a white-label reference UI that contact centers can use to reduce the cost and time-to-market associated with deploying AI apps.
In addition to accessing STT as a cloud service, Voicegain can be deployed onto a Kubernetes cluster in a client's datacenter or in a dedicated VPC with any of the major cloud providers. This addresses applications where compliance, privacy and data control concerns prevent use of STT engines on public cloud infrastructure.
Connecting AudioCodes VAIC to Voicegain is done in 3 simple steps. They are:
1) Add Voicegain as the ASR/STT provider on VAIC. This is done through an API (provided by Audiocodes). In this step, you would need to enter a JWT token from Voicegain web console for authentication (instructions provided below).
2) Enter the web-socket entry URL for Voicegain ASR on VAIC. You can get this URL from Voicegain Web Console (instructions provided below)
3) Configure the Speech Recognition engine settings. This includes picking the right model and having the correct timeout and model sensitivity settings. This is done on the Voicegain Web Console (instructions to sign up provided below)
Please contact your Audiocodes customer success contact for Steps 1 & 2.
You would need to sign up for a developer account on Voicegain Web Console. Voicegain offers an open developer platform and there is no need to enter your credit card. We provide 300 minutes of free Speech-to-Text APIs access every month. You can test out our APIs and check out our accuracy.
After you sign up, please go to Settings-> API Security. The JWT Token required for Step 1 and the API entry URL for Step 2 are provided here.
Also you would need to pick the right acoustic model, set the complete timeout & sensitivity specified in Step 3. Please navigate Settings-> Speech Recognition -> ASR Transcription settings.
If you have any questions please email us at support@voicegain.ai
* VoiceAI connect is a product and trademark owned by AudioCodes.
The entire Voicegain Speech-to-Text/ASR platform and all the associated products - ranging from Web Speech-to-Text (STT) APIs, Speech Analytics APIs, Telephony Bot APIs and the MRCP ASR engine and our logging and monitoring framework - can deployed on "the Edge".
By "Edge" we mean that the core Deep Neural Network based AI models that convert speech/audio into text run exclusively on hardware deployed in a client datacenter. Or after this announcement they can also run on a compute instance in a Virtual Private Cloud. In either case, the Voicegain platform is "orchestrated" using the Voicegain Console which is a web application that is deployed on Voicegain cloud.
On the Edge, the Voicegain platform gets deployed as a container on a Kubernetes Cluster. Voicegain can also be accessed as a Cloud service if clients would not want to manage either server hardware or VPC compute instances.
Voicegain platform has always been deployable in a simple and automated way on hardware in a datacenter. Our clients procure compatible Intel Xeon based servers with Nvidia based GPUs. And they are able to install the entire Voicegain platform a few clicks from the Cloud Portal (see these videos for demonstration).
You can read about the advantages of this Datacenter type of Edge Deployment in our previous blog post. To summarize, these advantages are :
Now these benefits shall also be available for enterprise clients that use a Virtual Private Cloud on AWS to run a portion of their enterprise workloads.
Many enterprises have migrated several enterprise workloads to AWS Cloud infrastructure in order to benefit from the scale, flexibility and ease of maintenance. While moving these workloads to the Cloud, these enterprises largely prefer the private cloud offerings of AWS. e.g., using VPC network isolation, Site-to-Site VPN and dedicated compute instances. For these enterprises, ideally any new workload should be capable of being run inside their AWS VPC. In particular, if an enterprise already has dedicated AWS compute instances or hosts, they could realize all of the above 4 advantages of Edge Deployment by deploying into their dedicated AWS infrastructure.
Recently, anticipating interest of some of our customers we have performed extensive tests of complete deployment of our platform into AWS. Because Voicegain platform is Kubernetes based, there are essentially only two differences from deployment onto local on-premise hardware – these are:
Otherwise, the core of the deployment process is pretty much identical between on premise hardware and AWS VPC.
You can read the details involved in the AWS deployment process on Voicegain's github page.
If you are a developer building something that requires you to add or embed Speech-to-Text functionality (Transcription, Voice Bot or Speech Analytics in Contact Centers, analyzing meetings or sales calls, etc.), we invite you to give Voicegain a try. You can start by signing up for a developer account and use our Free tier. You can also email us at info@voicegain.ai.
This blog post describes how SignalWire developers should integrate with Voicegain Speech-to-Text/ASR based on the application that they are building.
Voicegain offers a highly accurate Speech-to-Text/ASR option on SignalWire. Voicegain is very disruptively priced and one of the main benefits is that it allows developers to customize the underlying acoustic models to achieve very high accuracy for specific accents or industry verticals.
SignalWire developers can fork audio to Voicegain using the <Stream> instruction in LAML. The <Stream> instruction makes it possible to send raw audio streams from a running phone call over WebSockets in near real time, to a specified URL.
Developers looking to just get the raw text/transcript may use the Voicegain STT API to get real time transcription of streamed audio from SignalWire.
For developers that need NLU tags like sentiment, named entities, intents and keywords in addition to the transcript, Voicegain's Speech Analytics API provides those metrics in addition to the transcript.
Applications of real-time transcription and speech analytics include live agent assist in contact centers, extraction of insights for sales calls conducted over telephony, and meeting analytics.
If you want to build a Voice Bot or a directed dialog Speech IVR application that handles calls coming over SignalWire then we suggest using Voicegain Telephony Bot API. This is a web callback API similar to LaML and has instructions or commands specifically helpful in building IVRs or Voice Bots. This API handles speech-to-text, DTMF digits and also plays prompts (TTS, pre-recorded, or a combination).
Your calls are transferred from SignalWire to a Voicegain provided SIP endpoint (based on FreeSWITCH) using a simple SIP INVITE.
Voicegain Telephony Bot API allows you to build two types of applications:
If your application has only a limited need for speech recognition, you can invoke Voicegain STT API only as needed. Every time you need speech recognition in your application, you can simply start a new ASR session with Voicegain either in transcribe (large vocabulary transcription) or recognize (grammar-based recognition) mode. You can use the LAML <stream> command
An example application that could be a voice controlled voicemail retrieval or dictation application where Voicegain recognize API is used in a continuous mode and listens to commands like play, stop, next, etc.
In addition to SignalWire, Voicegain also offers integrations with FreeSWITCH using the built-in mrcp plug-in and a separate module for real-time transcription.
If you are a developer on SignalWire and would like to build an app that requires Speech-to-Text/ASR, you can sign up for a developer account using the instructions provided here.
This blog post will describe 4 ways you can use Telnyx with the Voicegain's Deep Neural Network based Speech-to-Text/ASR platform.
For developers looking to get the raw text/transcript, the Voicegain STT API supports real time transcription of streamed audio from Telnyx.
For conversational AI applications that need NLU tags like sentiment, named entities, intents and keywords in the submitted audio, Voicegain's real-time Speech Analytics API provides those metrics in addition to the transcript.
While both the STT API and the Speech Analytics API support multiple methods to stream audio, Voicegain recommends RTP streaming as the primary method with Telnyx. Developers can stream either 1-channel or 2-channel RTP (the two channels are tied together which is important for some Speech Analytics features).
You can use Telnyx Call Control API to fork the call audio and send it to Voicegain. Call Control API allows you to send either inbound (rx) or outbound (tx) audio or both, this is done using the fork_start command. You can find a complete example of a code needed for real-time transcription of a call here: platform/examples/telnyx/call_control_fork_of_bridged_call at master · voicegain/platform (github.com)
Applications of real-time transcription and speech analytics include live agent assist in contact centers, extraction of insights for sales calls conducted over telephony, and meeting analytics.
If you want to build a Voice Bot or an IVR application that handles calls coming over Telnyx then we suggest using Voicegain Telephony Bot API - this is a callback API similar in style to Twilio's TwiML. This API handles speech-to-text, DTMF digits and also plays prompts (TTS, pre-recorded, or a combination).
Your calls are transferred from Telnyx to Voicegain using a simple SIP INVITE. The SIP INVITE is accomplished using Telnyx Call Control Dial command. You can find a complete example how to do this here: platform/telnyx-dial-outbound-lambda.py at master · voicegain/platform (github.com)
Voicegain Telephony Bot API allows you to build two types of applications:
If your application has only a limited need for speech recognition, you can invoke Voicegain STT API only as needed. Every time you need speech recognition you simply start a new ASR session with Voicegain either in transcribe (large vocabulary transcription) or recognize (grammar-based recognition) mode. The session will return an RTP ip:port to which you can fork your Telnyx audio. You can receive speech-to-text results either over a websocket or via a callback. After you a done with the transcription/recognition session you stop the Telnyx audio fork.
An example application that could be built like that is a voice controlled voicemail retrieval application where Voicegain recognize API is used in a continuous mode and listens to commands like play, stop, next, etc.
Finally, you could use Voicegain Long-Session API (planned to be released later in 2021). This API allows you to establish single long session that takes an ongoing stream of inbound audio from Telnyx (via fork command). Once the session is established you can issue commands for transcription or recognition. They would return results upon finding a speech endpoint or when you explicitly stop them. After processing the results you could issue additional commands on the same Voicegain session.
In addition to returning recognition results, Long-Session STT API returns important events, like e.g. start-of-speech that allows you to implement proper barge-in behavior.
Using this API you could build your own Voice Bot just like the Voice Bots from #2, but you could have more control over your Telnyx session, e.g. you could use conference commands.
This post is the first in a series of posts that compares the performance of Voicegain Speech Analytics against Google and Amazon. This post compares the capabilities and accuracy of recognition/extraction of Named Entities. The Google APIs used for comparison were those under Cloud Natural Language and the Amazon APIs were under AWS Comprehend.
Named Entity Recognition (NER) or extraction of Named Entities is a one of the features of the Voicegain Speech Analytics API. Named Entities Recognition locates and classifies named entities in unstructured text that may be obtained e.g. from the transcription of the audio files. Although there is a lot of overlap between Google, Amazon and Voicegain with respect to the classification categories, there are also some significant differences which are summarized below.
The full spreadsheet linked here shows the named entities extracted by the Voicegain Speech Analytics API and it compares them to the named entity categories available in Google and Amazon Comprehend APIs. Amazon has two NER API: Entity, and PII Entity.
If you look at the spreadsheet you will see that Amazon non-PII Entity API offers little granularity in the named entity categories. For example, it groups a lot of numerical named entities into single QUANTITY category. It groups dates and time (of day) into a single category DATE. On the other hand then PII Entity API has a lot of fine categories related items typically PII-redacted, but it misses a lot of other common entity categories.
Google API seems to cover the usual categories but misses some entities used in call-center application, e.g. CC, SNN, EMAIL>
A category that Voicegain does not support is OTHER. This category which is available in Google and Amazon requires additional application logic to interpret the string that it matches.
We have tested all 4 APIs on a set of call center calls.
The overall results show that Voicegain and Amazon non-PII PAI detect similar named entities (with the caveat that Amazon NER categories are less specific). Compared to these two, Google NER API misses more entities, but it also marks many additional words falling into the OTHER categories (which is generally is not very useful, at least not when analyzing call center calls.
Looking at the Amazon PII Entities we noticed that:
Where Voicegain has a matching entity category for AWS PII Entity it performed same or better.As you see it is difficult to summarize the results because the entities are not directly comparable. If you want to know how Voicegain NER will perform on your data we suggest you test the Voicegain Speech Analytics API which includes NER, keyword, phrase detection, sentiment analysis, etc.
For testing, you have two options:
Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Interested in customizing the ASR or deploying Voicegain on your infrastructure?