Our Blog

News, Insights, sample code & more!

ASR
Going beyond Accuracy: Key STT API Features for Contact Center Voice AI Apps

This article is for companies building Voice AI Apps targeting the Contact Center. It outlines the key technical features, beyond accuracy, that are important while evaluating an OEM Speech-to-Text (STT) API. Usually, most analyses focus on the importance of accuracy and metrics like benchmarks of word error rates (WER). While accuracy is very important, there are other technical features that are equally important for contact center AI apps.

Introduction

There are multiple use-cases for Voice AI Apps in the ContactCenter. Some of the common use cases are 1) AI Voicebot or Voice Agent 2) Real-time Agent Assist 3)Post Call Speech Analytics.

This article is focused on the third use-case which is Post-Call Speech Analytics. This use-case relies on batch STT APIs while the first two use-cases require streaming transcription. This Speech Analytics App helps the Quality Assurance and Agent-Performance management process. This article is intended for Product Managers and Engineering leads involved in building such AI Voice Apps that target the QA, Coaching and Agent Performance management process in the call center. Companies building such apps could include 1) CCaaS Vendors adding AI features, 2) Enterprise IT or Call Center BPO Digital organizations building an in-house Speech Analytics App 3) Call Center Voice AI Startups

1. Accurate Speaker Diarization

Very often, call-center audio recordings are only available in mono. And even if the audio recording is in 2-channel/stereo, it could include multiple voices in a single channel. For example, the Agent channel can include IVR prompts and hold music recordings in addition to the Agent voice. Hence a very important criterion for an OEM Speech-to-Text vendor is that they provide accurate speaker diarization.

We would recommend doing a test of various speech-to-text vendors with a good sample set of mono audio files. Select files that are going to be used in production and calculate the Diarization Error Rate. Here is a useful link that outlines the technical aspects of  understanding and measuring speaker diarization.

2. Accurate PII/Named Entity Redaction and PCI Compliance

A very common requirement of Voice AI Apps is to redact PII – which stands for Personally Identifiable Information. PII Redaction is a post-processing step that a Speech-to-Text API vendor needs to perform. It involves accurate identification of entities like name, email address, phone number and mailing addresses and subsequent redaction both in text and audio. In addition, there are PCI – Payment Card Industry – specific named entities like Credit Card number, 3-digit PIN and expiry dates. Successful PII and PCI redaction requires post-processing algorithms to accurately identify a wide range of PII entities and cover a wide range of test scenarios. These test scenarios need to cover scenarios where there errors in user input and errors in speech recognition too.

There is another important capability related to PCI/PII redaction. Very often PII/PCI entities are present across multiple turns in a conversation between an Agent and Caller. It is important that the post-processing algorithm of the OEM Speech-to-Text vendor is able to process both channels simultaneously when looking for these named entities.

3. Language Detection

A Call Center audio recording could start off in one languageand then switch to another. The Speech-to-Text API should be able to detect language and then perform the transcription.

4. Hints/Keyword Boosting

There will always be words that are not accurately transcribed by even the most accurate Speech-to-Text model. The API should include support for Hints or Keyword Boosting where words that are consistently misrecognized can get replaced by the correctly transcribed word. This is especially applicable for names of companies, products and industry specific terminology.

5. Sentiment and Emotion

There are AI models that measure sentiment and emotion, and these models can be incorporated in the post-processing stage of transcription to enhance the Speech-to-Text API. Sentiment is extracted from the text of the transcript while Emotion is computed from the tone of the audio. A well-designed API should return Sentiment and Emotion throughout the interaction between the Agent and Caller. It should effectively compute the overall sentiment of the call by weighting the “ending sentiment” appropriately.

6. Talk-Listen Ratios, Overtalk and Other Incidents

While measuring the quality of an Agent-Caller conversation, there are a few important audio-related metrics that are tracked in a call center.  These include Talk-Listen Ratios, overtalk incidents and excessive silence and hold.

7. Other Optional LLM-Powered Features

There are other LLM-powered features like computation of theQA Score and the summary of the conversation. However, these are features are builtby the developer of the AI Voice App by integrating the output of the Speech-to-TextAPI with the APIs offered by the LLM of the developer’s choice.

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Access Voicegain ASR from FreeSWITCH using mod_unimrcp
Developers
Access Voicegain ASR from FreeSWITCH using mod_unimrcp

Voicegain STT platform has supported MRCP (Media Resource Control Protocol) for a long time now. Our ASR can be accessed using MRCP and we support both grammar-based recognition (e.g. GRXML) and large-vocabulary transcription. MRCP is a communication protocol designed to connect telephony based IVRs and Voice Bots with speech recognizers (ASR) and speech synthesizers (TTS).

Previously we tested connecting to Voicegain using MRCP from VXML platforms like Dialogic PowerMedia XMS or Aspect Prophecy. We had not tested connecting from FreeSWITCH, a popular open source telephony platform, using its MRCP plugin mod_unimrcp.

We are pleased to announce that Voicegain platform works out-of-the box with mod_unimrcp, the MRCP plugin for FreeSWITCH. However, getting the mod_unimrcp plugin to work on FreeSWITCH is not particularly trivial. Here are some pointers to help those who would like to use mod_unimrcp with our platform.


Deploying Voicegain unimrcp server

There are currently 2 options to do this. We plan to add a third option very soon  

  1. For production deployments of Speech IVRs and Voice Bots on FreeSWITCH, we recommend an Edge Deployment of the Voicegain platform. This will deploy our unimrcp server that can communicate with a locally deployed FreeSWITCH using MRCP.
  2. To use our Cloud ASR, you will need to download a MRCP IVR Proxy. This proxy can be downloaded from the Voicegain Web Console. You will download a tar file that has the definition of a docker compose that you can then run on your docker server. This will deploy our preconfigured unimrcp server with a proxy for connecting to Voicegain Cloud Speech-to-Text engine .
  3. (Coming soon) We plan to implement a voicegain_asr plugin that can be deployed on a standard unimrcp server. The plugin will talk to our ASR in the cloud using gRPC.

Also, the current TTS option accessible over MRCP are not great. Our focus has been on the use of prerecorded prompts for IVRs and Voice Bots. We plan to shortly allow developers to access the Google or Amazon TTS.


Configuring FreeSWITCH for mod_unimrcp

mod_unimrcp does not get built by default when you build FreeSWITCH from source. To get it built you need to enable it in build/modules.conf.in by uncommenting this line: #asr_tts/mod_unimrcp


After the build, before starting FreeSWITCH you will need to:

  • Add <load module="mod_unimrcp"/> to autoload_configs/modules.conf.xml(you can put it in <!-- ASR /TTS --> section because that is where it logically belongs)
  • Create mrcp_profile for voicegain (see below)
  • Modify content of autoload_config/unimrcp.conf.xmlIf you want to use both ASR and TTS via Voicegain MRCP, you will need to point both default-asr-profile and default-tts-profile to the voicegain1-mrcp2 profile you will create in mrcp_profiles folder.

Here is an example MRCP v2 profile for connecting to Voicegain MRCP:

Here are some additional notes about the configuration file:

  • It is important that the port range used by the Unimrcp Client:<param name="rtp-port-min" value="4000"/><param name="rtp-port-max" value="5000"/>is accessible from outside, otherwise, the TTS via MRCP will not work. Also, these ports may not overlap with the UDP ports used by FreeSWITCH.
  • In some setups the "auto" values of :<param name="client-ip" value="auto"/> and<param name="rtp-ip" value="auto"/>may not work and you will have to manually specify the external IP.

How to use mod_unimrcp

Here is an example of how to play a question prompt and to invoke the ASR via mod_unimrcp to recognize a spoken phone number:


session:execute("set", "tts_engine=unimrcp:voicegain1-mrcp2");
session:execute("set", "tts_voice=Catherine");
session:execute("play_and_detect_speech", 
"say:What is your phone number detect:unimrcp {start-input-timers=false,define-grammar=true,no-input-timeout=5000}builtin:grammar/phone")

asrResult = session:getVariable("detect_speech_result");

test

What this example does is:

  • tells FS which tts_egine to use
  • sets the TTS voice - currently ignored
  • plays a question prompt using the specified TTS and launches the recognition
  • retrieves the result of the speech recognition

The result of the recognition is a string in XML format (NLSML). You will need to parse it to get the utterance and any semantic interpretations. NLSML result also contains confidence.  


The normal command "play_and_detect_speech" holds onto ASR session until the end of the call - this makes subsequent recognitions more responsive, but you are paying for the MRCP session. You can also use this command "play_and_detect_speech_close_asr" to release ASR session immediately after recognition.


If you have any questions about the use of Voicegain ASR via MRCP please contact us at: support@voicegain.ai


Coming Soon

On our roadmap we have a mod_voicegain plugin for FreeSWITCH which will bypass the need for mod_unimrcp and unimrcp server and will be talking from FreeSWITCH directly to the Voicegain ASR using gRPC.

Read more → 
Implementing Real-time Agent Assist with Voicegain
Use Cases
Implementing Real-time Agent Assist with Voicegain

As pandemic forces Contact Centers to operate with work-from-home agents, managers are increasingly looking to real-time speech analytics to drive improvements in agent efficiency (via reduction in AHT) and effectiveness (improvements in FCR, NPS) and achieve 100% compliance.

Before the pandemic, Contact Center managers relied on a combination of in person supervision and speech analytics of recorded calls to drive improvements in agent efficiency and effectiveness.

However the pandemic has upended everything. It has forced contact centers to support work-from-home agents from multiple locations.  Team Leads who "walked the floor" and monitored  and assisted agents in realtime are not available any more. The offline Speech Analytics process - which is still available remotely - is limited and manual. A Call Coach or a QA Analyst coaches an agent manually using a sample 1-2% of the calls that have been transcribed and analyzed.

There is a now an urgent need to monitor and support agents real-time and provide them all tools and support that they had while they worked in their offices.

Real-time Agent Assist is the use of Artificial Intelligence - more specifically Speech Recognition and Natural Language Processing - to help agents real-time during the call in the following ways.

  1. Agents can be presented with knowledge-base articles and next-best actions from intents that are extracted from the transcribed text
  2. Using NLU algorithms and intents extracted, you can now summarize the call automatically and realize savings on disposition/wrap time
  3. Supervisors can monitor sentiment real-time

Real-time Agent Assist can reduce AHT by 30 seconds to 1 minute, improve FCR by 3-5% and improve NPS/CSAT.

What does it take to implement Real-time Agent Assist?

Real-time agent assist involves the realtime transcription of the Agent and Caller Interaction and extracting keywords, insights and intents from the transcribed text and make it available in a user-friendly manner to both the Agents and also the team-leads and supervisors.

There are 4 key steps involved:

  1. Audio Capture: The first step is to stream the two channels of audio (i.e agent and caller streams) from the Contact Center Platform that the client is using (whether premise based or cloud based). Voicegain supports a variety of protocols to stream audio. We have described them here and here.   We have integrated with premise-based major contact center platforms like Avaya, Cisco and Genesys. We have also integrated with Media Stream APIs from programmable CCaaS platforms like Twilio and SignalWire.
  2. Transcription: The next step in the process is to transcribe the audio streams into text . Voicegain offers Transcription APIs to convert the audio into text realtime. We can stream the text realtime (using web-sockets or gRPC) so that it can be easily integrated into any NLU Engine.
  3. NLU/Text Analytics:  In this step, the NLU engine extracts the intents from the transcribed text. These intents are trained in an earlier phase using phrases and sentences. Voicegain integrates with leading NLU Engines like RASA, Google Dialogflow, Amazon Lex and Salesforce Einstein.
  4. Integration with Agent Desktop: The last and final step is to integrate the results of the NLU with the Agent Desktop.

At Voicegain, we make it really easy to develop real-time agent assist applications . Sign up to test the accuracy of our real-time model.

Read more → 
Easy Speech IVR for Outbound Calling using Voicegain and Twilio
Contact Center
Easy Speech IVR for Outbound Calling using Voicegain and Twilio

Outbound IVRs on Voicegain

Voicegain platform makes it easy to build IVRs for simple outbound calling applications like: surveys (Voice-of-Customer, political, etc), reminders (e.g. appointments, payments due), notifications (e.g. school closure, water boil notice), and so on.

Voicegain allows developers to use the outbound calling features of CPaaS platforms like Twilio or SignalWire with the speech recognition and IVR features of the Voicegain platform. All you need is this simple piece of code to make an outbound call using Twilio and connect it to Voicegain for IVR.


Defining IVRs in declarative way

Voicegain provides a full featured Telephone Bot API. It is a webhook/callback style API that can be used in similar way you would use Twilio's TwiML. You can read more about it here

However, in this post, we describe an even simpler method to build IVRs. We allow developers to specify the Outbound IVR call flow definitions in a simple YAML format. We also provide a python script that can be easily deployed on AWS Lambda or on your web-server to interpret this YAML file. The complete code with examples can be found on our github. It is under MIT license so you can modify the main interpreter script to your liking. You might want to do it e.g. to make calls to external webservices that your IVR needs.

In this YAML format, an IVR question would be defined as follows:


As you can see, this is a pretty easy way to define an IVR question. Notice also that we provide a built-in handling for the NOINPUT and NOMATCH re-prompts, as well as the logic for confirmations. This greatly reduces the the clutter in the specification as those flow scenarios do not have to be handled explicitly.

The questions support either use of grammars to map responses to semantic meaning, or they can alternatively simply capture the response using a large vocabulary transcription.

Prompts are played using TTS or can be concatenated from prerecorded clips.

Wait, there is more.

Because this is built on top of Voicegain Telephone Bot API it comes with full API access to the IVR call session. You can obtain details, including all the events and responses, of the complete session using the API. This includes the 2-channel recording plus also full transcription of both channels and also Speech Analytics features.

You can also examine the details of the session from the Voicegain Console and listen to the audio. This helps in testing the application before it gets deployed.  




If you have questions about building this type of IVRs running on Voicegain platform, please contact us at support@voicegain.ai

Read more → 
Voicegain Speech Recognition for Voice Picking for Warehouses
Use Cases
Voicegain Speech Recognition for Voice Picking for Warehouses

Among the various speech-to-text APIs that Voicegain provides is a speech recognition API that uses grammars and supports continuous recognition. This API is ideally suitable for use in warehouse Voice Picking applications. Warehouse Management Systems can embed Voicegain APIs to offer Voice Picking as part of their feature set.

Here are more details of that specific API:

  • Audio input - supports streaming of audio via websockets for very easy integration with web based or Android/iOS applications (gRPC support is in beta)
  • Results of recognition are available via websocket or http callbacks in JSON format. Sending recognition results over websockets is a recent addition and it makes building web based voice picking applications much easier.
  • Supports grammar based recognition - better suited for a well defined set of commands compared to large-vocabulary speech-to-text. Has higher accuracy, better noise rejection, better handling of various accents, etc. Using grammas provides a benefit of fast end-pointing - the recognizer knows that the command has been completely uttered and there is no additional timeout needed to determine end-of-speech.We support a variant of JSGF grammar format which is very intuitive and easy to use.
  • Supports continuous recognition - multiple commands can be recognized in a single http session. Continuous recognition allows for the commands to be spaced closer together and allows for natural correction of misrecognitions by simple repetition.

In addition to that Voicegain Speech-to-Text platform provides additional benefits for Voice Picking applications:

  • Acoustic/language model is customizable - this allows for very high recognition accuracy for specific domains
  • Web-based tools available for reviewing utterance recognitions. These tools allow for tuning of grammars and for collection of utterances for model training.

Together this allows for your Voice Picking application to continually learn and improve.

Our APIs are available in the Cloud but can also be hosted at the Edge (on-prem) which can increase reliability and reduce the already low latencies.

If you would like to test our API and see how they would fit in your warehouse applications you can start with the fully functional example web app that we have made available on github: platform/examples/command-grammar-web-app at master · voicegain/platform (github.com)

If you have any question please email us at info@voicegain.ai. You can also sign-up for a free account on Voicegain Platform via our Web Console at: https://console.voicegain.ai/signup  

Read more → 
Streaming Audio data from Contact Center platforms to enable Generative AI Voice apps like Realtime Agent-Assist and AI Co-Pilot
Developers
Streaming Audio data from Contact Center platforms to enable Generative AI Voice apps like Realtime Agent-Assist and AI Co-Pilot

This article outlines various options for how developers and builders of real-time Gen AI voice applications  in contact center should design and architect access to streaming audio data from IP-based Contact Centers systems. These Contact Center systems can be premise-based contact center platforms like Avaya, Cisco, Genesys or CCaaS platforms like Five9, Genesys Cloud, NICE CXOne and Aircall.

Use Case for Realtime Generative AI Voice for Contact Center

One of the main use cases for Realtime Generative Voice AI in a contact center is Realtime Agent Assist (RTAA) or a generative AI Co-Pilot. The first step for any such realtime application is to stream audio from Contact Center platforms to a streaming Speech-to-Text model and get the speaker separated transcript. This transcript in turn can be integrated with an LLM for real-time sentiment analysis, QA automation agent assist, summarization and other real-time AI use cases in the contact center. 

Voicegain's inhouse Kappa model is one such streaming speech-to-text model. The real-time transcript is made available by Voicegain over websockets.

Architecture Options to get Real-time Audio data

Overall there are 3 main approaches to get access to real-time audio streams

  • Voicegain SIP Media Stream B2BUA (For On-Premise Systems)
  • SIPREC from the SBC (Under Development)
  • Programmable Integration (leveraging APIs provided by CCaaS platforms )

The details of each of those approaches are described below

SIP Media Stream B2BUA

Most on-premise contact center platforms, like Avaya, Genesys and Cisco do not provide programmatic access to the media streams. Instead they all offer the ability to transfer a call to a SIP destination/URI. This is in turn can be provided by the Voicegain SIP Media Stream B2BUA. In other words, the Voicegain SIP Media Stream B2BUA can accept a call from such a SIP INVITE.  

More details of the SIP Media Stream B2BUA can be found here

SIPREC from Session Border Controller (currently in Beta)

Most enterprise premise-based Contact Center platforms include a network element called the Session Border Controller (SBC). The SBCs can be thought of as a SIP-aware firewall that is architected "in front" of a premise-based IP Contact Center. SBCs support the forking of audio streams using a protocol called SIPREC and this has been used over the years by active/compliant call recording vendors like NICE and Verint.

With SIPREC, an SBC essentially provides a mirror or fork of the real-time RTP stream from the telephone call. This can be sent to Voicegain's SIPREC Server (currently in beta).

Voicegain has a beta version of a SIPREC interface has been tested with the following platforms:

  • Avaya Enterprise SBC
  • Ribbon/Sonus SBC
  • Broadsoft SIPREC sipua
  • Cisco Cisco Unified Border Element (CUBE)
  • Metaswitch SIPREC sipua - The minimal version of Metaswitch that supports SIPREC is 9.0.10
  • Oracle SBC SIPREC - SelectiveCall Recording SIPREC (oracle.com)
  • Twilio TwiML <Siprec>

Voicegain can capture relevant call metadata in addition to obtaining the audio (the metadata capture functionality may differ in capabilities depending on the client platform).

Voicegain platform can be configured to automatically launch transcription and speech-analytics as soon as the new SIPREC session gets established.

SIPREC support is available both in the Cloud and the Edge (OnPrem) deployments of the Voicegain Platform.

SIPREC is an Enterprise feature of the Voicegain platform and is not included in the base package. Please contact support@voicegain.ai or submit a Zendesk ticket for more information about SIPREC and if you would like to use it with your existing Voicegain account.

Programmable Integration with CCaaS real-time audio streaming APIs

Some CCaaS  platforms, in particular the modern one provide APIs to get programmatic access to the real-time audio stream. In many of them such a capability was added specifically to simplify integration with Cloud Speech-to-Text services.

Examples of such CCaaS platforms are :

  • Five9 VoiceStream
  • Genesys Audiohook
  • Avaya DMCC (which is part of Avaya Aura® Application Enablement (AE) Services) to open RTP streams with the content of the call
  • Use Extended Media Forking (XMF) provided by Cisco Unified Communications Gateway Services

Voicegain Platform integrates with the APIs multiple protocols that allow for flexible programmable integration:

  • websockets - sending binary audio data over websocket is supported. In addition to binary data, message protocols used in Twilio and SignalWire for audio streaming over websocket are also supported. (If required, we can easily add support for additional message protocols.)
  • gRPC - binary audio data may also be sent using gRPC protocol. Note, that this capability is currently in beta.
  • plain RTP. Voicegain also supports plain RTP. The IP/port/encoding negotiation, however, has to be done using our HTTP API. We do not support RTCP nor RTSP. The HTTP API is very simple and we have already had some of our customers integrate this type of plain RTP streaming using XMF within the Cisco UC environment.    

All those protocols support uLaw, aLaw, and Linear 16-bit encoding in either 8- or 16kHz sample rate.

Contact us to discuss or brainstorm!

If you are building a voice Gen AI application and you would like to discuss getting access to realtime audio data, please contact us at support@voicegain.ai

Read more → 
PII Text and Audio Redaction now available in Speech Analytics API
Speech Analytics
PII Text and Audio Redaction now available in Speech Analytics API

Our latest release (1.24.0) expands Voicegain Speech Analytics and Transcription API with ability to redact sensitive data both in transcript and in audio. This allows our customers to be compliant with standards like HIPAA, GDPR, CCPA, PCI or PIPEDA.

Any of the following types of Named Entities can be redacted in transcript text and/or the audio file.

  • ADDRESS - Postal address.
  • CARDINAL - Numerals that do not fall under another type.
  • CC - Credit Card
  • DATE - Absolute or relative dates or periods.
  • EMAIL - (coming soon) Email address
  • EVENT - Named hurricanes, battles, wars, sports events, etc.
  • FAC - Buildings, airports, highways, bridges, etc.
  • GPE - Countries, cities, states.
  • NORP - Nationalities or religious or political groups.
  • MONEY - Monetary values, including unit.
  • ORDINAL - "first", "second", etc.
  • ORG - Companies, agencies, institutions, etc.
  • PERCENT - Percentage, including "%".
  • PERSON - People, including fictional.
  • PHONE - (coming soon) Phone number.
  • QUANTITY - Measurements, as of weight or distance.
  • SSN - Social Security number
  • TIME - Named documents made into laws.
  • ZIP - (coming soon) Zip Code (if not part of an Address)

In the audio they are replaced with silence and in the transcript they are replaced with a string specified when making the API request.

This feature is supported both in Cloud and on the Edge (on-prem).

Two typical use cases are:

  • Enable redaction as part of normal processing, of e.g. call center calls
  • Do a bulk processing of previously underacted audio in storage to achieve compliance. Combined with low per minute price of Voicegain APIs, this allows our customers to cost effectively process large qualities of audio data.  


Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control