Our Blog

News, Insights, sample code & more!

Enterprise
Announcing Voicegain Casey, a Generative AI Voice Agent for Health Plan and TPA Call Centers

Voicegain is excited to announce the launch of Voicegain Casey, a payer focused AI Voice Agent that transforms the end-to-end call center experience with the power of generative AI. Voicegain Casey is a software suite of the following three Voice AI SaaS applications that helps a health plan or TPA call center improve operational efficiency and increase the CSAT and NPS (Net Promoter Score):

A. Voicegain Casey - Suite of Generative AI-Powered SaaS Applications

1. AI Voice Agent:

The AI Voice Agent replaces a touch-tone IVR with a modern LLM-powered human-like conversational voice experience. The AI Voice Agent can answer all calls that are received at a Health Plan or TPA Call center. It engages callers in a natural conversation and automates routine telephone calls like Claims Status, benefits inquiries and eligibility verifications. There is a very compelling business case to automate Provider phone calls in Health Plan and TPA call centers. Voicegain Casey has been specifically designed and developed for this goal. The AI Voice Assistant is also trained to perform HIPAA Validation and triaging of calls. So if the AI has not been trained to answer a specific question, it routes the call to the call center for live assistance.

2. CCaaS Integrated AI Co-Pilot : 

Voicegain AI Co-Pilot is a browser extension that runs as a browser side-panel of Call Center Agent's CRM. This Co-Pilot is integrated with the Contact Center/CCaaS platform used in the Call Center. When a call transferred by the AI Voice Agent is eventually answered by a Live Agent, all the information collected by the AI Voice Assistant is presented as a "Screen-Pop" on the Desktop of the Live Agent (also referred to as CTI). This CTI/Screen pop feature ensures that the front-line call center staff can continue the conversation from where the AI Voice Agent left off. In addition to this Screen-Pop, the AI Co-Pilot also guides the front-line call center staff in real-time by listening, transcribing and analyzing the conversation and providing real-time guidance . The AI Co-Pilot also generates a summary of the conversation within five seconds of the completion of the call. This automated summarization easily saves 1-2 mins of wrap-up time or after call work which is very common in these health plan and TPA call centers.

3. AI QA & Coach:

Voicegain AI QA & Coach is a browser-based AI SaaS application that is used by Team-leaders, QA Call Coaches/Analysts and Operations Managers in a call center. This AI SaaS app records, transcribes and analyzes the entire conversation. It measure the sentiment of the callers and computes the QA score. Voicegain uses the latest open-source reasoning LLMs (like LLAMA 3, Gemma) and closed-source reasoning models like o-3 from Open AI. With the power of modern reasoning models, almost the entire QA score-card (approximately 80% of the questions) can be easily answered using AI. This SaaS App also provides a database of all whole-call-recordings of the entire conversation of the customer - which includes the AI Voice Assistant part, the transfer to the specific Call Center queue and eventually the entire conversation between the Live Agent and the Caller.

B. Integrations

Voicegain Casey requires the following 3 key integrations to help with automation and real-time assistance.

1. Contact Center Platform/CCaaS Platform

Voicegain Casey integrates with modern CCaaS platforms. Current Integrations include Aircall, Five9 and Genesys Cloud. Planned integrations include Ringcentral, NICE CXOne and Dialpad.

2. CRM Software

Voicegain Casey integrates with the CRM software of the Health plan or the TPA. This can be an off-the-shelf CRM like Zendesk or Salesforce. It can also be a proprietary/homegrown CRM. As long as the CRM is a browser-based SaaS application, this should not be an issue. Voicegain Casey AI Co-Pilot is a browser-extension that is installed in the side-panel of the same browser tab as the CRM. At the end of the call, the summary of the call is automatically generated and available on the browser extension within 5 seconds of the end of the call.

3. Eligibility & Claims

Voicegain Casey needs access to the member eligibility and claims data.

C. Demo and Additional Information

For further information on Voicegain Casey, including a demo, please visit this link

D. Give us a shout!

If you would like to understand Voicegain Casey in more detail or if you would prefer a detailed product demo over a Zoom video call, please do not hesitate to send us an email. You can reach us at sales@voicegain.ai or support@voicegain.ai

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Voicegain releases Telephony Bot APIs for IVRs and Voice Bots
Voice Bot
Voicegain releases Telephony Bot APIs for IVRs and Voice Bots

Update Dec 2020: We have renamed RTC Callback APIs to Telephony Bot APIs to better reflect how developers can use these APIs  - which is build Voice Bots, IVRs.


If you have wanted to voice enable your Chatbot or build your own Telephony based Voice Bot or a Speech-enabled IVR, Voicegain has built an API that is really cool  - Release 1.12.0 of Voicegain Speech-to-Text Platform now includes Telephony Bot APIs (formerly called RTC Callback APIs in the past).

Voicegain Telephony Bot APIs enables any NLU/Bot Framework to easily integrate with PSTN/telephony infrastructure using either (a) SIP INVITE of Voicegain platform from a CPaaS platform of your choice or (b) purchasing a phone number directly from Voicegain portal and pointing it to your Bot. You can then use these callback style APIs to (i) play prompts (ii) recognize speech utterances or DTMF digits (iii) allow for barge-in and several other exciting features. We offer sample code that will help you easily integrate a Bot Framework of your choice to our Telephony Bot APIs.


If you do not have a Bot Framework, thats okay too. You can write the logic in any backend programming language (Python, Java or Node.JS) that can serialize responses in a JSON format and interact with our Callback style APIs.  Voicegain also offers a declarative YAML format to define the call flow and you can host this YAML file logic and interact with these APIs. Developers can also code and deploy the application logic in a server-less computing environment like Amazon Lambda.


Many enterprises - in banking, financial services , health care, telecom and retail  - are stuck with legacy telephony based IVRs that are approaching obsolescence.

Voicegain's Telephony Bot APIs provide a great future-proof upgrade path for such enterprises. Since these APIs are based on web callbacks, they can interact with any backend programming language. So any backend web developer can design, build and maintain such apps.


Why should you use Telephony Bot APIs?

With Telephony Bot APIs, integration becomes much simpler for developers.

1) You can SIP INVITE the Voicegain Speech-to-Text/ASR platform to a SIP/RTP session for as long as is needed. We support SIP integration with CPaaS platforms like Twilio, Signalwire and Telnyx. We also support CCaaS platforms like Genesys, Cisco and Avaya.

2) We also support direct phone number ordering and SIP Trunks from the Voicegain Web Console. More integrations will be added soon.

Telephony Bot APIs are based on web callbacks where the actual program/ implementation is on the Client side and the Voicegain Telephony Bot APIs  define the Requests and Responses. The meaning of Requests and Responses is reversed w.r.t what you would see in a normal Web API:

  • Responses provide the commands, while
  • Requests provide the outcome of those commands.

Illustrated example of Telephony Bot API in action

Below is an example of a simple phone call interaction which is controlled by Telephony Bot API. The sequence diagram shows 4 callbacks during a toy survey call:

  • Req 1: Phone Call arrived
  • Resp 1: Say: "Welcome"
  • Req 2: Done saying "Welcome"
  • Resp 2: Ask: "Are you happy", bind reply to happy var
  • Req 3: Caller's answer was "yes", happy=YES
  • Resp 3: Disconnect
  • Req 4: Disconnected
  • Resp 4: We are done


Currently supported actions

Telephony Bot API supports 4 types of actions:

  • output: say something - TTS with a choice of 8 different voices is supported
  • input: ask question - both speech input and DTMF are supported. For speech input you can use GRXML, JSGF or built-in grammars
  • transfer: transfer a call to a phone destination
  • disconnect: end the call

Wait, there is more

Each call can be recorded (two channel recording) and then transcribed. The recording and the transcript can be accessed from the portal as well as via the API.

Roadmap

Features coming soon:

  • record Callback action - you can use it to implement voicemail or record other types of messages
  • transfer to a sip destination
  • input - allow choice of large vocabulary speech-to-text in addition to grammars - use the captured text in your NLU
  • answer call at a sip address - instead of a phone number
  • WebRTC support
  • outbound dialing

Read more → 
Python SDK Available
Developers
Python SDK Available

As of August 5th, 2020, programming in Python against Voicegain Speech-to-Text (STT) API got even easier with the release of official voicegain-speech package to  Python Package Index (PyPI) repository.


The SDK package is available at: https://pypi.org/project/voicegain-speech/

The SDK source code is available at: https://github.com/voicegain/python-sdk


This package wraps Voicegain Speech-to-Text Web API. A preview of the API spec can be found at: https://www.voicegain.ai/api

Full API spec documentation is available at: https://console.voicegain.ai/api-documentation


The core APIs are for Speech-to-Text, either transcription or recognition (further described below).Other available APIs include:

  • RTC Callback APIs which in addition to speech-to-text allow for control of RTC session (e.g., a telephone call).
  • Websocket APIs for managing broadcast websockets used in real-time transcription.
  • Language Model creation and manipulation APIs.
  • Data upload APIs that help in certain STT use scenarios.
  • Training Set APIs - for use in preparing data for acoustic model training.
  • GREG APIs - for working with ASR and Grammar tuning tool - GREG.

Transcribe API

/asr/transcribeThe Transcribe API allows you to submit audio and receive the transcribed text word-for-word from the STT engine. This API uses our Large Vocabulary language model and supports long form audio in async mode.

The API can, e.g., be used to transcribe audio data - whether it is podcasts, voicemails, call recordings, etc. In real-time streaming mode it can, e.g., be used for building voice-bots (your the application will have to provide NLU capabilities to determine intent from the transcribed text).

The result of transcription can be returned in four formats:

  • Transcript - Contains the complete text of transcription
  • Words - Intermediate results will contain new words, with timing and confidences, since the previous intermediate result. The final result will contain complete transcription.
  • Word-Tree - Contains a tree of all feasible alternatives. Use this when integrating with NL postprocessing to determine the final utterance and its meaning.
  • Captions - Intermediate results will be suitable to use as captions (this feature is in beta).

Recognize API

/asr/recognizeThis API should be used if you want to constrain STT recognition results to the speech-grammar that is submitted along with the audio (grammars are used in place of the large vocabulary language model).

While having to provide grammars is an extra step (compared to Transcribe API), they can simplify the development of applications since the semantic meaning can be extracted along with the text.

Another advantage of using grammars is that they can ignore words in the utterance that are outside of grammar - still delivering recognition although with lower confidence.

Voicegain supports grammars in the JSGF and GRXML formats – both grammar standards used by enterprises in IVRs since early 2000s.The recognize API only supports short form audio - no more than 60 seconds.


Read more → 
CORS Support Added in 1.9.0
Developers
CORS Support Added in 1.9.0

We have recently added support  for CORS (Cross Origin Resource Sharing) in our APIs. This was in response to our customers asking for it in order to enable them building Speech-to-Text web applications with minimal effort. By making web API requests to Voicegain Speech API directly from their web clients the application can be simpler and more efficient.

Examples of simple applications that our customers are implementing this way are: microphone input capture and transcription (e.g. to capture and transcribe meeting notes), or offline-audio file transcription.

Users have full control, via security settings, over which Origins should be allowed to make the CORS requests.

Read more → 
Competitive Advantage of Custom Acoustic Models
Model Training
Competitive Advantage of Custom Acoustic Models

There is no doubt that there is a lot of value in the datasets that are used to train AI models. That is one of the reasons why Google offers their Speech-to-Text service at two price points, one with 'data logging' and and one without, see table below.



However at Voicegain, our speech-to-text platform does not capture or use any customer data (while still being able to offer low ASR pricing).

Moreover, Voicegain platform enables our customers to use their data to train their own dedicated & custom Acoustic Models. As result, our customers benefit in two ways:

  • The accuracy of these custom acoustic model(s) is several % higher compared to our base models.
  • Custom models are licensed exclusively to the clients and are not shared with anyone (neither Voicegain, nor any other Voicegain customers), so this higher accuracy translates directly into competitive advantage.

By retaining ownership of the data and the custom acoustic models, our customers benefit from higher ASR accuracy in general, and higher accuracy than their potential competitors in particular.

Read more → 
How AI powered Speech can boost Contact Center BPO topline?
Insights
How AI powered Speech can boost Contact Center BPO topline?

Senior leadership teams at most global contact center outsourcers are constantly under pressure. They need to have a laser like focus on key metrics, SLAs and people to manage their businesses. They are increasingly managing a global distributed business that is both labor intensive and technology intensive. And they have to do all of this with increasingly tight margins.

Despite being measured on metrics like CSAT and NPS, a lot of the value that an outsourcer delivers to its clients is often hard to quantify. And too often the price realized by the outsourcer does not capture the value and quality an outsourcer provides.

Two Ideas to pivot into high value SaaS offerings

In this article I would like to propose two new innovative ideas that can help Contact Center BPOs pivot into new SaaS (Software-as-a-Service) revenues.

  1. CX Speech Insights Service: Develop a new branded realtime CX insights service based on speech analytics powered by deep learning.
  2. CX Speech Automation Service: Build new voice self-service applications that can automate some of the common customer care scenarios.

Both these offerings can be offered to the clients using a Software-as-a-Service (SaaS) based business model in conjunction with the traditional agent side of the business.


Both these SaaS offerings leverage some of the key strengths that BPOs have: Deep domain expertise, in depth understanding of customer issues and technology infrastructure that leverages both

1. CX Speech Insights Service

Contact centers have a treasure trove of audio data. Every day associates are handling thousands of calls across a wide variety of topics. While outsourcers use legacy speech analytics vendors, the traditional use has been to analyze a sample of calls to assist in the Quality Assurance function. Net-net, it is viewed as a cost center both for the outsourcers and their clients.

However there is a massive untapped opportunity to mine and extract insights from such audio data for uses well beyond quality assurance. Such insights may be relevant to stakeholders in Product and Marketing teams of the clients. This can open up new non-traditional product and marketing budgets for BPOs.

2. CX Speech Automation Service

Outsourcers have an in-depth deeper understanding of current topics that customers are calling about. They have unique and current insights into which categories of calls are actually driving volumes. With the right tools, methodologies and personnel, outsourcers can build and offer new innovative speech self service applications that may automate parts of calls. With the right technologies, outsourcers can move seamlessly between agent assisted calls and automated self-service interactions.

The Foundation: Deep Neural Networks & custom acoustic models

The foundation for these SaaS offerings are modern Deep Neural Network (DNN) based Speech to Text platforms.

The old speech to text were technologies were based on traditional statistical models (called HMMs and GMMs). They were limited in their ability to train on specific industry jargons and accents. But a DNN based platform has the following advantages

  1. A DNN based platform can be easily trained to recognize unique words/jargon, accents and noisy backgrounds. Training the models increases the quality of recognition and makes it accurate enough to deliver real value to client stakeholders.
  2. A industry or customer specific acoustic model has the potential to create intellectual property for the BPO.
  3. A DNN platform can be used equally well both in the up front automation part and in the analytics and notification service. There are benefits from using the same platform for both offerings.

For more info, please contact us at info@voicegain.ai.


Read more → 
Speech-to-Text Accuracy Benchmark - June 2020 Results
Benchmark
Speech-to-Text Accuracy Benchmark - June 2020 Results

[UPDATE - October 31st, 2021:  Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced.]

"What is the accuracy of your recognizer?"

That is the question that we are frequently asked by our potential customers. Often we answer "that depends" and we get a feeling that the other side thinks "must be really bad if they do not give a straight answer". However, "that depends" is really the right answer. Accuracy of automated speech recognition (ASR) depends on the audio in many ways and the effect is not small. Basically, accuracy can be all over the place depending on factors like:

  • Does the speech follow proper grammar or is the speaker making things up as they are saying it. Prepared speeches will have better, i.e. lower WER (word error rate) scores compared to unscripted speech.
  • What is the subject of the speech. Rare and obscure words or word combinations, like e.g. people or other names, will make life difficult for the NLM (natural language model).
  • Are there more than one speakers? Are they constantly switching over or even talk over one another.
  • Is there music in the background - very common for youtube productions.
  • Is there background noise? What is the type of noise?
  • Are parts of the speech audio unusually slow or fast?
  • Is there room reverb or echo in the recording?
  • Is the recording volume very low. Are there variations in the recording volume (e.g. recorder placed on one edge of a very long table)
  • Is the recording quality bad, e.g., due to a codec or insane archival compression levels.
  • etc. etc.

Testing / Benchmarking Speech-to-Text Accuracy

Because the accuracy or Word Error Rate questions are somewhat meaningless without specifying the type of speech audio, it is important to do testing when choosing a speech recognizer. As a test set, one would choose a set of audio files, that accurately represent the spectrum of the speech that will be encountered by the recognizer in the expected use cases.  For each speech audio file from the set one would obtain a gold/reference transcript that is 100% accurate. After that, things can be automated -- transcribe each file on the recognizers being evaluated, compute WER against the reference for each of the generated transcripts, and collate the results. The combined results will present a clear picture of how the recognizers perform on the specific speech audio that we care about. If you are going to repeat this process often, e.g., to evaluate new candidates on the recognizer marker, it is good to standardize the test set, basically creating a repeatable benchmark that can be referenced in the future.

Our benchmark

The benchmark results that we are presenting here are somewhat different than the use-case driven tests or benchmarks. Because we are building a general recognizer for an unspecified use case, we intentionally decided to use a very broad set of audio files.  Rather than collecting the test files ourselves, we decided to use the data set described in "Which Automatic Transcription Service is the Most Accurate? — 2018" from September 2018 by Jason Kincaid. The article presents a comparison of Speech Recognizers from various companies using a set of 48 YouTube videos (taking 5 minutes of audio from each of the videos). By the time we decided to do a retest of Jason's benchmark, 4 videos were no longer accessible, so our benchmark presented here uses data from only 44 videos.

We compared the results presented by Jason to the results from the big 3 - Google, Amazon, and Microsoft - recognizers as of June 2020. Of course, we also included our Voicegain recognizer, because we wanted to see how we stacked against those. All the tested recognizers use Deep Neural Networks. The Voicegain speech recognizer ran on the Google Cloud Platform using Nvidia T4 GPUs. All recognizers were run with default settings and no hints nor user language models were used.

It is important to mention that none of the benchmark files are included in the training set that Voicegain uses. Neither is other audio from the speakers from the benchmark files, nor the same content but spoken by other speakers.

So what are the results? Who has the best recognizer?

Again, the best recognizer is not the right question, because it all depends on your actual speech audio it is used on. But the key results from testing on the 44 files are as follows:

  • Every recognizer has improved. The biggest improvement in median WER was by Microsoft Speech to Text.
  • The best recognizer in our data set was Google Speech to Text - Enhanced (video), but the new Microsoft Speech to Text is very close second.
  • Taking price into consideration, Microsoft might be declared Best Buy
  • Voicegain recognizer is definitely Best Value.
  • Google Speech to Text - Standard, although somewhat improved, is still clearly the worst performing on the data set.    
  • The single bad data point for Google Enhanced (video) is real. We ran repeated test on the file and got the same result. The old Google Enhanced recognizer did not have problems with that file.

How does the Voicegain recognizer stack up?

Here are our thoughts and some details:

  • Up until October 2019 the training set we were using to train our recognizer was relatively unchanged. Moreover, our training set was heavily biased towards some categories of speech audio. You can see that in the chart, e.g., by the fact that our best results were better than old Amazon Transcribe but our worst results were quite a bit more worse than Amazon Transcribe.
  • Based on the first results from the benchmark we analyzed what kind of audio gave us trouble, and collected data with the particular characteristics but sourced very broadly (to avoid training to  benchmark) to make our recognizer more robust.  That effort paid off and you can see that now the Voicegain recognizer WER spread is much tighter and overall is now very close to new Amazon Transcribe.
  • Overall Voicegain is the most improved recognizer. Just over 6 months ago we were just better than Google Standard, but now we are closing on Amazon Transcribe. This is result of both changes to the Neural Network architecture and a large increase in the training data set hours.
  • If you look into the details, Voicegain recognizer was better than new Amazon on 11 out of 44 files, better than Google Video on 5 files, and better than Microsoft also on 5 out of 44 files.
  • If you consider the price, we think that Voicegain presents a great value. We have talked to customers who were not doing large scale transcription due to large cost of the 3 big platforms and our low pricing suddenly made new uses of transcription viable.

We welcome anyone to test our platform and see how it performs on speech audio types that matter for your use cases.

Any software that can help me in testing recognizers?

We have Open Sourced the key component of our benchmark suite, the transcribe_compare python utility. It is available here: https://github.com/voicegain/transcription-compare under MIT license.

It is useful for automatic benchmarking but it can also output data to an html file which can be viewed in a web browser. We use it often this way to do a manual review of the transcription errors or differences in errors between two recognizers or recognizer versions.

How can I test drive Voicegain?

If you are building an app that requires transcription, sign up today for a developer account and get $50 in free credits  (~5000 minutes of platform use). You can check out our accuracy add test our APIs. Instructions to sign up for a developer account are provided here.

3. If you want to make Voicegain your own AI Transcription Assistant, click here. You can take Voicegain to meetings, webinars, talks, lectures and more.

We expect to catch up soon

We are still in the middle of extensive data collection effort and the training is not over yet. We are seeing continuing improvement in our recognizer, with the new improved versions of the acoustic model deployed to production about twice a month. We will report updated benchmark results on our blog in a few months.

User-Customized Acoustic Model

We have another blog post planned that is going to quantify the benefit one can expect from using additional user data to train the acoustic model used in the recognizer. We have selected a large data set with a very specific English accent that currently has higher WER. We will report on the impact on WER of training on such a data set. We will quantify the improvement based on the size of the data set and the duration of training.

Voicegain provides easy to use tools that allow users to build their own custom acoustic models. This upcoming post will provide a clear insight as to what improvements to expect and how much data is needed to make a difference in reducing WER.

References

Contact Us

If you have any questions regarding this article or our platform and recognizer you can contact us at info@voicegain.ai


Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control