By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Our Blog

News, Insights, sample code & more!

ASR,Benchmark
Speech-to-Text Accuracy Benchmark - June 2022

It has been over 7 months since we published our last speech recognition accuracy benchmark. Back then the results were as follows (from most accurate to least): Microsoft and Amazon (close 2nd), then Voicegain and Google Enhanced, and then, far behind, IBM Watson and Google Standard.

Since then we have obtained more training data and added additional features to our training process. This resulted in a further increase in the accuracy of our model.

As far as the other recognizers are concerned:

  • Microsoft and Amazon both improved, with Microsoft improving a lot on the more difficult files from the benchmark set
  • Google has released a new model "latest-long" which is quite a bit better than the previous Google's best Video Enhanced model. Accuracy of Video Enhanced stayed pretty much unchanged.

We have decided to no longer report on Google Standard and IBM Watson accuracy, which were always far behind in accuracy.


Methodology

We have repeated the test using similar methodology as before: used 44 files from the Jason Kincaid data set and 20 files published by rev.ai and removed all files where none of the recognizers could achieve a Word Error Rate (WER) lower than 25%.

This time only one file was that difficult. It was a bad quality phone interview (Byron Smith Interview 111416 - YouTube).

The Results

You can see boxplots with the results above. The chart also reports the average and median Word Error Rate (WER)

All of the recognizers have improved (Google Video Enhanced model stayed much the same but Google now has a new recognizer that is better).

Google latest-long, Voicegain, and Amazon are now very close together, while Microsoft is better by about 1 %.

Best Recognizer

Let's look at the number of files on which each recognizer was the best one.

  • Microsoft was best on 35 out of the 63 files
  • Amazon was best on 15 files (note that in the October 2021 benchmark Amazon was best on 29 files).
  • Voicegain was close behind Amazon by being best on 12 audio files
  • Google latest-long was best on 4
  • Google Video Enhanced wins a participation trophy by being best on 1 file, which was a very easy "The Art of War by Sun Tzu Full" Librivox Audiobook - WER of 1.79%

Note, the numbers do not add to 63 because there were a few files where two recognizers had identical results (to 2 digits behind comma).

Improvements over time

We now have done the same benchmark 4 times so we can draw charts showing how each of the recognizers has improved over the last 1 year and 9 months. (Note for Google the latest result is from latest-long model, other Google results are from video enhanced.)

You can clearly see that Voicegain and Amazon started quite bit behind Google and Microsoft but have since caught up.

Google seems to have the longest development cycles with very little improvement since Sept. 2021 till very recently. Microsoft, on the other hand, releases an improved recognizer every 6 months. Our improved releases are even more frequent than that.

As you can see the field is very close and you get different results on different files (the average and median do not paint the whole picture). As always, we invite you to review our apps, sign-up and test our accuracy with your  data.

Out-of-the-box accuracy is not everything

When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:

  • Ability to customize the Acoustic Model - Voicegain model may be trained on your audio data - we have several blogposts describing both research and real use-case model customization. The improvements can vary from several percent on more generic cases, to over 50% to some specific cases, in particular for voicebots.
  • Ease of integration - Many Speech-to-Text providers offer limited APIs especially for developers building applications that require interfacing with  telephony or on-premise contact center platforms.
  • Price - Voicegain is 60%-75% less expensive compared to other Speech-to-Text/ASR software providers while offering almost comparable accuracy. This makes it affordable to transcribe and analyze speech in large volumes.
  • Support for On-Premise/Edge Deployment - The cloud Speech-to-Text service providers offer limited support to deploy their speech-to-text software in client data-centers or on the private clouds of other providers. On the other hand, Voicegain can be installed on any Kubernetes cluster - whether managed by a large cloud provider or by the client.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account  and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
ASR,Benchmark
Speech-to-Text Accuracy Benchmark - June 2022

It has been over 7 months since we published our last speech recognition accuracy benchmark. Back then the results were as follows (from most accurate to least): Microsoft and Amazon (close 2nd), then Voicegain and Google Enhanced, and then, far behind, IBM Watson and Google Standard.

Since then we have obtained more training data and added additional features to our training process. This resulted in a further increase in the accuracy of our model.

As far as the other recognizers are concerned:

  • Microsoft and Amazon both improved, with Microsoft improving a lot on the more difficult files from the benchmark set
  • Google has released a new model "latest-long" which is quite a bit better than the previous Google's best Video Enhanced model. Accuracy of Video Enhanced stayed pretty much unchanged.

We have decided to no longer report on Google Standard and IBM Watson accuracy, which were always far behind in accuracy.


Methodology

We have repeated the test using similar methodology as before: used 44 files from the Jason Kincaid data set and 20 files published by rev.ai and removed all files where none of the recognizers could achieve a Word Error Rate (WER) lower than 25%.

This time only one file was that difficult. It was a bad quality phone interview (Byron Smith Interview 111416 - YouTube).

The Results

You can see boxplots with the results above. The chart also reports the average and median Word Error Rate (WER)

All of the recognizers have improved (Google Video Enhanced model stayed much the same but Google now has a new recognizer that is better).

Google latest-long, Voicegain, and Amazon are now very close together, while Microsoft is better by about 1 %.

Best Recognizer

Let's look at the number of files on which each recognizer was the best one.

  • Microsoft was best on 35 out of the 63 files
  • Amazon was best on 15 files (note that in the October 2021 benchmark Amazon was best on 29 files).
  • Voicegain was close behind Amazon by being best on 12 audio files
  • Google latest-long was best on 4
  • Google Video Enhanced wins a participation trophy by being best on 1 file, which was a very easy "The Art of War by Sun Tzu Full" Librivox Audiobook - WER of 1.79%

Note, the numbers do not add to 63 because there were a few files where two recognizers had identical results (to 2 digits behind comma).

Improvements over time

We now have done the same benchmark 4 times so we can draw charts showing how each of the recognizers has improved over the last 1 year and 9 months. (Note for Google the latest result is from latest-long model, other Google results are from video enhanced.)

You can clearly see that Voicegain and Amazon started quite bit behind Google and Microsoft but have since caught up.

Google seems to have the longest development cycles with very little improvement since Sept. 2021 till very recently. Microsoft, on the other hand, releases an improved recognizer every 6 months. Our improved releases are even more frequent than that.

As you can see the field is very close and you get different results on different files (the average and median do not paint the whole picture). As always, we invite you to review our apps, sign-up and test our accuracy with your  data.

Out-of-the-box accuracy is not everything

When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:

  • Ability to customize the Acoustic Model - Voicegain model may be trained on your audio data - we have several blogposts describing both research and real use-case model customization. The improvements can vary from several percent on more generic cases, to over 50% to some specific cases, in particular for voicebots.
  • Ease of integration - Many Speech-to-Text providers offer limited APIs especially for developers building applications that require interfacing with  telephony or on-premise contact center platforms.
  • Price - Voicegain is 60%-75% less expensive compared to other Speech-to-Text/ASR software providers while offering almost comparable accuracy. This makes it affordable to transcribe and analyze speech in large volumes.
  • Support for On-Premise/Edge Deployment - The cloud Speech-to-Text service providers offer limited support to deploy their speech-to-text software in client data-centers or on the private clouds of other providers. On the other hand, Voicegain can be installed on any Kubernetes cluster - whether managed by a large cloud provider or by the client.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account  and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Read more → 
Benchmark
Transcription
Announcing Voicegain Transcribe, an AI based assistant for transcription of meetings

Today, we are really excited to announce the launch of Voicegain Transcribe. Transcribe is an AI based transcription assistant for recording and transcription of in-person and web meetings, live video events and webinars. Our goal is to empower users to focus on their meetings/events and leave the note taking to us.

Voicegain Transcribe is built on top of our highly accurate deep-learning-based ASR. It is powered by the same Speech-to-Text APIs that all our developer/platform customers use today. Our out-of-the-box accuracy of 89% is on par with the very best. 


Currently there are 3 main ways you can use Voicegain Transcribe:

Voicgain Transcribe, an app to record and transcribe meetings, live video and webinars, is now available
Screenshot of Voicegain Transcribe on first time log-in

1. Transcribe Web Meeting (using browser sharing of audio)

Users can use our browser sharing feature to record & transcribe audio that is playing on any tab on a Chrome or Edge browser. Any meeting platform that allows a browser based client is supported. Some prominent meeting platforms include Google Meet, BlueJeans, Webex and Zoom.

If the Users use a Windows based laptop/desktop, then this Browser sharing supports capturing audio from the client desktop app of the Meeting platform (e.g Zoom or Microsoft Teams). The Mac OS users does not support sharing of audio from a desktop app with Voicegain Transcribe.

2. Microphone Capture

This allows users to record and transcribe anything that is captured by the microphone on their laptop/desktop. So Users can turn on the microphone capture for an in- person meeting, lecture or event. They can also just let a web meeting or event play on their speaker and have the microphone capture what is being played on it.

3. Upload Audio Recordings

Users may also upload pre-recorded audio files of their meetings, podcasts, calls and generate the transcript. We support over 40 different formats including mp3, mp4, wav, aac and ogg). Voicegain supports speaker diarization - so we can separate speakers even on a single channel audio recording.

Languages Supported

Currently we support English and Spanish. More languages are in our roadmap - German, Portuguese, Hindi.

Advanced Features

Voicegain Transcribe also supports the following advanced Features.

a. Projects

Users can organize their meeting recordings and audio files into different projects. A project is like a workspace or a folder.

b. Named Entities & Keywords

Users can highlight named entities (dates, currency, addresses, email id) in their meeting transcript.

c. PII Redaction

Users can also mask - in both text and audio - any personally identifiable information.

Coming Soon

We are adding close integration with the Zoom meeting platform. With this, we can capture the actual speaker labels directly from Zoom. This will address errors related to diarization.

We are also adding a Chrome extension that will make it much easier to record and transcribe web meetings.

Get Started for Free today!

By signing up today, you will be signed up on our forever Free Plan - which makes you eligible for 120 mins of Meeting Transcription free every month . Once you are  satisfied with our accuracy and our user experience, you can easily upgrade to Paid Plans.

If you have any questions, please email us at support.transcribe@voicegain.ai

Read more → 
Transcription
Languages
New Languages Available in Voicegain Speech-to-Text

[Updated: 5/27/2022]

In addition to the current support for English, Spanish, Hindi, and German languages in its Speech-to-Text platform, Voicegain is releasing support for many new languages over the next couple of months.

Languages Generally Available right now
  • English - blend of mainly US, UK, Irish accents - includes punctuation and digit/time/currency formatting support
  • Spanish - focused on Latin America accents - includes punctuation and digit/time/currency formatting support
  • Hindi
  • German

You can access these languages right now from the Web Console or via our Transcribe App or via the API

Languages available right now in Alpha early access program

Upon request we will make these languages available for your testing. Generally, they can be available within hours from receiving a request. Please contact us at support@voicegain.ai

  • Portuguese - blend of European and Brazilian Portuguese
  • Polish
  • Korean
  • Dutch
  • Ukrainian
What does "Alpha early access" mean?

The Alpha early access models differ from full-featured production models in the following ways:

  • They are not good at rejecting background noise, music, etc.
  • The  vocabulary may be limited - they may not be good at recognizing names of products, people, places, etc. Generally the vocabulary is the core every day vocabulary of a given language.
  • They will not be good at recognizing heavy or unusual accents.
  • Punctuation and capitalization is not available.
  • Formatting of digits, time, dates, currencies is not available.
  • For languages not using Latin alphabet, there could be occasional glitches in the characters in the transcript.
  • Initially most of those models are available in offline/batch mode only. We are working on training the real-time/streaming models.

As alpha models are being trained on additional data, their accuracy will improve. We are also working on punctuation, capitalization, and formatting of each of those models.

Languages that will be available in the first half of June
  • French - blend of European French (Metropolitan) and Canadian French (Quebec)

We will update this post as soon as these languages are available in the Alpha early access program.

Languages available by end of June
  • Arabic
  • Italian
Do not see a language that you need?

Since our language models are created exclusively with End-to-End Deep Learning, we can perform transfer learning from one language to another, and quickly support new languages and dialects to better meet your use case. Don’t see your language listed below? Contact us at support@voicegain.ai, as new languages and dialects are released frequently.


Read more → 
Languages
Model Training
Acoustic Model Training delivers big gains in ASR Accuracy

This is a Case Study of training the acoustic model of Deep learning based Speech-to-Text/ASR engine for a Voice Bot that could take orders for Indian Food.

The Problem

The client approached Voicegain as they experienced very low accuracy of speech recognition for a specific telephony based voice bot for food ordering.

The voice bot had to recognize Indian food dishes with acceptable accuracy, so that the dialog could be conducted in a natural conversational manner rather than having to fallback to rigid call flows like e.g. enumerating through a list.

The spoken response would be provided by provided by speakers of South Asian Indian origin. This meant that in addition to having to recognize unique names,  the accent would be a problem too.

The out-of-the box accuracy of Voicegain and other prominent ASR engines was considered too low. Our accuracy was particularly low because our training datasets did not have any examples of Indian Dish names spoken with heavy Indian accents.

With the use of Hints, the results improved significantly and we achieved an accuracy of over 30%. However, 30% was far from being good enough.

The Approach

Voicegain first collected relevant training data (audio and transcripts) and trained the acoustic model of our deep learning based ASR. We have had good success with it in the past, in particular with our latest DNN architecture, see e.g. post about recognition of UK postcodes.

We used a third party data generation service to initially collect over 11,000 samples of Indian Food utterances - 75 utterances per participant. The quality varied widely, but that is good because we think it reflected well the quality of the audio that would be encountered in a real application. Later we collected additional 4600 samples.

We trained two models:

  • A "balanced" model - where the Food Dish training data was combine with our complete training set to train the model.
  • A "focused" model - there the Food Dish data was combined with just a small subset of our other training data set.

We also first trained on the 10k set, collected the benchmark results, and then trained on the additional 5k data.

We randomly selected 12 sets of 75 utterances (total 894 after some bad recordings were removed) for a benchmark set and used the remaining 10k+ for training. We plan to share a link to the test data set here in a few days.

The Results - A 75% improvement in accuracy!

We compared our accuracy against Google and Amazon AWS both before and after training and the results are presented in a chart below. The accuracy presented here is the accuracy of recognizing the whole dish name correctly. If one word of several in a dish name was mis-recognized, then it was counted as a failure to recognize the dish name. We applied the same methodology if one extra word was recognized, except for additional words that can easily be ignored, e.g., "a", "the", etc. We also allowed for reasonable variances in spelling that would not introduce ambiguity, e.g. "biryani" was considered a match to "biriyani".

Note that the tests on Voicegain recognizer were ran with various audio encodings:

  • PCMU 8kHz - is a telephony quality audio
  • L16 16kHz - is closer to the audio quality you would expect from most webrtc applications and delivers better accuracy

Also, the AWS test was done in offline mode (which generally delivers better accuracy), while Google and Voicegain tests were done in streaming (real-time) mode.

We did a similar set of tests with the use of hints (we did not include AWS because our test script did not support AWS hints at that time).



This shows that huge benefits can be achieved by targeted model training for speech recognition. For this domain, that was new to our model, we increased accuracy by over 75% (10.18% to 86.24%) as result of training.

As you can see, after training we exceeded the Speech-to-Text accuracy of Google by over 45% (86.24% vs 40.38%) if no hints were used. With the use of hints we were better than Google STT by about 36% (87.58% vs 61.30%).

We examined cases where mistakes were still made and they fell into 3 broad categories:

  • Recordings missing an end part of the last word. That is because the stop record button was pressed while the last word was still being spoken. The recorded part of the last word is generally recognized ok, e.g., instead of "curry" we recognize "cu". (We plan to manually review the benchmarks set and modify the expected values according to what is being said and then recompute the accuracy numbers.)
  • Really bad quality recordings - where the volume of the audio is barely over the background noise level. In this case we usually missed some words or parts of words. This also explains why the hints do not give more improvement - there are no sufficient quality partial hypotheses that the hints could boost.
  • Loud background speech noise. In this case we usually recognized additional words beyond what was expected.

The first type of problems we think can be overcome by training on additional data and that is what we are planning to do, hoping to eventually get accuracy close to 85% (for L16 16kHz audio). The second type could be potentially resolved by post-processing in the application logic if we return the dB values of the recognized words.

Interested?

If your speech application also suffers from low accuracy and using hints or text-based language models is not working well enough, then acoustic model training could be the answer. Send us an email at info@voicegain.ai and we could discuss doing a project to show how Voicegain trained model can achieve best accuracy on your domain.

Read more → 
Model Training
Benchmark
Getting high Speech Recognition Accuracy on Alphanumeric Sequences: A Case Study with UK Zip Codes

It is a common knowledge for AI/ML developers working with speech recognizers and ASR software that getting high accuracy in real-world applications on sequences of alphanumerics is a very difficult task. Examples of alphanumeric sequences are  serial numbers of various products, policy numbers, case numbers or postcodes (e.g. UK and Canadian).

Some reasons why ASRs have a hard time recognizing alphanumerics are:

  • some letters sound very similar, e.g. P and B, T and D
  • A and 8 sound very similar
  • combinations of letters and digits sound like words, e.g. "E Z" sounds like "easy", "B 9" sounds like "benign", etc.

Another reason why the overall accuracy is bad is simply that the errors compound - the longer the sequences the more likely it is that at least one symbol will be misrecognized and thus the whole sequence will be wrong. If accuracy of a single symbol is 90% then the accuracy of a number consisting of 6 symbols will be only 53% (assuming that the errors are independent). Because of that, major recognizers, deliver poor results on alphanumerics. In our interaction with customers and prospects, we have consistently heard about the challenges they have encountered with getting good accuracy on alphanumeric sequences. Some of them use post-processing of the large  vocabulary results, in particular, if a set of hypotheses is returned. We used such approaches back when we built IVR systems as Resolvity and had to use 3rd party ASR. In fact, we were awarded with a patent for one of such postprocessing approaches.

Case Study: British Postcodes

While working on a project aiming to improve recognition of UK postcodes we collected over 9000 sample recordings of various people speaking randomly selected valid UK postcodes. About 1/3 of speakers had British accent, while the remaining had a variety of other accents, e.g. Indian, Chinese, Nigerian, etc.

Out of that data set we reserved some for testing. The results reported here are from a 250 postcode test set (we will soon provide a link to this test set on our Github). As of the date of this blog post, Google Speech-to-Text achieved only 43% accuracy and Amazon 58% on this test set.

At Voicegain we use two approaches to help us achieve high accuracy on the alphahumerics: (a) training the recognizer on realistic data sets containing sample alphanumeric sequences, (b) using grammars to constrain the possible recognitions. In a specific scenario, we can use one or the other or even both approaches.

Here is a summary of the results that we achieved on the UK postcodes set.


Improving Recognition with Acoustic Model Training

We used the data set described above in our most recent training round for our English Model and have achieved significant improvement in accuracy when testing on a set of 250 UK postcodes which were not used in training.

  • For unconstrained large vocabulary recognition the accuracy improved from 51.60% to 63.60% (a gain of 12%). The training helped both the acoustic part of our model (e.g. letters which were skipped in the base recognizer because they were not enunciated well enough were picked after training - 8 was recognized correctly instead of H, etc.) and the language part of our model (e.g. correctly recognizing "two" instead of "to" because of the context)
  • For grammar-based recognition (more about it in the section below) the accuracy improved from  79.31% to 84.03% (a gain of 4.72%). Because in grammar based recognition the language model is fully defined by the grammar the gain here was from being able to distinguish more acoustic nuances between various letters and numbers (e.g. someone's long R is no longer recognized as "A R", "L P" is now correctly recognized instead of "A P", etc).

Improving Recognition with the use of Grammars

Voicegain DNN recognizer has ability to use grammars for speech recognition, a somewhat unique feature among modern speech recognizers. We support GRXML and JSGF grammar format. Grammars are used during the search - they are not merely applied to the result of the large vocabulary recognition - this gives us best possible results. (BTW, we can also combine grammar-based recognition with large vocabulary recognition, see this blog post for more details.)

For UK postcode recognition we defined a grammar which captures all ways in which valid UK postcodes can be said. You can see the exact grammar that we used here.


Grammar based UK postcode recognition gives significantly better results than large vocabulary recognition.

  • On our base model, before training, the difference was 27.71% (79.31% vs 51.60%)
  • On the trained model the difference was smaller, but still very large 20.43%  (84.03% vs 63.60%)
  • Compared to Amazon recognizer we were 25.62% better after training (84.03% vs 58.40%)
What if the possible set of alphanumeric sequences cannot be defined using a grammar?

We have come across scenarios where the alphanumeric sequences are difficult to define exhaustively using grammars, e.g. some Serial Numbers. In those cases our recognizer supports the following approach:

  • Define a grammar that matches a superset of valid sequences,
  • Use a lookup table to match know list of valid and likely to occur sequences. For example, if these are serial numbers and the application deals with warranty registration, we can narrow down a set of possible SN that we may have to recognize.

Want to test your alphanumeric use case?

We are always ready to help prospective customers with solving their challenges with speech recognition. If your current recognizer does not deliver satisfactory results recognizing sequences of alphanumerics, start a conversation over email at info@voicegain.ai. We are always interested in accuracy.

Read more → 
Benchmark
Voice Bot
Voicegain as a single ASR for both Speech IVRs & Voice Bots

This post highlights how Voicegain's deep learning based ASR supports both speech-enabled IVRs and conversational Voice Bots.

This can help Enterprise IT organizations simplify their transition from directed dialog telephony IVR to a modern conversational Voice Bot.

This is because of a very important feature of Voicegain. Voicegain's ASR can be accessed in two ways

1) MRCP ASR for Speech IVR - the traditional way: Voicegain ASR can be invoked over MRCP from a VoiceXML IVR application developed using Speech grammars. Voicegain is a "drop-in" replacement for the ASR used in most of these IVRs.

2) Speech-to-Text/ASR for Bots -  the modern way: Voicegain offers APIs integrate with (a) SIP telephony or CPaaS platforms and (b) Bot Frameworks that present a REST endpoint. Examples of bot frameworks supported include Google Dialogflow, RASA and Azure Bot Service.

Directed Dialog IVRs are not going away any time soon!

When it comes to voice self service, enterprises understand that they would need to maintain and operate traditional Speech IVRs for many years.

This is because existing users have been trained over the years and have become proficient with these speech enabled IVRs. They would prefer not having to learn  new user interface like Voice Bots if they can avoid it. Also enterprises have made substantial investments in developing these IVRs and they would like to continue to support these IVRs as long as they generate adequate usage.

However an increasing "digital-native" segment of customers demand Alexa-like conversational experiences as it provides a much better user experience compared to IVRs. This is driving substantial interest by enterprises to develop Voice Bots as a long term replacement for IVRs.

Net-net, even as enterprises develop new conversational Voice Bots for the long term; in the near term, they would need to support and operate these IVRs .

Bots & IVRs use different ASRs, protocols and App tech stacks

ASR: While both Voice bots & IVRs require ASR/Speech-to-Text, the ASRs that support conversational voice bots are different from the ASRs used in directed dialog IVRs. The ASRs that support IVRs are based on HMMs (Hidden Markov models) and and the apps use speech grammars when invoking the ASR. On the other hand, voice bots work with large vocabulary deep learning based STT models.

Protocol: The communication protocols between the ASR & the app are also very different. An IVR App, usually written in VoiceXML, communicates with the ASR over MRCP; modern Bot Frameworks communicate with ASRs over modern web-based protocols like WebSockets and gRPC.

App Stack: The app logic of a directed dialog IVRs is built on VoiceXML compliant application IDE. Popular vendors in this space Avaya Aura Experience Portal (AAEP), Cisco Voice Portal (CVP) and Genesys Voice Portal or Genesys Engage. This article explores this in more detail.

On the other hand, modern Voice bots require Bot frameworks like Google Dialogflow, Kore.ai, RASA, AWS Lex and others. They use modern NLU technology to can extract intent from transcribed text. Bot Frameworks also offer sophisticated dialog management to dynamically determine conversation turns. They also allow integration with other enterprise systems like CRM and Billing.

When it comes to Voice Bots, most enterprises want to "voice-enable" the chatbot interaction logic which is also developed on the same Bot Framework and then integrate with telephony. - so use a phone number to  "dial" the chatbot and interact using Speech-to-Text and Text-to-Speech.

The Solution: Use Voicegain ASR to support both IVRs & Bots

The Voicegain platform is the first and currently the only ASR/ Speech-to-Text platform in the market that can support both a directed dialog Speech IVR and a Conversational voice bot using a single acoustic and language model.

Cloud Speech-to-Text APIs from Google, Amazon and Microsoft support large vocabulary speech recognition and can support voice bots. However they cannot be a "drop-in" replacement for the MRCP ASR functionality in directed dialog IVR.

And  traditional MRCP ASRs that supported directed dialog IVRs (e.g. Nuance,  Lumenvox etc) do not support large vocabulary transcription.

Integration with Bot Frameworks and Telephony

Voicegain offers Telephony Bot APIs to support Bots developers with providing the "mouth" and the "ear" of the Bot.

These APIs are Callback style APIs that an enterprise can can use along with a Bot Framework of its choice.

In addition to the actual ASR, Voicegain also embeds a telephony/PSTN interface. There are 3 possibilities:

1. Integration with modern CPaaS platforms like Twilio, SignalWire and Telnyx  With such an integration, callers can now have  "dial and talk" to their chatbots over a phone number.

2. SIP INVITE from CCaaS or CPaaS Platform: The Bot Developer can transfer the call control to Voicegain using a SIP INVITE. After the call has been transferred, the Bot Framework can interact using above mentioned APIs. At the end of the bot interaction, you can end the Bot session and continue the live conversation on the CCaaS/CPaaS platform.

3. Voicegain embedded CPaaS:  Voicegain has also embedded the Amazon Chime CPaaS; so developers can actually purchase a phone number and start building their voice bot in a matter of minutes.

Essentially, by using Telephony Bot APIs alongside any Bot Framework, an Enteprise can have a Bot framework and an ASR that serves all 3 self service mediums - Chatbots, Voicebots and Directed Dialog IVRs.

To explore this idea further, please send us an email at info@voicegain.ai

Read more → 
Voice Bot
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control