It has been another 6 months since we published our last speech recognition accuracy benchmark. Back then, the results were as follows (from most accurate to the least): Microsoft, then Amazon closely followed by Voicegain, then new Google latest_long and Google Enhanced last.
While the order has remained the same as the last benchmark, three companies - Amazon, Voicegain and Microsoft showed significant improvement.
Since the last benchmark, at Voicegain we invested in more training - mainly lectures - conducted over zoom and in a live setting. Training on this type of data resulted in a further increase in the accuracy of our model. We are actually in the middle of a further round of training with a focus on call center conversations.
As far as the other recognizers are concerned:
We have repeated the test using similar methodology as before: used 44 files from the Jason Kincaid data set and 20 files published by rev.ai and removed all files where none of the recognizers could achieve a Word Error Rate (WER) lower than 25%.
This time again only one file was that difficult. It was a bad quality phone interview (Byron Smith Interview 111416 - YouTube) with WER of 25.48%
We publish this since we want to ensure that any third party - any ASR Vendor, Developer or Analyst - to be able to reproduce these results.
You can see box-plots with the results above. The chart also reports the average and median Word Error Rate (WER)
Only 3 recognizers have improved in the last 6 months.
Detailed data from this benchmark indicates that Amazon is better than Voicegain on audio files with WER below the median and worse on audio files with accuracy above the median. Otherwise, AWS and Voicegain are very closely matched. However we have also run a client-specific benchmark where it was the other way around - Amazon as slightly better on audio files with WER above the median than Voicegain, but Voicegain was better on audio files with WER below the median. Net-net, it really depends on type of audio files, but overall, our results indicate that Voicegain is very close to AWS.
Let's look at the number of files on which each recognizer was the best one.
We now have done the same benchmark 5 times so we can draw charts showing how each of the recognizers has improved over the last 2 years and 3 months. (Note for Google the latest 2 results are from latest-long model, other Google results are from video enhanced.)
You can clearly see that Voicegain and Amazon started quite bit behind Google and Microsoft but have since caught up.
Google seems to have the longest development cycles with very little improvement since Sept. 2021 till about half a year ago. Microsoft, on the other hand, releases an improved recognizer every 6 months. Our improved releases are even more frequent than that.
As you can see, the field is very close and you get different results on different files (the average and median do not paint the whole picture). As always, we invite you to review our apps, sign-up and test our accuracy with your data.
When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
LLMs like ChatGPT and Bard are taking the world by storm! An LLM like ChatGPT is so good not just in understanding language but also processing and acquiring knowledge of this content. The outcome of this is almost eerie and scary. Because once these LLMs acquire knowledge, they are able to answer very accurately questions that in the past seemed to require human judgement.
A big use-case for LLMs is in the analysis of business meetings - both internal (between employees) and external (e.g conversations with customers, vendors, etc).
In the past few years, companies have been primarily using Cloud-based Revenue/Sales Intelligence and Meeting AI solutions to transcribe business conversations. With all of these cloud-based solutions, the meeting transcript is usually stored in the Vendor's cloud. One the transcript is generated, NLU models built by these vendors and included as part of their SaaS Web app (were used to extract insights - questions and sales blockers in sales conversations, meeting action items, risks etc.
Essentially these NLU models - most of these predate the LLMs - were able to summarize, extract topics, keywords and phrases. Enterprises did not mind using the cloud infrastructure of the vendor to store the transcripts as what this NLU could do seemed pretty harmless.
However the LLMs take this to a whole different level. Once the LLMs are provided the transcripts - or more specifically "embeddings" of the transcripts, they are able to acquire knowledge of what is actually taking place and answer extremely insightful questions.
At Voicegain, we just used an Open Source vector database to generate embeddings of a single month of our daily scrum meeting transcripts and we submitted it to the ChatGPT API.
We were able to get answers to the following questions
1. Provide a summary of the contract with <Largest Customer Name>.
2. What is the progress on <Key Initiative>?
3. Did the Company hire new employees?
4. Did the Company discuss any trade secrets?
5. What is the team's opinion on Mongodb Atlas vs Google Firestore?
6. What new products is the Company planning to develop?
7. Which Cloud provider is the Company using?
8. What is the progress on a key initiative?
9. Are employees happy working in the company?
10. Is the team fighting fires?
ChatGPT's responses to the above questions was amazingly and eerily accurate. For Question 4, it did indicate that it did not want to answer the question. And when it do not have adequate information (e.g. Question 9), it did indicate that in its response.
At Voicegain, we had always been a big proponents of why Voice AI needs to remain on the Edge. We had written about it in the past.
Meeting transcripts in any business is a veritable gold mine of information. Now with the power of LLMs, they can now be queried very easily to provide amazing insights. But if these transcripts are stored in another Vendor's cloud, it has the potential to expose very proprietary and confidential information of any business to 3rd parties.
Hence it is extremely critical that these transcripts are stored only in private infrastructure (behind the firewall). It is really important for Enterprise IT to make sure this happens in order to safeguard proprietary and confidential information.
Voicegain offers Transcribe, an enterprise-ready option for Meeting AI. Transcribe can deployed in a datacenter on bare-metal or in a private cloud option. You can read more about it here.
To vet
On March 1st, Open AI announced that developers could now access the Whisper Speech-to-Text model via easy to use APIs. OpenAI also released APIs to GPT3.5, the LLM behind the buzzy ChatGPT product.
Since Whisper's initial release in October 2022, it has been a big draw for developers. A highly accurate Open Source ASR is extremely compelling. Whisper has been trained on 680,000 hours of audio data which is much more than what most models are trained on. Here is a link to their github.
However there were two major limitations. 1) Running Whisper requires expensive memory-intensive GPU based compute options (see below). 2) A company still had to invest in an engineering team that could test, run and support the model in a production environment.
By taking over the responsibility of hosting this model and making it accessible via easy-to-use APIs, Open AI addresses both of the above limitations.
This article highlights some of the key strengths and limitations of using Whisper - whether using Open AI's APIs or hosting it on your own.
In our benchmark tests, Whisper models demonstrated high accuracy for a widely diverse range of audio datasets. Our ML engineers found that the Whisper models perform well on a wide range of audio datasets - from meetings, classroom lectures, YouTube videos and call center audio. We benchmarked Whisper-base, Whisper-small and Whisper-medium.
The median Word Error Rate (WER) for Whisper-medium was 12.5% for meeting audio and 17.5% for call center audio. This was indeed better than the WERs of other large players like AWS, Azure and Google. However here is an interesting fact - it is possible to match and even exceed Whisper's accuracy with custom models. Custom models are models that are trained on our client's data.
Please contact us via email (support@voicegain.ai) if you would like to review these accuracy benchmarks.
Whisper's pricing at $0.006/min is much lower than the Speech-to-Text offerings of some of the other larger cloud players. This translates to a 75% discount to Google Speech-to-Text and AWS Transcribe (based on pricing as of the date of this post). However there are a few caveats to this pricing which are outlined in the Limitations section below
What was also significant was Open AI announced the release of ChatGPT APIs with the release of Whisper APIs. For developers building Voice AI apps, they can now can combine the power of Whisper Speech-to-Text models with the GPT 3.5 LLM (the underlying models that ChatGPT APIs give access to) and they can build really cool apps - whether it is for Meetings or Call Center.
Whisper currently does not support apps that require real-time/streaming transcription which are relevant both to call center and the meetings use-case. While there are some hacks and work-arounds, they are not practical for a production deployment.
The throughput of Whisper models - both for the small and medium models - is quite low. Our ML engineers tested Whisper models on popular NVIDIA GPU-based compute instances available in public clouds (AWS, GCP and Microsoft Azure). Net-net, we determined that while developers would not have to pay for software licensing, the cloud infrastructure costs would be so high. We determined that the infrastructure cost of running Whisper - so that it can perform well is in the range of $0.10 - $0.15/hour.
In addition to this infrastructure cost the larger expense of running Whisper on the Edge (On-Premise + Private Cloud) is that it would require a dedicated back-end Engineering & Devops team to run the model in a cost-effective manner.
As of the publication of this post, Whisper does not have a multi-channel audio API. So if your application involves audio with multiple speakers, then Whisper's effective price-per-min = Number of channels * 0.006. For both meetings and call center use-cases, this pricing can become prohibitive.
This release of Whisper is missing some key features that developers would need. The three important features we noticed are Diarization (speaker separation), Time-stamps and PII Redaction.
At Voicegain, we have built deep-learning-based Speech-to-Text/ASR models that match the accuracy of models from the large players. For over 3 years nows, developers have been using our Speech-to-Text APIs to build and launch successful products. Our focus has been on voice developers that need high accuracy (achieved by training custom acoustic models) and deployment in private infrastructure at an affordable price. We provide an accuracy SLA where we guarantee that a custom model that is trained on your data will be as accurate if not more than most popular options including Open AI's Whisper. Here is a link to our
We also have models that are trained specifically on call center audio - so if you are looking for a call center focused model we can provide higher accuracy than Whisper.
While Whisper is a worthy competitor (of course a much larger company with 100x our resources), as developers we welcome the innovation that Open AI is unleashing in this market. By adding ChatGPT APIs to our Speech-to-Text , we are planning to broaden our API offerings to developer community.
To sign up for a developer account on Voicegain with free credits, click here.
Like Voicegain Transcribe, there are other cloud-based Meeting AI and AI note-taking solutions that work with video meeting platforms like Zoom and Microsoft Teams. However they do not meet the requirements of privacy-sensitive enterprise customers in financial services, healthcare, manufacturing and high-tech and other industry verticals. Data privacy and control issues would mean that these customers would want to deploy an AI based meeting assistant in their private infrastructure behind their corporate firewall.
Voicegain Transcribe has been designed and developed for the On-Prem Datacenter or Virtual Private Cloud use-case. Voicegain has already deployed this at a large global Fortune 50 company, making it one of the first truly On-premise/private-cloud AI Meeting Assistant solutions in the market.
The key features of Voicegain Transcribe are:
Zoom Local Recordings are recordings of your meetings that are saved in your computer's hard disk on your file-system and not on Zoom's cloud. This feature ensures that confidential and privacy-sensitive recorded audio and video content is stored within the enterprise and is not accessible to Zoom.
Voicegain offers a Windows desktop app (App for Mac OS is on the roadmap) that accesses these Zoom recordings and submits it for transcription and NLU.
The other major advantage of Zoom Local Recordings is that Zoom supports recording of a separate audio track for each participant. This feature is not available in its Cloud recording as of yet (as of Feb 2023). Voicegain Transcribe with Zoom Local Recordings can hence assign speaker labels with 100% accuracy.
There are vendors that offer Meeting Assistants that join from the Cloud and record. However when this solution is picked, the Meeting Assistant has access only to a blended/merged mono audio file which includes audio of all the participants. So Meeting AI solution has to "diarize" the meeting audio - which is an inherently difficult problem to solve. Even state-of-the-art diarization/speaker separation models are only 83-85% accurate.
For any Meeting AI solution to extract meaningful insights, the accuracy of the underlying transcription is extremely important. If the Speech-to-Text is not accurate, then even the best NLU algorithm or the largest language model cannot deliver valuable and accurate analytics.
Voicegain can train the underlying Speech-to-Text to help accurately transcribe different accents, customer specific words and the specifiic acoustic environment.
Voicegain integrates with Enterprise SSO solutions using SAML. Voicegain also integrates with internal email systems to simplify user management tasks like sign-up, password reset and changes, adds and deletes.
All the meeting audio, transcripts and NLU-based analytics are stored in enterprise controlled NoSQL and SQL databases. Enterprises can either use in-house staff to maintain/administer these databases and storage or they can also use a managed database option like MongoDB Atlas or Managed PostgreSQL from a cloud provider like Azure, AWS or GCP
If you are looking for a Meeting AI solution that can be deployed fully behind your corporate firewall or in your own Private Cloud infrastructure, then Voicegain Transcribe is the perfect fit for your needs.
Have questions? We would love to hear from you. Send us an email -sales@voicegain.ai or support@voicegain.ai and we will be happy to offer more details.
We are really excited to announce the launch of Zoom Meeting Assistant for Local Recordings. This is immediately available to all users of Voicegain Transcribe that have a Windows device. The Zoom Meeting Assistant can be installed on computers that have Windows 10 or Windows 11 as the OS.
What are local recordings? Zoom offers two ways to record a meeting - 1) Cloud Recording: Zoom users may save the recording of the meeting on Zoom's Cloud. 2) Local Recording - The meeting recording is saved locally on the Zoom user's computer. These recordings are saved in the default Zoom folder on the file system. Zoom processes the recording and makes it available in this folder a few minutes after the meeting is complete.
Below is a screenshot of how a Zoom user can initiate a local recording.
There are four big benefits of using Local Recordings
To use Voicegain Zoom Meeting Assistant, there are just two requirements
1. Users should first sign up for a Voicegain Transcribe account. Voicegain offers a free plan forever (up to 2 hours of transcription per month) and users can sign up using this link. You can learn more about Voicegain Transcribe here.
2. They should have a computer with Windows 10 or 11 as the OS.
This Windows App can be downloaded from the "Apps" page on Voicegain Transcribe. Once the app is installed, users will be able to access it on their Windows Taskbar (or Tray). All they need to do is to log into the Voicegain Transcribe App from the Meeting Assistant by entering their Transcribe user-id and password.
Once the Meeting Assistant App is logged into Voicegain Transcribe, it does two things
1. It constantly scans the Zoom folder for any new local recordings of Meetings. As soon as it finds such a recording, it submits/uploads it to Voicegain Transcribe for transcription, summarization and extraction of Key Items (Actions, Issues, Sales Blockers, Questions, Risks etc.)
2. It can also join any Zoom Meeting as the Users AI Assistant. Also this feature works whether the user is the Host of the Zoom Meeting or just a Participant . By joining the meeting, the Meeting Assistant is able to collect information on all the participants in the meeting.
While the current Meeting Assistant App works only for Windows users, Voicegain has native apps for Mac, Android and iPhone as part of its product roadmap.
Send us an email at support@voicegain.ai if you have any questions.
It has been another 6 months since we published our last speech recognition accuracy benchmark. Back then, the results were as follows (from most accurate to the least): Microsoft, then Amazon closely followed by Voicegain, then new Google latest_long and Google Enhanced last.
While the order has remained the same as the last benchmark, three companies - Amazon, Voicegain and Microsoft showed significant improvement.
Since the last benchmark, at Voicegain we invested in more training - mainly lectures - conducted over zoom and in a live setting. Training on this type of data resulted in a further increase in the accuracy of our model. We are actually in the middle of a further round of training with a focus on call center conversations.
As far as the other recognizers are concerned:
We have repeated the test using similar methodology as before: used 44 files from the Jason Kincaid data set and 20 files published by rev.ai and removed all files where none of the recognizers could achieve a Word Error Rate (WER) lower than 25%.
This time again only one file was that difficult. It was a bad quality phone interview (Byron Smith Interview 111416 - YouTube) with WER of 25.48%
We publish this since we want to ensure that any third party - any ASR Vendor, Developer or Analyst - to be able to reproduce these results.
You can see box-plots with the results above. The chart also reports the average and median Word Error Rate (WER)
Only 3 recognizers have improved in the last 6 months.
Detailed data from this benchmark indicates that Amazon is better than Voicegain on audio files with WER below the median and worse on audio files with accuracy above the median. Otherwise, AWS and Voicegain are very closely matched. However we have also run a client-specific benchmark where it was the other way around - Amazon as slightly better on audio files with WER above the median than Voicegain, but Voicegain was better on audio files with WER below the median. Net-net, it really depends on type of audio files, but overall, our results indicate that Voicegain is very close to AWS.
Let's look at the number of files on which each recognizer was the best one.
We now have done the same benchmark 5 times so we can draw charts showing how each of the recognizers has improved over the last 2 years and 3 months. (Note for Google the latest 2 results are from latest-long model, other Google results are from video enhanced.)
You can clearly see that Voicegain and Amazon started quite bit behind Google and Microsoft but have since caught up.
Google seems to have the longest development cycles with very little improvement since Sept. 2021 till about half a year ago. Microsoft, on the other hand, releases an improved recognizer every 6 months. Our improved releases are even more frequent than that.
As you can see, the field is very close and you get different results on different files (the average and median do not paint the whole picture). As always, we invite you to review our apps, sign-up and test our accuracy with your data.
When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
Enterprises are increasingly looking to mine the treasure trove of insights from voice conversations using AI. These conversations take place daily on video meeting platforms like Zoom, Google Meet and Microsoft Teams and over telephony in the contact center (which take place on CCaaS or on-premise contact center telephony platforms).
Voice AI or Conversational AI refers to converting the audio from these conversations into text using Speech recognition/ASR technology and mining the transcribed text for analytics and insights using NLU. In addition to this, AI can be used to detect sentiment, energy and emotion in both the audio and text. The insights from NLU include extraction of key items from meetings. This include semantically matching phrases associated with things like action items. issues, sales blockers, agenda etc.
Over the last few years, the conversational AI space has seen many players launch highly successful products and scale their businesses. However most of these popular Voice AI options available in the market are multi-tenant SaaS offerings. They are deployed in a large public cloud provider like Amazon, Google or Microsoft. At first glance, this makes sense. Most enterprise software apps that automate workflows in functional areas like Sales and Marketing(CRM), HR, Finance/Accounting or Customer service are architected as multi-tenant SaaS offerings. The move to Cloud has been a secular trend for business applications and hence Voice AI has followed this path.
However at Voicegain, we firmly believe that a different approach is required for a large segment of the market. We propose an Edge architecture using a single-tenant model is the way to go for Voice AI Apps.
By Edge, we mean the following
1) The AI models for Speech Recognition/Speech-to-Text and NLU run on the customer's single tenant infrastructure – whether it is bare-metal in a datacenter or on a dedicated VPC with a cloud provider.
2) The Conversational AI app -which is usually a browser based application that uses these AI models is also completely deployed behind the firewall.
We believe that the advantages for Edge/On-Prem architecture for Conversational/Voice AI is being driven by the following four factors
Very often, conversations in meetings and call centers are sensitive from a business perspective. Enterprise customers in many verticals (Financial Services, Health Care, Defense, etc) are not comfortable storing the recordings and transcripts of these conversations on the SaaS Vendor's cloud infrastructure. Think about a highly proprietary information like product strategy, status of key deals, bugs and vulnerabilities in software or even a sensitive financial discussion prior to the releasing of earnings for a public company. Many countries also impose strict data residency requirements from a legal/compliance standpoint. This makes the Edge (On-Premises/VPC) architecture very compelling.
Unlike pure workflow-based SaaS applications, Voice AI apps include deep-learning based AI Models –Speech-to-Text and NLU. To extract the right analytics, it is critical that these AI models – especially the acoustic models in the speech-recognition/speech-to-text engine are trained on client specific audio data. This is because each customer use case has unique audio characteristics which limit the accuracy of an out-of-the-box multi-tenant model. These unique audio characteristics relate to
1. Industry jargon – acronyms, technical terms
2. Unique accents
3. Names of brands, products, and people
4. Acoustic environment and any other type of audio.
However, most AI SaaS vendors today use a single model to serve all their customers. And this results in sub-optimal speech recognition/transcription which in turn results in sub-optimal NLU.
For real-time Voice AI apps - for e.g in the Call Center - there is an architectural advantage for the AI models to be in the same LAN as the audio sources.
For many enterprises, SaaS Conversational AI apps are inexpensive to get started but they get very expensive at scale.
Voicegain offers an Edge deployment where both the core platform and a web app like Voicegain Transcribe can operate completely on our clients infrastructure. Both can be placed "behind an enterprise firewall".
Most importantly Voicegain offers a training toolkit and pipeline for customers to build and train custom acoustic models that power these Voice AI apps.
If you have any question or you would like to discuss this in more detail, please contact our support team over email (support@voicegain.ai)
Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Interested in customizing the ASR or deploying Voicegain on your infrastructure?