Explore of our key APIs or sign up to access the complete set
Explore key Voicegain APIs for Speech-to-Text, Telephony Bot, and Speech Analytics.
View documentation →Create a free account and get access to the full Voicegain API documentation and get to try them out.
Create free account →LLMs like ChatGPT and Bard are taking the world by storm! An LLM like ChatGPT is so good not just in understanding language but also processing and acquiring knowledge of this content. The outcome of this is almost eerie and scary. Because once these LLMs acquire knowledge, they are able to answer very accurately questions that in the past seemed to require human judgement.
A big use-case for LLMs is in the analysis of business meetings - both internal (between employees) and external (e.g conversations with customers, vendors, etc).
In the past few years, companies have been primarily using Cloud-based Revenue/Sales Intelligence and Meeting AI solutions to transcribe business conversations. With all of these cloud-based solutions, the meeting transcript is usually stored in the Vendor's cloud. One the transcript is generated, NLU models built by these vendors and included as part of their SaaS Web app (were used to extract insights - questions and sales blockers in sales conversations, meeting action items, risks etc.
Essentially these NLU models - most of these predate the LLMs - were able to summarize, extract topics, keywords and phrases. Enterprises did not mind using the cloud infrastructure of the vendor to store the transcripts as what this NLU could do seemed pretty harmless.
However the LLMs take this to a whole different level. Once the LLMs are provided the transcripts - or more specifically "embeddings" of the transcripts, they are able to acquire knowledge of what is actually taking place and answer extremely insightful questions.
At Voicegain, we just used an Open Source vector database to generate embeddings of a single month of our daily scrum meeting transcripts and we submitted it to the ChatGPT API.
We were able to get answers to the following questions
1. Provide a summary of the contract with <Largest Customer Name>.
2. What is the progress on <Key Initiative>?
3. Did the Company hire new employees?
4. Did the Company discuss any trade secrets?
5. What is the team's opinion on Mongodb Atlas vs Google Firestore?
6. What new products is the Company planning to develop?
7. Which Cloud provider is the Company using?
8. What is the progress on a key initiative?
9. Are employees happy working in the company?
10. Is the team fighting fires?
ChatGPT's responses to the above questions was amazingly and eerily accurate. For Question 4, it did indicate that it did not want to answer the question. And when it do not have adequate information (e.g. Question 9), it did indicate that in its response.
At Voicegain, we had always been a big proponents of why Voice AI needs to remain on the Edge. We had written about it in the past.
Meeting transcripts in any business is a veritable gold mine of information. Now with the power of LLMs, they can now be queried very easily to provide amazing insights. But if these transcripts are stored in another Vendor's cloud, it has the potential to expose very proprietary and confidential information of any business to 3rd parties.
Hence it is extremely critical that these transcripts are stored only in private infrastructure (behind the firewall). It is really important for Enterprise IT to make sure this happens in order to safeguard proprietary and confidential information.
Voicegain offers Transcribe, an enterprise-ready option for Meeting AI. Transcribe can deployed in a datacenter on bare-metal or in a private cloud option. You can read more about it here.
To vet
On March 1st, Open AI announced that developers could now access the Whisper Speech-to-Text model via easy to use APIs. OpenAI also released APIs to GPT3.5, the LLM behind the buzzy ChatGPT product.
Since Whisper's initial release in October 2022, it has been a big draw for developers. A highly accurate Open Source ASR is extremely compelling. Whisper has been trained on 680,000 hours of audio data which is much more than what most models are trained on. Here is a link to their github.
However there were two major limitations. 1) Running Whisper requires expensive memory-intensive GPU based compute options (see below). 2) A company still had to invest in an engineering team that could test, run and support the model in a production environment.
By taking over the responsibility of hosting this model and making it accessible via easy-to-use APIs, Open AI addresses both of the above limitations.
This article highlights some of the key strengths and limitations of using Whisper - whether using Open AI's APIs or hosting it on your own.
In our benchmark tests, Whisper models demonstrated high accuracy for a widely diverse range of audio datasets. Our ML engineers found that the Whisper models perform well on a wide range of audio datasets - from meetings, classroom lectures, YouTube videos and call center audio. We benchmarked Whisper-base, Whisper-small and Whisper-medium.
The median Word Error Rate (WER) for Whisper-medium was 12.5% for meeting audio and 17.5% for call center audio. This was indeed better than the WERs of other large players like AWS, Azure and Google. However here is an interesting fact - it is possible to match and even exceed Whisper's accuracy with custom models. Custom models are models that are trained on our client's data.
Please contact us via email (support@voicegain.ai) if you would like to review these accuracy benchmarks.
Whisper's pricing at $0.006/min is much lower than the Speech-to-Text offerings of some of the other larger cloud players. This translates to a 75% discount to Google Speech-to-Text and AWS Transcribe (based on pricing as of the date of this post). However there are a few caveats to this pricing which are outlined in the Limitations section below
What was also significant was Open AI announced the release of ChatGPT APIs with the release of Whisper APIs. For developers building Voice AI apps, they can now can combine the power of Whisper Speech-to-Text models with the GPT 3.5 LLM (the underlying models that ChatGPT APIs give access to) and they can build really cool apps - whether it is for Meetings or Call Center.
Whisper currently does not support apps that require real-time/streaming transcription which are relevant both to call center and the meetings use-case. While there are some hacks and work-arounds, they are not practical for a production deployment.
The throughput of Whisper models - both for the small and medium models - is quite low. Our ML engineers tested Whisper models on popular NVIDIA GPU-based compute instances available in public clouds (AWS, GCP and Microsoft Azure). Net-net, we determined that while developers would not have to pay for software licensing, the cloud infrastructure costs would be so high. We determined that the infrastructure cost of running Whisper - so that it can perform well is in the range of $0.10 - $0.15/hour.
In addition to this infrastructure cost the larger expense of running Whisper on the Edge (On-Premise + Private Cloud) is that it would require a dedicated back-end Engineering & Devops team to run the model in a cost-effective manner.
As of the publication of this post, Whisper does not have a multi-channel audio API. So if your application involves audio with multiple speakers, then Whisper's effective price-per-min = Number of channels * 0.006. For both meetings and call center use-cases, this pricing can become prohibitive.
This release of Whisper is missing some key features that developers would need. The three important features we noticed are Diarization (speaker separation), Time-stamps and PII Redaction.
At Voicegain, we have built deep-learning-based Speech-to-Text/ASR models that match the accuracy of models from the large players. For over 3 years nows, developers have been using our Speech-to-Text APIs to build and launch successful products. Our focus has been on voice developers that need high accuracy (achieved by training custom acoustic models) and deployment in private infrastructure at an affordable price. We provide an accuracy SLA where we guarantee that a custom model that is trained on your data will be as accurate if not more than most popular options including Open AI's Whisper. Here is a link to our
We also have models that are trained specifically on call center audio - so if you are looking for a call center focused model we can provide higher accuracy than Whisper.
While Whisper is a worthy competitor (of course a much larger company with 100x our resources), as developers we welcome the innovation that Open AI is unleashing in this market. By adding ChatGPT APIs to our Speech-to-Text , we are planning to broaden our API offerings to developer community.
To sign up for a developer account on Voicegain with free credits, click here.
Interested in customizing the ASR or deploying Voicegain on your infrastructure?