Selecting a Speech-to-Text API for your SaaS app is not a slam dunk!
Updated: Jan 5
Developers building voice-enabled SaaS applications that embed Speech-to-Text or Transcription APIs as part of their product have multiple vendors to choose from.
However, the decision to pick the right Speech-to-Text platform or API is rather involved. This writeup outlines three types of vendors and the three key criteria (summarized as the 3 As - Accuracy, Affordability and Accessibility) to consider while making that choice.
Most voice-enabled SaaS apps that incorporate Speech-to-Text APIs broadly fall into two categories 1) Analytics and 2) Automation. Analytics SaaS apps usually involve transcription of a conversation - either realtime or offline - between two or more people and - subsequent mining of the transcript using NLU to extract analytics like keywords, insights, topics and summaries. E.g, an app that transcribes and summarizes important meetings. On the other hand, Automation SaaS apps involve the real-time interaction of a person with a Bot or Voice Assistant. E.g., a voice bot that automates the ordering of food/pizza at a quick service restaurant.
Whether you are developing an analytics app or an automation app, developers have the following vendor choices.
The Vendor Landscape
There are 3 distinct types of vendors
Big Three Cloud Providers
Large Enterprise ASR Platforms
Pure-play Speech-to-Text/Voice AI Startups
1. Big Three Cloud Providers
The first set of choices for most developers are Speech-to-Text APIs from the big cloud companies - Google, Amazon and Microsoft. These big players offer Speech-to-Text APIs as part of their portfolio of Cloud AI & ML services. They are offered alongside other AI/ML APIs like NLU & Image Recognition APIs. The strategy for the Big Cloud providers is to sell their entire stack - from cloud infrastructure to APIs and even products.
However the Cloud service providers may compete directly with the developers they look to serve. E.g. Amazon Connect directly competes with Contact Center platforms that are hosted on AWS. Google Dialogflow directly competes with other NLU startups that may be looking to build and offer Voice bots and Voice Assistants to enterprises.
2. Large Enterprise ASR Platforms
Other than the big 3, Nuance and IBM Watson are large companies that have a rich history of providing Automated Speech Recognition (ASR). Of the two, Nuance is better known and has been a dominant player both in the enterprise call center market with its Nuance ASR engine and in the medical transcription space with its Dragon offering. IBM has a long history of fundamental speech recognition and IBM Watson Speech-to-Text is their developer oriented offering.
3. Pure-play Speech-to-Text/Voice AI Startups
Voicegain.ai, our company, plays alongside other startup companies like Deepgram that target SaaS developers with their best-of-breed DNN based speech-to-text. Since these startups are specialized providers, they are focused on beating the big cloud providers and legacy players with respect to price, performance and ease of use.
Key Criteria - Accuracy, Affordability, Accessibility
The key criteria while picking an ASR or Speech-to-Text platform are the 3 As - Accuracy, Affordability and Accessibility.
1. Accuracy - Establish Target & Baseline accuracy
The first and most important criteria for any Speech-to-Text platform is recognition accuracy. However accuracy is a tricky metric to assess and measure. There is no 'one-size-fits-all' approach to accuracy. We have shared our thoughts & benchmarks here. While Voicegain matches or exceeds the "out-of-the-box" transcription accuracy of most of the larger players, we suggest that you do additional diligence before making a choice. The audio datasets used in these benchmarks may or may not be similar to the use case or context for which the developer intends to use the API.
While accuracy is usually measured using Word Error Rate (WER), it is important to note that this metric too has limitations. For a SaaS app, getting some important and critical words right may be even more important than just a low overall WER.
That being said, it is important for developers to establish and calculate a quick baseline "out-of-the-box" accuracy for their application with their audio datasets.
At Voicegain, we have open sourced tools to benchmark our performance against the very best in business. We strongly recommend that developers & ML Engineers calculate a benchmark baseline accuracy for their vendor choices using a statistically significant volume of audio datasets for their application.
From a developer perspective, a baseline accuracy measure will provide insights into the how closely your datasets match the datasets that the underlying STT models from the vendors have been trained on.
Here are a set of important factors that may affect your "out-of-the-box" accuracy:
Length of audio: Does your application involve audio data that is comprised or short words/phrases or full sentences? Bots involve use of short words and phrases while analytics apps involve transcription of long sentences
Industry jargon: Does your audio data have industry specific jargon and terms that are not part of normal vocabulary?
Audio quality - 8kHz or 16 kHz: Is the source of your audio data - telephony - that is sampled at 8 kHz or is it 16 kHz data captured in a meeting platform like Zoom or Webex? Does the vendor have models that are tuned to 8 kHz and 16 kHz?
Separate channels: If there are multiple speakers, are you able to provide separate channels for each speaker to the Speech-to-Text engine? Accuracy could be higher if you are able to.
Background noise: Does your audio have a lot of background noise - e.g say news playing in the background or cross talk in a call center context. If so, how "sensitive" is the Speech-to-Text engine to such background noise
Accents: Does your application support speakers with different accents?
Developers also need to establish a "Target" accuracy that their SaaS application or product requires. Usually Product Managers determine this based on their needs.
It is possible to bridge the gap between the Target Accuracy and the Baseline "out-of-the-box" accuracy. While it is outside the scope of this post, here is an overview of some ways in which developers can improve upon the Baseline accuracy.
Acoustic Model Training. Voicegain allows developers to train the underlying acoustic model. This is the best way to address issues related to accents and background noise. Here is a link to some results we have demonstrated with model training. Of the larger players, currently only Microsoft and IBM allow for acoustic model customization whereas most players only allow customizing the language model (described below)
Language Model: Customizing the language model is usually the fastest and easiest way to boost the accuracy of the Speech-to-Text engine - especially for things like product names. This is accomplished in a few different ways. Some platforms allow developers to pass hints along with a recognition request while others allow you to load an entire corpus as a domain specific language model.
Speech Grammars: We have written extensively about Speech Grammars here, here and here and how they really simplify development of Voice Bots and Assistants. They boost accuracy of specific entities like zipcode, addresses, dates etc. They also improve recognition of short phrases like "card", "cash", etc. usually with better end results than using hints mentioned above. While Speech Grammars were commonly used for building telephony based Speech-enabled IVRs in the past (which were based on Speech-to-Text platforms based on HMMs and GMMs), most modern back-end developers are not familiar with the use of Grammars.
However not all Speech-to-Text platforms support one or more of these options.
To summarize, the choice may not be as simple as picking the one with the best "out-of-the-box' accuracy. It could in fact be a platform that provides the most convenient and least expensive path to bridge the gap between Target and Baseline accuracy.
The second most important factor after accuracy is price. Most SaaS products are very disruptively priced. It is not uncommon for the SaaS product to be sold at 'tens of dollars' ($35-100) per user per month. It is critical that Speech-to-Text APIs make up as small a fraction of the SaaS price as possible. The price directly impacts the "gross-margin" of the SaaS application, a critical financial metric/KPI that SaaS companies care dearly about.
In addition to the top-line usage based price for the platform, it is also important to understand what the minimum billable time and billing increment for each interaction. Many of the large Cloud providers have a very high minimum billable times - 12 or 18 seconds. This makes it very expensive for Voice Bots or Voice Assistant.
Another cost related aspect is the price for transcribing multi-channel audio, where only one speaker is active at the time. Does the platform charge for transcribing silence on the inactive channel ?
3. Accessibility - Ease/Simplicity of Integration
The last (but not the least!) important criterion is how accessible - or in other words how simple and easy is it to integrate the Speech-to-Text platform with the SaaS Application.
This ease of integration becomes even more important if the SaaS Application streams audio real-time to the Speech-to-Text platform. Another important criterion for real-time streaming is latency - which is the time to receive recognition results from the platform. For a Bot or Voice Assistant, it is important to get API latency down to 500 milliseconds or lower. Also, reliable and fast end-of-speech detection is crucial in those scenarios for natural dialog turn taking.
At Voicegain, we support multiple options - ranging from TCP-based methods like gRPC and Websockets to telephony/UDP protocols like SIP/RTP, MRCP and SIPREC.
The choice made by the developer depends on the following factors:
The actual backend programming language or web framework that the SaaS app is built on (i.e., the libraries that they support).
Familiarity or past-experience in usage of certain protocols for developers
For apps that are accessed over traditional telephony (PSTN), integration with modern telephony platforms becomes really important (CCaaS, On Premise Contact Center or CPaaS Platforms like Twilio & SignalWire). At Voicegain, we integrate with most prominent Cloud and Premise based contact center platforms. We also allow you to use our JSON Callback based APIs with any platform that supports SIP Invite.
In conclusion, selecting the right Speech-to-Text or ASR platform for a SaaS application is a diligent exercise; it is by no means a slam dunk!!
We really dig having a conversation with you about this. We are always keen to know you are building and how we can help you make the right choice - even if it is not us. Connect with us on LinkedIn, give us a shout!! Or email us at firstname.lastname@example.org.