Blog | Speech-to-Text Platform

Contact Center

Voicegain Acquires TrampolineAI to deliver End-to-End Contact Center AI for Healthcare Payers

Arun Santhebennur

•

min read

•

January 7, 2026

New unified platform combines AI voice agent automation with Real-time agent assistance and Auto QA, enabling healthcare payers to reduce average handle time (AHT) and improve first contact resolution (FCR) in their call centers.

IRVING, Texas and SAN FRANCISCO, Jan. 7, 2026 /PRNewswire-PRWeb/ -- Voicegain, a leader in AI Voice Agents and Infrastructure, today announced the acquisition of TrampolineAI, a venture-backed healthcare payer-focused Contact Center AI company whose products supports thousands of member interactions. The acquisition unifies Voicegain's AI Voice Agent automation with Trampoline's real-time agent assistance and Auto QA capabilities, enabling healthcare payers to optimize their entire contact center operation—from fully automated interactions to AI-enhanced human agent support.

Healthcare payer contact centers face mounting pressure to reduce costs while improving member experience. The reasons vary from CMS pressure, Medicaid redeterminations, Medicare AEP volume and staffing shortages. The challenge lies in balancing automation for routine inquiries with personalized support for complex interactions. The combined Voicegain and TrampolineAI platform addresses this challenge by providing a comprehensive solution that spans the full spectrum of contact center needs—automating high-volume routine calls while empowering human agents with real-time intelligence for interactions that require specialized attention.

"We're seeing strong demand from healthcare payers for a production-ready Voice AI platform. TrampolineAI brings deep payer contact center expertise and deployments at scale, accelerating our mission at Voicegain." — Arun Santhebennur

Over the past two years, Voicegain has scaled Casey, an AI Voice Agent purpose-built for health plans, TPAs, utilization management, and other healthcare payer businesses. Casey answers and triages member and provider calls in health insurance payer call centers. After performing HIPAA validation, Casey automates routine caller intents related to claims, eligibility, coverage/benefits, and prior authorization. For calls requiring live assistance, Casey transfers the interaction context via screen pop to human agents.

TrampolineAI has developed a payer-focused Generative AI suite of contact center products—Assist, Analyze, and Auto QA—designed to enhance human agent efficiency and effectiveness. The platform analyzes conversations between members and agents in real-time, leveraging real-time transcription and Gen AI models. It provides real-time answers by scanning plan documents such as Summary of Benefits and Coverage (SBCs) and Summary Plan Descriptions (SPDs), fills agent checklists automatically, and generates payer-optimized interaction summaries. Since its founding, TrampolineAI has established deployments with leading TPAs and health plans, processing hundreds of thousands of member interactions.

"Our mission at Voicegain is to enable businesses to deploy private, mission-critical Voice AI at scale," said Arun Santhebennur, Co-founder and CEO of Voicegain. "As we enter 2026, we are seeing strong demand from healthcare payers for a comprehensive, production-ready Voice AI platform. The TrampolineAI team brings deep expertise in healthcare payer operations and contact center technology, and their solutions are already deployed at scale across multiple payer environments."

Through this acquisition, Voicegain expands the Casey platform with purpose-built capabilities for payer contact centers, including AI-assisted agent workflows, real-time sentiment analysis, and automated quality monitoring. TrampolineAI customers gain access to Voicegain's AI Voice Agents, enterprise-grade Voice AI infrastructure including real-time and batch transcription, and large-scale deployment capabilities, while continuing to receive uninterrupted service.

"We founded TrampolineAI to address the significant administrative cost challenges healthcare payers face by deploying Generative Voice AI in production environments at scale," said Mike Bourke, Founder and CEO of TrampolineAI. "Joining Voicegain allows us to accelerate that mission with their enterprise-grade infrastructure, engineering capabilities, and established customer base in the healthcare payer market. Together, we can deliver a truly comprehensive solution that serves the full range of contact center needs."

A TPA deploying TrampolineAI noted the platform's immediate impact, stating that the data and insights surfaced by the application were fantastic, allowing the organization to see trends and issues immediately across all incoming calls.

The combined platform positions Voicegain to deliver a complete contact center solution spanning IVA call automation, real-time transcription and agent assist, Medicare and Medicaid compliant automated QA, and next-generation analytics with native LLM analysis capabilities. Integration work is already in progress, and customers will begin seeing benefits of the combined platform in Q1 2026.

Following the acquisition, TrampolineAI founding team members Mike Bourke and Jason Fama have joined Voicegain's Advisory Board, where they will provide strategic guidance on product development and AI innovation for healthcare payer applications.

The terms of the acquisition were not disclosed.

About Voicegain

Voicegain offers healthcare payer-focused AI Voice Agents and a private Voice AI platform that enables enterprises to build, deploy, and scale voice-driven applications. Voicegain Casey is designed specifically for healthcare payers, supporting automated and assisted customer service interactions with enterprise-grade security, scalability, and compliance. For more information, visit voicegain.ai.

About TrampolineAI

TrampolineAI was a venture-backed voice AI company focused on healthcare payer solutions. The company applies Generative Voice AI to contact centers to improve operational efficiency, member experience, and compliance through real-time agent assist, sentiment analysis, and automated quality assurance technologies. For more information, visit trampolineai.com.

Media Contact:

Arun Santhebennur

Co-founder & CEO, Voicegain

press@voicegain.ai

Media Contact

Arun Santhebennur, Voicegain, 1 9725180863 701, arun@voicegain.ai, https://www.voicegain.ai

SOURCE Voicegain

‍

Model Training

Competitive Advantage of Custom Acoustic Models

Jacek Jarmulak

•

min read

•

June 30, 2020

There is no doubt that there is a lot of value in the datasets that are used to train AI models. That is one of the reasons why Google offers their Speech-to-Text service at two price points, one with 'data logging' and and one without, see table below.

However at Voicegain, our speech-to-text platform does not capture or use any customer data (while still being able to offer low ASR pricing).

Moreover, Voicegain platform enables our customers to use their data to train their own dedicated & custom Acoustic Models. As result, our customers benefit in two ways:

The accuracy of these custom acoustic model(s) is several % higher compared to our base models.
Custom models are licensed exclusively to the clients and are not shared with anyone (neither Voicegain, nor any other Voicegain customers), so this higher accuracy translates directly into competitive advantage.

By retaining ownership of the data and the custom acoustic models, our customers benefit from higher ASR accuracy in general, and higher accuracy than their potential competitors in particular.

Insights

How AI powered Speech can boost Contact Center BPO topline?

Arun Santhebennur

•

min read

•

June 27, 2020

Senior leadership teams at most global contact center outsourcers are constantly under pressure. They need to have a laser like focus on key metrics, SLAs and people to manage their businesses. They are increasingly managing a global distributed business that is both labor intensive and technology intensive. And they have to do all of this with increasingly tight margins.

Despite being measured on metrics like CSAT and NPS, a lot of the value that an outsourcer delivers to its clients is often hard to quantify. And too often the price realized by the outsourcer does not capture the value and quality an outsourcer provides.

Two Ideas to pivot into high value SaaS offerings

In this article I would like to propose two new innovative ideas that can help Contact Center BPOs pivot into new SaaS (Software-as-a-Service) revenues.

CX Speech Insights Service: Develop a new branded realtime CX insights service based on speech analytics powered by deep learning.
CX Speech Automation Service: Build new voice self-service applications that can automate some of the common customer care scenarios.

Both these offerings can be offered to the clients using a Software-as-a-Service (SaaS) based business model in conjunction with the traditional agent side of the business.

Both these SaaS offerings leverage some of the key strengths that BPOs have: Deep domain expertise, in depth understanding of customer issues and technology infrastructure that leverages both

1. CX Speech Insights Service

Contact centers have a treasure trove of audio data. Every day associates are handling thousands of calls across a wide variety of topics. While outsourcers use legacy speech analytics vendors, the traditional use has been to analyze a sample of calls to assist in the Quality Assurance function. Net-net, it is viewed as a cost center both for the outsourcers and their clients.

However there is a massive untapped opportunity to mine and extract insights from such audio data for uses well beyond quality assurance. Such insights may be relevant to stakeholders in Product and Marketing teams of the clients. This can open up new non-traditional product and marketing budgets for BPOs.

2. CX Speech Automation Service

Outsourcers have an in-depth deeper understanding of current topics that customers are calling about. They have unique and current insights into which categories of calls are actually driving volumes. With the right tools, methodologies and personnel, outsourcers can build and offer new innovative speech self service applications that may automate parts of calls. With the right technologies, outsourcers can move seamlessly between agent assisted calls and automated self-service interactions.

The Foundation: Deep Neural Networks & custom acoustic models

The foundation for these SaaS offerings are modern Deep Neural Network (DNN) based Speech to Text platforms.

The old speech to text were technologies were based on traditional statistical models (called HMMs and GMMs). They were limited in their ability to train on specific industry jargons and accents. But a DNN based platform has the following advantages

A DNN based platform can be easily trained to recognize unique words/jargon, accents and noisy backgrounds. Training the models increases the quality of recognition and makes it accurate enough to deliver real value to client stakeholders.
A industry or customer specific acoustic model has the potential to create intellectual property for the BPO.
A DNN platform can be used equally well both in the up front automation part and in the analytics and notification service. There are benefits from using the same platform for both offerings.

For more info, please contact us at info@voicegain.ai.

‍

Benchmark

Speech-to-Text Accuracy Benchmark - June 2020 Results

Jacek Jarmulak

•

min read

•

June 25, 2020

[UPDATE - October 31st, 2021: Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced.]

"What is the accuracy of your recognizer?"

That is the question that we are frequently asked by our potential customers. Often we answer "that depends" and we get a feeling that the other side thinks "must be really bad if they do not give a straight answer". However, "that depends" is really the right answer. Accuracy of automated speech recognition (ASR) depends on the audio in many ways and the effect is not small. Basically, accuracy can be all over the place depending on factors like:

Does the speech follow proper grammar or is the speaker making things up as they are saying it. Prepared speeches will have better, i.e. lower WER (word error rate) scores compared to unscripted speech.
What is the subject of the speech. Rare and obscure words or word combinations, like e.g. people or other names, will make life difficult for the NLM (natural language model).
Are there more than one speakers? Are they constantly switching over or even talk over one another.
Is there music in the background - very common for youtube productions.
Is there background noise? What is the type of noise?
Are parts of the speech audio unusually slow or fast?
Is there room reverb or echo in the recording?
Is the recording volume very low. Are there variations in the recording volume (e.g. recorder placed on one edge of a very long table)
Is the recording quality bad, e.g., due to a codec or insane archival compression levels.
etc. etc.

Testing / Benchmarking Speech-to-Text Accuracy

Because the accuracy or Word Error Rate questions are somewhat meaningless without specifying the type of speech audio, it is important to do testing when choosing a speech recognizer. As a test set, one would choose a set of audio files, that accurately represent the spectrum of the speech that will be encountered by the recognizer in the expected use cases. For each speech audio file from the set one would obtain a gold/reference transcript that is 100% accurate. After that, things can be automated -- transcribe each file on the recognizers being evaluated, compute WER against the reference for each of the generated transcripts, and collate the results. The combined results will present a clear picture of how the recognizers perform on the specific speech audio that we care about. If you are going to repeat this process often, e.g., to evaluate new candidates on the recognizer marker, it is good to standardize the test set, basically creating a repeatable benchmark that can be referenced in the future.

Our benchmark

The benchmark results that we are presenting here are somewhat different than the use-case driven tests or benchmarks. Because we are building a general recognizer for an unspecified use case, we intentionally decided to use a very broad set of audio files. Rather than collecting the test files ourselves, we decided to use the data set described in "Which Automatic Transcription Service is the Most Accurate? — 2018" from September 2018 by Jason Kincaid. The article presents a comparison of Speech Recognizers from various companies using a set of 48 YouTube videos (taking 5 minutes of audio from each of the videos). By the time we decided to do a retest of Jason's benchmark, 4 videos were no longer accessible, so our benchmark presented here uses data from only 44 videos.

We compared the results presented by Jason to the results from the big 3 - Google, Amazon, and Microsoft - recognizers as of June 2020. Of course, we also included our Voicegain recognizer, because we wanted to see how we stacked against those. All the tested recognizers use Deep Neural Networks. The Voicegain speech recognizer ran on the Google Cloud Platform using Nvidia T4 GPUs. All recognizers were run with default settings and no hints nor user language models were used.

It is important to mention that none of the benchmark files are included in the training set that Voicegain uses. Neither is other audio from the speakers from the benchmark files, nor the same content but spoken by other speakers.

So what are the results? Who has the best recognizer?

Again, the best recognizer is not the right question, because it all depends on your actual speech audio it is used on. But the key results from testing on the 44 files are as follows:

Every recognizer has improved. The biggest improvement in median WER was by Microsoft Speech to Text.
The best recognizer in our data set was Google Speech to Text - Enhanced (video), but the new Microsoft Speech to Text is very close second.
Taking price into consideration, Microsoft might be declared Best Buy
Voicegain recognizer is definitely Best Value.
Google Speech to Text - Standard, although somewhat improved, is still clearly the worst performing on the data set.
The single bad data point for Google Enhanced (video) is real. We ran repeated test on the file and got the same result. The old Google Enhanced recognizer did not have problems with that file.

How does the Voicegain recognizer stack up?

Here are our thoughts and some details:

Up until October 2019 the training set we were using to train our recognizer was relatively unchanged. Moreover, our training set was heavily biased towards some categories of speech audio. You can see that in the chart, e.g., by the fact that our best results were better than old Amazon Transcribe but our worst results were quite a bit more worse than Amazon Transcribe.
Based on the first results from the benchmark we analyzed what kind of audio gave us trouble, and collected data with the particular characteristics but sourced very broadly (to avoid training to benchmark) to make our recognizer more robust. That effort paid off and you can see that now the Voicegain recognizer WER spread is much tighter and overall is now very close to new Amazon Transcribe.
Overall Voicegain is the most improved recognizer. Just over 6 months ago we were just better than Google Standard, but now we are closing on Amazon Transcribe. This is result of both changes to the Neural Network architecture and a large increase in the training data set hours.
If you look into the details, Voicegain recognizer was better than new Amazon on 11 out of 44 files, better than Google Video on 5 files, and better than Microsoft also on 5 out of 44 files.
If you consider the price, we think that Voicegain presents a great value. We have talked to customers who were not doing large scale transcription due to large cost of the 3 big platforms and our low pricing suddenly made new uses of transcription viable.

We welcome anyone to test our platform and see how it performs on speech audio types that matter for your use cases.

Any software that can help me in testing recognizers?

We have Open Sourced the key component of our benchmark suite, the transcribe_compare python utility. It is available here: https://github.com/voicegain/transcription-compare under MIT license.

It is useful for automatic benchmarking but it can also output data to an html file which can be viewed in a web browser. We use it often this way to do a manual review of the transcription errors or differences in errors between two recognizers or recognizer versions.

How can I test drive Voicegain?

If you are building an app that requires transcription, sign up today for a developer account and get $50 in free credits (~5000 minutes of platform use). You can check out our accuracy add test our APIs. Instructions to sign up for a developer account are provided here.

3. If you want to make Voicegain your own AI Transcription Assistant, click here. You can take Voicegain to meetings, webinars, talks, lectures and more.

We expect to catch up soon

We are still in the middle of extensive data collection effort and the training is not over yet. We are seeing continuing improvement in our recognizer, with the new improved versions of the acoustic model deployed to production about twice a month. We will report updated benchmark results on our blog in a few months.

User-Customized Acoustic Model

We have another blog post planned that is going to quantify the benefit one can expect from using additional user data to train the acoustic model used in the recognizer. We have selected a large data set with a very specific English accent that currently has higher WER. We will report on the impact on WER of training on such a data set. We will quantify the improvement based on the size of the data set and the duration of training.

Voicegain provides easy to use tools that allow users to build their own custom acoustic models. This upcoming post will provide a clear insight as to what improvements to expect and how much data is needed to make a difference in reducing WER.

References

The original benchmark article with the description of the data set.
Detailed results for all 44 files.
Google Speech-to-Text pricing. Billed in 15-second increments.
Amazon Transcribe pricing. Billed in one-second increments, with a minimum per request charge of 15 seconds
Microsoft Speech-to-Text pricing. And here are the relevant FAQs.
Voicegain Pricing. Billed in 1 second increments.

Contact Us

If you have any questions regarding this article or our platform and recognizer you can contact us at info@voicegain.ai

Use Cases

Transcription for Live Streamed Event - an example

Jacek Jarmulak

•

min read

•

June 24, 2020

The video below shows an example of Voicegain Live Transcribe used to provide transcription for an event streamed over video.
‍

‍

Here are some details about this particular setup:

the video part is streamed using BoxCast
the audio for transcription is tapped live at the source on site
audio is streamed to Voicegain Cloud for processing using a small Java client running on raspberry pi computer
the audio client was downloaded pre-configured from the Voicegain portal and reads audio directly from USB audio device plugged into raspberry pi
speech is transcribed in the Cloud using Voicegain semi-real-time mode which delivers results in about 30 seconds (the real-time mode delivers results will less than 1 second delay))
the transcription output goes via a delay component that allows us to dial in the precise delay to match the streaming video delay - in this case the delay was 35.5 seconds
the transcribed words are sent to a Web Client over websocket - each word is sent with the set delay
the words are displayed with the gray font shade corresponding to the confidence in the words and the gap proportional to the gap between the spoken words
the Acoustic Model used here has been custom trained with additional 200h+ hours from this particular speaker
custom training data consisted simply of previously transcribed speeches by the speaker that were readily available on the website
we are also using a custom Language Model (on top of the base NLM) that was created from user provided corpus

Insights

Key Differentiators

Jacek Jarmulak

•

min read

•

March 30, 2020

Current speech-to-text enterprise market can be divided into 3 distinct groups of players. Note, that we are focusing here on speech-to-text platforms rather than complete end-user products (so we do not include consumer products like Dragon NaturallySpeaking, etc.)

The old ASRs - for example Nuance (and every speech company that Nuance acquired over the years) and Lumenvox. These speech-to-text engines go back to late 1990s early 2000s. They were built using technology relying on Gaussian Models and Hidden Markov Chains. They do require on-prem install.
Established Cloud Speech-to-Text services - like Google, AWS, Microsoft Azure, IBM. Some of these also began with recognizers build using Gaussian Models and Hidden Markov Chains, but by 2012 started transitioning to recognizers using Deep Neural Network models for speech recognition.
New players - these are new companies going back to about 2015. That is when Nvidia made it possible for pretty much anyone to train DNNs on Nvidia's new GPUs. A lot of small companies arose which built their own speech-to-text engines either from scratch or using open-source foundations. Now, 5 years later, many of them are entering speech-to-text market with mature products and delivering high recognition accuracy.

Where does Voicegain fit here?

We consider ourselves as as one of the new players as we started working on our own DNN-based speech-to-text engine at the end of 2016. However, we have been working with old style ASRs since 2006 and as a result we knew very well limitations of those. That was what motivated us to develop ASRs of our own.

We are also very familiar with employing ASRs in real-world large volume applications so we know which features the users of ASRs want - be it developers who build the applications, or IT personnel that has to host and maintain them.

All of this guided us in decisions we made when developing our speech-to-text platform.

So how is Voicegain product different?

Below we list what we think are 4 key differentiators of our speech-to-text platform compared to competition. Note that the competitive field is pretty broad, and we consider a particular feature a differentiator if it is not a common feature in the market.

1) Edge Deployment

By, Edge Deployment we mean a deployment on customer premises (datacenter) or on VPC. Moreover, the deployment is fully orchestrated and managed from the Cloud (for more information see our blog post about Benefits of Edge Deployment). The aspect of orchestration and built-in management makes it essentially different from the old ASRs which were also deployed on-prem and required Support Contracts do deploy them successfully and to maintain them over time.

We think that Edge Deployment is critical for a speech-to-text platform which is to replace many of the old ASRs in their applications.

2) Acoustic Model Customization

Over the years when working with ASRs we noticed that there were cases where the ASR would show consistently higher error rates. Usually, this was related to IVR calls coming from customers in regions of the country with distinct accents.

In some of our use cases so far, ability to customize models has allowed us to reduce WER very significantly (e.g. from 8% WER to 3%).

We are currently working on a rigorous experiment where we are customizing our model to support Irish English. We plan to report in detail on the results in April.

3) Targeted support for IVR

Voicegain speech-to-text platform was developed specifically with IVR use cases in mind. Currently the platform supports the following 3 IVR uses cases, and we are working on adding conversational NLU later this year.

a) ASR with support for legacy IVR Standards

In order to make our speech-to-text engine an attractive solution for replacement of old ASRs, we implemented it to support legacy standards like MRCP and GRXML. That support is not a mere add-on, simply tagging a Web API on the back of an MRCP server, but is more integral - our core speech-to-text engine directly interprets a superset of MCRP protocol commands.

We also support GRXML and JSGF grammars - via MRCP, in IVR callbacks, and over Web API.

When used with grammars, big advantage of Voicegain recognizer is that at the core it is a large vocabulary recognizer. Grammars are used to do constrain the recognized utterances to facilitate semantic mapping, but the recognizer can also recognize Out-of-Grammar utterances, which opens new possibilities for IVR tuning.

b) Web-hook IVR Support (without VXML)

Flow-based IVR systems have traditionally been built using two approaches - (i) either having the dialog interactions interpreted on a VXML platform (VXML browser), or (ii) using webhooks invoking application logic running on standard web back-end platforms (examples of the latter are offerings of e.g. Twilio, Plivo, or Tropo).

Our platform supports webhook style IVRs. Incoming calls can be interfaced via standard telephony SIP/RTP, and the IVR dialog can be directed from any platform that implements web-hooks (e.g. Node.js, Django)

c) Enabling IVRs that use chatbot back-end

Many companies have invested significant effort into building their text based chatbots rather than using products like Google Dialogflow. What Voicegain platform provides is an easy way to deploy the existing chatbot logic on a telephony speech channel. This takes advantage of our platform's webhook-ivr IVR support and can feed real-time text (including multiple alternatives) to a chatbot platform. We also provide audio output either via TTS or prerecorded clips.

4) End-to-end support for Real-Time Continuous Speech-to-Text

Because IVR has always been our focus, we built our Acoustic Models to support low latency real-time speech-to-text (both continuous large vocabulary and with context-free grammars). We also focused on convenient ways to stream audio into our speech-to-text platform, and to consume the generated transcript.

One of our products is Live Transcribe which allows for real-time transcription (with just few seconds delay) which is then broadcast over websockets and can be consumed on provided web clients. This opens possibility to do live speaker transcription with uses cases that may include conferences, lectures, etc. making these events easier to participate by hearing impaired audience members.

Developers

"Hello World" Example

Jacek Jarmulak

•

min read

•

March 19, 2020

In this post we show in three steps what is needed to run your first transcription using Voicegain API.

We assume that you already signed up for Voicegain account and logged into the portal.

Step 1: Create new Context

Main reason to create new Context is to establish new authentication realm. Access to each Context can be separately controlled, so it is easy to disable access to certain Context without affecting other Contexts.

Contexts are also used for specifying default ASR settings.

You can create a new Context from the Context Dash

‍

Step 2: Generate Authentication token

Voicegain APIs use JWT (JSON Web Tokens) to identify and authenticate the account making the request. In order to make API requests you need to generate a JWT which can easily be done from the portal.

‍

Step 3: Run the curl command

Below is the complete input and output from curl command that submits a Web API request to Voicegain Synchronous Speech-to-Text API https://api.voicegain.ai/v1/asr/transcribe

‍

In this case, the audio to be transcribed was retrieved from a URL. Audio can alternatively also be submitted in-line (within request).

Note that synchronous transcription has audio length limit of 60 seconds. Longer audio requires use of asynchronous transcription API.

For asynchronous transcription requests it is possible to stream the audio, e.g. via websocket. You can see some of Voicegain API documentation at: https://www.voicegain.ai/api

‍