Our Blog

News, Insights, sample code & more!

ASR
Going beyond Accuracy: Key STT API Features for Contact Center Voice AI Apps

This article is for companies building Voice AI Apps targeting the Contact Center. It outlines the key technical features, beyond accuracy, that are important while evaluating an OEM Speech-to-Text (STT) API. Usually, most analyses focus on the importance of accuracy and metrics like benchmarks of word error rates (WER). While accuracy is very important, there are other technical features that are equally important for contact center AI apps.

Introduction

There are multiple use-cases for Voice AI Apps in the ContactCenter. Some of the common use cases are 1) AI Voicebot or Voice Agent 2) Real-time Agent Assist 3)Post Call Speech Analytics.

This article is focused on the third use-case which is Post-Call Speech Analytics. This use-case relies on batch STT APIs while the first two use-cases require streaming transcription. This Speech Analytics App helps the Quality Assurance and Agent-Performance management process. This article is intended for Product Managers and Engineering leads involved in building such AI Voice Apps that target the QA, Coaching and Agent Performance management process in the call center. Companies building such apps could include 1) CCaaS Vendors adding AI features, 2) Enterprise IT or Call Center BPO Digital organizations building an in-house Speech Analytics App 3) Call Center Voice AI Startups

1. Accurate Speaker Diarization

Very often, call-center audio recordings are only available in mono. And even if the audio recording is in 2-channel/stereo, it could include multiple voices in a single channel. For example, the Agent channel can include IVR prompts and hold music recordings in addition to the Agent voice. Hence a very important criterion for an OEM Speech-to-Text vendor is that they provide accurate speaker diarization.

We would recommend doing a test of various speech-to-text vendors with a good sample set of mono audio files. Select files that are going to be used in production and calculate the Diarization Error Rate. Here is a useful link that outlines the technical aspects of  understanding and measuring speaker diarization.

2. Accurate PII/Named Entity Redaction and PCI Compliance

A very common requirement of Voice AI Apps is to redact PII – which stands for Personally Identifiable Information. PII Redaction is a post-processing step that a Speech-to-Text API vendor needs to perform. It involves accurate identification of entities like name, email address, phone number and mailing addresses and subsequent redaction both in text and audio. In addition, there are PCI – Payment Card Industry – specific named entities like Credit Card number, 3-digit PIN and expiry dates. Successful PII and PCI redaction requires post-processing algorithms to accurately identify a wide range of PII entities and cover a wide range of test scenarios. These test scenarios need to cover scenarios where there errors in user input and errors in speech recognition too.

There is another important capability related to PCI/PII redaction. Very often PII/PCI entities are present across multiple turns in a conversation between an Agent and Caller. It is important that the post-processing algorithm of the OEM Speech-to-Text vendor is able to process both channels simultaneously when looking for these named entities.

3. Language Detection

A Call Center audio recording could start off in one languageand then switch to another. The Speech-to-Text API should be able to detect language and then perform the transcription.

4. Hints/Keyword Boosting

There will always be words that are not accurately transcribed by even the most accurate Speech-to-Text model. The API should include support for Hints or Keyword Boosting where words that are consistently misrecognized can get replaced by the correctly transcribed word. This is especially applicable for names of companies, products and industry specific terminology.

5. Sentiment and Emotion

There are AI models that measure sentiment and emotion, and these models can be incorporated in the post-processing stage of transcription to enhance the Speech-to-Text API. Sentiment is extracted from the text of the transcript while Emotion is computed from the tone of the audio. A well-designed API should return Sentiment and Emotion throughout the interaction between the Agent and Caller. It should effectively compute the overall sentiment of the call by weighting the “ending sentiment” appropriately.

6. Talk-Listen Ratios, Overtalk and Other Incidents

While measuring the quality of an Agent-Caller conversation, there are a few important audio-related metrics that are tracked in a call center.  These include Talk-Listen Ratios, overtalk incidents and excessive silence and hold.

7. Other Optional LLM-Powered Features

There are other LLM-powered features like computation of theQA Score and the summary of the conversation. However, these are features are builtby the developer of the AI Voice App by integrating the output of the Speech-to-TextAPI with the APIs offered by the LLM of the developer’s choice.

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Voicegain offers Spanish Speech-to-Text
Languages
Voicegain offers Spanish Speech-to-Text

Last week we announced that Spanish Speech-to-Text capability would be available from Voicegain in March. We are pleased to announce today  that we have been able to complete training of the Spanish Neural Network Model earlier than expected and the Spanish Speech-to-Text has been released last Saturday (2/20) as part of our Release 1.24.0.

We have been able to complete work on the Spanish model from start to finish in exactly 3 weeks - we started working on it February 3rd. Such fast progress was possible because of our extensive experience with customization of Neural Network Models for speech recognition and the fact that we have developed advanced tools and proven techniques that make speech-to-text model development and training fast.

The recognition accuracy of the model depends on the type of speech audio. For most benchmark files our Spanish model accuracy is just a few % behind that of  Google or Amazon recognizers. The advantage of our recognizer is the significantly lower price plus ability to train customized acoustic models. Custom models can have accuracy higher than that of Amazon or Google. We encourage you to use our Web Console and/or API to test the real-life performance on your own data. BTW, we are focusing this speech-to-text model on Latin American Spanish.

Of course, Voicegain platform offers other advantages too like support for Edge (on-prem) deployments  and extensive API with many options for out-of-the-box integration into e.g. telephony environments.

Currently, Speech-to-Text API is fully functional with the Spanish Model. Some of the Speech Analytics API functions are not yet available for Spanish, e.g., Named Entity Recognition or Sentiment/Mood detection.

Initially the Spanish Model is available only in the version that supports off-line transcription. Real-time version of the Model will be available in the near future,

To tell the API that you want to use the Spanish Acoustic Model all you need to do is choose it in the Context settings. Spanish models have 'es' in the name, e.g. VoiceGain-ol-es:1

Read more → 
Unique feature: RTP streaming support
Telephony
Unique feature: RTP streaming support

Voicegain speech-to-text platform has supported RTP streaming from the very beginning. One of our first applications, several years ago, was live transcription with ffmpeg utility used to capture audio from a device and to stream it to the Voicegain platform using RTP. Over time we added more robust protocols and RTP was rarely used. However, recently in one of our deployments we came across a use case where RTP streaming allowed our customer to do integration in a very straightforward way within a call-center telephony stack.

Voicegain platform does support more advanced streaming protocols for call-center use like SIPREC or SIP/RTP (SIP Invite). However, in this particular use we were able to stream from Cisco CUBE directly to Voicegain using plain RTP. Upon receiving an incoming call a script is triggered which uses HTTP to establish new Voicegain transcription session. In the session response, ip:port parameters for the RTP receiver specific to the session are returned and these are passed to the CUBE to establish a direct RTP connection.

RTP used like this provides no authentication and security which would make it generally unsuitable for use over Internet. However, in this particular use case our customer benefits from the fact that the entire Voicegain stack can be deployed on-prem. Because of being on the same isolated network as the CUBE there are no issues with security and/or packet loss.  

An example

You can visit out github to see a python code example which shows  how to establish the speech-to-text session, how to point the RTP sender to the receiver endpoint, and how to receive real-time transcription result via a websocket.

The command to establish the session is as simple as this:


Audio section defines the RTP streaming part, and the websocket section defines how the results will be sent back over a websocket.

The response looks like this:

In the github example the stream.ip and stream.port are passed to ffmpeg that is used as the RTP streaming client. The example further illustrates how to process the messages with incremental transcription results sent real-time over the websocket.

Read more → 
Voicegain Speech Analytics API Generally Available
Speech Analytics
Voicegain Speech Analytics API Generally Available

Voicegain has released its Speech Analytics (SA) API that supports variety of analytics tasks performed on the audio or the transcript of that audio. The features supported by Voicegain SA API were chosen to support our target main use case which is processing Call Center calls.


Things that Speech Analytics can do now (from release 1.22.0)

The current release supports offline Speech Analytics. The data that can be obtained through Speech Analytics API is listed below.

Note, here we do not include things that can be obtained also from our Transcribe API, like: transcript, decibel values, audiozones, etc. These, however, will be accessible from the Speech Analytics API response.

Per channel analytics:

  • gender - likely gender of the speaker based on the voice characteristics. Currently either "male" or "female".
  • emotion - Both totals over the entire call and a list of  values computed at multiple places in the transcript. Each item will contain values of: (1) sentiment - from -1.0 (mad/angry) to +1.0 (happy/satisfied)(2) mood - a map with estimated values (range 0.0 to 1.0) for the following moods: "neutral" "calm" "happy" "sad" "angry" "fearful" "disgust" "surprised"(3) location - start and end in msec and index of the word
  • Named Entities recognized in the call. This will be a list with the entity type and the location in the call. NER values that are supported are: CARDINAL - Numerals that do not fall under another type.DATE - Absolute or relative dates or periods.EVENT - Named hurricanes, battles, wars, sports events, etc.FAC - Buildings, airports, highways, bridges, etc.GPE - Countries, cities, states.NORP - Nationalities or religious or political groups.MONEY - Monetary values, including unit.ORDINAL - "first", "second", etc.ORG - Companies, agencies, institutions, etc.PERCENT - Percentage, including "%".PERSON - People, including fictional.QUANTITY - Measurements, as of weight or distance.TIME - Named documents made into laws.
  • keywords - list of keywords or keyword groups recognized in the call. Keywords to be recognized can easily to configured from examples.
  • profanity - this is essentially a predefined keyword group
  • talk metrics - things like maximum and average talk streak, talk rate, energy
  • overtalk metrics - overtalk happens if the speaker starts speaking while the other speaker is already speaking.

Global analytics:

  • silence metrics - Defined as time when none of the channels is speaking. Note: Only the Agent is assumed to be in control of the speaking time. This a simplification, but it is difficult to determine of any silence was caused by the caller and was unavoidable.
  • word cloud frequencies - smart word cloud data with stop words removed and word variations collapsed before computing frequencies

Speech Analytics features coming soon

Real-time Speech Analytics will be available in the near future. Soon we also plan to release Score Card support for Speech Analytics.

Per channel analytics coming soon:

  • Two additional named entities: CC - Credit Card,SSN - Social Security number
  • age - estimated age of the speaker based on the voice characteristics. Three possible values: "young-adult" "senior" "unknown"
  • phrases - list of phrases or phrase groups recognized in the call. These are identified using NLU algorithms - essentially the same as used for identifying NLU intents. Phrases to be recognized can be configured from examples.
  • pitch statistics will be added to talk metrics

Additionally, we will soon support PII redaction of any named entity from either transcript or audio.

Supported audio types

Speech Analytics API supports the following types of audio input:

  • 2-channel (stereo) audio as typically found in call centers where the Caller voice is recorded in one channel and the Agent voice is recorded in the other channel. Some metrics, like overtalk e.g., can only be computed if the input audio is of this type.
  • 1 channel audio with two speakers - for this audio type diarization will be performed to separate the two speakers. The per-channels analytics will be performed after diarization. Overtalk metrics are not available for this use case.

You can see the API specification here.

Read more → 
Combining grammar-based and large vocabulary speech recognition
ASR
Combining grammar-based and large vocabulary speech recognition

In this blog post we present a unique feature of the Voicegain speech-to-text platform that efficiently combines the use of grammars with the use of large vocabulary models to provide developers with the ability to achieve high recognition accuracy in a very efficient and convenient way.

Two Types of Speech Recognition

Speech recognition (ASR) systems generally can be divided into two types:


Large Vocabulary Continuous Speech Recognition

This type of recognizer is generally used for transcription where the vocabulary is very broad and the length of the speech audio is unlimited (except for practical e.g. resource related limit). Typical components and processing steps of such a system are illustrated below:

The working of such a system is as follows: (s) The audio signal is processed into features. (b) The features are fed into an acoustic model processor. The processor converts data from the acoustic realm to text/linguistic or some other intermediate (e.g. audio embeddings) realm. The output values may be phonemes, letters, word pieces, audio embeddings, etc., presented as vectors of probabilities. (c) These vectors are then passed to search/optimization component. Search uses the language model to decide which hypotheses formed from the output of the previous stage are most likely to be the correct textual interpretation of the input speech audio.


The Language Models used may take variety of forms. Two of the many possible manifestations are: (a) ARPA language models, which are n-gram based, and (b) Neural Network language models where a neural network (e.g., RNN) is trained to represent a language model. Some of the Language models can also incorporate a decoder part, if the acoustic model output is encoded (e.g. if it is represented by acoustic embedding).


Because the vocabulary of this type of recognizers is large, they are prone to misrecognitions. This is particularly the case for short utterances that do not provide much context for the language model to sufficiently constrain the hypotheses. An example would be misrecognizing “card” as “car” if that is the only word that is said and a speaker has a specific accent.


Cloud speech-to-text offerings from the Big Cloud providers - Google, Amazon, and Microsoft are all examples of Large Vocabulary ASRs.


Grammar-Based Speech Recognition

In such a system, the Voice Bot/IVR developer uses a context free grammar to define a set of possible utterances that can be recognized. The grammars are typically defined using the SRGS (Speech Recognition Grammar Specification) standard - either ABNF or GRXML grammar. Other types of grammars used are JSGF (JSpeech Grammar Format) and GSL (which is Nuance Grammar Specification Language).


Components and processing steps of a typical speech recognition system that uses such grammars are illustrated below:

In this system the evaluation of the output from acoustic model processing is done by a search/optimizer that uses the rules contained in the grammar to decide which hypotheses are acceptable. Only the utterances that can be generated from the grammar may be output.


If an utterance outside of the grammar is spoken and presented to the recognizer it may still be recognized but with low confidence. If the confidence is below a set threshold a NOMATCH will be returned.


The obvious disadvantage of using such a recognizer is that it will not recognize utterances outside the scope of grammar. Such utterances are called Out-of-Grammar utterances.  However, a big advantage with this approach is that it will be less prone to misrecognition when an utterance that is spoken has been anticipated and is included in the grammar.  


An additional advantage of using a grammar-based recognizer is that most grammars allow for insertion of semantic tags, which allow the grammar to not only define an utterance but also the semantic interpretation of that utterance.


Examples of such a grammar-based speech recognition system would the speech-to-text offerings like Nuance ASR or Lumenvox ASR.


Combining grammar-based and large vocabulary recognition


Clearly both types of speech recognition systems have advantages and disadvantages. It hence seems understandable that a combination of both could potentially have the advantages of both while possibly avoiding some disadvantages.


Approach using a combination of existing ASRs


A simple approach would be to combine two different speech recognition systems. One would need to create two speech recognition sessions and split the incoming audio stream so that each session is fed a copy of incoming audio. Those two sessions would process the audio separately and would output separate results that would then need to be combined. This is illustrated below:


Disadvantages of using two ASR sessions


The setup as presented above has several disadvantages:

  1. It introduces complexity in the streaming of the audio to the recognizer. Additional proxy like component needs to be added that splits the audio stream and feeds it to two separate ASR systems.
  2. Combining the results also requires a new separate component. This is not necessarily trivial because of the different end-pointing of the two disconnected ASR systems meaning that the results will arrive at different times.
  3. Extra compute resources will be needed to support running two separate ASR systems instead of just one.
  4. Another disadvantage is having to pay double the license fee as each ASR will have to have a separate session license.


Voicegain approach


Voicegain platform provides a speech recognition system that combines both types of speech recognition to benefit from the advantages of both. Our system is illustrated in the figure below:

In this system the processing up to the output from the Acoustic model processing is essentially identical to the processing done in systems depicted in the first two figures of this post. However, after that step Voicegain includes a novel Search/Optimization module that uses both grammar and the large vocabulary language model to generate the final recognition results. The end-pointing is  performed in a way that is similar to grammar-based recognizer as that seems to make most sense given the use case (but this can be modified). The final recognition result will comprise n-best results from the grammar-based recognition, if the grammar did MATCH, and one or more hypotheses from the large vocabulary recognition.


The application developer may make own decisions as to how to use the recognition result. For example, the confidence value may be used to determine whether the grammar-based result or the large vocabulary result should be used at a given point in the application.


With Voicegain’s release of 1.22.0 , this feature is Generally Available as part of our Recognize API.


An example request using our /asr/recognize/async API looks like this:


As you can see there is just one definition for the incoming audio stream. The grammar section of settings.asr contains two grammar definitions:

  • one is a standard JSGF grammar with literal tag format semantics,
  • the other is actually not a grammar but a command to turn on large vocabulary transcription for this session {type:BUILT-IN, name:transcribe}

MRCP Use Case

In addition to being available in our STT API and Telephone Bot API the ability to support both gramma-based and large vocabulary recognition at the same time is supported via the MRCP interface. For example, from VXML you can pass both GRXML grammar and  builtin:speech/transcribe grammar and you will receive both GRXML result and large vocabulary result.

If you are building an Intelligent Voice Assistant, Voice Bot, Speech IVR Application or any other application that could benefit from this feature, please contact us via (email info@voicegain.ai) to engage in a more in-depth discussion.


Read more → 
Modernize your VoiceXML IVR into Conversational Voice Bots
Voice Bot
Modernize your VoiceXML IVR into Conversational Voice Bots

The urgent need for IVR platform modernization

Most enterprise IT organizations have mature telephony based IVR applications that serve as the “front door” for all voice based customer support calls. These applications use a combination of touchtone (DTMF) and speech to interact with callers. They have been carefully designed, developed and tuned over the years.


The objectives of any IVR are two fold 1) Automate simple routine queries (like balance inquiry, payment status, etc) and 2) Authenticate and intelligently route calls that require live support to the appropriate agent.


IT organizations across industry verticals like financial services, travel, media, telecom, retail or health-care have a small staff of in-house or outsourced IVR developers to maintain these applications. While enterprises have been focused on scaling and upgrading their digital support channels (like chat and email),  IVR applications have largely remained un-touched for years.


As CIOs and CDOs (Chief Digital Officers) embark on strategic initiatives to migrate enterprise workloads to the Cloud, one "niche" workload on this list is the IVR. However migrating IVRs "as-is" to the cloud is tricky. The languages, protocols and platforms that these telephony based IVRs were built on is from the early 2000s and are approaching obsolescence. Also while they support directed dialogs with limited customer spoken utterances, they are not a good fit for conversational bot interactions.


So IT organizations are faced with a Catch 22 situation. On one-hand, it is cumbersome to maintain these IVR workloads. On the other hand, the rationale to migrate existing platforms "as-is" to modern cloud infrastructure is questionable. Why bear the trouble and expense if IVRs are eventually are going to be replaced by conversational bots?


So there is a real need to modernize these IVRs as part of their cloud migration strategy.


A brief look at the underlying infrastructure of these IVR applications

Traditionally speech IVR applications ran on on-premise Contact Center telephony platforms. Companies like Avaya, Nortel, Cisco, Intervoice, Genesys and Aspect dominated the vendor landscape. In the early to mid-2000s, these vendors worked collaboratively as part of the W3C consortium to develop VoiceXML, an open vendor agnostic language for speech-enabled IVR applications.


VoiceXML enabled developers to build interactive voice dialogs and provided a standard way to interact with an automatic speech recognizer (ASR). This was done using a telephony based protocol called MRCP. The standard also provided a method to define speech grammars called SRGS and a format called GRXML.


The architecture and supporting jargon/terminology around VoiceXML borrowed heavily from the web world. The VoiceXML platform was referred to as a “Voice browser” that could “render VoiceXML pages” just like how a web browser could render HTML pages. Most contact center platforms provided visual IDEs to help   build and maintain these interactive call flows. Some also automated the generation of the VoiceXML pages. The IDE generated code that could run on  application server (like Apache Tomcat) which in turn generated VoiceXML pages that were sent to a VoiceXML platform over standard HTTP. The application server was also responsible for making web-services requests to enterprise database resources that were required for the IVR interaction; for e.g. billing/payment systems or CRM systems.


Also most ASRs from the late 90s and early 2000s were based on Hidden Markov Models and Gaussian Mixture models. They mainly supported grammar-based recognition - which meant that as a Speech IVR developer you had to anticipate all possible utterances that a user could say in response to a question/prompt. There were some options to build open-ended statistical language models but these were tricky and required careful selection of the training corpus.

Why modernize now?

While VoiceXML worked well in the past, it is a niche and outdated language. The last release of VoiceXML 2.1 was back in 2007!! That is more than a decade ago.

And a lot has changed in the web world since then. VoiceXML was developed at a time when JSP (Java Server Pages) was widely used. So it was before JSON, YAML, RESTful APIs & AJAX.


For enterprises, it is expensive to maintain a dedicated staff - whether in-house or outsourced - with niche skills in technologies like VoiceXML and MRCP.


Enterprises should ideally be able to run IVR app like any other modern web application. Most enterprise web apps are built on programming languages like Python, Node.JS that are popular with web developers. They are containerized using docker and orchestrated using Kubernetes.


It would be ideal for an enterprise IT organization for its IVR app to be built on similar programming languages so that it can be supported or maintained just like other applications in the IT portfolio.


In addition to the obsolescence of VoiceXML, the speech recognition engine (ASR) that was deployed in the early 2000s has also become outdated. Modern speech-to-text engines are built on Deep Neural Networks that run on powerful GPU infrastructure. They offer amazing accuracy and allow the use of a very large vocabulary - which is what is needed for bot like conversational experience. Also modern NLU engines allow you to easily extract intents from the transcribed text.


So if an enterprise wants to offer a voice bot that supports an open conversational experience, they need to move to a modern DNN based Speech-to-Text platform that can integrate with such NLU engines.


Our recipe for IVR App modernization



At Voicegain, we recommend that an enterprise first modernize the underlying infrastructure while retaining the existing IVR application logic. This is a great first step. It allows an enterprise to continue serving existing users while taking a step towards providing a more conversational user experience.

How can an enterprise modernize its legacy IVR app?

We suggest that the existing call flow logic - which is typically maintained using visual IDEs of contact center platforms - get rewritten (ideally with the help of automated tools) into a modern programming language like Python or Node.Js.

Instead of generating legacy VoiceXML pages, enterprises should use web friendly data representation languages like JSON or YAML to interact with modern RESTful Speech-to-Text APIs using web callbacks.

How Voicegain supports IVR App modernization?

At Voicegain, we provide a modern Voice AI platform that includes

  1. A modern DNN based speech recognizer accessible using RESTful APIs
  2. Ability to interface directly with telephone calls delivered over SIP/RTP
  3. JSON style callback APIs to replace functionality of a VoiceXML
  4. Ability to deploy on your VPC/Private Cloud or use as a cloud service
  5. Fully feature compatible with legacy standards (support SRGS grammars,  universals)
  6. Training of the underlying acoustic model & language models to get high recognition accuracy

Voicegain is developing tools to automatically convert VoiceXML to equivalent JSON/YAML representation that talks to our callback APIs.


How is this a "future proof" architecture for an enterprise?

The Voicegain platform is capable of large vocabulary transcription which is a requirement for NLU based Voice Bots. This will be the way customers interact with enterprises in the future.


We allow developers to  switch between grammar based recognition and large vocabulary recognition at each and every turn of the dialog; or you could simultaneously use both to achieve more flexibility.


Our Telephony Bot APIs can also integrate with Bot Frameworks like Google Dialog Flow, .


We are inviting enterprise web developers for a free trial of our platform.






Read more → 
Why Voice AI is critical for enterprises in a post Covid world
Enterprise
Why Voice AI is critical for enterprises in a post Covid world

Digital Transformation efforts in most enterprises have only gained pace as a result of the pandemic. The maxim going around in corporate circles in 2020 (and very likely to continue in 2021) is that the coronavirus was the real Chief Digital Officer (CDO) for most enterprises!! CIOs, CTOs and the CDOs today have stronger and bolder mandates to fundamentally alter the economics of their businesses.

They are increasingly being asked by their CEOs to make big bets and take on initiatives that can "materially" transform the underlying economics of their businesses.

A significant area of focus for digital enterprises is what is being referred to as "Practical AI". How businesses use AI and ML in a practical yet fundamental manner to transform themselves? Enterprises in different industries - financial services, travel, telecommunications, media and retail - are realizing that investing in strong AI & ML capabilities in their teams is critical to their post-pandemic digital future. In many Fortune 1000 companies, businesses are 'insourcing' and aggressively hiring AI & ML teams even as they outsource maintenance of legacy back-end systems to gain competitive advantage.

And one of the most practical AI applications in the enterprise is Voice AI - which refers to the use of AI & ML on voice conversations within the enterprise.

Why Voice will remain significant & relevant for the Enterprise?

Despite the proliferation of digital channels like chat/text messaging, email and social, higher value sales conversations, meetings, and involved customer service discussions are conducted pre-dominantly over voice. Speaking is not just more efficient than typing, it is also more engaging!! The human touch with voice is something that we as humans will always value. Voice is here to stay and its enduring significance is as immutable as the laws of gravity!

So what is changing in the world of Voice? It is just that the underlying plumbing is transforming - voice conversations traditionally took place over legacy telephony networks. They are quickly moving to meeting platforms like Zoom, Microsoft Teams and Webex; so a voice only conversation is being replaced by a richer voice & video conversation conducted over the internet.

The barriers historically associated with voice - costs and complexity of voice infrastructure- have been eliminated with technologies like WebRTC, 4G/5G and cloud computing. For consumers, the cost of making a voice call is now zero - it is the cost of their WiFi or 4G/5G bandwidth (as consumers use free mobile apps like Facetime, Skype and WhatsApp).

What is Voice AI? And why is it exciting?

Voice AI is highly accurate Speech-to-Text and NLU that is built on highly specialized and customizable (trainable) Deep Neural Networks running on GPUs.

What is unique about Deep Neural Networks is that the underlying Speech-to-Text and NLU models can be trained - easily and affordably - on enterprise specific datasets. You can leverage enterprise's lexicon and corpus - both voice & text. So instead of a 'one-size-fits-all approach', each enterprise can have its own Voice AI infrastructure - that is trained on its product names, industry jargon, employee & customer names, unique accents etc. Once it is trained, there are two big applications - 1) Voice AI for Automation and using 2) Voice AI for Analytics.

Voice AI for Automation

Enterprises can build Voice bots to intelligently respond to contact requests from their prospects and customers anytime anywhere. Voice Bots may also be used to respond to internal employees queries in a service/help desk context. The automation use-case is one that has really accelerated during the pandemic. Bots can help businesses deal with massive disruption caused by everyone - in sales, customer success and service - working from home during the pandemic. McKinsey has written about automation using AI.

Voice AI for Analytics

Voice AI also makes it possible for businesses to transcribe 100% of their voice conversations and subsequently mine the text for sentiment and analytics/insights.

With Voice AI, businesses can ensure that its frontline sales staff is able to pitch its core value proposition, benefits, product and service features in a consistent and compelling manner. This can be a massive boost to sales teams as they can improve conversion ratios and accurately forecast pipeline with Voice AI.

Voice AI can also ensure that customer success and service personnel are provided with tailored/customized insights to improve not just their efficiency (metrics like AHT in contact center) and but also enhance effectiveness measures like CSAT and NPS scores.

At Voicegain, we are passionate about helping enterprises, small and mid-size businesses, entrepreneurs and startup companies with their Voice AI efforts. Our mission is to build the world's most open developer friendly Voice AI platform. Be a part of our mission by signing up here. You can transcribe your calls/meetings, try out our APIs, building amazing telephony bots and more !

About the Author:

Arun Santhebennur is the Co-founder & CEO of Voicegain. To have a more in-depth conversation, please connect with Arun on LinkedIn or send us an email.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control