• Arun Santhebennur

Why pay for speech recognition software when it is not listening?

Updated: Aug 12


Most companies that use contact centers to handle phone calls have on-premise IVR systems that act as a front-end for all the calls. While some IVRs still only support touchtone responses (i.e callers press digits), most systems today allow callers to speak their responses. In order to provide this capability, traditional IVR systems integrate with automated speech recognition software (also called ASR) using a protocol called MRCP.


What does an IVR System do? And how does it do it?

An IVR system is designed with three main objectives. 1) Identify and authenticate the caller 2) Automate the call (if it is a simple/routine inquiry) 3) Route the call to the right person/queue for live assistance. An IVR does this through its application logic or what is referred to as the call flow of an IVR system. So how does this logic work? Or in other words what is the “anatomy” of a call flow?


Any IVR interaction with the caller is fundamentally a repetition of the following 3 steps.

  1. The IVR system plays a question prompt

  2. The caller provides a response either using digits (touchtone) or speech

  3. The IVR application logic processes the response and determines next prompt; now loop back to Step 1 and repeat

The above steps are repeated until the caller either hangs up or gets forwarded to a live agent for further support.


When does an IVR actually use Speech recognition software?

It is really important to understand that the speech recognition software is actively engaged only in Step 2. In Step 1, speech recognition is not really required; what is needed is a speech detection component that listens to see if the caller has “barged-in” (or interrupted the prompt). It is actually not processing caller utterances and converting them to text.


The time taken by Step 2 as percentage of the overall call varies depending on the type of calls. For the very common billing calls, Step 2 is less than one-fifth of the duration of a call. So in other words, the speech recognizer is actively being used only for 20% of the duration of a live call.


How is an on-premise IVR system sized and purchased?

The next topic to understand is how an on-premise IVR system is sized. The capacity of an IVR system is based on the maximum number of simultaneous calls that it can handle. A metric that is used to quantify that number is a port. The longer the duration of the IVR interaction and the higher the peak number of calls that an IVR system needs to handle, the higher are the number of ports that a business needs to provision and purchase. While we don’t go into the details of how this is calculated, it is done using what is referred to as the Erlang formula.


Why pricing speech recognition software based on ports is misleading?

But what is really important to understand is that while the rest of the IVR System (i.e. telephony trunks and equipment) is sized and licensed based on ports (or ‘peak’ call traffic), even the speech recognition software is marketed, licensed and sold based on the number of ports. We have already shown how the ASR is only used actively in less than 20% of the call interaction. Yet businesses do not explicitly see this because this software is included as part of a bundle along with the rest of the system which is sold on a per-port basis. In fact to provide a technical justification, many speech recognition vendors artificially tie up the port over the entire duration of the call.


Net-net customers are paying for the use of the speech recognition software for the duration of the entire call - even when the recognizer is not actively listening.


What makes this worse?

There is yet another factor that compounds this problem further. There is a lot of seasonality in the call arrival pattern in contact centers for most businesses. Some businesses, e.g., retail, have extremely high call volumes during the holiday seasons. Other businesses have to plan for high call volumes around key events, promotions, a renewal season etc.


As a result, most businesses invest significant dollars in maintaining an IVR infrastructure for peak capacity while the average use is much lower. In our experience, businesses purchase 2.5 - 3 times the number of IVR ports that they need on average. So net-net the utilization of an IVR port is between 33% and 40%. The utilization of a speech recognizer port is somewhere between 5%-8%.

Try Voicegain for free The Voicegain Speech-to-Text platform offers a different licensing model for businesses that use the Voicegain MRCP ASR as a replacement for their on premise ASR. It is based on minutes of actual usage - i.e. when then the speech recognition software is actually listening and converting speech to text. In this case, the platform has a separate voice activity detector that is included in the much lower session price when the platform listens for a barge-in and does not actively convert a caller utterance to text.


For a free trial, please click here. You can also read about some of our key differentiators here.



58 views
Contact Us