Why pay for speech recognition software when it is not listening?
Updated: Jun 15
Most companies that have contact centers to answer phone calls have IVR systems that act as a front-end for all the calls. While some IVRs still only allow touchtone responses (where callers press digits), most systems today allow callers to speak their responses. In order to provide this capability, these IVR systems integrate with speech recognition software (also called ASR).
What does an IVR System do? And how does it do it?
An IVR system is designed with three main objectives. 1) Identify and authenticate the caller 2) Automate the call (if it is a simple/routine inquiry) 3) Route the call to the right person for live assistance. An IVR accomplishes these objectives through its application logic (or what is referred to as the call flow) of an IVR system. So how does this logic look? Or in other words what is the “anatomy” of a call flow?
Any IVR interaction with the caller is fundamentally a repetition of the following 3 steps.
The IVR system plays a question prompt
The caller provides a response either using digits (touchtone) or speech
The IVR application logic processes the response and determines next prompt; go back to Step 1
The above steps are repeated until the caller either hangs up or gets forwarded to a live agent for further support.
When does an IVR use Speech recognition software?
It is really important to understand that the speech recognition software is actively engaged only in Step 2. Even in Step 1, speech recognition is not required; rather, a speech detection component is only listening to determine if the caller has “barged-in” (or interrupted the prompt). It is actually not processing caller utterances and converting them to text.
Percentage of time taken by Step 2 varies depending on the type calls. For the very common billing calls, Step 2 is less than one-fifth of the duration of a call. So in other words, the speech recognizer is actively being used only for 20% of the duration of a live call.
How is an IVR system sized and purchased?
The next topic to understand is how an IVR system is sized. The capacity of an IVR system is based on the maximum number of simultaneous calls that it can handle. A metric that is used to quantify that number is a port. The longer the duration of the IVR interaction and the higher the peak number of calls that an IVR system needs to handle, the higher are the number of ports that a business needs to provision and purchase. While we don’t go into the details of how this is calculated, it is done using what is referred to as the Erlang formula.
Why is pricing speech recognition software based on ports so misleading?
But what is really important to understand is that while the rest of the IVR System (i.e. telephony trunks and equipment) is sized and licensed based on ports (or ‘peak’ call traffic), even the speech recognition software is very misleadingly marketed, licensed and sold based on the number of ports. We have already shown how the speech-to-text software is only used actively in less than 20% of the call interaction. Yet businesses do not notice this because this software is included as part of a bundle along with the rest of the system which is sold on a per-port basis. In fact to provide a technical justification, many speech recognition vendors artificially tie up the port over the entire duration of the call.
Net-net customers are paying for the use of the speech recognition software for the duration of the entire call - even when the recognizer is not actively listening.
What makes this worse?
There is yet another factor that compounds this problem further. There is a lot of seasonality in the call arrival pattern in contact centers for most businesses. Some businesses, e.g., retail, have extremely high call volumes during the holiday seasons. Other businesses have to plan for high call volumes around key events, promotions, a renewal season etc.
As a result, most businesses invest significant dollars in maintaining an infrastructure for peak capacity while the average use is much lower. In our experience, businesses purchase 2.5 - 3 times the number of IVR ports that they need on average. So net-net the utilization of an IVR port is between 33% and 40%. The utilization of a speech recognizer port is somewhere between 5%-8%.
Try Voicegain for free The Voicegain Speech-to-Text platform offers a different licensing model for businesses. It is based on minutes of actual usage - i.e. when then the speech recognition software is actually listening and converting speech to text. The platform has a separate voice activity detector that is included in the much lower session price when the platform listens for a barge-in and does not actively convert a caller utterance to text.