Streaming real-time audio to Voicegain Speech-to-Text
Many applications of speech-to-text (STT) require that the conversion from audio to text happen in realtime. These applications could be voice bots, speech ivr, live captioning of events or talks, real time transcription of meetings, real time speech analytics of sales calls or live agent assistance in a contact center.
An important question for developers looking to integrate real time STT into their apps is the choice of the protocol and/or mechanism to stream real time audio to the STT platform. While some STT vendors offer just one method; at Voicegain we offer multiple choices that developers could select from. In this post, we explore in detail all these methods so that a developer could choose the right one for their specific use case.
Some of the factors that may guide the specific choice are:
How audio stream is made available to the app - you application may already be receiving the audio stream in a particular manner and format.
The type of application and its requirements for latency and network resiliency
Related to the above - the quality of the network between the app and the STT platform.
At Voicegain we currently offer five different methods/protocols to support streaming to our STT platform. The first three are TCP based methods and the last two methods are UDP based.
TCP based methods are generally a good idea if the quality of network is very robust, for example, MPLS connectivity between app and STT platform or if the app is co-located with the same cloud provider or datacenter as STT platform.
UDP based methods might be a better choice if the application is using the public internet to connect to the STT platform.
Using WebSockets is a simple and popular option to stream audio to Voicegain for speech recognition. WebSockets have been around for a while and most web programming languages have libraries that support it. This option may be the easiest way to get started. Voicegain API is using binary WebSockets, and we have some simple examples to get you started.
However, there is a drawback. WebSockets were not inherently designed to stream audio efficiently and reliably. They were designed for text messages, where upon any issue a WebSocket could easily be restarted. If you are using WebSockets over a network (say the internet) that is subject to congestion, your app may have trouble with delivering consistently good connection. In our experience WebSockets work fine for short audio like small notes, commands, etc.
2. HTTP 1.1 with Chunked transfer encoding
Voicegain also supports streaming over HTTP 1.1 using chunked transfer encoding. This allows you to send raw audio data with unknown size, which is generally the case for streaming audio. Voicegain supports both pull and push scenarios - we can fetch the audio from a URL that you provide or the application can submit the audio to a URL that we provide. To use this method, your programming language should have libraries that support chunked transfer encoding over HTTP, some of the older or simpler HTTP libraries do not support it.
gRPC builds on top of HTTP/2 protocol which was designed to support long-running bi-directional connections. Moreover, gRPC uses Protocol buffers which are a more efficient data serialization format compared to JSON that is commonly used in RESTful HTTP APIs. Both these aspects of gRPC allow audio data to be efficiently sent over the same connection that is also used for sending commands and receiving results.
With gRPC, client side libraries can easily be generated for multiple languages, like Java, C#, C++, Go, Python, Node Js, etc. The generated client code contains stubs for use by gRPC clients to call the methods defined by the service.
Using gRPC, clients can invoke the Voicegain STT APIs like a local object whose methods expose the APIs. This method is a fast, efficient, and low-latency way to stream audio to Voicegain and receive recognition responses. The responses are sent over the same connection back from the server to client - this removes the need for polling or callbacks to get the results when using HTTP.
gRPC is great when used from the back-end code or from Android. It is not a plug and play solution when used from Web Browsers but requires some extra steps.
4. RTP protocol with Voicegain extensions
The first three methods described above are TCP based methods. They work great for audio streaming as long as the connection has no or minimal packet loss. Packet loss causes significant delays and jitter in the TCP connections. This may be fine if audio does not have to be processed truly real-time and can be buffered. If real-time behavior is important and the networking is known to be unreliable UDP protocol is preferred to TCP for audio streaming. With UDP, packet loss will manifest itself as audio dropouts, but that may be preferable to excessive pauses and jitter in case of TCP.
RTP is a standard protocol for audio streaming over UDP. However, RTP itself is is generally not sufficient and is normally used with accompanying RTP Control Protocol (RTCP). Voicegain has implemented its own variation of RTCP that can be used to control RTP audio streams sent to the recognizer.
Currently, the only way to to stream audio using RTP to Voicegain platform is to use our proprietary Audio Sender Java library. We also provide Audio Sender Daemon that is capable of reading data directly from audio devices and streaming it to Voicegain for real time transcription.
If you are looking to invoke Speech-to-text as part of a telephony application - e.g. in a contact center, Voicegain offers RTC Callback APIs. You can read more about them here. With these APIs you can invite the Voicegain platform into a SIP session and start sending audio to Voicegain Speech-to-Text engine. Once the audio stream gets established, you can issue recognize requests and receive recognition responses using our HTTP callback API. You can write the logic of your application using any programming language of your choice - all that is needed is being able to handle HTTP requests and send responses.
Voicegain platform in this scenario essentially acts as a 'mouth' and an 'ear' to the entire conversation which happens over SIP/RTP. The application can invoke JSON commands over HTTP that play prompts and convert caller speech into text through the entire duration of the call over a single session. You can also record the entire conversation if the call is transferred to a live agent and transcribe into text.
Additional protocols coming soon
Voicegain is looking to add support for WebRTC soon.