• Jacek Jarmulak

Two-Channel Support for Twilio Media Streams


Voicegain Speech-to-Text platform has already for a while supported many of the Twilio features like:

  • <Connect> <Stream> for speech-enabled IVR / Voicebot applications

  • SIP INVITE - for integration of Voicegain Callback API into Twilio originated calls - also mainly focusing on VR / Voicebot applications

  • SIPREC - for either real-time speech-to-text or offline speech-to-text and speech analytics

  • plain media <Stream> - but so far only in 1-channels applications with focus of offering an alternative for <Gather>

Release 1.26.0 of the Voicegain platform finally offers a full 2-channel support for Twilio Media Streams. This enables real-time transcription of both the inbound and outbound channels at the same time.


How does it work?


Twilio <Stream> command takes a websocket url parameter as a target to which the selected channels are streamed, for example:


<Response> 
    <Say voice="woman" language="en-US">connected </Say> 
        <Start> 
            <Stream name="Test-stream-001" url="wss://api.ascalon.ai/v1/0/plain/0-0kmatf40e06m2e0fmcgkizs7sswr" track="both_tracks" /> 
        </Start> 
    <Say voice="woman" language="en-US">start talking, </Say> 
<Response>

The wss url can obtained by starting a new Voicegain real-time transcription session using https://api.voicegain.ai/v1/asr/transcribe/async API. The session part of the request may look like this (notice that two session are started and each will be fed different channel left/right of the audio stream):


  "sessions": [
    {
      "asyncMode": "REAL-TIME",
      "audioChannelSelector" : "left",
      "websocket": { 
        "adHoc": 'true', 
        "useSTOMP" : 'false',
        "minimumDelay": 0 
      },
      "content" : {
        "incremental" : ['words'],
        "full" : []
      }
    },
    {
      "asyncMode": "REAL-TIME",
      "audioChannelSelector" : "right",
      "websocket": { 
        "adHoc": 'true', 
        "useSTOMP" : 'false',
        "minimumDelay": 0 
      },
      "content" : {
        "incremental" : ['words'],
        "full" : []
      }
    }
  ],

We also need to tell Voicegain to take input in TWIML protocol in stereo:


  "audio": {
    "source": { "stream": { "protocol": "TWIML" } },
    "format": "PCMU",
    "channels" : "stereo",
    "rate": 8000, 
    "capture": 'true'
  },

Notice that we can enable audio capture which in addition will give us a stereo recording of the call once the session is complete.


In the response of the start of Voicegain session we get 3 websocket urls:

  • one for the inbound audio - this one we pass to Twilio TwiML <Stream> command

  • two for receiving transcription results in real-time - individual messages will look like, e.g. {"utt": "one", "conf": 0.4047, "start": 440}

Example code

On our github we provide an example python code that starts a simple outbound Twilio phone call and then transcribes in real-time both inbound and outbound audio.


The sample code illustrates an outbound calling example which is somewhat simpler because there are no callback involved. In a case of an inbound call, the request to Voicegain would have to be done from your Twilio callback function that gets invoked when a new call comes in, otherwise, the rest of the code would be very similar to our github example.


Use Cases

Some of these are already listed on Twilio Media Streams page:

  • real-time transcription

  • NLU - e.g. detect and respond to events during the call

  • automated Knowledge-Base lookup

  • sentiment analysis - use text in to determine sentiment during the call

Coming Soon

We will be testing the <Stream> functionality on the LaML command language provided by SignalWire platform which is very similar to Twilio TwiML - we will update our blog with the results of those test.


We are also working on a real-time version of our Speech Analytics API. Once complete then all Speech Analytics functionality will be available real-time to users of Twilio and SignalWire platforms.

22 views0 comments
Contact Us