Voice Bot

How to build a Voicebot using Voicegain, Twilio, RASA, and AWS Lambda

You can find the complete code (minus the RASA logic - you will have to supply your own) at our github repository.

What does it do ?

The setup allows you to call a phone number and then interact with a Voicebot that uses RASA as the dialog logic engine.

How does it work ?

The Components

  • Twilio Programmable Voice - We configure a Twilio phone number to point to a TwiML App that has the AWS Lambda function as the callback URL.
  • AWS Lambda function - a single Node.js function with an API Gateway trigger (simple HTTP API type).
  • Voicegain STT API - we are using /asr/transcribe/async api with input via websocket stream and output via a callback. Callback is to the same AWS Lambda function but Voicegain callback is POST while Twilio callback is GET.
  • RASA - dialog logic is provided by RASA NLU Dialog server which is accessible over RestInput API.
  • AWS S3 for storing the transcription results at each dialog turn.

November 2021 Update: We do not recommend S3 and AWS Lambda for a production setup. A more up to date review of various options to build a Voice Bot is described here. You should consider replacing the functionality of S3 and AWS Lambda with a web server that is able to maintain state - like Node.js or Python Flask.

The Steps

The sequence diagram is provided below. Basically, the sequence of operations is as follows:

  1. Call a Twilio phone number
  2. Twilio makes an initial callback to the Lambda function
  3. Lambda function sends "Hi" RASA and RASA responds with the initial dialog prompt
  4. Lambda function calls Voicegain to start an async transcription session. Voicegain responds with a url of a websocket for audio streaming
  5. Lambda function responds to Twilio with a TwiML command <Connect><Stream> to open a Media Stream to Voicegain. The command will also contain the text of the question prompt.
  6. Voicegain uses TTS to generate from the text of the RASA question an audio prompt and streams it via websocket to Twilio for playback
  7. The Caller hears the prompt and says something in response
  8. Twilio streams caller audio to Voicegain ASR for speech recognition
  9. Voicegain ASR transcribes the speech to text and makes a callback with the result of transcription to Lambda function
  10. Lambda function stores the transcription result in S3
  11. Voicegain closes the websocket session with Twilio
  12. Twilio notices end of session with ASR and makes a callback to Lambda function to find out what to do next
  13. Lambda function retrieves result of recognition from S3 and passes it to RASA.
  14. RASA processes the answer and generates next question in the dialogue
  15. We continue next turn same as in step 4.



Voicegain: Voice AI Under Your Control

Voicegain: Build Voice AI apps with our Speech-to-Text and LLM-powered NLU APIs. Record & Transcribe meetings, contact center calls, videos, etc. Get LLM-powered Summary, Sentiment and more. Build Conversational Voice Bots that integrate with your On-prem or cloud CCaaS platform. Get started today.

See how Voicegain works — get a demo of Voicegain today.

Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control