• Jacek Jarmulak

How to build a Voicebot using Voicegain, Twilio, RASA, and AWS Lambda


You can find the complete code (minus the RASA logic - you will have to supply your own) at our github repository.


What does it do ?

The setup allows you to call a phone number and then interact with a Voicebot that uses RASA as the dialog logic engine.


How does it work ?


The Components

  • Twilio Programmable Voice - We configure a Twilio phone number to point to a TwiML App that has the AWS Lambda function as the callback URL.

  • AWS Lambda function - a single Node.js function with an API Gateway trigger (simple HTTP API type).

  • Voicegain STT API - we are using /asr/transcribe/async api with input via websocket stream and output via a callback. Callback is to the same AWS Lambda function but Voicegain callback is POST while Twilio callback is GET.

  • RASA - dialog logic is provided by RASA NLU Dialog server which is accessible over RestInput API.

  • AWS S3 for storing the transcription results at each dialog turn

The Steps

The sequence diagram is provided below. Basically, the sequence of operations is as follows:

  1. Call a Twilio phone number

  2. Twilio makes an initial callback to the Lambda function

  3. Lambda function sends "Hi" RASA and RASA responds with the initial dialog prompt

  4. Lambda function calls Voicegain to start an async transcription session. Voicegain responds with a url of a websocket for audio streaming

  5. Lambda function responds to Twilio with a TwiML command <Connect><Stream> to open a Media Stream to Voicegain. The command will also contain the text of the question prompt.

  6. Voicegain uses TTS to generate from the text of the RASA question an audio prompt and streams it via websocket to Twilio for playback

  7. The Caller hears the prompt and says something in response

  8. Twilio streams caller audio to Voicegain ASR for speech recognition

  9. Voicegain ASR transcribes the speech to text and makes a callback with the result of transcription to Lambda function

  10. Lambda function stores the transcription result in S3

  11. Voicegain closes the websocket session with Twilio

  12. Twilio notices end of session with ASR and makes a callback to Lambda function to find out what to do next

  13. Lambda function retrieves result of recognition from S3 and passes it to RASA.

  14. RASA processes the answer and generates next question in the dialogue

  15. We continue next turn same as in step 4.



110 views
Contact Us