• Jacek Jarmulak

Voicegain Speech Analytics API Generally Available


Voicegain has released its Speech Analytics (SA) API that supports variety of analytics tasks performed on the audio or the transcript of that audio. The features supported by Voicegain SA API were chosen to support our target main use case which is processing Call Center calls.


Things that Speech Analytics can do now (from release 1.22.0)

The current release supports offline Speech Analytics. The data that can be obtained through Speech Analytics API is listed below.


Note, here we do not include things that can be obtained also from our Transcribe API, like: transcript, decibel values, audiozones, etc. These, however, will be accessible from the Speech Analytics API response.


Per channel analytics:

  • gender - likely gender of the speaker based on the voice characteristics. Currently either "male" or "female".

  • emotion - Both totals over the entire call and a list of values computed at multiple places in the transcript. Each item will contain values of: (1) sentiment - from -1.0 (mad/angry) to +1.0 (happy/satisfied) (2) mood - a map with estimated values (range 0.0 to 1.0) for the following moods: "neutral" "calm" "happy" "sad" "angry" "fearful" "disgust" "surprised" (3) location - start and end in msec and index of the word

  • Named Entities recognized in the call. This will be a list with the entity type and the location in the call. NER values that are supported are: CARDINAL - Numerals that do not fall under another type. DATE - Absolute or relative dates or periods. EVENT - Named hurricanes, battles, wars, sports events, etc. FAC - Buildings, airports, highways, bridges, etc. GPE - Countries, cities, states. NORP - Nationalities or religious or political groups. MONEY - Monetary values, including unit. ORDINAL - "first", "second", etc. ORG - Companies, agencies, institutions, etc. PERCENT - Percentage, including "%". PERSON - People, including fictional. QUANTITY - Measurements, as of weight or distance. TIME - Named documents made into laws.

  • keywords - list of keywords or keyword groups recognized in the call. Keywords to be recognized can easily to configured from examples.

  • profanity - this is essentially a predefined keyword group

  • talk metrics - things like maximum and average talk streak, talk rate, energy

  • overtalk metrics - overtalk happens if the speaker starts speaking while the other speaker is already speaking.

Global analytics:

  • silence metrics - Defined as time when none of the channels is speaking. Note: Only the Agent is assumed to be in control of the speaking time. This a simplification, but it is difficult to determine of any silence was caused by the caller and was unavoidable.

  • word cloud frequencies - smart word cloud data with stop words removed and word variations collapsed before computing frequencies

Speech Analytics features coming soon

Real-time Speech Analytics will be available in the near future. Soon we also plan to release Score Card support for Speech Analytics.


Per channel analytics coming soon:

  • Two additional named entities: CC - Credit Card, SSN - Social Security number

  • age - estimated age of the speaker based on the voice characteristics. Three possible values: "young-adult" "senior" "unknown"

  • phrases - list of phrases or phrase groups recognized in the call. These are identified using NLU algorithms - essentially the same as used for identifying NLU intents. Phrases to be recognized can be configured from examples.

  • pitch statistics will be added to talk metrics

Additionally, we will soon support PII redaction of any Named Entities from either transcript or audio.


Supported audio types

Speech Analytics API supports the following types of audio input:

  • 2-channel (stereo) audio as typically found in call centers where the Caller voice is recorded in one channel and the Agent voice is recorded in the other channel. Some metrics, like overtalk e.g., can only be computed if the input audio is of this type.

  • 1 channel audio with two speakers - for this audio type diarization will be performed to separate the two speakers. The per-channels analytics will be performed after diarization. Overtalk metrics are not available for this use case.

You can see the API specification here.

45 views0 comments
Contact Us