“Hey Siri, call me an ambulance”, you gasp at your phone.
“You would like me to call you ‘Anne Ambulance’?”.
You stare at your phone in disbelief. But your phone doesn’t understand your expressions, because most dialogue systems typically work by first running your speech through an automatic speech recognition system and work on the resulting text.
While the particular exchange I just described can be easily corrected by the verbal content of the utterance alone, it would benefit a dynamic response system to gain a better understanding of context from the non-verbal behavioural cues of the user. Specifically, in this work we try and answer the question: how can we improve the decision making of a dynamic spoken dialogue system by incorporating the rich information in the non-verbal behaviour of the user?
This was one of the first demos I built in my role as a Senior Research Engineer at the Multicomp Lab, Carnegie Mellon University. The project was a collaboration between multiple groups within the Yahoo InMind Project.
My specific contribution here was to build a realtime module that estimates a user’s confusion or surprise from a video of their face, and integrate the system into a dynamic spoken dialgoue system that resided on the user’s mobile phone. The integration started with receiving the video from a video server routing the feed from the phone’s front-facing camera. I then extracted the facial action units using the platform Multisense that we built within our group, and estimated confusion and surprise as a linear combination of the appropriate action units. These estimates where then broadcasted back to the response system running on the mobile phone.
I went with ZeroMQ for the messaging backbone, and used a pub-sub model for the communication, supporting multiple consumers for the processed behaviour information on my end.