Half-Measure (of software): Voice Interaction

I can't wait for the straight-out-of-Ironman "Jarvis" digital assistant we can communicate with in a human fashion.
But this involves very sophisticated and advanced AI, Speach-to-text (aka Speech Recognition) and Text-to-speech technologies.
There are still some parts of this technology stack that can be done today, waiting for other parts to improve (like speech recognition and text to speech).

I'm only exposing here some thoughts and findings about the architecture and constraints of such system.

Architecture

The idea is to process speech into activity, be it an activity within the system or outward (like spoken response or physical activities).
I believe a basic architecture for voice-based interaction is as follows:

Fig. 1. Basic Architecture

There are basically 4 important components in this architecture:

Speech-to-text (STT): This component processes the audio input. The performance is key here: performance is a measure of speed and accuracy. Results can be however simplified (vs. accurate), enough to trigger the right commands and end results in the end. For example, if the next component doesn't need interjections, there's no need for the STT to recognize them.
Text-to-command (TTC): This component will process text and translate into commands.
Command Interpreter (CI): Commands are processed here. They can either lead to results to be later processed (like text to be spoken afterwards) or immediate actions (like changing state of interpreter, parameters, etc.).
Results Interpreter (RI): Certain commands will result in further activity. One example is Text-to-speech where the results of some commands could be the text to be spoken. There could actually be more than one RI to process all the results from the CI (e.g. one to turn on the heat and one to acknowledge the order).

Benefits

The benefits of this architecture is the separation of concerns.

STT deals only with processing audio.
TTC deals only with processing text and creating appropriate commands from it. The result of STT is very difficult to completely abstract to provide generic input for TTC. It's very possible assumptions have to be made making TTC somewhat coupled to an STT solution. It is however possible to extract this correlation into a sub-component of STT to comply with an already implemented TTC version, and vice-versa.
CI processes commands and might return further commands for RI. TTC will need to know the details of those commands so that CI can understand them.
RI deals with commands to trigger activity in real world (be it speech or any automation).

For CI and RI, this architecture allows the creation and sharing of commands between systems. It's then possible to re-use already implemented commands, stored and shared in libraries/repositories.

Response Time

The performance of the system is defined as a measure of its speed and accuracy.
The response time of the entire system is proportional to the the speed of each component.
Each component response time is defined by two elements:

Time to receive input (prime letter)
Time to process the input (letter itself).

The response time of STT is A'+A, where A' is the time to receive the audio and A the time to process it into text.

According to research in [1], response times varies based on the type of the input (question) among other things:

Question Type	Response Time
yes-answers	0-100ms
abandoned	0-200ms
yes-no-question	0-220ms
no-answers	0-220ms
affimative non-yes answers	0-200ms
statement non-opinion	0-220ms
statement-opinions	0-250ms
negative non-no answers	0-300ms

\[ Response Time = \begin {cases} X & \mbox{if no results from commands}\\
X+Y & \mbox{otherwise.}
\end {cases} \]

Based on [1], $ResponseTime$ needs to comply with the following requirement:
\[ 0 \lt ResponseTime \lt 300ms \]

This can help figuring out the performance of the constituting components of the system.
I suspect B' to C might use the least amount of time, with A',A,D' and D holding the highest values.

Commands

Commands can be grouped in 2 distinct sets:

"Set" commands (i.e. without further activity): no results will be returned after processing commands (or very marginal results). This is likely to be from command changing the state of the system (like "don't acknowledge anymore", or "shutdown").
"Reactive" commands (i.e. with further activity): those commands will yield further activity. One example is when the system is asked to report on its state. The CI will collect the data and output the text to be spoken. A Text-to-speech component (acting as a Result Interpreter) will speak the text.

Command Interpreter

Right now I see 3 types of commands the CI needs to handle:

"Say": the end result will be text to be spoken by a TTS technology. It can either be a question or an answer. Typical human interaction relies heavily on speech (especially since we can't see "Jarvis", so no gesture or physical communications).
"Ask": this is a special case in the "Say" realm where the interpreter might setup the next interaction as dependent on the current one. This is the case where the system needs more details, more information, or a confirmation (or rejection) before proceeding.
"Do": commands in this category typically accomplish something. There are 2 possible behaviors for such commands:

"Set": this is something internal. It might change the state of an interpreter.
"Act": triggers actions outside the interpreter realm. This would be interaction with other software or hardware.

Here are some snippets of pseudo-commands TTC and CI could be dealing with:

Set command: set(acknowledge, off)
Do command followed by Ask: walk(10 miles) --> set state to walk:10 miles" + ask("which direction?", direction) --> "North" --> direction("North")
Do command: turn_lights(off)
Do command: shutdown()
Say and Act commands combined: say(search("Marvin Minsky"))
Say and Act commands combined: say(what_is("raccoon"))
Say and Act commands combined: say(where_is("brother"))
Act commands combined: show(where_is("brother"))

Considerations

It's expected that sometimes reactions (spoken responses or actions) start before the end of the input.
To get better experience out of the system, it's preferred that the input be streamed. Most of today's Speach-to-something solutions wait for a distinct pause to bundle the audio and send it for processing.
In [1], BACKWARD cases are interesting: responses are way faster than in other cases because the questions are about something already mentioned. A feedback loop from the CI to upstream components (STT and TTC) might help speed things up (e.g. tweaking probabilities of upcoming words based on current state) and decrease response time.

References

[1] Timing responses to questions in dialogue,
[2] An Architecture for Scalable, Universal Speech Recognition,

Half-Measure (of software)

Saturday, March 22, 2014

Voice Interaction