Voice-Based Computing with Digital Assistants

It’s been a long time coming, but it looks like the era of voice-driven computing has finally arrived.

Powered by the latest advancements in artificial intelligence and deep learning, the new generation of smart digital assistants and chatbots are clearly some of the hottest developments in the tech industry. Not only are they driving big announcements from vendors such as Google, Microsoft, Amazon, Facebook and Apple, they’re expected to enable even bigger changes long term.

In fact, as the technology improves and people become more accustomed to speaking to their devices, digital assistants are poised to change not only how we interact with and think about technology, but even the types of devices, applications and services that we purchase and use. The changes won’t happen overnight, but the rise of these voice-driven digital helpers portends some truly revolutionary developments in the tech world.

Fine and good, you say, but what about the here and now? Short term, expect to see a lot of efforts geared towards improving the accuracy and reliability of our interactions with these assistants. We’ve all either made or heard jokes about the “creative” interpretations of various requests that Siri and other digital assistants have made. While they may seem funny at first, these types of experiences quickly tire people of using voice-driven interactions. In fact, many people who initially tried these assistants stopped using them because of their initial bad experiences.

To overcome this, vendors are spending a lot of time fine-tuning various parts of the interaction chain, from initial speech recognition to server-based analysis. For example, on some devices, companies are able to leverage enhanced microphones and audio signal processing algorithms. As with so many things in life, speech recognition often suffers from a garbage in, garbage out phenomena. In other words, the quality and “cleanliness” of the audio signal being processed can have a big impact on the accuracy of the recognition. The more work that can be done to pre-process and filter the audio before it’s analyzed, the better. (FYI, this is also true for image-based recognition—image processing engines on today’s advanced smartphones are increasingly being used to “clean up” photos and optimize them for recognition.)

The real heavy lifting typically occurs on the back end, however, as enormous cloud-based data centers are typically employed to interpret the audio and provide the appropriate response. This is where huge advancements in pattern-based deep learning algorithms are helping not only improve the accuracy of recognition, but also, more importantly, the relevance of the response.

Essentially, the servers in these data centers quickly compare the results of the incoming audio snippets to enormous databases of keywords, key phrases and portions of spoken words known as phonemes, in order to find matches. In many cases, individual words are then combined into a phrase, and that combined phrase is then compared to yet another database to find more matches. Ultimately, the entirety of what was said is pieced together, and then more work is done to provide an appropriate response to the request.

Improvements in accuracy will come from a combination of increasing the size and level of detail in the various databases, along with advancing the speed and filtering techniques of the pattern matching algorithms at the heart of these artificial intelligence engines. In addition, vendors are just beginning to leverage the increased number of sensors available in smartphones and other voice input devices in order to start providing better context about where a person is located or what that person is doing, in order to improve the appropriateness of the response.

For example, asking what the temperature is in a particular location typically provides the outside temperature, but if you actually wanted to know the temperature inside a room or building, you would have to combine the temperature from a sensor along with the original request to generate a more accurate response.[pullquote]By having a better sense of context, a smart digital assistant can actually start providing information even before it’s been asked.”[/pullquote]

Though subtle, these kinds of contextual cues can greatly improve the appropriateness of a digital assistant’s response. These kinds of efforts will also be essential to help drive the next stage of digital assistance: proactive suggestions, information, and advice. Up until this point, much of the voice-based computing work described has occurred only in reaction to a user’s requests. By having a better sense of context, a smart digital assistant can actually start providing information even before it’s been asked.

Context can come not only from sensors, but awareness of, for example, information we’ve been searching for, documents we’ve been working on and much more.

Now, if implemented poorly, these proactive efforts could quickly become even more annoying than the sometimes laughable reactive responses to our voice requests. But if the concept is done well, these kind of efforts can and will turn digital assistants into very beneficial helpers that could drive voice-based computing into a new age.

Published by

Bob O'Donnell

Bob O’Donnell is the president and chief analyst of TECHnalysis Research, LLC a technology consulting and market research firm that provides strategic consulting and market research services to the technology industry and professional financial community. You can follow him on Twitter @bobodtech.

2 thoughts on “Voice-Based Computing with Digital Assistants”

Leave a Reply

Your email address will not be published. Required fields are marked *