The ability to have a smartphone respond to things you say has captivated people since the first demos of Siri on an iPhone over 7 years ago. Even the thought of an intelligent response to a spoken request was so science fiction-like that people were willing to forgive some pretty high levels of inaccuracy—at least for a little while.
Thankfully, things progressed on the voice-based computing and personal assistant front with the successful launch of Amazon’s Alexa-powered Echo smart speakers, and the Google Assistant found on Android devices, as well as Google (now Nest) Home smart speakers. All of a sudden, devices were accurately responding to our simple commands and providing us with an entirely new way of interacting with both our devices and the vast troves of information available on the web.
The accuracy of those improved digital assistants came with a hidden cost, however, as the recent revelations of recordings made by Amazon Alexa-based devices has laid bare. Our personal information, or even complete conversations from within the privacy of our homes, were being uploaded to the cloud for other systems, or even people, to analyze, interpret, and respond to. Essentially, the computing power and AI intelligence necessary to respond to our requests or properly interpret what we meant required the enormous computing resources of cloud-based data centers, full of powerful servers, running large, complicated neural network models.
Different companies used different resources for different reasons, but regardless, in order to get access to the power of voice-based digital assistants, you had to be willing to give up some degree of privacy, no matter which one you used. It was a classic trade-off of convenience versus confidentiality. Until now.
As Google demonstrated at their recent I/O developer conference, they now have the ability to run the Google Assistant almost entirely on the smartphone itself. The implications of this are enormous, not just from a privacy perspective (although that’s certainly huge), but from a performance and responsiveness angle as well. While connections to LTE networks and the cloud are certainly fast, they can’t compete with local computing resources. As a result, Google reported up to a 10x gain in responsiveness to spoken commands.
In the real-world that not only translates to faster answers, but a significantly more intuitive means of interacting with the assistant that more closely mimics what its like to speak with another human being. Plus, the ability to run natural language recognition models locally on the smartphones opens up the possibility for longer multi-part conversations. Instead of consisting of awkward silences and stilted responses, as they typically do now, these multi-turn conversations can now take on a more natural, real-time flow. While this may sound subtle, the difference in real-world experience literally shifts from something you have to endure to something you enjoy doing, and that can translate to significant increases in usage and improvements in overall engagement.
In addition, as hinted at earlier, the impact on privacy can be profound. Instead of having to upload your verbal input to the cloud, it can be analyzed, interpreted, and reacted to on the device, keeping your personal data private, as it should be. As Google pointed out, they are using a technique called federated learning that takes some of your data and sends it to the cloud in an anonymized form in order to be combined with others’ data and improve the accuracy of its models. Once those models are improved, they can then be sent back down to the local device, so that the overall accuracy and effectiveness of the on-device AI will improve over time.
Given what a huge improvement this is to cloud-based assistants, it’s more than reasonable to wonder why it didn’t get done before. The main reason is that the algorithms and datasets necessary to run this work used to be enormous and could only run with the large amounts of computing infrastructure available in the cloud. In addition, in order to create its models in the first place, Google needed a large body of data to build models that can accurately respond to people’s requests. Recently, however, Google has been able to shrink its models down to a size that can run comfortably even on lower-end Android devices with relatively limited storage.
On the smartphone hardware side, not only have we seen the continued Moore’s law-driven increases in computing power that we’ve enjoyed on computing devices for nearly 50 years, but companies like Qualcomm have brought AI-specific accelerator hardware into a larger body of mainstream smartphones. Inside most of the company’s Snapdragon series of chips is the little-known Hexagon DSP (digital signal processor), a component that is ideally suited to run the kinds of AI-based models necessary to enable on-device voice assistants (as well as computational photography and other cool computer vision-based applications). Qualcomm has worked alongside Google to develop a number of software hooks to neural networks they call the AndroidNN API that allows these to run faster and with more power efficiency on devices that include the necessary hardware. (To be clear, AI algorithms can and do run on other hardware inside smartphones—including both the CPU and GPU—but they can run more efficiently on devices that have the extra hardware capabilities.)
The net-net of all these developments is a decidedly large step forward in consumer-facing AI. In fact, Google is calling this Assistant 2.0 and is expected to make it available this fall with the release of the upcoming Q version of Android. It will incorporate not just the voice-based enhancements, but computer vision AI applications, via the smartphone camera and Google Lens, that can be done on device as well.
Even with these important advances, many people may view this next generation assistant as a more subtle improvement than the technologies might suggest. The reality is that many of the steps necessary to take us from the frustrating, early days of voice-based digital assistants, to the truly useful, contextually intelligent helpers that we’re all still hoping for are going to be difficult to notice on their own, or even in simple combinations. Achieving the science fiction-style interactions that the first voice-based command tools seemed to imply is going to be a long, difficult path. As with the growth of children, the day-to-day changes are easy to miss, but after a few years, the advancements are, and will be, unmistakable.