The Evolution of Smart Speakers

on August 22, 2017
Reading Time: 3 minutes

For a relatively nascent product category, smart speakers like Amazon Echo and Google Home are already seeing a huge influx of attention from both consumers and potential competitors eager to enter the market. Apple has announced the HomePod and numerous other vendors have either unveiled or are heavily rumored to be working on versions of their own.

Harman Kardon (in conjunction with Microsoft), GE Lighting and Lenovo have announced products in the US, while Alibaba, Xiaomi and, among others, have said they will be bringing products out in China. In addition, Facebook is rumored to be building a screen-equipped smart speaker called Gizmo.

One obvious question after hearing about all the new entrants is, how can they all survive? The short answer, of course, is they won’t. Nevertheless, expect to see a lot of jockeying, marketing and positioning over the next year or two because it’s still very early days in the world of AI-powered and personal assistant-driven smart speakers.

Yes, Amazon has built an impressive and commanding presence with the Echo line, but there are many limitations to Echos and all current smart speakers that frustrate existing users. Thankfully, technology improvements are coming that will enable competitors to differentiate themselves from others in ways which reduce the frustration and increase the satisfaction that consumers have with smart speakers.

Part of the work involves the overall architecture of the devices and how they interact with cloud-based services. For example, one of the critical capabilities that many users want is the ability to accurately recognize different individuals that speak to the device, so that responses can be customized for different members of a household. To achieve this as quickly and accurately as possible, it doesn’t make sense to try and send the audio signal to the cloud and then wait for the response. Even with superfast network connections, the inevitable delays make interactions with the device feel somewhat awkward.

The same problem exists when you try to move beyond the simple single query requests that most people are making to their smart speakers today. (Alexa, play music by horn bands or Alexa, what is the capital of Iceland?) In order to have naturally flowing, multi-question or multi-statement conversations, the delays (or latency) have to be dramatically reduced.

The obvious answer to the problem is to do more of the recognition and response work locally on the device and not rely on a cloud-based network connection to do so. In fact, this is a great example of the larger trend of edge computing, where we are seeing devices or applications that use to rely solely on big data centers in the cloud start to do more of the computational work on their own.

That’s part of the reason you’re starting to see companies like Qualcomm and Intel, among others, develop chips that are designed to enable more powerful local computing work on devices like smart speakers. The ability to learn and then recognize different individuals, for example, is something that the DSP (digital signal processor) component of new chips from these vendors can do.

Another technological challenge facing current generation products is recognition accuracy. Everyone who has used a smart speaker or digital assistant on other device has had the experience of not being understood. Sometimes that’s due to how the question or command is phrased, but it’s often due to background noises, accents, intonation or other factors that essentially end up providing an imperfect audio signal to the cloud-based recognition engine. Again, more local audio signal processing can often improve the audio signal to be sent, thereby enhancing overall recognition.

Going further, most of the AI-based learning algorithms used to recognize and accurately respond to speech will likely need to be run in very large, compute-intensive cloud data centers. However, the idea of being able to start do pattern recognition of common phrases (a form of inferencing—the second key aspect of machine learning and AI) locally with the right kind of computing engines and hardware architectures is becoming increasingly possible. It may be a long time before all that kind of work can be done within smart speakers and other edge devices, but even doing some speech recognition on the device should enable higher accuracy and longer conversations. In short, a much better user experience.

As new entrants try to differentiate their products in an increasingly crowded space, the ability to offer some key tech-based improvements is going to be essential. Clearly there’s a great deal of momentum behind the smart speaker phenomenon, but it’s going to take these kind performance improvements to move them beyond idle curiosities and into truly useful, everyday kinds of tools.