Voice Control Will Disrupt Living Room Electronics

In what seems to be a routine in high-tech journalism and social media now is to speculate on what Apple will do next. The latest and greatest rumor is that Apple will develop an HDTV set. I wrote back in September that Apple should build aTV given the lousy experience and Apple’s ability to fix big user challenges. What hasn’t been talked about a lot is why voice command and control makes so much sense in home electronics and why it will dominate the living room. Its all about the content.

History of U.S. TV Content


For many growing up in the U.S., there were 4-5 stations on TV; ABC, NBC, CBS, PBS and an independent UHF channel. If you ever wanted to know what was on, you just looked into the daily newspaper that was dropped off every morning on the front porch. Then around the early 80’s cable started rolling out and TV moved to around 10-20 channels and included ESPN, MTV CNN, and HBO. The next step was an explosion in channels brought by analog cable, digital cable and satellite. My satellite company, Time Warner, offers 512 different channels. Add that to the unlimited of over the top “channels” or titles available on Netflix, Boxee, and you can easily see the challenge.

The Consumer Problem

With an unlimited amount of things to watch, record, and interact with, finding what you want to watch becomes a huge issue. Paper guides are worthless and integrated TV guides from the cable or satellite boxes are slow and cumbersome. Given the flat and long tail characteristic of choices, multi-variate and unstructured “search” is the answer to find the right content. That is, directories aren’t the answer. The question then becomes, what’s the best way to search.

The Right Kind of Search

If search is the answer, what kind of search? The answer lies in how people would want to find something. Consumers have many ways they look for things.

Some like to do surgical searching where they have exacts. They ask for “The Matrix Revolutions.” Others have a concept or idea of what they are looking for but not exactly; “find the car movie with Will Ferrell and John Reilly” and back comes a few movies like Step Brothers and Talladega Nights. Others may search by an unlimited amount of “mental genres”, or those which are created by the user. They may ask for “all Emmy Award winning movies between 2005 and 2010”. You get the point; the consumer is best served with answers to natural language search and then the call to action is to get that person to the content immediately.

Natural Language Voice Search and Control

The answer to the content search challenge is natural language voice search and control. That’s a mouthful, but basically, tell the TV what you want to watch and it guides you there from thousands of entry points. Two popular implementations exist today for voice search. There are others, like Dragon Naturally Speaking, but those are niche commercial plays.

Microsoft Kinect

Microsoft has done more more to enhance the living room than any other company including Apple, Roku, Boxee and Sony. Microsoft is a leader in IPTV and the innovation leader in entertainment game consoles. With Kinect, a user can use Bing to search and find content. It works well in specific circumstances and at certain points in the experience, but it needs a lot of improvement. Bing needs to find content anywhere in the menu structure, not just at the top level. It also needs to improve upon its ability to work well in a living room full of viewers. Its beam-forming is awesome but needs to get better to the point that it serves as a virtual remote.

Finally, it needs to support natural language search and the ability to narrow down the choices. I have full confidence that they will add these features, but a big question is the hardware. The hardware is seven years old. Software gymnastics and offloading some processing to the Kinect module has been brilliant, but at some point, hardware runs out of gas.

Apple Siri

While certainly not the first to bring voice command and dictation to phones, Apple was the first to bring natural language to the phone. The problem with the current Siri is that its not connected to an entertainment database, its logic isn’t there to narrow down choices, and it isn’t connected to a TV so that once you find what you are looking for you can immediately switch the TV.

As I wrote in September (before Apple 4s and Siri), Apple “could master controlling the TV’s content via voice primarily.” If Apple were to build a TV, they could hypothetically leverage iPhones, iPads, iPods to improve the voice results. While Kinect has a full microphone array and operates best at 6-8 feet, an iPhone microphone could be 6 inches away and would certainly help with the “who owns the remote” problem and with voice recognition. Even better would be if multiple iOS devices could leverage each others sensors. That would be powerful.

While I am skeptical in driving voice control and cognition from the cloud, Apple, if they built a TV, could do more local processing and increase the speed of results. Anyone who has ever used Siri extensively knows what I am talking about here. The first few times Siri for TV fails to bring back results or says “system unavailable”, it gets shelved and never gets used again by many in the household. Part of the the entertainment database needs to be local until the cloud can be 99% accurate.

What about Sony, Samsung, LG, and Toshiba?

I believe that all major CE manufacturers are working on advanced HCI techniques to control CE devices with voice and air gestures. The big question is, do they have the IP and time to “perfect” the interface before Apple and Microsoft dominate the space? There are two parts to natural language control, the “what did they say”, and the “what did they mean”. Apple licences the first part from Nuance but the back end is Siri. Competitors could license the Nuiance front end, but would need to buy or build the “what did they mean” part.

Now that HDTV sales are slowing down, it is even harder to differentiate between HDTVs. Consumers haven’t been willing to spend more for 3D but have been willing to spend more for LED and Smart TV. Once every HDTV is LED, 3D and “smart”, the key differentiator could become voice and air gestures. If Sony, Samsung, LG and Toshiba, aren’t prepared, their world could change dramatically and Microsoft and Apple could have the edge..

Metro Could Drive Voice and Air Gesture UI

Last week, I attended Microsoft’s BUILD conference in Anaheim, where, among other things, Windows 8 details were rolled out to the Microsoft ecosystem. One of the most talked-about items was the Metro User Interface (UI), the end user face for the future of Windows. The last few days, I have been thinking about the implications of Metro on user interfaces beyond the obvious physical touch and gestures. I believe Metro UI has as much to do with voice control and air gestures as it does with physical touch.


Voice Control

Voice command and control has been a part of Windows for many generations. So why do I think Metro has anything to do with enabling widespread voice use in the future, and why do I think people would actually use this version? It’s actually quite simple. First, only a few voice command and control implementations and usage scenarios have been successful, and they all adopt a similar methodology and all come from the same company. Microsoft Auto voice solutions have found their way into Ford and Lincoln automobiles, branded SYNC, and drivers actually are using it. Fiat uses MS Auto technology as well. Microsoft Kinect implements a very accurate implementation for the living room using some amazing audio beamforming algorithms and a hardware four microphone array.


None of these implementations would be successful without establishing an in-context and limited dictionary. Let’s use Kinect as an example. Kinect allows you to “say what you see” on the TV screen, limiting the dictionary of words required to recognize. That is key. Pattern matching is a lot easier when you are matching 100s of objects versus 100K. Windows 8 Metro UI limits what users see on the screen, compared with previous versions of Windows, making that voice pattern matching all the easier. One final, interesting clue comes with the developer tablets distributed at BUILD. The tablets had dual microphones, which greatly assists with audio beam forming.

Air Gestures

Air gestures are essentially what Kinect users do with their hands and arms instead of using the XBOX controller. When players want to click on a “tile” in the XBOX environment, they place your hand in the air, hover over the tile for a few seconds, and it selects it. Kinect uses a camera array and an IR sensor to detect what your “limbs” are doing and associates it with a tile location on the screen. Note that no more than 8 tiles are shown on the screen at one time, increasing user accuracy.


Hypothetically, air gestures on Metro could take a few forms, and they could be guided by form factor. In “stand-up” environments with large displays, they would take a similar approach as Kinect does. In the future, displays will be everywhere in the house and air gestures would be used when touching the display just isn’t convenient or desired. I would like this functionality today in my kitchen as I am cooking. I have food all over my hands and I want to turn the cookbook page or even start up Pandora. I can’t touch the display, so I’d much rather do a very accurate air gesture.

In desk environments, I’d like to ditch the trackpad and mouse and just use my physical hand as a gesture methodology. It’s a virtual trackpad or gesture mouse. I use all the standard Metro gestures on a flat surface, a camera tracks exactly what my hand is doing and translates that into a physical touch gesture.


Microsoft introduced Metro as the next generation user interface primarily for physical touch gestures and secondarily for keyboard and mouse. Metro changes the interface from a navigation-centric environment with hundreds of elements on the screen to content-first with a very clean interface. Large tiles replace multitudes of icons and applets and the amount of words, or dictionary is drastically reduced. Sure this is great for physical touch, but also significantly improves the capability to enhance voice control and even air gestures. Microsoft is a leader in voice and air gesture with MS Auto and Kinect, and certainly could enable this in Windows 8 for the right user environments.

See Pat’s bio here or past blogs here.

Follow @PatrickMoorhead on Twitter and on Google+.