This article is exclusively for subscribers to the Think.Tank.
Microsoft launched Kinect back in November 2010 in a move to change the man-to-machine interface between the consumer to their living room content. While incredibly risky, the gamble paid off in the fastest selling consumer device, ever. I saw the potential after analyzing the usage models and technology for a few months after Kinect launch and predicted that at least all DMA’s would have the capability.
The Kinect launch sent shock waves into the industry because the titans of the living room like Sony, Samsung, and Toshiba hadn’t even gotten close to duplicating or leading with voice and air-gesture techniques. With Samsung and LG announcing future TVs with this capability at CES, Microsoft’s living room interaction strategy has officially been affirmed at CES and most importantly, the CE industry.
Samsung launched what it called “Smart Interaction”, which allows users to control and interact with their HDTVs. Smart Interaction allows the user to control the TV with their voice, air-gestures, and passively with their face. The voice and air gestures operate in a manner similar to Microsoft in that pre-defined gestures exist for different interactions. For instance, users can select an item by grabbing it, which signifies clicking an icon on a remote. Facial recognition essentially “logs you in” to your profile like a PC would giving you your personal settings for TV and also gives you the virtual remote.
A Step Further Than Microsoft ?
Samsung has one-upped Microsoft on one indicator, at least publicly, with their application development model. Samsung has broadly opened their APIs via an SDK which could pull in tens of thousands of developers. If this gains traction, we could see a future challenge arise where platforms are fighting for the number of apps in the same way Apple initially trumped everyone in smartphones. The initial iPhone lure was its design but also the apps, the hundreds of thousands of apps that were developed. It made Google Android look very weak initially until it caught up, still makes Blackberry and Windows Phone appear weaker, and can be argued it was the death blow to HP’s webOS. I believe that Microsoft is gearing up for a major “opening” of the Kinect ecosystem in the Windows 8 timeframe where Windows 8 Metro apps can be run inside the Kinect environment.
Challenges for Samsung and LG
Advanced HCI like voice and air-gesture control is a monumental undertaking and risk. Changing anything that stands between a CE user and the content is risky in that if it’s not perfect, and I mean perfect, users will stop using it. Look at version 1 of Apple’s Siri. Everyone who bought the phone tried it and most stopped using it because it wasn’t reliable or consistent. Microsoft Kinect has many, many contingencies to work well including standing in a specific “zone” to get the best air gestures to work correctly. Voice control only works in certain modes, not all interactions.
The fallback Apple has is that users don’t have to use Siri, it’s an option and it can be very personal in that most use Siri when others aren’t looking or listening. The Kinect fallback is a painful one, in that you wasted that cool looking $149 peripheral. Similarly, Samsung “Smart Interaction” users can fallback to the remote, and most will initially, until it’s perfected.
There are meaningful differences in consumer audiences of Siri, Kinect, and Samsung “Smart Interaction”. I argue that Siri and Kinect users are “pathfinders” and “explorers” in that they enjoy the challenge of trying new things. The traditional HDTV buyer doesn’t want any pathfinding or exploring; they want to watch content and if they’re feeling adventurous, they’ll go out on a limb and check sports scores. This means that Samsung’s customers won’t appreciate anything that just doesn’t work and don’t admire the “good try” or a Siri beta product.
One often-overlooked challenge in this space is content, or the amount of content you can actually control with voice and air gestures. Over the top services like Netflix and Hulu are fine if the app is resident in the TV, but what if you have a cable or satellite box which most of the living population have? What if you want to PVR something or want to play specific content that was saved on it? This is solvable if the TV has a perfect channel guide for the STB and service provider with IR-blasting capabilities to talk to it. That didn’t work out too well for Google TV V1, its end users or its partners.
This is the Future, Embrace It
The CE industry won’t get this right initially with a broad base of consumers but that won’t kill the interaction model. Hardware and software developers will keep improving until it finally does, and it truly becomes natural, consistent, and reliable. At some point in the very near future, most consumers will be able to control their HDTVs with their voice and air gestures. Many won’t want to do this, particularly those who are tech-phobic or late adopters.
In terms of industry investment, the positive part is that other devices like phones, tablets, PCs and even washing machines leverage the same interactions and technologies so there is a lot of investment and shared risk. The biggest question is, will one company other than Microsoft lead the future of living room? Your move, Apple.
In what seems to be a routine in high-tech journalism and social media now is to speculate on what Apple will do next. The latest and greatest rumor is that Apple will develop an HDTV set. I wrote back in September that Apple should build aTV given the lousy experience and Apple’s ability to fix big user challenges. What hasn’t been talked about a lot is why voice command and control makes so much sense in home electronics and why it will dominate the living room. Its all about the content.
History of U.S. TV Content
For many growing up in the U.S., there were 4-5 stations on TV; ABC, NBC, CBS, PBS and an independent UHF channel. If you ever wanted to know what was on, you just looked into the daily newspaper that was dropped off every morning on the front porch. Then around the early 80’s cable started rolling out and TV moved to around 10-20 channels and included ESPN, MTV CNN, and HBO. The next step was an explosion in channels brought by analog cable, digital cable and satellite. My satellite company, Time Warner, offers 512 different channels. Add that to the unlimited of over the top “channels” or titles available on Netflix, Boxee, and you can easily see the challenge.
The Consumer Problem
With an unlimited amount of things to watch, record, and interact with, finding what you want to watch becomes a huge issue. Paper guides are worthless and integrated TV guides from the cable or satellite boxes are slow and cumbersome. Given the flat and long tail characteristic of choices, multi-variate and unstructured “search” is the answer to find the right content. That is, directories aren’t the answer. The question then becomes, what’s the best way to search.
The Right Kind of Search
If search is the answer, what kind of search? The answer lies in how people would want to find something. Consumers have many ways they look for things.
Some like to do surgical searching where they have exacts. They ask for “The Matrix Revolutions.” Others have a concept or idea of what they are looking for but not exactly; “find the car movie with Will Ferrell and John Reilly” and back comes a few movies like Step Brothers and Talladega Nights. Others may search by an unlimited amount of “mental genres”, or those which are created by the user. They may ask for “all Emmy Award winning movies between 2005 and 2010”. You get the point; the consumer is best served with answers to natural language search and then the call to action is to get that person to the content immediately.
Natural Language Voice Search and Control
The answer to the content search challenge is natural language voice search and control. That’s a mouthful, but basically, tell the TV what you want to watch and it guides you there from thousands of entry points. Two popular implementations exist today for voice search. There are others, like Dragon Naturally Speaking, but those are niche commercial plays.
Microsoft has done more more to enhance the living room than any other company including Apple, Roku, Boxee and Sony. Microsoft is a leader in IPTV and the innovation leader in entertainment game consoles. With Kinect, a user can use Bing to search and find content. It works well in specific circumstances and at certain points in the experience, but it needs a lot of improvement. Bing needs to find content anywhere in the menu structure, not just at the top level. It also needs to improve upon its ability to work well in a living room full of viewers. Its beam-forming is awesome but needs to get better to the point that it serves as a virtual remote.
Finally, it needs to support natural language search and the ability to narrow down the choices. I have full confidence that they will add these features, but a big question is the hardware. The hardware is seven years old. Software gymnastics and offloading some processing to the Kinect module has been brilliant, but at some point, hardware runs out of gas.
While certainly not the first to bring voice command and dictation to phones, Apple was the first to bring natural language to the phone. The problem with the current Siri is that its not connected to an entertainment database, its logic isn’t there to narrow down choices, and it isn’t connected to a TV so that once you find what you are looking for you can immediately switch the TV.
As I wrote in September (before Apple 4s and Siri), Apple “could master controlling the TV’s content via voice primarily.” If Apple were to build a TV, they could hypothetically leverage iPhones, iPads, iPods to improve the voice results. While Kinect has a full microphone array and operates best at 6-8 feet, an iPhone microphone could be 6 inches away and would certainly help with the “who owns the remote” problem and with voice recognition. Even better would be if multiple iOS devices could leverage each others sensors. That would be powerful.
While I am skeptical in driving voice control and cognition from the cloud, Apple, if they built a TV, could do more local processing and increase the speed of results. Anyone who has ever used Siri extensively knows what I am talking about here. The first few times Siri for TV fails to bring back results or says “system unavailable”, it gets shelved and never gets used again by many in the household. Part of the the entertainment database needs to be local until the cloud can be 99% accurate.
What about Sony, Samsung, LG, and Toshiba?
I believe that all major CE manufacturers are working on advanced HCI techniques to control CE devices with voice and air gestures. The big question is, do they have the IP and time to “perfect” the interface before Apple and Microsoft dominate the space? There are two parts to natural language control, the “what did they say”, and the “what did they mean”. Apple licences the first part from Nuance but the back end is Siri. Competitors could license the Nuiance front end, but would need to buy or build the “what did they mean” part.
Now that HDTV sales are slowing down, it is even harder to differentiate between HDTVs. Consumers haven’t been willing to spend more for 3D but have been willing to spend more for LED and Smart TV. Once every HDTV is LED, 3D and “smart”, the key differentiator could become voice and air gestures. If Sony, Samsung, LG and Toshiba, aren’t prepared, their world could change dramatically and Microsoft and Apple could have the edge..
Last week, I attended Microsoft’s BUILD conference in Anaheim, where, among other things, Windows 8 details were rolled out to the Microsoft ecosystem. One of the most talked-about items was the Metro User Interface (UI), the end user face for the future of Windows. The last few days, I have been thinking about the implications of Metro on user interfaces beyond the obvious physical touch and gestures. I believe Metro UI has as much to do with voice control and air gestures as it does with physical touch.
Voice command and control has been a part of Windows for many generations. So why do I think Metro has anything to do with enabling widespread voice use in the future, and why do I think people would actually use this version? It’s actually quite simple. First, only a few voice command and control implementations and usage scenarios have been successful, and they all adopt a similar methodology and all come from the same company. Microsoft Auto voice solutions have found their way into Ford and Lincoln automobiles, branded SYNC, and drivers actually are using it. Fiat uses MS Auto technology as well. Microsoft Kinect implements a very accurate implementation for the living room using some amazing audio beamforming algorithms and a hardware four microphone array.
None of these implementations would be successful without establishing an in-context and limited dictionary. Let’s use Kinect as an example. Kinect allows you to “say what you see” on the TV screen, limiting the dictionary of words required to recognize. That is key. Pattern matching is a lot easier when you are matching 100s of objects versus 100K. Windows 8 Metro UI limits what users see on the screen, compared with previous versions of Windows, making that voice pattern matching all the easier. One final, interesting clue comes with the developer tablets distributed at BUILD. The tablets had dual microphones, which greatly assists with audio beam forming.
Air gestures are essentially what Kinect users do with their hands and arms instead of using the XBOX controller. When players want to click on a “tile” in the XBOX environment, they place your hand in the air, hover over the tile for a few seconds, and it selects it. Kinect uses a camera array and an IR sensor to detect what your “limbs” are doing and associates it with a tile location on the screen. Note that no more than 8 tiles are shown on the screen at one time, increasing user accuracy.
Hypothetically, air gestures on Metro could take a few forms, and they could be guided by form factor. In “stand-up” environments with large displays, they would take a similar approach as Kinect does. In the future, displays will be everywhere in the house and air gestures would be used when touching the display just isn’t convenient or desired. I would like this functionality today in my kitchen as I am cooking. I have food all over my hands and I want to turn the cookbook page or even start up Pandora. I can’t touch the display, so I’d much rather do a very accurate air gesture.
In desk environments, I’d like to ditch the trackpad and mouse and just use my physical hand as a gesture methodology. It’s a virtual trackpad or gesture mouse. I use all the standard Metro gestures on a flat surface, a camera tracks exactly what my hand is doing and translates that into a physical touch gesture.
Microsoft introduced Metro as the next generation user interface primarily for physical touch gestures and secondarily for keyboard and mouse. Metro changes the interface from a navigation-centric environment with hundreds of elements on the screen to content-first with a very clean interface. Large tiles replace multitudes of icons and applets and the amount of words, or dictionary is drastically reduced. Sure this is great for physical touch, but also significantly improves the capability to enhance voice control and even air gestures. Microsoft is a leader in voice and air gesture with MS Auto and Kinect, and certainly could enable this in Windows 8 for the right user environments.