Since Apple launched the Siri app on the iPhone 4S last fall, there has been a widespread assumption that Siri’s voice-driven semantic search might soon find its way to other Apple products. At the top of everyone’s list was the still notional Apple television, bolstered by the belief that Steve Jobs’s deathbed claim to have “cracked” TV was based on the development of a voice interface.
Don’t get too excited. I think Siri will continue to improve on the iPhone and might well migrate to the iPad, but its not likely to go anywhere beyond these handheld devices for some time to come. Both the technology and the psychology have to be right, and both are far from ready.
Siri on the iPhone is a big step forward, but it is very far from perfect. Mostly it understands me, sometimes it doesn’t. sometimes it has a useful answer to a question, sometimes it doesn’t. It’s a lot better than any previous voice/natural language effort, but I still rely on the keyboard or other touch interface elements most of the time. Actually, the iPhone makes a natural Siri development platform for Apple because even iPhone users are inured to mobile phones that fall well short of perfection. For example, calls drop, voice quality is often awful, messages arrive hours after they were sent. So we’re prepared to put up with a personal assistant who doesn’t always understand us. Apple, with its sharp focus on user experience, will be reluctant to push Siri into territory where customers may be disappointed by the performance.
Our expectations for television and cars, the logical targets for voice control, are much higher than for mobile phones. At the same time making voice control work is much harder for engineering reasons. Cars are actually the easier challenge. Apple has avoided the automotive market, but others are in the game and Microsoft is the clear leader, especially with its partnership with Ford.
Natural language understanding is a big computer science challenge for voice systems, but there are also a considerable audio engineering issues to solve. Speech recognition requires a high quality audio signal, and the more free-form the speech, the better the audio has to be. An airline reservation system can understand me over a poor cellphone connection (most of the time) largely because the vocabulary and syntax of airline reservations is very constrained. But a Siri-like system is supposed to understand anything.
Siri on the iPhone works as well as it does because the phone starts with a decent microphone system that is close to the speaker and filters out extraneous noise. Cars are a pretty good environment as well. Voice systems usually are activated by pressing a button on the steering wheel that can also mute the audio system. There are lots of good places to put microphone arrays close to the driver. And while the sounds of driving create a lot of ambient noise, it is of the predictable sort that noise-cancellation systems handle well. I expect to see car systems get a lot better, but I don’t see Apple becoming a player. Apple likes to be top dog, and that would not be the case in a relationship with auto makers, who are quite insistent that car buyers are their customers, not those of third-party vendors. (Microsoft may do the software and Nuance the speech recognition, but Sync is a Ford product through and through.)
The living room is far tougher, but here to Microsoft may well have the edge, this time because of Kinect sensor technology. Pure voice control of a television is extremely difficult. Unlike a car, you don’t know where the speaker is going to be, so you need a sophisticated speaker microphone array that can find and focus on the speaker, who might be 10 feet away. Such systems exist, but they are mostly still in the lab and, at least initially, are likely to be quite expensive.
You also need the equivalent of a push-to-talk button, or the voice recognition system is going to be saddled with the near impossible task of hearing anything over the sound of its own audio. Here’s where Kinect might come in very handy. It’s ability to recognize gestures and to combine gestures with speech might yield a much better interface, much faster than voice alone. This plus an enormous research investment in speech and natural language understanding, which admittedly have yet to yiled much in the way of products, might give Microsoft a considerable edge in the battle of the living room.
Of course, the big TV challenge for Apple, Microsoft, or anyone else is striking the deals needs with content owners that will permit a viewing experience that unifies internet video with cable and broadcast TV. Difficult as the technical issues are, this business challenge may prove tougher to crack.