Since Apple launched the Siri app on the iPhone 4S last fall, there has been a widespread assumption that Siri’s voice-driven semantic search might soon find its way to other Apple products. At the top of everyone’s list was the still notional Apple television, bolstered by the belief that Steve Jobs’s deathbed claim to have “cracked” TV was based on the development of a voice interface.
Don’t get too excited. I think Siri will continue to improve on the iPhone and might well migrate to the iPad, but its not likely to go anywhere beyond these handheld devices for some time to come. Both the technology and the psychology have to be right, and both are far from ready.
Siri on the iPhone is a big step forward, but it is very far from perfect. Mostly it understands me, sometimes it doesn’t. sometimes it has a useful answer to a question, sometimes it doesn’t. It’s a lot better than any previous voice/natural language effort, but I still rely on the keyboard or other touch interface elements most of the time. Actually, the iPhone makes a natural Siri development platform for Apple because even iPhone users are inured to mobile phones that fall well short of perfection. For example, calls drop, voice quality is often awful, messages arrive hours after they were sent. So we’re prepared to put up with a personal assistant who doesn’t always understand us. Apple, with its sharp focus on user experience, will be reluctant to push Siri into territory where customers may be disappointed by the performance.
Our expectations for television and cars, the logical targets for voice control, are much higher than for mobile phones. At the same time making voice control work is much harder for engineering reasons. Cars are actually the easier challenge. Apple has avoided the automotive market, but others are in the game and Microsoft is the clear leader, especially with its partnership with Ford.
Natural language understanding is a big computer science challenge for voice systems, but there are also a considerable audio engineering issues to solve. Speech recognition requires a high quality audio signal, and the more free-form the speech, the better the audio has to be. An airline reservation system can understand me over a poor cellphone connection (most of the time) largely because the vocabulary and syntax of airline reservations is very constrained. But a Siri-like system is supposed to understand anything.
Siri on the iPhone works as well as it does because the phone starts with a decent microphone system that is close to the speaker and filters out extraneous noise. Cars are a pretty good environment as well. Voice systems usually are activated by pressing a button on the steering wheel that can also mute the audio system. There are lots of good places to put microphone arrays close to the driver. And while the sounds of driving create a lot of ambient noise, it is of the predictable sort that noise-cancellation systems handle well. I expect to see car systems get a lot better, but I don’t see Apple becoming a player. Apple likes to be top dog, and that would not be the case in a relationship with auto makers, who are quite insistent that car buyers are their customers, not those of third-party vendors. (Microsoft may do the software and Nuance the speech recognition, but Sync is a Ford product through and through.)
The living room is far tougher, but here to Microsoft may well have the edge, this time because of Kinect sensor technology. Pure voice control of a television is extremely difficult. Unlike a car, you don’t know where the speaker is going to be, so you need a sophisticated speaker microphone array that can find and focus on the speaker, who might be 10 feet away. Such systems exist, but they are mostly still in the lab and, at least initially, are likely to be quite expensive.
You also need the equivalent of a push-to-talk button, or the voice recognition system is going to be saddled with the near impossible task of hearing anything over the sound of its own audio. Here’s where Kinect might come in very handy. It’s ability to recognize gestures and to combine gestures with speech might yield a much better interface, much faster than voice alone. This plus an enormous research investment in speech and natural language understanding, which admittedly have yet to yiled much in the way of products, might give Microsoft a considerable edge in the battle of the living room.
Of course, the big TV challenge for Apple, Microsoft, or anyone else is striking the deals needs with content owners that will permit a viewing experience that unifies internet video with cable and broadcast TV. Difficult as the technical issues are, this business challenge may prove tougher to crack.
9 thoughts on “Why Siri Won’t Go Beyond the iPhone–For Now”
The biggest reason Siri won’t go beyond the iPhone is that it must first support many more languages and then be attached to localized services and specialized search engines in different countries. That will be a massive undertaking.
When that is finished we’ll likely see it spread first to the Mac where a persistent internet connection is far more likely and then to Notebook and iPad form factors as they add more direct control of the computer function or bake it into the processor.
Localization is indeed a huge problem since, as shockme points out, it is not a simple matter of substituting one language model for another.
I doubt that we’ll see Siri on the Mac, though. Except for niche handsfree markets, there has never been much market for voice control, or even dictation, on laptops and desktops. It turns out to be really hard to beat the efficiency of the keyboard+mouse interface.
I think you are spot on with the issues facing voice recognition. We need to focus on improving audio quality signal and developing systems that are complemented by gesture recognition in order to be most effective. When looking at the in-car voice control, it’s important to note that we naturally have higher expectations for automotive technology because of the safety issues involved. So, in order for it to become mainstream it will have to be flawless, because we need to ensure it will limit driver distraction rather than add to it while creating a positive user experience. To take voice recognition to the next level and meet safety expectations, I believe a hybrid approach is required which utilizes the cloud and on-board processing. This hybrid approach should address simple command and control on-board while handling the more complex tasks (like getting the latest data on a restaurant rating) in the cloud.
VP of Marketing
Just a point but the tv should be able to filter out its own sound. that still leaves other sound sources but still deals with the worst one.
The problem is that the TV audio become part of the room’s ambient noise, which is going to get fed back into the mic. You need to find a way to reduce that sound without losing too much of the desired signal. You could mute the sound when the system senses someone speaking to it–that’s how car systems work–but TV viewers are going to be annoyed if the audio cuts out whenever someone just wants to check scheduling. It’s nit an impossible problem, but it is a hard one.
Maybe you’re thinking too small, maybe you should think smaller. What if a nano is bluetooth/wifi connected to the device. What if that nano has a microphone? And, what if that nano was your watch?
Now that would be pretty sweet, wouldn’t it?
Now think bigger, what if the TV remote is the mic for siri? Or the phone is bluetooth/wifi connected to the TV? Now the phone is a remote with siri, connecting to your tv also using siri. There’s a number of easy, creative workarounds that are all revolutionary. I wouldn’t say it won’t just yet.
The fundamental idea is to get rid of the remote.
Is that what Steve Jobs said he was fundamentally focused on? I thought Siri was supposed to be an enhancement, not a replacement.