Though the iPhone 4S appears nearly identical to the current iPhone 4, it is, as my colleague Tim Bajarin points out a revolutionary device because of its voice-based Siri interface. For the past 20 years, we humans have learned to point and click, but this has never been a natural way to interact with our environment. Touch and speech, on the other hand, have been around since we were living in caves.

“Speech is no longer an add-on,” says Vladimir Sejnoha, chief technical officer of Nuance, probably the world’s leading speech technology company. “It is a fundamental building block when designing the next generation of user interfaces.”
Sejnoha is faithful to the code of omerta that Apple imposes on its vendors. Although Nuance has supplied technology both to Apple and to Siri before its 2010 acquisition by Apple, he declined to discuss Nuance’s role in the iPhone 4S: “We have a great relationship with Apple. We license technology to them for a number of products. I am not able to go into greater detail. But we are very excited by what they have done. It’s a huge validation of the maturity of the speech market.”
But Sejnoha made no effort to hide his enthusiasm for the Siri approach. “It allows you to find functionality or content that is not even visible,” he says. “It provides a new dimension to smartphone interfaces, which have been sophisticated but shrunken-down desktop metaphors.”
It’s has been a long, hard slog for speech to become a core user interface technology. It took a good thirty years, from the late 60s to the late 90s for speech recognition—the ability to turn spoken words into text—to become practical. “Speech recognition is not completely solved,” says Sejnoha. “We have made great strides over the generations and the environment has changed in our favor. We now have connected systems that can send data through the clouds and update the speech models on devices.”
Recognition alone is a necessary but hardly sufficient tool for building a speech interface. For years, speech input systems have let users do little—sometimes nothing—more than speak menu commands. This made speech very useful in situations were hands-free operation was desirable or necessary, but left speech as a poor second choice where point-and-click or touch controls were available.
The big change embodied by Siri is the marriage of speech recognition with advanced natural language processing. The artificial intelligence, which required both advances in the underlying algorithms and leaps in processing power both on mobile devices and the servers that share the workload, allows software to understand not just words but the intentions behind them. “Set up an appointment with Scott Forstall for 3 pm next Wednesday” requires a program to integrate calendar, contact list, and email apps, create and send and invitation, and come back with an appropriate spoken response.
Sejnoha sees Siri in the iPhone as just a beginning. “Lots of handset OEMs are working on it,” he says. “There is a deep need for differentiation in Andoid and Apple will only light a fire under that. Our model is to work closely with customers and build unique systems tailored to their visions.” And while a speech interface can drive search, it can also become an alternative to it: “One consequence of using natural language in the user interface is direct access to information. We can figure out what you are looking for and take you directly there. You don’t always have to go through a traditional search portal. It will change some business models.”
Nor do the opportunities stop at handsets. “Speech is a big theme for in-car apps because that is a hands busy, eyes busy environment,” Sejnoha says. “All the automotive OEMs are working on next-generation connected systems. The industry is undergoing revolutionary change.”
The health care market is another hot spot. “Natural language is taking center stage in health care,” Sejnoha says. “We are mining data and using the results to populate electronic health records.” Nuance recently signed a deal with IBM to provide technology for a speech front-end to the health care implementation of its Watson question-answering system.
The key to the next breakthroughs in speech technology, Sejnoha says, is making effective use of the vast amount of speech data that now exists, a challenge that has also attracted Nuance competitors Google and Microsoft. “Most algorithms use machine learning and are very data-hungry,” he says. “No one knows yet what to do with tens of thousands of hours of speech data. The race to do that is one. We are doing fundamental research and have a relationship with IBM Research as well. It requires a broad array of techniques to model speech in a robust way and to learn the long tail statistically and the build techniques that can benefit from large amounts of data. It’s a very exciting time.”
I always have loved the Apple demos for speech recognition at WWDC and I’m looking forward to trying Siri on my iPhone.
Way back when, we were working with Apple’s speech recognition system for an automated system. One of the people in the company was from Nigeria and, while his english was impeccable, his accent was extremely thick. The old speech recognition stuff in the 1990s would just look confused whenever he spoke.
I’ll be curious to see Siri’s “learning ability” with things like this.