The Voice UI

I hear the phrase “voice is the new UI” often during meetings around Silicon Valley. This is nothing new. I’ve been involved in many industry discussions over the past 15 years where the “voice as the user-interface” vision has been well articulated. Science fiction stories have long portrayed humans interacting with machines via voice and, to the astonishment of the audience, the machines talk back. Consumer technology is unquestionably headed in this direction. We will have to explain to future generations what it was like to live in a world where we couldn’t operate our electronics with our voice and our electronics were not smart enough to understand us.

If we step back and take a look at a general theme in consumer technology today, we notice a pattern emerging – the elimination of friction. The success of messaging apps as platforms all over the world are based on the simple premise of eliminating friction. The move to contactless mobile payments is a move to eliminate friction. Google with Now on Tap is moving in that direction as is Apple with Proactive in iOS 9. The examples are countless and the trend is clear. Convenience trumps nearly everything in consumer electronics and things that eliminate friction are convenient.

Being able to pay with my smartphone or smartwatch is convenient and eliminates friction. Amazon’s brilliant idea for one-click purchases was to eliminate friction, making it easier and faster for me to buy things from Amazon. Voice as a user-interface layer eliminates the friction for many tasks that are possible by typing on my smart device but, often for such small interactions, voice is much more convenient. To text my wife a short message I could pull out my phone, used Touch ID to log-in, pull up iMessage, click on her contact info, and start typing. Or I could lift my Apple Watch and say “Hey Siri, text Jen I’ll be home in 30 minutes.” If we believe we are on the grand path to eliminate as much friction as possible from the world of technology, we have to believe voice truly is the new UI. Honestly, it can’t get her fast enough.

I look at this in two ways. The first world viewpoint and the third world viewpoint.

First World Problems

I was having a discussion with a family member about the future and he said “I want to be able to talk to my oven and tell it to turn on to 450 degrees.” Voice as UI layers applies to all kinds of household appliances. “Refrigerator, how much milk do I have left?” Or “how many eggs do I have?” “Do I have everything I need to make waffles?” In this vision and many like it, the appliances talk back, making sure we get what we need. Your refrigerator may tell you that you need more eggs and ask if they should be added to your grocery list. Once your shopping list is complete, you can send the request to have everything you need delivered by the end of day.

Interestingly, Amazon’s Echo is presenting this vision and trying to make it mainstream via a singular household appliance. If you have never seen this video on the Amazon Echo I recommend it. I’ve yet to try an Amazon Echo but one is on the way. This product is a great example of the potential of voice UI and what can happen when more and more of our appliances become “smart.” When you watch videos like this or have experiences of our own where we use voice to control and interact with appliances, you conclude this is the direction we are heading. The challenge is all the innovation surrounding this vision that still needs to happen.

Echo is great, but it’s only one product. While Amazon is touting integration with smart home products so you can control them through Echo, most appliance companies will be slow to adopt any standard and integrate with a product like Echo. It’s more likely, at least in the beginning, that each appliance manufacturer will want to build the smarts into their appliances rather than work through an aggregator. This is a debate I hear frequently in industry circles.

It is true voice recognition has come a long way. However, we still need artificial intelligence layers in the cloud to mature even more than it is today. The cost of components and sensors need to come down in price as well before we can see this expand to everyday appliances at price points the masses can afford. As much as I want this vision to become a reality sooner than later, it seems we still have a bit of a wait ahead.

Third World Problems

As interesting as voice is as an interaction layer to most of us in the developed world, it may evolve to become central to those in the third world, particularly with things like smartphones. One of the primary problems, besides economics, to connecting the next billion humans to the internet is a lack of technical literacy and often the lack of literacy at all. There are massive pockets of humans who live in villages with maybe one TV and radio. Which brings up the interesting question of how would they use a smartphone even if they could afford one and the data plan attached to it? This is where things like voice as a user-interface may provide a solution.

There is still a long way from commercialization for this specific use case and voice as UI will have to become fully mature and established in developed markets first. But if we can bring natural interfaces like voice to the masses and include the ability to understand the many languages and dialects spoken today, we could be one step closer to connecting the next billion and the several billion after that to the internet.

Published by

Ben Bajarin

Ben Bajarin is a Principal Analyst and the head of primary research at Creative Strategies, Inc - An industry analysis, market intelligence and research firm located in Silicon Valley. His primary focus is consumer technology and market trend research and he is responsible for studying over 30 countries. Full Bio

12 thoughts on “The Voice UI”

  1. I think there’s a bit of a confusion between being all-knowing and being able to talk, and also, a bit of a practical issue.
    In the waffles example, the banter about the batter (ah !) is secondary to the fridge actually knowing what’s inside of it, and then there’s a second step about something somewhere being able to check ingredients (vs which recipe ? It also has to be aware of the recipe I use, not pull up the first one off a random site). Once that info is available, most of the problem is solved, the UI is frosting on the cake. I think the story is more about a well-informed free-form UI than about voice recognition ?
    The practical issue is that voice works well… in a quiet environment, if you don’t fumble your requests, and for stuff that’s expected. I just got a new phone for my elderly mom, and figured OK Google would be a better UI than the Big Launcher she’s been relying on. I had to roll that back: there always someone talking in the background, or she mumbles/hesitates/gets mixed up, or OKG gets it wrong for some reason. I’m starting to wonder if holograms are not required for Voice to be usable as the main UI to an AI, because speaking to the void w/ no feedback is unpleasant, unintuitive, and inefficient.

    1. Yes, my focus was more on where we are headed with a natural voice driven engagement with our electronics which will include elements of AI to interact back with us.

      There are all sorts of points around where we are today vs. where we will be and all the criticisms of the voice recognition and AI markets today are valid. But this will get solved. As computing power increases both locally and in the cloud, as sensors get better, as a machine lears my voice vs. the voice or sounds of others and can filter other sounds out and focus on me, etc., we will move closer to this reality.

      1. To echo the point, I do think that there is a question about whether voice is the best interaction model for all situations. The technology that underpins any improvement in voice UI aren’t exclusive to voice UI, and may underpin other kinds of UX and UI models that work better for some scenarios, not that different input and output modes are mutually exclusive of course.

        1. I agree that voice is not the best for all situations, but on the other hand, the QWERTY keyboard as an interaction model is so bad on mobile that it’s not even funny anymore.

          If somebody could come up with a one handed text entry method that may require a bit of learning, but is fast once you get the hang of it, then maybe free text queries might become quite good. The Japanese “flick” text entry method that most people here use is actually pretty close (I unfortunately haven’t taken the time to learn it, but I know I should).

          1. In no way endorsing qwerty on mobile! Though I’ve found the swipe keyboard to be rather effective (have been able to write entire essays on it). On an tangent, touchscreens have been tremendous for Chinese input. Up until the mobile phone writing Chinese on a computer was a nightmare learning experience full of friction points.

          2. I’m slowly but surely falling in love with Minuum (I think it’s on iOS too). It’s Swype on sdioretS -the opposite of Steroids- : even smaller, even smarter about… divining… what your fat fingers intend to write on a measly line of squished letters.
            Takes a bit of getting used to (it is very different) and trust-building (it can’t possibly.. yes, it did !). Couple of weeks later, anything else seems clunky.

      2. I just remembered that one of the earliest consumer smart objects was that plastic rabbit with lights that spoke news, weather updates, and various other items. Ah, Nabaztag, my brother had one. v2 did a bit of voice recog, both flashed lights (unread mail…) and spoke.
        I find it telling that it was a dedicated object, actually a “pet”, not some incorporeal entity.

  2. Life without friction, is it worth living?

    Card Slide. Beep.
    $18.96. Is it OK?
    Yes, it is OK. What can I say.

  3. A recent report on Engadget suggested that Amazon will offer Echo voice tech to other manufacturers. In the context of this article, it would mean that the UI may become an externally sourceable component, independent of the operating system and the device. Moreover, since it will be hard, if not impossible to patent the “look and feel” of a voice UI (although internal algorithms may be patentable), it may have significant implications on the branding of operating systems. It might lead to the commoditisation of the UI.

    What I find interesting is that given the pretty good reviews of Amazon Echo and the rapid improvement of Siri, it now seems that good voice recognition is within the reach of at least the large tech companies, regardless of previous prowess in AI. Amazon, by offering Echo voice to other companies, is essentially making it the new AWS. This could result in even small startups using sophisticated voice technology, and if Google/Microsoft follow suit as they did with Google Cloud and Azure, the costs might be dramatically reduced.

    It’s interesting to contemplate what will define ease-of-use and what will provide differentiation of voice UIs when this happens.

Leave a Reply

Your email address will not be published. Required fields are marked *