This is What You’re Missing about Vocal Computing

On Christmas morning, as my mom and I hurriedly rushed around my kitchen making final preparations, a third voice would occasionally interject into our conversations. Sitting at my counter was Alexa, helping me through the process by answering questions, setting timers and even flipping on holiday music at our request.

I’ve been living with Alexa for roughly two years now and have grown accustomed to our constant banter. But, for my mom, it’s still a very new and novel experience. When my mom speaks to Alexa, she might recognize she’s speaking to a computer, but she probably doesn’t consider that computer is actually thousands of miles away and she probably doesn’t realize the way we talked to that computer on Christmas morning is the new face of computing.

The graphical user interface (GUI) wasn’t new when it was introduced in 1981 by Xerox and popularized to the masses in 1984 by Apple’s Macintosh computer. A GUI didn’t represent a new technical way of computing but it was a crucial evolution in how we interact with computers. Think of the impact the GUI had on how we used computers and what we used computers for. Think of how it changed our conception of computing.

The smartphone was created in the 1990s but it wasn’t until 2007, with the advent of Apple’s iPhone, that smartphones reached an important inflection point in consumer adoption. Today, 75 percent of U.S. households own a smartphone, according to research from the Consumer Technology Association (CTA).

The touchscreen interface represented the next paradigm shift in computing, ushering in a new way of thinking about computing and bringing into existence new applications.

Smartphone computing shares an important heritage and legacy with the GUI introduced in the early 1980s. If you’re old enough to remember computing before GUIs, can you imagine computing on a smartphone using command prompts? GUIs in the era of desktops improved computing. It was the transformation to a graphic interface that ultimately launched the smartphone era of apps.

Vocal computing will do the same thing for the future of computing. Vocal computing isn’t perfect. Alexa isn’t always certain what I’m asking. Google Home doesn’t always provide an answer. Siri can’t always help my sons when they ask complex questions. Like a first date, we are still learning each other.

Software layers and form factors change our computing experience. We’ve seen this throughout the history of computing – from the earliest mainframes to the computers we call phones and carry in our pockets. In all the same ways, vocal computing is just an extension of what we already know – it’s a more natural and intuitive interface.

Let’s not overlook just how transformative this new interface can be. Imagine someday computing on our bikes, in our cars, while we are walking or lying in bed. With voice, every environment can be touched by computing.

Published by

Shawn DuBravac

Shawn DuBravac is chief economist of the Consumer Technology Association and the author of “Digital Destiny: How the New Age of Data Will Transform the Way We Live, Work, and Communicate.”

34 thoughts on “This is What You’re Missing about Vocal Computing”

  1. I am OK with the discussion that voice will be very helpful. I am however not OK with the blanket statement that voice is more natural and intuitive than a touch UI.

    You can train rats to pull levers to get their food. Can rats talk?

    One year old toddlers can’t yet speak, but can express themselves through crying, pulling, touching, pointing.

    Also if the most natural and intuitive interface is guanteed to win over all others, without regard for expressiveness, clarity and a myriad of other factors, as this article seems to assume, then most likely, the winner will be touch sans letters or words.

    In reality, the usability and convenience of a user interface depends on a range of factors. Usability depends a lot on the capability of your brain and your physical capabilities. your training (can you read?, your familiarity with the interface (Do you remember shortcuts? Do you remember where the buttons are?), and the complexity of the task (Do you really want to control an Excel spreadsheet with voice? Do you prefer to play Chess with voice?).

    Imagine playing Mario Go with a voice controller (you can still keep the screen if you want). Now that would be interesting.

    I don’t think it helps to make blanket assumptions.

    1. Sadly I can only give this reply a single upvote.

      Voice input is good for no-look interactions. Voice output is good for low information, linear replies. A screen is essential for any high bandwidth output, or any output where we might want to process it in a non-linear way. Touch or mouse input is essential for anything that would be hard to express verbally. And some kind of keyboard, onscreen or physical, is essential for any linguistic input longer than a sentence or two.

        1. Either. Both. Human short term memory is too crappy for a long verbal response to be retained properly. And verbal output is useless for conveying nonverbal information. Alexa can tell you the weather is cold and wet, but it can’t show you the weather radar. Nor will it do you much good to be told the weather forecasts for the next five days, since you’ll be mixing them up and losing track of what was said long before it gets to the fifth day out.

          1. I disagree. Spoken word (like theatre) CAN be more expressive than a written word in books.

          2. Not talking about expressiveness or performativity. Build me a computer that can act in a play and we can reopen the topic. Computer Voice interface is about conveying information, not storytelling or performativity. And while humans are great at following the details of a aurally told story, we suck at retaining more than a few bits of information in short term memory. With visual information, we can go back and look at it again. With audio, you can’t rehear it unless you ask for it to be repeated. Which would get tedious really fast.

          3. Not act in a play, listen to a play. If I said a “podcast” instead of a theatre, would it be clearer what I mean? Also, my understanding is that Chinese is based on pictograms or pictures as opposed to literal English, this may be a reason that some languages are better expressed in a written form.

    2. “You can train rats to pull levers to get their food. Can rats talk?”

      Unfortunately that’s the problem with rats! Oh…you mean the two legged variety.

      Anyhow, completely agree with voice as an option. I’m not for chatty machines.
      “To accept this answer please press 1….”

    3. Voice is an option for a limited subset of user-computer interactions. The GUI is much more accessible to human withing a wider range of cognitive capabilty (babies to adults).

      Voice requires a level of cognitive thinking that frankly slows me down. To interact with iOS/MacOS vis Siri, I have to think, formulate the question, then think some more about how to vocalise my thoughts to the device.

      1. I think that is an important point.

        In real life, we often find meeting and talking in person, or even talking over the phone to be a very efficient way to communicate. Many people misunderstand this and conclude that verbal communication itself is inherently efficient, but I consider this a fallacy.

        Instead, what really makes verbal communication efficient is the flexibility and interactiveness. When you are talking, the listener is free to ask questions to confirm your intention, or to provide verbal cues to communicate that they understood what the speaker meant. Even if the speaker doesn’t clearly give his intentions the first time, he can repeat himself for clarity, or he can answer questions the listener might bring up. All this combined is what makes real-life verbal communication work, without extensive cognitive thinking up front.

        My opinion is that verbal communication is actually inherently inefficient, but interactiveness more than makes up for it. Therefore verbal communication without interactiveness, which is what we have on our devices today, tends to be quite terrible.

        1. Plus verbal communication is often combined with a) visual cues and b) intonation. My “Bonjour” can be cordial, friendly, doubtful, scathing, aggressive…

        2. As an anecdote, I’ve tried to push Voice for my elderly and declining parents who never quite got into Computers (my dad is old enough that he had a secretary until retirement then built a house from the ground up – and got the reams of hand-calculated mass, structure, and whatnot to prove it ;-p; my mom was stay-at-home and is more into people and arts-and-crafts).

          They still have that foreign relationship with computers, afraid to try stuff out, doing by rote rather than understanding How Things Work…

          Utter failure with voice. I put it down to lack of feedback, the need for a clearly enunciated query well though-out before hand. Even dictating texts isn’t working out for them.

    4. This is a fairly reductionist presentation about voice UI. It certainly can introduce a new level of computer interaction. I think in an effort to counter the negativism of detractors of voice, the author has gone a little overboard. Must be the days we live in.

      Voice will require a higher level of understanding than touch or even a PC with a GUI. Both are passive for the computer. It just sits there and doesn’t do anything until you press the button—literal or figurative. Voice requires the computer to be perceptive and would have to parse out what I mean. Heck my wife of 30 years still has trouble understanding me when I ask something.

      I brought this up before, but the auto insurance commercial has two people using the exact same words with two completely opposite meanings.

      https://www.youtube.com/watch?v=ultPAIkFoRw

      Even with touch UI people are trying to introduce nuances to the interface based on things like force with only limited success. Except for more basic questions, that’s the difficulty voice faces. Right now, even with touch I keep yelling at my device saying “I didn’t do that!!!!”

      So voice opens up a whole new realm of misunderstandings!

      Joe

      1. This commercial is actually a great example on how emotions can be expressed better by voice nuances rather than words: in two situations being shown the same words sound completely differently. That gives voting points to the voice UI as a matter of communications, doesn’t it?

        1. Except it isn’t just voice, it is also visual cues. This is requiring a perception beyond just aural.

          Joe

          1. Arguably though, voice may be just enough to decipher the situation based on an audio context only. If there is a mistake, voice UI may clarify. AI training is based on a trial and error.

          2. Don’t get me wrong. I am not opposed to a voice UI nor do I think it is pointless/useless. But it isn’t the ultimate or more superior to touch or anything else. It will be a welcome _layer_. But we have enough problems as humans misunderstanding each other, even when talking on the phone, I don’t see how a computer will fair any better.

            Which brings up the question in my mind, is natural really superior? Why do we want a “natural” UI? And how natural do we want to go? The robot in Hitchhiker’s Guide? What do we really mean by “natural”?

            Joe

          3. Voice is better in some situations than text because the emotional content is more easily derived from it I think. I am still figuring out what situations it is better for. I agree – the more senses you can put to use, the better understanding is.

          4. I agree. I hate, really hate, touch screens. Their saving grace for me is that I like big screens. Something’s got to give…

            But why do I hate them? It’s paradoxical, but touch screens deprive me from the sense of touch. It’s binary, there’s no texture (or only two textures, air and glass).

            Same with sound. Words alone have less texture than when they are accompanied by inflections, body language, eye movement, and other non-verbal communication.

            Even the written word is mentally dictated to us depending whether its Shakespeare of a computer manual. That too gives texture.

          5. Have you ever thought what “I am touched” means? Metaphorically. You can be “touched” by words “not less” than physically and Shakespeare is a good example of that.

          6. Yes, of course. There’s also auditory versus visual listeners. The auditory one’s “mentally narrate” to them selves when they read, the visual one’s it’s more like watching a ticker tape go by. I can envision content impacting that mode.

          1. Certainly, this is the reason. Also some words are more emotional than others and on the other hand there are some words that are empty.

    5. Your argument is extremely Flaw
      Voice is as natural as touch when it comes to human interaction, we talk to each other more often than we touch a computer so to assume that touch in any way will be the main interface is the false

      What you need to understand is that none of our current computer, software and interface are designed around voice interaction, a smartphone with voice interaction aren’t that different than a PC with a touch screen.

      The Amazon Echo and Google Home are probably the first computer to have been truly designed around voice interaction, but the application and set of instructions are still very limited in the same way as the first touch computer in the age of mousse and keyboard was limited

      Voice is the best form of interaction depending on circumstances and instruction, but for this we need to design a computer totally around the voice interaction, and not a touch computer with a voice interaction capability

      But in not too long all our computers, applications and instructions will be designed around these two forms of interaction, one will not be more dominant nor better than the other the will complement each other.

      1. Your point is exactly the same as mine. Voice and visual/touch will complement each other. That is how human beings communicate with each other today.

        The issue I have with this article is that a) it assumes that voice is more natural and intuitive than visual/touch, and b) it implies that voice will dominate over visual/touch by comparing this to the transition from command line UIs to GUIs.

  2. I think the great advantage for voice UI is in being listener to the conversation between people. Right now you need to press a button to have the voice UI listen, but if it listens all the time it picks up the important nuances and can provide a context for the conversation by infrequent requestsresponses. This is a true ambient computing as it should be I think.

Leave a Reply

Your email address will not be published. Required fields are marked *