When Computers Can See

We are watching the early phases of one of the most significant advancements in computing. We are on the cusp of teaching computers to hear us and to see us and the world around us. Computer vision has been a buzz word that companies like Intel, NVIDIA, Qualcomm, and AMD have been using because the idea behind computer vision has immense impact on the processors and GPUs of today. One of the most important long-term observations I’ve come to of late, is that we simply do not have enough processing power to handle our ideas of computer vision. Understanding this means knowing we are nowhere close to achieving this vision since it could be 10 years before we have the kind of processing power to truly give a computer eyes and ears and real-time intelligence to go along with them. This is the opening innings of a long game but this game could be the pinnacle of them all.

This trend first started becoming clear as companies looked to create autonomous cars. Cars were not yet capable of seeing, so companies had to train their backend networks to be able to identify a human, other cars on the roads, street signs, curbs, etc. Once the network was trained, the computer/car got some vision capabilities. It took mountains of data to train these systems, many person hours labeling the images/data, and much more computer hours processing the data and training the backend network.

Cars are one example but there are several others in verticals like oil and gas, public safety, and even medical, however, the examples are few and far between. Meaning most computers can’t see, and even those that can are extremely expensive and niche. So the real question is how to we get to a point where all computers can see? I believe it starts with Augmented Reality.

Smart phone cameras have the most potential to assist in teaching computers to see. They arleady do, to a degree, with the amount of photos consumers take and post to things like Facebook, Instagram, even Google Photos, and Apple Photos. These systems can use the data captured by the user to help train a network in specific ways. Facebook and Instagram may be able to offer me specific localized ads if they see images suggesting I’m in Boston that I’m posting that day. Google and Apple can take photos stored on their images app to train networks more broadly for places, faces, foods, and anything else that is sitting on a users deivice. Some may be concerned about the privacy aspect of this but there are many ways on device and in the cloud this can be done privately. Training networks with still photos was and is the first step. But with Augmented Reality there will be an opportunity to teach computers to see in real-time. This is no easy task.

Overall, Augmented Reality is going to usher in a new era of app development. I like to reference it as mobile 2.0 or even the app store 2.0. We are going to see a gold rush of innovation by software developers around Apple’s ARkit which will re-ignite the software developer landscape. But AR brings with it the ability for a computer to now capture visual data in real time and to do so with 100s of millions of people gathering a tremendous amount of data.

In machine learning, data is the gold. Not just any data but good data or the right kind of data. As companies create AR apps and can leverage the real-time capture of data to help train their network for services they want or plan to offer, they will be working with some of the best data they can get their hands on.

The challenge many will have over the next year is coming up with a grand concept of how to implement computer vision more elegantly into their app or service. Perhaps apps that look at my food and give me calorie counts can also offer ways to make better food choices. Or apps that show me what clothes will look like on me help me find the best deals on the Internet. Outside of enterprise, and gaming, I expect retail to be an early mover of augmented reality and the benefits to their industry of giving computers eyes. This will work for them both with the devices we carry but also in their stores. As retailers embrace more camera sensors in their stores and train these computers to spot patterns, it will help them evolve and compete in the e-commerce era.

While we are just in the beginning phases, it is worth repeating, that we don’t have anywhere near the hardware capabilities needed to pull off this vision. We need innovation in camera sensors, software, on device and data center CPU/GPU/Memory/Sensors, bandwidth in the home and outside the home, and many other things. It took the technology industry nearly a decade to get us where we are today in speech recognition, and it still isn’t perfect. It could take another decade to truly give computers eyes and even longer for true AI to become a part of the system.

This is a big trend and one we will look back on with gratitude that we got to see it play out. It’s also affirming to know the tech industry has many more years of innovation ahead and hence great job security for all of us who work in tech.

Published by

Ben Bajarin

Ben Bajarin is a Principal Analyst and the head of primary research at Creative Strategies, Inc - An industry analysis, market intelligence and research firm located in Silicon Valley. His primary focus is consumer technology and market trend research and he is responsible for studying over 30 countries. Full Bio

Leave a Reply

Your email address will not be published. Required fields are marked *