(Tech.Pinions: Today’s Daily piece, “Adventures in Machine Intelligence” was an Insider post we originally published on December 12th, 2016. We post it today as an example of the daily content for our Insider subscribers. You can subscribe, yearly or monthly, at the page found here)
While I tend to stay away from high-performance computing and data center analysis, I’ve taken up the effort to better understand the soup-to-nuts solutions being developed for machine learning, everything from technical details around chipset architectures, software, network modeling, etc. Luckily, a large number of our clients have assets in these areas so engaging in discussions to help me better understand the dynamics has been straightforward. I’m not going to proclaim to be an expert but my technical background, as well as staying current in semiconductor industry analysis, is proving to be quite helpful. I’d like to share a few basics I find quite interesting.
A great deal of the work up to this point has been around data collection. Large amounts of data on specific subject matter, or around specific categories of data, is the key for machine learning. Simply put, data is the building block of machine learning and intelligence. Interestingly, and somewhat contrary to some opinions, it is not the person or company with the most data who is best positioned but those with the right data. A key part of this analysis about where we go in machine intelligence and how that translates to smart computers (AI) needs to be grounded in collecting the right data. So fundamentally, the starting point is data and the right data.
Lots of companies have been gathering data. Google has been gathering data from searches, world mapping and more. Microsoft has been gathering enterprise data, and Facebook gathers social data. There are a lot of companies gathering data but many are still in the early stages of making their backend data collection efforts into smart machines. In fact, very little of the technology we use is smart. By smart I mean something that is truly predictive and can anticipate human needs. We have a tremendously long way to go in making our machines truly smart. In a recent conversation with some semiconductor architects of machine learning silicon, I asked them if we can estimate a point in time in the history of the personal computer and liken it to where we are today in machine learning. Their answer? No later than the early IBM PCs. This was from folks who have been in the silicon industry for a very long time. The context for this discussion was around how much silicon needs to still advance for machine intelligence and AI to truly start to mature. So it is worth noting their comments on the early IBM days would include their knowledge that the early IBMs ran an Intel 8086’s with 4,500 transistors. Today, we have architectures that have more then 10 billion transistors.
After being convinced we still have a tremendous amount of innovation in semiconductors to get where we need to be in machine learning and AI, I started looking into what is happening today. The next step is to understand how to train a network or how to teach a computer to be smart(er). I stated above it all starts with data, good data, and the right data. Some of the most common examples of network training today are around computer vision. We are teaching computers to identify all kinds of things by throwing terabytes of data at them and teaching them a dog is a dog, a cat is a cat, a car is a car, etc. Training a network is not entirely arbitrary. It is calculated and intentional. The reason is network models have to be built/programmed before they can be trained. Leaning on decades of work on machine learning, many programs exist to train a network in some more common fields that work with large data. Medicine, autonomous vehicles, agriculture, astrophysics, oil and gas, and several others are areas where people have been focused creating this network model. Many hours of hard work and hard science go into the training of these network models so data can be collected and given to the machine so it can learn. Companies playing in this field today are picking their battles in areas where big money is to be made with these training models.
What is fascinating is how long it takes to train a network. With a modern day CPU/GPU and machine learning software, a network can be trained in as little as a few hours depending on the data set. To train a network what a dog is, with roughly two terabytes of data, could take 3-4 hours. However, there are many cases where the data sets are so large it could take several weeks to a month to train a computer just on one single thing. This again underscores the point of how far we still have to go in silicon. I’m reminded of early demonstrations of Microsoft Office running on Pentium chipsets where the demo shined because Excel could process a massive spreadsheet in 30 minutes or less. Today, it is nearly instantaneous. Someday, training a network will be nearly instant as will its ability to query that data and yield insight or a response. Both instant and in real time is the holy grail but we are many years away.
Knowing how early in this stage we are makes it hard to count any company out at this point. But it does emphasize how the right data being collected is key. Companies are right now setting the stage by getting the right data they need to carve out value in the future with AI. What is fascinating is how deep learning algorithms are helping networks learn faster with less data. Expert consensus affirms that having the largest data sets is not necessarily the guarantee of who wins in the future. Because specific network models have to be built, it emphasizes the collection of the “right data” philosophy.
What this means is companies and services can benefit from months or years of the right kind of specific data and still train a network model. Even companies who are starting today and just starting to gather data have a chance in this future leveraging machine intelligence for themselves and their customers — if the data is good.
With some context of where we need to go, silicon architectures — CPU, GPU, FPGA, custom ASICs, as well as memory — are all key to advancing technology in the data center for more efficient and capable backend systems for machine intelligence. But all are still governed by science and we have a relatively good idea what is possible and when. Which is why we know it will still be many, many years and, hopefully, a few new breakthroughs before we get even close to where we need to be for our intelligent computer future.
19 thoughts on “Adventures in Machine Intelligence”
I have been very interested in the distinction between data quantity and data quality since my brief experience with genomics and bioinformatics 15 years ago, and your emphasis on “right data” makes a lot of sense to me.
My question is, what is “right data”, and who do you think is best positioned to get access to this? Do you think that the stealthy methods that the Internet advertising giants use gives them the “right data”, or do you think that companies that focus on getting opt-in and collecting data on specific domains are better positioned (I am specifically interested in health related data)?
To answer your question about “the right data”. Normally data in AI is broken down to “features”. The task of features is a close prediction of a target. For example, you want to predict a home price in a given neighborhood. Then a footage, number of rooms, number of floor would be “good features” for that. But if you for example decide to include a distance from home to the moon as a feature ( I exaggerate) it may not be a good feature i.e. “bad data”. Some of the features can be redundant also, for example if you decide to give an area of the house both in m^2 and ft^2.
This is very simple example. In reality the decision which features are good for describing a prediction model and which are not may be not that obvious.
I agree that the example that you give is relevant to one aspect of “right data”. For clarity, I think it would also be beneficial to provide an example of bad data. For example, having a lot of data on my web browsing habits and learning that I am interested in personal computers, but not having my actual purchase history (I actually very rarely purchase new PCs, and I’m still on 2011 models) would be an incomplete data set if your aim is to provide relevant ads efficiently. In this case, one could say that the data set lacks the prediction of the target (whether I want to buy a PC or not). For this reason, my opinion is that Amazon should have a vastly more valuable data set than Google (but Amazon is not an advertising company yet).
The other aspect of “right data” that I was originally thinking about when I wrote the above comment a month ago, was that the data has to be relevant to your objectives. If you want to make money from healthcare, you need healthcare data. Analyzing web browsing habits, where you travel to everyday, how much you sleep, etc. may give you some clues about your health, but ultimately the data from medical checkups will be vital. This is undeniably the most relevant data. However, not every company will be given access to this. Companies will probably need to have a very good privacy and security track record, and may also be required to share their results with the broader scientific community. It is likely that they will be audited. They may also need to support a wide range of platforms so that patient data will benefit everyone and not only the rich 15% of the population. Getting access to the “right data” will not be trivial.
I think there are many aspects to “right data”, and for the Internet companies that are accustomed to unrestricted collecting of everything there is to know about your Internet habits, getting the “right data” outside of the Internet may pose significant problems.
I agree, there are features with a weaker correlation to the target (as browsing history) and with a stronger correlation to the target (purchase history) and dependent which one you choose your prediction of a target may have a bigger or a smaller error accordingly.
How to get an access to the “right data” is a separate story. I heard that Google translate took transcripts of Canadian parliament (bilingual) that are publicly available to train their French to English translator. Getting private medical records would be a challenge of course. But there may be some records available that are expired.
The differential privacy aspect is exactly what I’m interested in.
What if, by refraining from gobbling up personal information now, Apple is positioning itself to have preferential access to medical data? What if medical institutions decide that neither Google nor Facebook is trustworthy enough to give patient data, but that Apple is? Given how seriously privacy is taken in healthcare, and the regulatory environment there, this may give Apple preferential access to the most vital and “right data” there is.
Of course Apple also has issues related to it only having access to the wealthy 15%. Apple-only solutions will not benefit the majority of taxpayers, and it might be pressured by health agencies to open up its health platform.
No private company’s voluntary policies will ever be safe enough for privacy. PR is PR, talk is talk, one day’s commitment can be undone by a change in ownership, management, priorities…
Privacy (and, probably, data security) can only be ensured by a) regulations b) penalties.
Apple for example sell trackers to shops, have reactivated tracking on the sly during OS updates, have distributed malware in their official AppStore, have told their tech support “geniuses” to deny perfectly documented malware infections, have had and still have their fair share of security bugs, leaks, exploits and 0-days… certainly not worthy of my medical data.
That’s what I’m saying. We might start seeing regulations and penalties, and not everybody will be able to comply immediately.
I am not sure that Apple is “refraining from gobbling up personal information”. They add noise to the data, so extracting the individual data from the anonymized and randomized dataset is difficult, but I believe they do use the data in their research.
And what prevents others from doing the same to their health data? Adding the noise to a data is not that difficult. And Apple participates in all sorts of standards committees, I don’t think they would mind sharing their best practices with regulators.
Another concern I have is what is considered a health data is a bit blurred right now. How is your heart rate measured by the wearable different than the one left on a treadmill in a gym? The fact that it is anonymized doesn’t prevent from correlating the data to your time of the visit to the gym. On the other hand, a list of songs and times of listening that is stored on your iPhone can be indicative of your mood, which can also be considered a health metric.
Technically, Google will not have any problem using differential privacy (assuming that this approach works at all). In fact, technical capability does not seem to be much of an issue when it comes to AI-related stuff. The issue is with business models. Can Google monetise without tying it up to your personal information and advertising?
As for what I mean about health data, I separate it from fitness data. I am talking about the stuff that you get when you take a medical checkup, or your disease history. This is much more sensitive.
It is not immediately obvious to me how to combine a differential privacy with a precise personal ad targeting. May be ad targeting needs to be sacrificed to keep an individual privacy. With web searches failing to provide a relevant data and CPC costs going, the better targeting with a medical data even with an added noise maybe a carrot for an advertisers.
Also, I don’t think privacy of medical data is black and white. Maybe by substituting rarecritical deseases with commonnon-critical ones the integrity of data is preserved while keeping its privacy or by uncovering parts of it critical to the application. Who sees the ad is also important.
What I am getting to is while there may be some tradeoffs between data privacy and advertising revenue, it is likely that this problem may be solved in between technical and business sides.
Until the last decade or two, advertising was mostly done without ad networks knowing much about you. Advertising can and does work without compromising the privacy of your citizens. Targeting does improve ad click-through rates, but still we won’t click display ads 99% of the time.
What often gets overlooked is that if you look at Google’s finances, display ads on network sites are doing terrible, and search is what is providing the growth. Display ads are the ones that need to track you and know if you’re pregnant or not, for example. Search ads can work pretty well by just analysing your queries and showing you diaper ads when you’re explicitly searching for them. AI targeting does not really benefit the majority of the ads which Google makes money from.
I suspect that Google could easily stop collecting user data outside of search, stop using personal data for display ads, but still easily grow double digits for the next few years.
Just think. Are you clicking more ads then say 10 years ago? I’m definitely not.
So my conclusion is that ad revenue and data privacy can co-exist, and have done so for the vast majority of history. Even the free internet thrived before rampant data collection and was arguably in better shape than it is today.
But we don’t want to go back to the times when there was no personal information information available for the ad networks to target the ads, do we? I personally dislike display ads (I derived experimentally that ads appear after I visit a correspondent website). And I think that display ads are a passe and customers can be better served by movies or videos, which has some sponsor related content embedded into them. For example you watch “Mad men” and the liquor they drink all the time is Canadian Club. So Canadian Club must be a sponsor for the movie I guess.
But I can see a lot of usefulness if Google Home for example gives me a health advice based on my personal data or Google assistant offers a day schedule. So I am not against divulging personal data to ad revenue based companies, I just think I am entitled to know how it is used, what benefit do I get and how my personal data is secured. Then I can decide what to share and what not.
I’m going to ignore them 99% of the time anyway, so personally, I don’t care whether ads are targeted or not. I’d certainly appreciate it if they weren’t creepy.
Of course, if you were advertising your goods, you would prefer targeted ads. However, internet ads have not increased total ad spending, but merely shifted ad budget from magazines and newspapers to digital. If targeted ads went away, most likely the same ad spending would simply go to non-targeted.
We see that total ad spend has remained very constant over the last century, despite innovations in targeting. Money doesn’t lie. This is the true value of advertising. In other words, targeting has not yet made advertising any more valuable.
I would argue targeting has made advertising worse in most cases, fostering tracking scripts which slow down websites and annoy users, contributing to clickbait and the pay per click model which is a race to the bottom. That model doesn’t work well (although there are exceptions), and that model is the reason the newspaper industry is finding it very difficult to transition to the digital world.
Interesting data, thank you. What it tells me is that:
1. Internet advertising does not directly compete with TV and radio advertising, but with newspaper advertising.
2. The growth rate on a new medium flattens with the time, so in the coming years (following the radio example) we may witness about the same market share for Internet advertising.
So it looks to me that if the internet advertising wants to grow it needs to take a share from radio and TV. It looks like the value for radio and TV is in broadcasting the same ad to many people in the same time.
If we were to invent a new type of Internet broadcasting device this would help the cause of Internet advertising. If this device streamed a personalized content without need to switch channels – even better. Thus I think this is where home voice devices like Home or Echo are going.
A model that seems to work very well is sponsor reads in podcasts. I first encountered that at 5by5 with the original Talk Show. It’s a kind of throw back to early radio. Essentially it elevates the ad to the level of content and presents it within the flow of the content. The ad is priced per placement, not per click, the advertiser is paying for access to an audience (as it should be). There is also curation of the ads by the podcast, which is a good thing. I don’t mind ads at all, if they’re done well and don’t annoy me. But most internet ads are not done well, they are annoying banners that load dozens of tracking scripts.
I imagine in a voice-driven device this model could work very well. Hey Alexa what’s the weather like? Then Alexa gives you the weather forecast and tells you it might be time to think about winter tires and there’s a great deal at a shop not far from you. Done correctly this could be useful. Done poorly and it’ll cause people to not buy the device.
Even if Internet advertising did take share away from TV and radio, there’s probably only 100% more to earn. Five years of double digit growth and that’s it. It will cease to be a growth industry pretty soon, at least in the US, unless it can grow total ad spending.
Or silos. Same as bank are not using their business insight for their own trading, Newspaper are not letting their ads depts. influence their editorial content, politicians aren’t letting PAC donors influence their politics.
It works extremely well. On the PR level ;-p
I’d say the difference between treadmill data and phone data is that the treadmill data is a) anonymous b) transient c) unlinked to anything else. The phone data is none of those.