Tech Could Learn from the Election: Big Data Rules

XKCD xartoon

I have wanted to write about the growing importance of big data and analytics for a while, but this is a tech site and I did not want to get it embroiled in the political tempests. But now that the votes have been counted, we have a stunning demonstration of the power of data, as suggested by the XKCD cartoon above.

In the closing days of the campaign, New York Times blogger Nate Silver emerged as a lightening rod. He drew fire from both Republicans and traditional pundits of all stripes for his insistence that despite the closeness of the polls, both nationally and in key states, that President Obama was an overwhelming favorite to win re-election by a narrow margin in the popular vote and a much larger one in the electoral college. He was, of course, dead on.

People who are now hailing Silver as the greatest political pundit ever are completely missing the point, because Silver’s approach was about as far as you can get from the seat-of-the-pants, anecdotal methods of pundits. He went with the data.

Silver’s best friend was something known to mathematicians and statisticians as the law of large numbers. What this theorem, first developed in the 18th century, says is that as you take many samples some some phenomenon, say the percentage of voters supporting Obama in Ohio, the results of these tests will cluster around the true value. (The law’s cousin, the central limit theorem, allows much great specificity about the nature of that clustering, but also is subject to much more stringent restrictions.)

Pundits kept focusing on individual polls and the fact that the difference fell within the margin of error.* Silver understood that as you used more and more polls, the probable error of the aggregated result shrank. In other words (and very roughly) if 10 polls each show Candidate A with a one point lead, you can be reasonably confident that A is in the lead even though the error of each poll is plus or minus three points.

Silver also used a tool that gave him a way to quantify predictions. Now a model necessarily involves making a lot of assumptions. For example, Silver’s novel (which he has been reasonably transparent about, though he has never released its detailed specification) assigned a fairly heavy weight to key economic indicators early in the campaign, but reduced the weight as time went on based on the assumption that new data have less effect as election day nears. It also involves weighting the influence of polls based on their track record and “house effect” (a tendency to favor on party or candidate relative to other polls. He then used to model to run thousands of daily simulations of possible outcomes, a sort of Monte Carlo method. His probability of victory on any given day was simply the percentage of simulations in which a candidate emerged as the winner.

This sort of analysis has only recently become possible. First, we didn’t have the raw data. There were fewer polls, and greater lags between the collection of data and its release. Second, until the recent massive increases in cheap and available computing power, doing thousands of daily runs of a model of any complexity was impossible. A similar phenomenon lies behind the increased accuracy of weather forecasts, including the extremely accurate predictions of the course and effects of Hurricane Sandy. Weather forecasters use supercomputers for their simulations because the models are far more complex, but the techniques and the beenfits are much the same.

Too bad we don’t have more data-driven analysis in tech. Of course, there’s the big problem that a lot of the necessary data just isn’t available. Only Apple, Amazon, and Samsung know exactly how many of which products they sell, and they are not inclined to share the information. Still, there are analysts who make the most of the data. Two who come to mind are Horace Dediu of Asymco, who keeps tabs on the handset business, and Mary Meeker of Kleiner Perkins, who provides infrequent but deep data dives. We could badly use more data and less posturing.

*–Poll margin of error is one of the most misunderstood concepts around. First of all, the term should be abandoned. The correct concept is a confidence interval; what you are saying when you claim a margin of error is plus or minus three points is that some percentage of the time (the confidence level, typically 95% in polling) the actual result wii be within three points of the reported value. Pollster should act more like engineers and surround their point values with error bars. Second, the size of the confidence interval is purely a function of the sample size and says nothing whatever about how well a poll is executed. So a poll with poorly put questions and a badly drawn but large sample will have a tighter confidence interval than a much better done poll with a smaller sample.

The Logistic Principle, Not the Law of Large Numbers, Will Slow Apple’s Growth

I cringe whenever I see a reference to the “law of large numbers,” knowing that it is almost certain to be incorrect. Blogger Dr. Drang has already done an  excellent takedown of the New York Times‘ James B. Stewart for a silly column in which he warned that Apple may be “running up against the law of large numbers.”

logistic curve graph
A generalized logistic curve

The reason I go overboard on this subject is that the law of large numbers is a very important concept that we cannot afford to lose. It has nothing whatever to do with growth. What it actually says is that as a large number of samples of a random variable are taken from a population, the mean of the samples approaches the expected value of the population. In other (and simplified) terms, the larger your sample the better your estimate of the actual value. This theorem and its more sophisticated cousin, the central limit theorem, are the basis of all sampling, polling, and inferential statistics.

So what do we call the principle that the growth rate of things tends to slow as they get larger? The idea is kind of obvious, which may be why it doesn’t have a name. But then, the law of large numbers, the real one, is somewhat obvious, too. All the best theorems are.

I propose we call it the logistic principle. The name comes form the logistic function, which models how many things grow in the real world. Growth starts out slow, then accelerates rapidly. But at some point, the growth hits some sort of constraint, usually because a resource begins to grow scarce. This causes a sharp slowdown in growth. (see graph above.) For populations, the constraint is usually food. In the case of Apple, the constraint might be that everyone who wants and can afford an iPad or iPhone already has one (and I think Apple is probably still a long way  from the point of inflection.)

At least the law of large numbers exists, even if it doesn’t mean what most people think it does. The law of averages, which is also used to predict the end of a run of success like Apple’s, is completely fanciful and, in fact, is known to statisticians as the gambler’s fallacy.