Can Big Data Make Us Healthier?

Photo of Mobile Health summit echibition

I spent some time this week at the Mobile Health Summit, an annual Washington event featuring the latest in mobile health-related technologies. The exhibition hall was filled with sensor-based devices that can track blood pressure, body weight, blood glucose, pill-taking behavior, and just about any other facet of human life. There was even a $199 sled from AliveCor that turns an iPhone into an electrocardiogram.

Nearly all of these devices feature either Wi-Fi or cellular wireless capability, making them part of an ever growing machine-to-machine network, also known as the Internet of Things. There is little doubt that they are becoming an important part of individualized treatment that can help keep us healthier, albeit at a sometimes creepy loss of privacy. (One company was showing connected motion sensors that could alert a care giver if it didn’t sense you moving about your room when you were supposed to be up and about. I can see the usefulness, but still find the idea disturbing.)

To be truly useful, the data from these sensors should feed into the patient’s medical record in a way that gives a health care provider a big-picture idea of what is going on. Infrastructure providers, including Verizon, AT&T, and Qualcomm, are building systems that can consolidate data from a variety of sensor sources.

But the question in my mind is whether we can go beyond individual medicine and use the staggering mass of data that will be produced by our quantified futures to improve health in general. The practice of medicine remains, in many ways, stunningly unscientific. Treatments are often selected without solid statistical knowledge of outcomes because data is hard to come by. Many decisions are based more on instinct and custom. What studies do exist too often reach sweeping conclusions on the basis of painfully small numbers of patients involved.

I have no doubt that researchers could gain tremendous insight into medicine, particularly what does and does not work to keep us healthy, if they could use big data approaches to study treatments and outcomes from an aggregation of the information that is starting to flow. However, many challenges–technical, business, and regulatory–have to be met before this can happen.

Today, what data does exist is likely to be stored in completely disconnected silos. Changes in technology, insurance company practice, and government regulations are forcing the adoption of electronic medical records (EMR) at a rapid rate, but EMR systems often cannot talk to each other. If you land in the hospital, you will be very lucky if its records system can communicate directly with your doctor’s. The government’s Center for Medicare and Medicaid Services.the payment agent for two massive programs, has a vast collection of data on treatments and outcomes hidden away in an assortment of mutually incompatible legacy databases. (CMS has launched a modernization program mandated by Obamacare, but it could take years to bear fruit.)

There are many obstacles in the way of turning a big collection of individual medical records into useful big data. An obvious one is privacy. Medical records are about as sensitive as personal data gets and we have to make sure that the identity of individuals is not exposed when the information is aggregated. There are already extensive protections in place, most significantly in the U.S., the Health Insurance Portability  and Accountability Act (HIPAA.) Some experts, notably Jane Yakowitz Bambauer, fear that excessive concern with making sure that data remain anonymous threatens to cripple valuable research.

There are also major issues in making sense of the data. If you are researching outcomes, it does help to be able to find all the patients with the condition you are studying. But the metadata accompanying today’s medical records, often designed more for the needs of insurance companies and other payers than for doctors, can make identifying the relevant data hard. “People are carefully coding the financial side, but that provides very little help on the clinical side,” says Dr. David Delaney, chief medical officer for SAP Healthcare. A new system called ICD-10 is in use in much of the developed world, but won’t be fully implemented in the U.S. for another two years.

Venture capitalist Vinod Khosla, writing for Fortuneargues that data analytics will eventually replace 80% of what doctors now do. Fortunately for his prediction, Khosla does not put a time on “eventually.” I have no doubt the time will come, but given the myriad difficulties, I suspect it will take a lot longer than any of us would like.






Tech Could Learn from the Election: Big Data Rules

XKCD xartoon

I have wanted to write about the growing importance of big data and analytics for a while, but this is a tech site and I did not want to get it embroiled in the political tempests. But now that the votes have been counted, we have a stunning demonstration of the power of data, as suggested by the XKCD cartoon above.

In the closing days of the campaign, New York Times blogger Nate Silver emerged as a lightening rod. He drew fire from both Republicans and traditional pundits of all stripes for his insistence that despite the closeness of the polls, both nationally and in key states, that President Obama was an overwhelming favorite to win re-election by a narrow margin in the popular vote and a much larger one in the electoral college. He was, of course, dead on.

People who are now hailing Silver as the greatest political pundit ever are completely missing the point, because Silver’s approach was about as far as you can get from the seat-of-the-pants, anecdotal methods of pundits. He went with the data.

Silver’s best friend was something known to mathematicians and statisticians as the law of large numbers. What this theorem, first developed in the 18th century, says is that as you take many samples some some phenomenon, say the percentage of voters supporting Obama in Ohio, the results of these tests will cluster around the true value. (The law’s cousin, the central limit theorem, allows much great specificity about the nature of that clustering, but also is subject to much more stringent restrictions.)

Pundits kept focusing on individual polls and the fact that the difference fell within the margin of error.* Silver understood that as you used more and more polls, the probable error of the aggregated result shrank. In other words (and very roughly) if 10 polls each show Candidate A with a one point lead, you can be reasonably confident that A is in the lead even though the error of each poll is plus or minus three points.

Silver also used a tool that gave him a way to quantify predictions. Now a model necessarily involves making a lot of assumptions. For example, Silver’s novel (which he has been reasonably transparent about, though he has never released its detailed specification) assigned a fairly heavy weight to key economic indicators early in the campaign, but reduced the weight as time went on based on the assumption that new data have less effect as election day nears. It also involves weighting the influence of polls based on their track record and “house effect” (a tendency to favor on party or candidate relative to other polls. He then used to model to run thousands of daily simulations of possible outcomes, a sort of Monte Carlo method. His probability of victory on any given day was simply the percentage of simulations in which a candidate emerged as the winner.

This sort of analysis has only recently become possible. First, we didn’t have the raw data. There were fewer polls, and greater lags between the collection of data and its release. Second, until the recent massive increases in cheap and available computing power, doing thousands of daily runs of a model of any complexity was impossible. A similar phenomenon lies behind the increased accuracy of weather forecasts, including the extremely accurate predictions of the course and effects of Hurricane Sandy. Weather forecasters use supercomputers for their simulations because the models are far more complex, but the techniques and the beenfits are much the same.

Too bad we don’t have more data-driven analysis in tech. Of course, there’s the big problem that a lot of the necessary data just isn’t available. Only Apple, Amazon, and Samsung know exactly how many of which products they sell, and they are not inclined to share the information. Still, there are analysts who make the most of the data. Two who come to mind are Horace Dediu of Asymco, who keeps tabs on the handset business, and Mary Meeker of Kleiner Perkins, who provides infrequent but deep data dives. We could badly use more data and less posturing.

*–Poll margin of error is one of the most misunderstood concepts around. First of all, the term should be abandoned. The correct concept is a confidence interval; what you are saying when you claim a margin of error is plus or minus three points is that some percentage of the time (the confidence level, typically 95% in polling) the actual result wii be within three points of the reported value. Pollster should act more like engineers and surround their point values with error bars. Second, the size of the confidence interval is purely a function of the sample size and says nothing whatever about how well a poll is executed. So a poll with poorly put questions and a badly drawn but large sample will have a tighter confidence interval than a much better done poll with a smaller sample.

Big Data, Price Discrimination, and Markets

At GigaOm, Derrick Harris has an interesting take on how data analytics are allowing New York landlords to extract maximum rents. It’s a good piece, but I think it just scratches the surface of what is going to become an increasingly important debate about the ethics of big data (where concern until now had focused primarily upon privacy issues.)

Driscrimination in admission prices
A form of price discrimination

Price discrimination is a technical (and value-neutral) term in economics. It refers to sellers charging different prices for a food or service based on factors other than the cost of providing it. In the past, price discrimination was difficult, both because it was difficult, as a practical matter to charge different prices to different customers and because sellers lacked the information they needed to determine a profit-maximizing individual price. Airlines have used discriminatory pricing since deregulation and have gotten increasingly good at it.

But there are a lot of problems inherent in price discrimination. For one thing, it is inherently a distortion of free markets. Efficient markets theory, to the extent that anyone still believes it,  assumes that all participants in a market, buyers and sellers, have equal access to the information that goes into price-setting. Price discrimination, at least as practiced in the real world, depends on the seller having information no available to the buyer. Car dealers could engage in price discrimination because only they knew what the wholesale price of the car really was and what prices were on comparable sales to other customers (a power eroded by the web.) Airlines and hotels have lots of information about available seats or rooms, marginal costs, and expected demand that lets them vary prices profitably.

The growing ability to collect and analyze vast amounts of data, plus the trend to online sales that allow customized price quotes not possible in brick-and-mortar stores, is bound to produce a lot more price discrimination. Is this necessarily bad for consumers? That’s not clear, although there definitely will be winners and losers. It is also sure to produce growing calls for regulating the practice.