Can You Be Identified from Anonymous Data? It’s Not So Simple

For the past several years, a highly technical but very important debate has raged among privacy experts: How easy is it to identify an individual from a collection of data that supposedly lacks personally identifiable information? Those who say that it is relatively easy argue for greater restrictions on the release of data and more stringent efforts to anonymize it. Their opponents argue that worrying too much about the risk of “re-identification” deprives researchers of valuable data in fields such as epidemiology.

Daniel_Barth-Jones Photo
Daniel Barth-Jones (Columbia Univ.)

A centerpiece of the debate is a 1997 incident in which Latanya Sweeney, then an MIT graduate student and now a computer scientist at Harvard, identified the medical records of Massachusetts Governor William Weld from information publicly available in a state insurance database. The incident led to important changes in privacy rules for medical information, especially under the Health Insurance Portability and Accessibility Act (HIPAA), and 15 years later it is still influencing the debate over data privacy.

But a new draft paper by Daniel C. Barth-Jones of Columbia’s Mailman School of Public Health suggests the Weld re-identification may have been a fluke that depended on Weld’s prominence as a public figure who suffered a highly publicized medical incident. For ordinary folks, the risks of such an identification are much lower.

The facts of the Weld case are not in dispute. On May 18, 1996, Weld collapsed at a public event. He was diagnosed with the flu and released after a brief hospitalization. With knowledge of Weld’s date of birth and zip code, Sweeney was able to locate Weld’s records in the Massachusetts Group Insurance Commission (GIC) database and verify them with a cross-reference to Cambridge voter-registration records.

Barth-Jones points out that this is not at all the same as finding the records of an arbitrary individual. For one thing, the date and place of his hospitalization were public information. But his bigger concern is with the use of voter registration data to confirm that the GIC records, which had no names associated with them, were truly Weld’s. The danger of a false positive arises from the possibilities that there might be individuals in the  zip code with the same date of birth who are not registered to vote in Cambridge and therefore do not appear in the database. In fact, Barth-Jones concludes that there was only a 62% to 66% probability that the re-identification of Weld through the voter list was correct. The U.S. has no universal listing of citizens by name, let alone by name and address. Voter lists are about as good as it gets.

This is not merely an academic dispute. Barth-Jones credits Sweeney’s 1997 paper for bringing needed limitations on the amount of medical data made public under HIPAA. For example, published records may now only contain a three-digit rather than a five-digit zip code and year, rather than year and day, of birth. This means a much larger number of individuals will share the same basic information, making successful identification far more difficult.

Privacy advocates, however, sometimes continue to argue as if these changes had never been made and that the Weld re-identification is a typical event. For example, Paul Ohm of the University of Colorado Law School argues, “Thus, easy, cheap, powerful re-identification will cause significant harm that is difficult to avoid. Faced with these daunting new challenges, regulators must find new ways to measure the risk to privacy in different contexts.” Barth-Jones maintains that re-identification is nowhere near as cheap and easy as Ohm suggests.

And there is a cost to further restrictions on the availability of data. In a paper entitled “The Tragedy of the Data Commons,” Jane Yakowitz Bambauer of the Brooklyn Law School, who has collaborated with Barth-Jones, writes: “[A]nonymized data is crucial to beneficial social research, and constitutes a public resource–a commons–under threat of depletion…. [S]ince current privacy policies overtax valuable research without reducing any realistic risks, law should provide a safe harbor for the dissemination of research data.”