Open Data’s dirty secret: poor quality and potential for abuse

medium_4403827916This week the government released data on more than 4.4 million medical payment records in line with its Open Payments initiative, and it’s already come in for some strong criticism from the media.

It’s a reminder that open data, while free, is usually far from perfect.

The controversial release, which is required under the Affordable Care Act, has been criticized by the likes of Forbes, NPR and The Wall Street Journal as vague and confusing. The records show more than $3.5 billion in payments made to doctors by device companies and pharmaceutical manufacturers, yet there are two main issues with the data.

Firstly, it offers no context. The records don’t show if the payments represent legitimate financial relationships or conflicts of interest, says Fierce Health IT. Second, one-third of the payment records submitted last year have been omitted due to problems with the data that could lead to mistaken identification, according to NPR. Moreover, ProPublica reports that 64 percent of the current data doesn’t specify which hospital or doctor received the money.

“The release could be viewed two ways: as a detailed view of the underbelly of U.S. medicine, or a flawed, sloppy release of partial information that will confuse rather than elevate understanding,” suggests Politico in an opinion piece.

In reality, it’s probably somewhere between the two. There might be quality problems with the data, but ProPublica’s article still reveals some noteworthy insights into public health spending trends.

This also isn’t the first time an open data release has been criticized. When Germany’s open data portal came online, it was quickly slammed by critics who realized they had to specifically request many of the data sets – and were sometimes charged for the privilege. Australia’s effort fared little better, being described as “patchy and transitional” by the country’s own Information Commissioner. Meanwhile, India’s open data is looking somewhat sparse, according to DNA India’s Shyamanuja Das, who moaned it “has just about 115 datasets; what’s worse, all those datasets are from only 11 government department/agencies.”

The incidents highlight some fundamental flaws with open data. Raw data is often incomplete, inaccessible and therefore unusable.

The second flaw is the potential for misuse, as illustrated by the recent release of historical trip and fare data from New York City taxis. A Freedom of Information request made by blogger Chris Whong yielded details of 173 million trips made by the yellow cabs, including the driver’s ID, GPS coordinates of both pick-up and drop-off locations, trip times and passenger numbers.

But it didn’t take long for someone to decipher the driver’s IDs. Then someone else revealed how it was possible to match celebrities’ journeys with their drivers. That led to the discovery that some of those people were picked up outside of the city’s strip joints.

Clearly the potential for misuse applies to almost every data set, no matter what efforts are made to anonymize it. As analyst Alistair Croll pointed out back in 2012, “Big Data is our generation’s civil rights issue, and we don’t know it.”

Maybe we will find out the hard way.

photo credit: European Parliament via photopin cc