AOL just unwittingly released private, personally-identifiable data for 650,000 of its subscribers when it posted a large chunk of its search logs (20 million queries, actually) to its research website as a service to the scientific community.
Despite anonymizing user id’s, the search queries often include information that make it easy to associate them with a person. The query data include social security numbers, credit card numbers, porn queries, evidence of intent to engage in criminal activities, etc.
AOL has since removed the data, but it’s spreading like wildfire over the internet on mirrors and torrents. I was able retrieve a complete copy of it (2 gigabytes, uncompressed) in about an hour.
As a scientist who does research that could would really benefit from data like this, I can tell you: this is big. Big and dirty.
Ethically speaking…should we, as researchers, ignore that this data exists or deal with it pragmatically as an unfortunate accident?
On one hand this is extremely useful and compelling data for a host of social and computer sciences; on the other, it is an unequivocally criminal violation of ethical standards.
Given the Google subpoena, big brother NSA, and the ethical debates about scientific research this story is provoking in mass media, this feels like a watershed moment.
No one can ever create a ‘clean’ version of this data since it could always be traced back to the original, identifiable information.
Here’s a possible scenario:
Most scientists will hesitate to research it, but some rebels will and no doubt find interesting, at-first-unpublishable, results. Sooner or later, something will get published, and then the floodgates will open. Because something can’t be unethical if everyone is doing it.
1 Responses to “Do ethics apply to great data?”
Leave a Reply