As Jeff Kelly pointed out there’s big money being invested in big data. And John Fritz has warned us that with big data comes big responsibility. But what’s driving this level of investment, and what’s putting so much power into the hands of data scientists, is big expectations. Big data applications are being used for everything from cancer research to advertising to predicting the necessity of hospitalization. Customers are recognizing the value of data and demanding that companies use it to better serve them.
But is big data being setup to fail by unrealistic expectations? Here are a few of the limitations of big data.
1. The Need to Ask the Right Questions
One of the best talks from the O’Reilly Strata Summit on the Businsess of Data was J.C. Herz’s talk
Analytics and Witch-Doctoring
Why Executives Succumb to the Black Box Mentality. You can watch the first few minutes of the presentation on the O’Reilly site, but the gist of it is: executives keep coming to Herz’s e-commerce analytics company and asking for “analytics,” without a clear idea of what they want these analytics system to actually do. The money line: “Even tarot requires a question.”
And that’s part of the problem. We may be accumulating vast amounts of data, and have more processing power than ever, but we’re still have human limitations in terms of what sorts of questions to ask and what to search for.
2. The Need to Search for the Right Things
One of my predictions last year was “Predictive analytics will be applied to more business processes, regardless of whether it helps.” I guess what I was really predicting here was that we’d see some areas where predictive analytics failed, or at least didn’t help much, but was still being used.
I can’t say that I really saw this, but Ralph Losey’s piece on the accuracy of e-discovery software comes close. I’ve often linked to John Markoff’s New Tork Times piece on how e-discovery tools from companies like Autonomy and Clearwell are replacing armies of lawyers and paralegals as evidence of the ways that information technology is displacing human workers.
Losey mostly focuses on the trouble with keyword search, rather than a more holistic approach that includes machine learning and predictive analytics. “When large data sets are involved, no human is smart enough to guess the right keywords,” he writes.
3. The Need for Human Verification
But even with the more advance predictive analytics processing models in place, there’s still a big problem with these e-discovery systems: no one really knows how accurate they are, because the only way to verify determine their accuracy is to do a human review. And human reviews are also inaccurate. E-discovery persists because it’s the “least bad” option. This is an analytics problem that will carry over into other fields.
Big data is supposed to liberate us from the need to do sampling, because we can just analyze an entire data set. But if we want to do any sort of quality assurance on machine accuracy we have to resort to sampling, except in cases where the entire data set can be checked by a human.
There are further uncertainties involved. How do you know for certain that the calculations are complete? That the data collected for an environmental sensor is accurate?
Some models are testable through simple outcomes. Click-throughs and conversions improve or they don’t. Employee turn-over decreases or it doesn’t. Once a model is finely tuned, it absolutely can produce results and can be easily verified by humans. But what happens when something unpredictable happens?
The idea of a “black swan event” comes from Nassim Taleb. He named it after the discovery of black swans in Australia, before which it was assumed that all swans were white. But the story of the Thanksgiving day turkey might be the best example. Every day for almost a year the turkey is fed and well kept. Based on past behavior, the turkey can predict that its future will be more of the same. It has a perfectly functioning model. Until the day the turkey is slaughtered for by its own. It only takes that one outlier to ruin the entire model.
One of the most intriguing areas for predictive analytics is law enforcement. Earlier this year Slate covered the rise of predictive policing, which it is admitted will only work for fairly predictable crimes like burglary and car theft. But what happens when there’s a outlier, like the Riots in London last summer? A predictive policing system may have been able to predict increased crime in certain neighborhoods during summer months, but i t couldn’t have prepared London police for what actually happened. Fortunately, it doesn’t sound like police departments are going to over rely on this sort of data modeling.
But what happens to your business analytics model when a volcano in Iceland disrupts travel all over the entire European continent? That’s a rare event, but it’s one that can seriously disrupt nearly every aspect of some businesses.
5. The Non-Quantifiable
In a blog post earlier this year Greg Borenstein contrasted the enthusiasm for big data at O’Reilly Foo Camp 2011 with the way the very human way that O’Reilly actually operates as a company:
O’Reilly’s process relies almost exclusively on human traits that aren’t represented in data or reproduced in a model: the trust between two peers that allows them to talk about a crazy idea that one of them is thinking about taking on outside of work, the ability to tell who’s highly respected in a field by the tone of voice people use when mentioning a name, the gut instinct of an experienced industry visionary for what will happen next, etc.
Sites like Technorati and Klout attempt to quantify these sorts of relationships and turn them into something that can be processed and analyzed, but human relationships remain difficult to quantify and categorize. Another difficulty here is the “decline effect,” which Jonah Lehrer wrote about here and here. Simply stated, there’s a tendency for support for scientific claims to decrease over time. Behavioral science tends to be particularly vulnerable. This is likely due in part to publication bias, but also in part to the inability to anticipate and control for all variables that relate to human behavior. And this same inability to anticipate variables makes quantification and modeling of human behavior very difficult.
There is seemingly no area in which we can’t benefit from analyzing data: law enforcement, business, politics, economics, sports, even sex… and yet there is a real risk in overestimating our ability to use data to make accurate predictions. Big data is constrained by human limitations. One thing to watch in 2012 is the expectations of everyone from CEOs to politicians to individuals to produce results for big data. Big data can solve big problems, but we must always remember not to put too much trust in our ability to predict the future.