New Nielsen Study Shows 75.6% of All Statistics are Fake

The headlines in the media are filled with that latest stats.  Stats sell.  The stats are often quoted from the latest reports.  People then parrot them around like they’re fact when image most of them are complete bullsh*t.  People throw them around at cocktail parties.  Often when they do I throw out my favorite statistic:  73.6% of  all statistics are made up.  I say it deadpanned.  Often I’ll get some people look at me like, “really?”  ”It’s true. Nielsen just released the number last month.”

No.  It’s irony.

Or as Mark Twain popularized the quote most attributed to the Prime Minister of Great Britain, Benjamin Disraeli, “there are three kinds of lies: lies, damn lies and statistics.”  The quote is meant to highlight the deceiving but persuasive power of numbers.

So, where is this all coming from, Mark?  What are you on about?  Anyone with a great deal of experience in dealing with numbers knows to be careful about the seduction of them.  I’m writing this post to make sure you’re all on that same playing field.

Here’s how I learned my lesson:

I started my life as a consultant.  Fortunately I was mostly a technology consultant, which image meant that I coded computers, designed databases and planned system integration projects.  OK, yes.  It was originally COBOL and DB2 – so what? But for my sins I got an MBA and did “strategy” consulting.  One of our core tasks was “market analysis,” which consistent of: market sizing, market forecasts, competitive analysis and then instructing customers on which direction to take.

It’s strange to me to think that customers with years of experience would ever listen to twenty-something smarties from great MBA’s who have never worked in your industry before – but that’s a different story.  Numbers are important.  I’d rather make decisions with uncertain numbers than no numbers.  But you have to understand how to interpret your numbers.

In 1999 I was in Japan doing a strategy project for the board of directors of Sony.  We were looking at all sorts of strategic decisions that Sony was considering, which required analysis and data on broadband networks, Internet portas and mobile handsets/networks.  I was leading the analysis with a team of 14 people: 12 Japanese, 1 German and 1 Turk.  I was the only one whose Japanese was limited to just a sushi menu.

I was in the midst of sizing the mobile handset markets in 3 regions: US, Europe and Asia.  I had reports from Gartner Group, Yankee Group, IDC, Goldman Sachs, Morgan Stanley and a couple of others.  I had to read each report, synthesis it and then come up with our best estimate of the markets going forward.  In data analysis you want to look for “primary” research, which means the person who initially gathered the data.

But all of the data projections were so different so I decided to call some of the research companies and ask how they derived their data.  I got the analyst who wrote one of the reports on the phone and asked how he got his projections.  He must have been about 24.  He said, literally, I sh*t you not, “well, my report was due and I didn’t have much time.  My boss told me to look at the growth rate average over the past 3 years an increase it by 2% because mobile penetration is increasing.”  There you go.  As scientific as that.

I called another agency.  They were more scientific.  They had interviewed telecom operators, handset manufacturers and corporate buyers.  They had come up with a CAGR (compounded annual growth rate) that was 3% higher that the other report, which in a few years makes a huge difference.  I grilled the analyst a bit. I said, “So you interviewed the people to get a plausible story line and then just did a simple estimation of the numbers going forward?”

“Yes. Pretty much”

Me, sarcastically, “And you had to show higher growth because nobody buys reports that just show that next year the same thing is going to happen that happened last year?”  Her, “um, basically.”

“For real?” “Well, yeah, we know it’s going to grow faster but nobody can be sure by how much.”  Me, “And I suppose you don’t have a degree in econometrics or statistics?”  Her, “No.”

I know it sounds like I’m making this sh*t up but I’m not.  I told this story to every consultant I knew at the time.  Nobody was surprised.  I wish it ended there.

The problem of amplification:

The problem got worse as the data flowed out to the “bulge bracket” investment banks.  They, too, were staffed with super smart twenty somethings.  But these people went to image slightly better schools (Harvard, Stanford, Wharton, University of Chicago) and got slightly better grades.  They took the data from the analysts.  So did the super bright consultants at McKinsey, Bain and BCG.  We all took that data as the basis for our reports.

Then the data got amplified.  The bankers and consultants weren’t paid to do too much primary research.  So they took 3 reports, read them, put them into their own spreadsheet, made fancier graphs, had professional PowerPoint departments make killer pages and then at the bottom of the graph they typed, “Research Company Data and Consulting Company Analysis” (fill in brand names) or some derivative.  But you couldn’t just publish exactly what Gartner Group had said so these reports ended up slightly amplified in message.

Even more so with journalists.  I’m not picking on them.  They were as hoodwinked as everybody was.  They got the data feed either from the research company or from the investment bank.  And if anybody can’t publish something saying “just in, next year looks like a repeat of last year” it’s a newspaper.  So you end up with superlative amplification.  ”Mobile penetration set to double next year reaching all time highs,” “venture capital market set to implode next year – more than 70% of firms may disappear” or “drug use in California growing at an alarming rate.”  We buy headlines.  Unless it’s a major publication there’s no time to fact check data in a report.  And even then …

The problem of skewing results:

Amplification is one thing.  It’s taking flawed data and making it more extreme.  But what worries me much more is skewed data.  It is very common for firms (from small ones to prestigious ones) to take data and use it conveniently to make the point that that want to make.  I have seen this so many times I consider it routine, which is why I question ALL data that I read.

How is it skewed?  There are so many ways to present data to tell the story you want that I can’t even list every way data is skewed.  Here are some examples:

- You ask a small sample set so that data isn’t statistically significant.  This is often naivete rather than malicious

- You ask a group that is not unbiased.  For example, you ask a group of prisoners what they think of the penal system, you ask college students what they think about the drinking age or you ask a group of your existing customers what they think about your product rather than people who cancelled their subscription.  This type of statistical error is known as “selective bias.”

- Also common, you look at a large data set of questions asked about consumer preferences.  You pick out the answers that support your findings and leave out the ones that don’t support it from your report.  This is an “error of omission.”

- You change the specific words asked in the survey such that you subtly change the meaning for the person reading your conclusions.  But subtle changes in words can totally change the way that the reader interprets the results.

- Also common is that the survey itself asks questions in a way that leads the responder to a specific answer.

- There are malicious data such as on Yelp where you might have a competitor that types in bad results on your survey to bring you down or maliciously positive like on the Salesforce.com AppExchange where you get your friends to rate your app 5 out of 5 so you can drive your score up.

That doesn’t happen? “I’m shocked, shocked to find that gambling is going on here.”  We all know it happens.  As my MBA statistics professor used to say, “seek disconfirming evidence.”  That always stuck with me.

Believing your own hype:

And this data subtly sinks into the psyche of your company.  It becomes folklore.  13% of GDP is construction – the largest industry.  40% of costs are labor, 40% are materials and 20% are overheads.  23% of all costs are inefficient.  18% of all errors come from people using the wrong documents. 0.8 hours are spent every day by workers searching for documents.

It’s important to quantify the value of your product or service.  I encourage it.

You’ll do your best to market the benefits ethically while still emphasizing your strong points.  Every investment banker I know is “number 1″ in something.  They just define their category tightly enough that they win it.  And then they market the F out of that result.  That’s OK.  With no numbers as proof points few people will buy your products.

Obviously try to derive data that is as accurate as possible.  And be careful that you don’t spin the numbers for so long and so hard that you can’t separate out marketing estimates from reality.  Continually seek the truth in the form of better customer surveys, more insightful market analyses and more accurate ROI calculations.  And be careful not to believe your own hype.  It can happen.  Being the number one investment bank in a greatly reduced data set shouldn’t stop you from wanting to broaden the definition of “number 1″ next year.

Here’s how to interpret data:

In the end make sure you’re suspicious of all data.  Ask yourself the obvious questions:

- who did the primary research on this analysis?

- who paid them? Nobody does this stuff free.  You’re either paid up front “sponsored research” or you’re paid on the back-end in terms of clients buying research reports.

- what motives might these people have had?

- who was in the sample set? how big was it? was it inclusive enough?

- and the important thing about data for me … I ingest it religiously.  I use it as one source of figuring out my version of the truth.  And then I triangulate.  I look for more sources if I want a truer picture.  I always try to think to myself, “what would the opposing side of this data analysis use to argue its weaknesses?”

Statistics aren’t evil.  They’re just a bit like the weather – hard to really predict.

And as they say about economists and weathermen – they’re the only two jobs you can keep while being wrong nearly 100% of the time