Klout and Hadoop, the Pros and Cons

Everybody wants to improve their Klout score, a value that represents a user’s influence across their social network. And while Klout describes to its users how the number is calculated, few people understand how the platform behind the score really works. To lend some insight, Dave Mariani, vice president of engineering at Klout, joined John Furrier and Jeff Kelly at The Cube, broadcasting during Hadoop Summit 2012 in San Jose, Calif (full video below).

Mariani explained how Hadoop’s distributed file system has enabled his start-up to not just process, but also store data cheaply. Hadoop is horizontally scalable, meaning if an organization wants to increase the capacity or speed to process its data, it can increase the number of machines in its Hadoop cluster without changing anything in the underlying software.

Hadoop lets small companies wrestle with huge amounts of data. Klout prefers to work with Hadoop inside its own hosted data center, but for organizations lacking the the resources that Klout has at its disposal, Hadoop can run on top of Amazon EC2. “It’s very inexpensive and very easy out of the gate to get scale,” Mariani said. “We can’t do what we’re doing without Hadoop. We’re out of business without that infrastructure.”

But, Mariani also wasn’t shy to express what he believes Hadoop’s current limitations are, and what he would like to see from the open-source framework moving forward. In a nutshell, platforms like Hadoop — or HBase and Hive, for that matter — lack robust business intelligence capabilities. “You still need schemas on the unstructured data to get the most out of it,” Mariani said.

For a company like Klout, which collects a billion “signals” from its registered users every day, it craves real-time business intelligence to develop better social media analytics that will ultimately lead to more satisfied customers and larger profits for the company. The problem with Hadoop is that it is a batch processing system that struggles in the “real-time world,” Mariani said. As a result, he is waiting for developers to create analytical engines that can run on top of Hadoop to enable it to perform interactive queries.

In the meantime, Klout turns to SQL Server Analysis Services to conduct that sought-after business intelligence. But Mariani would love to see this functionality available in Hadoop. “If you think about what makes Hadoop so great, when you store a piece of data — let’s just say it’s a file — it appears virtually to you as a file…but that actually is distributed across as many nodes as you have in the cluster…So when I do a query…it’s a massive parallel table scan across all these individual hard disks that are out there that I get to take advantage of…So that’s what I want to do with [business intelligence]…versus trying to pipe it and load it into something else.”