Top 5 Open Source Projects in Big Data – Breaking Analysis

Big Data is a booming area that is receiving more widespread attention, especially since technology research company Gartner has projected that Big Data will drive $34 billion in IT spending in 2013. Abhishek Mehta, founder of Tresata, joined Kristin Feledy on the Morning NewsDesk Show to give his perspective on what’s happening in Big Data.

Abhi noted that there are a number of emerging open source projects in Big Data and shared his top five list, which included Trevni, Spark, D3, Impala and H Catalog. Abhi gave a brief rundown of why these projects are important and explained that he would categorize these projects into “pockets” rather than ranking them in order of impact or popularity. Calling them “tools for data scientists,” he elaborated that these are infrastructure tools that make the use of emerging Big Data technologies more relevant.

“These Open Source projects by themselves are making Big Data a lot more easy to consume by the business user,” he said. The most important thing about Trevni is that it’s Doug’s project, he half-joked, making reference to Doug Cutting, the founder of several open source projects including Hadoop and Lucene. “Everything Doug seems to touch turns to gold,” said Abhi. Trevni is a columnar file format for HDFS and is designed after Google’s Dremel engine to make very fast data retrievals.

Spark is a relatively new tool, which Abhi described as an extremely fast cluster computing system which allows not only for iterative machine learning algorithmic processing, but also interactive data management.

On the more mature end of these projects is D3, which has been around for awhile, but Abhi believes it has the most interesting visualization platform. With its cross-hardware platform capability, it is able to answer the most important data science question of how to communicate results with a constantly growing library of visualizations.

“I think Impala is going to rapidly become the universal data access mechanism going forth,” said Abhi, adding that it’s powered on the back-end by Trevni. He stated that he’s in agreement with Cloudera that Impala does make existing visualization tools extremely performant on Hadoop.

Rounding out Abhi’s top five list is H Catalog, which he said is probably the least talked about project, but still the most needed thing in Big Data. H Catalog is a metadata management model with Open Source framework that is being developed by Hortonworks. Abhi declared, “A metadata management framework that is fundamentally open and works across all of your data in HDFS is much needed, and that’s the problem H Catalog solves.”

Abhi also gave an honorable mention to Google Spanner, which he compared to Hadoop. The white paper on Google Spanner was released in December, and Abhi said it’s the first global, consistent database that solves two big challenges in Open Source and data management frameworks. See the entire segment with Kristin Feledy and Abhishek Mehta on the Morning NewsDesk Show.