Top 5 Open Source Projects in Big Data – Breaking Analysis

Big Data is a booming area that is receiving more widespread attention, especially since technology research company Gartner has projected that Big Data will drive $34 billion in IT spending in 2013. Abhishek Mehta, founder of Tresata, joined Kristin Feledy on the Morning NewsDesk Show to give his perspective on what’s happening in Big Data.

Abhi noted that there are a number of emerging open source projects in Big Data and shared his top five list, which included Trevni, Spark, D3, Impala and H Catalog. Abhi gave a brief rundown of why these projects are important and explained that he would categorize these projects into “pockets” rather than ranking them in order of impact or popularity. Calling them “tools for data scientists,” he elaborated that these are infrastructure tools that make the use of emerging Big Data technologies more relevant.

“These Open Source projects by themselves are making Big Data a lot more easy to consume by the business user,” he said. The most important thing about Trevni is that it’s Doug’s project, he half-joked, making reference to Doug Cutting, the founder of several open source projects including Hadoop and Lucene. “Everything Doug seems to touch turns to gold,” said Abhi. Trevni is a columnar file format for HDFS and is designed after Google’s Dremel engine to make very fast data retrievals.

Spark is a relatively new tool, which Abhi described as an extremely fast cluster computing system which allows not only for iterative machine learning algorithmic processing, but also interactive data management.

On the more mature end of these projects is D3, which has been around for awhile, but Abhi believes it has the most interesting visualization platform. With its cross-hardware platform capability, it is able to answer the most important data science question of how to communicate results with a constantly growing library of visualizations.

“I think Impala is going to rapidly become the universal data access mechanism going forth,” said Abhi, adding that it’s powered on the back-end by Trevni. He stated that he’s in agreement with Cloudera that Impala does make existing visualization tools extremely performant on Hadoop.

Rounding out Abhi’s top five list is H Catalog, which he said is probably the least talked about project, but still the most needed thing in Big Data. H Catalog is a metadata management model with Open Source framework that is being developed by Hortonworks. Abhi declared, “A metadata management framework that is fundamentally open and works across all of your data in HDFS is much needed, and that’s the problem H Catalog solves.”

Abhi also gave an honorable mention to Google Spanner, which he compared to Hadoop. The white paper on Google Spanner was released in December, and Abhi said it’s the first global, consistent database that solves two big challenges in Open Source and data management frameworks. See the entire segment with Kristin Feledy and Abhishek Mehta on the Morning NewsDesk Show.



Join our mailing list to receive the latest news and updates from our team.


Join our mailing list to receive the latest news and updates from our team.


  1. Nice summary, thanks! One thing, though: if you list true Open Source projects such as HCatalog then Apache Drill [1] should be listed as well ;)

  2. All these news about big data is exciting, indeed.  My only concern?  It can kill art as we know it because algorithm will now dictate what you should be making/selling based on user data you have.  Now, that’s an originality killer don’t you think?

  3. @seventhman Good point, but only if we let it, no?

  4. Informative article Molly. One other open source technology to look at is HPCC Systems from LexisNexis, a data-intensive supercomputing platform for processing and solving big data analytical problems. Their open source Machine Learning Library and Matrix processing algorithms assist data scientists and developers with business intelligence and predictive analytics. Its integration with Hadoop, R and Pentaho extends further capabilities providing a complete solution for data ingestion, processing and delivery. In fact, executing HPCC Systems commands within R helps ease the burden of memory limitations with just R alone. More at

Submit a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Share This

Share This

Share this post with your friends!