UPDATED 14:34 EDT / FEBRUARY 04 2013

Top 5 Open Source Projects in Big Data – Breaking Analysis

by Molly Sassmann

Big Data is a booming area that is receiving more widespread attention, especially since technology research company Gartner has projected that Big Data will drive $34 billion in IT spending in 2013. Abhishek Mehta, founder of Tresata, joined Kristin Feledy on the Morning NewsDesk Show to give his perspective on what’s happening in Big Data.

Abhi noted that there are a number of emerging open source projects in Big Data and shared his top five list, which included Trevni, Spark, D3, Impala and H Catalog. Abhi gave a brief rundown of why these projects are important and explained that he would categorize these projects into “pockets” rather than ranking them in order of impact or popularity. Calling them “tools for data scientists,” he elaborated that these are infrastructure tools that make the use of emerging Big Data technologies more relevant.

“These Open Source projects by themselves are making Big Data a lot more easy to consume by the business user,” he said. The most important thing about Trevni is that it’s Doug’s project, he half-joked, making reference to Doug Cutting, the founder of several open source projects including Hadoop and Lucene. “Everything Doug seems to touch turns to gold,” said Abhi. Trevni is a columnar file format for HDFS and is designed after Google’s Dremel engine to make very fast data retrievals.

Spark is a relatively new tool, which Abhi described as an extremely fast cluster computing system which allows not only for iterative machine learning algorithmic processing, but also interactive data management.

On the more mature end of these projects is D3, which has been around for awhile, but Abhi believes it has the most interesting visualization platform. With its cross-hardware platform capability, it is able to answer the most important data science question of how to communicate results with a constantly growing library of visualizations.

“I think Impala is going to rapidly become the universal data access mechanism going forth,” said Abhi, adding that it’s powered on the back-end by Trevni. He stated that he’s in agreement with Cloudera that Impala does make existing visualization tools extremely performant on Hadoop.

Rounding out Abhi’s top five list is H Catalog, which he said is probably the least talked about project, but still the most needed thing in Big Data. H Catalog is a metadata management model with Open Source framework that is being developed by Hortonworks. Abhi declared, “A metadata management framework that is fundamentally open and works across all of your data in HDFS is much needed, and that’s the problem H Catalog solves.”

Abhi also gave an honorable mention to Google Spanner, which he compared to Hadoop. The white paper on Google Spanner was released in December, and Abhi said it’s the first global, consistent database that solves two big challenges in Open Source and data management frameworks. See the entire segment with Kristin Feledy and Abhishek Mehta on the Morning NewsDesk Show.

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.