UPDATED 13:24 EDT / NOVEMBER 29 2012

NEWS

Big Data Up to 100X Faster – Researchers Crank Up the Speed Dial

Researchers at the University of California Berkeley are working on a project that is looking to change Big Data, taking it up a whole new gear and extremes of speed. The project aims to lower search and analysis times that take place when analyzing unstructured data. The open-source, next-generation data analytics stack includes components known as “Shark”, “Spark”, and “Mesos” and were introduced by Michael Franklin and Matei Zaharia at the 2012 AWS re:Invent conference.

The claims are huge – 100 times faster than Hadoop on some tasks. . The real-time database analysis tools are based on Apache’s Hive platform and feature improvements in machine learning and low-latency search functions. It is not a modified Hadoop, rather a completely separate codebase optimized for low latency and also can load data from Hadoop input sources. Shark is a component of Spark, an open source, distributed and fault-tolerant, in-memory analytics system, that can be installed on the same cluster as Hadoop.

In particular, Shark is fully compatible with Hive and supportsHiveQL, Hive data formats, and user-defined functions. In addition Shark can be used to query data4 in HDFS, HBase, and Amazon S3.

The possible applications are wide-spread and could address some significant real-world Big Data issues where speed is a significant requirement. As larger and larger data sets and more widespread adoption is taking place, both enhancements in speed and better analysis are becoming more of a requirement. Some may be wondering how Shark compares to Cloudera’s Impala – while there are some common goals, the differences are significant and bear review. A Quora post- “Apache Hadoop: How does Impala compare to Shark?” lays out a bunch of these differences for the reader to consider.

Ben Lorica, Chief Data Scientist at O’Reilly Media (https://twitter.com/bigdata) shared his top seven eight reasons he likes Spark a ”key part of my big data toolkit”:

  • Hadoop integration
  • The Spark interactive Shell
  • The Spark Analytic Suite
  • Resilient Distributed Data sets (RDD’s)
  • Distributed Operators
  • Once you get past the learning curve … iterative programs
  • It’s already used in production
  • The Spark codebase is small, extensible, and hackable.

If you want to learn more about Shark/Spark, the Amplab team is offering a tutorial at the 2013 Strata Conference in Santa Clara. Titled “An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark and Shark” the tutorial will provide an introduction to BDAS, the Berkeley Data Analytics Stack.

Shark is developed in the UC Berkeley AMP Lab. The research and development is supported in part by an NSF CISE Expeditions award, gifts from Google, SAP, Amazon Web Services, Blue Goji, Cisco, Cloudera, Ericsson, General Electric, Hewlett Packard, Huawei, Intel, MarkLogic, Microsoft, NetApp, Oracle, Quanta, Splunk, VMware and by DARPA.  All the software and documentation is available at https://amplab.cs.berkeley.edu/


A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.