Researchers at the University of California Berkeley are working on a project that is looking to change Big Data, taking it up a whole new gear and extremes of speed. The project aims to lower search and analysis times that take place when analyzing unstructured data. The open-source, next-generation data analytics stack includes components known as “Shark”, “Spark”, and “Mesos” and were introduced by Michael Franklin and Matei Zaharia at the 2012 AWS re:Invent conference.
The claims are huge – 100 times faster than Hadoop on some tasks. . The real-time database analysis tools are based on Apache’s Hive platform and feature improvements in machine learning and low-latency search functions. It is not a modified Hadoop, rather a completely separate codebase optimized for low latency and also can load data from Hadoop input sources. Shark is a component of Spark, an open source, distributed and fault-tolerant, in-memory analytics system, that can be installed on the same cluster as Hadoop.
In particular, Shark is fully compatible with Hive and supportsHiveQL, Hive data formats, and user-defined functions. In addition Shark can be used to query data4 in HDFS, HBase, and Amazon S3.
The possible applications are wide-spread and could address some significant real-world Big Data issues where speed is a significant requirement. As larger and larger data sets and more widespread adoption is taking place, both enhancements in speed and better analysis are becoming more of a requirement. Some may be wondering how Shark compares to Cloudera’s Impala – while there are some common goals, the differences are significant and bear review. A Quora post- “Apache Hadoop: How does Impala compare to Shark?” lays out a bunch of these differences for the reader to consider.
Ben Lorica, Chief Data Scientist at O’Reilly Media (https://twitter.com/bigdata) shared his top
seven eight reasons he likes Spark a ”key part of my big data toolkit”:
- Hadoop integration
- The Spark interactive Shell
- The Spark Analytic Suite
- Resilient Distributed Data sets (RDD’s)
- Distributed Operators
- Once you get past the learning curve … iterative programs
- It’s already used in production
- The Spark codebase is small, extensible, and hackable.
If you want to learn more about Shark/Spark, the Amplab team is offering a tutorial at the 2013 Strata Conference in Santa Clara. Titled “An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark and Shark” the tutorial will provide an introduction to BDAS, the Berkeley Data Analytics Stack.
Shark is developed in the UC Berkeley AMP Lab. The research and development is supported in part by an NSF CISE Expeditions award, gifts from Google, SAP, Amazon Web Services, Blue Goji, Cisco, Cloudera, Ericsson, General Electric, Hewlett Packard, Huawei, Intel, MarkLogic, Microsoft, NetApp, Oracle, Quanta, Splunk, VMware and by DARPA. All the software and documentation is available at https://amplab.cs.berkeley.edu/