UPDATED 09:30 EDT / JANUARY 28 2015

theCUBE Live At Hadoop World 2014 NEWS

Move over, Storm! LinkedIn’s Samza becomes full-fledged member of the Hadoop ecosystem

theCUBE Live At Hadoop World 2014

Four months after Storm, the first true stream processing engine for Hadoop, graduated from incubation at the Apache Software Foundation, Samza has followed suit to become a top-level project. The technology provides a potentially much more attractive option for analyzing real-time data that requires special handling.

It’s possible to process that kind of fast-moving information using Storm, which has been around longer and boasts a generally better-rounded set of capabilities, but the engine was designed for a very specific purpose that doesn’t accommodate every streaming workload equally well.

Storm traces its roots several years back to a startup called BackType Inc. that needed a way to quickly aggregate and analyze interactions on Twitter. To keep up with the pace of social media, it designed the framework to minimize the delay involved in ingesting data, which meant reducing the size of the payload that has to travel across the processing pipeline as much as possible.

Because of that design decision, Storm greatly limits the amount of context that it’s possible to store with any particular snippet of data. Keeping anything beyond a few metrics requires offloading the details to a remote database, which necessitates issuing a query and waiting for the results to flow back through the network every time the information is required.

That can add up to painful delays in applications that regularly execute relatively complex actions like grouping multiple messages into a single unit or merging a stream with a database table that needs more context. The limitation can make Storm unwieldy for some advanced use cases, which are growing more numerous by the day as organizations increasingly seek to make better use of their real-time data.

In contrast, Samza is specifically designed for those kinds of workloads. The LinkedIn-developed engine takes the opposite approach to Storm’s by attaching a local key-value store to every task to make it possible to quickly generate and fetch context. That setup maintains processing speed even when a large amount of collateral information is involved. It’s also reliable, replacing the contents of each embedded database across the entire cluster to ensure that no gaps are created in the real-time workflow when a machine becomes unavailable.

Reliability is a major theme with Samza. In addition to simply keeping everything properly synced, it also splits up workloads across single-threaded processes that can handle multiple tasks instead of simply assigning one thread to each job as Storm does. That’s a subtle but important difference that simplifies provisioning and reduces interference among tasks running on the same machine.

The inauguration of Samza as a top-level project noticeably expands the usefulness of Hadoop, nudging the platform a step closer to enterprise-readiness. With the stream processing framework now a full-fledged member of the ecosystem, it’s poised to start attracting serious contributions from vendors.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU