UPDATED 12:29 EST / SEPTEMBER 11 2015

NEWS

Spark 1.5 puts the pedal to the metal on in-memory analytics

Apache Spark expanded its lead as the prefered candidate for the open-source community’s new flagship analytics engine this week with the release of a landmark update that drastically improves processing speeds for every supported workload type. Much of that increase is due to an overhaul of the underlying operating scheme that has been in the works for several quarters.

Like most of the other leading analytics technologies developed under the umbrella of the Apache Software Foundation, Spark is written mainly in Java, which comes with an abstraction layer that removes the need for the programmer to worry about the nuances of how their code is executed. The project’s backers have given up some of that convenience to squeeze out more performance out of the underlying hardware.

Spark now circumvents the native Java mechanism for managing data in memory to use its own specialized format that saves space and reduces the overhead that the abstraction layer expends on figuring out which bits can be deleted and when after they’re no longer needed. But that still doesn’t fully accommodate every workload, which is why the engine takes over code execution entirely for some of its more advanced components.

Standing out in particular are the data management functions that Spark borrows from the world of relational databases, which are implemented in a dedicated component that allows business analysts to carry out analytics using familiar structure queries. As an added bonus, the new release makes it possible to visualize the execution paths of those queries in order to identify ways to improve response times.

Spark 1.5 also targets a more mathematically-oriented audience with the addition of expanded support for the R statistical modelling language, which is likewise aimed at enabling users to employ syntax they already know. Except instead structured queries, the integration aims to enable the creation of machine learning algorithms like the kind used in recommendation systems and several other popular use cases for the engine.

Another fast-rising application for Spark that often goes hand in hand with machine learning is stream processing, which is also receiving a boost in the form of reliability improvements and a new throttling feature meant to prevent clusters from ingesting more data than they can handle. That’s useful for dealing with sudden input spikes that can potentially compromise the service levels of a deployment if left unchecked.

But as big of an improvement as the update represents, it’s still only the tip of the iceberg of what’s to come now that IBM Corp. has allocated a billion dollars and several thousand engineers to accelerating the development of Spark. One of the first additions in the pipe is a library called SystemML that is derived from Watson and automatically optimizes machine learning algorithms for fast execution.

Photo via AdjencaJA

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU