

Apache Spark expanded its lead as the prefered candidate for the open-source community’s new flagship analytics engine this week with the release of a landmark update that drastically improves processing speeds for every supported workload type. Much of that increase is due to an overhaul of the underlying operating scheme that has been in the works for several quarters.
Like most of the other leading analytics technologies developed under the umbrella of the Apache Software Foundation, Spark is written mainly in Java, which comes with an abstraction layer that removes the need for the programmer to worry about the nuances of how their code is executed. The project’s backers have given up some of that convenience to squeeze out more performance out of the underlying hardware.
Spark now circumvents the native Java mechanism for managing data in memory to use its own specialized format that saves space and reduces the overhead that the abstraction layer expends on figuring out which bits can be deleted and when after they’re no longer needed. But that still doesn’t fully accommodate every workload, which is why the engine takes over code execution entirely for some of its more advanced components.
Standing out in particular are the data management functions that Spark borrows from the world of relational databases, which are implemented in a dedicated component that allows business analysts to carry out analytics using familiar structure queries. As an added bonus, the new release makes it possible to visualize the execution paths of those queries in order to identify ways to improve response times.
Spark 1.5 also targets a more mathematically-oriented audience with the addition of expanded support for the R statistical modelling language, which is likewise aimed at enabling users to employ syntax they already know. Except instead structured queries, the integration aims to enable the creation of machine learning algorithms like the kind used in recommendation systems and several other popular use cases for the engine.
Another fast-rising application for Spark that often goes hand in hand with machine learning is stream processing, which is also receiving a boost in the form of reliability improvements and a new throttling feature meant to prevent clusters from ingesting more data than they can handle. That’s useful for dealing with sudden input spikes that can potentially compromise the service levels of a deployment if left unchecked.
But as big of an improvement as the update represents, it’s still only the tip of the iceberg of what’s to come now that IBM Corp. has allocated a billion dollars and several thousand engineers to accelerating the development of Spark. One of the first additions in the pipe is a library called SystemML that is derived from Watson and automatically optimizes machine learning algorithms for fast execution.
THANK YOU