

Apache Spark expanded its lead as the prefered candidate for the open-source community’s new flagship analytics engine this week with the release of a landmark update that drastically improves processing speeds for every supported workload type. Much of that increase is due to an overhaul of the underlying operating scheme that has been in the works for several quarters.
Like most of the other leading analytics technologies developed under the umbrella of the Apache Software Foundation, Spark is written mainly in Java, which comes with an abstraction layer that removes the need for the programmer to worry about the nuances of how their code is executed. The project’s backers have given up some of that convenience to squeeze out more performance out of the underlying hardware.
Spark now circumvents the native Java mechanism for managing data in memory to use its own specialized format that saves space and reduces the overhead that the abstraction layer expends on figuring out which bits can be deleted and when after they’re no longer needed. But that still doesn’t fully accommodate every workload, which is why the engine takes over code execution entirely for some of its more advanced components.
Standing out in particular are the data management functions that Spark borrows from the world of relational databases, which are implemented in a dedicated component that allows business analysts to carry out analytics using familiar structure queries. As an added bonus, the new release makes it possible to visualize the execution paths of those queries in order to identify ways to improve response times.
Spark 1.5 also targets a more mathematically-oriented audience with the addition of expanded support for the R statistical modelling language, which is likewise aimed at enabling users to employ syntax they already know. Except instead structured queries, the integration aims to enable the creation of machine learning algorithms like the kind used in recommendation systems and several other popular use cases for the engine.
Another fast-rising application for Spark that often goes hand in hand with machine learning is stream processing, which is also receiving a boost in the form of reliability improvements and a new throttling feature meant to prevent clusters from ingesting more data than they can handle. That’s useful for dealing with sudden input spikes that can potentially compromise the service levels of a deployment if left unchecked.
But as big of an improvement as the update represents, it’s still only the tip of the iceberg of what’s to come now that IBM Corp. has allocated a billion dollars and several thousand engineers to accelerating the development of Spark. One of the first additions in the pipe is a library called SystemML that is derived from Watson and automatically optimizes machine learning algorithms for fast execution.
Support our open free content by sharing and engaging with our content and community.
Where Technology Leaders Connect, Share Intelligence & Create Opportunities
SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.