UPDATED 09:02 EST / FEBRUARY 23 2016

NEWS

Alluxio, the in-memory store for Apache Spark, hits version 1.0

As the Hadoop File System continues to lose traction among Spark adopters, new and more sophisticated storage frameworks are starting to take its place. One of the most popular options is the open-source Alluxio (previously known as Tachyon), which is moving under the wing of a dedicated foundation this morning on occasion of its first major release hitting general availability.

The launch is the culmination of a three-year development effort supported by some of the biggest names in the technology world that began with the work of a single doctoral candidate at UC Berkeley. Haoyuan Li witnessed the rise of Spark firsthand during his studies at the university’s AMPlab, where the analytics engine had gotten its start in 2010, and identified a bottleneck that was holding back early implementation attempts: The handful of data stores that were able to effectively support in-memory processing at the time all relied on replication for fault-tolerance.

The records in a Spark cluster would be copied across multiple servers to ensure that they could still be accessed if a node malfunctions. The approach remains the prefered method of maintaining the reliability of the analytics engine to this very day, even as the amount of information that organizations are processing grows at an accelerating rate. As a result, more and more bandwidth is used replicating data, which leaves less for other tasks and thus ultimately impedes processing. Haoyuan foresaw the challenge and devised an alternative fault-tolerance technique that would go on to form the basis of Alluxio.

The platform registers every change made to a record from the moment it’s ingested by Spark in a special log that is kept readily-accessible at all times. Should the server that hosts the file fail during analysis, Alluxio can have another machine pick up the slack, redo all the calculations that were performed in the run-up to the malfunction and continue from there as if nothing happened. The mechanism takes advantage of the fact that processing power is much more abundant than bandwidth in the enterprise to drastically improve cluster performance.

Banking giant Barclays PLC claims that its data scientists were able to reduce the duration of certain analyses from hours to minutes using Alluxio. The framework enables developers to work faster as well by hiding the complexity of its internals behind a programming interface that makes it relatively straightforward to control the flow of information. Records may be imported into memory from a variety of third party systems and automatically moved to disk for permanent storage after processing is complete.

Alluxio can handle the latter task by itself or relegate the analyzed data to conventional file systems such as GlusterFS and OpenStack Swift. The framework also provides integration with a number of open-sourced execution engines to accommodate organizations whose needs may not be fully met by Spark.

Image via Geralt

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU