Alluxio, the in-memory store for Apache Spark, hits version 1.0
As the Hadoop File System continues to lose traction among Spark adopters, new and more sophisticated storage frameworks are starting to take its place. One of the most popular options is the open-source Alluxio (previously known as Tachyon), which is moving under the wing of a dedicated foundation this morning on occasion of its first major release hitting general availability.
The launch is the culmination of a three-year development effort supported by some of the biggest names in the technology world that began with the work of a single doctoral candidate at UC Berkeley. Haoyuan Li witnessed the rise of Spark firsthand during his studies at the university’s AMPlab, where the analytics engine had gotten its start in 2010, and identified a bottleneck that was holding back early implementation attempts: The handful of data stores that were able to effectively support in-memory processing at the time all relied on replication for fault-tolerance.
The records in a Spark cluster would be copied across multiple servers to ensure that they could still be accessed if a node malfunctions. The approach remains the prefered method of maintaining the reliability of the analytics engine to this very day, even as the amount of information that organizations are processing grows at an accelerating rate. As a result, more and more bandwidth is used replicating data, which leaves less for other tasks and thus ultimately impedes processing. Haoyuan foresaw the challenge and devised an alternative fault-tolerance technique that would go on to form the basis of Alluxio.
The platform registers every change made to a record from the moment it’s ingested by Spark in a special log that is kept readily-accessible at all times. Should the server that hosts the file fail during analysis, Alluxio can have another machine pick up the slack, redo all the calculations that were performed in the run-up to the malfunction and continue from there as if nothing happened. The mechanism takes advantage of the fact that processing power is much more abundant than bandwidth in the enterprise to drastically improve cluster performance.
Banking giant Barclays PLC claims that its data scientists were able to reduce the duration of certain analyses from hours to minutes using Alluxio. The framework enables developers to work faster as well by hiding the complexity of its internals behind a programming interface that makes it relatively straightforward to control the flow of information. Records may be imported into memory from a variety of third party systems and automatically moved to disk for permanent storage after processing is complete.
Alluxio can handle the latter task by itself or relegate the analyzed data to conventional file systems such as GlusterFS and OpenStack Swift. The framework also provides integration with a number of open-sourced execution engines to accommodate organizations whose needs may not be fully met by Spark.
Image via Geralt
Since you’re here …
Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!
Support our mission: >>>>>> SUBSCRIBE NOW >>>>>> to our YouTube channel.
… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.