UPDATED 08:00 EST / JULY 09 2024

BIG DATA

Alluxio says it can achieve 97% GPU utilization across a distributed filesystem

Alluxio Inc., which sells a high-performance open-source distributed filesystem, announced a set of enhancements that optimize the use of costly graphic processing units along with performance improvements that make its storage performance competitive with storage subsystems optimized for high-performance computing.

Alluxio Enterprise AI Version 3.2 also adds a Python interface and improved cache management features. The company said these help organizations better utilize artificial intelligence training and inferencing infrastructure.

Alluxio provides a single control point for businesses to data-intensive workloads across diverse infrastructures. Last fall, it overhauled its architecture to focus on AI training. “Since then we’ve seen that people want to train on resources that are separate from their primary data lake,” said Adit Madan, director of product. “We’re giving people the flexibility to use GPUs anywhere.”

The new release uses a unified namespace, intelligent caching and data management to maximize GPU utilization even with remote data. The company said that combined with storage management improvements in cache management and selective filtering, Alluxio Enterprise AI 3.2 matches HPC storage performance on existing data lakes, as measured by the popular MLPerf benchmark suite.

“We’re at 75% of the limits of the hardware and able to drive 10 gigabits-per-second throughput,” Madan said. “This is as good as the best HPC storage subsystems out there.”

Alluxio said it’s addressing a common dilemma of balancing spending on GPUs and storage when needs aren’t always predictable. “This is a more dynamic split because you can use Alluxio to dial performance on your storage subsystem up and down,” Madan said.

97% GPU utilization

Combined with enhanced input/output performance, Alluxio said its platform achieves up to 10-gigabits-per-second throughput and 200,000 inputs/outputs per second on a single client, with the capacity to scale up to hundreds of clients. That results in more than 97% GPU utilization on a system with eight Nvidia Corp. A100 GPUs on a single node, well ahead of the 50% to 60% rates that Madan said are more common. New checkpoint read/write support optimizes training to minimize GPU idle times.

Several tenants can share the same caching and performance features in multi-tenant environments. “You are gaining performance without losing flexibility,” Madan said.

Version 3.2 also introduces the Alluxio Python FileSystem application program interface to simplify integration with Python applications. The API is compatible with File System Specification, a standard Python library intended to abstract and unify access to various file system backends. That means Python frameworks like Ray can be used to access local and remote storage systems.

The bottom line for Python users is “They can start getting benefits of Alluxio caching without having a dedicated person to manage the Alluxio system,” Madan said. “There is no longer the need for a dedicated set of services managed by the client.”

Image: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU