UPDATED 16:08 EST / NOVEMBER 29 2023

AI

New Amazon SageMaker HyperPod can train AI models across thousands of chips

Amazon Web Services Inc. today launched Amazon SageMaker HyperPod, a new offering that enables developers to train foundation models across thousands of chips.

The offering, which made its debut at AWS re:Invent 2023, could create more competition for Google LLC’s TPU Pods. Those are clusters of up to 4,096 artificial intelligence chips that are available through the search giant’s public cloud. The chips are coordinated by a recently detailed software system, Cloud TPU Multislice Training, that automates infrastructure maintenance tasks to save time for developers.

Large-scale AI training 

Foundation models are often too complex to be trained using a single AI chip. As a result, they have to be split across multiple processors, which is a technically complex undertaking. The task requires highly specialized skills and can take weeks or months depending on the amount of hardware involved. 

SageMaker HyperPod, the new offering AWS debuted today, provides access to on-demand AI training clusters. Developers can provision a cluster through a combination of point-and-click commands and relatively simple scripts, which is significantly faster than manually configuring infrastructure. AWS says that SageMaker HyperPod reduces the amount of time required to train foundation models by up to 40%.

Customers can equip an AI training cluster with Nvidia Corp. graphics cards or chips from AWS’ internally-developed Trainium processor series. The newest addition to the Trainium series made its debut on Tuesday during the second day of re:Invent. According to AWS, the new chip can train AI models up to four times faster than its predecessor with up to double the energy efficiency.

At the same time, the latest Trainium chip keeps its predecessor’s hardware-accelerated stochastic rounding feature. AI models ingest information in the form of floating point numbers, or large fractions. Those fractions are often rounded up or down, which facilitates faster processing but decreases the data’s accuracy. The stochastic rounding feature in Trainium chips reduces data accuracy to a lesser extent than traditional rounding methods.

Reliable training 

SageMaker HyperPod automates not only the process of setting up AI training clusters but also their maintenance. When one of the instances in a customer’s cluster goes offline, built-in automation software automatically tries to repair it. If the troubleshooting attempt is unsuccessful, SageMaker HyperPod swaps out the malfunctioning node for a new one.

Foundation models take weeks or months to train in some cases. If an outage takes the underlying AI infrastructure offline, developers have to restart the training from scratch, which can lead to significant project delays. To avoid such situations, SageMaker HyperPod periodically saves AI models during a training session and provides the ability to resume the session from the most recent snapshot.

“You can now use SageMaker HyperPod to train FMs for weeks or even months while SageMaker actively monitors the cluster health and provides automated node and job resiliency by replacing faulty nodes and resuming model training from a checkpoint,” Antje Barth, the principal developer advocate for generative AI at AWS, detailed in a blog post. 

Simplified software workflows 

Each SageMaker HyperPod cluster comes preconfigured with a set of AWS-developed distributed training libraries. The libraries automatically spread a developer’s model across the chips in the cluster. According to AWS, they also split the data on which that model is being trained into smaller, more manageable pieces.

Distributing an AI model across a large number of chips is usually a more complicated process. The reason is that it involves technically challenging performance optimization tasks. 

When different components of a neural network are running on different chips, they must regularly exchange data with one another to coordinate their work. This movement of data consumes a large amount of processing power, which can slow down training. To avoid significant training slowdowns, developers must distribute their AI models in a way that avoids unnecessary movement of data between the underlying chips.

The distributed training libraries that AWS provides with SageMaker HyperPod reduce the amount of manual work involved in this process. However, customers with advanced requirements have the option to use their own distributed training code. AWS also provides the ability to equip an AI training cluster with other software components such as debugging tools.

SageMaker HyperPod is generally available today in AWS’ Ohio, Northern Virginia, Oregon, Singapore, Sydney, Tokyo, Frankfurt, Ireland and Stockholm cloud regions. 

Image: AWS

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU