Amazon SageMaker HyperPod cooks up recipes and flexible training plans to accelerate AI development
Amazon Web Services Inc. isn’t stopping at transforming Amazon SageMaker into a unified platform for artificial intelligence development tools. It’s equally determined to make it easier for builders to access the underlying infrastructure needed to train their next-generation AI models.
Today at AWS re:Invent 2024, the company announced some key updates to Amazon SageMaker HyperPod. The changes are intended to help developers get started in AI model training faster, and save weeks on the time it takes to finish those training jobs through more flexible deployment options. They’ll also help developers to reduce the costs associated with model training.
In addition, Amazon said it’s helping SageMaker customers to discover, deploy and make use of various third-party generative AI and machine learning development tools offered by AWS partners, such as Deepchecks Inc., Fiddler Labs Inc. and Comet ML Inc.
AI recipes accelerate training jobs
AWS unveiled SageMaker HyperPod one year ago, at last year’s edition of re:Invent, saying it’s an infrastructure offering that provides developers with access to on-demand compute clusters for AI training. With SageMaker HyperPod, users can quickly provision clusters of graphics processing units or other AI accelerators via a combination of point-and-click commands or by using relatively simple scripts. It allows them to get started much more quickly than if they were to manually configure the clusters themselves.
The company is trying to position SageMaker HyperPod as the infrastructure platform of choice for AI training. It says developers need a specialized platform because model training is a difficult process that requires tons of expertise in terms of managing the underlying clusters and creating special code to distribute those models across multiple clusters.
SageMaker HyperPod eases much of that complexity, and with today’s updates the process is getting easier than ever, AWS said. One of the main innovations is the new “recipes” that allow customers to quickly customize popular, publicly available models like Llama and Mistral for specific use cases, based on their internal data.
The training recipes are meant to make it easier for users to get started, without being bogged down by things such as defining parameters and benchmarking performance. It’s offering more than 30 of the recipes at launch, for models such as Llama 2.2 90B, Llama 3.1 405B and Mistral 8x22B.
According to AWS, the recipes can help customers get going much faster, by automatically loading training datasets, applying distributed training techniques and automating other aspects of the process. The company reckons its recipes, available now in the SageMaker GitHub repository, can eliminate weeks of iterative evaluation and testing that would normally be required to get started in training AI.
“This is going to be a game changer,” Swami Sivasubramanian, vice president of AI and data at AWS, said at his re:Invent keynote today.
Flexible model training plans
In another update, AWS is also making it easier to plan and manage the underlying compute capacity requirements for AI training jobs. With the new, flexible training plans, customers can simply specify their budget, desired completion date and the maximum number of compute resources required for a job, and SageMaker HyperPod will automatically reserve the capacity, set up the necessary clusters and then deploy everything as necessary.
If the user’s proposed training requirements do not meet the specified completion date or budget, SageMaker HyperPod will automatically suggest various alternative plans of action, such as extending the date range, adding more compute or conducting the training job in another AWS region. When the customer approves a plan, the infrastructure will be provisioned automatically just before it’s required, with the right number of instances needed to get the job done according to the customer’s timeline.
Once again, it saves developers time, and more importantly it helps reduce the uncertainty that comes when customers need to acquire large clusters of GPUs to complete AI development tasks.
AWS said an AI startup called Hippocratic AI is already using its flexible training plans to speed up its training processes, and has noticed a fourfold improvement in the time it takes to get its newest models up to scratch.
Priority resource allocation
In a final update to SageMaker HyperPod, AWS is introducing new task governance features that promise to give users more control over task prioritization and resource allocation.
The new capabilities make it possible for customers to maximize accelerator utilization for specific model training, fine-tuning and inference jobs, reducing overall development costs by up to 40% in some cases, the company said. With a few simple clicks, users can define their priorities for each task, and set up limits on compute resources. Once these priorities and limits are established, SageMaker HyperPod will quickly allocate the relevant resources, while managing task queues automatically so that the higher priority work is always done first.
This enables the most critical training jobs to be prioritized at all times. So if a team has an urgent job that needs doing yesterday, they can automatically free up underutilized resources from other, non-urgent jobs that are underway, pausing them so the critical task can be done as fast as possible.
SageMaker integrations
The updates announced today weren’t all about HyperPod. In addition, AWS is also making it easier to integrate third-party AI development tools with SageMaker.
Previously, such integrations took an awful lot of time, involving lots of different steps around monitoring, compliance, data access, provisioning resources and so on. That’s all being automated now for select AI applications. According to AWS, the integrations will make it a trivial experience for users to explore, discover, deploy and use various AI developer applications from the likes of Comet, Deepchecks, Fiddler and other companies directly within SageMaker.
One advantage of deploying such apps within SageMaker is that there’s no need to move data outside of a secure AWS environment, thanks to the platform’s integration with multiple kinds of data stores.
“With today’s announcements, we’re offering customers the most performant and cost-efficient model development infrastructure possible to help them accelerate the pace at which they deploy generative AI workloads into production,” said Baskar Sridharan, vice president of AI/ML Services and Infrastructure at AWS.
With reporting from Robert Hof
Image: SiliconANGLE/Microsoft Designer
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU