UPDATED 17:10 EDT / DECEMBER 03 2024

CLOUD

AWS details Project Rainier AI compute cluster with hundreds of thousands of chips

Amazon Web Services Inc. today detailed Project Rainer, a compute cluster powered by hundreds of thousands of its custom AWS Trainium2 chips.

The company is using the system to support the artificial intelligence development efforts of Anthropic PBC. AWS parent Amazon.com Inc. has invested $8 billion in the OpenAI rival since last September. A few weeks ago, Anthropic detailed that it will help the cloud giant enhance the Trainium chip line.

The Trainium2 is powered by eight so-called NeuronCores, which in turn each comprise four compute modules. One of the modules is a so-called GPSIMD engine optimized to run custom AI operators. Those are highly specialized, low-level code snippets that machine learning teams use to boost the performance of their neural networks.

The eight NeuronCores are supported by 96 gibibytes of HBM memory, which is considerably faster than other RAM varieties. The Trainium2 moves data between its HBM pool and NeuronCores at a speed of up to 2.8 terabits per second. The faster information can reach the part of the chip where it will be processed, the sooner calculations can begin.

The hundreds of thousands of Trainium2 chips in Project Rainier are organized into so-called Trn2 UltraServers. Those are internally-developed servers that AWS detailed today alongside the compute cluster. Each machine includes 64 Trainium2 chips that can provide 332 petaflops of aggregate performance when running sparse FP8 operations, a type of calculation that AI models use to crunch data.

AWS didn’t deploy the servers that make up Project Rainer in a single data center as is the usual practice. Instead, the cloud giant decided to spread out the machines across multiple locations. That approach simplifies logistical tasks such as sourcing enough electricity to power the cluster.

The benefits of spreading out hardware across multiple facilities historically came at a cost: increased latency. The greater the distance between the servers in a cluster, the more time it takes data to travel between them. Because AI clusters regularly shuffle information among their servers, this latency increase can significantly slow down processing.

AWS addressed that limitation with an internally developed technology called the Elastic Fabric Adapter. It’s a network device that speeds up the flow of data between the company’s AI chips.

Moving information between two disparate servers involves numerous computing operations. Some of those operations are carried by the servers’ operating system. AWS’ Elastic Fabric Adapter bypasses the operating system, which allows network traffic to reach its destination faster.

Under the hood, the device processes traffic with the help of an open-source networking framework called libfabric. The software lends itself to powering not only AI models but also other demanding applications such as scientific simulations.

AWS expects to complete the construction of Project Rainier next year. When it comes online, the system will be one of the world’s largest compute clusters for training AI models. AWS said that it will provide more than five times the performance of the system that Anthropic has been using until now to develop its language models.

The announcement of Project Rainier today comes about a year after AWS disclosed plans to build another large-scale AI cluster.

Project Ceiba, as the other system is called, runs on Nvidia Corp. silicon rather than Trainium2 processors. The original plan was to equip the supercomputer with 16,384 of the chipmaker’s GH200 graphics cards. Last March, AWS switched to a configuration with 20,736 Blackwell B20 chips that is expected to provide six times as much performance.

Project Ceiba will support Nvidia’s internal engineering efforts. The chipmaker plans to use the system for projects spanning areas such as language model research, biology and autonomous driving.

Image: AWS

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

AWS details Project Rainier AI compute cluster with hundreds of thousands of chips

Image: AWS

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

VMware Explore 2025

Future of Data Platforms Summit 2025

WOW: World of Workato 2025

Supermicro Open Storage Summit 2025

Black Hat USA 2025

AWS details Project Rainier AI compute cluster with hundreds of thousands of chips

Image: AWS

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

VMware Explore 2025

Future of Data Platforms Summit 2025

WOW: World of Workato 2025

Supermicro Open Storage Summit 2025

Black Hat USA 2025

Cookies