AWS details Project Rainier AI compute cluster with hundreds of thousands of chips
Amazon Web Services Inc. today detailed Project Rainer, a compute cluster powered by hundreds of thousands of its custom AWS Trainium2 chips.
The company is using the system to support the artificial intelligence development efforts of Anthropic PBC. AWS parent Amazon.com Inc. has invested $8 billion in the OpenAI rival since last September. A few weeks ago, Anthropic detailed that it will help the cloud giant enhance the Trainium chip line.
The Trainium2 is powered by eight so-called NeuronCores, which in turn each comprise four compute modules. One of the modules is a so-called GPSIMD engine optimized to run custom AI operators. Those are highly specialized, low-level code snippets that machine learning teams use to boost the performance of their neural networks.
The eight NeuronCores are supported by 96 gibibytes of HBM memory, which is considerably faster than other RAM varieties. The Trainium2 moves data between its HBM pool and NeuronCores at a speed of up to 2.8 terabits per second. The faster information can reach the part of the chip where it will be processed, the sooner calculations can begin.
The hundreds of thousands of Trainium2 chips in Project Rainier are organized into so-called Trn2 UltraServers. Those are internally-developed servers that AWS detailed today alongside the compute cluster. Each machine includes 64 Trainium2 chips that can provide 332 petaflops of aggregate performance when running sparse FP8 operations, a type of calculation that AI models use to crunch data.
AWS didn’t deploy the servers that make up Project Rainer in a single data center as is the usual practice. Instead, the cloud giant decided to spread out the machines across multiple locations. That approach simplifies logistical tasks such as sourcing enough electricity to power the cluster.
The benefits of spreading out hardware across multiple facilities historically came at a cost: increased latency. The greater the distance between the servers in a cluster, the more time it takes data to travel between them. Because AI clusters regularly shuffle information among their servers, this latency increase can significantly slow down processing.
AWS addressed that limitation with an internally developed technology called the Elastic Fabric Adapter. It’s a network device that speeds up the flow of data between the company’s AI chips.
Moving information between two disparate servers involves numerous computing operations. Some of those operations are carried by the servers’ operating system. AWS’ Elastic Fabric Adapter bypasses the operating system, which allows network traffic to reach its destination faster.
Under the hood, the device processes traffic with the help of an open-source networking framework called libfabric. The software lends itself to powering not only AI models but also other demanding applications such as scientific simulations.
AWS expects to complete the construction of Project Rainier next year. When it comes online, the system will be one of the world’s largest compute clusters for training AI models. AWS said that it will provide more than five times the performance of the system that Anthropic has been using until now to develop its language models.
The announcement of Project Rainier today comes about a year after AWS disclosed plans to build another large-scale AI cluster.
Project Ceiba, as the other system is called, runs on Nvidia Corp. silicon rather than Trainium2 processors. The original plan was to equip the supercomputer with 16,384 of the chipmaker’s GH200 graphics cards. Last March, AWS switched to a configuration with 20,736 Blackwell B20 chips that is expected to provide six times as much performance.
Project Ceiba will support Nvidia’s internal engineering efforts. The chipmaker plans to use the system for projects spanning areas such as language model research, biology and autonomous driving.
Image: AWS
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU