UPDATED 14:10 EST / MAY 10 2023

CLOUD

Google boosts AI training with A3 virtual machines powered by Nvidia’s H100 GPUs

Google Cloud is expanding its portfolio of virtual machines for training and running artificial intelligence and machine learning models with the launch of its A3 supercomputers.

Announced at Google I/O today, the Google Compute Engine A3 supercomputers are purpose-built VMs equipped to train and serve the most advanced AI models, including those that advance progress in the exciting area of generative AI, the company said.

State-of-the-art AI and machine learning requires massive amounts of computational power delivered by infrastructure that’s tailor-made for the purpose, Google Director of Product Management Roy Kim and Group Product Manager Chris Kleban explained in a co-authored blog post. With its A3 supercomputers, Google Cloud is offering a combination of Nvidia Corp.’s new H100 graphics processing units and its own leading networking advancements, ensuring customers can access the most powerful GPUs for AI workloads, Kim and Kleban said.

A single A3 supercomputer VM is powered by eight H100 GPUs built on Nvidia’s Hopper architecture, delivering three times faster compute than its predecessor chip, the A100. It also offers 3.6 terabytes per second of bisectional bandwidth across those GPUs via NVSwitch and NVLink 4.0, plus integration with Intel Corp.’s 4th Gen Xeon Scalable processors to offload administrative tasks.

The A3 supercomputer is the first GPU instance to leverage Google’s custom-designed Intel Infrastructure Processing Units to accelerate GPU to central processing unit data transfers by bypassing the CPU host. According to Google, this increases network bandwidth by up to 10 times that of its previous generation A2 VMs.

The instances also use Google’s intelligent Jupiter data center networking fabric, which can scale across 26,000 interconnected GPUs, helping it to provide up to 26 exaFlops of AI performance. As a result, Google said, the A3 VMs will considerably improve the time and costs of training large machine learning models. Moreover, when companies transition from training to serving their models, the A3 VMs can deliver a 30-times boost in inference performance compared to the A2 VMs.

Not only are the A3 VMs incredibly powerful, but Google Cloud offers some flexible deployment options too. For instance, customers can choose to deploy the A3 VMs on Google Cloud’s Vertex AI platform for building machine learning models on a fully-managed infrastructure that’s purpose-built for high-performance training. Vertex AI was recently updated with new generative AI capabilities, increasing support for large language model development.

Alternatively, customers that wish to architect their own customized software stack can deploy the A3 supercomputers on Google Compute Engine or Google Kubernetes Engine, the company said. That will allow teams to train and serve advanced foundation models while benefiting from autoscaling, workload orchestration and automatic updates, Google said.

Image: Google Cloud

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU