UPDATED 08:00 EDT / APRIL 09 2024


Google Cloud’s AI Hypercomputer cloud infrastructure gets new GPUs, TPUs, optimized storage and more

Google Cloud is revamping its AI Hypercomputer architecture with significant enhancements across the board to support rising demand for generative artificial intelligence applications that are becoming increasingly pervasive in enterprise workloads.

At Google Cloud Next ’24 today, the company announced updates to almost every layer of the AI Hypercomputer cloud architecture, with new virtual machines powered by Nvidia Corp’.s most advanced graphics processing units one of the most significant revelations. In addition, it unveiled enhancements to its storage infrastructure for AI workloads, plus the underlying software for running AI models, and more flexible consumption options with its Dynamic Workload Scheduler service.

The updates were announced by Mark Lohmeyer, vice president and general manager of Compute and ML Infrastructure at Google Cloud. He explained that generative AI has gone from almost nowhere just a couple of years ago to becoming widespread across a wide range of enterprise applications encompassing text, code, videos, images, voice, music and more, placing incredible strains on the underlying compute, networking and storage infrastructure that supports it.

AI performance-optimized hardware

To support the increasingly powerful generative AI models being adopted across the enterprise today, Google Cloud has announced the general availability of what it says is its most powerful and scalable tensor processing unit to date. It’s called the TPU v5p, and it has been designed with a single purpose in mind – to train and run the most demanding generative AI models.

TPU v5p is built to deliver enormous computing power, with a single pod containing 8,960 chips running in unison, which is more than twice as many as the number in a TPU v4 pod. According to Lohmeyer, the TPU v5p delivers some impressive performance gains, with twice as many floating point operations per second and three-times more high-bandwidth memory on a per-chip basis, resulting in vastly improved overall throughput.

To enable customers to train and serve AI models running on large-scale TPU clusters, Google is adding support for the TPU v5p virtual machines on Google Kubernetes Engine, its cloud-hosted service for running software containers.

As an alternative, customers can also use the latest hardware from Nvidia to train their generative AI models on Google Cloud. Besides its TPU family, it’s also providing access to Nvidia’s H100 GPUs through its new A3 family of VMs. The A3 Mega VM will become generally available from next month, and one of its main advantages will be support for “confidential computing”, which refers to techniques that can protect the most sensitive data from unauthorized access even while it’s being processed. This is a key development, Lohmeyer said, as it will provide a way for generative AI models to access data that was previously deemed too risky for them to process.

“Character.AI is using Google Cloud’s Tensor Processor Units and A3 VMs running on Nvidia’s H100 Tensor Core GPUs to train and infer LLMs faster and more efficiently,” said Character Technologies Inc. Chief Executive Noam Shazeer. “The optionality of GPUs and TPUs running on the powerful AI-first infrastructure makes Google Cloud our obvious choice as we scale to deliver new features and capabilities to millions of users.”

More exciting, perhaps, is what Google Cloud has in store for later in the year. Though it hasn’t said when, the company confirmed that it’s planning to bring Nvidia’s recently announced but not yet released Blackwell GPUs to its AI Hypercomputer architecture. Lohmeyer said the Blackwell GPUs will be made available in two configurations, with VMs powered by both the HGX B200 and GB200 NVL72 GPUs. The former are designed for the most demanding AI workloads, while the latter is expected to support a new era of real-time large language model inference and massive-scale training for trillion-parameter scale models.

Optimized storage infrastructure

More powerful compute is just one part of the infrastructure equation when it comes to supporting advanced generative AI workloads. In addition, enterprises also need access to more capable storage systems that keep their data as close as possible to the compute instances that power them. The idea is that this reduces latency to train models faster, and with today’s updates, Google Cloud claims its storage systems are now among the best in the business, with improvements that maximize GPU and TPU utilization, resulting in superior energy efficiency and cost optimization.

Today’s updates include the general availability of Cloud Storage FUSE, a new file-based interface for Google Cloud Storage that enables AI and machine learning applications to tap into file-based access to its cloud storage resources. According to Google Cloud, GCS FUSE delivers an increase in training throughput of 2.9 times compared with its existing storage systems, with model serving performance showing a 2.2-times improvement.

Other enhancements include support for caching in preview within Parallelstore, a high-performance parallel file system that’s optimized for AI and high-performance computing workloads. With its caching capabilities, Parallelstore enables up to 3.9 times faster training times and 3.7 times superior training throughput, compared to traditional data loaders.

The company also announced AI-focused optimizations to the Filestore service, which is a network file system that enables entire clusters of GPUs and TPUs to simultaneously access the same data.

Lastly, there’s the new Hyperdisk ML service, which delivers block storage, available now in preview. With this, Google Cloud claims it can accelerate model load times by up to 12-times compared to alternative services.

Open AI software updates

A third part of the generative AI equation is the open-source software that’s used to support many of these models, and Google Cloud hasn’t ignored these either. It’s offering a range of updates across its software stack that it says will help simplify developer experiences and improve performance and cost efficiencies.

The software updates include the debut of MaxDiffusion, a new high-performance and scalable reference implementation for “diffusion models” that generate images. In addition, the company announced a range of new open models available now in MaxText, such as Gemma, GPT3, Llama 2 and Mistral.

The MaxDiffusion and MaxTest models are built on a high performance numerical computing framework called JAX, which is integrated with the OpenXLA compiler to optimize numerical functions and improve model performance. The idea is that these components ensure the most effective implementation of these models, so developers can focus on the math.

In addition, Google announced support for the latest version of the popular PyTorch AI framework, PyTorch/XLA 2.3, which will debut later this month.

Lastly, the company unveiled a new LLM inference engine called Jetstream. It’s an open-source offering that’s throughput- and memory-optimized for AI accelerators such as Google Cloud’s TPUs. According to Lohmeyer, it will provide three-times higher performance per dollar on Gemma 7B and other open AI models.

“As customers bring their AI workloads to production, there’s an increasing demand for a cost-efficient inference stack that delivers high performance,” he explained. “JetStream helps with this need and offers support for models trained with both JAX and PyTorch/XLA, and includes optimizations for popular open models such as Llama 2 and Gemma.”

Flexible resource consumption

The final ingredient for running generative AI on Google’s cloud stack is the Dynamic Workload Scheduler, which delivers resource management and job scheduling capabilities to developers. The main idea is that it improves access to AI computing capacity while providing tools to optimize spending on these resources.

With today’s update, Dynamic Workload Scheduler now provides two starting modes – flex start mode for enhanced obtainability with optimized economics, and calendar mode, for more predictable job start times and durations. Both modes are now available in preview.

According to Lohmeyer, flex start jobs will be cued to run as soon as possible, based on resource availability. This will make it easier for developers to access the TPU and GPU resources they need for workloads with more flexible start times. As for calendar mode, this provides short-term reserved access to AI compute resources including TPUs and GPUs. Users will be able to reserve co-located GPUs for a period of up to 14 days, up to eight weeks in advance. Reservations will be confirmed, and the capacity will come available on the requested start date.

“Dynamic Workload Scheduler improved on-demand GPU obtainability by 80%, accelerating experiment iteration for our researchers,” said Alex Hays, a software engineer at Two Sigma Inc. “Leveraging the built-in Kueue and GKE integration, we were able to take advantage of new GPU capacity in Dynamic Workload Scheduler quickly and save months of development work.”

Image: rorozoa/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy