Google Cloud updates Kubernetes Engine with support for trillion-parameter AI models
As generative artificial intelligence models continue to grow in size to as much as 2 trillion parameters, the need for compute and storage for large language models is following suit.
Today Google Cloud announced that it has upgraded its Kubernetes Engine’s capacity in anticipation of even larger models, with support for 65,000-node clusters, up from its current support for 15,000-node clusters. This capacity will provide the size and compute power needed to handle even the world’s most complex and resource-hungry AI workloads.
Training these multi-trillion parameter models on using AI accelerator workloads already requires clusters that exceeds 10,000 nodes. Parameters represent the variables within AI models that control how they behave and the predictions they can make. With more variables they can improve the model’s ability to make accurate predictions. They’re similar to knobs or switches that the model developer can adjust to enhance its performance or accuracy.
“Fundamentally, these large scale LLMs keep getting bigger from companies around the world and require very large clusters to operate efficiently,” Drew Bradstock, senior product director for Kubernetes and serverless at Google Cloud, told SiliconANGLE in an exclusive interview. “It’s not just they require large clusters. They require clusters that are reliable, scalable can handle the challenges these large LLM training workloads actually encounter.”
GKE, or Google Kubernetes Engine, is a managed Kubernetes service from Google that reduces the effort for running container environments. GKE automatically adds and removes hardware resources, such as specialized AI chips or graphics processing units, as workload requirements change. It also handles Kubernetes updates for users and supervises other maintenance tasks.
The new 65,000-node cluster can manage AI models spread across 250,000 tensor processing units, specialized AI processors designed to accelerate machine learning and generative AI workloads. Bradstock said this a fivefold increase from GKE’s old benchmark on a single cluster which was 50,000 TPU chips.
That greatly improves reliability and efficiency of running large-scale AI workloads. According to Bradstock, the increased scale is important for both large-scale AI training and inference as Kubernetes allows users to handle hardware-based failures without worrying about downtime. It also leads to faster job completion times, as the extra capacity can be used to run more iterations of models in a shorter timeframe.
To make this achievement possible, Bradstock said Google Cloud is transitioning GKE from the open-source etcd, a distributed key-value store, to a more robust system based on Spanner, Google’s distributed database. This will allow GKE clusters to handle virtually unlimited scale and provide improved latency.
Google also made a major overhaul to the GKE infrastructure so that it scales significantly faster. This allows customers to meet demands significantly faster. It is also able to run five jobs in a single cluster, each matching the scale of Google Cloud’s previous record for training LLMs.
Bradstock said the need for these upgrades is being driven by customer attention and the popularity of AI on the system and the rapid growth of AI across the industry. Google customers, including leading frontier AI model developers such as Anthropic PBC, have been taking advantage of GKE’s cluster capabilities to train their models.
“GKE’s new support for larger clusters provides the scale we need to accelerate our pace of AI innovation,” said James Bradbury, head of compute at Anthropic.
Over the past year, Bradstock said, there has been a 900% increase in the use of TPUs and graphical processing units on GKE and that is above a substantive number in use to begin with. “This is being driven by that rapid growth of AI,” he said. “With AI accounting for the majority of Kubernetes Engine’s usage going forward.”
Image: Pixabay
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU