VMware and Nvidia team to deliver virtualized AI workloads
VMware Inc. and Nvidia Corp. today are teaming up to hasten development of enterprise artificial intelligence applications.
New releases of the virtualization giant’s vSphere 7 server virtualization and vSAN 7 storage virtualization product will run applications requiring AI-ready infrastructure with improved security and simplified operations. Specifically, VMware and Nvidia said they’ll deliver a software stack that enables customers to develop new applications as well as modernize existing applications and infrastructure using Nvidia hardware.
Updates to the jointly developed AI-Ready Enterprise platform include certification of VMware vSphere 7 Update 2 for Nvidia AI Enterprise, which is described as a cloud-native collection of optimized AI applications and frameworks. The combination enables up to 20 times faster performance of Nvidia graphic processing unit-based workloads on top of VMware virtual machines than was previously possible.
“The performance of vSphere is virtually indistinguishable from bare metal,” said Justin Boitano, general manager of enterprise and edge computing at Nvidia. “You can manage under one control plane with no silos.”
The vSphere update also adds support for the Nvidia A100 and Nvidia A40 Tensor Core GPUs on Nvidia-certified Systems, which include the company’s HGX and EGX server platforms. This enables customers to add an AI-specific platform to their existing virtualized environment instead of running AI workloads separately.
“AI is a full-stack computing problem, but it has been kind of bespoke to this point with a do-it-yourself approach to set up and manage,” Boitano said. “This lets us use the VMware tools that already exist for AI with the full performance under vSphere.”
Better sharing and workload portability
The integration will enable VMware customers to take advantage of features in the latest generation of Nvidia GPUs like multi-instance GPU, which allows GPU cycles to be shared across multiple users. These can now be moved with VMware’s vSphere vMotion and load-balanced with vSphere Distributed Resource Scheduler.
The latter enables users to move applications across nodes in a common cluster or to distributed them live as workload demands change, said Lee Caswell, vice president of the cloud platform business unit at VMware. Another feature supports Nvidia multi-instance GPUs to permit a single GPU to be shared across as many as seven VMs with fault isolation to prevent downtime.
In addition, Nvidia has certified a library of AI and data science applications and frameworks, cloud-native deployment tools and Nvidia infrastructure optimization libraries that it calls Nvidia AI Enterprise for use with vSphere. “If a new company is starting on the journey to AI we’ve found they can spend over 80 weeks to curate the data, train the model, develop it and built a computer vision pipeline to the factory floor,” Boitano said. Nvidia’s pre-trained models and Transfer Learning Toolkit, which can be used to extract learned features from an existing neural network model to a new one, “can reduce that to 8 weeks,” he said.
Separate from the Nvidia partnership, VMware said it’s including VMware NSX Advanced Load Balancer Essentials as part of vSphere with its Tanzu application modernization suite. That enables VMware-supported multicloud load balancing for Kubernetes clusters and a path to the full capabilities of the NSX Advanced Load Balancer Enterprise Edition.
Kubernetes is the popular orchestrator for the portable, modular software platforms called containers. VSphere with Tanzu includes a refreshed supervisor with the latest Kubernetes 1.19 release that features enhancements to simplify upgrades and improve stability.
Hyperconverged features for vSAN
The vSAN storage virtualization layer, which VMware says is now used by more than 30,000 customers, is being updated with enhanced HCI Mesh. That’s a software-based form of hyperconverged infrastructure that enables organizations to unite islands of storage into a single virtual resource.
Update 2 is particularly aimed at customers looking to increase resource efficiency beyond their existing vSAN environment by enabling compute-only or non-HCI clusters to remotely use storage from a vSAN cluster within the data center, thus enabling compute and storage to be scaled independently.
“This has always been the knock on HCI: How do I know the next node is the right mix of compute and capacity?” Caswell said. “This will allow customer to flex and share that capacity across servers, enabling even individual blade servers to access vSAN storage directly.” HCI Mesh “breaks scalability limitations,” he said. “You can tap into any excess capacity you have.”
Update 2 of vSAN 7 also adds new capabilities to better support various physical topologies, including integrated distributed resource scheduler awareness of stretched cluster configurations. That enables more consistent failback performance along with vSAN file services support for stretched clusters and two-node clusters.
“If you move compute to a different location you’ve typically got performance problems,” Caswell said. “Enhanced stretched clusters retain that colocation of compute and storage, even on failover events.” Performance is also improved with support for remote direct memory access, which allows hosts to access each other’s memory without CPU intervention.
On the security front, vSphere 7 Update 2 introduces Confidential Containers for vSphere Pods, which use an Advanced Micro Devices Inc. hardware feature that encrypts all CPU register contents when a VM stops running. There’s also a new vSphere Native Key Provider that delivers basic key management server capabilities, making it easier for customers to enable encryption and advanced security features out of the box.
In distributed hybrid cloud and new edge environments, Caswell said, “we can have local air-gapped remote offices run independently of central key management. That avoids cost and complexity of external key management services.”
The new vSAN 7 also supports vSphere Proactive High Availability, which proactively moves the application state and data stored to another host to avoid data loss on degraded hardware. Enhanced data durability reduces downtime and data loss for unplanned outages such as multiple disk failures.
All updates are available immediately.
A message from John Furrier, co-founder of SiliconANGLE:
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.
We are holding our third cloud startup showcase on Sept. 22. Click here to join the free and open Startup Showcase event.
We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.