UPDATED 09:00 EST / MAY 28 2025

INFRA

Atlas Cloud optimizes AI inference service to boost GPU throughput

Cloud infrastructure startup Atlas Cloud today launched a highly optimized artificial intelligence inference service that it says dramatically reduces the computational requirements of even the most demanding AI workloads.

The new service, called Atlas Inference, is designed to provide companies with a more cost-effective and simpler environment in which they can deploy and run their large language models.

Atlas Cloud is the creator of a cloud-based infrastructure platform that’s geared especially for AI workloads. It provides low-cost and on-demand access to clusters of up to 5,000 graphics processing units, for both AI training and inference workloads. Customers can choose from a selection of GPU types, and the platform is serverless too, so they don’t have to worry about configuring their clusters or carrying out maintenance work.

The new Atlas Inference service is based on the open-source SGLang inference engine. The company says it maximizes GPU efficiency by processing more tokens with fewer computational resources. It claims it can deliver 2.1 times greater throughput for AI workloads compared wit equivalent AI inference services offered by the likes of Amazon Web Services Inc. and Nvidia Corp.

When running heavyweight, tensor-parallel AI systems, Atlas Inference can deliver equal or superior throughput while using 50% fewer GPUs. It features real-time load balancing capabilities that allow it to evenly distribute tokens and reduce latency spikes on overloaded nodes, ensuring stable performance under any conditions. In tests, it claims it was able to maintain sub-five-second first-token latency and 100-millisecond inter-token latency across more than 10,000 concurrent sessions.

The company adds that an Atlas Inference 12-node cluster outperformed DeepSeek Ltd.’s reference implementation for the DeepSeek V3 model while using only two-thirds of the server’s computational capacity. At the same time, operational expenses were reduced by 80%.

Atlas Cloud says this was made possible by four separate innovations. They include a “prefill/decode disaggregation” technique that separates compute-intensive operations from memory-bound processes to boost efficiency. There’s also “DeepExpert Parallelism,” which uses load balancing to increase GPU utilization across the entire cluster. Other innovations include Atlas Cloud’s proprietary two-batch overlap technology, which boosts throughput by enabling larger token batches, and the use of “DisposableTensor memory models,” which help prevent system crashes.

Another advantage of Atlas Inference is its linear scaling behavior across nodes, which automates the expansion and contraction of GPU clusters in real time to help optimize infrastructure costs.

Atlas Cloud Chief Executive Jerry Tang said the company wants to change the economics of AI deployment in order to make it more profitable for enterprises. He explained that many companies can barely break even at the moment, while others are running AI applications and services at a loss, because of the sky-high computational costs.

“Our platform’s ability to process 54,500 input tokens and 22,500 output tokens per second per node means businesses can finally make high-volume LLM services profitable,” Tang said. “I believe this will have a significant ripple effect throughout the industry. We’re surpassing industry standards set by hyperscalers by delivering superior throughput with fewer resources.”

The startup says Atlas Inference is compatible with any type of GPU hardware and supports any kind of AI model. It’s available starting today via the company’s cloud-based servers, and can also be run on customers’ on-premises servers.

Image: SiliconANGLE/Dreamina

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Atlas Cloud optimizes AI inference service to boost GPU throughput

Image: SiliconANGLE/Dreamina

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

Celosphere 2025

Dell AI Data Platform Event 2025

Nvidia GTC Washington, D.C. 2025

The AI Security Summit 2025

Audit & Beyond 2025

Atlas Cloud optimizes AI inference service to boost GPU throughput

Image: SiliconANGLE/Dreamina

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Celosphere 2025

Dell AI Data Platform Event 2025

Nvidia GTC Washington, D.C. 2025

The AI Security Summit 2025

Audit & Beyond 2025

Cookies