UPDATED 14:00 EDT / NOVEMBER 18 2024

Cerebras Systems upgrades its inference service with record performance for Meta’s largest LLM model

Cerebras Systems Inc., an ambitious artificial intelligence computing startup and rival chipmaker to Nvidia Corp., said today that its cloud-based AI large language model inference service can run Meta Platforms Inc.’s largest model at almost 1,000 tokens per second.

Inference is the process trained AI models use to apply logic and algorithms to streaming data to draw conclusions and make “inferences” or predictions about what will come next. It’s essentially the decision-making portion of AI deployment. The number of tokens that can be fed through an AI per second determines the speed at which it can think and respond.

The company launched its inference service in August, claiming it could run up to 16 times faster than comparable cloud-based services that use Nvidia’s most powerful graphical processing units. According to Cerebras, it delivers 2,100 tokens per second for Llama 3.1 70B performance.

The current system can deliver the much more complex Meta Llama 3.1 405B at lightning-fast 969 tokens per second. According to Cerebras, this places the super-giant frontier model on par with ultra-tiny models, which can run at almost instant speeds.

The 405B means that the model has 405 billion parameters, or variables that can be used to configure how an AI model processes data. The larger the number of parameters, the more proficient it can be made for making accurate and high-quality results. Bigger, more complex models also have a lot more momentum when processing, which makes them slower and that can affect user experience.

“We can now run 405B faster than GPUs can run Llama 1B,” James Wang, director of product marketing at Cerebras, told SiliconANGLE in an interview. “Llama 1B, the tiniest Llama, runs at 550 tokens per second on the fastest solution measured so far, and we can take a model 405 times larger and run it at twice the speed.”

The company achieves these speeds because of the specialized architecture that it has built into its powerful custom silicon for AI and high-performance computing, or HPC, workload software stack. The company has landed a few headlines for the past year saying that its dinner-plate-sized chips are more powerful than those minted by Nvidia, but also more cost-efficient for AI training and now inference.

According to the company, these speeds allow the open-source Llama 3.1 405B to outstrip closed-source frontier models such as OpenAI’s GPT-4o and Anthropic PBC’s Claude 3.5 Sonnet by over 10x.

For a real-world comparison, the time difference in text-search latency, or the time it takes a query for the 405 billion parameter model to come back with an answer on the fastest GPU system benchmarked is about five seconds. Cerebras says that it can respond in 0.07 seconds, which for humans is almost instantaneous. For voice applications, which require less than 100 milliseconds to feel like natural human conversation, it can take 700 milliseconds on GPUs, and on Cerebras less than 10 milliseconds.

Pricing for Llama 3.1 405B will be $6 per million input tokens and $12 per million output tokens, around 25% cheaper than Amazon Web Services Inc., Google LLC and Microsoft Corp.’s Azure. Trials for the service are available today for customers and the company intends to make the service available in the first quarter of 2025.

Cerebas customers include generative AI video research company Tavus Inc., LiveKit, a company that provides the edge network that powers OpenAI’s voice mode capabilities, and multinational pharmaceutical company GlaxoSmithKline PLC.

“Using Llama models running on Cerebras Inference, GSK is developing innovative AI applications, such as intelligent research agents, that will fundamentally improve the productivity of our researchers and drug discovery process,” said Kim Branson, senior vice president of AI and machine learning at GSK.

Alongside assisting with drug discovery, Cerebras announced that it achieved world-record performance in molecular dynamics simulation using its powerful hardware. According to the company, it has broken world records by running simulations 700x faster than Frontier, the world’s first and fastest exascale supercomputer. It even exceeded the performance of Anton 3, a custom supercomputer purpose-built for molecular dynamics.

These supercomputers are fundamental for discovering new materials that might become the next heat shielding for rockets sent into space, protection structures for nuclear reactors or specialized proteins for medicine. The capability to predict how they will behave far into the future with vast computing power makes them very valuable to scientists and engineers.

Using Cerebras hardware, scientists were able to simulate atomic activity into the future at 1.2 million timesteps per second, a first in molecular dynamics history according to the company. Frontier supercomputer averages 1,700 using 37,888 GPUs and Anton 3’s custom hardware reaches 980,000 using 512 custom chips.

Photo: Cerebras Systems

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Cerebras Systems upgrades its inference service with record performance for Meta’s largest LLM model

Photo: Cerebras Systems

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

VMware Explore 2025

Future of Data Platforms Summit 2025

WOW: World of Workato 2025

Supermicro Open Storage Summit 2025

Black Hat USA 2025

Cerebras Systems upgrades its inference service with record performance for Meta’s largest LLM model

Photo: Cerebras Systems

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

VMware Explore 2025

Future of Data Platforms Summit 2025

WOW: World of Workato 2025

Supermicro Open Storage Summit 2025

Black Hat USA 2025

Cookies