Cerebras Systems upgrades its inference service with record performance for Meta’s largest LLM model
Cerebras Systems Inc., an ambitious artificial intelligence computing startup and rival chipmaker to Nvidia Corp., said today that its cloud-based AI large language model inference service can run Meta Platforms Inc.’s largest model at almost 1,000 tokens per second.
Inference is the process trained AI models use to apply logic and algorithms to streaming data to draw conclusions and make “inferences” or predictions about what will come next. It’s essentially the decision-making portion of AI deployment. The number of tokens that can be fed through an AI per second determines the speed at which it can think and respond.
The company launched its inference service in August, claiming it could run up to 16 times faster than comparable cloud-based services that use Nvidia’s most powerful graphical processing units. According to Cerebras, it delivers 2,100 tokens per second for Llama 3.1 70B performance.
The current system can deliver the much more complex Meta Llama 3.1 405B at lightning-fast 969 tokens per second. According to Cerebras, this places the super-giant frontier model on par with ultra-tiny models, which can run at almost instant speeds.
The 405B means that the model has 405 billion parameters, or variables that can be used to configure how an AI model processes data. The larger the number of parameters, the more proficient it can be made for making accurate and high-quality results. Bigger, more complex models also have a lot more momentum when processing, which makes them slower and that can affect user experience.
“We can now run 405B faster than GPUs can run Llama 1B,” James Wang, director of product marketing at Cerebras, told SiliconANGLE in an interview. “Llama 1B, the tiniest Llama, runs at 550 tokens per second on the fastest solution measured so far, and we can take a model 405 times larger and run it at twice the speed.”
The company achieves these speeds because of the specialized architecture that it has built into its powerful custom silicon for AI and high-performance computing, or HPC, workload software stack. The company has landed a few headlines for the past year saying that its dinner-plate-sized chips are more powerful than those minted by Nvidia, but also more cost-efficient for AI training and now inference.
According to the company, these speeds allow the open-source Llama 3.1 405B to outstrip closed-source frontier models such as OpenAI’s GPT-4o and Anthropic PBC’s Claude 3.5 Sonnet by over 10x.
For a real-world comparison, the time difference in text-search latency, or the time it takes a query for the 405 billion parameter model to come back with an answer on the fastest GPU system benchmarked is about five seconds. Cerebras says that it can respond in 0.07 seconds, which for humans is almost instantaneous. For voice applications, which require less than 100 milliseconds to feel like natural human conversation, it can take 700 milliseconds on GPUs, and on Cerebras less than 10 milliseconds.
Pricing for Llama 3.1 405B will be $6 per million input tokens and $12 per million output tokens, around 25% cheaper than Amazon Web Services Inc., Google LLC and Microsoft Corp.’s Azure. Trials for the service are available today for customers and the company intends to make the service available in the first quarter of 2025.
Cerebas customers include generative AI video research company Tavus Inc., LiveKit, a company that provides the edge network that powers OpenAI’s voice mode capabilities, and multinational pharmaceutical company GlaxoSmithKline PLC.
“Using Llama models running on Cerebras Inference, GSK is developing innovative AI applications, such as intelligent research agents, that will fundamentally improve the productivity of our researchers and drug discovery process,” said Kim Branson, senior vice president of AI and machine learning at GSK.
Alongside assisting with drug discovery, Cerebras announced that it achieved world-record performance in molecular dynamics simulation using its powerful hardware. According to the company, it has broken world records by running simulations 700x faster than Frontier, the world’s first and fastest exascale supercomputer. It even exceeded the performance of Anton 3, a custom supercomputer purpose-built for molecular dynamics.
These supercomputers are fundamental for discovering new materials that might become the next heat shielding for rockets sent into space, protection structures for nuclear reactors or specialized proteins for medicine. The capability to predict how they will behave far into the future with vast computing power makes them very valuable to scientists and engineers.
Using Cerebras hardware, scientists were able to simulate atomic activity into the future at 1.2 million timesteps per second, a first in molecular dynamics history according to the company. Frontier supercomputer averages 1,700 using 37,888 GPUs and Anton 3’s custom hardware reaches 980,000 using 512 custom chips.
Photo: Cerebras Systems
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU