

Meta Platforms Inc. today introduced Llama 3.3 70B, the latest addition to its eponymous line of open-source large language models.
The new algorithm provides similar output quality as Llama 3.1 405B, the most advanced LLM in the series, but using a fraction of the hardware. The result is a significant drop in infrastructure expenses. Meta says that Llama 3.3 70B generates prompt responses nearly five times more cost-efficiently.
The model is based on an optimized version of the Transformer architecture, the neural network design that underpins most cutting-edge LLMs. When analyzing a set of data points, Transformer-based models use a so-called attention mechanism to determine which data points are most relevant to the task at hand. Meta swapped the default attention mechanism with an improved implementation that lowers inference costs.
The company’s engineers trained Llama 3.3 70B on a cluster of H100-80GB chips from Nvidia Corp. The chips’ TDP, a metric that tracks the extent to which a processor’s compute capacity is utilized, was set to the 700-watt maximum. Meta that the LLM took 39.3 million graphics card-hours to train.
The training dataset includes about 15 trillion tokens, units of data that each correspond to a few letters or numbers. Meta used information from the public web, as well as more than 25 million synthetic examples. Those are AI-generated data points created specifically for LLM development purposes.
After Meta completed the initial training process, it refined Llama 3.3 70B with several methods.
One of the techniques the company used is known as supervised fine-tuning. It involves providing a freshly developed LLM with additional datasets that it didn’t access during the initial training. Those additional datasets contain metadata, or contextual information, that makes it easier for the LLM to find useful patterns.
Meta also used another AI method known as RLHF. While an LLM is being trained, it receives pointers from an algorithm on how to improve the quality of its output. RLHF combines those automatically-generated pointers with feedback from humans.
After completing the development process, Meta compared Llama 3.3 70B with Llama 3.1 405B using 10 AI benchmarks. Llama 3.3 70B trailed its larger namesake by under 2% in six of the tests and managed to achieve higher scores across three. It also mostly outperformed OpenAI’s GPT-4o.
According to Meta, processing 1 million input tokens with Llama 3.1 405B costs $1, while generating 1 million output tokens requires $1.80’s worth of compute capacity. Llama can manage the same tasks with 10 cents’ and 40 cents’ worth of infrastructure, respectively.
Meta has made the source code for Llama 3.3 70B available on Hugging Face.
Support our open free content by sharing and engaging with our content and community.
Where Technology Leaders Connect, Share Intelligence & Create Opportunities
SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.