AI
AI
AI
Google LLC today introduced a new large language model, Gemini 2.5 Flash-Lite, that can process prompts faster and more cost-efficiently than its predecessor.
The algorithm is rolling out as part of a broader update to the company’s flagship Gemini 2.5 LLM series. The two existing models in the lineup, Gemini 2.5 Flash and Gemini 2.5 Pro, have moved from preview to general availability. The latter algorithm also received several pricing changes.
Gemini 2.5 made its original debut in March. The LLMs in the series are based on a mixture-of-experts architecture, which means that they each comprise multiple neural networks. When a user submits a prompt, Gemini 2.5 activates only one of the neural networks rather than all of them, which lowers hardware usage.
The LLM series is the first that Google trained using its internally developed TPUv5p AI chip. According to the company, the training processing involved multiple server clusters that each contained 8,960 TPUv5p chips. Google’s researchers equipped the clusters with new software that can automatically mitigate some technical issues.
Gemini 2.5 models are multimodal with support for up to 1 million tokens per prompt. Google describes the flagship algorithm in the series, Gemini 2.5 Pro, as its most capable LLM to date. During internal tests, it outperformed OpenAI’s o3-mini across a range of math and coding benchmarks.
Gemini 2.5 Flash, the model that moved into general availability today together with Gemini 2.5 Pro, trades off some performance for efficiency. It responds to prompts faster and incurs lower inference costs. Gemini 2.5 Flash-Lite, the new model that Google debuted today, is an even more efficient model that is positioned as the new entry-level model in the LLM series.
“2.5 Flash Lite has all-around higher quality than 2.0 Flash-Lite on coding, math, science, reasoning and multimodal benchmarks,” Tulsee Doshi, senior director of product management for Gemini, detailed in a blog post. “It excels at high-volume, latency-sensitive tasks like translation and classification, with lower latency than 2.0 Flash-Lite and 2.0 Flash on a broad sample of prompts.”
Gemini 2.5 Flash-Lite is billed at a rate of 10 cents per 1 million input tokens when developers submit prompts that contain text, images or video. That’s less than one-10th the cost of Gemini 2.5 Pro. The price per million tokens of output, in turn, is 40 cents compared with $10 for Gemini 2.5 Pro.
Google is changing the pricing of its mid-range Gemini 2.5 Flash model as part of the update. The company will now charge 30 cents per million input tokens and $2.50per 1 million output tokens compared with 15 cents and $3.50, respectively, before. Additionally, there is no longer separate pricing for tokens that the model processes in “thinking mode.” The mode allows the LLM to boost output quality by increasing the amount of time and compute resources that it uses to generate prompt responses.
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.