

If you go to ChatGPT.com, choose the o4-mini model from the drop-down menu and enter a prompt, you’ll see a message you’ve probably never seen before.
“Thinking,” the chatbot responds as several seconds pass. It then presents a summary of the question you just asked and the process it used to formulate an answer. A list of sources may appear along with the search criteria that the model created to research the answer. After several more seconds, the response appears on your screen, along with an account of how much time was spent generating it.
That may seem like a lot of unnecessary detail to answer a simple question, but it’s part of the appeal of reasoning models, a new form of generative artificial intelligence that’s rippling throughout the AI landscape. O4-mini and other reasoning models explore multiple avenues of thought to reach a conclusion. Documenting the approach it takes is part of what makes reasoning models so useful — and more effective at solving complex problems.
Gartner’s Carlsson: When you talk about machines outperforming humans, you’re often talking about reinforcement learning.” Photo: SiliconANGLE
OpenAI’s first reasoning model, dubbed o1, debuted in December. It was followed in rapid succession by the o3 and o4-mini models, the most recent of which debuted on April 16 (there was no o2). Other reasoning models include DeepSeek’s R1, Anthropic pbc’s Claude 3.7, Google LLC’s Gemini 2.0 Flash Thinking, Alibaba Group Holding Ltd.’s Thinking QwQ and Mistral AI SAS’s Colestral.
Microsoft Corp. last month added reasoning agents to its Copilot Studio, describing them as able to “tackle complex, multi-step research at work — delivering insights with greater quality and accuracy than previously possible.”
Though functionally similar to the ChatGPT large language model, reasoning models behave quite differently. LLMs use what’s known as “next‑token prediction” on vast corpora of text. They learn statistical patterns in language to generate the text that is statistically most likely to comprise a response.
That makes them good at addressing problems requiring broad general knowledge, but they are limited to working with the information in their training base. Their probabilistic nature — meaning that answers are generated based on probabilities rather than rigid rules — makes them prone to the occasional wildly inaccurate responses called hallucinations and poor at solving problems requiring multistep logic. They’re notoriously bad at math, too.
Reasoning models produce chains of intermediate steps, breaking problems into sub-problems and applying logical inference to compose an answer. They typically consult external sources for guidance and may try multiple paths to reach the best results. Although they’re more computationally intensive than LLMs and require more specialized training, they produce better results, are less prone to hallucinations, and are easier to audit because they show their work.
ISG’s Menninger: “Math is a perfect application of reasoning.” Photo: ISG
“They aren’t reasoning under my definition, but they’re capable of producing output that’s similar to what a reasoning human being would produce under the same circumstances,” said Peter Wang, founding chief technology officer at Anaconda Inc., an open-source development platform for the Python language.
Many experts believe reasoning models are the future of generative AI because they’re better at handling complexity and less inclined to make mistakes. They excel particularly at solving mathematical problems.
In introducing a preview of o1 last September, OpenAI reported the model ranked in the 89th percentile on a test of competitive programming questions and placed among the top 500 students in the U.S. in a qualifier for the USA Math Olympiad. It also topped human Ph.D.-level accuracy on a benchmark of science problems. Specifically, o1 scored 74% on the American Invitational Mathematics Examination compared with 12% for GPT-4o and scored as high as 93% when allowed to generate a large number of candidate solutions for evaluation.
“Math is a perfect application of reasoning,” said David Menninger, executive director of technology research at Information Services Group, Inc., a technology research and advisory group. “You explore something, go down a path, take what you learn and go down another path.”
“If you asked a traditional LLM what’s the best way to get a job, it will do a quick pass and give you the most common answers,” said Kjell Carlsson, an analytics and AI analyst at Gartner Inc. “A reasoning model will look at different ways of finding a job, think about it and perhaps do an internet search. There’s more stuff happening in the background.”
Traditional LLMs also require more work in the training stages. Initial LLM training is unsupervised, meaning that the model processes large amounts of unlabeled text and learns to predict each word from its context, using the text to generate labels. Supervised learning may be applied in the fine-tuning stage.
In contrast, reasoning models typically involve supervised learning from the beginning using one of two methods: chain‑of‑thought training and reinforcement learning.
Chain-of-thought training doesn’t just pair questions with answers, but includes the model’s thought process. It generates a series of intermediate reasoning steps before arriving at a final answer, helping it to break down complex tasks into more manageable parts. Human supervision is usually required, although a “teacher” model may be used to generate the intermediate steps and filter them for quality.
“It has a dialogue with itself, following different paths and asking where it should probe more,” said Anaconda’s Wang. “It’s like shining a flashlight through a dark cave with all of these interesting tunnels and finding little mini-caves it can explore.”
Anaconda’s Wang: Reasoning models learn by “shining a flashlight through a dark cave… and finding little mini-caves it can explore.” Photo: LinkedIn
Deciding which answer is best may be determined by examples provided during training, a ranking algorithm or a decision tree with weightings that indicate better or worse answers.
Reinforcement learning is a well-established machine learning practice in which the model learns by interacting with an environment, making decisions, and receiving rewards or penalties based on its actions. It’s essentially training by experience with reinforcement provided either by human feedback or algorithms that define success or failure.
Reinforcement learning mimics the human learning process, but because computers are so much faster than humans, it can yield impressive results. “When you talk about machines outperforming humans, you’re often talking about reinforcement learning,” said Gartner’s Carlsson.
“Most big machine learning models have been trained on a set of data that gives largely the same responses to the same inputs,” said Andy Piper, vice president of engineering at Diffblue Ltd., a maker of software testing tools. “Reinforcement learning learns as it goes along. It can provide results in large spaces where it hasn’t seen the variables before.”
The bottom line is that reasoning systems can be expected to provide better results for many problems, particularly complex ones. “LLMs predict the next thing while reasoning is more analogous to thinking,” said ISG’s Menninger. “It breaks down the problem and tries different alternatives.”
That ability will take on greater significance as agents proliferate. They need to act autonomously and make decisions governed by goals rather than rules, demanding a level of reasoning that LLMs don’t have. “If you’re going to use agents, you’re going to need reasoning,” Menninger said.
Real-world examples of reasoning models are rare, but some agentic systems now in production have reasoning attributes. United Parcel Service Inc. uses an agentic system called On-Road Integrated Optimization and Navigation to make driver routes as efficient as possible. Orion functions autonomously, continuously analyzing real-time data such as traffic conditions, weather patterns, and package volumes to adjust routes dynamically. UPS says the software has reduced total miles driven by 100 million, yielded $300 million in cost savings and cut carbon emissions by 100,000 metric tons.
Levi Strauss & Co. uses agentic reasoning for demand forecasting, combining data about a mix of historical sales, social media sentiment, weather patterns, and economic signals to predict demand across regions and product lines. Automated inventory optimization systems complement this by adjusting stock levels in real time based on sales data. Models can trigger restocks or redistribute goods dynamically to improve turnover and minimize shortages and overstocks. AI-driven production planning recommends output levels based on forecasted demand and optimizes material use through optimized fabric cutting. AI-assisted pricing models analyze market conditions and competitor activity to suggest the best timing and depth for promotions.
Does that make reasoning systems the next logical step in the evolution of generative AI? Some experts think so.
“Will everything become a reasoning system? Absolutely,” said Jeremy Gaerke, chief technology officer at Pantomath Inc., maker of a data observability platform.
Others see the technologies evolving in parallel. For all their intelligence, reasoning models are slower and incur more computational overhead. LLMs excel at summarizing large volumes of information and cranking out routine documents. For those uses, reasoning models are overkill.
“It’s a different use case,” said Gartner’s Carlsson. “LLMs are useful for creating conversational text or automating processes. In many of those cases, reasoning models are too slow and costly.”
You can think of it this way: LLMs are like a drive-through — handy when you just need fries and a factoid. Reasoning models are a chic bistro where the chef walks you through every course. Regardless of your choice, you aren’t going to go hungry.
THANK YOU