Research paper suggests OpenAI’s GPT-4 may have become less accurate
A new research paper suggests that GPT-4, OpenAI LP’s most advanced artificial intelligence model, may have become less effective at performing certain tasks.
Ars Technica reported the paper’s findings today. Originally published July 18, it was authored by three researchers from Stanford University and the University of California at Berkeley. Following the paper’s release, a number of AI experts expressed doubts about whether GPT-4 has in fact become less accurate.
The researchers behind the paper evaluated GPT-4’s reasoning capabilities by asking it to perform a set of tasks twice: one in March and again three months later. They then compared the results of the two experiments.
One subset of the tasks that the researchers gave GPT-4 required the AI to solve math problems. In March, it successfully solved 97.6% of the problems with which it was presented. By June, that percentage had plummeted to 2.4%.
The paper’s authors believe that the decline may have been caused by “drifts of chain-of-thoughts’ effects.”
When the researchers asked GPT-4 to solve math problems, they used a method known as chain-of-thought prompting to interact with the model. They didn’t simply ask the model to provide an answer, but also requested that it provide a step-by-step breakdown of its thought process. This method has been shown to improve the accuracy of language models.
The researchers believe the change they observed in GPT-4 accuracy may be related to the chain-of-thought prompts. In one test, they entered a chain-of-thought prompt that asked the model to determine whether the number 1,7077 is a prime number. GPT-4 provided the correct answer in March along with a step-by-step breakdown of its thought process, but three months later, it outputted an incorrect answer and didn’t share a breakdown.
The researchers also tested GPT-4’s accuracy with other types of tasks. One subset of the tests they used required the model to write software code. The percentage of questions GPT-4 answered with “directly executable” code, or code that can be run without any modifications, dropped by more than 40% between March and June.
Some AI experts have expressed doubts about the paper’s findings. Arvind Narayanan, a computer science professor at Princeton University, pointed out that the fact the code generated by GPT-4 couldn’t be run immediately doesn’t necessarily mean it was less correct. In some cases, the reason the code couldn’t be run is because GPT-4 also included explanatory prose in its responses.
Prominent software engineer Simon Willison echoed that view. “A decent portion of their criticism involves whether or not code output is wrapped in Markdown backticks or not,” Willison told Ars Technica. Markdown backticks are used to format software code.
Logan Kilpatrick, OpenAI’s head of developer relations, tweeted that “the team is aware of the reported regressions and looking into it.” Peter Welinder, the AI startup’s vice president of product and partnerships, earlier stated that “no, we haven’t made GPT-4 dumber. Quite the opposite.”
The authors of this week’s paper about the accuracy of GPT-4 also evaluated GPT-3.5, an earlier OpenAI model with more limited capabilities. They found that the latter model’s accuracy didn’t decrease but rather increase between March and June. The accuracy with which GPT-3.5 solved math problems jumped from 7.4% to 86.8% in three months.
Image: OpenAI
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU