OpenAI says GPT-5 hallucinates less — what does the data say?

0
558

OpenAI says GPT-5 hallucinates less — what does the data say?

OpenAI has officially launched GPT-5, promising a faster and more capable AI model to power ChatGPT.

The AI company boasts state-of-the-art performance across math, coding, writing, and health advice. OpenAI proudly shared that GPT-5's hallucination rates have decreased compared to earlier models.

Specifically, GPT makes incorrect claims 9.6 percent of the time, compared to 12.9 percent for GPT-4o. And according to the GPT-5 system card, the new model’s hallucination rate is 26 percent lower than GPT-4o. In addition, GPT-5 had 44 percent fewer responses with “at least one major factual error.”

While that's definite progress, that also means roughly one in 10 responses from GPT-5 could contain hallucinations. That's concerning, especially since OpenAI touted healthcare as a promising use case for the new model.


How GPT-5 reduces hallucinations

Hallucinations are a pesky problem for AI researchers. Large language models (LLMs) are trained to generate the next probable word, guided by the massive amounts of data they're trained on. This means LLMs can sometimes confidently generate a sentence that is inaccurate or pure gibberish. One might assume that as models improve through factors like better data, training, and computing power, the hallucination rate decreases. But OpenAI's launch of its reasoning models o3 and o4-mini showed a troubling trend that couldn't be entirely explained even by its researchers: they hallucinated more than previous models, o1, GPT-4o, and GPT-4.5. Some researchers argue that hallucinations are an inherent feature of LLMs, instead of a bug that can resolved.

Mashable Light Speed

That said, GPT-5 hallucinates less than previous models according to its system card. OpenAI evaluated GPT-5 and a version of GPT-5 with additional reasoning power, called GPT-5-thinking against its reasoning model o3 and more traditional model GPT-4o. A significant part of evaluating hallucination rates is giving models access to the web. Generally speaking, models are more accurate when they're able to source their answers from accurate data online as opposed to relying solely on its training data (more on that below). Here are the hallucination rates when the models are given web-browsing access:

  • GPT-5: 9.6 percent

  • GPT-5-thinking: 4.5 percent

  • o3: 12.7 percent

  • GPT-4o: 12.9 percent

In the system card, OpenAI also evaluated various versions of GPT-5 with more open-ended and complex prompts. Here, GPT-5 with reasoning power hallucinated significantly less than previous reasoning model o3 and o4-mini. Reasoning models are said be more accurate and less hallucinatory because they apply more computing power to solving a question, which is why o3 and o4-mini's hallucination rates were somewhat baffling.

Overall, GPT-5 does pretty well when it's connected to the web. But the results from another evaluation tell a different story. OpenAI tested GPT-5 on its in-house benchmark, Simple QA. This test is a collection of "fact-seeking questions with short answers that measures model accuracy for attempted answers," per the system card's description. For this evaluation, GPT-5 didn't have web access, and it shows. In this test, the hallucination rates were way higher.

  • GPT-5 main: 47 percent

  • GPT-5-thinking: 40 percent

  • o3: 46 percent

  • GPT-4o: 52 percent

GPT-5 with thinking was marginally better than o3, while the normal GPT-5 hallucinated one percent higher than o3 and a few percentage points below GPT-4o. To be fair, hallucination rates with the Simple QA evaluation are high across all models. But that's not a great consolation. Users without web search will encounter much higher risks of hallucination and inaccuracies. So if you're using ChatGPT for something really important, make sure it's searching the web. Or you could just search the web yourself.

It didn't take long for users to find GPT-5 hallucinations

But despite reported overall lower rates of inaccuracies, one of the demos revealed an embarrassing blunder. Beth Barnes, founder and CEO of AI research nonprofit METR, spotted an inaccuracy in the demo of GPT-5 explaining how planes work. GPT-5 cited a common misconception related to the Bernoulli Effect, Barnes said, which explains how air flows around airplane wings. Without getting into the technicalities of aerodynamics, GPT-5's interpretation is wrong.

This Tweet is currently unavailable. It might be loading or has been removed.
البحث
الأقسام
إقرأ المزيد
أخرى
Key Growth Drivers in the US Ultra High Molecular Weight Polyethylene Market
The US Ultra High Molecular Weight Polyethylene (UHMWPE) market demand is witnessing significant...
بواسطة Ram Vasekar 2025-09-17 12:56:29 0 149
Science
“We're Insisting That Brain Death Is Something That It Isn't” – How Do We Determine Death?
“We're Insisting That Brain Death Is Something That It Isn't” – How Do We Determine Death?In...
بواسطة test Blogger3 2025-09-03 16:00:08 0 239
Technology
Eufy’s self-emptying C10 robot vacuum has hit its lowest-ever price
Best robot vacuum deal: Eufy C10 hits lowest-ever price at $229.99...
بواسطة Test Blogger7 2025-07-24 11:00:15 0 729
Home & Garden
Do Coffee Grounds Really Keep Slugs Away? Here's What an Expert Says
Do Coffee Grounds Really Keep Slugs Away? Here's What an Expert Says Before using coffee grounds...
بواسطة Test Blogger9 2025-05-29 00:11:30 0 2كيلو بايت
Technology
Your Mac has 1,200 hidden features — this app makes them usable
MacPilot lifetime license TL;DR: Get lifetime access to...
بواسطة Test Blogger7 2025-07-29 10:00:39 0 624