OpenAI says GPT-5 hallucinates less — what does the data say?

0
17

OpenAI says GPT-5 hallucinates less — what does the data say?

OpenAI has officially launched GPT-5, promising a faster and more capable AI model to power ChatGPT.

The AI company boasts state-of-the-art performance across math, coding, writing, and health advice. OpenAI proudly shared that GPT-5's hallucination rates have decreased compared to earlier models.

Specifically, GPT makes incorrect claims 9.6 percent of the time, compared to 12.9 percent for GPT-4o. And according to the GPT-5 system card, the new model’s hallucination rate is 26 percent lower than GPT-4o. In addition, GPT-5 had 44 percent fewer responses with “at least one major factual error.”

While that's definite progress, that also means roughly one in 10 responses from GPT-5 could contain hallucinations. That's concerning, especially since OpenAI touted healthcare as a promising use case for the new model.


How GPT-5 reduces hallucinations

Hallucinations are a pesky problem for AI researchers. Large language models (LLMs) are trained to generate the next probable word, guided by the massive amounts of data they're trained on. This means LLMs can sometimes confidently generate a sentence that is inaccurate or pure gibberish. One might assume that as models improve through factors like better data, training, and computing power, the hallucination rate decreases. But OpenAI's launch of its reasoning models o3 and o4-mini showed a troubling trend that couldn't be entirely explained even by its researchers: they hallucinated more than previous models, o1, GPT-4o, and GPT-4.5. Some researchers argue that hallucinations are an inherent feature of LLMs, instead of a bug that can resolved.

Mashable Light Speed

That said, GPT-5 hallucinates less than previous models according to its system card. OpenAI evaluated GPT-5 and a version of GPT-5 with additional reasoning power, called GPT-5-thinking against its reasoning model o3 and more traditional model GPT-4o. A significant part of evaluating hallucination rates is giving models access to the web. Generally speaking, models are more accurate when they're able to source their answers from accurate data online as opposed to relying solely on its training data (more on that below). Here are the hallucination rates when the models are given web-browsing access:

  • GPT-5: 9.6 percent

  • GPT-5-thinking: 4.5 percent

  • o3: 12.7 percent

  • GPT-4o: 12.9 percent

In the system card, OpenAI also evaluated various versions of GPT-5 with more open-ended and complex prompts. Here, GPT-5 with reasoning power hallucinated significantly less than previous reasoning model o3 and o4-mini. Reasoning models are said be more accurate and less hallucinatory because they apply more computing power to solving a question, which is why o3 and o4-mini's hallucination rates were somewhat baffling.

Overall, GPT-5 does pretty well when it's connected to the web. But the results from another evaluation tell a different story. OpenAI tested GPT-5 on its in-house benchmark, Simple QA. This test is a collection of "fact-seeking questions with short answers that measures model accuracy for attempted answers," per the system card's description. For this evaluation, GPT-5 didn't have web access, and it shows. In this test, the hallucination rates were way higher.

  • GPT-5 main: 47 percent

  • GPT-5-thinking: 40 percent

  • o3: 46 percent

  • GPT-4o: 52 percent

GPT-5 with thinking was marginally better than o3, while the normal GPT-5 hallucinated one percent higher than o3 and a few percentage points below GPT-4o. To be fair, hallucination rates with the Simple QA evaluation are high across all models. But that's not a great consolation. Users without web search will encounter much higher risks of hallucination and inaccuracies. So if you're using ChatGPT for something really important, make sure it's searching the web. Or you could just search the web yourself.

It didn't take long for users to find GPT-5 hallucinations

But despite reported overall lower rates of inaccuracies, one of the demos revealed an embarrassing blunder. Beth Barnes, founder and CEO of AI research nonprofit METR, spotted an inaccuracy in the demo of GPT-5 explaining how planes work. GPT-5 cited a common misconception related to the Bernoulli Effect, Barnes said, which explains how air flows around airplane wings. Without getting into the technicalities of aerodynamics, GPT-5's interpretation is wrong.

This Tweet is currently unavailable. It might be loading or has been removed.
Pesquisar
Categorias
Leia mais
Food
Midwestern Comfort Foods You Need To Try Before You Die
Midwestern Comfort Foods You Need To Try Before You Die...
Por Test Blogger1 2025-07-10 12:00:03 0 479
Technology
WhatsApps iPad app is here, and its exactly what youd expect
WhatsApp iPad app review: Exactly what you'd expect...
Por Test Blogger7 2025-05-29 01:00:32 0 1KB
Jogos
How to clear Dead Roots in Grounded 2
How to clear Dead Roots in Grounded 2 As an Amazon Associate, we earn from qualifying...
Por Test Blogger6 2025-07-29 18:00:14 0 155
Jogos
Stunning survival crafting game soars on Steam as 1.0 launch adds 8-player co-op
Stunning survival crafting game soars on Steam as 1.0 launch adds 8-player co-op As an Amazon...
Por Test Blogger6 2025-06-22 17:00:13 0 801
Technology
The Jackery Explorer 1000 v2 power station just dropped in price again — save $350 ahead of Prime Day
Best power station deal: Save $350 on Jackery Explorer 1000 v2...
Por Test Blogger7 2025-06-25 09:00:14 0 756