OpenAI’s New AI Models Face Increased Hallucination Challenges

OpenAI’s latest AI models, o3 and o4-mini, exhibit a disturbing trend of heightened hallucinations, raising concerns about their reliability and usefulness. Hallucinations, as it’s called, often leads AI models to create and spread misinformation or baseless claims. These events represent one of the most difficult problems in all of AI.

A recent testing on OpenAI’s in-house benchmark, PersonQA, yielded some concerning new statistics about these models’ hallucination rates. This means that the o3 model hallucinated in answer to 33% of questions posed, a percentage that is disconcertingly high compared with its predecessors. For reference, the previous reasoning bases models, o1 and o3-mini, had hallucination rates of 16% and 14.8%. The o4-mini model fared even worse, scoring a shocking hallucination rate of 48%.

Together, these findings tell us that even as AI models expand their ability to reason, the rate of hallucinations is increasing. Even with all of this bleeding edge technology, OpenAI still hasn’t figured out why this trend continues to worsen. The organization acknowledges that more holistic research is needed to understand why these inaccuracies are growing in frequency and visibility.

Kian Katanforoosh, a representative from OpenAI, admitted that o3 is more likely to produce broken website links in the content it produces. This third kind of hallucination serves to render the model’s utility dangerous. This leads to users having to settle for broken or nonexistent citations when looking up more information. Transluce’s independent U.S. based testing revealed one particularly troubling discovery. It revealed o3’s propensity to create steps it claims to have taken in producing its responses, which contributes to crush trust in its output.

OpenAI acknowledges the dual nature of its new models, stating that they “make more accurate claims as well as more inaccurate/hallucinated claims.” This caveat highlights the challenge of building AI systems that can do well without fail.

The implications of these hallucinations are worse than harmful retweets and incorrect listings. Sarah Schwettmann, Transluce co-founder, cautioned that mistakes like this could make AI models potentially less helpful across the board. As users increasingly rely on AI for information and assistance, the presence of hallucinations poses a risk to their effectiveness and credibility.

Even as o3 and o4-mini take a misstep with hallucinations, there’s one model that really stands out. GPT-4o, with web search capabilities, has an 8% accuracy rate on SimpleQA. This is a troubling disparity, but it begs the question – what is driving the differences in performance between different AI models?

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *