OpenAI’s o3 AI Model Falls Short of Initial Benchmark Claims

OpenAI just published its o3 AI model, and the tech world is all atwitter. This is causing a lot of excitement and speculation, especially around its performance on the FrontierMath benchmark. Originally, OpenAI asserted that o3 could correctly answer more than ¼ of the questions on the benchmark. Epoch AI’s independent tests of the public version of o3 found that it usually scored around 10%. This outcome is much more pessimistic than OpenAI’s projected impact.

Unfortunately, OpenAI’s expectations don’t meet o3’s current performance. This gap calls into question the accuracy of the company’s previous claims. Epoch AI, the research institute that created FrontierMath, moved fast to publish independent benchmark results. Fittingly enough, this came immediately on the heels of the public release of o3. These findings indicate that the o3 model provided to users is not as robust as previously implied. It further comes off as the runt of OpenAI’s lineup. Interestingly, both the o3-mini-high and o4-mini model surpassed o3 on the same benchmark.

OpenAI’s chief research officer, Mark Chen, responded to the controversy. He remarked that every published tier of o3 is in fact less than the one they’ve internally compared all of to. He suggested that the discrepancy in scores might stem from differences in testing environments or computing power used during evaluations.

“The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time computing, or because those results were run on a different subset of FrontierMath,” – Epoch AI.

Epoch’s testing utilized an improved version of FrontierMath. This version may differ in subtle ways from the version which OpenAI used under those conditions for its own evaluations. This begs more questions than just their representativeness, including the validity and reliability of the benchmarks being used to measure AI performance.

Then in December, OpenAI released a series of prologue reports demonstrating leading benchmark results. These results indicated that lower-bound scores were consistent with those seen by Epoch. In internal tests, Chen said that even with the most aggressive compute settings, o3 was able to achieve upwards of 25% accuracy on FrontierMath. We should note that this performance was achieved using a not-yet-released version of o3. The ARC Prize Foundation had a go of it, and this version is said to be quite different from the publicly available version.

Though progress has halted due to these challenges, OpenAI hasn’t given up on making its products more accessible. The company is planning to release a more powerful version of o3 model, called o3-pro, in the next few weeks. After years of performance deficiencies, this new redacted release is part of the new OpenAI’s plan to restore paused confidence in OpenAI’s competency.

“Today, all offerings out there have less than 2% [on FrontierMath],” – Mark Chen, chief research officer at OpenAI.

Despite these setbacks, OpenAI remains committed to improving its offerings. The company plans to introduce a more powerful variant of the o3 model, dubbed o3-pro, in the coming weeks. This new version aims to address performance deficiencies and restore confidence in OpenAI’s capabilities.

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *