xAI Faces Scrutiny Over Grok 3 Benchmark Results Amidst OpenAI Criticism

xAI has come under fire following the release of a graph detailing the performance of its latest AI model, Grok 3, on the AIME 2025 test. The graph, which showcased Grok 3 Reasoning Beta and Grok 3 mini Reasoning scores, sparked controversy due to accusations from OpenAI employees who claimed the results were misleading. The debate unfolded on X, a social media platform, where Igor Babushkin, co-founder of xAI, defended the company's findings.

The contention centers around the AIME 2025, a challenging set of math questions from a recent invitational mathematics exam. These tests are often used to evaluate AI models' mathematical capabilities. The debate intensified when it was revealed that Grok 3's scores fell below OpenAI's o3-mini-high model when benchmarked using the Consensus@64 method. This method allows a model 64 attempts to solve each problem, with the most frequent answer chosen as the final response. Omitting this method from xAI's graph reportedly made it appear as though Grok 3 outperformed its competitors.

OpenAI employees highlighted that xAI's graph neglected to include o3-mini-high's AIME 2025 score at "cons@64," casting doubt on the accuracy of the presented data. Despite this, xAI maintains that Grok 3 Reasoning Beta and Grok 3 mini Reasoning outperformed OpenAI's leading model on AIME 2025. Nathan Lambert, an AI researcher, emphasized the significance of considering computational and monetary costs when assessing a model's best performance.

In defense of xAI's approach, Igor Babushkin insists that their results accurately reflect Grok 3's capabilities. However, some experts have questioned the validity of using AIME as a benchmark for AI performance, given its focus on math skills. Meanwhile, xAI continues to advertise Grok 3 as the "world's smartest AI," further fueling the debate over its actual proficiency.

"Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda (I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-high-pass@”””1””” deserves more scrutiny.)"

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex)

xAI Faces Scrutiny Over Grok 3 Benchmark Results Amidst OpenAI Criticism

Tags

Leave a Reply Cancel reply