xAI’s Grok 3 Sparks Debate Over AI Benchmarks  

A public clash between OpenAI and Elon Musk's xAI sparked controversy over AI benchmarking transparency, particularly regarding Grok 3 model. 

A public clash between OpenAI and Elon Musk’s xAI sparked controversy over AI benchmarking transparency, particularly regarding Grok 3 model.  

The controversy began when xAI uploaded a Grok 3 performance plot on the American Invitational Mathematics Examination (AIME 2025), a test featuring challenging math problems.  

OpenAI quickly showed that xAI had forgotten critical information, particularly the “consensus@64” (cons@64) score for OpenAI’s o3-mini-high model, perhaps misleading the public. 

What’s Cons@64? 

Consensus@64 (cons@64) is a type of AI benchmarks that allows a model 64 attempts per question, selecting the most frequent answer as the final response. The approach inflates the accuracy of performance data so that scores appear higher.  

When cons@64 results are eliminated, xAI’s graph caused one to assume that Grok 3 had beaten OpenAI’s model while actually o3-mini-high had performed better in the context of cons@64. 

Grok 3’s variants, including Grok 3 Reasoning Beta and Grok 3 mini-Reasoning, scored lower than o3-mini-high at the standard “@1” score, which is the first attempt each model makes at solving the problems.  

Even Grok 3 Reasoning Beta trailed OpenAI’s o1 model running on medium computing. Despite these results, Elon Musk AI company, xAI, continued to market Grok 3 as the “world’s smartest AI.” 

AI Model Benchmarks Debate 

AIME 2025, which is the test used in the evaluation of Grok 3, is an advanced math problem set that’s often used to evaluate AI’s capability in mathematics. Some experts question its validity as a measure of the general evaluation of AI models performance data. However, AIME remains common in AI performance testing despite this. 

xAI’s exclusion of the cons@64 score encouraged OpenAI employees to criticize the misleading representation, with some experts emphasizing that performance metrics alone don’t reveal the full story behind AI capabilities, bringing attention to a larger issue: the need for clear and transparent AI benchmarks. 

One major concern that has been raised, by AI researcher Nathan Lambert, regarding the lack of transparency surrounding computational and financial cost of achieving the best scores.  

“The computational and monetary cost it took for each model to achieve its highest score is often hidden,” Lambert said, and that AI benchmarks never really reveal the full picture of a model’s performance. 

 Currently, the information provided by benchmarks like AIME 2025 is in general incomplete, and it’s essential to understand the costs and resources encountered to properly estimate a model’s value. 

Final Thoughts 

The debate shows a growing need for transparent, standardized AI benchmarking practices. AIME 2025 and other tests give a record of the ability of a model on some collection of tasks, but they do not give a general sense of what it can or cannot accomplish. For fair comparisons, it is essential that benchmarks not just expose performance data but also the computing resources behind such performances.  

The xAI-OpenAI conflict over Grok 3 highlights the complexity of AI benchmarking. Ensuring transparency in performance data and associated costs will be key for fair AI evaluations. Disagreement about the claims by Grok 3 reminds everyone that AI needs careful consideration and honest reporting when making fair comparisons. 


Inside Telecom provides you with an extensive list of content covering all aspects of the tech industry. Keep an eye on our Tech sections to stay informed and up-to-date with our daily articles.