Technology

OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

Sultan Ahmed

April 21, 2025
No Comments

Recent discussions surrounding OpenAI’s latest AI model reveal inconsistencies between internal claims and independent evaluations. The o3 model, initially touted for achieving over 25% accuracy on the FrontierMath benchmark, now shows a significant performance gap in third-party testing.

Independent analysis by Epoch AI found the publicly released version of o3 scored approximately 10% on FrontierMath—far below OpenAI’s earlier assertions. While the company had highlighted a 25% success rate during its December 2023 unveiling, researchers note this figure likely represented an optimized internal prototype rather than the production model.

Epoch AI’s evaluation shows o3 scoring around 10% on FrontierMath, emphasizing the gap between research prototypes and deployed systems. April 18, 2025

Technical staff at OpenAI clarified that the production model prioritizes real-world efficiency over benchmark performance. Wenda Zhou explained during a recent livestream: “We’ve optimized o3 for faster response times and cost-effectiveness, which may result in benchmark disparities compared to research-focused versions.”

The situation highlights broader challenges in AI benchmarking practices:

Varying evaluation methodologies between organizations
Differences between research prototypes and production systems
Potential conflicts of interest in company-reported metrics

This incident follows similar controversies across the industry:

Epoch AI faced criticism for delayed disclosure of OpenAI funding ties
xAI’s Grok 3 faced accusations of misleading benchmark claims
Meta recently acknowledged discrepancies between promoted and released model versions

As OpenAI prepares to launch its o3-pro model, the industry faces growing calls for standardized evaluation protocols and increased transparency in AI performance reporting.