Recent news and advancements in AI benchmarking have polarized the conversation in the tech community. These discussions drill down into how various models stack up against each other in retro video games such as Pokémon. The upcoming faceoff between Anthropic’s Claude 3.7 Sonnet and Google’s Gemini has generated considerable buzz. It highlights many of the complexities and challenges of evaluating new AI capabilities with gaming benchmarks.
Claude 3.7 Sonnet had a 62.3% correctness on SWE-bench Verified — pretty solid for a generative AI. This benchmarking tool is completely tailored to measure coding skills. Using a “custom scaffold” of their own creation—called a rubric—the model’s score increased to a passing 70.3%. This is great in principle, but it means that Claude’s promise can only be realized with careful, thoughtful and specific implementation.
Google’s Gemini model has shown significant benefits in simulation environments, particularly in gaming scenarios. One developer released a version of the Gemini stream that was under active maintenance. They created their own minimap so the model could learn to recognize necessary contents in the Pokémon game. This minimap allows Gemini to identify which trees are cuttable, making Gemini much faster in terms of gameplay. On a recent Twitch stream, Gemini cruised past Lavender Town with no issues, showing its better navigational skills. Claude faltered at Mount Moon and stayed bottlenecked since late February.
The recent back-and-forth between these models highlights the shortcomings that come along with AI benchmarks, especially in gaming contexts. Custom and non-standard implementations can create gaps that hide a model’s actual strengths and weaknesses. A tweet from X (formerly Twitter) boasted that Gemini was better than Claude in the first three Pokémon video games. This triggered an even larger uproar about the establishment of AI benchmarking standards.
The tweet announcing Gemini’s performance was one of our most talked about at 119 live views. Many observers considered it underrated, considering the importance of the game-changing developments it covered. These implications extend far beyond gaming performance. They raise critical issues on how we ought to assess AI models and if benchmarks are really reflective of what they can do.
Leave a Reply