The fall of the ARC benchmark is a clear indication of how relentless advancements in AI optimization are redefining what we once thought were pinnacle tests—marking a significant turning point in artificial intelligence progress. But here's where it gets controversial: many analysts now argue that benchmarks like ARC, which was originally designed to distinguish genuine learning from mere memorization, are increasingly becoming just another target for optimization rather than true measures of intelligence.
For years, the Abstraction and Reasoning Corpus—later renamed ARC-AGI—stood out as a rigorous challenge, aiming to evaluate an AI's capacity for fluid reasoning and abstract thinking, rather than rote data recall. Its context was to serve as a litmus test for systems that could demonstrate a more human-like understanding of novel problems. However, recent developments suggest this original purpose is gradually being overshadowed by newer, more powerful approaches—ones that are capable of overwhelming even the most carefully crafted benchmarks.
The software company Poetiq has recently announced that, based on their latest systems, the original ARC-AGI-1 benchmark has essentially been 'solved.' In their detailed report (https://poetiq.ai/posts/arcagi_announcement/), they claim their models—built on cutting-edge architectures like OpenAI's GPT-5.1 and Google's Gemini 3—have achieved perfect or near-perfect performance on the initial dataset. Most strikingly, their systems have surpassed average human scores of approximately 60% on the more challenging ARC-AGI-2 set, which was designed to push the limits of abstraction and reasoning.
Poetiq's innovative approach combines large language models with custom integration techniques, operating within an iterative cycle of proposal, evaluation, self-audit, and refinement—akin to how a skilled human might Tinker and improve upon their solutions before arriving at the final answer.
The evolution of the former 'North Star' of AI benchmarks—introduced by François Chollet in 2019—was originally intended to emphasize 'skill acquisition efficiency.' The goal was to measure how well an AI could learn new tasks, rather than simply memorize vast amounts of data. For a long time, progress was slow; language models excelled at many benchmarks but struggled to solve these colorful grid puzzles, indicating the challenge of true abstraction.
However, this dynamic changed dramatically with the advent of specialized reasoning techniques and the increasing use of methods like Test-Time Training (TTT). A pivotal moment occurred in December 2024, when OpenAI's o3-preview system unexpectedly achieved over 75% accuracy on ARC-AGI-1. This marked a shift from viewing ARC as a pure measure of AI thinking to treating it as an optimization problem that could be solved through reinforcement learning and search algorithms, with labs tuning their models specifically to master ARC’s unique logical structure.
Notably, efficiency in solving these puzzles is improving as well. Poetiq’s system, driven by open-source models like GPT-OSS-120B (https://the-decoder.com/openai-releases-its-first-open-weight-language-models-since-gpt-2-with-gpt-oss/), manages to reach over 40% accuracy on ARC-AGI-1 at a tiny cost—less than a cent per task. This signifies the possible end of an era where solving such benchmarks required supercomputers and massive resources, especially with non-LLM models like the Tiny Recursive Model also showing promising results.
Yet, there are important caveats. These high scores—while impressive—are mostly limited to public datasets. Poetiq’s own analysis highlights a significant performance dip when models are tested on semi-private datasets, which are held back from training and publicly available benchmarks to prevent data contamination. This contamination occurs when models inadvertently memorize data from publicly available sources included within their training data, making it appear as though they are reasoning when they are actually recalling information.
This phenomenon raises a fundamental question: can we truly claim to evaluate general intelligence if the models are trained on datasets that are also part of their memory? Poetiq anticipates similar issues for their systems on other datasets like ARC-AGI-1, as models tend to perform worse when tested on data they have seen during training. Interestingly, the more tightly controlled ARC-AGI-2 set seems to be somewhat more resistant to this problem, as Poetiq states their models were never directly trained on those tasks.
Meanwhile, AI thought leaders like Chollet are closely observing these trends, seeing them as part of a broader strategic shift. He interprets recent successes—like the impressive reasoning abilities of models such as o3—as evidence that the traditional paradigm of scaling models through sheer data and size may be reaching its limits. Instead, the new frontier involves 'test-time adaptation,' where models dynamically adjust at runtime, employing techniques similar to program synthesis or chain-of-thought reasoning, to tailor their responses to specific problems.
Chollet believes this reflects a fundamental truth about intelligence: it is a process of continual adaptation rather than a static repository of knowledge. While solving ARC has historically been viewed as a crucial step toward artificial general intelligence (AGI), current models still lack a genuine understanding of the world, performing well only in constrained environments. However, the success in solving these benchmarks—initially created to push research in reasoning—demonstrates their utility in fostering rapid innovation.
It's clear that once these benchmarks are effectively 'solved,' their role shifts from measuring progress to serving as catalysts for new research directions. Today, with ARC-AGI-1 and even the tougher ARC-AGI-2 falling to sophisticated optimization strategies, the industry is moving toward integrating more adaptable, context-aware systems—an evolution that might bring us closer to true 'fluid intelligence' or might just be another step in our ongoing debate about what constitutes genuine understanding in AI.
Looking ahead, Chollet mentions future benchmarks like ARC-AGI-3, which aims to incorporate interactive environments to evaluate models' 'agency'—their capacity to act and make decisions within complex scenarios (https://the-decoder.com/richard-sutton-says-the-ai-industry-has-lost-its-way-by-ignoring-core-principles-of-intelligence/). Poetiq has already shared its code and results publicly on GitHub (https://github.com/poetiq-ai/poetiq-arc-agi-solver?tab=readme-ov-file), signaling a new era of open competition and collaborative progress in developing reasoning systems that push the boundaries of what AI can achieve.