ARC-AGI 3 - Gaming with Reason

Jye Sawtell-Rickson · June 11, 2025

AGI Competition

ARC-AGI 2 was just released in March, but we’re already getting an announcement of the upcoming ARC-AGI 3, a benchmark said to be the next step on the way to AGI. The expected release date is early 2026, with beta testing about to begin.

This post will start with what ARC-AGI 3 is and what makes it unique compared to previous benchmarks. Next we’ll talk about existing agents that solve these sorts of problems and sketch out the template for a potential ARC-AGI 3 solution, wrapping it up with a discussion of the future for this work.

What is it?

In short, it’s a collection of unique game environments in ARC-AGI style, with 100 held out for private testing. They’re created with an intent to “force understanding through exploration”. Similar to ARC-AGI 1 and 2, it is solvable with only core knowledge priors: basic math and geometry, agentness and objectness.

Differences from ARC-AGI 2

ARC-AGI 3 is is easier than ARC-AGI 2 in a way, because you’re given ~infinite examples of how to solve the task, instead of just 2 or 3 samples. For context, ARC-AGI-1 would likely be solvable by current techniques if you had e.g. 1M samples per problem. In fact, RE-ARC and some other generators provide pseudo datasets which give you just that, many samples of a specific kind of problem to train on. Models that train on these datasets can get very good performance on the training dataset that they’re based off. But of course what’s going to matter is efficiency.

What other features set ARC-AGI 3 apart from other gaming benchmarks?

The Uniqueness of ARC-AGI 3

Games as a benchmark has been around for a long time, I’ve even written about them before on this blog. To be more precise, pixel-based game benchmarks have been around for some time too (e.g. various Gymnasium games). To be even more precise, pixel-based multi-game benchmarks have been around for some time (e.g. Atari). So on the surface, this isn’t a big change, but the ARC-AGI foundation would argue otherwise.

The key features of ARC-AGI-3:

A well-defined human baseline and evaluation mechanism: bringing together researchers with clear targets.
A private test set with no leakage: preventing leakage of test as is common with benchmarks.
Significant diversity in games: requiring agents to adapt to new scenarios.
Efficiency-based metrics: the amount of compute (or $) you need to solve each task impacts the overall score.

With that said, it does seem that the benchmark is the same as learning any game but with a focus of efficiency, something meta learning. For example could a DQN algorithm solve every game in ARC-AGI with enough time? Maybe, maybe not. As we’ll see, modern game-playing algorithms are impressive, but there’s a chance that these new games are sufficiently complex, require sufficient reasoning that the algorithms won’t converge to good solutions.

With the focus on efficiency, it reminds me a lot of the classic exploration-exploitation trade-off problem: the multi-armed bandit. Hopefully we’ll see as much deep thought go into finding efficient algorithms for exploiting the ARC-AGI 3 games.

Classic Multi-Game Benchmarks

Given multi-game benchmarks have been around for a long time, it’s worth doing a quick recap on those benchmarks. Following are some of the key ones:

Atari 2600 ALE (2012): a collection of hundreds of atari-based environments selected as unique and challenging environments. This immediately sounds very similar to ARG-AGI 3, but the key difference will likely come in the diversity and difficulty of the ARC-AGI 3 challenges.
General Videogame AI (2014): uses video game design language (VDGL) to create two dimensional, arcade, grid-based physics games. “Researchers would develop their agents without knowing what games they will be playing, and after submitting their agents to the competition all agents are evaluated using an unseen set of games.” Over ten years ago, we had a similar concept with simpler games.
ProcGen (2019): 16 procedurally generated environments designed to sample efficiency and generalisation. Through procedural generation, there is more generalisation involved, but there is still limited diversity.
Animal-AI (2019): features diverse 900 tasks that are designed to test animal-like cognition. The biggest gap here is likely the difficulty in the reasoning on these tasks. They’re relatively simple tasks, but perception is more difficult with a 3D first-person view.
MiniHack (2021): 17 pre-defined environments but easily extensible. Similar to other benchmarks here, they’re first-person tasks lacking diversity compared to the expected tasks in ARC-AGI-3.
XLand (2023): a vast set of 3D adaptation problems.

Finally, it’s worth mentioning Gymnasium which is a great interface that allows researchers to work with many of these benchmarks with a single interface, providing even further diversity in the algorithms they build.

In summary, the sort of benchmark that ARC-AGI-3 is aiming to be has existed for a over ten years in various forms. Environments specifically tested generalisation to unseen games (e.g. GVGAI, ProcGen) a key goal of ARC-AGI 3. In order for the benchmark to be truly interesting, it will have to deliver on its claims of task diversity and difficulty as these are the areas that I can really see it setting itself apart from the above.

Previous SoTA Agents

Given there are a host of existing environments for this type of benchmark, there are also many agents that have been developed to play these games. Any effective agents from past work will be a reasonable starting point for the benchmarks.

Key agents developed for multi-game benchmarks:

MuZero (2020): learns latent dynamic model without access to environment rules (like LPN! - not really, MuZero learns a latent dynamic model per-game meaning it predicts what the next state is based on latent dynamics; LPN encodes programs in latent space and searches for the best program). Could an agent instead encode agents inside a latent space and search for the best agent? The agent is a latent variable for a generalised controller.
Agent57 (DeepMind, 2020): the first system to get above-human performance across all atari games. It utilises a meta-controller that computes a mixture of long and short term intrinsic motivation to explore and learn a family of policies.
GATO (2022): GATO was designed for breadth, not depth meaning that it can do many tasks but none of them well. Its architecture is a massive transformer trained on a multi-modal dataset, including Atari. This makes GATO both very general (many tasks), but also not general at all given it was hyper-tuned to a specific set of games.
Multi-game decision transformers (2022): Traditional deep reinforcement learning agents (DQN, SimPLe, Dreamer, etc) are trained to optimize decisions to achieve the optimal return. Decision transformer - maps low and high quality experiences to learn rules. Trained on 1B gameplay experiences.
Adaptive Agent (2023): is a model created for generalist agents in a complex world. Their goal was to create “agents that can zero-shot generalise to tasks from the test set”. It utilises modern techniques like attention and an optimised choice of games for the agents to play based on what improves validation score.
There are a variety of LLM-based agents (e.g. Voyager) which leverage large pre-trained models as priors for choosing the correct actions in environments.
POET/MCC

Firstly, it seems I’m very biased, all of the above models are from Google/Deepmind. Next, the models either leverage a lot of pre-trained data (e.g. LLM ones) or are sample inefficient, requiring a lot of data. So in general, these models are not what ARC-AGI-3 is looking for. According to Chollet’s definition of intelligence, they lack skill acquisition efficiency and are thus not intelligent agents.

So what should we expect to see?

Potential Solutions

I very much expect to see early success from models with strong in-built priors behaving ‘unintelligently’. This was the case with ARC-AGI 1 which started with brute force search over DSLs before moving to the current SOTA which leverages search over LLM’s massive priors on function generation, with some solutions applying fine-tuning.

In many ways, these solutions were not in the ‘spirit’ of the competition, but they continually showed they’re the best known approach. Based on the details shared so far, it’s also likely that priors will play some part in ARC-AGI 3. Each game is given a name, e.g. “locksmith”, which suggests that as humans with our wide priors we see some similarity to previous problems or games. While the agent isn’t given the game title, it’s likely that having the understanding of what locking/unlocking mechanics exist in other games could help succeed in this game, for example. This akin to the ‘gravity’ problem in ARC-AGI 1 where there’s an expectation for objects to tend downwards. It’s hard to avoid priors when the puzzles are human generated!

Something like a variant of GATO which is fine-tuned to play these specific type of games might do well. Indeed, the size of the GATO models range from 79M up to 1B, making them a reasonable candidate for the benchmark. We shall have to see.

Do We Even Need Generalists?

ARC-AGI challenges continue to push towards this idea of strong generality, or intelligence. Before exhausting all our efforts down this path, we can also take a step back and ask ourselves: is this really necessary?

In terms of solving games, there is no super strong generalist. Training a bunch of hyper-specific agents works really well, and potentially more efficiently than any generalist agent. In other domains, we train specific models to solve our e.g. product recommendation tasks.

LLMs are a strong counterexample where we pushed for generality and produced something amazing. From that point we can also make it less general through fine-tuning. But there are supposedly emergent properties which come only from training a general model on a massive dataset. And maybe there are things we can’t achieve without general training, for example, the finding that training on math data significantly enhances coding ability (the skill of logic).

On the other hand we do see systems like ‘simulacra’ that simulate multiple agents and their interactions. Many of the latest research assistant systems have multiple agents with specific roles. If we look outside AI this is also largely how our society works. Many ‘less intelligent’ people working together (ensembling) can achieve amazing things through the collective.

Maybe we just need to find the right scaffolding for agents to interact and thrive, rather than produce the most generalist intelligence.

Conclusion

ARC-AGI 3 looks to be a fun challenge with all the difficulties of the previous challenges but a host of prior art for people to get started with. It will be interesting to see the initial performance of SoTA models out of the gate and the solutions that people come up with.

Share: Twitter, Facebook