o3 is Coming

Jye Sawtell-Rickson · December 24, 2024

Merry Christmas from OpenAI: “o3 is coming, and it’s better than all your models!”

As a part of their annual announcements, OpenAI shared their latest model: o3 (skipping o2 due to UK’s big telco). TL;DR it seems we got the new step-change people were doubting. Of course, nothing is published, nothing is available, but they do share some key results.

I’ve long been interested in Chollet’s On the Measure of Intelligence and the associated ARC challenge. Just a few weeks before this announcement a competition called ARC Prize included with some nice results, but nothing astounding. o3 has suddenly come in and taken the prize, by a big margin.

Chollet himself has done a nice write-up of the experiments they ran. Most notably, they achieve an accuracy of 88% with a high-compute, tuned model, vs. o1’s 32% and the SOTA’s 60%.

But what did it cost? A cool USD$3500 per task. Yes, per task. So for 100 tasks (the semi-private dataset) that’s a casual USD$350K.

You thought o1 is pricey. o3 also breaks all benchmarks with pricing.

Notably, they also tested a low compute version, which comes in at $20 per task which is a lot more reasonable and at least closer to the order of magnitude of human cost.

What this means

The key question: is this AGI? No. At least according to Chollet. Based on their tests, o3 still makes silly mistakes a human would not. Empirically, it’s not the same kind of intelligence as human intelligence. So there are still steps to go, we’re not quite there yet, but we might be getting awfully close.

With o3 coming soon, should we give up other avenues? Probably not. o3 is still limited on it’s own. It’s expensive and inefficient, it still needs to integrate with tools effectively (e.g. coding), it needs to be able to evolve and understand long context windows, etc.

That said, o3’s architecture is a big step up. To quote Chollet:

o3’s improvement over the GPT series proves that architecture is everything

But o3 is still just the beginning. There’s still a lot of interesting research to be done.

Time to play?

It will likely be six months before it’s released to the public (based off the history), so don’t expect to have access soon (unless you’re a safety researcher).

Even o1 (non-preview) is not yet available to many users.

Other Tidbits

You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute

To adapt to novelty, you need two things. First, you need knowledge – a set of reusable functions or programs to draw upon. LLMs have more than enough of that. Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis. LLMs have long lacked this feature. The o series of models fixes that.

…o3’s core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model.

There are however two significant differences between what’s happening here and what I meant when I previously described “deep learning-guided program search” as the best path to get to AGI. Crucially, the programs generated by o3 are natural language instructions (to be “executed” by a LLM) rather than executable symbolic programs. This means two things. First, that they cannot make contact with reality via execution and direct evaluation on the task – instead, they must be evaluated for fitness via another model, and the evaluation, lacking such grounding, might go wrong when operating out of distribution. Second, the system cannot autonomously acquire the ability to generate and evaluate these programs (the way a system like AlphaZero can learn to play a board game on its own.) Instead, it is reliant on expert-labeled, human-generated CoT data.

Twitter, Facebook