Coding World Models with LLMs

Jye Sawtell-Rickson · June 22, 2025

Recently I ran an experiment in which I tried to get LLMs to output explicit world models in the form of Python code which could be used as the core of a MCTS policy for control in basic gymnasium environments. This is different to typical world model learning which would be something like a parameterised neural network. The reason why we would care about something like this is for on-the-fly reasoning, such as in ARC-AGI-3 where agents must act in a completely new environment.

Coded World Models with LLMs

The idea behind the approach is that given an LLM’s pre-trained knowledge it should be able to accurately model the physics of various simulations, and even better so when fed in example transitions. Overall, the project wasn’t very successful, with small models failing to produce reasonable code when it became a little tricky (e.g. calculating angular velocity changes based on translation forces). For example, this is code output by Gemma3:4b (with constants tuned). It introduces the concept of friction which may be unnecessary for a good enough model, but overall has the right idea, utilising trigonometry to calculate changes the velocity, adjusting speed differently based on actions.

def get_next_state(observation: list, action: int) -> list:
    """Returns the next observation given the current observation and action."""
    cart_position, cart_velocity, pole_angle, pole_angular_velocity = observation

    # Constants (tuned for the specific game)
    dt = -3.72e-05 # Time step
    pole_length = 885.53
    gravity = -217921501.39
    friction = 1.60

    # Calculate new cart position and velocity
    new_cart_position = cart_position + cart_velocity * dt

    # Calculate new cart velocity - more robust damping
    new_cart_velocity = cart_velocity + (gravity * math.sin(pole_angle) / pole_length) * dt
    new_cart_velocity = new_cart_velocity * friction

    # Apply action
    if action == 0:  # Push left
        new_cart_velocity -= 0.5
    elif action == 1:  # Push right
        new_cart_velocity += 0.5

    # Calculate new pole angle and angular velocity - improved damping
    new_pole_angle = pole_angle + pole_angular_velocity * dt
    new_pole_angular_velocity = pole_angular_velocity * friction * (1 - abs(new_cart_velocity) / 1.0) # Damp angular velocity

    return [new_cart_position, new_cart_velocity, new_pole_angle, new_pole_angular_velocity]

Graphical Models

It did get me thinking though, as humans, we don’t typically come up with detailed world models (at least I couldn’t repeat it back) when solving these sorts of tasks. I couldn’t tell you what the equation is for the angular velocity of the pole, but I can succeed after just one or two trials at the game. So do I have a model? What is it?

I believe I do have a structured model, and it’s something like this:

  • Pushing left makes the cart go a little bit left and the pole drifts a bit to the right; and same for the other side.
def get_next_state(observation: list, action: int) -> list:
    """Returns the next observation given the current observation and action."""
    cart_position, cart_velocity, pole_angle, pole_angular_velocity = observation

    dt = 1
    a_bit = 0.1
    a_bit_less = 0.05
    new_cart_velocity = cart_velocity 
    new_pole_angular_velocity = pole_angular_velocity

    if action == 0:
        new_cart_velocity += a_bit
        new_carte_angle -= a_bit_less
    else:
        new_cart_velocity -= a_bit 
        new_carte_angle += a_bit_less

    new_cart_position = cart_position + new_cart_velocity * dt
    new_cart_angle = cart_angle + new_cart_velocity * dt

    return [new_cart_position, new_cart_velocity, new_pole_angle, new_pole_angular_velocity]

This leads me to a control policy:

  • If pole is falling left, push cart left to counteract; and same for the other side.
  • If cart is drifting too far left, let the pole hang to the right for a bit before resuming 1; and same for the other side.

With this model, I’m able to tackle the problem fairly well. It’s a simple model, and could be represented by some small control chart. It could be made even simpler if you just want to perform well in the short term: right if angle > 0 else left.

So how can we get an LLM to produce this simplified world model, instead of the more complex world model we saw before? Maybe just through better prompting…I’ll be playing around with it, wish me luck!

Twitter, Facebook