Machine Learning Solver ¶

This notebook is built using mathy_envs and a modified version of @nikhilbarhate99's wonderful PPO-Pytorch script.

While working with math problems using heuristics is interpretable and reliable, it can be a significant engineering task to design combinations of rules and heuristics for handling all the various tree forms that user input questions might take.

Rather than invest engineering time into writing heuristics, we can use machine learning algorithms to train a model to select which actions to take to find an optimal path to a solution. This is more robust than random action selections, and it will make solving many problems trivial once we get going.

Let's look together at how mathy_envs can be used with Proximal Policy Optimization (PPO) in PyTorch to train a problem-solving model that can then be used to demonstrate solving problems step-by-step.

!pip install mathy_envs>=0.12.1 torch

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip

Overview¶

Before we get started, let's review what mathy envs are and how they work. Mathy envs are reinforcement learning environments for manipulating math trees with a rules system.

Each mathy_envs environment generates math problem texts and determines if the current expression is "solved" or not
Users/Models interact with the environments by playing "episodes" where they solve problems given a set of rules and environment-specific logic
Depending on the context, the outputs are either used as inputs to a training model or as an output demonstration for an end-user

We will use reinforcement learning to train a model capable of solving problems generated by the mathy_envs library.

Specifically, we choose the PolySimplify environment, which generates controllably complex polynomial simplification problems and implements logic to determine when they're solved.

For the machine learning portion, we choose the Proximal Policy Optimization algorithm, an online learning algorithm.

Before we get into the machine learning parts, let's quickly get a taste of the basics of our environments.

Mathy envs implements a base environment interface, and that's wrapped in a set of classes exposed for gym/gymnasium libraries.

import gymnasium as gym
import numpy as np
import torch
from mathy_envs import MathyEnv
from mathy_envs.gym import MathyGymEnv

# Environment difficulty level (options: 'easy', 'normal', 'hard')
env_difficulty = "easy"

# Environment names to train on, based on environment types and difficulty
env_types = [
    "poly",
    # "poly-blockers",
    # "poly-combine",
    # "poly-commute",
    # "poly-grouping",
    # "poly-like-terms-haystack",
    # "binomial",
    # "complex",
]
env_names = [f"mathy-{t}-{env_difficulty}-v0" for t in env_types]

env: MathyGymEnv = gym.make(env_names[0])
base_env: MathyEnv = env.unwrapped.mathy
print(f"Environment: {base_env.get_env_namespace()}")
print(f"Num Actions: {base_env.action_size}")
print(f"Rules      : {[e.name for e in base_env.rules]}")

Environment: mathy.polynomials.simplify
Num Actions: 896
Rules      : ['Constant Arithmetic', 'Commutative Swap', 'Distributive Multiply', 'Distributive Factoring', 'Associative Group', 'Variable Multiplication', 'Restate Subtraction']

Proximal Policy Optimization¶

Proximal Policy Optimization (PPO) is a reinforcement learning approach that balances simplicity of implementation and sample efficiency. Developed by John Schulman and colleagues at OpenAI, PPO is designed to be more stable and reliable than earlier policy gradient methods, thanks to its novel objective function that moderates the policy updates.

PPO uses a buffer of trajectories called a rollout buffer to store key elements like states, actions, and rewards generated by interacting with the environments. These buffered trajectories are used when updating the actor-critic neural networks during training.

We'll use PPO here to train an agent that can solve polynomial simplification problems in a step-by-step manner.

First, let's set some variables that can be changed later for experimentation.

# Learning rate for the actor network
lr_actor = 0.0003

# Learning rate for the critic network
lr_critic = 0.001

# Discount factor for future rewards
gamma = 0.99

# Number of epochs to update the policy
K_epochs = 80

# Clip parameter for PPO, used in policy update
eps_clip = 0.2

# Random seed setting (0 = no random seed)
random_seed = 1337

# Device to run the training on (CPU or CUDA)
device = torch.device("cpu" if not torch.cuda.is_available() else "cuda:0")

# Dimension of the hidden layer in the critic network
critic_hidden_dim = 64

# Where to save the model
checkpoint_path = "ppo.pth"

# Whether or not to use masked action selection. This makes the problems significantly easier when
# true because the action space is sparse with most possible actions being invalid. When false, the
# agent must learn to avoid invalid actions itself, making the problems much more challenging given
# the action space on the order of hundreds or thousands of possible actions for each state.
use_masked_actions = True

Rollout Buffer¶

The Rollout Buffer in PPO stores the agent's trajectory during its interaction with the environment over a single policy iteration. This includes actions, states, rewards, log probabilities of the actions under the current policy, and state values. When the policy is updated, the buffer is cleared.

class RolloutBuffer:
    def __init__(self):
        self.actions = []
        self.states = []
        self.logprobs = []
        self.rewards = []
        self.state_values = []
        self.is_terminals = []

    def clear(self):
        del self.actions[:]
        del self.states[:]
        del self.logprobs[:]
        del self.rewards[:]
        del self.state_values[:]
        del self.is_terminals[:]

Actor-Critic Module¶

The Actor-Critic module in PPO is the core of the policy learning mechanism and has two key components:

the actor, which is responsible for choosing actions based on the current state.
the critic, which evaluates actor actions by estimating the value of the state. This is used to help steer the actor toward higher-value actions.

The dual structure allows for more efficient and stable learning by combining the strengths of both policy-based and value-based approaches in reinforcement learning.

import torch.nn as nn
from torch.distributions import Categorical


class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(ActorCritic, self).__init__()
        self.device = device
        self.action_dim = action_dim

        # Actor network
        self.actor = nn.Sequential(
            nn.Linear(state_dim, critic_hidden_dim),
            nn.Tanh(),
            nn.Linear(critic_hidden_dim, critic_hidden_dim),
            nn.Tanh(),
            nn.Linear(critic_hidden_dim, action_dim),
            nn.Softmax(dim=-1),
        )

        # Critic network
        self.critic = nn.Sequential(
            nn.Linear(state_dim, critic_hidden_dim),
            nn.Tanh(),
            nn.Linear(critic_hidden_dim, critic_hidden_dim),
            nn.Tanh(),
            nn.Linear(critic_hidden_dim, 1),
        )

    def act(self, state):
        action_probs = self.actor(state)
        if use_masked_actions:
            mask = state[-action_probs.shape[0] :]
            action_probs = action_probs * mask
            action_probs = action_probs / torch.sum(action_probs)
        dist = Categorical(action_probs)
        action = dist.sample()
        action_logprob = dist.log_prob(action)
        state_val = self.critic(state)

        return action.detach(), action_logprob.detach(), state_val.detach()

    def evaluate(self, state, action):
        action_probs = self.actor(state)
        dist = Categorical(action_probs)
        action_logprobs = dist.log_prob(action)
        dist_entropy = dist.entropy()
        state_values = self.critic(state)

        return action_logprobs, state_values, dist_entropy

Algorithm¶

The PPO class encapsulates handles the policy updates, action selection, experience buffer handling, and saving/loading.

It initializes two ActorCritic models: one for the current policy and another as a reference to the old policy. This structure is crucial for implementing PPO's clipped surrogate objective function, which moderates the policy updates for stability.

import torch
import torch.nn as nn


class PPO:
    def __init__(self, state_dim: int, action_dim: int):
        self.device = device
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.K_epochs = K_epochs

        self.buffer = RolloutBuffer()
        self.policy = ActorCritic(state_dim, action_dim).to(device)
        self.optimizer = torch.optim.Adam(
            [
                {"params": self.policy.actor.parameters(), "lr": lr_actor},
                {"params": self.policy.critic.parameters(), "lr": lr_critic},
            ]
        )

        self.policy_old = ActorCritic(state_dim, action_dim).to(device)
        self.policy_old.load_state_dict(self.policy.state_dict())

        self.MseLoss = nn.MSELoss()

    def select_action(self, state):
        state = torch.FloatTensor(state).to(self.device)
        with torch.no_grad():
            action, action_logprob, state_val = self.policy_old.act(state)

        self.buffer.states.append(state)
        self.buffer.actions.append(action)
        self.buffer.logprobs.append(action_logprob)
        self.buffer.state_values.append(state_val)

        return action.item()

    def update(self):
        # Monte Carlo estimate of returns
        rewards = []
        discounted_reward = 0
        for reward, is_terminal in zip(
            reversed(self.buffer.rewards), reversed(self.buffer.is_terminals)
        ):
            if is_terminal:
                discounted_reward = 0
            discounted_reward = reward + (self.gamma * discounted_reward)
            rewards.insert(0, discounted_reward)

        # Normalizing the rewards
        rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-7)

        # convert list to tensor
        old_states = (
            torch.squeeze(torch.stack(self.buffer.states, dim=0))
            .detach()
            .to(self.device)
        )
        old_actions = (
            torch.squeeze(torch.stack(self.buffer.actions, dim=0))
            .detach()
            .to(self.device)
        )
        old_logprobs = (
            torch.squeeze(torch.stack(self.buffer.logprobs, dim=0))
            .detach()
            .to(self.device)
        )
        old_state_values = (
            torch.squeeze(torch.stack(self.buffer.state_values, dim=0))
            .detach()
            .to(self.device)
        )

        # calculate advantages
        advantages = rewards.detach() - old_state_values.detach()

        # Optimize policy for K epochs
        for _ in range(self.K_epochs):
            # Evaluating old actions and values
            logprobs, state_values, dist_entropy = self.policy.evaluate(
                old_states, old_actions
            )

            # match state_values tensor dimensions with rewards tensor
            state_values = torch.squeeze(state_values)

            # Finding the ratio (pi_theta / pi_theta__old)
            ratios = torch.exp(logprobs - old_logprobs.detach())

            # Finding Surrogate Loss
            surr1 = ratios * advantages
            surr2 = (
                torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
            )

            # final loss of clipped objective PPO
            loss = (
                -torch.min(surr1, surr2)
                + 0.5 * self.MseLoss(state_values, rewards)
                - 0.01 * dist_entropy
            )

            # take gradient step
            self.optimizer.zero_grad()
            loss.mean().backward()
            self.optimizer.step()

        # Copy new weights into old policy
        self.policy_old.load_state_dict(self.policy.state_dict())

        # clear buffer
        self.buffer.clear()

    def save(self, checkpoint_path):
        torch.save(self.policy_old.state_dict(), checkpoint_path)

    def load(self, checkpoint_path):
        checkpoint = torch.load(checkpoint_path, map_location=self.device)
        self.policy_old.load_state_dict(checkpoint)
        self.policy.load_state_dict(checkpoint)

Training¶

The training loop tries to make the agent better by having it interact with its environment and learn from what happens.

The loop includes picking actions with the current policy, getting rewards and states, and updating the policy regularly.

During training, we print updates about how many episodes we've finished, how many steps we've taken, and the average reward the agent gets.

def train(checkpoint_path: str, max_steps: int = 1_000_000):
    print(
        f"Device set to: {torch.cuda.get_device_name(device) if torch.cuda.is_available() else 'CPU'}"
    )
    print(f"Training environments: {', '.join(env_names)}")

    max_ep_len = 50  # Max timesteps in one episode
    print_freq = 20_000  # Frequency for printing average reward
    save_model_freq = int(1e5)  # Model saving frequency
    update_timestep = max_ep_len * 4  # update policy every n timesteps

    envs = [
        gym.make(name, invalid_action_response="raise", verbose=False)
        for name in env_names
    ]
    env = envs[0]  # Select an environment

    # Initialize the PPO agent
    ppo_agent = PPO(env.observation_space.shape[0], env.action_space.n)

    # Training variables
    time_step = 0
    i_episode = 0
    total_reward = 0

    # Training loop
    while time_step <= max_steps:
        state, _ = env.reset()
        episode_reward = 0

        for t in range(1, max_ep_len + 1):
            action = ppo_agent.select_action(state)
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            ppo_agent.buffer.rewards.append(reward)
            ppo_agent.buffer.is_terminals.append(done)

            time_step += 1
            episode_reward += reward

            if time_step % update_timestep == 0:
                ppo_agent.update()

            # Print average reward
            if time_step % print_freq == 0:
                avg_reward = total_reward / i_episode if i_episode > 0 else 0
                print(
                    f"Episode: {i_episode} \t Timestep: {time_step} \t Average Reward: {avg_reward:.2f}"
                )

            # Save model
            if time_step % save_model_freq == 0:
                print(f"Saving model at timestep {time_step}")
                ppo_agent.save(checkpoint_path)

            if done:
                break

        total_reward += episode_reward
        i_episode += 1

    print("Training completed.")

Now we're ready to train our model. If you want to train a fully-capable agent in this environment, you might want to train for 1 or more million steps, but that can take a while to complete.

For our purposes a few hundred thousand steps is a good number because it's quicker and you can still see the agent start to learn how to solve problems in that time.

Any reward values over 0.0 almost always indicate a correct solution within the number of steps allowed by the environment. Perfect scores are generally around ~1.5 for most environments and max out at about 2.0 for others that only take a few steps.

train(checkpoint_path, 300_000)

Device set to: NVIDIA GeForce RTX 3090
Training environments: mathy-poly-easy-v0
Episode: 1672    Timestep: 20000     Average Reward: -1.25
Episode: 3752    Timestep: 40000     Average Reward: -0.54
Episode: 6370    Timestep: 60000     Average Reward: 0.02
Episode: 9146    Timestep: 80000     Average Reward: 0.34
Episode: 12116   Timestep: 100000    Average Reward: 0.54
Saving model at timestep 100000
Episode: 15078   Timestep: 120000    Average Reward: 0.66
Episode: 18083   Timestep: 140000    Average Reward: 0.75
Episode: 21243   Timestep: 160000    Average Reward: 0.81
Episode: 24630   Timestep: 180000    Average Reward: 0.88
Episode: 27946   Timestep: 200000    Average Reward: 0.92
Saving model at timestep 200000
Episode: 31427   Timestep: 220000    Average Reward: 0.96
Episode: 34634   Timestep: 240000    Average Reward: 0.99
Episode: 38175   Timestep: 260000    Average Reward: 1.02
Episode: 41833   Timestep: 280000    Average Reward: 1.05
Episode: 45453   Timestep: 300000    Average Reward: 1.07
Saving model at timestep 300000
Training completed.

Evaluation¶

The test function evaluates the performance of our trained agent. It loads the trained model, uses it to complete a number of test problems, then prints the average reward alongside the problem outputs.

def test(checkpoint_path: str):
    envs = [
        gym.make(name, invalid_action_response="raise", verbose=True)
        for name in env_names
    ]
    assert len(envs) > 0, "No environments found"
    env = envs[0]
    ppo_agent = PPO(env.observation_space.shape[0], env.action_space.n)

    print(f"\nloading network from : {checkpoint_path}\n", flush=True)
    ppo_agent.load(checkpoint_path)

    total_test_episodes = 10  # total num of testing episodes
    test_running_reward = 0

    for ep in range(1, total_test_episodes + 1):
        env = np.random.choice(envs)
        ep_reward = 0
        state, _ = env.reset()
        done = False

        while not done:
            action = ppo_agent.select_action(state)
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            ep_reward += reward

        test_running_reward += ep_reward
        emoji = "✅" if ep_reward >= 0.0 else "🔴"
        print(f"[{ep}]{emoji} Reward: {round(ep_reward, 2)}")
        ep_reward = 0

    avg_test_reward = test_running_reward / total_test_episodes
    print(f"Average test reward: {round(avg_test_reward, 2)}")

Having trained a model and written the test function, we can finally see the results of our hard work.

Our tiny model (< 1MB) is able to solve our polynomial simplification problems somewhat consistently.

With more training the given agent config can reach near perfect accuracy on this task.

test(checkpoint_path)

006 | -- cs -- df ag -- -- | 16 | 0.0   | initial-state(-1)         | (6j^2 + 2j^2 + 2q + 5r^3)

loading network from : ppo.pth

005 | -- cs -- -- ag -- -- | 16 | 0.0   | initial-state(-1)         | 6y + 8m^2 + 12y + 10m^2
007 | -- cs -- df ag -- -- | 15 | -0.01 | commutative swap(9)       | 6y + 12y + 8m^2 + 10m^2
008 | ca cs dm df ag -- -- | 14 | 0.01  | distributive factoring(3) | (6 + 12) * y + 8m^2 + 10m^2
004 | -- cs -- df ag -- -- | 13 | 0.01  | constant arithmetic(1)    | 18y + 8m^2 + 10m^2
004 | ca cs dm -- -- -- -- | 12 | 0.01  | distributive factoring(9) | 18y + (8 + 10) * ^2
001 | -- cs -- -- -- -- -- | 11 | 1.4   | constant arithmetic(5)    | 18y + 18^2
[1]✅ Reward: 1.42
005 | -- cs -- -- ag -- -- | 16 | 0.0   | initial-state(-1)         | (3c^4 + 5o^3 + c^4) + 8c^2
006 | -- cs -- df ag -- -- | 15 | -0.01 | commutative swap(11)      | 3c^4 + c^4 + 5o^3 + 8c^2
006 | ca cs dm -- ag -- -- | 14 | 0.01  | distributive factoring(5) | (3 + 1) * c^4 + 5o^3 + 8c^2
003 | -- cs -- -- ag -- -- | 13 | 1.7   | constant arithmetic(1)    | 4c^4 + 5o^3 + 8c^2
[2]✅ Reward: 1.67
007 | -- cs -- df ag -- -- | 16 | 0.0   | initial-state(-1)         | 7o^2 + (o^2 + g + 9g)
007 | ca cs dm df ag -- -- | 15 | 0.01  | distributive factoring(5) | (7 + 1) * o^2 + (g + 9g)
008 | ca cs dm -- -- -- -- | 14 | 0.01  | distributive factoring(9) | (7 + 1) * o^2 + (1 + 9) * g
005 | ca cs dm -- -- -- -- | 13 | 0.01  | constant arithmetic(1)    | 8o^2 + (1 + 9) * g
001 | -- cs -- -- -- -- -- | 12 | 1.5   | constant arithmetic(7)    | 8o^2 + 10g
[3]✅ Reward: 1.53
005 | -- cs -- -- ag -- -- | 16 | 0.0   | initial-state(-1)         | 7v + (10v^4 + v) + 7v^4
007 | -- cs -- df ag -- -- | 15 | -0.01 | commutative swap(9)       | 7v + (v + 10v^4) + 7v^4
007 | ca cs dm df ag -- -- | 14 | 0.01  | distributive factoring(11) | 7v + v + (10 + 7) * v^4
004 | -- cs -- df ag -- -- | 13 | 0.01  | constant arithmetic(7)    | 7v + v + 17v^4
005 | ca cs dm -- -- -- -- | 12 | 0.01  | distributive factoring(3) | (7 + 1) * v + 17v^4
001 | -- cs -- -- -- -- -- | 11 | 1.4   | constant arithmetic(1)    | 8v + 17v^4
[4]✅ Reward: 1.42
006 | -- cs -- df ag -- -- | 16 | 0.0   | initial-state(-1)         | (9u^3 + 10r + 3r + 8u^3)
007 | ca cs dm -- ag -- -- | 15 | 0.01  | distributive factoring(9) | 9u^3 + (10 + 3) * r + 8u^3
008 | ca cs dm df ag -- -- | 14 | -0.01 | commutative swap(5)       | (10 + 3) * r + 9u^3 + 8u^3
004 | -- cs -- df ag -- -- | 13 | 0.01  | constant arithmetic(1)    | 13r + 9u^3 + 8u^3
004 | ca cs dm -- -- -- -- | 12 | 0.01  | distributive factoring(9) | 13r + (9 + 8) * u^3
001 | -- cs -- -- -- -- -- | 11 | 1.4   | constant arithmetic(5)    | 13r + 17u^3
[5]✅ Reward: 1.42
003 | -- cs -- -- ag -- -- | 16 | 0.0   | initial-state(-1)         | (5p^3 + 2y + 8p^3)
004 | -- cs -- df ag -- -- | 15 | -0.01 | commutative swap(5)       | 2y + 5p^3 + 8p^3
004 | ca cs dm -- -- -- -- | 14 | 0.01  | distributive factoring(9) | 2y + (5 + 8) * p^3
001 | -- cs -- -- -- -- -- | 13 | 1.7   | constant arithmetic(5)    | 2y + 13p^3
[6]✅ Reward: 1.67
009 | -- cs -- -- ag -- -- | 20 | 0.0   | initial-state(-1)         | (4a + 4f + 5f^2 + 8a + 12z + 7f)
009 | -- cs -- -- ag -- -- | 19 | -0.01 | commutative swap(7)       | 4a + 5f^2 + 4f + 8a + 12z + 7f
009 | -- cs -- -- ag -- -- | 18 | -0.01 | commutative swap(9)       | 4a + 4f + 5f^2 + 8a + 12z + 7f
009 | -- cs -- -- ag -- -- | 17 | -0.04 | commutative swap(7)       | 4a + 5f^2 + 4f + 8a + 12z + 7f
009 | -- cs -- -- ag -- -- | 16 | -0.04 | commutative swap(9)       | 4a + 4f + 5f^2 + 8a + 12z + 7f
009 | -- cs -- -- ag -- -- | 15 | -0.06 | commutative swap(7)       | 4a + 5f^2 + 4f + 8a + 12z + 7f
009 | -- cs -- -- ag -- -- | 14 | -0.01 | associative group(9)      | 4a + 5f^2 + (4f + 8a) + 12z + 7f
009 | -- cs -- -- ag -- -- | 13 | -0.01 | associative group(9)      | 4a + 5f^2 + (4f + 8a + 12z) + 7f
009 | -- cs -- -- ag -- -- | 12 | -0.01 | associative group(9)      | 4a + 5f^2 + (4f + 8a + 12z + 7f)
009 | -- cs -- -- ag -- -- | 11 | -0.01 | commutative swap(9)       | 4a + (4f + 8a + 12z + 7f) + 5f^2
009 | -- cs -- -- ag -- -- | 10 | -0.01 | commutative swap(11)      | 4a + (4f + 12z + 8a + 7f) + 5f^2
009 | -- cs -- -- ag -- -- | 09 | -0.04 | commutative swap(11)      | 4a + (4f + 8a + 12z + 7f) + 5f^2
009 | -- cs -- -- ag -- -- | 08 | -0.04 | commutative swap(11)      | 4a + (4f + 12z + 8a + 7f) + 5f^2
009 | -- cs -- -- ag -- -- | 07 | -0.06 | commutative swap(11)      | 4a + (4f + 8a + 12z + 7f) + 5f^2
009 | -- cs -- -- ag -- -- | 06 | -0.01 | commutative swap(15)      | 4a + (4f + 8a + 7f + 12z) + 5f^2
010 | -- cs -- df ag -- -- | 05 | -0.01 | commutative swap(11)      | 4a + (4f + 7f + 8a + 12z) + 5f^2
010 | -- cs -- df ag -- -- | 04 | -0.01 | commutative swap(3)       | 4f + 7f + 8a + 12z + 4a + 5f^2
011 | ca cs dm -- ag -- -- | 03 | 0.01  | distributive factoring(3) | (4 + 7) * f + 8a + 12z + 4a + 5f^2
007 | -- cs -- -- ag -- -- | 02 | 0.01  | constant arithmetic(1)    | 11f + 8a + 12z + 4a + 5f^2
007 | -- cs -- -- ag -- -- | 01 | -0.01 | associative group(3)      | 11f + (8a + 12z) + 4a + 5f^2
007 | -- cs -- -- ag -- -- | 00 | -1.0  | associative group(3)      | 11f + (8a + 12z + 4a) + 5f^2
[7]🔴 Reward: -1.37
002 | -- cs -- df -- -- -- | 08 | 0.0   | initial-state(-1)         | (12k^4 + 4k^4)
003 | ca cs dm -- -- -- -- | 07 | 0.01  | distributive factoring(5) | (12 + 4) * k^4
002 | -- cs -- df -- -- -- | 06 | -0.01 | distributive multiply(3)  | 12k^4 + 4k^4
003 | ca cs dm -- -- -- -- | 05 | -0.04 | distributive factoring(5) | (12 + 4) * k^4
000 | -- -- -- -- -- -- -- | 04 | 1.2   | constant arithmetic(1)    | 16k^4
[8]✅ Reward: 1.21
005 | -- cs -- -- ag -- -- | 16 | 0.0   | initial-state(-1)         | 12x + (8x^2 + 1x + x^2)
007 | -- cs -- df ag -- -- | 15 | -0.01 | commutative swap(9)       | 12x + (1x + 8x^2 + x^2)
008 | ca cs dm df ag -- -- | 14 | 0.01  | distributive factoring(3) | (12 + 1) * x + (8x^2 + x^2)
004 | -- cs -- df ag -- -- | 13 | 0.01  | constant arithmetic(1)    | 13x + (8x^2 + x^2)
004 | ca cs dm -- -- -- -- | 12 | 0.01  | distributive factoring(9) | 13x + (8 + 1) * x^2
001 | -- cs -- -- -- -- -- | 11 | 1.4   | constant arithmetic(5)    | 13x + 9x^2
[9]✅ Reward: 1.42
006 | -- cs -- df ag -- -- | 12 | 0.0   | initial-state(-1)         | c + (8c + 6c^2 + 6o)
005 | -- cs -- -- ag -- -- | 11 | -0.01 | commutative swap(1)       | 8c + 6c^2 + 6o + c
005 | -- cs -- -- ag -- -- | 10 | -0.01 | commutative swap(9)       | 8c + 6o + 6c^2 + c
005 | -- cs -- -- ag -- -- | 09 | -0.04 | commutative swap(7)       | 8c + 6c^2 + 6o + c
005 | -- cs -- -- ag -- -- | 08 | -0.01 | associative group(9)      | 8c + 6c^2 + (6o + c)
005 | -- cs -- -- ag -- -- | 07 | -0.01 | commutative swap(9)       | 8c + (6o + c) + 6c^2
006 | -- cs -- df ag -- -- | 06 | -0.01 | commutative swap(7)       | 8c + (c + 6o) + 6c^2
007 | ca cs dm -- ag -- -- | 05 | 0.01  | distributive factoring(3) | (8 + 1) * c + 6o + 6c^2
003 | -- cs -- -- ag -- -- | 04 | 1.1   | constant arithmetic(1)    | 9c + 6o + 6c^2
[10]✅ Reward: 1.04
Average test reward: 1.14