Blog about stuff

Trackmania 2020 AI

2023/09/28

What is this about?

I’ve messed around with LLMs a lot over the past few months and decided to take a break from that. I wanted to learn more about Reinforcement Learning (RL). I’ve been thinking about it for a while but didn’t have an interesting game to try it for. I saw a video on YouTube about creating an AI for Trackmania by using some odd combination of computer vision and machine learning. It inspired me to try my hand at it as well, but I wanted to do it by using vision, kind of like a human would play it. So this post is about my journey to a basic AI driver in Trackmania 2020 where the AI has the screen and the keyboard as I/O.

Why RL?

Supervised/Self-supervised learning has seen a lot of success recently and is currently applicable to a lot more problems than Reinforcement Learning. In the context of actors in a virtual world like a computer game, supervised learning is often referred to as imitation learning. The actor is not trying to be creative but just trying to imitate training data as best as it can. So achieving optimal policy for given inputs is pretty much out of question unless you have training data that’s strictly optimal. If you have an unlimited amount of optimal training data then you probably already have a solved problem.

The mindset for Reinforcement Learning is slightly different. We’re learning to do things better than we did when generating the training data. There might be no training data at all in the traditional sense, the agent could be dropped into an environment with the task of figuring everything out on its own. You’d iterate on the improvement and, hopefully you’d get an actor that’s as good as humans or better.

You can divide RL algorithms into two main groups based on how they’re trained - off-policy and on-policy. Long story short, off-policy algorithms can be trained on data gathered with any policy, similar to supervised learning. It could be humans, random actions, AI actors, or anything else. On-policy algorithms have to use the policy of the learning AI actor, this usually means the training is done online. I’m sticking to off-policy Q-learning-based algorithms in this post. It’s much easier to iterate on, just collect data and you can retrain as many times as you need, then fine-tune online if needed.

How does RL work?

In supervised learning, you usually have pairs of inputs and desired outputs. If you wanted to train an AI actor, you’d feed it those pairs. For example “Here’s a screenshot, and those are the keys a human would press”. Usually, you give the ML model the inputs, it tries its best to guess the outputs, and then you calculate the loss based on the difference between the guess and the desired outputs. Update model weights and repeat. This approach doesn’t even try to answer the question “What’s optimal?”.

If we want to find the optimal solution with RL, we need a metric to optimize. In RL that metric is usually called the reward. The training is usually divided into episodes and the agent tries to maximize the cumulative reward over the entire episode. Ideally, in the context of a racing game, that would just be the inverse of the lap time, negative lap time, or something like that. The problem is that the AI actor wouldn’t get any feedback before completing the entire lap, and what are the chances that it will do so by just pressing random keys? That’s why you usually need a more continuous reward function.

Anyways, the core idea is that the AI actor gets given a state and it tries to learn a policy that will maximize the cumulative reward. For example, the AI gets a screenshot with a curve ahead, so it knows the optimal action is to brake and then turn because other actions would lead to lower rewards. I started with the vanilla Deep Q-Network (DQN) algorithm from the paper “Playing Atari with Deep Reinforcement Learning” and then kept adding different extensions from other research papers.

Deep Q-Learning and DQN

So we’ve determined the goal of our agent - to always choose an action that will lead to the maximal cumulative reward. How would an AI agent learn that? We’ve also established that the AI actor should receive the image from the game as input and output the pressed keyboard keys. We don’t have any labeled data, we just have the current state as the input. More formally, the Markov property holds for our model, thus we can use the Markov decision process (MDP). The summary is that our decision only depends on the current state, it doesn’t matter how we got there, and based on that state we need to learn to make a decision that gets us the maximal reward. MDP tries to find the reward function for each state transition (state-action pair) and choose the action that would lead to the highest cumulative reward. It’s quite clear that finding and evaluating all state-action pairs is rarely feasible. For example, in Trackmania there’s practically an infinite number of states or at least more than we could ever evaluate in our lifetime.

That’s where Q-Learning can help us out. Instead of trying to evaluate all state-action pairs, we explore the space of state-action pairs and try to come up with a function that tries to estimate the value of each action given a specific state. We predict the value of taking each action from the current state, pick an action, see what reward we get, and correct our estimate. The function that tries to estimate all the cumulative rewards for each action given a state is the Q-function.

\( \Large Q(s, a)=r(s, a) + \gamma \max \limits_{a} Q(s’,a) \)

In the formula s is the current state, a is the action, \( \gamma \) is the discount factor, and s’ is the next state. r(s, a) is the reward for the state transition, the only thing we’d actually know, and thus the feedback that will iteratively improve the learned estimate. So the formula tells us that we pick an action that leads to the highest cumulative reward. The discount factor is usually something like 0.99 and makes the Q function finite. Imagine if your Trackmania AI would be driving in a loop forever, it could accumulate infinite rewards. The Q function depends on the Q function for the next state, which depends on the Q function for the state after the next state, you get the idea. With a discount factor below one, the value converges at the limit even if you do infinite laps. Realistically we won’t get accurate estimates very far in the future anyway.

Deep Q-Learning estimates the Q function with a deep neural network. Who saw that coming? Now that we’re talking about neural networks, we gotta talk about loss functions. We know the input already - the state, so what’s the loss function? From the earlier linked paper:

\( \Large L_i(\theta_i)=\mathbb{E}_{s,a\sim \rho(\cdot)}[(y_i - Q(s,a;\theta_i))^2] \)

\( \Large y_i=\mathbb{E}_{s’\sim \epsilon} [r + \gamma \max _{a’} Q(s’,a’;\theta _{i-1})|s,a] \)

The first function is just mean squared loss (MSELoss in Pytorch). I use SmoothL1Loss in Pytorch, some also use HuberLoss. Depends on the application. I wanted L1 loss because squared loss is a bit too sensitive to outliers.
The second equation is the equation for the target, the recursive (?) Q function we just looked at.

Coming from supervised learning this seems strange. The estimate and the target both depend on the Q function that we’re trying to learn. If you want to truly understand what’s going on, then you should look into the Bellman equation and perhaps work through the Q-Learning algorithm on a grid. The short explanation is that it’s an iterative Temporal difference learning algorithm. It iteratively converges toward a solution instead of relying on ground truth labels like supervised learning. I won’t try to explain it further because I’m not sure I’d do a great job, a bit too mathy for me.

Not only is this hard to wrap your head around, it’s not very smooth sailing for the computer either. DQN training is very unstable and probably even looking at it funny will make it not converge towards an optimal solution. There have been lots of extensions and additional techniques developed for the algorithm to improve training sample efficiency and make training more stable. So many in fact that there is an entire paper about combining them: “Rainbow: Combining Improvements in Deep Reinforcement Learning”. As of now, I’ve only used the DDQN and Dueling DQN extensions. It’s so easy to get the most minor detail wrong, and on top of that, you have no idea if everything around the algorithm itself is correct or how fast should it converge.

Finally there’s the structure of the NN model. I used Pytorch like almost everyone creating ML models does these days. I started with the vanilla DQN like I mentioned and then added the Dueling DQN extension to the model. The structure was very similar to the one in the Playing Atari paper I linked earlier. I used 4 layers of convolutional neural networks and then 4 MLP layers for each head. For dueling DQN the network has two heads - one calculates the value (cumulative reward when always choosing the best action), the other one calculates the advantage for each action with the best action having 0 advantage and others negative. For stability you’d use a second copy of the model to calculate the targets for the loss. You keep that second network somewhat outdated and it’s been found to be much more stable this way.

# conv base
self.conv1 = nn.Conv2d(frame_stack, 32, kernel_size=7, stride=2, padding=3)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(32, 64, kernel_size=5, stride=2, padding=2)
self.bn2 = nn.BatchNorm2d(64)
self.conv3 = nn.Conv2d(64, 64, kernel_size=5, stride=2, padding=2)
self.bn3 = nn.BatchNorm2d(64)
self.conv4 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1)
self.bn4 = nn.BatchNorm2d(64)

# state value branch
self.v_fc1 = nn.Linear(state_dim + final_conv_size, num_hidden)
self.v_bn1 = nn.BatchNorm1d(num_hidden)
self.v_fc2 = nn.Linear(num_hidden, num_hidden)
self.v_bn2 = nn.BatchNorm1d(num_hidden)
self.v_fc3 = nn.Linear(num_hidden, num_hidden)
self.v_bn3 = nn.BatchNorm1d(num_hidden)
self.v_fc4 = nn.Linear(num_hidden, 1)

# advantage branch
self.a_fc1 = nn.Linear(state_dim + final_conv_size, num_hidden)
self.a_bn1 = nn.BatchNorm1d(num_hidden)
self.a_fc2 = nn.Linear(num_hidden, num_hidden)
self.a_bn2 = nn.BatchNorm1d(num_hidden)
self.a_fc3 = nn.Linear(num_hidden, num_hidden)
self.a_bn3 = nn.BatchNorm1d(num_hidden)
self.a_fc4 = nn.Linear(num_hidden, num_actions)

I know torch.nn.Sequential exists, but I didn’t use it.

The results

The code is here. The AI runs only at 10FPS with 4 112x112 screenshots of the game screen. My workflow was collecting data into a python pickle and then trying to figure out the hyperparameters and the implementation details by reusing that data. A huge benefit of an off-policy algorithm.

So, how did the implementation actually go?

Day 1 was like this. This was just the vanilla DQN, using only the current frame. At least it found the finish line.

Day 2 brings some improvements, but still pretty dumb even for the simplest track.

That’s where I got stuck for a while. Tried lots of different things and nothing seemed to improve it. Whenever I had a new idea I would try it out, but nothing seemed to make a significant difference.
The problem with Trackmania 2020 is that unlike Trackmania Nations Forever there aren’t good tools around it to implement a good automated AI training around it. To make things worse I’m running it on Linux with Proton and it used to crash a lot. I wanted to use gamescope to at least sandbox the game so I could use my computer while it’s collecting data, but I couldn’t get it to work with 2 Nvidia GPUs. The only other alternative I know would’ve been to literally hack the game and hook code to the game engine, but that’s just hard to do. Long story short it was really tedious and time-consuming.

I was still able to improve it. About 2 weeks later this is the end result for now. The track is much longer and complex, the first versions definitely would not have even completed this track. It’s only trained for maybe about 3 hours, I’d guess it might converge around something like 30 hours, to give an idea of how much further this could be trained. It would be fun to see, but I really want to use my computer.

Here’s a different sample from what the AI sees. It also sees the 3 frames before the current frame but you get the idea.

Edit:
Further improvements because not training it further really bothered me. Added a penalty for crashing.

Lessons learned

Overall I learned a lot about reinforcement learning. For some reason I had no idea that off-policy RL algorithms existed, I always assumed it would have to be trained from scratch in an online setting. This means that RL algorithms could potentially be used in similar situations as supervised learning algorithms.
I also didn’t know that it was possible to train a fully vision-based system in just a couple of hours on a mid-range consumer GPU. LLM GPTs take a significant part of a data center and months to train, so I assumed at least I’d need tens of hours of data to start getting results. If I had the data pre-recorded it would only take like 30min to get the results I’ve shown, and that’s a single GPU.
Overall RL is probably still a bit too finicky to see much real-world use, but it certainly has the potential to solve problems that supervised or self-supervised algorithms won’t be able to solve efficiently.