Reinforcement Learning

My personal favorite :)

Remember when Google beat the world champion in Go (and later in Starcraft)? Yeah, reinforcement learning is how that happened. Basically, reinforcement learning is where the model learns without pre-labeled data, and instead explores its environment and determines for itself what works well and what does not. Sort of like how humans learn! So, what are the essential parts of a reinforcement learning problem?


At a high level, reinforcement learning problems follow a general pattern:

  1. Given a state, the agent will choose an action based on a learning strategy.

  2. The agent will execute the action, and the environment will simulate the results.

  3. The environment emits a new state and reward for executing the action.

  4. The agent learns from this experience, and adjusts its strategy accordingly

  5. Go back to step 1, until the agent learns a good policy.

After this process, the agent should be able to perform well in this specific environment, and will learn how to do things all on it's own! This is called unsupervised learning.


Building an environment is key to any reinforcement learning problem. This is the area where your agent can explore and learn how to operate well. The key requirements that need to be accessible from the environment are:

  • Action space: all the possible actions an agent can take

  • State space: representations of places the agent can be

  • Reward function: determining how the agent gets feedback (reinforcement) on its actions

  • Simulation: the environment needs to be able to simulate actions

As an example, take Pong:

  • Action space: move left or move right

  • State space: positions of the ball, direction of the ball (left or right), position of opponent

  • Reward function: +1 if you win the point, -1 if you lose the point. (can you design a better reward function? maybe)

  • Simulation: making the Pong game behave as expected.

Learning Strategy

aka the algorithm that determines how the agent learns. There are many new reinforcement learning strategies (PPO, Actor-Critic, TRPO, etc), but the most intuitive and classic one is called Q-Learning. The learning strategy dictates how the agent reacts to receiving rewards from the environment and how it chooses its actions.


For this example, we'll define the environment as a 2D grid where the agent can move up, down, left, or right. On some cells, there will be an item, either candy or poop. If the agent lands on a cell with candy, it gets +1 reward, and if it steps on poop, it gets -1 reward.

Example environment

Let's say the agent is currently in state s. A good representation of the state is probably 2D list of the board, and the agent's position. The four actions it can take are A = {up, down, left, right}. To choose which action it should take, it will use its own estimates of how much reward an action will give, and take the action with the highest estimated reward.This is very intuitive, because obviously you'd want to take the action you think will be the best! So if Q(s, up) = 1, Q(s, down) = 0.5, Q(s, right) = 0, and Q(s, left) = -0.9, we'd choose to go up. Q is the value estimator function, which takes in a state and an action and returns an estimated value. That's why it's called Q-learning!

How do you know the Q function?

Good question. In our current example, an intuitive way to store the function is to have a 2D table, where entry T[i][j] is the estimated value in row i and column j. What does that mean? Well, it means that T[i][j] estimates the reward you'll get in that position, or in other words, the value of the best action you can take from T[i][j]. So using the Q values from above, we'd get that T[i][j] = 1 (since the best action from this position is to go up).

Sample 2D Tabular Q Function

In more continuous or larger state spaces (for example, pong, where the paddle's positions and the ball positions can take many many more values), the Q function can be estimated using your choice of ML algorithms (Deep Learning, KNN, Logistic Regression, etc).

How is the Q function learned?

Let's say the agent was in state s (position i, j), and took action a and received reward r from the environment for taking the action. What information does that tell us? Which cell should we update the value for? If you're interested in the exact details, you should look at the Bellman equations :)

If you're using an ML algorithm to estimate the value, you can use this sequence (s, a, r) as a training example and do normal ML training on that.


A policy is a function from states to actions, and tells the agent the best action from the state. In Q-learning, the policy is determined from the Q function, and the agent will take the highest-valued action. Below is an example, given the above example.

Note that at the beginning when the Q function doesn't have good estimates, the policy will be more like a random policy. The goal is to get to the optimal policy!

Discounted Rewards

There are a few details that I glossed over. For one, typical reinforcement learning has Q estimate the expected discounted reward. For example, getting a candy in one action is preferable to getting a candy in two actions. This is simulated by having a discount factor with which to weight rewards (usually chosen to be 0.9, 0.95, or 0.99). So getting a candy in two actions might give a reward of 0.9, whereas getting it in one action would give 1.

Using RL

If you wish to use RL in your TP, I'd recommend doing the following:

  1. Choose a simple environment, and implement it. This could be something like a maze or pong.

  2. Make sure your environment has an interface to simulate moves, and returns a new state and a reward, based on a reward function that you create.

  3. Create an agent that can interact with the environment, and write Q-value functions and policy functions.

  4. Try to solve the environments, and adjust the agent until it works well!

If you want to jump right in and not have to implement an entire environment, you may consider using OpenAI gym. They have several environments to work with, and in particular, the frozen-lake environment is a good starting point.

If you're interested in learning more about RL, I'd recommend reading about AlphaGo and AlphaGoZero by Google, and also checking out OpenAI's blog posts.