Date: May 4, 2020
Introduction to Reinforcement Learning and Q-Learning with Flappy Bird
Reinforcement learning is an exciting branch of artificial intelligence that trains algorithms using a system of rewards and punishments. Itâs the type of algorithm used if you want to create a smart bot that can beat virtually any video game.
In this article, I will try and demystify reinforcement learning by showing how one goes about building a Flappy bird AI using Q-Learning.
Part 1. Understanding Agent-Environment interactions
Before we actually dive into the algorithm that drives the botâs âintelligenceâ, we first need to understand how the AI, also known as the agent, interacts with its environment.
Most Reinforcement Learning games are all built on top of the Markov Chain model, and itâs really important to understand this concept before we move on to reinforcement learning because it serves as the foundation of the entire AI algorithm.
A Markov Chain.
In essence, a Markov chain tells us the continuous relationship between an agent and its environment. The agent (i.e. the AI) takes in a state and a reward from the environment, and based on these two variables, the agent chooses the optimal action. Then, the environment returns a new state and a new reward to the agent based on the action chosen. The process continually loops itself, and this is exactly how the agent and the environment interact with each other. It might not sound as intuitive at first, but letâs see how this applies in the context of Flappy Bird to get a better understanding.
Flappy Bird Environment built using Markov Chains
Before we go ahead and build a Flappy Bird AI, we need to build our own Flappy Bird Environment. The code used in this article was built with the PyGame library. Letâs see how we can build a Markov Chain environment.
States and Observations
We start by initializing our Flappy Bird environment (practically speaking, the environment will be a Python class) with an initial state (S_t), composed of different observations.
In our custom Flappy Bird environment, we defined 2 observations per state, the birdâs horizontal and vertical distance to the lower pipe. This state composed of the 2 observations will provide the agent with enough information to learn and optimize its actions later on. Equipped with the state, the agent will choose one of two possible actions, to flap or not to flap. The action that is chosen gets fed back to the environment, which will either give back a reward or a penalty to the agent, as well as a new state composed of new observations.
Take a look again at the diagram below in order to better internalize the logic that Iâve just explained behind a Markov Chain.
Rewards
The key to making a âsmartâ agent lies in the rewards. Since the goal of Flappy Bird is to obtain the highest score possible, we would adjust our rewards to encourage the AI to stay alive as long as possible.
In our case, for each good action that the bot chooses (i.e. it stays alive), the agent will get rewarded with 15 points, but if it chooses a bad action (i.e. hitting the green pipe and ending the game), it will get a negative reward, a penalty of -1000 points. Note that these are constants that we defined ourselves, we can change them to any value we want, but 15 and -1000 points as parameters gave back solid results after experimenting with different values. The agent chooses a new action for every new frame that we see (every time that the game gets refreshed), so a new reward is passed to the agent for every frame.
Hopefully, this first part was clear enough. We are going to jump into Q-Learning, which is the algorithm that the agent uses to learn and improve at the game!
Part 2. Q-Learning
There are quite a few complex algorithms out there for reinforcement learning, but Iâll be explaining basic Q-Learning, a technique that is aimed at choosing the best action for a given state.
Hereâs the formal notation, but it wonât mean much to most people.
Essentially, the goal of the Q-Learning algorithm is to maximize the value function Q, where Q stands for âqualityâ of an action taken in a given state. The quality of an action in a given state is determined by the rewards used to provide the reinforcement. In the end, for any finite Markov decision process (FMDP), Q-learning can find an optimal set of actions starting from an initial state to maximize the expected value of the total reward over any and all successive steps. In other words, Q-Learning is able to look into the future in order to maximize its total rewards rather than just the immediate reward. This is what makes Q-Learning so powerful.
Letâs break it down into how this applies practically for building our Flappy Bird AI.
Step 1. Build a Q-table
Firstly, we need to build a Q-table, which simply consists of all possible Q-values. Letâs see how Flappy Birdâs q-table looks like.
On the left, each row represents a possible state in game. Remember that in our Flappy Bird environment, we have 2 observations per state. The total number of states in our q-table is simply the total number of observations possible. For this environment, the horizontal distances vary between -2 and 14, while the vertical distances vary between 0 and 14. Since the values are rounded to integers thereâs 17x15 possible states, which is 255.
The columns represent all possible actions in game. In our case, there are two possible actions, to flap or not to flap.
The end result is a Q-table of dimensions 255 by 2, which relatively speaking, is pretty small compared to Q-tables of other games.
We start by initializing our Q-table with uniform values. Our hope is to fully populate this q-table with exact values so that we know the âqualityâ of all actions given a state, ultimately allowing the agent to make the optimal decisions.
Step 2. Q-Learning
Now that we have initialized our q-table, we can use our Q-Learning algorithm to update our q-table.
An agent chooses its actions using its q-table. For a given state, the best action is the one with the highest q-value. Since the table has initially been randomized, the actions will be randomized.
Then, the q-value gets updated using the formula
which, in code, is:
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
Learning-rate and discount factor are fixed parameters that do not change, they are calculated using trial and error.
The learning rate controls how quickly the model is adapted to the problem, and is often in the range between 0.0 and 1.0.
The discount factor is a measure of how much we want to care about future rewards rather than immediate rewards. It is usually fairly high (because we put greater importance on long term gains) and between 0 and 1.
We know the value of current_q (obtained from the q-table), we know the reward which is either 15 or -1000 depending on whether or not the bird stays alive, and we can obtain max_future_q also by looking at our q-table, except the state will be the state at t+1, and taking the maximum Q-value of both possible actions.
This is why we say that in Q-Learning, weâre not just interested in the present. We also want to look in the future.
Finally, we update the current_q of that action given a state with a new_q
q_table[state + (action, )] = new_q
Step 3. Repeat
The process of q learning and updating the q-table is repeated until the agent has explored all possible states with all possible actions, allowing it to make extremely smart decisions. One thing to note is that whenever an agent dies, the game simply resets, but the q-table doesnât reset, so the agent is able to make smarter and smarter decisions over time because of this.
Conclusion
A simple Q-Learning approach performs relatively well in simple games such as flappy bird, but in more complex games, this approach wonât work well. This is because in Q-learning, all states must be explored for the AI to make good decisions, since if the state has not been seen before, the agent will just make a random decision. When we look at other algorithms for reinforcement learning such as Deep Q-Learning, these are much more powerful because agents are still able to make smart choices on unknown data.
Thatâs it for this article, I hope you enjoyed!
Steven Gong