top of page
In this project we research DQN performance in multi-level games with Qbert as a test case. We show that Qbert is a high risk game and the exploration step results in a glass ceiling. Then, we show a specific solution for Qbert and propose a potential solution for the general problem using multiple agents with automatic clustering.

 

The project

In a nutshell

Background​

Reinforcement learning is a method where an agent interacts with an environment by taking actions that change the environment state and rewarding the agent with reward. On time t the agent observes state s(t), according to the agent's policy π, it reacts with an action  a(t)=π(s(t)) and receiving reward r(t). The goal of the agent is to maximize the discounted cumulative reward, and the problem called infinite horizon problem.

Deep Q Network is an algorithm approximates the optimal Q function with a Convolutional Neural Network (CNN), by minimizing the expected Temporal Difference (TD) error of the optimal bellman equation.

The tuples s(t) ,a(t), r(t), s(t+1) are collected from the agents experience and are stored in the Experience Replay (ER) making it an offline learning algorithm.

Things we learned

Starting this project with very little knowledge about machine learning we got to know the world of machine learning research and reinforcement learning specifically. Through practical experiment and by modifying the known DQN algorithm, we learned about different aspects of reinforcement learning ingredients and more relevant fields like clustering methods. In addition, this project gave us a glimpse about the research process,.

Reinforcement learning is a method where an agent interacts with an environment by taking actions that change the environment state and rewarding the agent with reward. On time t the agent observes state s(t), according to the agent's policy π, it reacts with an action  a(t)=π(s(t)) and receiving reward r(t). The goal of the agent is to maximize the discounted cumulative reward, and the problem called infinite horizon problem.

Deep Q Network is an algorithm approximates the optimal Q function with a Convolutional Neural Network (CNN), by minimizing the expected Temporal Difference (TD) error of the optimal bellman equation.

The tuples s(t) ,a(t), r(t), s(t+1) are collected from the agents experience and are stored in the Experience Replay (ER) making it an offline learning algorithm.

Background​

DQN is a reinforcement learning algorithm which is known to achieve state-of-the-art results on different Atari games as published by Deepmind. Although the results often surpass human experts, it was observed that games which contained different levels, like Qbert, in which most of the change from previous or future levels was in the way the screen was displayed (e.g. color), were learning rather fast at first but after a few levels, they reached saturation. That happened even though the agent succeeds learning to play the first few levels and thus virtually "solved" the game and only needs to transfer the knowledge to the higher levels. 

In this project, the Atari game Qbert was used as a test case for multi-level games, and most of this work was done on Qbert.

Introduction

DQN is a reinforcement learning algorithm which is known to achieve state-of-the-art results on different Atari games as published by Deepmind. Although the results often surpass human experts, it was observed that games which contained different levels, like Qbert, in which most of the change from previous or future levels was in the way the screen was displayed (e.g. color), were learning rather fast at first but after a few levels, they reached saturation. That happened even though the agent succeeds learning to play the first few levels and thus virtually "solved" the game and only needs to transfer the knowledge to the higher levels. 

In this project, the Atari game Qbert was used as a test case for multi-level games, and most of this work was done on Qbert.

Introduction

Qbert is an Atari game of a player that needs to "touch" all 21 bricks of a pyramid. Along the way, it might lose lives by encountering a monster or falling off the edge of the pyramid. The player has 4 lives and when it finished to "touch" all the bricks it moves onto the next level with more complicated monsters.

Risk in Qbert: We show that the ε-greedy policy in Qbert leads to a score limit both by probability analysis and simulations. Since in this game there are many risky states that a single action might cause to lose a life, the ε-greedy policy will limit also an agent with optimal policy.

Probability analysis: The analysis assumes uniform distribution on the player's position and no monsters - which makes the analysis stricter. The player has 6 possible actions to take, the pyramid has 8/21 bricks with no fatal action, 6/21 bricks with 1 fatal action called, 5/21 with 2 fatal actions and 2/21 with 3 fatal actions, those actions will be called fatal0, fatal1, fatal2 and fatal3 with edge0, edge1, edge2 and edge 3 as the corresponding bricks. Taking the parameters from DQN paper, the values of ε are 1 to 0.1 on the learning phase, and 0.05 on evaluation. With ε=0.05, we have the following probability for death due to random action:

 

 

 

 

 

 

Practically, this implies death due to random action every ~114 steps when playing with epsilon 0.05. This project is not the first to use a low value of epsilon, the known Double DQN, presented in Double-DQN paper and achieved high results, used epsilon that goes down to 0.01 during learning and 0.001 on evaluations.

This video shows an example of a terminal caused by a random action:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To show the effect of the ε-greedy policy, we trained multiple agents, each agent learns and play on a single level of Qbert and stop learning once it passes the level without losing lives. The evaluations were taken both with epsilon 0 and epsilon 0.05. Between those 2 epsilon values, the score gap on evaluations reaches as high as 160% in favor to ε=0.

The following video shows a single game of 2 agents playing Qbert using the EXACT same network. The only difference between the agents is the value of epsilon, this illustrates the effect of epsilon.

 

 

Qbert
Introduction

DQN is a reinforcement learning algorithm which is known to achieve state-of-the-art results on different Atari games as published by Deepmind. Although the results often surpass human experts, it was observed that games which contained different levels, like Qbert, in which most of the change from previous or future levels was in the way the screen was displayed (e.g. color), were learning rather fast at first but after a few levels, they reached saturation. That happened even though the agent succeeds learning to play the first few levels and thus virtually "solved" the game and only needs to transfer the knowledge to the higher levels. 

In this project, the Atari game Qbert was used as a test case for multi-level games, and most of this work was done on Qbert.

Suggested method

Our solution suggests using multiple agents and KMeans algorithm enhanced by spatio-tempral clustering algorithms. The usage of spatio-temporal clustering increasing the accuracy as revealed in other papers.

Each cluster will be learned and played by a different agent. This way, each cluster/level is learned individually and also the exploration can be manipulated individually, according to the agent's performance. Thus, the exploration-explotation dilemma is solved by exploiting more when we are more confident that we achieve good results and also exploring more when we are unsure of the results.

The algorithm, which is a modification of the original DQN algorithm, is described below. It is supposed to take the concept of experiment 3, and generalize.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The previous algorithm (Vanila DQN) only had a single agent and therefore the exploration rate (epsilon) was global and shared between every possible state at a given time (step). The addition of multiple agents allows us to control our exploration-explotation ratio in a manner that differs from different clusters of states to others. For example, if we played and learned a certain state many times, because of the DQN convergence, the possibility that we are choosing the optimal action is high and therefore we would like to exploit rather than explore (lower epsilon) on that specific state. Assuming that similar states are also close in time and similarly visited, by clustering the state-space to clusters of similar states, we can control the exploration-exploitation ratio per cluster.

Also, dividing the state-space into several agents, helps to prevent deterioration of early states since it negates dependancy between state which are similar in representation but they are actually far apart in time, thus allowing to better differ between the states.

This block diagram explains the flow of the method:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Introduction

DQN is a reinforcement learning algorithm which is known to achieve state-of-the-art results on different Atari games as published by Deepmind. Although the results often surpass human experts, it was observed that games which contained different levels, like Qbert, in which most of the change from previous or future levels was in the way the screen was displayed (e.g. color), were learning rather fast at first but after a few levels, they reached saturation. That happened even though the agent succeeds learning to play the first few levels and thus virtually "solved" the game and only needs to transfer the knowledge to the higher levels. 

In this project, the Atari game Qbert was used as a test case for multi-level games, and most of this work was done on Qbert.

Experiments

 

Multiple agents

In this experiment, we created several agents (7 to be exact) and made sure that every agent will learn and play a different level. The objective of this simulation was to eliminate the dependancy between states so that there will be no confusion between levels while learning.

We set the experiment by generating 7 agents. Each new state we got, we calculated the level and sent to its corresponding agent. The evaluations were done with both epsilon 0 and epsilon 0.05 to allow the comparison.

As shown below the results were poor so we concluded that the main problem was reaching terminal-state at random.

As we showed, with epsilon=0.05, the probability of reaching terminal-state from a random move in Qbert is 0.174. Meaning that on average, solely due to epsilon, the agent makes a fatal action every 115 moves.

Simulation results:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Multiple agents with dynamic epsilon control

Similar to the previous experiment, in this experiment we also used a different agent for every level. The difference was that once an agent passes the level without reaching a terminal, its network is immediately frozen and its epsilon is lowered to zero, instead of the usual 0.05 both in evaluation mode and while learning. We also split the agent each time we reached a new level. Meaning that each new agent was initialized with the previous agent's weights. We compared the results with multiple agents experiment results in order to get a reliable reference.

As shown below, the results of the this simulation is greatly higher than the other ones which points to the fact that in certain scenarios (Qbert among them), it is a lot better to continue to minimize the exploration but it must be done with respect to the "experience" already gained.

Simulation results:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Using our suggested method

On this experiment, we implemented the method we suggested, which includes the spatio-temporal KMeans clustering and epsilon manipulation. The preliminary results are poor, but it is possible in future research, with parameter-tuning and slight changes in the algorithm, this method will score better than multiple agents, dynamic epsilon control experiment.

Simulation results:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Part of the proof of concept of this experiment was to cluster activations from Vanilla DQN simulation. The following image is the results of this clustering on a TSNE map, and it visible that the accuracy is high. We noticed this high accuracy, around 90% also in this auto clustering experiment. ​

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Things we learned

Starting this project with very little knowledge about machine learning we got to know the world of machine learning research and reinforcement learning specifically. Through practical experiment and by modifying the known DQN algorithm, we learned about different aspects of reinforcement learning ingredients and more relevant fields like clustering methods. In addition, this project gave us a glimpse about the research process,.

Relevant links
bottom of page