Deep Q-Learning Series (DQN)

DQN series reinforcement learning algorithms involve with learning by using Deep Q Networks, these algorithms include Deep Q-learning (short for DQN), Double Deep Q-learning (Double DQN), Dueling Deep Q-learning (Dueling DQN), Prioritized Double DQN, Noisy DQN, Bootstrapped DQN and etc, which tend to improve Navie DQN from different perspectives, we will highlight the first three algorithms in this post.  

1. Value-based Approximation Methods

      This post will focus on value-based approximation methods for learning a policy, which differs from tabular methods like Monte Carlo and temporal difference.
      Reason for replacing the tabular methods by the value function approximation methods: the memory needed for large tables; the time and data needed to fill tables accurately; states encountered may never been seen before; VFA is also applicable to partially observable problems, in which the full state is not available to the agent.(Thus need to generalize to unseen state.)
      Core: Value Approximation methods generalize better than the tabular methods.
      In tabular methods, the value function is represented by a look-up table, where each state has a corresponding entry, V (s), or each state-action pair has an entry, Q(s, a). However, this approach might not generalize well to problems with very large state and/or action spaces. In large scale reinforcement learning, like backgammon with {10^{20}} states, Computer Go with 10^{170} states and Helicopter with continuous state space. There are too many states and/or actions to store in memory and it is too slow to learn the value of each state individually. Solution of those problem is to estimate value function with function approximation so as to generalize from seen states to unseen states.

    \[\begin{array}{l}\hat v\left( {s,{\bf{w}}} \right) \approx {v_{\rm{\pi }}}(s)\\\hat q\left( {s,a,{\bf{w}}} \right) \approx {q_{\rm{\pi }}}(s,a)\\{\rm{\pi }}\left( {a,s,{\bf{w}}} \right) \approx {\rm{\pi }}\left( {a\left| s \right.} \right)\end{array}\]

      Here \bf{w} is usually referred to as the parameter or weights of our function approximator.
      However, there exists different structure to build this approximator as bellow.
      Meanwhile, there are many function approximators, e.g. Linear combinations of features, neural network, decision tree, nearest neighbor, Fourier / wavelet bases and etc. However, differentiable function approximators like neural network are considered in this post, which is the core of Deep Q-learning  using Deep Q-Networks(DQN).
      Tips:
      (1) Requirements for function approximation methods to be applied in reinforcement learning problem: learning be able to occur online, while the agent interacts with its environment or with a model of its environment, thus need to learn efficiently from incrementally acquired data; being able to handle nonstationary target functions (target functions that change over time).
      (2) Stochastic Gradient Descent (SGD) is the idea behind most approximate learning. Most algorithms have the similar update rule as in traditional gradient update rule, as bellow:

2. Deep Learning Approximation

      The difficulties in handcrafting feature sets make it’s not applicable for value functions approximation like linear function, which leads to the use of non-linear approximation function, especially neural network and deep learning approach in recent years.
      DQN algorithm is an extension of Q-learning from tabular case to large scale case, using different approaches to approximate Q-values as shown in the below figure.
      Various algorithms based on deep learning framework and their relative performance are as follow:
      Okay, so now let’s begin today’s role, DQN, we will start with a brief overview of Q-learning, then extend to Deep Q-learning.

3. Q-learning Recap

      In traditional supervised learning scheme, we have input data and corresponding labels to feed various models with parameters, then using optimization algorithm to minimize the difference or error between the labels and predicted labels to get the model parameters, this is actually a “learn by error process”. While in reinforcement learning scheme, which involves with the agent interacting with the environment, data generated are actions, states and reward. No explicit input data and labels, how to use the well-developed technology in supervised learning to solve the reinforcement learning problem, the goal of which is to find the best policy to achieve the best goal?
      The first issue coming will be how to evaluate an agent. We use state-value function to evaluate how good or bad of an agent to be in a specific state s, denoted by V(s), which refers to total expected return the agent get in a state s; the action-value function to evaluate how good or bad of an agent to take a specific action a in the state s, denoted by Q(s,a), which refers to the total expected return after taking the action a in the state s.
      Trying to used available methods in supervised learning, we have data but with no labels, how about forming a label?
      Actually, we can estimate current V(s) or Q(s,a) by taking a step forward and see what happened (get a reward r and be in a new state s'). That means we can estimate current V(s) by r+V(s'), sometimes we are not so sure about what will happen next so we give a discount factor \gamma to the next estimation, thus we use r+\gamma\times V(s') to estimate V(s), or taking r+\gamma\times V(s') as our target label to feed our model for training. Similarly, we can also use the next time-step Q values to estimate current Q value, using r(s) + \gamma\times Q(s,a) to estimate Q(s,a), or even the next time-step best Q^*(s',a) values to estimate the current best Q^*(s,a) value, using r(s) + \gamma\times Q^*(s',a) to estimate Q^*(s,a).
      There are three approaches to solve the above problem.
      (1) The analytic methods by solving a N\times N equation
      (2) Q-learning by formulating a look-up table to record the Q(s,a) get during each interaction
      (3) Deep Q-learning by using a neural network to approximate the Q(s,a) function.
      Q-learning is an algorithm that estimates Q value during the train iterations, it updates the Q-values for each state-action pair by extracting corresponding values from the loop-up table formed using the below formula:

    \[Q\left( {{S_t},{A_t}} \right) = Q\left( {{S_t},{A_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma Q\left( {{S_{t + 1}},a} \right) - Q\left( {{S_t},{A_t}} \right)} \right]\]

4. From Q-learning to Deep Q-learning

      Why Deep Q-learning?
      Compared with tabular q-learning, amount of memory required to save and update that table would increase as the number of states increases; amount of time required to explore each state to create the required Q-table would be unrealistic.
      Challenges:
      There are several problems or challenges in the Q-learning process, namely, the non-stationary or unstable target; highly correlated training data and maximization bias, we will elaborate them separately in the following section.
      Key Points:
      The key process in DQN is actually a supervising learning process, which minimizes the actual Q values and estimated Q values error, calculating corresponding weights {\bf{w}}.4.1 Formulati

4.1 Pseudocode

      Nature version pseudocode:
      Description as follow:
      (1). preprocess and feed the game screen (state s) to the DQN, which returns the Q values of all possible actions in the state.
      (2). Select an action using the \epsilon-greedy policy: with probability \epsilon, select a random action a and with probability 1-\epsilon, select an action that has a maximum Q values, namely,

    \[a = \arg {\max _a}(Q(s,a,w))\]

.      (3). After selecting the action a, perform this action in a state s and reach a new state s' and receive a reward r. The next state s' is the preprocessed image of the next game screen.
      (4). Store this transition in the replay buffer as \left( {s,a,r,s'} \right)
      (5). Sample some random batches of transitions from the replay buffer and calculate the loss (the squared difference between target Q and predicted Q).
      (6). Calc loss:

    \[Loss = {\left( {r + \gamma {{\max }_a}\hat Q\left( {s',a;{w^ - }} \right) - \hat Q\left( {s,a;w} \right)} \right)^2}\]

      (7). Perform gradient descent with respect the actual network parameters w to minimize above loss.
      (8). After every k steps, copy the actual network weights w to the target network weight w^-.
      (9). Repeat 2-8 steps for M number of episodes.

4.2 Tricks

4.2.1 Experience Replay

      The key idea of experience replay is that we train our deep Q network with transitions sampled from the replay buffer instead of training with the last transitions.
      Neural network can overfit with correlated experience, but can be reduced by randomly selecting batch of experiences from the replay buffer. We can use uniform sampling for sampling the experience. The experience replay can be considered as a queue rather than a list, it will store only a fixed number of recent experiences/transitions, when the new information comes in, the old is deleted.

4.2.2 Target Network

      We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target and the current Q value (estimation of Q).

    \[\Delta w = \alpha \left[ {\left( {R + \gamma {{\max }_a}\hat Q\left( {s',a,w} \right)} \right) - \hat Q\left( {s,a,w} \right)} \right]{\nabla _w}\hat Q\left( {s,a,w} \right)\]

      But we don’t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.

    \[Q\left( {s,a} \right) = r\left( {s,a} \right) + \gamma {\max _a}Q\left( {s',a} \right)\]

      However, the problem is that we using the same parameters (weights) for estimating the target and the Q value. As a consequence, there is a big correlation between the TD target and the parameters (w) we are changing.
      Therefore, it means that at every step of training, our Q values shift but also the target value shifts. So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This lead to a big oscillation in training.
      I find a good and intuitive example to explain this points from post …. It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).
      At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters). This leads to a very strange path of chasing (a big oscillation in training).
      Instead, we can use the idea of fixed Q-targets introduced by DeepMind:
      (1) Using a separate network with a fixed parameter (let’s call it {w^ - }) for estimating the TD target.
      (2) At every \tau steps, we copy the parameters from our DQN network to update the target network.

    \[\Delta w = \alpha \left[ {\left( {R + \gamma {{\max }_a}\hat Q\left( {s',a,{w^ - }} \right)} \right) - \hat Q\left( {s,a,w} \right)} \right]{\nabla _w}\hat Q\left( {s,a,w} \right)\]

          At every \tau steps: {w^ - } \leftarrow w
      Loss function is:

    \[Loss = \left( {R + \gamma {{\max }_a}\hat Q\left( {s',a,{w^ - }} \right)} \right) - \hat Q\left( {s,a,w} \right)\]

      Notice that the parameters of target Q is w^- instead of w. The actual Q network is used for predicting Q values and learns the correct weights of w using gradient descent. The target network is frozen for several time steps and then the target network weights w^- are updated by copying the weights from actual Q network. Freezing the target network for a while and then updating its weights with the actual Q network weights stabilizes the training.

4.2.3 Other Details

      Cliipping Reward
      Reward assignment varies for different games. In some games, we assign rewards such as +1, -1, 0 for wining, loss and nothing respectively, but in other games, the rewards assigned may be +100 and+50 for two different actions. To avoid this problem, rewards are clipped to -1 and +1.
      Huber Loss Funciton
      In reinforcement learning, both the input and the target change constantly during the process and make training unstable.
      To minimize this error, we will use the Huber loss. The Huber loss acts like the mean squared error when the error is small, but like the mean absolute error when the error is large – this makes it more robust to outliers when the estimates of Q are very noisy. We calculate this over a batch of transitions, B, sampled from the replay memory:

    \[\begin{array}{l}L = \frac{1}{{\left| B \right|}}\sum\limits_{\left( {s,a,s',r} \right) \in B} {L\left( \delta \right)} \\where{\kern 1pt} {\kern 1pt} {\kern 1pt} L\left( \delta \right) = \left\{ \begin{array}{l}\frac{1}{2}{\delta ^2}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} for{\kern 1pt} {\kern 1pt} {\kern 1pt} \left| \delta \right| \le 1\\\left| \delta \right| - \frac{1}{2}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} otherwise\end{array} \right.\end{array}\]

      The Huber loss function graph is as follow:

4.3 Code implementaion

      Version1: Keras  Simple Version DQN code, see reference as bellow

      Version2: Breakout problem

      Visualization of the network:

4.5 DQN Summary

      The replay buffer and target network introduced have improven the performance a lot. The effects of replay buffer and target network according tests from the nature DQN paper is as bellow:

5. Double DQN

      Value Overestimation
      One problem in the DQN algorithm is that the agent tends to overestimate the Q function value, due to the max in the formula used to set targets:

    \[Q\left( {s,a} \right) \to r + \gamma {\max _a}Q\left( {s',a} \right)\]

      To demonstrate this problem, let’s imagine a following situation. For one particular state there is a set of actions, all of which have the same true Q value. But the estimate is inherently noisy and differs from the true value. Because of the max in the formula, the action with the highest positive error is selected and this value is subsequently propagated further to other states. This leads to positive bias – value overestimation. This severe impact on stability of our learning algorithm.
      Here is the other example: consider a single state s where the true Q value for all actions equal 0, but the estimated Q values are distributed some above and below zero. Taking the maximum of these estimates (which is obviously bigger than zero) to update the Q function leads to the overestimation of Q values.
      Another similar example from post …. DQN tends to overestimate Q values because of the max operator in the Q learning Equation. The max operator uses the same value for both selecting and evaluating an action. Suppose in state s, there exists five actions a_1 to a_5. Let’s say a_3 is the best action. When we estimate Q values for all these actions in state s, the estimated Q values will have some noise and differ from the actual value. Due to this noise, action a_2 may get a higher value than the optimal a action a_3. So if we select the best action as the one that has maximum value, we will end up selecting a suboptimal action a_2 instead of optimal action a_3.
      Above problem can be solved by using two separate Q functions, each learning independently. One Q function is used to select an action while the other is used to evaluate an action. This can be realized just by tweaking the target function of DQN.
      Recall the target function for DQN:

    \[y_i^{DQN} = r + \gamma {\max _{a'}}Q\left( {s',a';\theta '} \right)\]

      Just modify the target function to formulate the Double DQN target:

    \[y_i^{DoubleDQN} = r + \gamma Q\left( {s,\arg \max Q\left( {s,a;{\theta ^ - }} \right);\theta '} \right)\]

      In above equation, we have two Q functions each with different weights. The Q function with weights \theta^{'} is used to select the action and the other Q function with weights \theta^- is used to evaluate the action. Of course, the roles of these two Q functions can also be switched here.
      Compared with double q-learning algorithm:
      In the original Double Q-learning algorithm, two value functions are learned by assigning each experience randomly to update one of the two value functions, such that there are two sets of weights, \theta and \theta'. For each update, one set of weights is used to determine the greedy policy and the other to determine its value. In Q-learning, we are still estimating the value of the greedy policy according to the current values.
      The complete pseudocode for Double DQN is as follow:
      Estimation errors of any kind can induce an upward bias, regardless of whether these errors are due to environmental noise, function approximation, non-stationarity, or any other source.
      This problem can be referenced by the same problem met in double q-learning, maximization bias as introduced in the following cases
      Double DQNs handles the problem of the overestimation of Q-values.
      Double DQN does not always improve performance, it substantially benefits the stability of learning. This improved stability directly translates to ability to learn much complicated tasks.
      Implementaion
      Version1: Keras easy ‘CartPole-v1’ implementation

      Version2:  More complicated ‘Breakout’ implementation

6. Dueling DQN

6.1 Dueling Architecture
      (This part is summarized from Dueling DQN Paper)
      Two separate estimator: one for the state value function and one for the state-dependent action advantage function.
      The dueling network architecture, explicitly separates the representation of state values and (state-dependent) action advantages. The dueling architecture consists of two streams that represent the value and advantage functions, while sharing a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an estimate of the state-action value function Q as shown in the below Figure. This dueling network should be understood as a single Q network with two streams that replaces the popular single-stream Q network in existing algorithms such as Deep Q-Networks (DQN). The dueling network automatically produces separate estimates of the state value function and advantage function, without any extra supervision.
      The figure below: A popular single stream Q-network (top) and the dueling Q-network (bottom). The dueling network has two streams to separately estimate (scalar) state-value and the advantages for each action; the green output module combine them. Both networks output Q-values for each action.
      Intuitively, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state. This is particularly useful in states where its actions do not affect the environment in any relevant way. The dueling architecture can more quickly identify the correct action during policy evaluation as redundant or similar actions are added to the learning problem.
      A more detailed explanation is given by …(to be cited) as follow:

6.2 Formulation
      The key insight behind our new architecture is that for many states, it is unnecessary to estimate the value of each action choice.
      The single Q network architecture, or the dueling network. The lower layers of the dueling network are convolutional as in the original DQNs. However, instead of following the convolutional layers with a single sequence of fully connected layers, we instead use two sequences (or streams) of fully connected layers. The streams are constructed such that they have the capability of providing separate estimates of the value and advantage functions. Finally, the two streams are combined to produce a single output Q function. The output of the network is a set of Q values, one for each action.
      Action-value function: {Q^{\rm{\pi }}}\left( {s,a} \right) = {V^{\rm{\pi }}}\left( s \right) + {A^{\rm{\pi }}}\left( {s,a} \right)
      The State-value Function: {V^{\rm{\pi }}}\left( s \right) = {\mathbb{E}_{a \sim {\rm{\pi }}(s)}}\left[ {{Q^{\rm{\pi }}}\left( {s,a} \right)} \right]
      The expectation of the advantage function equals to 0, as {\mathbb{E}_{a \sim {\rm{\pi }}(s)}}\left[ {{A^{\rm{\pi }}}\left( {s,a} \right)} \right] = 0
      For a deterministic policy, {a^ * } = \arg {\max _{a' \in \mathcal{A}}}Q\left( {s,a'} \right) , thus Q\left( {s,{a^*}} \right) = V\left( s \right) and A\left( {s,{a^*}} \right) = 0
      The aggregating module is as follow:

    \[Q\left( {s,a;\theta ,\alpha ,\beta } \right) = V\left( {s;\theta ,\beta } \right) + A\left( {s,a;\theta ,\alpha } \right)\]

      where V\left( {s;\theta ,\beta } \right) is a scalar outputted by one stream of fully-connected layers, and A\left( {s,a;\theta ,\alpha } \right) is a \left| \mathcal{A} \right|-dimensional vector. \theta denotes the parameters of the convolutional layers, while \alpha and \beta are the parameters of the two streams of fully-connected layers.
      Problem arises: the above equation is unidentifiable in the sense that given Q we cannot recover V and A uniquely. For example, add a constant to V (s; \theta; \beta) and subtract the same constant from A(s; a; \theta; \alpha).
      To address this issue of identifiability, try to force the advantage function estimator to have zero advantage at the chosen action.

    \[Q\left( {s,a;\theta ,\alpha ,\beta } \right) = V\left( {s;\theta ,\beta } \right) + \left( {A\left( {s,a;\theta ,\alpha } \right) - \mathop {\max }\limits_{a' \in \mathcal{A}} A\left( {s,a';\theta ,\alpha } \right)} \right)\]

      An alternative module replaces

    \[Q\left( {s,a;\theta ,\alpha ,\beta } \right) = V\left( {s;\theta ,\beta } \right) + \left( {A\left( {s,a;\theta ,\alpha } \right) - \frac{1}{{\left| \mathcal{A} \right|}}\sum\limits_{a'} {A\left( {s,a';\theta ,\alpha } \right)} } \right)\]

6.3 Pseudocode

6.4 Code implementation

      The epsilon paramter changes according to the below curve.
      Code implementation:

      You can also see this reference: https://github.com/fg91/Deep-Q-Learning/blob/master/DQN.ipynb

7. Summary

     No more words, just summarize the DQN series in a single graph as bellow:

Reference

1. RL — DQN Deep Q-network.
2. Human-level control through deep reinforcement learning.
3. Dueling Network Architectures for Deep Reinforcement Learning.
4. Deep Reinforcement Learning with Double Q-learning.
5. Playing Atari with Deep Reinforcement Learning.

Leave a Reply

Your email address will not be published. Required fields are marked *