Deep Determinstic Policy Gradient (DDPG)

This post we will summarize the deep deterministic policy gradient (DDPG) algorithm, to see how it works in continuous action space and how the Deep Q Learning and Deterministic Policy Gradient algorithm are combined. This post is organized as follow:
     1. DDPG overview, what’s the goal of DDPG and the problem of DQN; 
     2. Formulation of DDPG, its derivation and solution;
     3. DDPG Implementation.

1. DDPG Overview

     DDPG is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action space. DDPG combines the deterministic policy gradient (DPG) and Deep Q-learning (DQN). DDPG concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.
     Problem of DQN:
     DQN solves problems with high-dimensional observation spaces and can only handle discrete and low-dimensional action spaces. In tasks like physical control tasks that have continuous (real valued) and high dimensional action spaces.
      Any way to solve the ‘infinite action’ problem in DQN?
      One alternative method is discretizing the action space, but this will lead to an action space with dimensionality, and is not suitable for tasks that require a finer grained discretization, leading to an explosion of the number of discrete actions (too large); meanwhile, it throws away information about the structure of the action domain (not precise), which may be essential for solving many problems.
      Conclusion : DQN focuses on solving the tasks of high-dimensional observation spaces, while DDPG focuses on solving the tasks of high-dimensional action spaces.
     Goal of DDPG:
     DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. DQN cannot be straightforwardly applied to continuous domains since it relies on a finding the action that maximizes the action-value function, which in the continuous valued case requires an iterative optimization process at every step. (Discretizing the action space will lead to an action space with dimensionality or lead to an explosion of the number of discrete actions, and throws away information meanwhile.
     In Q-learning, if you know the optimal action-value function {Q^*}\left( {s,a} \right), then in any given state, the optimal action {a^*} can be found by solving

    \[{Q^*}\left( {s,a} \right){a^*}(s) = \arg \mathop {\max }\limits_a {Q^*}\left( {s,a} \right)\]

     For discrete actions, the argmax can be solved directly by just computing the Q-values for each action separately and compare them. (This also immediately gives us the action which maximizes the Q-value.) But when the action space is continuous, we can’t exhaustively enumerate the action space and solving the optimization problem is highly difficult. Using a normal optimization algorithm would make calculating \mathop {\max_a } {Q^*}\left( {s,a} \right) a painfully expensive subroutine since it would need to be run every time the agent wants to take an action in the environment, this is unacceptable.
     Because the action space is continuous, the function {Q^*}\left( {s,a} \right) is presumed to be differentiable with respect to the action argument. This allows us to set up an efficient, gradient-based learning rule for a policy \mu (s) which exploits that fact. Then, instead of running an expensive optimization subroutine each time to compute \mathop {\max_a } {Q^*}\left( {s,a} \right), we can approximate it with {\max _a}{Q^*}\left( {s,a} \right) \approx Q\left( {s,\mu (s)} \right).
     Idea behind: using a new neural network to approximate the max calculation.

Recap:
     DQN solves the problem of learning large, no-linear function approximators, which was generally believed to be difficult and unstable by introducing two innovations:
     •  The network is trained off-policy with samples from a replay buffer to minimize correlations between samples;
     •  The network is trained with a target Q network to give consistent targets during temporal difference backups.
     DDPG reuses the replay buffer and target network ideas, meanwhile, it incorporates the batch normalization trick, which does not exist in original DQN, to solve the problem of generalization for input observation scaling. According to the DDPG paper, it can learn competitive polices for all tasks researched of low-dimensional observations, and some of the high-dimensional observations like pixels, but not all (to be proved).

Key Features:
     •  Compared with DPG algorithm, DDPG is inspired by the success of DQN, allowing to use deep neural network function approximators to learn in large state and action spaces online.
     •  Introduce batch normalization to normalize each dimension of observations across the samples in a minibatch to have unit mean and variance, thus to make the network easy to learn effectively and easy to find hyper-parameters which generalize across environments with different scales of state values.
     •  Solving the issue of the exploration in continuous action spaces by adding noise sampled from a noise process, namely, the Ornstein-Uhlenbeck process. (The OpenAI Spinning up document mentions the random Gaussian noises works well too)
     •  Introducing the replay buffer as in DQN to handle the problem of training neural network using samples generated from exploring sequentially in an environment that are not independently and identically distributed.
     •  Introduce the target network for both the actor and the critic to improve the stability of learning. By using “soft” target updates, a copy of the actor and critic networks is created and used for calculating the target values, rather than directly copying the weights.

Quick Facts:
     •  DDPG is an off-policy algorithm.
     •  DDPG extends DPG to Deep RL.
     •  DDPG can only be used for environments with continuous action spaces.
     •  Off-policy with Actor-Critic architecture for continuous action spaces.
     •  DDPG can be thought of as being deep Q-learning for continuous action spaces.
     •  DQN focuses on solving the tasks of high-dimensional observation spaces, while DDPG focuses on solving the tasks of high-dimensional action spaces.

2. Formulation

     (Summarized from DDPG original paper)
     Assumptions:
     •  An agent interacting with the environment in discrete timesteps;
     •  Actions are real-valued {a_t} \in {\mathbb{R}^N};
     •  Environment is fully-observed {s_t} = \left( {{x_1},{a_1}, \cdots ,{a_{t - 1}},{x_t}} \right) = {x_t}.
     Agent’s Goal: learn a policy which maximizes the expected return from the start distribution:

    \[J = {\mathbb{E}_{{r_i},{s_i} \sim E,{a_i} \sim {\rm{\pi }}}}\left[ {{R_1}} \right]\]

     Action-value function: the expected return after taking an action {a_t} in state {s_t} and thereafter following policy \pi:

    \[{Q^{\rm{\pi }}}\left( {{s_t},{a_t}} \right) = {\mathbb{E}_{{r_{i \ge t}},{s_{i > t}} \sim E,{a_{i > t}} \sim {\rm{\pi }}}}\left[ {{R_t}\left| {{s_t},{a_t}} \right.} \right]\]

     For target policy, if it is deterministic, we can describe it as a function \mu :\mathcal{S} \leftarrow \mathcal{A}
     Ignore the inner expectation, we get:

    \[{Q^\mu }\left( {{s_t},{a_t}} \right) = {\mathbb{E}_{{r_t},{s_{t + 1}} \sim E}}\left[ {r\left( {{s_t},{a_t}} \right) + \gamma {Q^\mu }\left( {{s_{t + 1}},\mu ({s_{t + 1}})} \right)} \right]\]

     The expectation above depends only on the environment, meaning that it is possible to learn {Q^\mu } off-policy, using transitions which are generated from a different stochastic behavior policy \beta.
     There are two ways of learning off-policy: using a replay-buffer or by important sampling.
     In the traditional commonly used Q-learning, as an off-policy algorithm, it just uses the greedy policy, regardless of Q coming from the current policy or not:

    \[\mu (s) = \arg {\max _a}Q\left( {s,a} \right)\]

     While in function approximation methods like DQN, considering function approximators (like the neural network in DQN methods) parameterized by {\theta ^Q}, we optimize by minimizing the loss:

    \[L\left( {{\theta ^Q}} \right) = {\mathbb{E}_{{s_t} \sim {\rho ^\beta },{a_t} \sim \beta ,{r_t} \sim E}}\left[ {{{\left( {Q\left( {{s_t},{a_t}\left| {{\theta ^Q}} \right.} \right) - {y_t}} \right)}^2}} \right]\]

     where

    \[{y_t} = r\left( {{s_t},{a_t}} \right) + \gamma Q\left( {{s_{t + 1}},\mu ({s_{t + 1}})\left| {{\theta ^Q}} \right.} \right)\]

     However, it is difficult to solve above large no-linear optimization problem. (Original paper: The use of large, non-linear function approximators for learning value or action-value functions has often been avoided in the past since theoretical performance guarantees are impossible, and practically learning tends to be unstable.) That is why the use of a replay buffer, and a separate target network for calculating y_t is introduced in.
     Conclusion: DQN solves the problem of learning large, no-linear function approximator problem by introducing the replay buffer and the target network, however, the problem of agent taking continuous actions will be solved idea from the DPG algorithm.

DPG algorithm recap:
     The DPG algorithm maintains a parameterized actor function \mu \left( {s\left| {{\theta ^\mu }} \right.} \right) which specifies the current policy by deterministically mapping states to a specific action. The critic Q\left( {s,a} \right) is learned by using the Bellman equation as in Q-learning. The actor is updated by following the chain rule to the expected return from the start distribution J with respect to the policy parameters:

    \[\begin{array}{l}{\nabla _{{\theta ^\mu }}}J \approx {\mathbb{E}_{{s_t} \sim {\rho ^\beta }}}\left[ {{\nabla _{{\theta ^\mu }}}Q\left( {s,a\left| {{\theta ^Q}} \right.} \right)\left| {_{s = {s_t},a = \mu \left( {s\left| {{\theta ^\mu }} \right.} \right)}} \right.} \right]\\\;\;\;\;\;\;\;\;\;\; = {\mathbb{E}_{{s_t} \sim {\rho ^\beta }}}\left[ {{\nabla _a}Q\left( {s,a\left| {{\theta ^Q}} \right.} \right)\left| {_{s = {s_t},a = \mu ({s_t})}} \right.{\nabla _{{\theta ^\mu }}}\mu \left( {s\left| {{\theta ^\mu }} \right.} \right)\left| {_{s = {s_t}}} \right.} \right]\end{array}\]

     This process will also introducing non-linear function approximators as with Q-learning, means that convergence is no longer guaranteed. However, such approximators appear essential in order to learn and generalize on large state spaces.
     NFQCA uses the same update rules as DPG but with neural network function approximators, and uses batch learning for stability, which is intractable for large networks.
     DDPG using ideas from DQN to solve this stability problem.
     Other problems: Problems of using neural network in reinforcement learning, training supervised learning neural network, the training samples is assumed to be independently and identically distributed.
     Replay Buffer solve two problems: learning neural networks from sequentially generated samples; feasibility of learning large no-linear problem.

Key Points:
     Replay buffer for de-correlating samples:
     As in DQN, the replay buffer is a finite size cache \mathcal{R}. Transitions are sampled from the environment according to the exploration policy and the tuple (s_t,a_t,r_t,s_{t+1}) is stored in the replay buffer with the new samples replacing the oldest samples. At each timestep the actor and critic are updated by sampling a minibatch uniformly from the buffer, thus enable the agent to learn from a set of uncorrelated transitions.
     Target network for learning stability:
     Target network is a little different from that in DQN and is modified for actor-critic to use ‘soft’ target updates, rather than directly copying the weights. DDPG create a copy of the actor and critic networks, namely, the actor target network and the critic target network \mu '\left( {s\left| {{\theta ^{\mu '}}} \right.} \right) and Q'\left( {s,a\left| {{\theta ^{Q'}}} \right.} \right) respectively for calculating the target values. The weights of these target networks are updated by having them slowly track the learned networks: \theta ' \leftarrow \tau \theta + (1 - \tau )\theta ' with \tau \ll 1. To improve the stability of learning, both the target {\mu '} and {Q'} are required to have stable target y_t to consistently train the critic without divergence.
     Batch normalization to improve generalization:
     Why batch normalization is proposed? (To solve the generation problem or physical units scaling) For different tasks, different components of the observation may have different physical units (for example, positions versus velocities) and the ranges may vary across environments. This can make it difficult for the network to learn effectively and may make it difficult to find hyper-parameters which generalize across environments with different scales of state values.
In the low-dimensional case, we used batch normalization on the state input and all layers of the \mu network and all layers of the Q network prior to the action input.
     Exploration Exploitation:
Using the exploration policy:

    \[\mu '({s_t}) = \mu ({s_t}\left| {\theta _t^\mu } \right.) + \mathcal{N}\]

3. Pseudocode and Code Implementation

   OpenAI Spinning-up Modified version according to the corresponding documentation: (selected from OpenAI Spinning-up)
     This section is drawn from OpenAI SpiningUp Documentation which provides a much clearer and excellent explanation of how DDPG worker.
     3.1 The Q-Learning Side of DDPG:
     According to the Bellman equation describing the optimal action-value function {Q^*}\left( {s,a} \right):

    \[{Q^*}\left( {s,a} \right) = {\mathbb{E}_{s' \sim P}}\left[ {r\left( {s,a} \right) + \gamma \mathop {\max }\limits_{a'} {Q^*}\left( {s',a'} \right)} \right]\]

     where {s' \sim P} is shorthand for saying that the next state, s', is sampled by the environment from a distribution P\left( { \cdot \left| {s,a} \right.} \right).
     The Bellman equation is used to learn an approximator to {Q^*}\left( {s,a} \right) . Suppose the approximator is a neural network {Q_\phi }\left( {s,a} \right) , with parameters \phi , and that we have collected a set \mathcal{D} of transitions {(s,a,r,s',d)} (where d indicates whether state s' is terminal). We can set up a mean-squared Bellman error (MSBE) function, which tells us roughly how closely {Q_\phi } comes to satisfying the Bellman equation:

    \[L\left( {\phi ,D} \right) = \mathop \mathbb{E}\limits_{(s,a,r,s',d) \sim D} \left[ {{{\left( {{Q_\phi }\left( {s,a} \right) - \left( {r + \gamma (1 - d)\mathop {\max }\limits_{a'} {Q_\phi }\left( {s',a'} \right)} \right)} \right)}^2}} \right]\]

     Q-learning algorithms for function approximators, such as DQN (and all its variants) and DDPG, are largely based on minimizing this MSBE loss function. There are two main tricks employed by all of them which are worth describing, and then a specific detail for DDPG.
     Replay Buffers: In order for the algorithm to have stable behavior, the replay buffer should be large enough to contain a wide range of experiences, but it may not always be good to keep everything. If you only use the very-most recent data, you will overfit to that and things will break; if you use too much experience, you may slow down your learning. This may take some tuning to get right. (Tips: We’ve mentioned that DDPG is an off-policy algorithm: this is as good a point as any to highlight why and how. Observe that the replay buffer should contain old experiences, even though they might have been obtained using an outdated policy. Why are we able to use these at all? The reason is that the Bellman equation doesn’t care which transition tuples are used, or how the actions were selected, or what happens after a given transition, because the optimal Q-function should satisfy the Bellman equation for all possible transitions. So any transitions that we’ve ever experienced are fair game when trying to fit a Q-function approximator via MSBE minimization.)
     Target Networks: In the Q-learning algorithms, the target is defined as

    \[{r + \gamma (1 - d)\mathop {\max }\limits_{a'} {Q_\phi }\left( {s',a'} \right)}\]

     when we minimize the MSBE loss, we are trying to make the Q-function be more like this target. Problematically, the target depends on the same parameters we are trying to train: \phi. This makes MSBE minimization unstable. The solution is to use a set of parameters which comes close to \phi, but with a time delay—that is to say, a second network, called the target network, which lags the first. The parameters of the target network are denoted \phi_{\rm targ}.
     In DQN-based algorithms, the target network is just copied over from the main network every some-fixed-number of steps. In DDPG-style algorithms, the target network is updated once per main network update by polyak averaging:

    \[{\phi _{{\rm{targ}}}} \leftarrow \rho {\phi _{{\rm{targ}}}} + (1 - \rho )\phi \]

     where \rho is a hyperparameter between 0 and 1 (usually close to 1).
     DDPG Detail: Calculating the Max Over Actions in the Target. As mentioned earlier: computing the maximum over actions in the target is a challenge in continuous action spaces. DDPG deals with this by using a target policy network to compute an action which approximately maximizes {\phi_{\rm{targ}}}. The target policy network is found the same way as the target Q-function: by polyak averaging the policy parameters over the course of training.
     Putting it all together, Q-learning in DDPG is performed by minimizing the following MSBE loss with stochastic gradient descent:

    \[L\left( {\phi ,D} \right) = \mathop \mathbb{E}\limits_{(s,a,r,s',d) \sim D} \left[ {{{\left( {{Q_\phi }\left( {s,a} \right) - \left( {r + \gamma (1 - d){Q_{{\phi _{{\rm{targ}}}}}}\left( {s',{\mu _{{\theta _{{\rm{targ}}}}}}(s')} \right)} \right)} \right)}^2}} \right]\]

where {{\mu _{{\theta _{{\rm{targ}}}}}}} is the target policy.

     3.2 The Policy Learning Side of DDPG:
     The policy learning in DDPG tends to learn a deterministic policy {\mu _\theta }\left( s \right) which gives the action that maximizes {{Q_\phi }\left( {s,a} \right)}. Because the action space is continuous, and we assume the Q-function is differentiable with respect to action, we can just perform gradient ascent (with respect to policy parameters only) to solve \mathop {\max }\limits_\theta {\mathbb{E}_{s \sim D}}\left[ {{Q_\phi }\left( {s,{\mu _\theta }\left( s \right)} \right)} \right].
     Note that the Q-function parameters are treated as constants here.

     3.3 Exploration & Exploitation:
     DDPG trains a deterministic policy in an off-policy way. Because the policy is deterministic, if the agent were to explore on-policy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make DDPG policies explore better, we add noise to their actions at training time. The authors of the original DDPG paper recommended time-correlated OU noise, but more recent results suggest that uncorrelated, mean-zero Gaussian noise works perfectly well. Since the latter is simpler, it is preferred.
     Version 1
     The OpenAI spinning up DDPG pseudocode is as follow:
   Corresponding Code is modified from original OpenAI Spinning Up Git, but encapsulate all modules into a single file to faicilitate implementation and understanding:

The result is as follow:

     Version 2
     I draw a diagram to illustarte the iteration process:
   DDPG alogrithm pseudocode from original DDPG paper and correspondng a minimal implementation:

Code  implementation

The result is as follow:

4. Reference

     1. Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016
     2. OpenAI Spinning Up Documentation, DDPG

Leave a Reply

Your email address will not be published. Required fields are marked *