Twin Delayed DDPG (TD3) Walkthrough

In this post, we will start with a brief introduction of TD3 algorithm and why it is proposed, thus leading to the problem of overestimation bias and how it is handled in the actor-critic methods, together with some performance enhancement tricks by addressing variance. Furthermore, this post will also include two versions of state of art implementation to show its capability of promoting performance in continuous action space, one by OpenAI Spiningup, and the other from udemy. (Some of the post are selected directly from these reference to facilitate understanding)

This post is organized as follow:
1.Problem Formulation
2. Reducing Overestimation Bias
3. Addressing Variance
4. Psuedocode and Implementation
5. Reference

TD3 Overview

     Twin Delayed Deep Deterministic Policy Gradient (TD3) is an off-policy algorithm that can only be used for environments with continuous action spaces, which gains improved performance over DDPG by introducing some critic tricks. The goal of TD3 algorithm is to solve the overestimation bias problem in actor-critic methods in continuous action spaces due to the function approximation error. In DDPG, a common failure mode is the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function.

1. Problem Formulation

     The agent interacts with the environment at each discrete time step t, in the state s\in \mathcal{S},  the agent selects actions a\in \mathcal{A} according to its policy \pi:\mathcal{S} \to \mathcal{A}, receiving a scalar reward r and the environment changes to the state s'. In this interaction process, return is defined as discounted sum of rewards from time t after taking action a in a state s:

    \[{R_t} = \sum\nolimits_{i = t}^T {{\gamma ^{i - t}}r\left( {{s_i},{a_i}} \right)} \]

     The agent’s objective is to find the optimal policy \pi_\phi, with parameters \phi, which maximizes the expected return \mathbb{E}:     

    \[J\left( \phi \right) = {{\mathbb{E}}_{{s_i} \sim {p_{\rm{\pi }}},{a_i} \sim {\rm{\pi }}}}\left[ {{R_0}} \right]\]

     So the problem of the agent finding the optimal policy actually turns to be an optimization problem that maximizes J\left( \pi \right), namely,

    \[\max J\left(\phi \right) \]

   For continuous control, while J\left( \pi \right) is differential with respect to \phi, using gradient ascent method, parameterized policies \pi_\phi can be updated by taking the gradient of the expected return {\nabla _\phi }J\left( \phi \right)
     In the actor-critic setting, the policy, namely the actor, according to the deterministic policy gradient theorem (refer to DPG section), can be updated through:

    \[{\nabla _\phi }J\left( \phi \right){\rm{ = }}{{\mathbb{E}}_{s \sim {p_{\rm{\pi }}}}}\left[ {{\nabla _a}{Q^{\rm{\pi }}}\left( {s,a} \right)\left| {_{a = {\rm{\pi }}(s)}{\nabla _\phi }{{\rm{\pi }}_\phi }(s)} \right.} \right]\]

,  while Q_\pi is the expected return when performing action a in state s and following \rm{\pi} after, is known as the critic or the value function, denoted as:

    \[{Q^{\rm{\pi }}}\left( {s,a} \right) = {{\mathbb{E}}_{{s_i} \sim {p_{\rm{\pi }}},{a_i} \sim {\rm{\pi }}}}\left[ {{R_t}\left| {s,a} \right.} \right]\]

     Above is the common idea for solving the maximization problem, now it comes to the calculation of the value function Q^{\rm{\pi }}, which may varies due to the cases of different sizes.
     In small state space, using Q-learning, Q^\pi or the value function can be learned using temporal difference learning based on the Bellman equation, which is a fundamental relationship between the value of a state-action pair (s, a) and the value of the subsequent state-action pair (s', a'):

    \[{Q^{\rm{\pi }}}\left( {s,a} \right) = r + \gamma {{\mathbb{E}}_{s',a'}}\left[ {{Q^{\rm{\pi }}}\left( {s',a'} \right)} \right],\;{\kern 1pt} a \sim {\rm{\pi }}(s')\]

     For a large state space, the value can be estimated with a differentiable function approximator {Q_\theta }\left( {s,a} \right), with parameters \theta, which is the neural network approximator in deep Q-learning , the network is updated by using temporal difference learning with a secondary frozen target network Q_{\theta'}(s, a) to maintain a fixed objective y over multiple updates:

    \[y = r + \gamma {Q_{\theta '}}\left( {s',a'} \right),\;{\kern 1pt} a \sim {{\rm{\pi }}_{\phi '}}(s')\]

  The actions are selected from a target actor network \pi_{\phi'}.
     The target network updates periodically to exactly match the weights of the current network, or by some proportion \tau at each time step according to \theta ' \leftarrow \tau \theta + (1 - \tau )\theta ' .
     By minimizing the difference between the action value and its target, or loss below, using the deep learning framework, we can finally get the parameters.

    \[Loss = {\left( {{Q^{\rm{\pi }}}\left( {s,a;\theta } \right) - y} \right)^2}\]

     Till now, everything goes well, so what’s the problem? The answer is the overestimation bias or the maximization bias! As been depicted in the Double Q-learning.

2. Reducing Overestimation Bias

     What’s overestimation bias?
     (From TD3 paper) Overestimation bias is a property of Q-learning in which the maximization of a noisy value estimate induces a consistent overestimation. In a function approximation setting, this noise is unavoidable given the imprecision of the estimator. This inaccuracy is further exaggerated by the nature of temporal difference learning, in which an estimate of the value function is updated using the estimate of a subsequent state. This means using an imprecise estimate within each update will lead to an accumulation of error. Due to overestimation bias, this accumulated error can cause arbitrarily bad states to be estimated as high value, resulting in suboptimal policy updates and divergent behavior.
     In Q-learning with discrete actions, the value estimate is updated with a greedy target y = r + \gamma {\max _{a'}}Q\left( {s',a'} \right), when the Q\left( {s',a'} \right) is corrupted by errors due to the estimation error, the agent using the maximization operation may select a wrong action that causes a overestimation bias, and will be then propagated through the Bellman equation. The overestimation bias is obvious in the discrete action setting compared with the actor-critic setting where policy is updated via gradient descent. TD3 paper section 4.1 proves the overestimation bias also exists in actor-critic setting.
     How to reduce the effects of overestimation bias?
     For Q-learning, the Double Q-learning is proposed, its greedy update uses two independent estimators, each is used to update the other: (Double Q-learning setting)
     For DQN, the Double DQN is proposed to use the target network as one of the value estimates, and obtain a policy by greedy maximization of the current value network rather than the target network. (Double DQN setting)

    \[y_t^{{\rm{DoubleDQN}}} = {r_{t + 1}} + \gamma Q\left( {{s_{t + 1}},\arg {{\max }_a}Q\left( {{s_{t + 1}},a;{\theta _t}} \right),\theta _t^ - } \right)\]

     In the actor-critic setting, the author of TD3 paper first tends to use an analogous update as in Double DQN and uses the current policy rather than the target policy in the learning target: (Actor-critic setting)

    \[y = r + \gamma {Q_{\theta '}}\left( {s',{{\rm{\pi }}_\phi }(s')} \right)\]

     However, it comes with the problem of the slow-changing policy in actor-critic, which is due to the current and target networks were too similar to make an independent estimation, and offered little improvement. So the actor-critic setting considers Double Q-learning formulation can be used, with a pair of actors \left( {{{\rm{\pi }}_{{\phi _1}}},{{\rm{\pi }}_{{\phi _2}}}} \right) and critics \left( {{Q_{{\theta _1}}},{Q_{{\theta _2}}}} \right) , where {{{\rm{\pi }}_{{\phi _1}}}} is optimized with respected to {{Q_{{\theta _1}}}} and {{{\rm{\pi }}_{{\phi _2}}}} with respected to {{Q_{{\theta _2}}}}: (Actor-critic Double DQN setting)

    \[\begin{array}{l}{y_1} = r + \gamma {Q_{{{\theta '}_2}}}\left( {s',{{\rm{\pi }}_{{\phi _1}}}(s')} \right)\\{y_2} = r + \gamma {Q_{{{\theta '}_1}}}\left( {s',{{\rm{\pi }}_{{\phi _2}}}(s')} \right)\end{array}\]

     However, it is again shown that the actor-critic Double DQN suffers from a similar overestimation as DDPG. Why? Because the two critics are not entirely independent (this independence is also required in Double Q-learning) due to the use of the opposite critic in the learning targets and the same replay buffer. Then the Clipped Double Q-learning algorithm is proposed by taking the minimum of the two estimate {{Q_{{\theta _1}}}} and {{Q_{{\theta _2}}}}, the target is updated by:

    \[{y_1} = r + \gamma \mathop {\min }\limits_{i = 1,2} {Q_{{{\theta '}_i}}}\left( {s',{{\rm{\pi }}_{{\phi _1}}}(s')} \right)\]

     Effects: the above update rule may induce an underestimation bias, but the value of the underestimated actions will not be explicitly propagated though the policy update.
     To get an overview of how this Clipped Double Q-learning target update rule is formed? The original TD3 paper explores in the following procedure:
     Takeaway: TD3 uses the Double Q-learning ways, it learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.

3. Addressing Variance

     Three approaches are proposed to addressing the variance directly, namely, using target network, delayed policy update and a regularization strategy.
     (1)Target Networks and Delayed Policy Update
     The paper proves using a stable target reduces the growth of error and shows divergence will occur without target networks is the result of policy updates with a high variance value estimate. Value estimates diverge through overestimation when the policy is poor, and the policy will become poor if the value estimate itself is inaccurate.
     (From TD3 paper) If target networks can be used to reduce the error over multiple updates, and policy updates on high-error states cause divergent behavior, then the policy network should be updated at a lower frequency than the value network, to first minimize error before introducing a policy update. We propose delaying policy updates until the value error is as small as possible. The target networks are updated slowly by:

    \[\theta ' \leftarrow \tau \theta + (1 - \tau )\theta '\]

     (2)Target Policy Smoothing Regularization
     The deterministic policies can overfit to narrow peaks in the value estimate, a common sense is that similar actions should have similar value. Fit the value of a small area around the target action:

    \[y = r + {\mathbb{E}_\epsilon }\left[ {{Q_{\theta '}}\left( {s',{{\rm{\pi }}_{\phi '}}(s') + \epsilon } \right)} \right]\]

     In practice, the expectation over actions can be approximated by adding a small amount of random noise to the target policy and averaging over mini-batches, thus the target update is modified as:

    \[\begin{array}{l}y = r + \gamma {Q_{\theta '}}\left( {s',{{\rm{\pi }}_{\phi '}}(s') + \epsilon } \right),\\\epsilon \sim clip\left( {N(0,\sigma ), - c,c} \right)\end{array}\]

     (From OpenAI Spinning Up Doc)Target policy smoothing essentially serves as a regularizer for the algorithm. It addresses a particular failure mode that can happen in DDPG: if the Q-function approximator develops an incorrect sharp peak for some actions, the policy will quickly exploit that peak and then have brittle or incorrect behavior. This can be averted by smoothing out the Q-function over similar actions, which target policy smoothing is designed to do.
     Takeaway: TD3 updates the policy (and target networks) less frequently than the Q-function. It adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.

4. Pseudocode and Implementation

Version1: Main process of learning(Selected and summarized from OpenAI Spinning Up explanation).
     Step1: Taking Action (do target policy smoothing), get a'(s').
     Actions used to form the Q-learning target are based on the target policy {\mu _{{\theta _{{\rm{targ}}}}}}, but with clipped noise added on each dimension of the action. After adding the clipped noise, the target action is then clipped to lie in the valid action range (all valid actions, a, satisfy {a_{Low}} \le a \le {a_{High}}. The target actions are thus:

    \[a'(s') = {\rm{clip}}\left( {{\mu _{{\theta _{{\rm{targ}}}}}}\left( {s'} \right) + {\rm{clip}}\left( {\varepsilon , - c,c} \right),{a_{Low}},{a_{High}}} \right),\;\;\varepsilon \sim \;N(0,\sigma )\]

     Step2: Clipped double-Q learning
     Both Q-functions use a single target, calculated using whichever of the two Q-functions gives a smaller target value:

    \[y(r,s',d) = r + \gamma (1 - d)\mathop {\min }\limits_{i = 1,2} {Q_{{\theta _{i,{\rm{targ}}}}}}\left( {s',a'(s')} \right),\]

and then both are learned by regressing to this target:

    \[\begin{array}{l}L\left( {{\phi _1},D} \right) = \mathop \mathbb{E}\limits_{(s,a,r,s',d) \sim D} \left[ {{{\left( {{Q_{{\phi _1}}}\left( {s,a} \right) - y(r,s',d)} \right)}^2}} \right]\\L\left( {{\phi _2},D} \right) = \mathop \mathbb{E}\limits_{(s,a,r,s',d) \sim D} \left[ {{{\left( {{Q_{{\phi _2}}}\left( {s,a} \right) - y(r,s',d)} \right)}^2}} \right]\end{array}\]

     Using the smaller Q-value for the target, and regressing towards that, helps fend off overestimation in the Q-function.
     Step3: Policy learning
     The policy is learned just by maximizing {{Q_{{\phi _1}}}}

    \[\mathop {\max }\limits_\theta \mathop \mathbb{E}\limits_{s \sim D} \left[ {{Q_{{\phi _1}}}\left( {s,{\mu _\theta }(s)} \right)} \right]\]

     The corresponding pseudocode is as follow:

   The following code implementation is modified from original OpenAI Spinninng-up Github to make it is much easier to understand just in one single file:
     Implemntation Version1:

     The results of is as follow:

Version2:
     This section we will the other version of implementation, modified from udemy reinforcement learning 2.0, following the original TD3 paper. Coresponding pseudocdoe is shown as follow:

Procedure of implementing this algorithm according to the original paper:
Initialization
     Step1: Initialize the experience replay memory, with a size of 20000 and it will be populated with each new transition.
     Step2: Build one neural network for the actor model and one neural network for the actor target.
     Step3: Build two neural networks for the two critic models and two neural networks for the two critic targets.

Training Process – run a full episode with first 10000 actions played randomly, and with actions played by the actor model. Then repeat the following steps:
     Step4: Sample a batch of transitions (s, s', a, r) from the memory. Then for each element of the batch:
     Step5: From the next state s', the actor target plays the next action a'.
     Step6: Add the Gaussian noise to this next action a' and clamp it in a range of values supported by the environment.
     Step7: The two critic targets take each the couple (s', a') as input and return two Q-values, {Q_{t1}}\left( {s',a'} \right) and {Q_{t2}}\left( {s',a'} \right) as outputs.
     Step8: Keep the minimum of these two Q-values:\min \left( {{Q_{t1}},{Q_{t2}}} \right). It represents the approximated value of the next state.
     Step9: Get the final target of the two critic models, which is: {Q_t} = r + \gamma \min \left( {{Q_{t1}},{Q_{t2}}} \right), where \gamma is the discount factor.
     Step10: The two critic models take each the couple (s, a) as input and return two Q-values Q_1(s, a) and Q_2(s, a) as outputs.
     Step11: Compute the loss coming from the two critic models: Crtic\_Loss = MSE\_Loss\left( {{Q_1}\left( {s,a} \right),{Q_t}} \right) + MSE\_Loss\left( {{Q_2}\left( {s,a} \right),{Q_t}} \right)
     Step12: Backpropagate this Critic Loss and update the parameters of the two critic models with a SGD optimizer.
     Step13: Once every two iterations, update the actor model by performing gradient ascent on the output the first critic model:

    \[{\nabla _\phi }J\left( \phi \right){\rm{ = }}{N^{ - 1}}\sum {{\nabla _a}{Q_{{\theta _1}}}\left( {s,a} \right)\left| {_{a = {{\rm{\pi }}_\phi }(s)}{\nabla _\phi }{{\rm{\pi }}_\phi }(s)} \right.} \]

          where \phi and \theta_1 represent the weights of the actor and critic.
     Step14: Still once two iterations, update the weights of the actor target by polyak averaging:

    \[{{\theta '}_i} \leftarrow \tau {\theta _i} + (1 - \tau ){{\theta '}_i}\]

     Step15: Still once two iterations, update the weights of the critic target by polyak averaging:

    \[\phi ' \leftarrow \tau \phi + (1 - \tau )\phi '\]

     I draw this process in the figure as bellow to illustrate this iterataion process more intuitively.
   We can aslo refer to the original diagram (from reference 3) introduced as bellow for understanding.

Version 2:
     The Full code implementation is as follow, three enviroments are test in this code, all of them work perfect well. 

     The iteration process for the ant env is as follow:
   
     Final render results are shown as follow:

5. Reference

  1. Addressing Function Approximation Error in Actor-Critic Methods
  2. OpenAI SpiningUp Documentation-TD3
  3. Deep Reinforcement Learning 2.0

Leave a Reply

Your email address will not be published. Required fields are marked *