Asynchronous Advantage Actor Critic (A3C)

Policy-based Methods directly learn a parameterized policy.Value-value methods learn the values of actions, and then selected actions based on their estimated action values. In policy-based methods, a value function may still be used to learn the policy parameter, but is not required for action selection. Methods that that learn approximations to both policy and value functions are often called actor–critic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.

1. Policy Gradient Recap

      In the previous policy gradient overview post, we have knowledge of how basic policy-based methods like REINFORCE and Vanilla Policy Gradient work, it is natural to extend this series of algorithms to the actor-critic setting. We will do a short recap first.
To maximize the performance measure J\left( \theta \right) with respect to \theta, we can use the stochastic gradient-ascent approach

    \[\begin{array}{l} \nabla J\left( \theta \right) = {\mathbb{E}_{\rm{\pi }}}\left[ {\sum\limits_a {{\rm{\pi }}\left( {a\left| {{S_t};\theta } \right.} \right){q_{\rm{\pi }}}\left( {{S_t},a} \right)\frac{{\nabla {\rm{\pi }}\left( {a\left| {{S_t};\theta } \right.} \right)}}{{{\rm{\pi }}\left( {a\left| {{S_t};\theta } \right.} \right)}}} } \right]\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\mathbb{E}_{\rm{\pi }}}\left[ {{q_{\rm{\pi }}}\left( {{S_t},{A_t}} \right)\frac{{\nabla {\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}{{{\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}} \right]{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {{\rm{replacing}}{\kern 1pt} {\kern 1pt} {\kern 1pt} a{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{by}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{the}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{sample}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {A_t} \sim {\rm{\pi }}} \right)\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\mathbb{E}_{\rm{\pi }}}\left[ {{G_t}\frac{{\nabla {\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}{{{\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}} \right]{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {{\rm{because}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\mathbb{E}_{\rm{\pi }}}\left[ {{G_t}\left| {{S_t},{A_t}} \right.} \right] = {q_{\rm{\pi }}}\left( {{S_t},{A_t}} \right){\kern 1pt} } \right) \end{array}\]

      The REINFORCE update is

    \[{\theta _{t + 1}} = {\theta _t} + \alpha {G_t}\frac{{\nabla {\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}{{{\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}\]

      The general discounted case:

    \[{\theta _{t + 1}} = {\theta _t} + \alpha {\gamma ^t}G\nabla \ln {\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)\]

      Corresponding pseudocode of the REINFORCE is as follow:
    As a Monte Carlo method REINFORCE may be of high variance and thus produce slow learning. This is further improved by generalizing the policy gradient theorem to include a comparison of the action value to an arbitrary baseline b(s) is as follow:

    \[\nabla J\left( \theta \right) \propto \sum\limits_s {\mu \left( s \right)} \sum\limits_a {\left( {{q_{\rm{\pi }}}\left( {s,a} \right) - b\left( s \right)} \right)} \nabla {\rm{\pi }}\left( {a\left| {s;\theta } \right.} \right)\]

      Baseline can be any function or random variables that does not vary with action a, then the update rule for REINFORCE that includes a general baseline:

    \[{\theta _{t + 1}} = {\theta _t} + \alpha \left( {{G_t} - b\left( {{S_t}} \right)} \right)\frac{{\nabla {\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}{{{\rm{\pi }}\left( {{A_t}\left| {{S_t};\theta } \right.} \right)}}\]

      The introduce of baseline cause no effects to the bias, but can reduce the variance. One natural choice for the baseline is an estimate of the state value \hat v\left( {{S_t};{\bf{w}}} \right), where {\bf{w}} \in {\mathbb{R}^m} is a weight vector learned using methods as in value approximation methods.
The pseudocode algorithm for REINFORCE with baseline using such a learned state-value function as the baseline:
      Tips: REINFORCE-with-baseline method learns both a policy and a state-value function.

2. Actor-Critic Methods

      (Reference from Rich Sutton) Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated.
      REINFORCE with baseline is still using the Monte Carlo methods, so it is unbiased but of high variance and need to wait until the end of the episode to calculate the reward, the temporal-difference method can eliminate these problems to some extent. Take the one-step Actor-Critic and Q Actor-Critic methods for example to explain.

2.1 One-step Actor-Critic

      The one-step actor-critic method replaces the full return of REINFORCE with the one-step return (and use a learned state-value function as the baseline) as follows:

    \[\begin{array}{l} {\theta _{t + 1}} = {\theta _t} + \alpha \left( {{G_{t:t + 1}} - \hat v\left( {{S_t};{\bf{w}}} \right)} \right)\frac{{\nabla {\rm{\pi }}\left( {{A_t}\left| {{S_t};{\theta _t}} \right.} \right)}}{{{\rm{\pi }}\left( {{A_t}\left| {{S_t};{\theta _t}} \right.} \right)}}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\theta _t} + \alpha \left( {{R_{t + 1}} + \gamma \hat v\left( {{S_{t + 1}};{\bf{w}}} \right) - \hat v\left( {{S_t};{\bf{w}}} \right)} \right)\frac{{\nabla {\rm{\pi }}\left( {{A_t}\left| {{S_t};{\theta _t}} \right.} \right)}}{{{\rm{\pi }}\left( {{A_t}\left| {{S_t};{\theta _t}} \right.} \right)}}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\theta _t} + \alpha {\delta _t}\frac{{\nabla {\rm{\pi }}\left( {{A_t}\left| {{S_t};{\theta _t}} \right.} \right)}}{{{\rm{\pi }}\left( {{A_t}\left| {{S_t};{\theta _t}} \right.} \right)}} \end{array}\]

      The corresponding pseudocode:

2.2 Q Actor-Critic

      Considering the gradient of the performance measure J\left( \theta \right) with respect to \theta

    \[{\nabla _\theta }J\left( \theta \right) = {\mathbb{E}_\tau }\left[ {\sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right){G_t}} } \right]\]

      Decompose the expectation into (the other form of Actor Critic method derivation, rather than from the aspects of causality and temporary structure in Berkeley CS294 lecture):

    \[{\nabla _\theta }J\left( \theta \right) = {\mathbb{E}_{{s_0},{a_0}, \ldots ,{s_t},{a_t}}}\left[ {\sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} } \right]{E_{{r_{t + 1}},{s_{t + 1}}, \ldots ,{s_T},{a_T}}}\left[ {{G_t}} \right]\]

      By definition, the second expectation term above is just the Q value

    \[{\mathbb{E}_{{r_{t + 1}},{s_{t + 1}}, \ldots ,{s_T},{a_T}}}\left[ {{G_t}} \right] = Q\left( {{s_t},{a_t}} \right)\]

Plugging that in, we can rewrite the update equation as such:

    \[\begin{array}{l} {\nabla _\theta }J\left( \theta \right) = {\mathbb{E}_{{s_0},{a_0}, \ldots ,{s_t},{a_t}}}\left[ {\sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} } \right]Q\left( {{s_t},{a_t}} \right)\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\mathbb{E}_\tau }\left[ {\sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} } \right]Q\left( {{s_t},{a_t}} \right) \end{array}\]

      When parameterizing the above Q with the neural network, we get the Q Actor Critic algorithm as follow.
      Using the similar derivation as Q Actor Crtic with different policy gradient forms, we can get series of algorithms as bellow:
      Quick Reference:Actor-Critic Core

    \[J\left( \theta \right) = {\mathbb{E}_{\rm{\pi }}}\left[ {\sum\limits_{t = 0}^T {R\left( {{s_t},{a_t};\theta } \right)} } \right] = \sum\limits_\tau {P\left( {\tau ;\theta } \right)R\left( \tau \right)} \]

    \[\begin{array}{l} \mathop {\max }\limits_\theta J\left( \theta \right) = \mathop {\max }\limits_\theta \sum\limits_\tau {P\left( {\tau ;\theta } \right)R\left( \tau \right)} \\ {\nabla _\theta }J\left( \theta \right) = \sum\limits_\tau {P\left( {\tau ;\theta } \right)R\left( \tau \right)} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_\tau {{\nabla _\theta }P\left( {\tau ;\theta } \right)R\left( \tau \right)} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_\tau {\frac{{P\left( {\tau ;\theta } \right)}}{{P\left( {\tau ;\theta } \right)}}{\nabla _\theta }P\left( {\tau ;\theta } \right)R\left( \tau \right)} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_\tau {P\left( {\tau ;\theta } \right)\frac{{{\nabla _\theta }P\left( {\tau ;\theta } \right)}}{{P\left( {\tau ;\theta } \right)}}R\left( \tau \right)} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_\tau {P\left( {\tau ;\theta } \right){\nabla _\theta }\log P\left( {\tau ;\theta } \right)R\left( \tau \right)} \end{array}\]

      Estimating the gradient by sampling multiple trajectories

    \[{\nabla _\theta }J\left( \theta \right) = \frac{1}{m}\sum\limits_{i = 1}^m {{\nabla _\theta }\log P\left( {{\tau ^i};\theta } \right)R\left( {{\tau ^i}} \right)} \]

      Considering

    \[P\left( {\tau ;\theta } \right) = \prod\limits_{t = 0}^T {P\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right){{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} \]

    \[\begin{array}{l} {\nabla _\theta }\log P\left( {\tau ;\theta } \right) = {\nabla _\theta }\log \left[ {\prod\limits_{t = 0}^T {P\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right){{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} } \right]\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\nabla _\theta }\left[ {\sum\limits_{t = 0}^T {\log P\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right)} + \sum\limits_{t = 0}^T {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} } \right]\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\nabla _\theta }\sum\limits_{t = 0}^T {\log P\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right)} + {\nabla _\theta }\sum\limits_{t = 0}^T {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\nabla _\theta }\sum\limits_{t = 0}^T {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} \end{array}\]

      Calculate {\nabla _\theta }J\left( \theta \right) as

    \[{\nabla _\theta }J\left( \theta \right) = \frac{1}{m}\sum\limits_{i = 1}^m {{\nabla _\theta }\log P\left( {{\tau ^i};\theta } \right)R\left( {{\tau ^i}} \right)} = \frac{1}{m}\sum\limits_{i = 1}^m {\sum\limits_{t = 0}^T {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} R\left( {{\tau ^i}} \right)} \]

2.3 Code Implementation

Result:

3. Advantage Actor-Critic

      The state-value function can be naturally considered to be one of the baseline function choice. So, using the V function as the baseline function, we subtract the Q value term with the V value, thus we get the advantage value:

    \[A\left( {{s_t},{a_t}} \right) = {Q_w}\left( {{s_t},{a_t}} \right) - {V_v}\left( {{s_t}} \right)\]

      It will be inefficient to construct two neural networks for both the Q value and the V value using the relationship between the Q and V from the Bellman equation:

    \[Q\left( {{s_t},{a_t}} \right) = \mathbb{E}\left[ {{r_{t + 1}} + \gamma V\left( {{s_{t + 1}}} \right)} \right]\]

      Rewrite the advantage as:

    \[A\left( {{s_t},{a_t}} \right) = {r_{t + 1}} + \gamma {V_v}\left( {{s_{t + 1}}} \right) - {V_v}\left( {{s_t}} \right)\]

      Using {r_{t + 1}} + \gamma {V_v}\left( {{s_{t + 1}}} \right) will lose one time-step accuracy, but with better efficiency with only one neural network for the V function, then the update equation can be written as follow:

    \[\begin{array}{l} {\nabla _\theta }J\left( \theta \right) \sim \sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)\left( {{r_{t + 1}} + \gamma {V_v}\left( {{s_{t + 1}}} \right) - {V_v}\left( {{s_t}} \right)} \right)} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)A\left( {{s_t},{a_t}} \right)} \end{array}\]

      This is the gradient used for Advantage Actor Critic update rule.

    \[\begin{array}{l} {\nabla _\theta }J\left( \theta \right) \sim \sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)\left( {{r_{t + 1}} + \gamma {V_v}\left( {{s_{t + 1}}} \right) - {V_v}\left( {{s_t}} \right)} \right)} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \sum\limits_{t = 0}^{T - 1} {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)A\left( {{s_t},{a_t}} \right)} \end{array}\]

      Two main variants of the Advantage Actor Critic: the Advantage Actor Critic (A2C) and the Asynchronous Advantage Actor-Critic (A3C). In A2C, we synchronously update the global network. We wait until all workers have finished their training and calculated their gradients to average them, to update our global network. Because of the asynchronous nature of A3C, some workers (copies of the Agent) will be playing with older version of the parameters. Thus the aggregating update will not be optimal. That’s why A2C waits for each actor to finish their segment of experience before updating the global parameters. Then, we restart a new segment of experience with all parallel actors having the same new parameters.
      Pseudocode
The pseudocode for Advantage actor-critic is as follow:
      Implementation 1

      Result

      Implementation 2

      Result

      Differences between Actor-critic and Advantage Actor-critic
      Actor-Critic is not just a single algorithm, it should be viewed as a family of related techniques. These techniques are based on the policy gradient theorem, which train some form of value estimate to plug into the update rule a lower-variance replacement for the returns at the end of an episode. The all perform “bootstrapping” by using some sort of prediction of value.
      Advantage Actor-critic specifically uses estimates of the advantage function {A^{\rm{\pi }}}\left( {{s_t},{a_t}} \right) = {Q^{\rm{\pi }}}\left( {{s_t},{a_t}} \right) - {V^{\rm{\pi }}}\left( {{s_t}} \right) for its bootstrapping, whereas ‘actor-critic’ without the ‘advantage’ qualifier is not specific; it could be a trained V(s) function, or some sort of estimate of the Q(s,a), it could be a variety of things.

4. Asynchronous Advantage Actor Critic

      A3C runs asynchronously on a single machine with a standard multi-core CPU, rather than traditional learning approach on specialized hardware such as GPUs or massively distributed architectures.
      (Selected from A3C paper) The aim of the A3C is to train deep neural network policies reliably and without large resource requirement. First, use asynchronous actor-learners by using multiple CPU threads on a single machine. Second, multiple actors-learners run in parallel to explore different parts of the environment using different exploration policies to maximize the diversity. By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the DQN training algorithm (why not use replay buffer in A3C). (Tips, replay buffer drawbacks: using more memory and computation per real interaction; requiring off-policy learning algorithms that can update from data generated by an older policy). Third, benefits: the reduction in training time that is roughly linear in the number of parallel actor-learners and being able to use on-policy reinforcement learning methods such as Sarsa and actor-critic to train neural networks in a stable way, rather than relying on experience replay for stabilizing learning.
      The A3C maintains a policy {\rm{\pi }}\left( {{a_t}\left| {{s_t}} \right.;\theta } \right) and an estimate of the value function V\left( {{s_t};{\theta _v}} \right), it operates in the forward view (as opposed to the more common backward view used by techniques like eligibility traces in Rich Sutton) and uses the same mix of n-step returns to update both the policy and the value-function. The policy and the value function are updated after every t_{\rm max} actions or when a terminal state is reached. The update performed by the algorithm can be seen as {\nabla _{\theta '}}\log {\rm{\pi }}\left( {{a_t}\left| {{s_t}} \right.;\theta '} \right)A\left( {{a_t}\left| {{s_t}} \right.;\theta ,{\theta _v}} \right), where A\left( {{a_t}\left| {{s_t}} \right.;\theta ,{\theta _v}} \right) is an estimate of the advantage function given by \sum\nolimits_{i = 0}^{k - 1} {{\gamma ^i}{r_{t + i}} + {\gamma ^k}V\left( {{s_{t + k}};{\theta _v}} \right) - V\left( {{s_t};{\theta _v}} \right)}, where k can vary from state to state and is upper-bounded by {t_{\max }}. The pseudocode for the algorithm is as follow:
      Adding the entropy of the policy \pi to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies. The gradient of the full objective function including the entropy regularization term with respect to the policy parameters takes the form:

    \[{\nabla _{\theta '}}\log {\rm{\pi }}\left( {{a_t}\left| {{s_t}} \right.;\theta '} \right)\left( {{R_t} - V\left( {{s_t};{\theta _v}} \right)} \right) + \beta {\nabla _{\theta '}}H\left( {{\rm{\pi }}\left( {\left| {{s_t}} \right.;\theta '} \right)} \right)\]

where H is the entropy. The hyperparameter \beta controls the strength of the entropy regularization term.

      Implementation

      Result

Leave a Reply

Your email address will not be published. Required fields are marked *