Policy Gradient Methods Overview

Policy gradient methods have advantages over better convergence properties, being effective in high-dimensional or continuous action spaces and being able to learn stochastic policies.

1. Policy-based Methods

      In value-based approaches, in order to learn a policy, we must find the optimal state value function or state-action value function with parameters \theta,

    \[\begin{array}{l}{V_\theta }\left( s \right) \approx {V^{\rm{\pi }}}\left( s \right)\\{Q_\theta }\left( {s,a} \right) \approx {Q^{\rm{\pi }}}\left( {s,a} \right)\end{array}\]

then use {V_\theta } or {Q_\theta } to extract a policy, e.g. with \epsilon-greedy.
      Why not learn the policy directly? The policy-based methods parameterize the policy directly:

    \[{{\rm{\pi }}_\theta }\left( {a\left| s \right.} \right) = \mathbb{P}\left[ {a\left| s \right.;\theta } \right]\]

      After a simple derivation, the agent’s goal of maximizing total accumulated reward can be represented by its policy parameter. Then we can use different approaches to optimize the policy, including gradient free methods like Hill climbing, Simplex, Genetic algorithms, Cross-entropy method or Evolution strategies or gradient-based methods like REINFORCE, Villain Policy Gradient (VPG), Natural Policy Gradient (NPG) and etc.
      We focus on the basic and widely used gradient-based algorithm REINFORCE and VPG here to present main ideas, process and how they are used to solve practical reinforcement learning problems.

2. Policy Gradient

      The object function of the reinforcement learning algorithm is:

    \[J\left( \theta \right) = {\mathbb{E}_{\tau \sim {{\rm{\pi }}_\theta }\left( \tau \right)}}\left[ {\sum\limits_t {{\gamma ^t}r\left( {{s_t},{a_t}} \right)} } \right] = \int {{{\rm{\pi }}_\theta }\left( \tau \right)r\left( \tau \right)d\tau } \approx \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{t = 1}^T {{\gamma ^t}r\left( {{s_t},{a_t}} \right)} } \]

      This performance measure function J\left( \theta \right) can be approximated by sampling different trajectories and calculate the corresponding average. The goal of the agent or learned policy to maximize J\left( \theta \right):

    \[{\theta ^*} = \mathop {\arg \max }\limits_\theta J\left( \theta \right)\]

      Note that: 

    \[\begin{array}{l}{{\rm{\pi }}_\theta }\left( \tau \right) = {{\rm{\pi }}_\theta }\left( {{s_1},{a_1}, \cdots ,{s_T},{a_T}} \right) = p\left( {{s_1}} \right)\prod\limits_{t = 1}^T {{{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} p\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right)\\\log {{\rm{\pi }}_\theta }\left( \tau \right) = \log p\left( {{s_1}} \right) + \sum\limits_{t = 1}^T {\left[ {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right) + \log p\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right)} \right]} \end{array}\]

    \[\begin{array}{l}{\nabla _\theta }J\left( \theta \right) = \int {{\nabla _\theta }{{\rm{\pi }}_\theta }\left( \tau \right)r\left( \tau \right)d\tau } \\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \int {{{\rm{\pi }}_\theta }\left( \tau \right)\frac{{{\nabla _\theta }{{\rm{\pi }}_\theta }\left( \tau \right)}}{{{{\rm{\pi }}_\theta }\left( \tau \right)}}r\left( \tau \right)d\tau } \\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \int {{{\rm{\pi }}_\theta }\left( \tau \right){\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( \tau \right)r\left( \tau \right)d\tau } \\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\mathbb{E}_{\tau \sim {{\rm{\pi }}_\theta }\left( \tau \right)}}\left[ {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( \tau \right)r\left( \tau \right)} \right]\end{array}\]

      The gradient of J\left( \theta \right) with respect to \theta can be written as:

    \[\begin{array}{l}{\nabla _\theta }J\left( \theta \right) = {\mathbb{E}_{\tau \sim {{\rm{\pi }}_\theta }\left( \tau \right)}}\left[ {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( \tau \right)r\left( \tau \right)} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\mathbb{E}_{\tau \sim {{\rm{\pi }}_\theta }\left( \tau \right)}}\left[ {{\nabla _\theta }\left[ {\log p\left( {{s_1}} \right) + \sum\limits_{t = 1}^T {\left( {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right) + \log p\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right)} \right)} } \right]r\left( \tau \right)} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\mathbb{E}_{\tau \sim {{\rm{\pi }}_\theta }\left( \tau \right)}}\left[ {{\nabla _\theta }\left[ {\sum\limits_{t = 1}^T {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} } \right]r\left( \tau \right)} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\mathbb{E}_{\tau \sim {{\rm{\pi }}_\theta }\left( \tau \right)}}\left[ {\sum\limits_{t = 1}^T {\left( {{\nabla _\theta }\left( {\log {{\rm{\pi }}_\theta }\left( {{a_t}\left| {{s_t}} \right.} \right)} \right)\left( {\sum\limits_{t = 1}^T {{\gamma ^t}r\left( {{s_t},{a_t}} \right)} } \right)} \right)} } \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \approx \frac{1}{N}\sum\limits_{i = 1}^N {\left[ {\underline {\sum\limits_{t = 1}^T {\left( {\underline {{\nabla _\theta }\left( {\log {{\rm{\pi }}_\theta }\left( {{a_{i,t}}\left| {{s_{i,t}}} \right.} \right)} \right)\left( {\sum\limits_{t = 1}^T {\underline {{\gamma ^t}r\left( {{s_{i,t}},{a_{i,t}}} \right)} } } \right)} } \right)} } } \right]} \end{array}\]

      Considerate action at time t takes no effect to actions before time t, above formula is equivalent to:

    \[{\nabla _\theta }J\left( \theta \right) \approx \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{t = 1}^T {\left[ {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( {{a_{i,t}}\left| {{s_{i,t}}} \right.} \right)\left( {\underline {\sum\limits_{t' = t}^T {{\gamma ^t}r\left( {{s_{i,t'}},{a_{i,t'}}} \right)} } } \right)} \right]} } = \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{t = 1}^T {\left[ {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( {{a_{i,t}}\left| {{s_{i,t}}} \right.} \right){{\hat Q}_{i,t}}} \right]} } \]

      The underline part {\sum\limits_{t'=t}^T {{\gamma ^t}r\left( {{s_{i,t'}},{a_{i,t'}}} \right)} } is just Q function according to its definition. Meanwhile, by changing t=0 to t=t', the number of r term is reduced, thus leading to the variance reduction.
      Furthermore, the base-line b is introduced to reduce high variance.

    \[{\nabla _\theta }J\left( \theta \right) \approx \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{t = 1}^T {\left[ {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( {{a_{i,t}}\left| {{s_{i,t}}} \right.} \right)\left( {\sum\limits_{t' = t}^T {{\gamma ^t}r\left( {{s_{i,t'}},{a_{i,t'}}} \right)} - b} \right)} \right]} } \]

      A natural choice of b is the value function V(S), thus we get

    \[A = \sum\limits_{t' = t}^T {{\gamma ^t}r\left( {{s_{i,t'}},{a_{i,t'}}} \right)} - b = Q - V\]

where A is the advantage function, meaning how much better it is for the agent to take the action a in a state s than the average. An analytical solution for optimal baseline b is

    \[b = \frac{{\mathbb{E}\left[ {{{\left( {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( \tau \right)} \right)}^2}r\left( \tau \right)} \right]}}{{\mathbb{E}\left[ {{{\left( {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( \tau \right)} \right)}^2}} \right]}}\]

      Tips:
      The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.

3. REINFORCE

      The REINFORCE algorithm is also called Monte-Carlo Policy Gradient Algorithm. It samples multiple trajectories following the policy {{\rm{\pi }}_\theta } and updates \theta using the estimated gradient by these samples. The update rule is as follow:

    \[\begin{array}{l}{\nabla _\theta }J\left( \theta \right) \approx \frac{1}{N}\sum\limits_{i = 1}^N {\left[ {\sum\limits_{t = 1}^T {\left( {{\nabla _\theta }\left( {\log {{\rm{\pi }}_\theta }\left( {{a_{i,t}}\left| {{s_{i,t}}} \right.} \right)} \right)\left( {\sum\limits_{t = 1}^T {{\gamma ^t}r\left( {{s_{i,t}},{a_{i,t}}} \right)} } \right)} \right)} } \right]} \\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \approx \frac{1}{N}\sum\limits_{i = 1}^N {\left[ {\sum\limits_{t = 1}^T {\left( {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( {{a_{i,t}}\left| {{s_{i,t}}} \right.} \right){G_{i,t}}} \right)} } \right]} {\kern 1pt} {\kern 1pt} \end{array}\]

      Pseudocode is as follow:
      Code implementation:

4. Vanilla Policy Gradient

   REINFORCE algorithm using {G_{i,t}} for gradient estimation is often of high variance across multiple episodes. According to the above session, subtracting a baseline b(s) from {G_{i,t}} may help to reduce the variance as long as the baseline does not vary with the action a.

    \[{\kern 1pt} {\kern 1pt} {\nabla _\theta }J\left( \theta \right) \approx \frac{1}{N}\sum\limits_{i = 1}^N {\left[ {\sum\limits_{t = 1}^T {\left( {{\nabla _\theta }\log {{\rm{\pi }}_\theta }\left( {{a_{i,t}}\left| {{s_{i,t}}} \right.} \right)\left( {{G_{i,t}} - b\left( {{s_t}} \right)} \right)} \right)} } \right]} \]

      The term {\left( {{G_{i,t}} - b\left( {{s_t}} \right)} \right)} is how much better the agent did after time step t than baseline {b\left( {{s_t}} \right)} and called the advantage. The true advantage can be estimated by sampling the trajectory i:

    \[{{\hat A}_t} = \left( {{G_{i,t}} - b\left( {{s_t}} \right)} \right)\]

      Subtracting a baseline may not change the bias or the gradient calculation but can largely reduce the variance. In this way, we change REINFORCE algorithm into Vanilla Policy Gradient algorithm, pseudocode of which is as follow:
      When considering the b(s_t) to be the value function V(s_t). If we define the advantage function as {A^{\rm{\pi }}}\left( {s,a} \right) = {Q^{\rm{\pi }}}\left( {s,a} \right) - {V^{\rm{\pi }}}\left( s \right), meanwhile, use an estimate \hat V\left( {{s_t},{\bf{w}}} \right) as the true state values. The weight vector {\bf{w}} for the baseline (state-value) function and the policy parameters \theta can be learned simultaneously using the Mote-Carlo trajectory samples.
      Code implementation:

      Compare with actor-critic methods, whether the value function is used for policy evaluation.

5. Summary

      A single graph to summary basic policy gradient methods and their relations as follow:

Reference

1 thought on “Policy Gradient Methods Overview

Leave a Reply

Your email address will not be published. Required fields are marked *