Trust Region Policy Optimization (TRPO)

TRPO can be viewed as a combination of natural policy gradient, line search strategy and monotonic improvement theorem, it updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be. The constraint is expressed in terms of KL-Divergence, a measure of (something like, but not exactly) distance between probability distributions. we will walk through the derivation process and details in the following section.

1. Overview

    Just keep in mind: TRPO= NPG + Linesearch + monotonic improvement theorem!

2. Fromulation

STEP1: Problems Setup
      Two approaches of writing reward function:
      Approach 1: Expected Discounted Reward
      (1) The objective function to optimize:

    \[\eta \left( {\rm{\pi }} \right) = {\mathbb{E}_{{s_0},{a_0}, \ldots }}\left[ {\sum\limits_{t = 0}^\infty {{\gamma ^t}r\left( {{s_t}} \right)} } \right]\]

      where {s_0} \sim {\rho _0}\left( {{s_0}} \right),{a_t} \sim {\rm{\pi }}\left( {{a_t}\left| {{s_t}} \right.} \right),{s_{t + 1}} \sim P\left( {{s_{t + 1}}\left| {{s_t},{a_t}} \right.} \right).
      Can we express the expected return of another policy in terms of the advantage over the original policy?
      (2) Yes, it is proven and shown that a guaranteed increase in the performance is possible.
      Approach 2:Compute the reward of a policy using another policy

    \[\eta \left( {{\rm{\tilde \pi }}} \right) = \eta \left( {\rm{\pi }} \right) + {\mathbb{E}_{{s_0},{a_0}, \ldots \sim {\rm{\tilde \pi }}}}\left[ {\sum\limits_{t = 0}^\infty {{\gamma ^t}{A_{\rm{\pi }}}\left( {{s_t},{a_t}} \right)} } \right] = \eta \left( {\rm{\pi }} \right) + \sum\limits_s {{\rho _{{\rm{\tilde \pi }}}}\left( s \right)} \sum\limits_a {{\rm{\tilde \pi }}\left( {a\left| s \right.} \right)} {A_{\rm{\pi }}}\left( {{s_t},{a_t}} \right)\]

      For a new policy {{\rm{\tilde \pi }}}, \eta \left( {{\rm{\tilde \pi }}} \right) can be viewed as the expected return of policy {{\rm{\tilde \pi }}} in terms of the advantage over {\rm{\pi }}, which is the old policy.
      You might wonder why advantage is used. Intuitively, you can think of it as measuring how good the new policy is with regard to the average performance of the old policy. \eta of the new policy can be rewrite into the following form, where \rho is the (unormalized) discounted visitation frequencies where {s_0} \sim {\rho _0}

    \[\begin{array}{l}\eta \left( {{\rm{\tilde \pi }}} \right) = \eta \left( {\rm{\pi }} \right) + \sum\limits_{t = 0}^\infty {\sum\limits_s {P\left( {{s_t} = s\left| {{\rm{\tilde \pi }}} \right.} \right)} \sum\limits_a {{\rm{\tilde \pi }}\left( {a\left| s \right.} \right)} {\gamma ^t}{A_{\rm{\pi }}}\left( {s,a} \right)} \\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \eta \left( {\rm{\pi }} \right) + \sum\limits_s {\sum\limits_{t = 0}^\infty {{\gamma ^t}P\left( {{s_t} = s\left| {{\rm{\tilde \pi }}} \right.} \right)} } \sum\limits_a {{\rm{\tilde \pi }}\left( {a\left| s \right.} \right)} {A_{\rm{\pi }}}\left( {s,a} \right)\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \eta \left( {\rm{\pi }} \right) + \sum\limits_s {{\rho _{{\rm{\tilde \pi }}}}\left( s \right)} \sum\limits_a {{\rm{\tilde \pi }}\left( {a\left| s \right.} \right)} {A_{\rm{\pi }}}\left( {s,a} \right)\end{array}\]

    \[{\rho _{\rm{\pi }}}\left( s \right) = P\left( {{s_0} = s} \right) + \gamma P\left( {{s_1} = s} \right) + {\gamma ^2}P\left( {{s_2} = s} \right) + \cdots \]

      The above formula is hard to be optimized since \rho is highly dependent on the new policy {{\rm{\tilde \pi }}}. However, we can make an approximation to \eta \left( {{\rm{\tilde \pi }}} \right) with {L_{\rm{\pi }}}\left( {{\rm{\tilde \pi }}} \right)
      (3) Make an approximation to \eta \left( {{\rm{\tilde \pi }}} \right) to remove the dependency of discounted visitation frequencies under the new policy.
      The local approximation:

    \[\begin{array}{l}{L_{\rm{\pi }}}\left( {{\rm{\tilde \pi }}} \right) = \eta \left( {\rm{\pi }} \right) + \sum\limits_s {{\rho _{\rm{\pi }}}\left( s \right)} \sum\limits_a {{\rm{\tilde \pi }}\left( {a\left| s \right.} \right)} {A_{\rm{\pi }}}\left( {s,a} \right)\\{L_{{{\rm{\pi }}_{{\theta _0}}}}}\left( {{{\rm{\pi }}_{{\theta _0}}}} \right) = \eta \left( {{{\rm{\pi }}_{{\theta _0}}}} \right)\\{\nabla _\theta }{L_{{{\rm{\pi }}_{{\theta _0}}}}}\left( {{{\rm{\pi }}_\theta }} \right)\left| {_{\theta = {\theta _0}}} \right. = {\nabla _\theta }\eta \left( {{{\rm{\pi }}_\theta }} \right)\left| {_{\theta = {\theta _0}}} \right.\end{array}\]

      Note that we replace {{\rho _{\rm{\pi }}}} with {{\rho _{{\rm{\tilde \pi }}}}}, assuming state visitation frequency is not too different for the new and the old policies.
      {L_{\rm{\pi }}} uses the visitation frequency {{\rho _{\rm{\pi }}}} rather than {{\rho _{{\rm{\tilde \pi }}}}}, that is, we ignore the changes in state visitation frequency due to the change in policy. To put it in simple terms, we assume that the state visitation frequency is not different for both the new and old policy. When we are calculating the gradient of {L_{\rm{\pi }}} , which will also improve \eta with respect to some parameter \theta we can’t be sure how much big of a step to take.
      Kakade and Langford proposed a new policy update method called conservative policy iteration, shown as follows: (CPI (https://arxiv.org/pdf/1509.02971.pdf))

    \[{{\rm{\pi }}_{new}}\left( {a\left| s \right.} \right) = \left( {1 - \alpha } \right){{\rm{\pi }}_{old}}\left( {a\left| s \right.} \right) + \alpha {\rm{\pi '}}\left( {a\left| s \right.} \right)\]

where {{\rm{\pi }}_{new}} is the new policy, {{\rm{\pi }}_{old}} is the old policy, and {{\rm{\pi '}}} is the policy which maximizes {L_{{{\rm{\pi }}_{old}}}}\left( {{\rm{\pi '}}} \right), namely, {\rm{\pi '}} = \arg \max {\rm{\pi '}}L{{\rm{\pi }}_{old}}\left( {{\rm{\pi '}}} \right)
      Kakade and Langford derived the following equation from above equation as follows (Let’s use Theorem 1 to denote it). (Theorem 1 (https://arxiv.org/pdf/1509.02971.pdf))

    \[\eta \left( {{\rm{\tilde \pi }}} \right) \ge {L_{\rm{\pi }}}\left( {{\rm{\tilde \pi }}} \right) - CD_{KL}^{\max }\left( {{\rm{\pi }},{\rm{\tilde \pi }}} \right)\]

      where C = \frac{{4\epsilon \gamma }}{{{{\left( {1 - \gamma } \right)}^2}}}
C represents the penalty coefficient, whereas D^{max}_{KL} denotes the maximum KL divergence of the two parameter for each of the state. The concept of KL divergence was originated from information theory, describing the information loss. Simply put, you can view it as how different these two parameters, {\rm{\pi }} and {{\rm{\tilde \pi }}} are.
      (4) Find the lower-bound that guarantees the improvement using MM
      The above formula implies that the expected long-term reward η is monotonically improving as long as the right-hand-side term is maximized. Why? Let’s define the right-hand-side of the inequality to be M_{i}. (https://arxiv.org/pdf/1509.02971.pdf)

    \[{M_i}\left( {\rm{\pi }} \right) = {L_{{{\rm{\pi }}_i}}}\left( {\rm{\pi }} \right) - CD_{KL}^{\max }\left( {{{\rm{\pi }}_i},{\rm{\pi }}} \right)\]

      We can then prove the following inequality.

    \[\begin{array}{l}\eta \left( {{{\rm{\pi }}_{i + 1}}} \right) \ge {M_i}\left( {{{\rm{\pi }}_{i + 1}}} \right)\\\eta \left( {{{\rm{\pi }}_i}} \right) = {M_i}\left( {{{\rm{\pi }}_i}} \right)\end{array}\]

      then,

    \[\eta \left( {{{\rm{\pi }}_{i + 1}}} \right) - \eta \left( {{{\rm{\pi }}_i}} \right) \ge {M_i}\left( {{{\rm{\pi }}_{i + 1}}} \right) - {M_i}\left( {{{\rm{\pi }}_i}} \right)\]

      Therefore, the complex problem that we are trying to solve now boils down to maximizing M_i. Namely,

    \[\mathop {{\rm{maximize}}}\limits_\theta \left[ {{L_{{\theta _{old}}}}\left( \theta \right) - CD_{KL}^{\max }\left( {{\theta _{old}},\theta } \right)} \right]\]

      The following graph visually illustrates the approximation of \eta with L:
    More explanation from (reference …) is better for understanding:
      a) In practice, if penalty coefficient is included in the objective function, the step size will be very small, leading to long training time. Consequently, a constraint on the KL divergence is used to allow a larger step size while guarantee robust performance.
      b) But having a penalty coefficient C in the preceding equation will cause the step size to be very small, which in turn slows down the updates. So, we impose a constraint on the KL divergence’s old policy and new policy, which is the trust region constraint, and it will help us to find the optimal step size.
      c) In practice, if we use the penalty coefficient C recommended by the theory above, the step sizes would be very small. One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint. Use the average KL instead of the maximum of the KL (heuristic approximation)

    \[\begin{array}{l}\mathop {{\rm{maximize}}}\limits_\theta {L_{{\theta _{old}}}}\left( \theta \right)\\{\rm{subject}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{to}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} D_{KL}^{\max }\left( {{\theta _{old}},\theta } \right) \le \delta \end{array}\]

      The KL divergence constraint is imposed on every state in the state space, the maximum of which should be smaller than a small number \delta. Unfortunately, it is not solvable as there are a infinitely large number of states. The paper proposed a solution which provides a heuristic approximation with the expected KL divergence over states, as opposed to finding the maximum KL divergence. (KL Divergence With State Visitation Frequency \rho)

    \[\bar D_{KL}^\rho \left( {{\theta _1},{\theta _2}} \right) \buildrel\textstyle.\over= {E_{s \sim \rho }}\left[ {{D_{KL}}\left( {{{\rm{\pi }}_{{\theta _1}}}\left( { \cdot \left| s \right.} \right)\left\| {{{\rm{\pi }}_{{\theta _2}}}\left( { \cdot \left| s \right.} \right)} \right.} \right)} \right]\]

      So now, we can rewrite our preceding objective function with the average KL divergence constraint as:

    \[\begin{array}{l}\mathop {{\rm{maximize}}}\limits_\theta {L_{{\theta _{old}}}}\left( \theta \right)\\{\rm{subject}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{to}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \bar D_{KL}^{{\rho _{{\theta _{old}}}}}\left( {{\theta _{old}},\theta } \right) \le \delta \end{array}\]

      Now, the objective function becomes the following when we expand the first line:

    \[\begin{array}{l}\mathop {{\rm{maximize}}}\limits_\theta \sum\limits_s {{\rho _{{\theta _{old}}}}\left( s \right)} \sum\limits_a {{{\rm{\pi }}_\theta }\left( {a\left| s \right.} \right)} {A_{{\theta _{old}}}}\left( {s,a} \right)\\{\rm{subject}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{to}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \bar D_{KL}^{{\rho _{{\theta _{old}}}}}\left( {{\theta _{old}},\theta } \right) \le \delta \end{array}\]

      (5) Sample-based estimation
      Now replace sum over states \sum\limits_s {{\rho _{{\theta _{old}}}}\left( s \right)} as expectation {\mathbb{E}_{s \sim {\rho _{{\theta _{old}}}}}} and replace sum over actions by important sampling estimator as

    \[\sum\limits_a {{{\rm{\pi }}_\theta }\left( {a\left| s \right.} \right)} {A_{{\theta _{old}}}}\left( {s,a} \right) = {\mathbb{E}_{a \sim q}}\left[ {\frac{{{{\rm{\pi }}_\theta }\left( {a\left| s \right.} \right)}}{{q\left( {a\left| s \right.} \right)}}{A_{{\theta _{old}}}}\left( {s,a} \right)} \right]\]

      we can rewrite the above as (Final Objective Function)

    \[\begin{array}{l}\mathop {{\rm{maximize}}}\limits_\theta {\mathbb{E}_{s \sim {{\rm{\pi }}_{{\theta _{old}}}},a \sim {{\rm{\pi }}_{{\theta _{old}}}}}}\left[ {\frac{{{{\rm{\pi }}_\theta }\left( {a\left| s \right.} \right)}}{{{{\rm{\pi }}_{{\theta _{old}}}}\left( {a\left| s \right.} \right)}}{A_{{\theta _{old}}}}\left( {s,a} \right)} \right]\\{\rm{subject}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{to}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\mathbb{E}_{s \sim {{\rm{\pi }}_{{\theta _{old}}}}}}\left[ {{D_{KL}}\left( {{{\rm{\pi }}_{{\theta _{old}}}}\left( { \cdot \left| s \right.} \right)\left\| {{{\rm{\pi }}_\theta }\left( { \cdot \left| s \right.} \right)} \right.} \right)} \right] \le \delta \end{array}\]

      The objective function is also called a “surrogate” objective function as it contains a probability ratio between current policy and the next policy. TPRO successfully addresses the problem imposed by DDPG that the performance does not improve monotonically. The subset of region lies within the constraint is called trust region. As long as the policy change is reasonably small, the approximation is not much different from the true objective function. By choosing the new policy parameters which maximizes the expectation subject to the KL divergence constraint, a lower bound of the expected long-term reward η is guaranteed. This also implies that you don’t need to worry too much about the step size with TRPO.
STEP2:Problem Solving
      (6) Using line-search (Approximation, Fisher matrix, Conjugate gradient)
      The loss function is defined as

    \[L\left( {\rm{\pi }} \right) = {\mathbb{E}_{\tau \sim {{\rm{\pi }}_k}}}\left[ {\sum\limits_{t = 0}^\infty {{\gamma ^t}\frac{{{\rm{\pi }}\left( {{a_t}\left| {{s_t}} \right.} \right)}}{{{{\rm{\pi }}_k}\left( {{a_t}\left| {{s_t}} \right.} \right)}}{A^{\rm{\pi }}}\left( {{s_t},{a_t}} \right)} } \right]\]

      As we mentioned before, the lower bound function M should be easy to optimize. In fact, we approximate L and the mean of the KL-divergence using Taylor Series up to the first order for L and second order for the KL-divergence:

    \[\begin{array}{l}{L_{{\theta _k}}}\left( \theta \right) \approx {L_{{\theta _k}}}\left( {{\theta _k}} \right) + {g^T}\left( {\theta - {\theta _k}} \right) + \ldots \\{{\bar D}_{KL}}\left( {\theta \left\| {{\theta _k}} \right.} \right) \approx {{\bar D}_{KL}}\left( {{\theta _k}\left\| {{\theta _k}} \right.} \right) + {\nabla _\theta }{{\bar D}_{KL}}\left( {\theta \left\| {{\theta _k}} \right.} \right)\left| {_{{\theta _k}}} \right.\left( {\theta - {\theta _k}} \right) + \frac{1}{2}{\left( {\theta - {\theta _k}} \right)^T}H\left( {\theta - {\theta _k}} \right)\end{array}\]

      After taking out the zero values:

    \[\begin{array}{l}{L_{{\theta _k}}}\left( \theta \right) \approx {g^T}\left( {\theta - {\theta _k}} \right){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} g \buildrel\textstyle.\over= {\nabla _\theta }{L_{{\theta _k}}}\left( \theta \right)\left| {_{{\theta _k}}} \right.\\{{\bar D}_{KL}}\left( {\theta \left\| {{\theta _k}} \right.} \right) \approx \frac{1}{2}{\left( {\theta - {\theta _k}} \right)^T}H\left( {\theta - {\theta _k}} \right){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} H \buildrel\textstyle.\over= \nabla _\theta ^2{{\bar D}_{KL}}\left( {\theta \left\| {{\theta _k}} \right.} \right)\left| {_{{\theta _k}}} \right.\end{array}\]

      where g is the policy gradient and H is called the Fisher Information matrix FIM in the form of a Hessian matrix.

    \[H = {\nabla ^2}f = \left( {\begin{array}{*{20}{c}}{\frac{{{\partial ^2}f}}{{\partial x_1^2}}}&{\frac{{{\partial ^2}f}}{{\partial {x_1}\partial {x_2}}}}& \cdots &{\frac{{{\partial ^2}f}}{{\partial {x_1}\partial {x_n}}}}\\{\frac{{{\partial ^2}f}}{{\partial {x_2}\partial {x_1}}}}&{\frac{{{\partial ^2}f}}{{\partial x_2^2}}}& \cdots &{\frac{{{\partial ^2}f}}{{\partial {x_2}\partial {x_n}}}}\\\vdots & \vdots & \ddots & \vdots \\{\frac{{{\partial ^2}f}}{{\partial {x_n}\partial {x_1}}}}&{\frac{{{\partial ^2}f}}{{\partial {x_n}\partial {x_2}}}}& \cdots &{\frac{{{\partial ^2}f}}{{\partial x_n^2}}}\end{array}} \right)\]

      The optimization problem becomes:

    \[\begin{array}{l}{\theta _{k + 1}} = \mathop {\arg \max }\limits_\theta {g^T}\left( {\theta - {\theta _k}} \right){\kern 1pt} {\kern 1pt} \\s.t.{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{1}{2}{\left( {\theta - {\theta _k}} \right)^T}H\left( {\theta - {\theta _k}} \right){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \le \delta \end{array}\]

      We can solve it by optimizing a quadratic equation and the solution becomes:

    \[{\theta _{k + 1}} = {\theta _k} + \sqrt {\frac{{2\delta }}{{{g^T}{H^{ - 1}}g}}} {H^{ - 1}}g\]

      However, FIM (H) and its inverse are very expensive to compute. TRPO estimates {H^{ - 1}}g by solving x for the following linear equation:

    \[\begin{array}{l}{x_k} \approx \hat H_k^{ - 1}{{\hat g}_k}\\{{\hat H}_k}{x_k} \approx {{\hat g}_k}\end{array}\]


      It turns out we can solve it with the conjugate gradient method which is computational less complex than finding the inverse of H. Conjugate gradient is similar to gradient descent but it can find the optimal point in at most N iterations where N is the number of parameters in the model.

3. Pseudocode

      OpenAI Spiningup version:

4. Implementation

      This code implementation is modified from original git, I include everything into a single file to facilitate its implementation and understanding.

5.Summary

      Again, no more words, just a single graph for summarization as bellow:

6. Reference

(1). Trust Region Policy Optimization, Schulman et al. 2015
(2). OpenAI Spinningup Documentation-TRPO
(3). rl-proximal-policy-optimization-ppo-explained
(4). rl-the-math-behind-trpo-ppo
(5). introduction-to-various-reinforcement-learning-algorithms-part-ii-trpo-ppo
(6). rl-trust-region-policy-optimization-trpo-explained

Leave a Reply

Your email address will not be published. Required fields are marked *