Natural Policy Gradient

Natural Gradient shares some common ideas with the high-level policy-based methods like Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO) and etc. The basic idea of natural policy gradient is to use the curvature information of the of the policy’s distribution over actions in the weight update. We will explain in detail how it is formulated and applied in this post. 

1. Minorize-Maximization, MM

      The MM algorithm is not an algorithm, but a strategy for constructing optimization algorithms.The main idea of MM (minorization-maximization) algorithms is that, intuitively, for a maximization problem, we first find a approximated lower bound of the original objective as the surrogate objective and then maximize the approximated lower bound so as to optimize the original objective. (And vice versa for minimization problems) In minimization MM stands for Majorize–Minimize, and in maximization MM stands for Minorize–Maximize. The EM algorithm can be thought as a special case of MM.
      The explanation of MM (Reference:Wikipedia)
      Take the minorize-maximization version for example, let f\left( \theta \right) be the objective concave function to be maximized. At the m (m=0,1,...) step of the algorithm, the constructed function g\left( {\theta \left| {{\theta _m}} \right.} \right) is the surrogate function at {{\theta _m}} if

    \[\begin{array}{l}g\left( {\theta \left| {{\theta _m}} \right.} \right) \le f\left( \theta \right){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} for{\kern 1pt} {\kern 1pt} {\kern 1pt} all{\kern 1pt} {\kern 1pt} {\kern 1pt} \theta \\g\left( {{\theta _m}\left| {{\theta _m}} \right.} \right) = f\left( {{\theta _m}} \right)\end{array}\]

      Then, we can maximize g\left( {{\theta _m}\left| {{\theta _m}} \right.} \right) instead of f\left( \theta \right), in the new iteration, let

    \[{\theta _{m + 1}} = \mathop {\arg \max }\limits_\theta g\left( {\theta \left| {{\theta _m}} \right.} \right)\]

      The above iteration method will guarantee that f\left( {{\theta _m}} \right) will converge to a local optimum or a saddle point as m goes to infinity.
      Now, we get

    \[f\left( {{\theta _{m + 1}}} \right) \ge g\left( {{\theta _{m + 1}}\left| {{\theta _m}} \right.} \right) \ge g\left( {{\theta _m}\left| {{\theta _m}} \right.} \right) = f\left( {{\theta _m}} \right)\]

      The iteration process and the surrogate functions is show in below figure
      Benifit of MM that substituting a simple optimization problem for a difficult problem can turn a non-differential problem into a smooth problem and linearize an optimization problem.

2. Fisher Information Matrix

      Suppose we are using a model parameterized by parameter vector \theta to model a distribution p\left( {x\left| \theta \right.} \right), the parameter \theta can be learned by maximizing the likelikhood p\left( {x\left| \theta \right.} \right) with respect to \theta. To measure to the goodness of this estimation of \theta, we define the gradient of log likelihood function of p\left( {x\left| \theta \right.} \right) as the score function:

    \[s\left( \theta \right) = {\nabla _\theta }\log p\left( {x\left| \theta \right.} \right)\]

      One important feature of the score function is the expected value of score with respect to the model is 0.

    \[\begin{array}{l}\mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {s\left( \theta \right)} \right] = \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {{\nabla _\theta }\log p\left( {x\left| \theta \right.} \right)} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \int {{\nabla _\theta }\log p\left( {x\left| \theta \right.} \right)} p\left( {x\left| \theta \right.} \right)dx\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \int {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} p\left( {x\left| \theta \right.} \right)dx\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \int {{\nabla _\theta }p\left( {x\left| \theta \right.} \right)dx} \\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\nabla _\theta }\int {p\left( {x\left| \theta \right.} \right)dx} \\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\nabla _\theta }1\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = 0\end{array}\]

      Fisher Information: the covariance of the score function.
      Here, the covariance defined as an uncertainty measure around the expected estimate is used to measure how certain we are to the estimate:

    \[\mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\left( {s\left( \theta \right) - 0} \right){{\left( {s\left( \theta \right) - 0} \right)}^{\rm{T}}}} \right]\]

      Thus, for parameters vector \theta, Fisher Information is in a matrix form called Fisher Information Matrix and denoted as:

    \[\mathbb{F} = \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {{\nabla _\theta }\log p\left( {x\left| \theta \right.} \right){\nabla _\theta }\log p{{\left( {x\left| \theta \right.} \right)}^{\rm{T}}}} \right]\]

      In practice, analytical solution for the above average is intractable and replaced by calculating average using sampling data X = \left\{ {{x_1},{x_2}, \cdots ,{x_N}} \right\} from empirical distribution \hat q\left( x \right). Then \mathbb{F} here is called Empirical Fisher:

    \[\mathbb{F} = \frac{1}{N}\sum\limits_{i = 1}^N {{\nabla _\theta }\log p\left( {{x_i}\left| \theta \right.} \right){\nabla _\theta }\log p{{\left( {{x_i}\left| \theta \right.} \right)}^{\rm{T}}}} \]


Fisher and Hessian
      Feature: the negative expected Hessian of log likelihood is equal to the Fisher Information Matrix \mathbb{F}
      Proof:

    \[\begin{array}{l}{{\rm{H}}_{\log p\left( {x\left| \theta \right.} \right)}} = Jacobian\left( {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} \right)\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{{{{\rm{H}}_{p\left( {x\left| \theta \right.} \right)}}p\left( {x\left| \theta \right.} \right) - {\nabla _\theta }p\left( {x\left| \theta \right.} \right){\nabla _\theta }p{{\left( {x\left| \theta \right.} \right)}^T}}}{{p\left( {x\left| \theta \right.} \right)p\left( {x\left| \theta \right.} \right)}}\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{{{{\rm{H}}_{p\left( {x\left| \theta \right.} \right)}}p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)p\left( {x\left| \theta \right.} \right)}} - \frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right){\nabla _\theta }p{{\left( {x\left| \theta \right.} \right)}^T}}}{{p\left( {x\left| \theta \right.} \right)p\left( {x\left| \theta \right.} \right)}}\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{{{{\rm{H}}_{p\left( {x\left| \theta \right.} \right)}}}}{{p\left( {x\left| \theta \right.} \right)}} - \left( {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} \right){\left( {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} \right)^T}\end{array}\]

      Then, the expected of Hessian is

    \[\begin{array}{l}\mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {{{\rm{H}}_{\log p\left( {x\left| \theta \right.} \right)}}} \right] = \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\frac{{{{\rm{H}}_{p\left( {x\left| \theta \right.} \right)}}}}{{p\left( {x\left| \theta \right.} \right)}} - \left( {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} \right){{\left( {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} \right)}^T}} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\frac{{{{\rm{H}}_{p\left( {x\left| \theta \right.} \right)}}}}{{p\left( {x\left| \theta \right.} \right)}}} \right] - \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\left( {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} \right){{\left( {\frac{{{\nabla _\theta }p\left( {x\left| \theta \right.} \right)}}{{p\left( {x\left| \theta \right.} \right)}}} \right)}^T}} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \int {\frac{{{{\rm{H}}_{p\left( {x\left| \theta \right.} \right)}}}}{{p\left( {x\left| \theta \right.} \right)}}} p\left( {x\left| \theta \right.} \right)dx - \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {{\nabla _\theta }\log p\left( {x\left| \theta \right.} \right){\nabla _\theta }\log p{{\left( {x\left| \theta \right.} \right)}^{\rm{T}}}} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {{\rm{H}}_{\int {p\left( {x\left| \theta \right.} \right)dx} }} - \mathbb{F}\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {{\rm{H}}_1} - \mathbb{F}\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - \mathbb{F}\end{array}\]

      namely,

    \[\mathbb{F} = - \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {{{\rm{H}}_{\log p\left( {x\left| \theta \right.} \right)}}} \right]\]

3. Fisher Information Matrix and KL-divergence

      KL divergence maps to our notion of what a probability distance should look like: it measures directly in terms of how probability density functions are defined, that is, differences in density value at a bunch of points over which the distribution is defined.
      The relationship between Fisher Information Matrix and KL-divergence is that FIM defines the local curvature in distribution space for which KL-divergence is the metric. That is, FIM F is the Hessian of KL-divergence between two distributions p\left( {x\left| \theta \right.} \right) and p\left( {x\left| {\theta '} \right.} \right), with respect to \theta', evaluated at \theta ' = \theta.
      This can be proved as follow:

    \[\begin{array}{l}{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta '} \right.} \right)} \right.} \right] = \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\log p\left( {x\left| \theta \right.} \right)} \right] - \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\log p\left( {x\left| {\theta '} \right.} \right)} \right]\\{\nabla _{\theta '}}{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta '} \right.} \right)} \right.} \right] = {\nabla _{\theta '}}\mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\log p\left( {x\left| \theta \right.} \right)} \right] - {\nabla _{\theta '}}\mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {\log p\left( {x\left| {\theta '} \right.} \right)} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {{\nabla _{\theta '}}\log p\left( {x\left| {\theta '} \right.} \right)} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - \int {p\left( {x\left| \theta \right.} \right)} {\nabla _{\theta '}}\log p\left( {x\left| {\theta '} \right.} \right)dx\\\nabla _{\theta '}^2{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta '} \right.} \right)} \right.} \right] = - \int {p\left( {x\left| \theta \right.} \right)} \nabla _{\theta '}^2\log p\left( {x\left| {\theta '} \right.} \right)dx\\{\rm{H}_{{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta '} \right.} \right)} \right.} \right]}} = - \int {p\left( {x\left| \theta \right.} \right)} \nabla _{\theta '}^2\log p\left( {x\left| {\theta '} \right.} \right)\left| {_{\theta ' = \theta }} \right.dx\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - \int {p\left( {x\left| \theta \right.} \right)} {\rm{H}_{\log p\left( {x\left| \theta \right.} \right)}}dx\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} \left[ {{\rm{H}_{\log p\left( {x\left| \theta \right.} \right)}}} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \mathbb{F}\end{array}\]

      Since we know how to evaluate the Hessian of KL-divergence, we now see the the steepest descent in distribution space.
      Using Taylor expansion, when d \to 0, denote {p\left( {x\left| \theta \right.} \right)} as {p_\theta } for short,

    \[\begin{array}{l}{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta + d} \right.} \right)} \right.} \right] \approx {\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| \theta \right.} \right)} \right.} \right] + {\left( {{\nabla _{\theta '}}{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta '} \right.} \right)} \right.} \right]\left| {_{\theta ' = \theta }} \right.} \right)^{\rm{T}}}d + \frac{1}{2}{d^{\rm{T}}}\rm{F}d\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| \theta \right.} \right)} \right.} \right] - \mathop \mathbb{E}\limits_{p\left( {x\left| \theta \right.} \right)} {\left[ {{\nabla _\theta }\log p\left( {x\left| \theta \right.} \right)} \right]^{\rm{T}}} + \frac{1}{2}{d^{\rm{T}}}\rm{F}d\end{array}\]

      The KL-divergence for the same distribution is 0,

    \[{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| \theta \right.} \right)} \right.} \right] = 0\]

the gradient of the KL-divergence is also 0 according to the above section. So we get:

    \[{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta + d} \right.} \right)} \right.} \right] \approx \frac{1}{2}{d^{\rm{T}}}Fd\]

      To minimize the loss function in distribution space, namely to find the update vector d that decreases the KL-divergence the most, we can solve this minimization problem:

    \[{d^*} = \mathop {\arg \min }\limits_{d{\kern 1pt} {\kern 1pt} s.t.{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta + d} \right.} \right)} \right.} \right] = c} \mathcal{L}\left( {\theta + d} \right)\]

c is a constant to limit the moving speed.
      To solve this minimization problem with constraint, we convert it to unconstrained problem and rewrite the equation as

    \[\begin{array}{l}{d^*} = \mathop {\arg \min }\limits_d \mathcal{L}\left( {\theta + d} \right) + \lambda \left( {{\rm{KL}}\left[ {p\left( {x\left| \theta \right.} \right)\left\| {p\left( {x\left| {\theta + d} \right.} \right)} \right.} \right] - c} \right)\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \approx \mathop {\arg \min }\limits_d \mathcal{L}\left( \theta \right) + {\nabla _\theta }\mathcal{L}{\left( \theta \right)^{\rm{T}}}d + \frac{1}{2}\lambda {d^{\rm{T}}}{\rm{F}}d - \lambda c\end{array}\]

      This minimization problem can be solved by set its derivative with respect to d. to zero:

    \[\begin{array}{l}0 = \frac{\partial }{{\partial d}}\left[ {L\left( \theta \right) + {\nabla _\theta }L{{\left( \theta \right)}^{\rm{T}}}d + \frac{1}{2}\lambda {d^{\rm{T}}}{\rm{F}}d - \lambda c} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\nabla _\theta }\mathcal{L}\left( \theta \right) + \lambda {\rm{F}}d\end{array}\]

      Then we get

    \[d = - \frac{1}{\lambda }{{\rm{F}}^{ - 1}}{\nabla _\theta }\mathcal{L}\left( \theta \right)\]

      Obviously, according to the above formula, the optimal descent direction for \mathcal{L}\left( \theta \right) or in Distribution Space is the opposite direction of gradient, but should take into account the local curvature in distribution space defined by {{\rm{F}}^{ - 1}}.

4. Natural Policy Gradient

      The Natural Gradient is defined as

    \[\begin{array}{l}{\nabla _\theta }\mathcal{L}\left( \theta \right) \buildrel\textstyle.\over= {\nabla _\theta }\mathcal{L}\left( \theta \right){\mathbb{E}_z}{\left[ {{{\left( {\log {p_\theta }\left( {\bf{z}} \right)} \right)}^T}\left( {\log {p_\theta }\left( {\bf{z}} \right)} \right)} \right]^{ - 1}}\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \buildrel\textstyle.\over= {\nabla _\theta }\mathcal{L}\left( \theta \right){\rm F^{ - 1}}\end{array}\]

      The “natural” bit comes from the second component: the expected value, taken over {\bf{z}}, of the squared gradient of the log probability function. We take that whole object, which is referred to as the Fisher Information Matrix, and multiply our loss gradient by its inverse.
      Natural Gradient provides information about curvature and a way to directly control movement of your model in predicted distribution space, as separate from movement of your model in loss space
      More explanations of Natural Gradient (reference…).
      Standard Gradient Decent:
      Let’s say we have a neural network, parameterized by some vector of parameters. We want to adjust the parameters of this network, so the output of our network changes in some way. In most cases, we will have a loss function that tells our network how it’s output should change.
      Using backpropagation, we calculate the derivatives of each parameter with respect to our loss function. These derivatives represent the direction in which we can update our parameters to get the biggest change in our loss function, and is known as the gradient. We can then adjust the parameters in the direction of the gradient, by a small distance, in order to train our network.
      A slight problem appears in how we define “a small distance”.

      However, defining “distance” in terms of how much our parameters is adjusted isn’t always correct. To visualize this, let’s take a look at two pairs of Gaussian distributions.
      In both distributions, the mean is changed from -1 to 1, a distance of two. However, it’s clear that the first distribution changed far more than the second.This leads to a key insight: our gradient measures how much our output is affected by changing a parameter. However, this affect on the output must be seen in context: a shift of +2 in the first distribution means a lot more than a shift of +2 in the second one.
      What the natural gradient does is redefine the “small distance” we update our parameters by. Not all parameters are equal. Rather than treating a change in every parameter equally, we need to scale each parameter’s change according to how much it affects our network’s entire output distribution.
      How do we do that?
      First off, let’s define a new form of distance, that corresponds to distance based on KL divergence, a measure of how much our new distribution differs from our old one. We do this by defining a metric matrix, which allows us to calculate the distance of a vector according to some custom measurement.
      For a network with 5 parameters, our metric matrix is 5×5. To compute the distance of a change in parameters delta using metric, we use the following:

      If our metric matrix is the identity matrix, the distance is the same as if we just used Euclidean distance.However, most of the time our metric won’t be the identity matrix. Having a metric allows our measurement of distance to account for relationships between the various parameters. As it turns out, we can use the Fisher Information Matrix as a metric, and it will measure the distance of delta in terms of KL divergence.The Fisher Information Matrix is the second derivative of the KL divergence of our network with itself. For more information on why, see this article.
      Now we’ve got a metric matrix that measures distance according to KL divergence when given a change in parameters.
      With that, we can calculate how our standard gradient should be scaled.

      Notice that if the Fisher is the identity matrix, then the standard gradient is equivalent to the natural gradient. This makes sense: using the identity as a metric results in Euclidean distance, which is what we were using in standard gradient descent.
      Finally, I want to note that computing the natural gradient may be computationally intensive. In an implementation of TRPO, the conjugate gradient algorithm is used to solve for natural_grad in “natural_grad * fisher = standard_grad”.

Pseudocode

Implementation

Results

5. Summary

      I will end the post with a summary graph as bellow to  connection between different parts of the Natural Gradient as bellow

Leave a Reply

Your email address will not be published. Required fields are marked *