Natural Policy Gradient

The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.

Trust Region Policy Optimization (TRPO)

This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.

Policy Gradient Methods Overview

Policy gradient methods have advantages over better convergence properties, being effective in high-dimensional or continuous action spaces and being able to learn stochastic policies. 1….