Natural Policy Gradient

The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.

Trust Region Policy Optimization (TRPO)

This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.

Prioritized Double DQN

This post is mainly summarized from original PER paper, but with more detailed and illustrative explanation, we will go through key ideas and implementation of…

Deep Q-Learning Series (DQN)

DQN series reinforcement learning algorithms involve with learning by using Deep Q Networks, these algorithms include Deep Q-learning (short for DQN), Double Deep Q-learning (Double…

Temporal Difference Learning from Scratch

This post hilights on basic temporal difference learning theory and algorithms that contribute much to more advanced topics like Deep Q Learning (DQN), doublel DQN,…

Asynchronous Advantage Actor Critic (A3C)

Policy-based Methods directly learn a parameterized policy.Value-value methods learn the values of actions, and then selected actions based on their estimated action values. In policy-based…

Policy Gradient Methods Overview

Policy gradient methods have advantages over better convergence properties, being effective in high-dimensional or continuous action spaces and being able to learn stochastic policies. 1….

Twin Delayed DDPG (TD3) Walkthrough

This post is organized as follow:1.Problem Formulation2. Reducing Overestimation Bias3. Addressing Variance4. Psuedocode and Implementation5. Reference TD3 Overview      Twin Delayed Deep Deterministic Policy…

Deep Determinstic Policy Gradient (DDPG)

Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.