Temporal Difference Learning from Scratch

This post hilights on basic temporal difference learning theory and algorithms that contribute much to more advanced topics like Deep Q Learning (DQN), doublel DQN, dueling DQN and series of Actor-critic methods. Starting from its formulation, and how it is applied in practice.
      The post assume that you have basic knowledge of reinforcement learning and Monte Carlo methods for solving RL problem.

1. Temporal Difference Learning Overview

      TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

2. Foundmental Topics

      As with most basic reforcement learning problem solving technique, we start by interpretating how to make prediction and  control based on the prediction.

2.1 TD Prediction

      (Reference Rich Sutton) Both TD and Monte Carlo methods use experience to solve the prediction problem. Given some experience following a policy {{\rm{\pi }}}, both methods update their estimate V of {{v_{\rm{\pi }}}} for the non-terminal states S_t occurring in that experience. Monte Carlo methods wait until the return following the visit is known, then use that return as a target for V\left( {{S_t}} \right). A simple every-visit Monte Carlo method suitable for nonstationary environments is

    \[V\left( {{S_t}} \right) \leftarrow V\left( {{S_t}} \right) + \alpha \left[ {{G_t} - V\left( {{S_t}} \right)} \right]\]

where G_t is the actual return following time t, and \alpha is a constant step-size parameter, also called constant-\alpha MC. Monte Carlo methods must wait until the end of the episode to determine the increment to V\left( {{S_t}} \right) (only then is G_t known), TD methods need to wait only until the next time step. At time t + 1 they immediately form a target and make a useful update using the observed reward R_{t+1} and the estimate V\left( {{S_{t + 1}}} \right). The simplest TD method makes the update

    \[V\left( {{S_t}} \right) \leftarrow V\left( {{S_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma V\left( {{S_{t + 1}}} \right) - V\left( {{S_t}} \right)} \right]\]

      Thus, the target for Monte Carlo update is G_t whereas the target for the TD updates is {{R_{t + 1}} + \gamma V\left( {{S_{t + 1}}} \right)}.
      Using TD Learning to evaluate a policy \pi is as shown in the below box.
       TD Prediction Algorithm:
      (1) Initialize V(s) to 0 or some arbitrary values
      (2) Begin the episode and for every step in the episode, perform an action A in the state S and receive a reward R and move to the next state s'
      (3) Update the value of the previous state using the TD update rule
      (4) repeat steps (2) and (3) until reach the terminal state
      Attention: In SARSA and Q-learning, we learn Q\left( {s,a} \right) rather than V\left( s \right).
MC method uses the expected value of gains from different episodes to predict the state value function, while TD method use the current next state value and its corresponding reward to update the last state value function.
      The above TD methods, or TD(0) bases its update in part on an existing estimate, we say that it is a bootstrapping method. It can learn based on incomplete sequence and update based on the bootstrapping method.

    \[\begin{array}{l} {v_{\rm{\pi }}}\left( s \right) \buildrel\textstyle.\over= {\mathbb{E}_{\rm{\pi }}}\left[ {{G_t}\left| {{S_t} = s} \right.} \right]\\ \;\;\;\;\;\;\;\;{\kern 1pt} = {\mathbb{E}_{\rm{\pi }}}\left[ {\underline {{R_{t + 1}} + \gamma {G_{t + 1}}} \left| {{S_t} = s} \right.} \right]{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {MC{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{Target}}} \right)\\ \;\;\;\;\;\;\;\;{\kern 1pt} = {\mathbb{E}_{\rm{\pi }}}\left[ {\underline {{R_{t + 1}} + \gamma {v_{\rm{\pi }}}\left( {{S_{t + 1}}} \right)} \left| {{S_t} = s} \right.} \right]\;\;\;\;\left( {DP{\kern 1pt} {\kern 1pt} {\rm{Target}}} \right)\\ V\left( {{S_t}} \right) \leftarrow V\left( {{S_t}} \right) + \alpha \left[ {\underline {{R_{t + 1}} + \gamma V\left( {{S_{t + 1}}} \right)} - V\left( {{S_t}} \right)} \right]{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {TD{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{Target}}} \right) \end{array}\]

      The Monte Carlo target is an estimate because the expected value is not known; a sample return is used in place of the real expected return.
      The DP target is an estimate not because of the expected values, which are assumed to be completely provided by a model of the environment, but because {{v_{\rm{\pi }}}\left( {{S_{t + 1}}} \right)} is not known and the current estimate, {V\left( {{S_{t + 1}}} \right)}, is used instead. The TD target is an estimate for both reasons: it samples the expected values and it uses the current estimate V instead of the true v_\pi . Thus, TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.
      In TD(0) backup diagram, the value estimate for the state node at the top of the backup diagram is updated on the basis of the one sample transition from it to the immediately following state, as follow:
       Sample one transition starting at s, then we estimate the value of the next state via our current estimate of the next state to construct a full Bellman backup estimate.
      The estimation for {q_{\rm{\pi }}}\left( {s,a} \right) follows the same way as above, see details in on-policy SARSA or off-policy Q-learning.

    \[\begin{array}{l} Q\left( {{S_t},{A_t}} \right) \leftarrow Q\left( {{S_t},{A_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma Q\left( {{S_{t + 1}},{A_{t + 1}}} \right) - Q\left( {{S_t},{A_t}} \right)} \right]\\ Q\left( {{S_t},{A_t}} \right) \leftarrow Q\left( {{S_t},{A_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma \mathop {\max }\limits_a Q\left( {{S_{t + 1}},a} \right) - Q\left( {{S_t},{A_t}} \right)} \right] \end{array}\]

TD Error:
      The difference between the estimated value of {{S_t}} and the better estimate {{R_{t + 1}} + \gamma V\left( {{S_{t + 1}}} \right)}, as follow:

    \[{\delta _t} \buildrel\textstyle.\over= {R_{t + 1}} + \gamma V\left( {{S_{t + 1}}} \right) - V\left( {{S_t}} \right)\]

      TD error depends on the next state and next reward, it is not actually available until one time step later. That is, {\delta _t} is the error in V\left( {{S_t}} \right), available at time t + 1.
      Explain by example (TODO):
       q value in different states can be calculated as follow:

    \[\begin{array}{l} \begin{array}{*{20}{l}} {{R_1} = {q_1} = 0}\\ {{R_2} = 4,{\rm{ }}{q_2} = 0 + 0.5 \times \left( {4 - 0} \right) = 2}\\ {{q_3} = {q_2} + 0.5 \times \left( {{R_2} - {q_2}} \right) = 2 + 0.5 \times \left( {4 - 2} \right) = 3}\\ {{q_4} = {q_3} + 0.5 \times \left( {{R_3} - {q_3}} \right) = 3 + 0.5 \times \left( {4 - 3} \right) = 3.5}\\ { \ldots \ldots }\\ {{q_{15}}{\rm{ }} = {\rm{ }}{q_{14}} + 0.5 \times \left( {{R_{14}} - {q_{14}}} \right) = 1.5 + 0.5 \times \left( {4 - 1.5} \right) = 2.75} \end{array}\\ {q_{16}} = {q_{15}} + 0.5 \times \left( {{R_{15}} - {q_{15}}} \right) = 2.75 + 0.5 \times \left( {0 - 2.75} \right) = 1.35 \end{array}\]

      Advantages of TD methods:
      First, compared with DP, they do not require a model of the environment, of its reward and next-state probability distributions; second, compared with MC, they are naturally implemented in an online, fully incremental fashion and wait until the next time-step, which are important in applications of long episodes; Third, TD methods are much less susceptible to problems of ignoring or discounting episodes on which experimental actions are taken, which can greatly slow learning, as in MC Methods.
      Tips:
      Both TD and Monte Carlo methods converge asymptotically to the correct predictions, however, it is still an open question which method converges faster. In practice, however, TD methods have usually been found to converge faster than constant-α MC methods on stochastic tasks.

2.2 Exploration and Optimality

      TD Prediction is used to evaluate state value function, while TD Control is used to improve and optimize the value function, as with Monte Carlo methods, it also need to trade off exploration and exploitation, which results in two kinds of methods, namely, on-policy and off-policy.
      The value of a particular state can be predicted, however, to find best policy or control, some exploration should be made.
Explore with \varepsilon-greedy method
Balance exploration of suboptimal actions with exploitation of the optimal, or greedy, action. One naive strategy is to take a random action with small probability and take the greedy action the rest of the time. This type of exploration strategy is called an \varepsilon-greedy policy. Mathematically, an \varepsilon-greedy policy with respect to the state-action value {Q^{\rm{\pi }}}\left( {s,a} \right) takes the following form:
       Monotonic \varepsilon-greedy Policy Improvement
       Let \pi_i be an \varepsilon-greedy policy. Then the \varepsilon-greedy policy with respect to Q^{\pi_i}, denoted \pi_{i+1}, is a monotonic improvement on policy \pi. In other words, {V^{{{\rm{\pi }}_{i + 1}}}} \ge {V^{{{\rm{\pi }}_i}}}. (Drawn from standford cs234 lecture note).
       Greedy in the limit of exploration
      The ε-greedy strategies above is a naive way to balance exploration of new actions with exploitation of current knowledge; however, we can refine this balance by introducing a new class of exploration strategies that allow us to make convergence guarantees about our algorithms. This class of strategies is called Greedy in the Limit of Infinite Exploration (GLIE).

3. On-policy SARSA

      SARSA learns an action-value function rather than a state-value function. Learn the values of for all state–action pairs and Sarsa control is an example of GPI with TD learning.
      For an on-policy method we must estimate {q_{\rm{\pi }}}\left( {s,a} \right) for the current behavior policy \pi and for all states s and actions a. This can be done using essentially the same TD method described above for learning v_{\pi}. Recall that an episode consists of an alternating sequence of states and state–action pairs:
      In the previous section we present the transitions from state to state and learned the values of states. Here we consider transitions from state–action pair to state–action pair, and learn the values of state–action pairs, which contributes to policy control. The theorems assuring the convergence of state values under TD(0) also apply to the corresponding algorithm for action values:

    \[Q\left( {{S_t},{A_t}} \right) \leftarrow Q\left( {{S_t},{A_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma Q\left( {{S_{t + 1}},{A_{t + 1}}} \right) - Q\left( {{S_t},{A_t}} \right)} \right]\]

      This update is done after every transition from a nonterminal state {{S_t}}. If {{S_{t + 1}}} is terminal, then {Q\left( {{S_{t + 1}},{A_{t + 1}}} \right)} is defined as zero. This rule uses every element of the quintuple of events,\left( {{S_t},{A_t},{R_{t + 1}},{S_{t + 1}},{A_{t + 1}}} \right), that make up a transition from one state–action pair to the next.
      SARSA gets its name from the parts of the trajectory used in the update equation. To update the Q-value at state-action pair (s, a), we need the reward, next state and next action, thereby using the values \left( {s,a,r,s',a'} \right). SARSA is an on-policy method because the actions a and a' used in the update equation are both derived from the policy that is being followed at the time of the update.
      As in all on-policy methods, we continually estimate q_{\pi} for the behavior policy \pi , and at the same time change \pi toward greediness with respect to q_{\pi}. The general form of the SARSA control algorithm is presented as follow:
      Algorithm description:
      (1) initialize the Q values to some arbitrary values.
      (2) select an action by the epsilon-greedy policy(\varepsilon>0) and move from one state to another
      (3) update the Q value previous state by following the update rule Q\left( {s,a} \right) = Q\left( {s,a} \right) + \alpha \left( {r + \gamma Q\left( {s',a'} \right) - Q\left( {s,a} \right)} \right), where a' is the action selected y an epsilon-greedy policy(\varepsilon > 0).
(4) Repeat (2) and (3) until convergence.
      The convergence of SARSA
      SARSA converges with probability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy.
      SARSA for finite-state and finite-action MDP’s converges to the optimal action-value, i.e. Q\left( {s,a} \right) \to q\left( {s,a} \right), if the following two conditions hold:
1. The sequence of policies \pi from is GLIE
2. The step-sizes αt satisfy the Robbins-Munro sequence such that

    \[\begin{array}{l} \sum\limits_{t = 1}^\infty {{\alpha _t} = \infty } \\ \sum\limits_{t = 1}^\infty {\alpha _t^2 < \infty } \end{array}\]

      Cases1:Windy Gridworld
      Description: refer to Rich Sutton book.

      Code implementation:

      Episode length over time and episode reward over time can be shown as follow:

      Case2: Taxi problem code implementation:

      Summary
      The flowchart of SARSA is shown as follow:

4. Off-policy Q-learning

      Q-learning is defined by:

    \[Q\left( {{S_t},{A_t}} \right) \leftarrow Q\left( {{S_t},{A_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma \mathop {\max }\limits_a Q\left( {{S_{t + 1}},a} \right) - Q\left( {{S_t},{A_t}} \right)} \right]\]

      The learned action-value function, Q, directly approximates q_*, the optimal action-value function, independent of the policy being       followed. This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs.
      Take a maximum over the actions at the next state, this action is not necessarily the same as the one we would derive from the current policy. Q learning algorithm is show in the following box:
      Case study1: Cliff Walking.
      State: grid position of the agent;
      Action: movement left, right, up, down;
      Reward: -1 on all transition step except those into “The Cliff” region which results in reward of -100 and reset episode.
      Refer to Rich Sutton’s book for more detailed description.

 

 

      Code for the problem using Q-learning is as follow:

      Results of the above code implementation are shown as follow:

      A speical point here to highlight is that aobve code is based on gym environment, however, we can still build the environment by ourself, as bellow:
(C2M2)

      Case Study 2: Taxi Problem
      Totaly 500 states,25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.
      Code implementaion:

      Compare Q-learning with SARSA
      In Q-learning, we take action using an epsilon-greedy policy and simply pick up the maximum action to update the Q value, while in SARSA, we take action using the epsilon-greedy policy and also pick up the action using the epsilon-greedy policy to update the Q value.
      Q-learning is off-policy because it updates its Q-values using the Q-values of the next state s' and the greedy action a'. It estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed depite that fact that it’s not follwing a greedy policy. SARSA is on-policy since it updates its Q-values using the Q-value of the next state s' and the current policy’s action a''. It estimates the return for state-action pairs assuming the current policy continues to be followed. The distinction disappears if the current policy is a greedy policy.
      Summary
      Algorithm flow is as follow:

5. Off-policy Expected SARSA

      Rather than sampling from \pi to pick an action a', we should just calculate the expected value of s'. This way, the estimate of how good s' is won’t fluctuate around, like it would when sampling an action from a distribution, but rather remain steady around the “average” outcome for the state.
      The corresponding algorithm pseudocode is as follow:
      The update rule in expected SARSA is

    \[\begin{array}{l} Q\left( {{S_t},{A_t}} \right) \leftarrow Q\left( {{S_t},{A_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma {E_{\rm{\pi }}}\left[ {Q\left( {{S_{t + 1}},{A_{t + 1}}} \right)\left| {{S_{t + 1}}} \right.} \right] - Q\left( {{S_t},{A_t}} \right)} \right]\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \leftarrow Q\left( {{S_t},{A_t}} \right) + \alpha \left[ {{R_{t + 1}} + \gamma \sum\limits_{a'} {{\rm{\pi }}\left( {a'\left| {{S_{t + 1}}} \right.} \right)Q\left( {{S_{t + 1}},a'} \right)} - Q\left( {{S_t},{A_t}} \right)} \right] \end{array}\]

      A calculation illustration:
      Code implementation (C2M3):

6. Summary

      Till now, we have goen through the temporal difference learning approches and seen how the Bellman equation is used to form these on-policy or off-policy algorithm, understanding idea of these basic algorithms is important to decompose more advanced methods later.

Leave a Reply

Your email address will not be published. Required fields are marked *