Dynamic Programming in Reinforcement Learning

Dynamic Programming (DP) plays an important role in traditional approaches of solving reinforcement learning with known dynamics. This post focuses on a more classical view of the basic reinforcment learning solution, mostly the tabular case.

       We take an overview of the dynamic programming problem, followed by the solution of DP problems, namely, policy search, policy iteration and value iteration, along with the basic process of policy evaluation and policy improvement, to faciliate understanding, both theroy and code examples are presented.

     1. Dynamic Programming Overview

       The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense.
       Dynamic Programming or (DP) is a method for solving complex problems by breaking them down into sub problems, solve the sub problems, and combine solutions to the sub problems to solve the overall problem.
       DP is a very general solution method for problems which have two properties, the first is “optimal substructure” where the principle of optimality applies (discussed later). It tells us that we can solve some overall problem by breaking it down into two or more pieces, solve for each of the pieces, and the optimal solution to the pieces tells us how to get the optimal solution to the overall problem.
E.g. if your problem is to get from some point in space to the nearest wall, you will try to figure out the shortest path to the wall. DP tells us that the shortest path to the wall is the shortest path to a midpoint between you and the wall combined with the shortest path from that midpoint to the wall.
       The second property is “overlapping sub problems”, which means that the sub problems which occur once, will occur again and again. We broke down the problem of getting to the wall into first getting to the midpoint, and then getting to the wall. That will help us solve the sub problem of how to get from another point to the wall. Solutions can be cached and reused.
       Markov Decision Processes satisfy both mentioned properties. Bellman equation gives us recursive decomposition (the first property). Bellman equation tells us how to break down the optimal value function into two pieces, the optimal behavior for one step followed by the optimal behavior after that step.
       What gives us the overlapping sub problems property in MDPs is the value function. It’s like a cache of the good information we figured out about the MDP.

2. Solving DP Problems

       Three methods are commonly used to solve the DP problems with known dynamics/models, including policy search, policy iteration (characterized by the policy evaluation intertwining with the policy improvement) and value iteration. As illustrated in the following figure:
       As most of the solutions are based on policy evaluation and policy improvement, I will introduce them first regardless of the order in the figure.

2.1 Policy Evaluation

       In policy evaluation, we’re given an MDP and a policy \pi. We try to evaluate the given policy. We can evaluate the given policy through iterative application of Bellman expectation equation, and later use the Bellman optimality equation to do control.
       How to compute the state-value function v_{\pi} for an arbitrary policy \pi. This is called policy evaluation in the DP literature, or prediction problem.

    \[\begin{array}{l}{v_{\rm{\pi }}}\left( s \right) \buildrel\textstyle.\over= {\mathbb{E}_{\rm{\pi }}}\left[ {{G_t}\left| {{S_t} = s} \right.} \right]\\\;\;\;\;\;\;\;\;{\kern 1pt} = {\mathbb{E}_{\rm{\pi }}}\left[ {{R_{t + 1}} + \gamma {G_{t + 1}}\left| {{S_t} = s} \right.} \right]\\\;\;\;\;\;\;\;\;{\kern 1pt} = {\mathbb{E}_{\rm{\pi }}}\left[ {{R_{t + 1}} + \gamma {v_{\rm{\pi }}}\left( {{S_{t + 1}}} \right)\left| {{S_t} = s} \right.} \right]\\\;\;\;\;\;\;\;\;{\kern 1pt} = \sum\limits_a {{\rm{\pi }}\left( {a\left| s \right.} \right)} \sum\limits_{s',r} {p\left( {s',r\left| {s,a} \right.} \right)} \left[ {r + \gamma {v_{\rm{\pi }}}\left( {s'} \right)} \right]\end{array}\]

       If the environment’s dynamics are completely known, then above equation is a system of \left| S \right| simultaneous linear equations with \left| S \right| unknowns (the {v_{\rm{\pi }}}\left( s \right),s \in S). The analytical solution may cost much computation. An iterative method is most suitable. Consider a sequence of approximate value functions {v_0},{v_1},{v_2}, \cdots, each mapping {S^ + } to \mathcal{R} (the real numbers). The initial approximation, v_0, is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive approximation is obtained by using the Bellman equation for {v_{\rm{\pi }}} as an update rule:

    \[\begin{array}{l}{v_{k{\rm{ + 1}}}}\left( s \right) \buildrel\textstyle.\over= {\mathbb{E}_{\rm{\pi }}}\left[ {{R_{t + 1}} + \gamma {v_{\rm{\pi }}}\left( {{S_{t + 1}}} \right)\left| {{S_t} = s} \right.} \right]\\{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \;\;\;\;\;\;\;\;{\kern 1pt} = \sum\limits_a {{\rm{\pi }}\left( {a\left| s \right.} \right)} \sum\limits_{s',r} {p\left( {s',r\left| {s,a} \right.} \right)} \left[ {r + \gamma {v_k}\left( {s'} \right)} \right]\end{array}\]

for all s \in S.
       Expected update: To produce each successive approximation, {v_{k{\rm{ + 1}}}} from {{v_k}}, iterative policy evaluation applies the same operation to each state s: it replaces the old value of s with a new value obtained from the old values of the successor states of s, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated. Each iteration of iterative policy evaluation updates the value of every state once to produce the new approximate value function {v_{k{\rm{ + 1}}}}. All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.
       There are two versions of policy evaluation, as below:
       One array in-place version:
       Corresponding Bellman backup diagram is as follow:
     In formula, it would be:

    \[\begin{array}{l}{v_{k{\rm{ + 1}}}}\left( s \right) = \sum\limits_{a \in \mathcal{A}} {{\rm{\pi }}\left( {a\left| s \right.} \right)\left( {\mathcal{R}_s^a + \gamma \sum\limits_{s' \in S} {\mathcal{P}_{ss'}^a{v_k}\left( {s'} \right)} } \right)} \\{{\bf{v}}^{k + 1}} = {{\bf{\mathcal{R}}}^{\bf{\pi }}} + \gamma {{\bf{\mathcal{P}}}^{\bf{\pi }}}{{\bf{v}}^k}\end{array}\]

       \mathcal{R}_s^a is the immediate reward correlating to state. Analytical solution can be calculated by the second formula.
       Define the value function at next iteration by plugging in the previous iterations values of the leaves to effectively take the value functions from the last iteration BK, we plug in those values in the leaves, we back up those values to compute one single new value for the next iteration of the root. So we figure out the k+1 by looking one-step ahead from value function of last iteration.
       The following is an example to illustrate the calculation process.

Calculation process is as follow:
       When k=1,
              -1.0 = 0.25*(-1+0) + 0.25*(-1+0) + 0.25*(-1+0) + 0.25*(-1+0)
       When k=2,
              -1.7 = 0.25*(-1+0) + 0.25*(-1-1) + 0.25*(-1-1) + 0.25*(-1-1)
              -2.0 = 0.25*(-1-1) + 0.25*(-1-1.0) + 0.25*(-1-1.0) + 0.25*(-1-1.0)
       When k=3,
     A more detailed view drawn from other author is,

    Code implementation for policy evaluation is as follow:

       Take the Gridworld Example from “Nuts & Bolts of Reinforcement Learning: Model Based Planning using Dynamic Programming” 

       Similarly, for all non-terminal states, v_{1}(s) = -1.
       For terminal states p\left( {s'\left| {s,a} \right.} \right) = 0 and hence {v_k}(1) = {v_k}(16) = 0 for all k. So v1 for the random policy is given by:

       For all the remaining states, i.e., 2, 5, 12 and 15, v_2 can be calculated as follows:

2.2 Policy Improvement

       Policy Improvement: the process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy.

    \[\begin{array}{l}{\rm{\pi '}}\left( s \right) \buildrel\textstyle.\over= \mathop {\arg \max }\limits_a {q_{\rm{\pi }}}\left( {s,a} \right)\\\;\;{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \mathop {\arg \max }\limits_a \mathbb{E}\left[ {{R_{t + 1}} + \gamma {v_{\rm{\pi }}}\left( {{S_{t + 1}}} \right)\left| {{S_t} = s,{A_t} = a} \right.} \right]\\\;\;{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \mathop {\mathop {\arg \max }\limits_a \sum\limits_{s',r} {p\left( {s',r\left| {s,a} \right.} \right)\left[ {r + \gamma {v_{\rm{\pi }}}\left( {s'} \right)} \right]} }\limits_a \end{array}\]


       Corresponding algorithm is:
     Code implementation for policy improvement algorithm is as follow.

2.3 Policy Iteration

       The reason for computing the value function for a policy is to help find better policies.
     A finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations. The policy iteration is  shown in the following box.

       In combination with above policy evaluation and improvement process,  code implementation for policy iteration algorithm is.

2.4 Value Iteration

       One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set.
       The policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration.
       Different views to see the value iteration:
       (1) Value iteration is obtained simply by turning the Bellman optimality equation into an update rule.
       (2) The value iteration update is identical to the policy evaluation update except that it requires the maximum to be taken over all actions.
       Pseudocode for value iteration is as below:
       Code Implementaion:

2.5 Policy Search

       Pseudocode:

3. Examples

3.1 Frozen Lake

       Problem Discription:
           S: the starting position
           F: the frozen lake where you can walk
         H: holes, which have to be careful about to avoid
         G: the goal
         Code Implementation.

       The result for the above implementation is:
       

4.  Summary

       Both value-iteration and policy-iteration assume that the agent knows the MDP model of the world (i.e. the agent knows the state-transition and reward probability functions).Therefore, they can be used by the agent to (offline) plan its actions given knowledge about the environment before interacting with it.
       In Value Iteration, we don’t make multiple steps because once the VF is optimal, then the policy converges (meaning it is also optimal). Whereas at policy iteration we do multiple steps in each iteration where every step gives us a different V(s) and the corresponding policy.

Leave a Reply

Your email address will not be published. Required fields are marked *