Prioritized Double DQN

This post is mainly summarized from original PER paper, but with more detailed and illustrative explanation, we will go through key ideas and implementation of Prioritized Double DQN.

1. Prioritized Experience Replay

       Online reinforcement learning (RL) agents incrementally update their parameters (of the policy, value function or model) while they observe a stream of experience. In their simplest form, they discard incoming data immediately, after a single update. Two issues with this are (a) strongly correlated updates that break the i.i.d. assumption of many popular stochastic gradient-based algorithms, and (b) the rapid forgetting of possibly rare experiences that would be useful later on.
       Experience replay addresses both of these issues: with experience stored in a replay memory, it becomes possible to break the temporal correlations by mixing more and less recent experience for the updates, and rare experience will be used for more than just a single update. DQN used a large sliding window replay memory, sampled from it uniformly at random In general, experience replay can reduce the amount of experience required to learn, and replace it with more computation and more memory – which are often cheaper resources than the RL agent’s interactions with its environment.
       In traditional DQN, during the experience replay process, experience transitions were uniformly sampled from a replay buffer, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. Prioritized experience replay approach is developed so as to replay important transitions more frequently, and therefore learn more efficiently.
       Prioritized Experience Replay focuses on how prioritizing which transitions are replayed can make experience replay more efficient and effective than if all transitions are replayed uniformly. The key idea is that an RL agent can learn more effectively from some transitions than from others. Transitions may be more or less surprising, redundant, or task-relevant. Some transitions may not be immediately useful to the agent, but might become so when the agent competence increases.
       Experience Replay liberates online learning agents from processing transitions in the exact order they are experienced. Prioritized replay further liberates agents from considering transitions with the same frequency that they are experienced.
       Prioritized Experience Replay proposes to more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. However, this prioritization can lead to a loss of diversity, which we alleviate with stochastic prioritization, and introduce bias, which is corrected with importance sampling.

2. Prioritizing with TD-error

       TD-errors have also been used as a prioritization mechanism for determining where to focus resources, for example when choosing where to explore or which features to select.
       The importance of each transition, one idealized criterion to measure would be the amount the RL agent can learn from a transition in its current state, which cannot be accessible directly, a reasonable proxy for this measure is the magnitude of a transition’s TD error δ, which indicates how ‘surprising’ or unexpected the transition is: specifically, how far the value is from its next-step bootstrap estimate. . This is particularly suitable for incremental, online RL algorithms, such as SARSA or Q-learning, that already compute the TD-error and update the parameters in proportion to \delta.

       In “greedy TD-error prioritization”, the transition with the largest absolute TD error is replayed from the memory. A Q-learning update is applied to this transition, which updates the weights in proportion to the TD error. New transitions arrive without a known TD-error, so we put them at maximal priority in order to guarantee that all experience is seen at least once.

       Prioritized sampling, as the name implies, will weigh the samples so that “important” ones are drawn more frequently for training.

3. Stochastic Prioritization

       Issues of greedy TD-error prioritization: to avoid expensive sweeps over the entire replay memory, transitions that have a low TD error on first visit may not be replayed for a long time; be sensitive to noise spikes; greedy prioritization focuses on a small subset of the experience (initially high error transitions get replayed frequently).
       To overcome these issues, we introduce a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling. We ensure that the probability of being sampled is monotonic in a transition’s priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, we define the probability of sampling transition i as

    \[P\left( i \right) = \frac{{p_i^\alpha }}{{\sum\nolimits_k {p_k^\alpha } }}\]

      where {p_i} > 0 is the priority of transition i. The exponent \alpha determines how much prioritization is used, with \alpha=0 corresponding to the uniform case.
       Proportional prioritization

    \[{p_i} = \left| {{\delta _i}} \right| + \varepsilon \]

      where \varepsilon is a small positive constant that prevents the edge-case of transitions not being revisited once their error is zero.
       Rank-based prioritization

    \[{p_i} = \frac{1}{{{\rm{rank}}(i)}}\]

      where {{\rm{rank}}(i)} is the rank of transition i when the replay memory is sorted according to \left| {{\delta _i}} \right|. In this case, P becomes a power-law distribution with exponent \alpha. Both distributions are monotonic in \left| {{\delta _i}} \right|, but the latter is likely to be more robust, as it is insensitive to outliers.
       Priority is translated to probability of being chosen for replay. A sample i has a probability of being picked during the experience replay determined by a formula ( \alpha=1 ):

    \[{P_i} = \frac{{{p_i}}}{{\sum\nolimits_k {{p_k}} }}\]

4. Bias Correction

       Correct bias with important sampling.
       The estimation of the expected value with stochastic updates relies on those updates corresponding to the same distribution as its expectation. Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to (even if the policy and state distribution are fixed). We can correct this bias by using importance-sampling (IS) weights

    \[{w_i} = {\left( {\frac{1}{N} \cdot \frac{1}{{P\left( i \right)}}} \right)^\beta }\]

      that fully compensates for the non-uniform probabilities P\left( i \right) if \beta = 1. These weights can be folded into the Q-learning update by using {w_i}{\delta _i} instead of {\delta _i} (this is thus weighted IS, not ordinary IS), For stability reasons, we always normalize weights by {1 \mathord{\left/ {\vphantom {1 {{{\max }_i}{w_i}}}} \right. \kern-\nulldelimiterspace} {{{\max }_i}{w_i}}} so that they only scale the update downwards.
       More explanations for the bias correction of prioritized experience replay:
       After calculating {P_i}, as a consequence, during each time step, we will get a batch of samples with batch probability distribution and train our network on it. But, we still have a problem here. With normal Experience Replay (deque), we use a stochastic update rule. As a consequence, the way we sample the experiences must match the underlying distribution they came from.
       When we have normal(deque) experience, we select our experiences in a normal distribution — simply put and then we select our experiences randomly. There is no bias, because each experience has the same chance to be taken, so we can update our weights normally. But, because we use priority sampling, purely random sampling is abandoned. As a consequence, we introduce bias toward high-priority samples (more chances to be selected).
       If we would update our weights normally, we have a risk of over-fitting our agent. Samples that have high priority are likely to be used for training many times in comparison with low priority experiences (= bias). As a consequence, we’ll update our weights with only a small portion of experiences that we consider to be really interesting.
       That why we use importance sampling weights (IS) that will adjust the updating by reducing the weights of the often seen samples to correct the bias.
       The weights corresponding to high-priority samples have very little adjustment (because the network will see these experiences many times), whereas those corresponding to low-priority samples will have a full update.
       The role of bias is to control how much these importance sampling weights affect learning. In practice, the bias parameter is annealed up to 1 over the duration of training, because these weights are more important in the end of learning when our q values begin to converge.        The unbiased nature of updates is most important near convergence.

5. Efficient Implementation

       To scale to large memory sizes N, a binary heap data structure for the priority queue, for which finding the maximum priority transition when sampling is O(1) and updating priorities (with the new TD-error after a learning step) is O(log N).
       Complexity analysis
       How do we store the experience and effectively sample from it?
(Source: https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/)
       A naive implementation would be to have all samples in an array sorted according to their priorities. A random number s,0 \le s \le \sum\nolimits_k {{p_k}}, would be picked and the array would be walked left to right, summing up a priority of the current element until the sum is exceeded and that element is chosen. This will select a sample with the desired probability distribution. But this would have a terrible efficiency: O(n log n) for insertion and update and O(n) for sampling.
A first important observation is that we don’t have to actually store the array sorted. Unsorted array would do just as well. Elements with higher priorities are still picked with higher probability.

       This releases the need for sorting, improving the algorithm to O(1) for insertion and update.
But the O(n) for sampling is still too high. We can further improve our algorithm by using a different data structure. We can store our samples in unsorted sum tree – a binary tree data structure where the parent’s value is the sum of its children. The samples themselves are stored in the leaf nodes.
       Update of a leaf node involves propagating a value difference up the tree, obtaining O(log n). Sampling follows the thought process of the array case, but achieves O(log n). For a value s,0 \le s \le \sum\nolimits_k {{p_k}}, we use the following algorithm (pseudo code):

       Following picture illustrates sampling from a tree with s = 24:

       Implementation details see original PER paper:
       Corresponding Pseudocode is as follow:

6. Code

       Simple Cartpole Version:

       Detailed Version:

7. Summary

       As a summary, we see the development of experience usage as shown in the following graph to refresh our knowledge of how the critical part-prioritized experience replay comes.

Leave a Reply

Your email address will not be published. Required fields are marked *