Prioritized Double DQN – LIAO YONG Technology Space

This post is mainly summarized from original PER paper, but with more detailed and illustrative explanation, we will go through key ideas and implementation of Prioritized Double DQN.

1. Prioritized Experience Replay

       Online reinforcement learning (RL) agents incrementally update their parameters (of the policy, value function or model) while they observe a stream of experience. In their simplest form, they discard incoming data immediately, after a single update. Two issues with this are (a) strongly correlated updates that break the i.i.d. assumption of many popular stochastic gradient-based algorithms, and (b) the rapid forgetting of possibly rare experiences that would be useful later on.
       Experience replay addresses both of these issues: with experience stored in a replay memory, it becomes possible to break the temporal correlations by mixing more and less recent experience for the updates, and rare experience will be used for more than just a single update. DQN used a large sliding window replay memory, sampled from it uniformly at random In general, experience replay can reduce the amount of experience required to learn, and replace it with more computation and more memory – which are often cheaper resources than the RL agent’s interactions with its environment.
       In traditional DQN, during the experience replay process, experience transitions were uniformly sampled from a replay buffer, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. Prioritized experience replay approach is developed so as to replay important transitions more frequently, and therefore learn more efficiently.
       Prioritized Experience Replay focuses on how prioritizing which transitions are replayed can make experience replay more efficient and effective than if all transitions are replayed uniformly. The key idea is that an RL agent can learn more effectively from some transitions than from others. Transitions may be more or less surprising, redundant, or task-relevant. Some transitions may not be immediately useful to the agent, but might become so when the agent competence increases.
       Experience Replay liberates online learning agents from processing transitions in the exact order they are experienced. Prioritized replay further liberates agents from considering transitions with the same frequency that they are experienced.
       Prioritized Experience Replay proposes to more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. However, this prioritization can lead to a loss of diversity, which we alleviate with stochastic prioritization, and introduce bias, which is corrected with importance sampling.

2. Prioritizing with TD-error

TD-errors have also been used as a prioritization mechanism for determining where to focus resources, for example when choosing where to explore or which features to select.
The importance of each transition, one idealized criterion to measure would be the amount the RL agent can learn from a transition in its current state, which cannot be accessible directly, a reasonable proxy for this measure is the magnitude of a transition’s TD error δ, which indicates how ‘surprising’ or unexpected the transition is: specifically, how far the value is from its next-step bootstrap estimate. . This is particularly suitable for incremental, online RL algorithms, such as SARSA or Q-learning, that already compute the TD-error and update the parameters in proportion to $\delta$ .

In “greedy TD-error prioritization”, the transition with the largest absolute TD error is replayed from the memory. A Q-learning update is applied to this transition, which updates the weights in proportion to the TD error. New transitions arrive without a known TD-error, so we put them at maximal priority in order to guarantee that all experience is seen at least once.

Prioritized sampling, as the name implies, will weigh the samples so that “important” ones are drawn more frequently for training.

3. Stochastic Prioritization

Issues of greedy TD-error prioritization: to avoid expensive sweeps over the entire replay memory, transitions that have a low TD error on first visit may not be replayed for a long time; be sensitive to noise spikes; greedy prioritization focuses on a small subset of the experience (initially high error transitions get replayed frequently).
To overcome these issues, we introduce a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling. We ensure that the probability of being sampled is monotonic in a transition’s priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, we define the probability of sampling transition $i$ as

$P\left( i \right) = \frac{{p_i^\alpha }}{{\sum\nolimits_k {p_k^\alpha } }}$

where ${p_i} > 0$ is the priority of transition $i$ . The exponent $\alpha$ determines how much prioritization is used, with $\alpha=0$ corresponding to the uniform case.
Proportional prioritization

${p_i} = \left| {{\delta _i}} \right| + \varepsilon$

where $\varepsilon$ is a small positive constant that prevents the edge-case of transitions not being revisited once their error is zero.
Rank-based prioritization

${p_i} = \frac{1}{{{\rm{rank}}(i)}}$

where ${{\rm{rank}}(i)}$ is the rank of transition $i$ when the replay memory is sorted according to $\left| {{\delta _i}} \right|$ . In this case, $P$ becomes a power-law distribution with exponent $\alpha$ . Both distributions are monotonic in $\left| {{\delta _i}} \right|$ , but the latter is likely to be more robust, as it is insensitive to outliers.
Priority is translated to probability of being chosen for replay. A sample $i$ has a probability of being picked during the experience replay determined by a formula ( $\alpha=1$ ):

${P_i} = \frac{{{p_i}}}{{\sum\nolimits_k {{p_k}} }}$

4. Bias Correction

Correct bias with important sampling.
The estimation of the expected value with stochastic updates relies on those updates corresponding to the same distribution as its expectation. Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to (even if the policy and state distribution are fixed). We can correct this bias by using importance-sampling (IS) weights

${w_i} = {\left( {\frac{1}{N} \cdot \frac{1}{{P\left( i \right)}}} \right)^\beta }$

that fully compensates for the non-uniform probabilities $P\left( i \right)$ if $\beta = 1$ . These weights can be folded into the Q-learning update by using ${w_i}{\delta _i}$ instead of ${\delta _i}$ (this is thus weighted IS, not ordinary IS), For stability reasons, we always normalize weights by ${1 \mathord{\left/ {\vphantom {1 {{{\max }_i}{w_i}}}} \right. \kern-\nulldelimiterspace} {{{\max }_i}{w_i}}}$ so that they only scale the update downwards.
       More explanations for the bias correction of prioritized experience replay:
       After calculating ${P_i}$ , as a consequence, during each time step, we will get a batch of samples with batch probability distribution and train our network on it. But, we still have a problem here. With normal Experience Replay (deque), we use a stochastic update rule. As a consequence, the way we sample the experiences must match the underlying distribution they came from.
       When we have normal(deque) experience, we select our experiences in a normal distribution — simply put and then we select our experiences randomly. There is no bias, because each experience has the same chance to be taken, so we can update our weights normally. But, because we use priority sampling, purely random sampling is abandoned. As a consequence, we introduce bias toward high-priority samples (more chances to be selected).
       If we would update our weights normally, we have a risk of over-fitting our agent. Samples that have high priority are likely to be used for training many times in comparison with low priority experiences (= bias). As a consequence, we’ll update our weights with only a small portion of experiences that we consider to be really interesting.
       That why we use importance sampling weights (IS) that will adjust the updating by reducing the weights of the often seen samples to correct the bias.
       The weights corresponding to high-priority samples have very little adjustment (because the network will see these experiences many times), whereas those corresponding to low-priority samples will have a full update.
       The role of bias is to control how much these importance sampling weights affect learning. In practice, the bias parameter is annealed up to 1 over the duration of training, because these weights are more important in the end of learning when our q values begin to converge.        The unbiased nature of updates is most important near convergence.

5. Efficient Implementation

       To scale to large memory sizes $N$ , a binary heap data structure for the priority queue, for which finding the maximum priority transition when sampling is $O(1)$ and updating priorities (with the new TD-error after a learning step) is $O(log N)$ .
       Complexity analysis
       How do we store the experience and effectively sample from it?
(Source: https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/)
       A naive implementation would be to have all samples in an array sorted according to their priorities. A random number $s$ , $0 \le s \le \sum\nolimits_k {{p_k}}$ , would be picked and the array would be walked left to right, summing up a priority of the current element until the sum is exceeded and that element is chosen. This will select a sample with the desired probability distribution. But this would have a terrible efficiency: $O(n log n)$ for insertion and update and $O(n)$ for sampling.
A first important observation is that we don’t have to actually store the array sorted. Unsorted array would do just as well. Elements with higher priorities are still picked with higher probability.

       This releases the need for sorting, improving the algorithm to $O(1)$ for insertion and update.
But the $O(n)$ for sampling is still too high. We can further improve our algorithm by using a different data structure. We can store our samples in unsorted sum tree – a binary tree data structure where the parent’s value is the sum of its children. The samples themselves are stored in the leaf nodes.
       Update of a leaf node involves propagating a value difference up the tree, obtaining $O(log n)$ . Sampling follows the thought process of the array case, but achieves $O(log n)$ . For a value $s$ , $0 \le s \le \sum\nolimits_k {{p_k}}$ , we use the following algorithm (pseudo code):

def retrieve(n, s):
    if n is leaf_node:
        return n

    if n.left.val >= s:
        return retrieve(n.left, s)
    else:
        return retrieve(n.right, s - n.left.val)

def retrieve(n, s):

if n is leaf_node:

return n

if n.left.val >= s:

return retrieve(n.left, s)

else:

return retrieve(n.right, s - n.left.val)

Following picture illustrates sampling from a tree with $s = 24$ :

Implementation details see original PER paper:
Corresponding Pseudocode is as follow:

6. Code

Simple Cartpole Version:

import gym
import random
import numpy as np
from collections import deque
import tensorflow as tf

env_name = 'CartPole-v0'
env = gym.make(env_name)

class QNetwork(object):
    def __init__(self, state_dim, action_size, tau=0.01):
        tf.reset_default_graph()
        self.state_in = tf.placeholder(tf.float32, shape=[None, *state_dim])
        self.action_in = tf.placeholder(tf.int32, shape=[None])
        self.q_target_in = tf.placeholder(tf.float32, shape=[None])
        self.importance_in = tf.placeholder(tf.float32, shape=[None])
        action_one_hot = tf.one_hot(self.action_in, depth=action_size)

        self.q_state_local = self.build_model(action_size, 'local')
        self.q_state_target = self.build_model(action_size, 'target')

        self.q_state_action = tf.reduce_sum(tf.multiply(self.q_state_local, action_one_hot), axis=1)
        self.error = self.q_state_action - self.q_target_in
        self.loss = tf.reduce_mean(tf.multiply(tf.square(self.error), self.importance_in))
        self.optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(self.loss)

        self.local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='local')
        self.target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='target')
        self.updater = tf.group([tf.assign(t, t + tau * (l-t)) for t,l in zip(self.target_vars, self.local_vars)])

    def build_model(self, action_size, scope):
        with tf.variable_scope(scope):
            hidden1 = tf.layers.dense(self.state_in, 100, activation=tf.nn.relu)
            q_state = tf.layers.dense(hidden1, action_size, activation=None)
            return q_state

    def update_model(self, session, state, action, q_target, importance):
        feed = {self.state_in: state, self.action_in: action,
                self.q_target_in: q_target, self.importance_in: importance}
        error, _, _ = session.run([self.error, self.optimizer, self.updater], feed_dict=feed)
        return error

    def get_q_state(self, session, state, use_target=False):
        q_state_op = self.q_state_target if use_target else self.q_state_local
        q_state = session.run(q_state_op, feed_dict={self.state_in: state})
        return q_state

class PrioritizedReplayBuffer(object):
    def __init__(self, maxlen):
        self.buffer = deque(maxlen=maxlen)
        self.priorities = deque(maxlen=maxlen)

    def add(self, experience):
        self.buffer.append(experience)
        self.priorities.append(max(self.priorities, default=1))

    def get_probabilites(self, priority_scale):
        scaled_priorities = np.array(self.priorities) ** priority_scale
        sample_probabilities = scaled_priorities / sum(scaled_priorities)
        return sample_probabilities

    def get_importance(self, probabilities):
        importance = 1/len(self.buffer) * 1/probabilities
        importance_normalized = importance / max(importance)
        return importance_normalized

    def sample(self, batch_size, priority_scale=1.0):
        sample_size = min(len(self.buffer), batch_size)
        sample_probs = self.get_probabilites(priority_scale)
        sample_indices = random.choices(range(len(self.buffer)), k=sample_size, weights=sample_probs)
        samples = np.array(self.buffer)[sample_indices]
        importance = self.get_importance(sample_probs[sample_indices])
        return map(list, zip(*samples)), importance, sample_indices

    def set_priorities(self, indices, errors, offset=0.1):
        for i,e in zip(indices, errors):
            self.priorities[i] = abs(e) + offset

class DoubleDQNAgent(object):
    def __init__(self, env):
        self.state_dim = env.observation_space.shape
        self.action_size = env.action_space.n
        self.q_network = QNetwork(self.state_dim, self.action_size)
        self.replay_buffer = PrioritizedReplayBuffer(maxlen=100000)
        self.gamma = 0.97
        self.eps = 1.0
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def get_action(self, state):
        q_state = self.q_network.get_q_state(self.sess, [state])
        action_greedy = np.argmax(q_state)
        action_random = np.random.randint(self.action_size)
        action = action_random if random.random() < self.eps else action_greedy
        return action

    def train(self, state, action, next_state, reward, done, use_DDQN=True, a=0.0):
        self.replay_buffer.add((state, action, next_state, reward, done))
        (states, actions, next_states, rewards, dones), importance, indices = \
            self.replay_buffer.sample(50, priority_scale=a)
        next_actions = np.argmax(self.q_network.get_q_state(self.sess, next_states, use_target=False), axis=1)
        q_next_states = self.q_network.get_q_state(self.sess, next_states, use_target=use_DDQN)
        q_next_states[dones] = np.zeros([self.action_size])
        q_next_states_next_actions = q_next_states[np.arange(next_actions.shape[0]), next_actions]
        q_targets = reward + self.gamma * q_next_states_next_actions
        errors = self.q_network.update_model(self.sess, states, actions, q_targets, importance**(1-self.eps))
        self.replay_buffer.set_priorities(indices, errors)
        if done:
            self.eps = max(0.1, 0.99*self.eps)

    def __del__(self):
        self.sess.close()

num_runs = 10
run_rewards = []

for n in range(num_runs):
    print('Run {}'.format(n))
    ep_rewards = []
    agent = DoubleDQNAgent(env)
    num_episodes = 200

    for ep in range(num_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, info = env.step(action)
            agent.train(state, action, next_state, reward, done, a=(n%2==0)*0.7)
            #env.render()
            total_reward += reward
            state = next_state
        ep_rewards.append(total_reward)
        print("Episode: {}, total_reward: {:.2f}".format(ep, total_reward))
    run_rewards.append(ep_rewards)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

import gym

import random

import numpy as np

from collections import deque

import tensorflow as tf

env_name = 'CartPole-v0'

env = gym.make(env_name)

class QNetwork(object):

def __init__(self, state_dim, action_size, tau=0.01):

tf.reset_default_graph()

self.state_in = tf.placeholder(tf.float32, shape=[None, *state_dim])

self.action_in = tf.placeholder(tf.int32, shape=[None])

self.q_target_in = tf.placeholder(tf.float32, shape=[None])

self.importance_in = tf.placeholder(tf.float32, shape=[None])

action_one_hot = tf.one_hot(self.action_in, depth=action_size)

self.q_state_local = self.build_model(action_size, 'local')

self.q_state_target = self.build_model(action_size, 'target')

self.q_state_action = tf.reduce_sum(tf.multiply(self.q_state_local, action_one_hot), axis=1)

self.error = self.q_state_action - self.q_target_in

self.loss = tf.reduce_mean(tf.multiply(tf.square(self.error), self.importance_in))

self.optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(self.loss)

self.local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='local')

self.target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='target')

self.updater = tf.group([tf.assign(t, t + tau * (l-t)) for t,l in zip(self.target_vars, self.local_vars)])

def build_model(self, action_size, scope):

with tf.variable_scope(scope):

hidden1 = tf.layers.dense(self.state_in, 100, activation=tf.nn.relu)

q_state = tf.layers.dense(hidden1, action_size, activation=None)

return q_state

def update_model(self, session, state, action, q_target, importance):

feed = {self.state_in: state, self.action_in: action,

self.q_target_in: q_target, self.importance_in: importance}

error, _, _ = session.run([self.error, self.optimizer, self.updater], feed_dict=feed)

return error

def get_q_state(self, session, state, use_target=False):

q_state_op = self.q_state_target if use_target else self.q_state_local

q_state = session.run(q_state_op, feed_dict={self.state_in: state})

return q_state

class PrioritizedReplayBuffer(object):

def __init__(self, maxlen):

self.buffer = deque(maxlen=maxlen)

self.priorities = deque(maxlen=maxlen)

def add(self, experience):

self.buffer.append(experience)

self.priorities.append(max(self.priorities, default=1))

def get_probabilites(self, priority_scale):

scaled_priorities = np.array(self.priorities) ** priority_scale

sample_probabilities = scaled_priorities / sum(scaled_priorities)

return sample_probabilities

def get_importance(self, probabilities):

importance = 1/len(self.buffer) * 1/probabilities

importance_normalized = importance / max(importance)

return importance_normalized

def sample(self, batch_size, priority_scale=1.0):

sample_size = min(len(self.buffer), batch_size)

sample_probs = self.get_probabilites(priority_scale)

sample_indices = random.choices(range(len(self.buffer)), k=sample_size, weights=sample_probs)

samples = np.array(self.buffer)[sample_indices]

importance = self.get_importance(sample_probs[sample_indices])

return map(list, zip(*samples)), importance, sample_indices

def set_priorities(self, indices, errors, offset=0.1):

for i,e in zip(indices, errors):

self.priorities[i] = abs(e) + offset

class DoubleDQNAgent(object):

def __init__(self, env):

self.state_dim = env.observation_space.shape

self.action_size = env.action_space.n

self.q_network = QNetwork(self.state_dim, self.action_size)

self.replay_buffer = PrioritizedReplayBuffer(maxlen=100000)

self.gamma = 0.97

self.eps = 1.0

self.sess = tf.Session()

self.sess.run(tf.global_variables_initializer())

def get_action(self, state):

q_state = self.q_network.get_q_state(self.sess, [state])

action_greedy = np.argmax(q_state)

action_random = np.random.randint(self.action_size)

action = action_random if random.random() < self.eps else action_greedy

return action

def train(self, state, action, next_state, reward, done, use_DDQN=True, a=0.0):

self.replay_buffer.add((state, action, next_state, reward, done))

(states, actions, next_states, rewards, dones), importance, indices = \

self.replay_buffer.sample(50, priority_scale=a)

next_actions = np.argmax(self.q_network.get_q_state(self.sess, next_states, use_target=False), axis=1)

q_next_states = self.q_network.get_q_state(self.sess, next_states, use_target=use_DDQN)

q_next_states[dones] = np.zeros([self.action_size])

q_next_states_next_actions = q_next_states[np.arange(next_actions.shape[0]), next_actions]

q_targets = reward + self.gamma * q_next_states_next_actions

errors = self.q_network.update_model(self.sess, states, actions, q_targets, importance**(1-self.eps))

self.replay_buffer.set_priorities(indices, errors)

if done:

self.eps = max(0.1, 0.99*self.eps)

def __del__(self):

self.sess.close()

num_runs = 10

run_rewards = []

for n in range(num_runs):

print('Run {}'.format(n))

ep_rewards = []

agent = DoubleDQNAgent(env)

num_episodes = 200

for ep in range(num_episodes):

state = env.reset()

total_reward = 0

done = False

while not done:

action = agent.get_action(state)

next_state, reward, done, info = env.step(action)

agent.train(state, action, next_state, reward, done, a=(n%2==0)*0.7)

#env.render()

total_reward += reward

state = next_state

ep_rewards.append(total_reward)

print("Episode: {}, total_reward: {:.2f}".format(ep, total_reward))

run_rewards.append(ep_rewards)

Detailed Version:

import gym
import os
import pylab
import numpy as np
from collections import deque
from keras.models import Model, load_model
from keras.layers import Input, Dense, Lambda, Add
from keras.optimizers import Adam, RMSprop
from keras import backend as K
import random

class SumTree(object):
    data_pointer = 0

    # initialize the tree with all nodes = 0, and initialize the data with all values = 0
    def __init__(self, capacity):
        # Number of leaf nodes (final nodes) that will store experiences
        self.capacity = capacity
        # Generate the tree with all nodes values = 0
        # To understand this calculation (2 * capacity - 1) look at the schema below
        # Remember in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)
        # Parent nodes = capacity - 1
        # Leaf nodes = capacity
        self.tree = np.zeros(2 * capacity - 1)  # total nodes number
        # Store the actual experiences here, the size of data is capacity
        self.data = np.zeros(capacity, dtype=object)  # data/experiences store in the leaf nodes

    # define function that will add priority score in the sumtree leaf and add the experience in data
    # Adds a data to data and updates corresponding priority
    def add(self, priority, data):

        # experience index
        tree_index = self.data_pointer + self.capacity - 1
        """tree:
                        0
                       / \
                      0   0
                     / \ / \
            tree_index  0 0  0  We fill the leaves from left to right
        """
        # Update data frame
        self.data[self.data_pointer] = data
        # Update the sum tree with new priorities
        self.update(tree_index, priority)
        # Increment the data pointer as data is stored in left to right
        # Create a ring buffer with fixed capacity by overwriting new data
        self.data_pointer += 1
        if self.data_pointer >= self.capacity:  # If we're above the capacity, go back to first index (we overwrite)
            self.data_pointer = 0

    def update(self, tree_index, priority):
        # Change is the update factor we need to do in sum tree
        change = priority - self.tree[tree_index]
        # Store the priority first in the tree
        self.tree[tree_index] = priority
        # Update the change to parent nodes, propagate the change through tree
        while tree_index != 0:
            tree_index = (tree_index - 1) // 2
            self.tree[tree_index] += change

    def get_leaf(self, v):
        # Traverse from top of the tree
        parent_index = 0
        while True:
            left_child_index = parent_index * 2 + 1
            right_child_index = left_child_index + 1
            # Base condition to break once tree traversal is complete
            if left_child_index >= len(self.tree):
                leaf_index = parent_index
                break
            # Find the node with maximum priority and traverse that subtree
            else:  # downward search, always search for a higher priority node
                if v <= self.tree[left_child_index]:
                    parent_index = left_child_index
                else:
                    v -= self.tree[left_child_index]
                    parent_index = right_child_index
        # The max priority which we could fetch for given input priority
        data_index = leaf_index - self.capacity + 1
        # Return index of priorities, priorities, and experience
        return leaf_index, self.tree[leaf_index], self.data[data_index]

    @property
    def total_priority(self):
        return self.tree[0]  # Returns the root node


class Memory(object):  # stored as ( state, action, reward, next_state ) in SumTree
    PER_e = 0.01 # Hyperparameter used to avoid some experiences to have 0 probability of being taken
    PER_a = 0.6 #Hyperparameter used to make a tradeoff between taking only exp with high priority and sampling randomly
    PER_b = 0.4 # importance-sampling, from initial value increasing to 1
    PER_b_increment_per_sampling = 0.001  # b_increment_per_sampling
    max_error = 1.0  # clipped abs error

    def __init__(self, capacity):
        # Build a SumTree
        self.tree = SumTree(capacity)

    # define a function to store a new experience in the tree.
    # Each new experience will have a score of max_prority (it will be then improved when we use this exp to
    # train our DDQN).
    def store(self, experience):
        # New experiences are first stored in the tree with max priority
        max_priority = np.max(self.tree.tree[-self.tree.capacity:])  # Find the max priority
        #If the max priority = 0 we can't put priority = 0 since this experience will never have a chance to be selected
        # So we use a minimum priority
        if max_priority == 0:
            max_priority = self.max_error
        self.tree.add(max_priority, experience)  # set the max priority for new priority

    # Define sample function used to pick batch from the tree memory, which will be used to train the model.
    # - First, sample a minibatch of n size, the range [0, priority_total] into priority ranges.
    # - Then a value is uniformly sampled from each range.
    # - Search in the sumtree, for the experience where priority score correspond to sample values are retrieved from.
    def sample(self, n):
        # Create a minibatch array that will contains the minibatch
        memory_b = []
        b_idx = np.empty((n,), dtype=np.int32)
        b_ISWeights = np.empty((n, 1), dtype=np.float32)
        # Calculate the priority segment
        # Here, as explained in the paper, divide the Range[0, ptotal] into n ranges and split the total
        # priority into batch-sized segments
        priority_segment = self.tree.total_priority / n  # priority segment

        # Scheduler for b
        self.PER_b = np.min([1., self.PER_b + self.PER_b_increment_per_sampling])  # max = 1
        # Max weight
        min_prob = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_priority
        max_weight = (min_prob * n) ** (-self.PER_b)

        for i in range(n):
            # Calculate priority value for each segment to get corresponding experience
            a, b = priority_segment * i, priority_segment * (i + 1)
            v = np.random.uniform(a, b)
            # Experience that correspond to each value is retrieved
            index, priority, data = self.tree.get_leaf(v)
            # Fatch and normalize priority
            sampling_probobility = priority / self.tree.total_priority
            # Weight update factor depends on sampling probability of the experience and b
            b_ISWeights[i, 0] = np.power(n * sampling_probobility, -self.b) / max_weight
            # Store the index and experience
            b_idx[i] = index
            experience = data
            memory_b.append(experience)

        return b_idx, memory_b, b_ISWeights

    # Update the priorities on the tree
    def batch_update(self, tree_idx, abs_erros):
        abs_erros += self.PER_e  # convert to abs and avoid 0
        clipped_errors = np.minimum(abs_erros, self.max_error)
        ps = np.power(clipped_errors, self.PER_a)
        # For the chosen batch, depending on the error, update the priority in sum tree
        for ti, p in zip(tree_idx, ps):
            self.tree.update(ti, p)

def DQNModel(input_shape, action_space, dueling):
    X_input = Input(input_shape)
    X = X_input

    # Input Layer of state size(4) and Hidden Layer with 512 nodes
    X = Dense(512, input_shape=input_shape, activation="relu", kernel_initializer='he_uniform')(X)
    X = Dense(256, activation="relu", kernel_initializer='he_uniform')(X) # Hidden layer with 256 nodes
    X = Dense(64, activation="relu", kernel_initializer='he_uniform')(X) # Hidden layer with 64 nodes

    if dueling:
        state_value = Dense(1, kernel_initializer='he_uniform')(X)
        state_value = Lambda(lambda s: K.expand_dims(s[:, 0], -1), output_shape=(action_space,))(state_value)

        action_advantage = Dense(action_space, kernel_initializer='he_uniform')(X)
        action_advantage = Lambda(lambda a: a[:, :] - K.mean(a[:, :], keepdims=True),
                                  output_shape=(action_space,))(action_advantage)

        X = Add()([state_value, action_advantage])
    else:
        # Output Layer with # of actions: 2 nodes (left, right)
        X = Dense(action_space, activation="linear", kernel_initializer='he_uniform')(X)

    model = Model(inputs = X_input, outputs = X, name='CartPole PER D3QN model')
    model.compile(loss="mean_squared_error", optimizer=RMSprop(lr=0.00025, rho=0.95, epsilon=0.01),
                  metrics=["accuracy"])
    model.summary()
    return model


class DQNAgent:
    def __init__(self, env_name):
        self.env_name = env_name
        self.env = gym.make(env_name)
        self.env.seed(0)
        # by default, CartPole-v1 has max episode steps = 500
        # we can use this to experiment beyond 500
        self.env._max_episode_steps = 4000
        self.state_size = self.env.observation_space.shape[0]
        self.action_size = self.env.action_space.n
        self.EPISODES = 1000

        # Instantiate memory
        memory_size = 10000
        self.MEMORY = Memory(memory_size)
        self.memory = deque(maxlen=2000)

        self.gamma = 0.95  # discount rate

        # EXPLORATION HYPERPARAMETERS for epsilon and epsilon greedy strategy
        self.epsilon = 1.0  # exploration probability at start
        self.epsilon_min = 0.01  # minimum exploration probability
        self.epsilon_decay = 0.0005  # exponential decay rate for exploration prob

        self.batch_size = 32

        # defining model parameters
        self.ddqn = True  # use double deep q network
        self.Soft_Update = False  # use soft parameter update
        self.dueling = True  # use dealing netowrk
        self.epsilon_greedy = False  # use epsilon greedy strategy
        self.USE_PER = False  # use priority experienced replay

        self.TAU = 0.1  # target network soft update hyperparameter

        self.Save_Path = 'Models'
        self.scores, self.episodes, self.average = [], [], []

        self.Model_name = os.path.join(self.Save_Path, self.env_name + "_PER_D3QN.h5")

        # create main model and target model
        self.model = DQNModel(input_shape=(self.state_size,), action_space=self.action_size, dueling=self.dueling)
        self.target_model = DQNModel(input_shape=(self.state_size,), action_space=self.action_size,
                                     dueling=self.dueling)

        # after some time interval update the target model to be same with model

    def update_target_model(self):
        if not self.Soft_Update and self.ddqn:
            self.target_model.set_weights(self.model.get_weights())
            return
        if self.Soft_Update and self.ddqn:
            q_model_theta = self.model.get_weights()
            target_model_theta = self.target_model.get_weights()
            counter = 0
            for q_weight, target_weight in zip(q_model_theta, target_model_theta):
                target_weight = target_weight * (1 - self.TAU) + q_weight * self.TAU
                target_model_theta[counter] = target_weight
                counter += 1
            self.target_model.set_weights(target_model_theta)

    def remember(self, state, action, reward, next_state, done):
        experience = state, action, reward, next_state, done
        if self.USE_PER:
            self.MEMORY.store(experience)
        else:
            self.memory.append((experience))

    def act(self, state, decay_step):
        # EPSILON GREEDY STRATEGY
        if self.epsilon_greedy:
            # Use an improved version of our epsilon greedy strategy for Q-learning
            explore_probability = self.epsilon_min + (self.epsilon - self.epsilon_min) * np.exp(
                -self.epsilon_decay * decay_step)
        # OLD EPSILON STRATEGY
        else:
            if self.epsilon > self.epsilon_min:
                self.epsilon *= (1 - self.epsilon_decay)
            explore_probability = self.epsilon

        if explore_probability > np.random.rand():
            # Make a random action (exploration)
            return random.randrange(self.action_size), explore_probability
        else:
            # Get action from Q-network (exploitation)
            # Estimate the Qs values state
            # Take the biggest Q value (= the best action)
            return np.argmax(self.model.predict(state)), explore_probability

    def replay(self):
        if self.USE_PER:
            # Sample minibatch from the PER memory
            tree_idx, minibatch = self.MEMORY.sample(self.batch_size)
        else:
            # Randomly sample minibatch from the deque memory
            minibatch = random.sample(self.memory, min(len(self.memory), self.batch_size))

        state = np.zeros((self.batch_size, self.state_size))
        next_state = np.zeros((self.batch_size, self.state_size))
        action, reward, done = [], [], []

        # do this before prediction
        # for speedup, this could be done on the tensor level
        # but easier to understand using a loop
        for i in range(len(minibatch)):
            state[i] = minibatch[i][0]
            action.append(minibatch[i][1])
            reward.append(minibatch[i][2])
            next_state[i] = minibatch[i][3]
            done.append(minibatch[i][4])

        # do batch prediction to save speed
        # predict Q-values for starting state using the main network
        target = self.model.predict(state)
        target_old = np.array(target)
        # predict best action in ending state using the main network
        target_next = self.model.predict(next_state)
        # predict Q-values for ending state using the target network
        target_val = self.target_model.predict(next_state)

        for i in range(len(minibatch)):
            # correction on the Q value for the action used
            if done[i]:
                target[i][action[i]] = reward[i]
            else:
                # the key point of Double DQN
                # selection of action is from model
                # update is from target model
                if self.ddqn:  # Double - DQN
                    # current Q Network selects the action
                    # a'_max = argmax_a' Q(s', a')
                    a = np.argmax(target_next[i])
                    # target Q Network evaluates the action
                    # Q_max = Q_target(s', a'_max)
                    target[i][action[i]] = reward[i] + self.gamma * (target_val[i][a])
                else:  # Standard - DQN
                    # DQN chooses the max Q value among next actions
                    # selection and evaluation of action is on the target Q Network
                    # Q_max = max_a' Q_target(s', a')
                    target[i][action[i]] = reward[i] + self.gamma * (np.amax(target_next[i]))

            if self.USE_PER:
                absolute_errors = np.abs(target_old[i] - target[i])
                # Update priority
                self.MEMORY.batch_update(tree_idx, absolute_errors)

        # Train the Neural Network with batches
        self.model.fit(state, target, batch_size=self.batch_size, verbose=0)

    def load(self, name):
        self.model = load_model(name)

    def save(self, name):
        self.model.save(name)

    pylab.figure(figsize=(18, 9))

    def PlotModel(self, score, episode):
        self.scores.append(score)
        self.episodes.append(episode)
        self.average.append(sum(self.scores[-50:]) / len(self.scores[-50:]))
        pylab.plot(self.episodes, self.average, 'r')
        pylab.plot(self.episodes, self.scores, 'b')
        pylab.ylabel('Score', fontsize=18)
        pylab.xlabel('Steps', fontsize=18)
        dqn = 'DQN_'
        softupdate = ''
        dueling = ''
        greedy = ''
        PER = ''
        if self.ddqn: dqn = 'DDQN_'
        if self.Soft_Update: softupdate = '_soft'
        if self.dueling: dueling = '_Dueling'
        if self.epsilon_greedy: greedy = '_Greedy'
        if self.USE_PER: PER = '_PER'
        try:
            pylab.savefig(dqn + self.env_name + softupdate + dueling + greedy + PER + ".png")
        except OSError:
            pass

        return str(self.average[-1])[:5]

    def run(self):
        decay_step = 0
        for e in range(self.EPISODES):
            state = self.env.reset()
            state = np.reshape(state, [1, self.state_size])
            done = False
            i = 0
            while not done:
                # self.env.render()
                decay_step += 1
                action, explore_probability = self.act(state, decay_step)
                next_state, reward, done, _ = self.env.step(action)
                next_state = np.reshape(next_state, [1, self.state_size])
                if not done or i == self.env._max_episode_steps - 1:
                    reward = reward
                else:
                    reward = -100
                self.remember(state, action, reward, next_state, done)
                state = next_state
                i += 1
                if done:
                    # every step update target model
                    self.update_target_model()

                    # every episode, plot the result
                    average = self.PlotModel(i, e)
                    print("episode: {}/{}, score: {}, e: {:.2}, average: {}".format(e, self.EPISODES, i,
                                                                                    explore_probability, average))
                    if i == self.env._max_episode_steps:
                        print("Saving trained model to", self.Model_name)
                        # self.save(self.Model_name)
                        break
                self.replay()

PER_Agent = DQNAgent('CartPole-v0')
PER_Agent.run()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

import gym

import os

import pylab

import numpy as np

from collections import deque

from keras.models import Model, load_model

from keras.layers import Input, Dense, Lambda, Add

from keras.optimizers import Adam, RMSprop

from keras import backend as K

import random

class SumTree(object):

data_pointer = 0

# initialize the tree with all nodes = 0, and initialize the data with all values = 0

def __init__(self, capacity):

# Number of leaf nodes (final nodes) that will store experiences

self.capacity = capacity

# Generate the tree with all nodes values = 0

# To understand this calculation (2 * capacity - 1) look at the schema below

# Remember in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)

# Parent nodes = capacity - 1

# Leaf nodes = capacity

self.tree = np.zeros(2 * capacity - 1) # total nodes number

# Store the actual experiences here, the size of data is capacity

self.data = np.zeros(capacity, dtype=object) # data/experiences store in the leaf nodes

# define function that will add priority score in the sumtree leaf and add the experience in data

# Adds a data to data and updates corresponding priority

def add(self, priority, data):

# experience index

tree_index = self.data_pointer + self.capacity - 1

"""tree:

/ \

0 0

/ \ / \

tree_index 0 0 0 We fill the leaves from left to right

"""

# Update data frame

self.data[self.data_pointer] = data

# Update the sum tree with new priorities

self.update(tree_index, priority)

# Increment the data pointer as data is stored in left to right

# Create a ring buffer with fixed capacity by overwriting new data

self.data_pointer += 1

if self.data_pointer >= self.capacity: # If we're above the capacity, go back to first index (we overwrite)

self.data_pointer = 0

def update(self, tree_index, priority):

# Change is the update factor we need to do in sum tree

change = priority - self.tree[tree_index]

# Store the priority first in the tree

self.tree[tree_index] = priority

# Update the change to parent nodes, propagate the change through tree

while tree_index != 0:

tree_index = (tree_index - 1) // 2

self.tree[tree_index] += change

def get_leaf(self, v):

# Traverse from top of the tree

parent_index = 0

while True:

left_child_index = parent_index * 2 + 1

right_child_index = left_child_index + 1

# Base condition to break once tree traversal is complete

if left_child_index >= len(self.tree):

leaf_index = parent_index

break

# Find the node with maximum priority and traverse that subtree

else: # downward search, always search for a higher priority node

if v <= self.tree[left_child_index]:

parent_index = left_child_index

else:

v -= self.tree[left_child_index]

parent_index = right_child_index

# The max priority which we could fetch for given input priority

data_index = leaf_index - self.capacity + 1

# Return index of priorities, priorities, and experience

return leaf_index, self.tree[leaf_index], self.data[data_index]

@property

def total_priority(self):

return self.tree[0] # Returns the root node

class Memory(object): # stored as ( state, action, reward, next_state ) in SumTree

PER_e = 0.01 # Hyperparameter used to avoid some experiences to have 0 probability of being taken

PER_a = 0.6 #Hyperparameter used to make a tradeoff between taking only exp with high priority and sampling randomly

PER_b = 0.4 # importance-sampling, from initial value increasing to 1

PER_b_increment_per_sampling = 0.001 # b_increment_per_sampling

max_error = 1.0 # clipped abs error

def __init__(self, capacity):

# Build a SumTree

self.tree = SumTree(capacity)

# define a function to store a new experience in the tree.

# Each new experience will have a score of max_prority (it will be then improved when we use this exp to

# train our DDQN).

def store(self, experience):

# New experiences are first stored in the tree with max priority

max_priority = np.max(self.tree.tree[-self.tree.capacity:]) # Find the max priority

#If the max priority = 0 we can't put priority = 0 since this experience will never have a chance to be selected

# So we use a minimum priority

if max_priority == 0:

max_priority = self.max_error

self.tree.add(max_priority, experience) # set the max priority for new priority

# Define sample function used to pick batch from the tree memory, which will be used to train the model.

# - First, sample a minibatch of n size, the range [0, priority_total] into priority ranges.

# - Then a value is uniformly sampled from each range.

# - Search in the sumtree, for the experience where priority score correspond to sample values are retrieved from.

def sample(self, n):

# Create a minibatch array that will contains the minibatch

memory_b = []

b_idx = np.empty((n,), dtype=np.int32)

b_ISWeights = np.empty((n, 1), dtype=np.float32)

# Calculate the priority segment

# Here, as explained in the paper, divide the Range[0, ptotal] into n ranges and split the total

# priority into batch-sized segments

priority_segment = self.tree.total_priority / n # priority segment

# Scheduler for b

self.PER_b = np.min([1., self.PER_b + self.PER_b_increment_per_sampling]) # max = 1

# Max weight

min_prob = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_priority

max_weight = (min_prob * n) ** (-self.PER_b)

for i in range(n):

# Calculate priority value for each segment to get corresponding experience

a, b = priority_segment * i, priority_segment * (i + 1)

v = np.random.uniform(a, b)

# Experience that correspond to each value is retrieved

index, priority, data = self.tree.get_leaf(v)

# Fatch and normalize priority

sampling_probobility = priority / self.tree.total_priority

# Weight update factor depends on sampling probability of the experience and b

b_ISWeights[i, 0] = np.power(n * sampling_probobility, -self.b) / max_weight

# Store the index and experience

b_idx[i] = index

experience = data

memory_b.append(experience)

return b_idx, memory_b, b_ISWeights

# Update the priorities on the tree

def batch_update(self, tree_idx, abs_erros):

abs_erros += self.PER_e # convert to abs and avoid 0

clipped_errors = np.minimum(abs_erros, self.max_error)

ps = np.power(clipped_errors, self.PER_a)

# For the chosen batch, depending on the error, update the priority in sum tree

for ti, p in zip(tree_idx, ps):

self.tree.update(ti, p)

def DQNModel(input_shape, action_space, dueling):

X_input = Input(input_shape)

X = X_input

# Input Layer of state size(4) and Hidden Layer with 512 nodes

X = Dense(512, input_shape=input_shape, activation="relu", kernel_initializer='he_uniform')(X)

X = Dense(256, activation="relu", kernel_initializer='he_uniform')(X) # Hidden layer with 256 nodes

X = Dense(64, activation="relu", kernel_initializer='he_uniform')(X) # Hidden layer with 64 nodes

if dueling:

state_value = Dense(1, kernel_initializer='he_uniform')(X)

state_value = Lambda(lambda s: K.expand_dims(s[:, 0], -1), output_shape=(action_space,))(state_value)

action_advantage = Dense(action_space, kernel_initializer='he_uniform')(X)

action_advantage = Lambda(lambda a: a[:, :] - K.mean(a[:, :], keepdims=True),

output_shape=(action_space,))(action_advantage)

X = Add()([state_value, action_advantage])

else:

# Output Layer with # of actions: 2 nodes (left, right)

X = Dense(action_space, activation="linear", kernel_initializer='he_uniform')(X)

model = Model(inputs = X_input, outputs = X, name='CartPole PER D3QN model')

model.compile(loss="mean_squared_error", optimizer=RMSprop(lr=0.00025, rho=0.95, epsilon=0.01),

metrics=["accuracy"])

model.summary()

return model

class DQNAgent:

def __init__(self, env_name):

self.env_name = env_name

self.env = gym.make(env_name)

self.env.seed(0)

# by default, CartPole-v1 has max episode steps = 500

# we can use this to experiment beyond 500

self.env._max_episode_steps = 4000

self.state_size = self.env.observation_space.shape[0]

self.action_size = self.env.action_space.n

self.EPISODES = 1000

# Instantiate memory

memory_size = 10000

self.MEMORY = Memory(memory_size)

self.memory = deque(maxlen=2000)

self.gamma = 0.95 # discount rate

# EXPLORATION HYPERPARAMETERS for epsilon and epsilon greedy strategy

self.epsilon = 1.0 # exploration probability at start

self.epsilon_min = 0.01 # minimum exploration probability

self.epsilon_decay = 0.0005 # exponential decay rate for exploration prob

self.batch_size = 32

# defining model parameters

self.ddqn = True # use double deep q network

self.Soft_Update = False # use soft parameter update

self.dueling = True # use dealing netowrk

self.epsilon_greedy = False # use epsilon greedy strategy

self.USE_PER = False # use priority experienced replay

self.TAU = 0.1 # target network soft update hyperparameter

self.Save_Path = 'Models'

self.scores, self.episodes, self.average = [], [], []

self.Model_name = os.path.join(self.Save_Path, self.env_name + "_PER_D3QN.h5")

# create main model and target model

self.model = DQNModel(input_shape=(self.state_size,), action_space=self.action_size, dueling=self.dueling)

self.target_model = DQNModel(input_shape=(self.state_size,), action_space=self.action_size,

dueling=self.dueling)

# after some time interval update the target model to be same with model

def update_target_model(self):

if not self.Soft_Update and self.ddqn:

self.target_model.set_weights(self.model.get_weights())

return

if self.Soft_Update and self.ddqn:

q_model_theta = self.model.get_weights()

target_model_theta = self.target_model.get_weights()

counter = 0

for q_weight, target_weight in zip(q_model_theta, target_model_theta):

target_weight = target_weight * (1 - self.TAU) + q_weight * self.TAU

target_model_theta[counter] = target_weight

counter += 1

self.target_model.set_weights(target_model_theta)

def remember(self, state, action, reward, next_state, done):

experience = state, action, reward, next_state, done

if self.USE_PER:

self.MEMORY.store(experience)

else:

self.memory.append((experience))

def act(self, state, decay_step):

# EPSILON GREEDY STRATEGY

if self.epsilon_greedy:

# Use an improved version of our epsilon greedy strategy for Q-learning

explore_probability = self.epsilon_min + (self.epsilon - self.epsilon_min) * np.exp(

-self.epsilon_decay * decay_step)

# OLD EPSILON STRATEGY

else:

if self.epsilon > self.epsilon_min:

self.epsilon *= (1 - self.epsilon_decay)

explore_probability = self.epsilon

if explore_probability > np.random.rand():

# Make a random action (exploration)

return random.randrange(self.action_size), explore_probability

else:

# Get action from Q-network (exploitation)

# Estimate the Qs values state

# Take the biggest Q value (= the best action)

return np.argmax(self.model.predict(state)), explore_probability

def replay(self):

if self.USE_PER:

# Sample minibatch from the PER memory

tree_idx, minibatch = self.MEMORY.sample(self.batch_size)

else:

# Randomly sample minibatch from the deque memory

minibatch = random.sample(self.memory, min(len(self.memory), self.batch_size))

state = np.zeros((self.batch_size, self.state_size))

next_state = np.zeros((self.batch_size, self.state_size))

action, reward, done = [], [], []

# do this before prediction

# for speedup, this could be done on the tensor level

# but easier to understand using a loop

for i in range(len(minibatch)):

state[i] = minibatch[i][0]

action.append(minibatch[i][1])

reward.append(minibatch[i][2])

next_state[i] = minibatch[i][3]

done.append(minibatch[i][4])

# do batch prediction to save speed

# predict Q-values for starting state using the main network

target = self.model.predict(state)

target_old = np.array(target)

# predict best action in ending state using the main network

target_next = self.model.predict(next_state)

# predict Q-values for ending state using the target network

target_val = self.target_model.predict(next_state)

for i in range(len(minibatch)):

# correction on the Q value for the action used

if done[i]:

target[i][action[i]] = reward[i]

else:

# the key point of Double DQN

# selection of action is from model

# update is from target model

if self.ddqn: # Double - DQN

# current Q Network selects the action

# a'_max = argmax_a' Q(s', a')

a = np.argmax(target_next[i])

# target Q Network evaluates the action

# Q_max = Q_target(s', a'_max)

target[i][action[i]] = reward[i] + self.gamma * (target_val[i][a])

else: # Standard - DQN

# DQN chooses the max Q value among next actions

# selection and evaluation of action is on the target Q Network

# Q_max = max_a' Q_target(s', a')

target[i][action[i]] = reward[i] + self.gamma * (np.amax(target_next[i]))

if self.USE_PER:

absolute_errors = np.abs(target_old[i] - target[i])

# Update priority

self.MEMORY.batch_update(tree_idx, absolute_errors)

# Train the Neural Network with batches

self.model.fit(state, target, batch_size=self.batch_size, verbose=0)

def load(self, name):

self.model = load_model(name)

def save(self, name):

self.model.save(name)

pylab.figure(figsize=(18, 9))

def PlotModel(self, score, episode):

self.scores.append(score)

self.episodes.append(episode)

self.average.append(sum(self.scores[-50:]) / len(self.scores[-50:]))

pylab.plot(self.episodes, self.average, 'r')

pylab.plot(self.episodes, self.scores, 'b')

pylab.ylabel('Score', fontsize=18)

pylab.xlabel('Steps', fontsize=18)

dqn = 'DQN_'

softupdate = ''

dueling = ''

greedy = ''

PER = ''

if self.ddqn: dqn = 'DDQN_'

if self.Soft_Update: softupdate = '_soft'

if self.dueling: dueling = '_Dueling'

if self.epsilon_greedy: greedy = '_Greedy'

if self.USE_PER: PER = '_PER'

try:

pylab.savefig(dqn + self.env_name + softupdate + dueling + greedy + PER + ".png")

except OSError:

pass

return str(self.average[-1])[:5]

def run(self):

decay_step = 0

for e in range(self.EPISODES):

state = self.env.reset()

state = np.reshape(state, [1, self.state_size])

done = False

i = 0

while not done:

# self.env.render()

decay_step += 1

action, explore_probability = self.act(state, decay_step)

next_state, reward, done, _ = self.env.step(action)

next_state = np.reshape(next_state, [1, self.state_size])

if not done or i == self.env._max_episode_steps - 1:

reward = reward

else:

reward = -100

self.remember(state, action, reward, next_state, done)

state = next_state

i += 1

if done:

# every step update target model

self.update_target_model()

# every episode, plot the result

average = self.PlotModel(i, e)

print("episode: {}/{}, score: {}, e: {:.2}, average: {}".format(e, self.EPISODES, i,

explore_probability, average))

if i == self.env._max_episode_steps:

print("Saving trained model to", self.Model_name)

# self.save(self.Model_name)

break

self.replay()

PER_Agent = DQNAgent('CartPole-v0')

PER_Agent.run()

7. Summary

As a summary, we see the development of experience usage as shown in the following graph to refresh our knowledge of how the critical part-prioritized experience replay comes.

1. Prioritized Experience Replay

2. Prioritizing with TD-error

3. Stochastic Prioritization

4. Bias Correction

5. Efficient Implementation

6. Code

7. Summary

Leave a Reply Cancel reply