DQN series reinforcement learning algorithms involve with learning by using Deep Q Networks, these algorithms include Deep Q-learning (short for DQN), Double Deep Q-learning (Double DQN), Dueling Deep Q-learning (Dueling DQN), Prioritized Double DQN, Noisy DQN, Bootstrapped DQN and etc, which tend to improve Navie DQN from different perspectives, we will highlight the first three algorithms in this post.
1. Value-based Approximation Methods
This post will focus on value-based approximation methods for learning a policy, which differs from tabular methods like Monte Carlo and temporal difference.
Reason for replacing the tabular methods by the value function approximation methods: the memory needed for large tables; the time and data needed to fill tables accurately; states encountered may never been seen before; VFA is also applicable to partially observable problems, in which the full state is not available to the agent.(Thus need to generalize to unseen state.)
Core: Value Approximation methods generalize better than the tabular methods.
In tabular methods, the value function is represented by a look-up table, where each state has a corresponding entry, , or each state-action pair has an entry, . However, this approach might not generalize well to problems with very large state and/or action spaces. In large scale reinforcement learning, like backgammon with states, Computer Go with states and Helicopter with continuous state space. There are too many states and/or actions to store in memory and it is too slow to learn the value of each state individually. Solution of those problem is to estimate value function with function approximation so as to generalize from seen states to unseen states.
Here is usually referred to as the parameter or weights of our function approximator.
However, there exists different structure to build this approximator as bellow.
Meanwhile, there are many function approximators, e.g. Linear combinations of features, neural network, decision tree, nearest neighbor, Fourier / wavelet bases and etc. However, differentiable function approximators like neural network are considered in this post, which is the core of Deep Q-learning using Deep Q-Networks(DQN).
Tips:
(1) Requirements for function approximation methods to be applied in reinforcement learning problem: learning be able to occur online, while the agent interacts with its environment or with a model of its environment, thus need to learn efficiently from incrementally acquired data; being able to handle nonstationary target functions (target functions that change over time).
(2) Stochastic Gradient Descent (SGD) is the idea behind most approximate learning. Most algorithms have the similar update rule as in traditional gradient update rule, as bellow:
2. Deep Learning Approximation
The difficulties in handcrafting feature sets make it’s not applicable for value functions approximation like linear function, which leads to the use of non-linear approximation function, especially neural network and deep learning approach in recent years.
DQN algorithm is an extension of Q-learning from tabular case to large scale case, using different approaches to approximate Q-values as shown in the below figure.
Various algorithms based on deep learning framework and their relative performance are as follow:
Okay, so now let’s begin today’s role, DQN, we will start with a brief overview of Q-learning, then extend to Deep Q-learning.
3. Q-learning Recap
In traditional supervised learning scheme, we have input data and corresponding labels to feed various models with parameters, then using optimization algorithm to minimize the difference or error between the labels and predicted labels to get the model parameters, this is actually a “learn by error process”. While in reinforcement learning scheme, which involves with the agent interacting with the environment, data generated are actions, states and reward. No explicit input data and labels, how to use the well-developed technology in supervised learning to solve the reinforcement learning problem, the goal of which is to find the best policy to achieve the best goal?
The first issue coming will be how to evaluate an agent. We use state-value function to evaluate how good or bad of an agent to be in a specific state , denoted by , which refers to total expected return the agent get in a state ; the action-value function to evaluate how good or bad of an agent to take a specific action in the state , denoted by , which refers to the total expected return after taking the action in the state .
Trying to used available methods in supervised learning, we have data but with no labels, how about forming a label?
Actually, we can estimate current or by taking a step forward and see what happened (get a reward and be in a new state ). That means we can estimate current by , sometimes we are not so sure about what will happen next so we give a discount factor to the next estimation, thus we use to estimate , or taking as our target label to feed our model for training. Similarly, we can also use the next time-step values to estimate current value, using to estimate , or even the next time-step best values to estimate the current best value, using to estimate .
There are three approaches to solve the above problem.
(1) The analytic methods by solving a equation
(2) Q-learning by formulating a look-up table to record the get during each interaction
(3) Deep Q-learning by using a neural network to approximate the function.
Q-learning is an algorithm that estimates value during the train iterations, it updates the Q-values for each state-action pair by extracting corresponding values from the loop-up table formed using the below formula:
4. From Q-learning to Deep Q-learning
Why Deep Q-learning?
Compared with tabular q-learning, amount of memory required to save and update that table would increase as the number of states increases; amount of time required to explore each state to create the required Q-table would be unrealistic.
Challenges:
There are several problems or challenges in the Q-learning process, namely, the non-stationary or unstable target; highly correlated training data and maximization bias, we will elaborate them separately in the following section.
Key Points:
The key process in DQN is actually a supervising learning process, which minimizes the actual values and estimated values error, calculating corresponding weights .4.1 Formulati
4.1 Pseudocode
Nature version pseudocode:
Description as follow:
(1). preprocess and feed the game screen (state ) to the DQN, which returns the Q values of all possible actions in the state.
(2). Select an action using the -greedy policy: with probability , select a random action and with probability , select an action that has a maximum Q values, namely,
. (3). After selecting the action , perform this action in a state and reach a new state and receive a reward r. The next state is the preprocessed image of the next game screen.
(4). Store this transition in the replay buffer as
(5). Sample some random batches of transitions from the replay buffer and calculate the loss (the squared difference between target Q and predicted Q).
(6). Calc loss:
(7). Perform gradient descent with respect the actual network parameters to minimize above loss.
(8). After every steps, copy the actual network weights to the target network weight .
(9). Repeat 2-8 steps for M number of episodes.
4.2 Tricks
4.2.1 Experience Replay
The key idea of experience replay is that we train our deep Q network with transitions sampled from the replay buffer instead of training with the last transitions.
Neural network can overfit with correlated experience, but can be reduced by randomly selecting batch of experiences from the replay buffer. We can use uniform sampling for sampling the experience. The experience replay can be considered as a queue rather than a list, it will store only a fixed number of recent experiences/transitions, when the new information comes in, the old is deleted.
4.2.2 Target Network
We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target and the current value (estimation of ).
But we don’t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest value for the next state.
However, the problem is that we using the same parameters (weights) for estimating the target and the value. As a consequence, there is a big correlation between the TD target and the parameters () we are changing.
Therefore, it means that at every step of training, our values shift but also the target value shifts. So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This lead to a big oscillation in training.
I find a good and intuitive example to explain this points from post …. It’s like if you were a cowboy (the estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).
At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters). This leads to a very strange path of chasing (a big oscillation in training).
Instead, we can use the idea of fixed Q-targets introduced by DeepMind:
(1) Using a separate network with a fixed parameter (let’s call it ) for estimating the TD target.
(2) At every steps, we copy the parameters from our DQN network to update the target network.
At every steps:
Loss function is:
Notice that the parameters of target Q is instead of . The actual Q network is used for predicting Q values and learns the correct weights of using gradient descent. The target network is frozen for several time steps and then the target network weights are updated by copying the weights from actual Q network. Freezing the target network for a while and then updating its weights with the actual Q network weights stabilizes the training.
4.2.3 Other Details
Cliipping Reward
Reward assignment varies for different games. In some games, we assign rewards such as +1, -1, 0 for wining, loss and nothing respectively, but in other games, the rewards assigned may be +100 and+50 for two different actions. To avoid this problem, rewards are clipped to -1 and +1.
Huber Loss Funciton
In reinforcement learning, both the input and the target change constantly during the process and make training unstable.
To minimize this error, we will use the Huber loss. The Huber loss acts like the mean squared error when the error is small, but like the mean absolute error when the error is large – this makes it more robust to outliers when the estimates of Q are very noisy. We calculate this over a batch of transitions, B, sampled from the replay memory:
The Huber loss function graph is as follow:
4.3 Code implementaion
Version1: Keras Simple Version DQN code, see reference as bellow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
import random import gym import numpy as np from collections import deque from keras.models import Sequential from keras.layers import Dense from keras.optimizers import Adam EPISODE = 1000 class DQNAgent(object): def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=2000) self.gamma = 0.95 self.epsilon = 1.0 self.epsilon_min = 0.01 self.epsilon_decay = 0.995 self.learning_rate = 0.001 self.model = self._build_model() def _build_model(self): model = Sequential() model.add(Dense(24, input_dim=self.state_size, activation='relu')) model.add(Dense(24, activation='relu')) model.add(Dense(self.action_size, activation='linear')) model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate)) return model def memorize(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): if np.random.rand() <= self.epsilon: return random.randrange(self.action_size) act_values = self.model.predict(state) return np.argmax(act_values[0]) def replay(self, batch_size): minibatch = random.sample(self.memory, batch_size) for state, action, reward, next_state, done in minibatch: target = reward if not done: target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0])) target_f = self.model.predict(state) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay def load(self, name): self.model.load_weights(name) def save(self, name): self.model.save_weights(name) if __name__ == '__main__': env = gym.make("CartPole-v1") state_size = env.observation_space.shape[0] action_size = env.action_space.n agent = DQNAgent(state_size, action_size) # agent.load('./save/cartpole-dqn.h5') done = False batch_size = 32 for e in range(EPISODE): state = env.reset() state = np.reshape(state, [1, state_size]) for t in range(500): # env.render() action = agent.act(state) next_state, reward, done, _ = env.step(action) reward = reward if not done else -10 next_state = np.reshape(next_state, [1, state_size]) agent.memorize(state, action, reward, next_state, done) state = next_state if done: print('episode: {}/{}, score: {}, e:{:.2}'.format(e, EPISODE, t, agent.epsilon)) break if len(agent.memory) > batch_size: agent.replay(batch_size) # if e % 10 == 0: # agent.save('./save/cartpole-dqn.h5') |
Version2: Breakout problem
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
import gym import tensorflow as tf import numpy as np import itertools import random import os import sys from collections import deque, namedtuple env = gym.make('Breakout-v0') state = env.reset() # (210, 160, 3) VALID_ACTIONS = [0, 1, 2, 3] class StateProcessor(object): def __init__(self): with tf.variable_scope("state_processor"): self.input_state = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8) self.output = tf.image.rgb_to_grayscale(self.input_state) #(210, 160, 1) self.output = tf.image.crop_to_bounding_box(self.output, 34, 0, 160, 160) #(160, 160, 1) self.output = tf.image.resize_images(self.output, [84,84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR) #(84, 84, 1) self.output = tf.squeeze(self.output) #<class 'tensorflow.python.framework.ops.Tensor'>; shape:(84, 84) def process(self, sess, state): return sess.run(self.output, {self.input_state: state}) class Estimator(object): def __init__(self, scope="estimator"): self.scope = scope with tf.variable_scope(scope): self._build_model() def _build_model(self): self.X_pl = tf.placeholder(shape=[None, 84, 84, 4], dtype=tf.uint8, name="X") #input self.y_pl = tf.placeholder(shape=[None], dtype=tf.float32, name="y") #TD target self.actions_pl = tf.placeholder(shape=[None], dtype=tf.int32, name="actions") X = tf.to_float(self.X_pl) / 255.0 #(?, 84, 84, 4) batch_size = tf.shape(self.X_pl)[0] #() # Three convolutional layers conv1 = tf.contrib.layers.conv2d(X, 32, 8, 4, activation_fn=tf.nn.relu) #(?, 21, 21, 32) conv2 = tf.contrib.layers.conv2d(conv1, 64, 4, 2, activation_fn=tf.nn.relu) #(?, 11, 11, 64) conv3 = tf.contrib.layers.conv2d(conv2, 64, 3, 1, activation_fn=tf.nn.relu) #(?, 11, 11, 64) # Fully connected layers flattened = tf.contrib.layers.flatten(conv3) #(?, 7744) fc1 = tf.contrib.layers.fully_connected(flattened, 512) #(?, 512) self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS)) #(?, 4) # Get the predictions for the chosen actions only gather_indices = tf.range(batch_size) * tf.shape(self.predictions)[1] + self.actions_pl self.action_predictions = tf.gather(tf.reshape(self.predictions, [-1]), gather_indices) # Calculate the loss self.losses = tf.squared_difference(self.y_pl, self.action_predictions) self.loss = tf.reduce_mean(self.losses) # Optimizer Parameters from original paper self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6) self.train_op = self.optimizer.minimize(self.loss, global_step=tf.contrib.framework.get_global_step()) def predict(self, sess, s): return sess.run(self.predictions, {self.X_pl: s}) def update(self, sess, s, a, y): feed_dict = {self.X_pl:s, self.y_pl:y, self.actions_pl:a} global_step, _, loss = sess.run( [tf.contrib.framework.get_global_step(), self.train_op, self.loss], feed_dict) return loss def copy_model_parameters(sess, estimator1, estimator2): e1_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator1.scope)] e1_params = sorted(e1_params, key=lambda v:v.name) e2_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator2.scope)] e2_params = sorted(e2_params, key=lambda v:v.name) updated_ops = [] for e1_v, e2_v in zip(e1_params, e2_params): op = e2_v.assign(e1_v) updated_ops.append(op) sess.run(updated_ops) def make_epsilon_greedy_policy(estimator, nA): def policy_fn(sess, observation, epsilon): A = np.ones(nA, dtype=float) * epsilon / nA q_values = estimator.predict(sess, np.expand_dims(observation, 0))[0] best_action = np.argmax(q_values) A[best_action] += (1.0 - epsilon) return A #(3,) return policy_fn def deep_q_learning(sess, env, q_estimator, target_estimator, state_processor, num_episodes, replay_memory_size, replay_memory_init_size, update_target_estimator_every=10000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=500000, batch_size=32): Transition = namedtuple('Transition', ['state', 'action', 'reward', 'next_state', 'done']) replay_memory = [] # replay memory total_t = sess.run(tf.contrib.framework.get_global_step()) # Get the current time step epsilons = np.linspace(epsilon_start, epsilon_end, epsilon_decay_steps) # The epsilon decay schedule policy = make_epsilon_greedy_policy(q_estimator, len(VALID_ACTIONS)) # The policy followed # Populate the replay memory with initial experience state = env.reset() state = state_processor.process(sess, state) state = np.stack([state] * 4, axis=2) for i in range(replay_memory_init_size): action_probs = policy(sess, state, epsilons[min(total_t, epsilon_decay_steps-1)]) #max epsilon value action = np.random.choice(np.arange(len(action_probs)), p=action_probs) next_state, reward, done, _ = env.step(VALID_ACTIONS[action]) next_state = state_processor.process(sess, next_state) next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2) replay_memory.append(Transition(state, action, reward, next_state, done)) if done: state = env.reset() state = state_processor.process(sess, state) state = np.stack([state] * 4, axis=2) else: state = next_state episodes_rewards = [] # Rewards of different episodes for i_episode in range(num_episodes): state = env.reset() state = state_processor.process(sess, state) state = np.stack([state] * 4, axis=2) loss = None episode_reward = 0 for t in itertools.count(): epsilon = epsilons[min(total_t, epsilon_decay_steps-1)] if total_t % update_target_estimator_every == 0: copy_model_parameters(sess, q_estimator, target_estimator) print("\nCopied model parameters to target network") print("\rStep {} ({}) @ Episode {}/{}, loss: {}".format( t, total_t, i_episode+1, num_episodes, loss, end="")) sys.stdout.flush() # Take a step action_probs = policy(sess, state, epsilon) action = np.random.choice(np.arange(len(action_probs)), p=action_probs) next_state, reward, done, _ = env.step(VALID_ACTIONS[action]) next_state = state_processor.process(sess, next_state) next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2) # If our replay memory is full, pop the first element if len(replay_memory) == replay_memory_size: replay_memory.pop(0) # Save transition to replay memory replay_memory.append(Transition(state, action, reward, next_state, done)) episode_reward += reward samples = random.sample(replay_memory, batch_size) # Sample a minibatch from the replay memory states_batch, action_batch, reward_batch, next_state_batch, done_batch = map(np.array, zip(*samples)) # Calculate q values and targets q_value_next = q_estimator.predict(sess, next_state_batch) targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * discount_factor * \ np.amax(q_value_next, axis=1) # Perform gradient descent update states_batch = np.array(states_batch) loss = q_estimator.update(sess, states_batch, action_batch, targets_batch) if done: episodes_rewards.append(episode_reward) print("Episode:{}, Episode_reward:{}\r".format(i_episode, episode_reward)) break state = next_state total_t += 1 tf.reset_default_graph() global_step = tf.Variable(0, name='global_step', trainable=False) q_estimator = Estimator(scope='q') target_estimator = Estimator(scope='target_q') state_processor = StateProcessor() # State processor with tf.Session() as sess: sess.run(tf.global_variables_initializer()) deep_q_learning(sess, env, q_estimator=q_estimator, target_estimator=target_estimator, state_processor=state_processor, num_episodes=1000, replay_memory_size=500000, replay_memory_init_size=50000, update_target_estimator_every=1000, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=5000, discount_factor=0.99, batch_size=32) |
Visualization of the network:
4.5 DQN Summary
The replay buffer and target network introduced have improven the performance a lot. The effects of replay buffer and target network according tests from the nature DQN paper is as bellow:
5. Double DQN
Value Overestimation
One problem in the DQN algorithm is that the agent tends to overestimate the Q function value, due to the max in the formula used to set targets:
To demonstrate this problem, let’s imagine a following situation. For one particular state there is a set of actions, all of which have the same true Q value. But the estimate is inherently noisy and differs from the true value. Because of the max in the formula, the action with the highest positive error is selected and this value is subsequently propagated further to other states. This leads to positive bias – value overestimation. This severe impact on stability of our learning algorithm.
Here is the other example: consider a single state where the true Q value for all actions equal 0, but the estimated values are distributed some above and below zero. Taking the maximum of these estimates (which is obviously bigger than zero) to update the function leads to the overestimation of values.
Another similar example from post …. DQN tends to overestimate Q values because of the max operator in the Q learning Equation. The max operator uses the same value for both selecting and evaluating an action. Suppose in state , there exists five actions to . Let’s say is the best action. When we estimate Q values for all these actions in state , the estimated Q values will have some noise and differ from the actual value. Due to this noise, action may get a higher value than the optimal a action . So if we select the best action as the one that has maximum value, we will end up selecting a suboptimal action instead of optimal action .
Above problem can be solved by using two separate Q functions, each learning independently. One Q function is used to select an action while the other is used to evaluate an action. This can be realized just by tweaking the target function of DQN.
Recall the target function for DQN:
Just modify the target function to formulate the Double DQN target:
In above equation, we have two Q functions each with different weights. The Q function with weights is used to select the action and the other Q function with weights is used to evaluate the action. Of course, the roles of these two Q functions can also be switched here.
Compared with double q-learning algorithm:
In the original Double Q-learning algorithm, two value functions are learned by assigning each experience randomly to update one of the two value functions, such that there are two sets of weights, and . For each update, one set of weights is used to determine the greedy policy and the other to determine its value. In Q-learning, we are still estimating the value of the greedy policy according to the current values.
The complete pseudocode for Double DQN is as follow:
Estimation errors of any kind can induce an upward bias, regardless of whether these errors are due to environmental noise, function approximation, non-stationarity, or any other source.
This problem can be referenced by the same problem met in double q-learning, maximization bias as introduced in the following cases
Double DQNs handles the problem of the overestimation of Q-values.
Double DQN does not always improve performance, it substantially benefits the stability of learning. This improved stability directly translates to ability to learn much complicated tasks.
Implementaion
Version1: Keras easy ‘CartPole-v1’ implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
import random import gym import numpy as np from collections import deque from keras.models import Sequential from keras.layers import Dense from keras.optimizers import Adam from keras import backend as K import tensorflow as tf EPISODES = 5000 class DDQNAgent(object): def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=2000) self.gamma = 0.95 self.epsilon = 1.0 self.epsilon_min = 0.01 self.epsilon_decay = 0.99 self.learning_rate = 0.001 self.model = self._build_model() self.target_model = self._build_model() self.update_target_model() def _huber_loss(self, y_true, y_pred, clip_delta=1.0): '''Huber loss for Q Learning''' error = y_true - y_pred cond = K.abs(error) <= clip_delta squared_loss = 0.5 * K.square(error) quadratic_loss = 0.5 * K.square(clip_delta) + clip_delta * (K.abs(error) - clip_delta) return K.mean(tf.where(cond, squared_loss, quadratic_loss)) def _build_model(self): # Neural Net for Deep-Q learning Model model = Sequential() model.add(Dense(24, input_dim=self.state_size, activation='relu')) model.add(Dense(24, activation='relu')) model.add(Dense(self.action_size, activation='linear')) model.compile(loss=self._huber_loss, optimizer=Adam(lr=self.learning_rate)) return model def update_target_model(self): # copy weights from model to target_model self.target_model.set_weights(self.model.get_weights()) def memorize(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state,done)) def act(self, state): if np.random.rand() <= self.epsilon: return random.randrange(self.action_size) act_values = self.model.predict(state) return np.argmax(act_values[0]) def replay(self, batch_size): minibatch = random.sample(self.memory, batch_size) for state, action, reward, next_state, doen in minibatch: target = self.model.predict(state) if done: target[0][action] = reward else: t = self.target_model.predict(next_state)[0] target[0][action] = reward + self.gamma * np.amax(t) self.model.fit(state, target, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay def load(self, name): self.model.load_weights(name) def save(self, name): self.model.save_weights(name) if __name__ == '__main__': env = gym.make('CartPole-v1') state_size = env.observation_space.shape[0] action_size = env.action_space.n agent = DDQNAgent(state_size, action_size) # agent.load("./save/cartpole-ddqn.h5") done = False batch_size = 32 for e in range(EPISODES): state = env.reset() # <class 'numpy.ndarray'>; (4,) state = np.reshape(state, [1, state_size]) for t in range(500): # env.render() action = agent.act(state) next_state, reward, done, _ = env.step(action) x, x_dot, theta, theta_dot = next_state r1 = (env.x_threshold - abs(x)) / env.x_threshold - 0.8 r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5 reward = r1 + r2 next_state = np.reshape(next_state, [1, state_size]) agent.memorize(state, action, reward, next_state, done) state = next_state if done: agent.update_target_model() print('episode: {}/{}, score:{}, e:{:.2}'.format(e, EPISODES, t, agent.epsilon)) break if len(agent.memory) > batch_size: agent.replay(batch_size) # if e % 10 == 0: # agent.save('./save/cartploe-ddqn.h5') |
Version2: More complicated ‘Breakout’ implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
import gym from gym.wrappers import Monitor import itertools import numpy as np import tensorflow as tf from collections import deque, namedtuple import random class StateProcessor(object): """Processes a raw Atari images. Resizes it and converts it to grayscale.""" def __init__(self): # Build the Tensorflow graph with tf.variable_scope('state_processor'): self.input_state = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8) self.output = tf.image.rgb_to_grayscale(self.input_state) self.output = tf.image.crop_to_bounding_box(self.output, 34, 0, 160, 160) self.output = tf.image.resize_images( self.output, [84,84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR) self.output = tf.squeeze(self.output) def process(self, sess, state): """ Args: sess: A Tensorflow session object state: A [210, 160, 3] Atari RGB State Returns: A processed [84, 84] state representing grayscale values. """ return sess.run(self.output, feed_dict={self.input_state: state}) class Estimator(object): """Q-Value Estimator neural network. This network is used for both the Q-Network and the Target Network.""" def __init__(self, scope='estimator'): self.scope = scope with tf.variable_scope(scope): # Build the graph self._build_model() def _build_model(self): """Builds the Tensorflow graph.""" # The input are 4 grayscale frames of shape 84, 84 each self.X_pl = tf.placeholder(shape=[None, 84, 84, 4], dtype=tf.uint8, name='X') # The TD target value self.y_pl = tf.placeholder(shape=[None], dtype=tf.float32, name='y') # Integer id of which action was selected self.actions_pl = tf.placeholder(shape=[None], dtype=tf.int32, name='actions') X = tf.to_float(self.X_pl) / 255.0 batch_size = tf.shape(self.X_pl)[0] # Three convolutional layers conv1 = tf.contrib.layers.conv2d(X, 32, 8, 4, activation_fn=tf.nn.relu) conv2 = tf.contrib.layers.conv2d(conv1, 64, 4, 2, activation_fn=tf.nn.relu) conv3 = tf.contrib.layers.conv2d(conv2, 64, 3, 1, activation_fn=tf.nn.relu) # Fully connected layers flattened = tf.contrib.layers.flatten(conv3) fc1 = tf.contrib.layers.fully_connected(flattened, 512) self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS)) # Get the predictions for the chosen actions only gather_indices = tf.range(batch_size) * tf.shape(self.predictions)[1] + self.actions_pl self.action_predictions = tf.gather(tf.reshape(self.predictions, [-1]), gather_indices) # Calculate the loss self.losses = tf.squared_difference(self.y_pl, self.action_predictions) self.loss = tf.reduce_mean(self.losses) # Optimizer Parameters from original paper self.optimzer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6) self.train_op = self.optimzer.minimize(self.loss, global_step=tf.contrib.framework.get_global_step()) def predict(self, sess, s): """ Predicts action values. Args: sess: Tensorflow session s: State input of shape [batch_size, 4, 84, 84, 1] Returns: Tensor of shape [batch_size, NUM_VALID_ACTIONS] containing the estimated action values. """ return sess.run(self.predictions, {self.X_pl: s}) def update(self, sess, s, a, y): """ Updates the estimator towards the given targets. Args: sess: Tensorflow session object s: State input of shape [batch_size, 4, 84, 84, 1] a: Chosen actions of shape [batch_size] y: Targets of shape [batch_size] Returns: The calculated loss on the batch. """ feed_dict = {self.X_pl: s, self.y_pl: y, self.actions_pl: a} global_step, _, loss = sess.run( [tf.contrib.framework.get_global_step(), self.train_op, self.loss], feed_dict) return loss class ModelParametersCopier(object): """ Copies the model parameters of one estimator to another. Args: sess: Tensorflow session instance estimator1: Estimator to copy the paramters from estimator2: Estimator to copy the parameters to """ def __init__(self, estimator1, estimator2): e1_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator1.scope)] e1_params = sorted(e1_params, key=lambda v:v.name) e2_params = [t for t in tf.trainable_variables() if t.name.startswith(estimator2.scope)] e2_params = sorted(e2_params, key=lambda v:v.name) self.update_ops = [] for e1_v, e2_v in zip(e1_params, e2_params): op = e2_v.assign(e1_v) self.update_ops.append(op) def make(self, sess): sess.run(self.update_ops) def make_epsilon_greedy_policy(estimator, nA): """ Creates an epsilon-greedy policy based on a given Q-function approximator and epsilon. Args: estimator: An estimator that returns q values for a given state nA: Number of actions in the environment. Returns: A function that takes the (sess, observation, epsilon) as an argument and returns the probabilities for each action in the form of a numpy array of length nA. """ def policy_fn(sess, observation, epsilon): A = np.ones(nA, dtype=float) * epsilon / nA q_values = estimator.predict(sess, np.expand_dims(observation, 0))[0] best_action = np.argmax(q_values) A[best_action] += (1.0 - epsilon) return A return policy_fn def double_deep_q_learning(sess, env, q_estimator, target_estimator, state_processor, num_episodes, replay_memory_size=500000, replay_memory_init_size=50000, update_target_estimator_every=10000, discount_factor=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=500000, batch_size=32, record_video_every=50): """ Q-Learning algorithm for off-policy TD control using Function Approximation. Finds the optimal greedy policy while following an epsilon-greedy policy. Args: sess: Tensorflow Session object env: OpenAI environment q_estimator: Estimator object used for the q values target_estimator: Estimator object used for the targets state_processor: A StateProcessor object num_episodes: Number of episodes to run for replay_memory_size: Size of the replay memory replay_memory_init_size: Number of random experiences to sampel when initializing the reply memory. update_target_estimator_every: Copy parameters from the Q estimator to the target estimator every N steps discount_factor: Gamma discount factor epsilon_start: Chance to sample a random action when taking an action. Epsilon is decayed over time and this is the start value epsilon_end: The final minimum value of epsilon after decaying is done epsilon_decay_steps: Number of steps to decay epsilon over batch_size: Size of batches to sample from the replay memory record_video_every: Record a video every N episodes """ Transition = namedtuple('Transition',['state','action','reward','next_state','done']) replay_memory = [] total_t = sess.run(tf.contrib.framework.get_global_step()) # Get the current time step epsilons = np.linspace(epsilon_start, epsilon_end, epsilon_decay_steps) # The epsilon decay schedule policy = make_epsilon_greedy_policy(q_estimator, len(VALID_ACTIONS)) # The policy to follow print("Populating replay memory...") state = env.reset() state = state_processor.process(sess, state) state = np.stack([state] * 4, axis=2) for i in range(replay_memory_init_size): action_probs = policy(sess, state, epsilons[total_t]) action = np.random.choice(np.arange(len(action_probs)), p=action_probs) next_state, reward, done, _ = env.step(VALID_ACTIONS[action]) next_state = state_processor.process(sess, next_state) next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2) replay_memory.append(Transition(state, action, reward, next_state, done)) if done: state = env.reset() state = state_processor.process(sess, state) state = np.stack([state] * 4, axis=2) else: state = next_state estimator_copy = ModelParametersCopier(q_estimator, target_estimator) env = Monitor(env, directory='./ddqn_results', video_callable=lambda count:count%record_video_every==0, resume=True) # Record videos for i_episode in range(num_episodes): state = env.reset() # Reset the environment state = state_processor.process(sess, state) state = np.stack([state] * 4, axis=2) loss = None episode_reward = 0 for t in itertools.count(): # One step in the environment epsilon = epsilons[min(total_t, epsilon_decay_steps-1)] # Epsilon for this time step if total_t % update_target_estimator_every == 0: # Update the target estimator estimator_copy.make(sess) print("\nCopied model parameters to target network.") # Print out which step we're on, useful for debugging. print("\rStep {} ({}) @ Episode {}/{}, loss: {}".format( t, total_t, i_episode + 1, num_episodes, loss), end="") # Take a step action_probs = policy(sess, state, epsilon) action = np.random.choice(np.arange(len(action_probs)), p=action_probs) next_state, reward, done, _ = env.step(VALID_ACTIONS[action]) next_state = state_processor.process(sess, next_state) next_state = np.append(state[:,:,1:], np.expand_dims(next_state, 2), axis=2) # If our replay memory is full, pop the first element if len(replay_memory) == replay_memory_size: replay_memory.pop(0) # Save transition to replay memory replay_memory.append(Transition(state, action, reward, next_state, done)) episode_reward += reward # Sample a minibatch from the replay memory samples = random.sample(replay_memory, batch_size) states_batch, action_batch, reward_batch, next_states_batch, done_batch = map(np.array, zip(*samples)) # Calculate q values and targets q_values_next = q_estimator.predict(sess, next_states_batch) best_actions = np.argmax(q_values_next, axis=1) q_values_next_target = target_estimator.predict(sess, next_states_batch) targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * discount_factor *\ q_values_next_target[np.arange(batch_size), best_actions] # Perform gradient descent update states_batch = np.array(states_batch) loss = q_estimator.update(sess, states_batch, action_batch, targets_batch) if done: break state = next_state total_t += 1 print("Episode Reward: {} \r".format(episode_reward)) if __name__ == '__main__': env = gym.envs.make("Breakout-v0") # Atari Actions: 0 (noop), 1 (fire), 2 (left) and 3 (right) are valid actions VALID_ACTIONS = [0, 1, 2, 3] tf.reset_default_graph() # Create a glboal step variable global_step = tf.Variable(0, name='global_step', trainable=False) q_estimator = Estimator(scope='q') target_estimator = Estimator(scope='target_q') state_processor = StateProcessor() with tf.Session() as sess: sess.run(tf.initialize_all_variables()) double_deep_q_learning(sess, env, q_estimator=q_estimator, target_estimator=target_estimator, state_processor=state_processor, num_episodes=10, replay_memory_size=500000, replay_memory_init_size=50000, update_target_estimator_every=10000, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay_steps=500000, discount_factor=0.99, batch_size=32) |
6. Dueling DQN
6.1 Dueling Architecture
(This part is summarized from Dueling DQN Paper)
Two separate estimator: one for the state value function and one for the state-dependent action advantage function.
The dueling network architecture, explicitly separates the representation of state values and (state-dependent) action advantages. The dueling architecture consists of two streams that represent the value and advantage functions, while sharing a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an estimate of the state-action value function Q as shown in the below Figure. This dueling network should be understood as a single Q network with two streams that replaces the popular single-stream Q network in existing algorithms such as Deep Q-Networks (DQN). The dueling network automatically produces separate estimates of the state value function and advantage function, without any extra supervision.
The figure below: A popular single stream Q-network (top) and the dueling Q-network (bottom). The dueling network has two streams to separately estimate (scalar) state-value and the advantages for each action; the green output module combine them. Both networks output Q-values for each action.
Intuitively, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state. This is particularly useful in states where its actions do not affect the environment in any relevant way. The dueling architecture can more quickly identify the correct action during policy evaluation as redundant or similar actions are added to the learning problem.
A more detailed explanation is given by …(to be cited) as follow:
6.2 Formulation
The key insight behind our new architecture is that for many states, it is unnecessary to estimate the value of each action choice.
The single Q network architecture, or the dueling network. The lower layers of the dueling network are convolutional as in the original DQNs. However, instead of following the convolutional layers with a single sequence of fully connected layers, we instead use two sequences (or streams) of fully connected layers. The streams are constructed such that they have the capability of providing separate estimates of the value and advantage functions. Finally, the two streams are combined to produce a single output Q function. The output of the network is a set of Q values, one for each action.
Action-value function:
The State-value Function:
The expectation of the advantage function equals to 0, as
For a deterministic policy, , thus and
The aggregating module is as follow:
where is a scalar outputted by one stream of fully-connected layers, and is a -dimensional vector. denotes the parameters of the convolutional layers, while and are the parameters of the two streams of fully-connected layers.
Problem arises: the above equation is unidentifiable in the sense that given we cannot recover and uniquely. For example, add a constant to and subtract the same constant from .
To address this issue of identifiability, try to force the advantage function estimator to have zero advantage at the chosen action.
An alternative module replaces
6.3 Pseudocode
6.4 Code implementation
The epsilon paramter changes according to the below curve.
Code implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 |
import os import random import gym import tensorflow as tf import numpy as np import imageio from skimage.transform import resize # from skimage.transform import resize class FrameProcessor(object): """Resizes and converts RGB Atari frames to grayscale""" def __init__(self, frame_height=84, frame_width=84): """ Args: frame_height: Integer, Height of a frame of an Atari game frame_width: Integer, Width of a frame of an Atari game """ self.frame_height = frame_height self.frame_width = frame_width self.frame = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8) self.processed = tf.image.rgb_to_grayscale(self.frame) self.processed = tf.image.crop_to_bounding_box(self.processed, 34, 0, 160, 160) self.processed = tf.image.resize_images(self.processed, [self.frame_height, self.frame_width], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR) def __call__(self, sess, frame): """ Args: sess: A Tensorflow session object frame: A (210, 160, 3) frame of an Atari game in RGB Returns: A processed (84, 84, 1) frame in grayscale """ return sess.run(self.processed, feed_dict={self.frame:frame}) class DQN(object): """Implements a Deep Q Network""" def __init__(self, n_actions, hidden=124, learning_rate=0.00001, frame_height=84, frame_width=84, agent_history_length=4): """ Args: n_actions: Integer, number of possible actions hidden: Integer, Number of filters in the final convolutional layer. This is different from the DeepMind implementation learning_rate: Float, Learning rate for the Adam optimizer frame_height: Integer, Height of a frame of an Atari game frame_width: Integer, Width of a frame of an Atari game agent_history_length: Integer, Number of frames stacked together to create a state """ self.n_actions = n_actions self.hidden = hidden self.learning_rate = learning_rate self.frame_height = frame_height self.frame_width = frame_width self.agent_history_length = agent_history_length self.input = tf.placeholder(shape=[None, self.frame_height, self.frame_width, self.agent_history_length], dtype=tf.float32) # Normalizing the input self.inputscaled = self.input / 255 # Convolutional layers self.conv1 = tf.layers.conv2d(inputs=self.inputscaled, filters=32, kernel_size=[8, 8], strides=4, kernel_initializer=tf.variance_scaling_initializer(scale=2), padding='valid', activation=tf.nn.relu, use_bias=False, name='conv1') self.conv2 = tf.layers.conv2d(inputs=self.conv1, filters=64, kernel_size=[4,4], strides=2, kernel_initializer=tf.variance_scaling_initializer(scale=2), padding='valid', activation=tf.nn.relu, use_bias=False, name='conv2') self.conv3 = tf.layers.conv2d(inputs=self.conv2, filters=64, kernel_size=[3,3], strides=1, kernel_initializer=tf.variance_scaling_initializer(scale=2), padding='valid', activation=tf.nn.relu, use_bias=False, name='conv3') self.conv4 = tf.layers.conv2d(inputs=self.conv3, filters=hidden, kernel_size=[7,7], strides=1, kernel_initializer=tf.variance_scaling_initializer(scale=2), padding='valid', activation=tf.nn.relu, use_bias=False, name='conv4') #Splitting into value and advantage streams self.value_stream, self.advantage_stream = tf.split(self.conv4, 2, 3) self.value_stream = tf.layers.flatten(self.value_stream) self.advantage_stream = tf.layers.flatten(self.advantage_stream) self.advantage = tf.layers.dense(inputs=self.advantage_stream, units=self.n_actions, kernel_initializer=tf.variance_scaling_initializer(scale=2), name='advantage') self.value = tf.layers.dense(inputs=self.value_stream, units=1, kernel_initializer=tf.variance_scaling_initializer(scale=2), name='value') # Combing value and advantage into Q-values self.q_values = self.value + tf.subtract(self.advantage, tf.reduce_mean(self.advantage, axis=1, keepdims=True)) self.best_action = tf.argmax(self.q_values, 1) # targetQ according to Bellman equation: # Q = r + gamma*max Q', calculated in the function learn() self.target_q = tf.placeholder(shape=[None], dtype=tf.float32) # Action that was performed self.action = tf.placeholder(shape=[None], dtype=tf.int32) # Q value of the action that was performed self.Q = tf.reduce_sum(tf.multiply(self.q_values, tf.one_hot(self.action, self.n_actions, dtype=tf.float32)), axis=1) # Parameters updates self.loss = tf.reduce_mean(tf.losses.huber_loss(labels=self.target_q, predictions=self.Q)) self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate) self.update = self.optimizer.minimize(self.loss) class ExplorationExploitationScheduler(object): """Determines an action according to an epsilon greedy strategy with annealing epsilon""" def __init__(self, DQN, n_actions, eps_initial=1, eps_final=0.1, eps_final_frame=0.01, eps_evaluation=0.0, eps_annealing_frames=1000000, replay_memory_start_size=50000, max_frames=25000000): """ Args: DQN: A DQN object n_actions: Integer, number of possible actions eps_initial: Float, Exploration probability for the first replay_memory_start_size frames eps_final: Float, Exploration probability after replay_memory_start_size + eps_annealing_frames frames eps_final_frame: Float, Exploration probability after max_frames frames eps_evaluation: Float, Exploration probability during evaluation eps_annealing_frames: Int, Number of frames over which the exploration probabilty is annealed from eps_initial to eps_final replay_memory_start_size: Integer, Number of frames during which the agent only explores max_frames: Integer, Total number of frames shown to the agent """ self.n_actions = n_actions self.eps_initial = eps_initial self.eps_final = eps_final self.eps_final_frame = eps_final_frame self.eps_evaluation = eps_evaluation self.eps_annealing_frames = eps_annealing_frames self.replay_memory_start_size = replay_memory_start_size self.max_frames = max_frames # Slopes and intercepts for exploration decrease self.slope = -(self.eps_initial - eps_final) / self.eps_annealing_frames self.intercept = self.eps_initial - self.slope*self.replay_memory_start_size self.slope2 = -(self.eps_final - self.eps_final_frame)/(self.max_frames - self.eps_annealing_frames - self.replay_memory_start_size) self.intercept_2 = self.eps_final_frame - self.slope2*self.max_frames self.DQN = DQN def get_action(self, sess, frame_number, state, evaluation=False): """ Args: session: A tensorflow session object frame_number: Integer, number of the current frame state: A (84, 84, 4) sequence of frames of an Atari game in grayscale evaluation: A boolean saying whether the agent is being evaluated Returns: An integer between 0 and n_actions - 1 determining the action the agent perfoms next """ if evaluation: eps = self.eps_evaluation elif frame_number < self.replay_memory_start_size: eps = self.eps_initial elif frame_number >= self.replay_memory_start_size and \ frame_number < self.replay_memory_start_size + self.eps_annealing_frames: eps = self.slope*frame_number + self.intercept elif frame_number >= self.replay_memory_start_size + self.eps_annealing_frames: eps = self.slope2*frame_number + self.intercept_2 if np.random.rand(1) < eps: return np.random.randint(0, self.n_actions) return sess.run(self.DQN.best_action, feed_dict={self.DQN.input:[state]})[0] class ReplayMemory(object): """Replay Memory that stores the last size=1,000,000 transitions""" def __init__(self, size=1000000, frame_height=84, frame_width=84, agent_history_length=4, batch_size=32): """ Args: size: Integer, Number of stored transitions frame_height: Integer, Height of a frame of an Atari game frame_width: Integer, Width of a frame of an Atari game agent_history_length: Integer, Number of frames stacked together to create a state batch_size: Integer, Number if transitions returned in a minibatch """ self.size = size self.frame_height = frame_height self.frame_width = frame_width self.agent_history_length = agent_history_length self.batch_size = batch_size self.count = 0 self.current = 0 # Pre-allocate memory self.actions = np.empty(self.size, dtype=np.int32) self.rewards = np.empty(self.size, dtype=np.float32) self.frames = np.empty((self.size, self.frame_height, self.frame_width),dtype=np.uint8) self.terminal_flags = np.empty(self.size, dtype=np.bool) # Pre-allocate memory for the states and new_states in a minibatch self.states = np.empty((self.batch_size, self.agent_history_length, self.frame_height, self.frame_width),dtype=np.uint8) self.new_states = np.empty((self.batch_size, self.agent_history_length, self.frame_height, self.frame_width), dtype=np.uint8) self.indices = np.empty(self.batch_size, dtype=np.int32) # batch element index def add_experience(self, action, frame, reward, terminal): """ Args: action: An integer between 0 and env.action_space.n - 1 determining the action the agent perfomed frame: A (84, 84, 1) frame of an Atari game in grayscale reward: A float determining the reward the agend received for performing an action terminal: A bool stating whether the episode terminated """ if frame.shape != (self.frame_height, self.frame_width): raise ValueError('Dimension of frame is wrong') self.actions[self.current] = action self.frames[self.current, ...] = frame ### TODO self.rewards[self.current] = reward self.terminal_flags[self.current] = terminal self.count = max(self.count, self.current+1) self.current = (self.current + 1) % self.size def _get_state(self, index): if self.count is 0: raise ValueError('The replay memory is empty!') if index < self.agent_history_length - 1: raise ValueError('Index must be min 3') return self.frames[index-self.agent_history_length+1:index+1, ...] def _get_valid_indices(self): for i in range(self.batch_size): while True: index = random.randint(self.agent_history_length, self.count-1) if index < self.agent_history_length: continue if index >= self.current and index - self.agent_history_length <=self.current: continue if self.terminal_flags[index - self.agent_history_length:index].any(): continue break self.indices[i] = index def get_minibatch(self): """ Returns a minibatch of self.batch_size = 32 transitions """ if self.count < self.agent_history_length: raise ValueError('Not enough memories to get a minibatch') self._get_valid_indices() for i, idx in enumerate(self.indices): self.states[i] = self._get_state(idx - 1) self.new_states[i] = self._get_state(idx) return np.transpose(self.states, axes=(0, 2, 3, 1)), self.actions[self.indices], self.rewards[self.indices],\ np.transpose(self.new_states, axes=(0, 2, 3, 1)), self.terminal_flags[self.indices] def learn(sess, replay_memory, main_dqn, target_dqn, batch_size, gamma): """ Args: session: A tensorflow sesson object replay_memory: A ReplayMemory object main_dqn: A DQN object target_dqn: A DQN object batch_size: Integer, Batch size gamma: Float, discount factor for the Bellman equation Returns: loss: The loss of the minibatch, for tensorboard Draws a minibatch from the replay memory, calculates the target Q-value that the prediction Q-value is regressed to. Then a parameter update is performed on the main DQN. """ # Draw a minibatch from the replay memory states, actions, rewards, new_states, terminal_flags = replay_memory.get_minibatch() # The main network estimates which action is best (in the next # state s', new_states is passed!) for every transition in the minibatch arg_q_max = sess.run(main_dqn.best_action, feed_dict={main_dqn.input: new_states}) # The target network estimates the Q-values (in the next state s', new_states is passed!) # for every transition in the minibatch q_vals = sess.run(target_dqn.q_values, feed_dict={target_dqn.input: new_states}) double_q = q_vals[range(batch_size), arg_q_max] # Bellman equation. Multiplication with (1-terminal_flags) makes sure that # if the game is over, targetQ=rewards target_q = rewards + (gamma*double_q * (1-terminal_flags)) # Gradient descend step to update the parameters of the main network loss, _ = sess.run([main_dqn.loss, main_dqn.update], feed_dict={ main_dqn.input: states, main_dqn.target_q:target_q, main_dqn.action:actions}) return loss class TargetNetworkUpdater(object): """Copies the parameters of the main DQN to the target DQN""" def __init__(self, main_dqn_vars, target_dqn_vars): """ Args: main_dqn_vars: A list of tensorflow variables belonging to the main DQN network target_dqn_vars: A list of tensorflow variables belonging to the target DQN network """ self.main_dqn_vars = main_dqn_vars self.target_dqn_vars = target_dqn_vars def _update_target_vars(self): update_ops = [] for i, var in enumerate(self.main_dqn_vars): copy_op = self.target_dqn_vars[i].assign(var.value()) update_ops.append(copy_op) return update_ops def __call__(self, sess): """ Args: sess: A Tensorflow session object Assigns the values of the parameters of the main network to the parameters of the target network """ update_ops = self._update_target_vars() for copy_op in update_ops: sess.run(copy_op) def generate_gif(frame_number, frames_for_gif, reward, path): for idx, frame_idx in enumerate(frames_for_gif): frames_for_gif[idx] = resize(frame_idx, (420, 320, 3), preserve_range=True, order=0).astype(np.int8) imageio.mimsave(f'{path}{"ATARI_frame_{0}_reward_{1}.gif".format(frame_number, reward)}', frames_for_gif, duration=1/30) class Atair(object): """Wrapper for the environment provided by gym""" def __init__(self, envName, no_op_steps=10, agent_history_length=4): self.env = gym.make(envName) self.process_frame = FrameProcessor() self.state = None self.last_lives = 0 self.no_op_steps = no_op_steps self.agent_history_length = agent_history_length def reset(self, sess, evaluation=False): """ Args: sess: A Tensorflow session object evaluation: A boolean saying whether the agent is evaluating or training Resets the environment and stacks four frames ontop of each other to create the first state """ frame = self.env.reset() self.last_lives = 0 terminal_life_lost = True # Set to true so that the agent starts with a 'FIRE' action when evaluating if evaluation: for _ in range(random.randint(1, self.no_op_steps)): frame, _, _, _ = self.env.step(1) # Action 'Fire' processed_frame = self.process_frame(sess, frame) self.state = np.repeat(processed_frame, self.agent_history_length, axis=2) return terminal_life_lost def step(self, sess, action): """ Args: sess: A Tensorflow session object action: Integer, action the agent performs Performs an action and observes the reward and terminal state from the environment """ new_frame, reward, terminal, info = self.env.step(action) if info['ale.lives'] < self.last_lives: terminal_life_lost = True else: terminal_life_lost = terminal self.last_lives = info['ale.lives'] processed_new_frame = self.process_frame(sess, new_frame) new_state = np.append(self.state[:, :, 1:], processed_new_frame, axis=2) self.state = new_state return processed_new_frame, reward, terminal, terminal_life_lost, new_frame def clip_reward(reward): if reward > 0: return 1 elif reward == 0: return 0 else: return -1 tf.reset_default_graph() # Control parameters env_name = 'BreakoutDeterministic-v4' max_episode_length = 18000 # Equivalent of 5 minutes of gameplay at 60 frames per second eval_frequency = 200000 # Number of frames the agent sees between evaluations eval_steps = 10000 # Number of frames for one evaluation netw_update_freq = 10000 # Number of chosen actions between updating the target network. # According to Mnih et al. 2015 this is measured in the number of # parameter updates (every four actions), however, in the # DeepMind code, it is clearly measured in the number # of actions the agent choses disocunt_factor = 0.99 # gamma in the Bellman equation replay_memory_start_size = 50000 # Number of completely random actions, before the agent starts learning max_frames = 30000000 # Total number of frames the agent sees memory_size = 1000000 # Number of transitions stored in the replay memory no_op_steps = 10 # Number of 'NOOP' or 'FIRE' actions at the beginning of an evaluation episode update_freq = 4 # Every four actions a gradient descend step is performed hidden = 1024 # Number of filters in the final convolutional layer. The output # has the shape (1,1,1024) which is split into two streams. Both # the advantage stream and value stream have the shape # (1,1,512). This is slightly different from the original # implementation but tests with the environment Pong # have shown that this way the score increases more quickly learning_rate = 0.00001 # Set to 0.00025 in Pong for quicker results. batch_size = 32 # Batch size path = 'ouput/' summaries = 'summaries' runid = 'run_1' os.makedirs(path, exist_ok=True) os.makedirs(os.path.join(summaries, runid), exist_ok=True) summ_writer = tf.summary.FileWriter(os.path.join(summaries, runid)) atair = Atair(env_name, no_op_steps) # main DQN and target DQN networks: with tf.variable_scope('mainDQN'): main_dqn = DQN(atair.env.action_space.n, hidden, learning_rate) with tf.variable_scope('targetDQN'): target_dqn = DQN(atair.env.action_space.n, hidden) init = tf.global_variables_initializer() saver = tf.train.Saver() main_dqn_vars = tf.trainable_variables(scope='mainDQN') target_dqn_vars = tf.trainable_variables(scope='targetDQN') #### Tensorboard layer_ids = ['conv1', 'conv2', 'conv3', 'conv4', 'denseAdvantage', 'denseAdvantageBias', 'denseValue', 'denseValueBias'] # Scalar summaries for tensorboard: loss, average reward and evaluation score with tf.name_scope('Performance'): loss_pl = tf.placeholder(tf.float32, shape=None, name='loss_summary') loss_summary = tf.summary.scalar('loss', loss_pl) reward_pl = tf.placeholder(tf.float32, shape=None, name='reward_summary') reward_summary = tf.summary.scalar('reward', reward_pl) eval_score_pl = tf.placeholder(tf.float32, shape=None, name='evaluation_summary') eval_score_summary = tf.summary.scalar('evaluation_score', eval_score_pl) performancce_summaries = tf.summary.merge([loss_summary, reward_summary]) # Histogramm summaries for tensorboard: parameters with tf.name_scope('Parameters'): all_param_summaries = [] for i, Id in enumerate(layer_ids): with tf.name_scope('mainDQN/'): main_dqn_kernel = tf.summary.histogram(Id, tf.reshape(main_dqn_vars[i], shape=[-1])) all_param_summaries.extend([main_dqn_kernel]) param_summaries = tf.summary.merge(all_param_summaries) def train(): """Contains the training and evaluation loops""" replay_memory = ReplayMemory(size=memory_size, batch_size=batch_size) # (★) update_networks = TargetNetworkUpdater(main_dqn_vars, target_dqn_vars) explore_exploit_sched = ExplorationExploitationScheduler( main_dqn, atair.env.action_space.n, replay_memory_start_size=replay_memory_start_size, max_frames=max_frames) with tf.Session() as sess: sess.run(init) frame_number = 0 rewards = [] loss_list = [] while frame_number < max_frames: ####### Training ####### epoch_frame = 0 while epoch_frame < eval_frequency: terminal_life_lost = atair.reset(sess) episode_reward_sum = 0 for _ in range(max_episode_length): action = explore_exploit_sched.get_action(sess, frame_number, atair.state) # (4★) processed_new_frame, reward, terminal, terminal_life_lost, _ = atair.step(sess, action) # (5★) frame_number += 1 epoch_frame += 1 episode_reward_sum += reward clipped_reward = clip_reward(reward) # Clip the reward # (7★) Store transition in the replay memory replay_memory.add_experience(action=action, frame=processed_new_frame[:, :, 0], reward=clipped_reward, terminal=terminal_life_lost) if frame_number % update_freq == 0 and frame_number > replay_memory_start_size: # (8★) loss = learn(sess, replay_memory, main_dqn, target_dqn, batch_size, gamma=disocunt_factor) loss_list.append(loss) if frame_number % netw_update_freq == 0 and frame_number > replay_memory_start_size: update_networks(sess) # (9★) print('\rTarget network parameters updated') if terminal: terminal = False break rewards.append(episode_reward_sum) # Output the progress: if len(rewards) % 10 == 0: summ = sess.run(performancce_summaries, feed_dict={loss_pl:np.mean(loss_list), reward_pl:np.mean(rewards[-100:])}) summ_writer.add_summary(summ, frame_number) loss_list = [] summ_param = sess.run(param_summaries) summ_writer.add_summary(summ_param, frame_number) print(len(rewards), frame_number, np.mean(rewards[-100:])) with open('rewards.dat', 'a') as reward_file: print(len(rewards), frame_number, np.mean(rewards[-100:]), file=reward_file) ###### Evaluation ###### terminal = True gif = True frames_for_gif = [] eval_rewards = [] evaluate_frame_number = 0 for _ in range(eval_steps): if terminal: terminal_life_lost = atair.reset(sess, evaluation=True) episode_reward_sum = 0 terminal = False action = 1 if terminal_life_lost else\ explore_exploit_sched.get_action(sess, frame_number, atair.state, evaluation=True) processed_new_frame, reward, terminal, terminal_life_lost, new_frame = atair.step(sess, action) evaluate_frame_number += 1 episode_reward_sum += reward if gif: frames_for_gif.append(new_frame) if terminal: eval_rewards.append(episode_reward_sum) gif = False print('Evaluation score:\n', np.mean(eval_rewards)) try: generate_gif(frame_number, frames_for_gif, eval_rewards[0], path) except IndexError: print('No evaluation game finished') # Save the network parameters saver.save(sess, path+'model', globa_step=frame_number) frames_for_gif = [] # Show the evaluation score in tensorboard summ = sess.run(eval_score_summary, feed_dict={eval_score_pl: np.mean(eval_rewards)}) summ_writer.add_summary(summ, frame_number) with open('rewardsEval.dat','a') as eval_reward_file: print(frame_number, np.mean(eval_rewards), file=eval_reward_file) train() |
You can also see this reference: https://github.com/fg91/Deep-Q-Learning/blob/master/DQN.ipynb
7. Summary
No more words, just summarize the DQN series in a single graph as bellow:
Reference
1. RL — DQN Deep Q-network.
2. Human-level control through deep reinforcement learning.
3. Dueling Network Architectures for Deep Reinforcement Learning.
4. Deep Reinforcement Learning with Double Q-learning.
5. Playing Atari with Deep Reinforcement Learning.