Policy-based Methods directly learn a parameterized policy.Value-value methods learn the values of actions, and then selected actions based on their estimated action values. In policy-based methods, a value function may still be used to learn the policy parameter, but is not required for action selection. Methods that that learn approximations to both policy and value functions are often called actor–critic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.
1. Policy Gradient Recap
In the previous policy gradient overview post, we have knowledge of how basic policy-based methods like REINFORCE and Vanilla Policy Gradient work, it is natural to extend this series of algorithms to the actor-critic setting. We will do a short recap first.
To maximize the performance measure with respect to , we can use the stochastic gradient-ascent approach
The REINFORCE update is
The general discounted case:
Corresponding pseudocode of the REINFORCE is as follow:
As a Monte Carlo method REINFORCE may be of high variance and thus produce slow learning. This is further improved by generalizing the policy gradient theorem to include a comparison of the action value to an arbitrary baseline b(s) is as follow:
Baseline can be any function or random variables that does not vary with action , then the update rule for REINFORCE that includes a general baseline:
The introduce of baseline cause no effects to the bias, but can reduce the variance. One natural choice for the baseline is an estimate of the state value , where is a weight vector learned using methods as in value approximation methods.
The pseudocode algorithm for REINFORCE with baseline using such a learned state-value function as the baseline:
Tips: REINFORCE-with-baseline method learns both a policy and a state-value function.
2. Actor-Critic Methods
(Reference from Rich Sutton) Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated.
REINFORCE with baseline is still using the Monte Carlo methods, so it is unbiased but of high variance and need to wait until the end of the episode to calculate the reward, the temporal-difference method can eliminate these problems to some extent. Take the one-step Actor-Critic and Q Actor-Critic methods for example to explain.
2.1 One-step Actor-Critic
The one-step actor-critic method replaces the full return of REINFORCE with the one-step return (and use a learned state-value function as the baseline) as follows:
The corresponding pseudocode:
2.2 Q Actor-Critic
Considering the gradient of the performance measure with respect to
Decompose the expectation into (the other form of Actor Critic method derivation, rather than from the aspects of causality and temporary structure in Berkeley CS294 lecture):
By definition, the second expectation term above is just the Q value
Plugging that in, we can rewrite the update equation as such:
When parameterizing the above Q with the neural network, we get the Q Actor Critic algorithm as follow.
Using the similar derivation as Q Actor Crtic with different policy gradient forms, we can get series of algorithms as bellow:
Quick Reference:Actor-Critic Core
Estimating the gradient by sampling multiple trajectories
Considering
Calculate as
2.3 Code Implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
import torch import torch.nn as nn import torch.nn.functional as F from torch.distributions import Categorical import torch.optim as optim import gym class ActorCritic(nn.Module): def __init__(self): super(ActorCritic, self).__init__() self.affine = nn.Linear(8, 128) self.action_layer = nn.Linear(128, 4) self.value_layer = nn.Linear(128, 1) self.log_probs = [] self.state_values = [] self.rewards = [] def forward(self, state): state = torch.from_numpy(state).float() state = F.relu(self.affine(state)) state_value = self.value_layer(state) action_probs = F.softmax(self.action_layer(state)) action_distributions = Categorical(action_probs) action = action_distributions.sample() self.log_probs.append(action_distributions.log_prob(action)) self.state_values.append(state_value) return action.item() def calculate_loss(self, gamma=0.99): rewards = [] dis_reward = 0 for reward in self.rewards[::-1]: dis_reward = reward + gamma * dis_reward rewards.insert(0, dis_reward) rewards = torch.tensor(rewards) rewards = (rewards - rewards.mean()) / (rewards.std()) loss = 0 for log_prob, value, reward in zip(self.log_probs, self.state_values, rewards): advantage = reward - value.item() action_loss = -log_prob * advantage value_loss = F.smooth_l1_loss(value, reward) loss += (action_loss + value_loss) return loss def clear_memory(self): del self.log_probs[:] del self.state_values[:] del self.rewards[:] def train(): render = False gamma = 0.99 lr = 0.02 betas = (0.9, 0.999) random_seed = 543 torch.manual_seed(random_seed) env = gym.make('LunarLander-v2') env.seed(random_seed) policy = ActorCritic() optimizer = optim.Adam(policy.parameters(), lr=lr, betas=betas) running_reward = 0 for i_episode in range(0, 10000): state = env.reset() for t in range(10000): action = policy(state) state, reward, done, _ = env.step(action) policy.rewards.append(reward) running_reward += reward if render and i_episode > 1000: env.render() if done: break optimizer.zero_grad() loss = policy.calculate_loss(gamma) loss.backward() optimizer.step() policy.clear_memory() if i_episode % 5 == 0: running_reward = running_reward / 5 print('Episode {}\tlength: {}\treward: {}'.format(i_episode, t, running_reward)) running_reward = 0 if __name__ == '__main__': train() |
Result:
3. Advantage Actor-Critic
The state-value function can be naturally considered to be one of the baseline function choice. So, using the function as the baseline function, we subtract the value term with the value, thus we get the advantage value:
It will be inefficient to construct two neural networks for both the value and the value using the relationship between the and from the Bellman equation:
Rewrite the advantage as:
Using will lose one time-step accuracy, but with better efficiency with only one neural network for the V function, then the update equation can be written as follow:
This is the gradient used for Advantage Actor Critic update rule.
Two main variants of the Advantage Actor Critic: the Advantage Actor Critic (A2C) and the Asynchronous Advantage Actor-Critic (A3C). In A2C, we synchronously update the global network. We wait until all workers have finished their training and calculated their gradients to average them, to update our global network. Because of the asynchronous nature of A3C, some workers (copies of the Agent) will be playing with older version of the parameters. Thus the aggregating update will not be optimal. That’s why A2C waits for each actor to finish their segment of experience before updating the global parameters. Then, we restart a new segment of experience with all parallel actors having the same new parameters.
Pseudocode
The pseudocode for Advantage actor-critic is as follow:
Implementation 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
import numpy as np import torch import gym import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch.autograd import Variable # hyperparameters hidden_size = 256 learning_rate = 3e-4 gamma = 0.99 num_steps = 300 max_episodes = 3000 class ActorCritic(nn.Module): def __init__(self, num_inputs, num_actions, hidden_size): super(ActorCritic, self).__init__() self.num_actions = num_actions self.critic_linear1 = nn.Linear(num_inputs, hidden_size) self.critic_linear2 = nn.Linear(hidden_size, 1) self.actor_linear1 = nn.Linear(num_inputs, hidden_size) self.actor_linear2 = nn.Linear(hidden_size, num_actions) def forward(self, state): state = Variable(torch.from_numpy(state).float().unsqueeze(0)) value = F.relu(self.critic_linear1(state)) value = self.critic_linear2(value) policy_dist = F.relu(self.actor_linear1(state)) policy_dist = F.softmax(self.actor_linear2(policy_dist), dim=1) return value, policy_dist def a2c(env): num_inputs = env.observation_space.shape[0] num_outputs = env.action_space.n actor_critic = ActorCritic(num_inputs, num_outputs, hidden_size) ac_optimizer = optim.Adam(actor_critic.parameters(), lr=learning_rate) all_lengths = [] average_lengths = [] all_rewards = [] entropy_term = 0 for episode in range(max_episodes): log_probs = [] values = [] rewards = [] state = env.reset() for steps in range(num_steps): value, policy_dist = actor_critic.forward(state) value = value.detach().numpy()[0,0] dist = policy_dist.detach().numpy() action = np.random.choice(num_outputs, p=np.squeeze(dist)) log_prob = torch.log(policy_dist.squeeze(0)[action]) entropy = -np.sum(np.mean(dist) * np.log(dist)) new_state, reward, done, _ = env.step(action) rewards.append(reward) values.append(value) log_probs.append(log_prob) entropy_term += entropy state = new_state if done or steps == num_steps-1: Qval, _ = actor_critic.forward(new_state) Qval = Qval.detach().numpy()[0,0] all_rewards.append(np.sum(rewards)) all_lengths.append(steps) average_lengths.append(np.mean(all_lengths[-10:])) if episode % 10 == 0: print("Episode: {}, reward: {}, total length: {}, average length: {}".format( episode,np.sum(rewards), steps, average_lengths[-1])) break # compute Q values QVals = np.zeros_like(values) for t in reversed(range(len(rewards))): Qval = rewards[t] + gamma * Qval QVals[t] = Qval # update actor critic values = torch.FloatTensor(values) QVals = torch.FloatTensor(QVals) log_probs = torch.stack(log_probs) advantage = QVals - values actor_loss = (-log_probs * advantage).mean() critic_loss = 0.5 * advantage.pow(2).mean() ac_loss = actor_loss + critic_loss + 0.001 * entropy_term ac_optimizer.zero_grad() ac_loss.backward() ac_optimizer.step() if __name__ == '__main__': env = gym.make('CartPole-v0') a2c(env) |
Result
Implementation 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
import gym import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.distributions import Categorical import torch.multiprocessing as mp import time import numpy as np n_train_processes = 3 learning_rate = 0.0002 update_interval = 5 gamma = 0.98 max_train_steps = 60000 print_interval = update_interval * 100 class ActorCritic(nn.Module): def __init__(self): super(ActorCritic, self).__init__() self.fc1 = nn.Linear(4, 256) self.fc_pi = nn.Linear(256, 2) self.fc_v = nn.Linear(256, 1) def pi(self, x, softmax_dim=1): x = F.relu(self.fc1(x)) x = self.fc_pi(x) prob = F.softmax(x, dim=softmax_dim) return prob def v(self, x): x = F.relu(self.fc1(x)) v = self.fc_v(x) return v def worker(worker_id, master_end, worker_end): master_end.close() # Forbid worker to use the master end for messaging env = gym.make('CartPole-v1') env.seed(worker_id) while True: cmd, data = worker_end.recv() if cmd == 'step': ob, reward, done, info = env.step(data) if done: ob = env.reset() worker_end.send((ob, reward, done, info)) elif cmd == 'reset': ob = env.reset() worker_end.send(ob) elif cmd == 'reset_task': ob = env.reset_task() worker_end.send(ob) elif cmd == 'close': worker_end.close() break elif cmd == 'get_spaces': worker_end.send((env.observation_space, env.action_space)) else: raise NotImplementedError class ParallelEnv(object): def __init__(self, n_train_processes): self.envs_num = n_train_processes self.waiting = False self.closed = False self.workers = list() master_ends, worker_ends = zip(*[mp.Pipe() for _ in range(self.envs_num)]) self.master_ends, self.worker_ends = master_ends, worker_ends for worker_id, (master_end, worker_end) in enumerate(zip(master_ends, worker_ends)): p = mp.Process(target=worker, args=(worker_id, master_end, worker_end)) p.daemon = True p.start() self.workers.append(p) # Forbid master to use the worker end for messaging for worker_end in worker_ends: worker_end.close() def step_async(self, actions): for master_end, action in zip(self.master_ends, actions): master_end.send(('step', action)) self.waiting = True def step_wait(self): results = [master_end.recv() for master_end in self.master_ends] self.waiting = False obs, rews, dones, infos = zip(*results) return np.stack(obs), np.stack(rews), np.stack(dones), infos def reset(self): for master_end in self.master_ends: master_end.send(('reset', None)) return np.stack([master_end.recv() for master_end in self.master_ends]) def step(self, actions): self.step_async(actions) return self.step_wait() def close(self): # For clean up resources if self.closed: return if self.waiting: [master_end.recv() for master_end in self.master_ends] for master_end in self.master_ends: master_end.send(('close', None)) for worker in self.workers: worker.join() self.closed = True def test(step_idx, model): env = gym.make('CartPole-v1') score = 0.0 done = False num_test = 10 for _ in range(num_test): s = env.reset() while not done: prob = model.pi(torch.from_numpy(s).float(), softmax_dim=0) a = Categorical(prob).sample().numpy() s_prime, r, done, info = env.step(a) s = s_prime score += r done = False print(f"Step # :{step_idx}, avg score : {score/num_test:.1f}") env.close() def compute_target(v_final, r_lst, mask_lst): G = v_final.reshape(-1) td_target = list() for r, mask in zip(r_lst[::-1], mask_lst[::-1]): G = r + gamma * G * mask td_target.append(G) return torch.tensor(td_target[::-1]).float() if __name__ == '__main__': envs = ParallelEnv(n_train_processes) model = ActorCritic() optimizer = optim.Adam(model.parameters(), lr=learning_rate) step_idx = 0 s = envs.reset() while step_idx < max_train_steps: s_lst, a_lst, r_lst, mask_lst = list(), list(), list(), list() for _ in range(update_interval): prob = model.pi(torch.from_numpy(s).float()) a = Categorical(prob).sample().numpy() s_prime, r, done, info = envs.step(a) s_lst.append(s) a_lst.append(a) r_lst.append(r/100.0) mask_lst.append(1 - done) s = s_prime step_idx += 1 s_final = torch.from_numpy(s_prime).float() v_final = model.v(s_final).detach().clone().numpy() td_target = compute_target(v_final, r_lst, mask_lst) td_target_vec = td_target.reshape(-1) s_vec = torch.tensor(s_lst).float().reshape(-1, 4) # 4 == Dimension of state a_vec = torch.tensor(a_lst).reshape(-1).unsqueeze(1) advantage = td_target_vec - model.v(s_vec).reshape(-1) pi = model.pi(s_vec, softmax_dim=1) pi_a = pi.gather(1, a_vec).reshape(-1) loss = -(torch.log(pi_a) * advantage.detach()).mean() + \ F.smooth_l1_loss(model.v(s_vec).reshape(-1), td_target_vec) optimizer.zero_grad() loss.backward() optimizer.step() if step_idx % print_interval == 0: test(step_idx, model) envs.close() |
Result
Differences between Actor-critic and Advantage Actor-critic
Actor-Critic is not just a single algorithm, it should be viewed as a family of related techniques. These techniques are based on the policy gradient theorem, which train some form of value estimate to plug into the update rule a lower-variance replacement for the returns at the end of an episode. The all perform “bootstrapping” by using some sort of prediction of value.
Advantage Actor-critic specifically uses estimates of the advantage function for its bootstrapping, whereas ‘actor-critic’ without the ‘advantage’ qualifier is not specific; it could be a trained function, or some sort of estimate of the , it could be a variety of things.
4. Asynchronous Advantage Actor Critic
A3C runs asynchronously on a single machine with a standard multi-core CPU, rather than traditional learning approach on specialized hardware such as GPUs or massively distributed architectures.
(Selected from A3C paper) The aim of the A3C is to train deep neural network policies reliably and without large resource requirement. First, use asynchronous actor-learners by using multiple CPU threads on a single machine. Second, multiple actors-learners run in parallel to explore different parts of the environment using different exploration policies to maximize the diversity. By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the DQN training algorithm (why not use replay buffer in A3C). (Tips, replay buffer drawbacks: using more memory and computation per real interaction; requiring off-policy learning algorithms that can update from data generated by an older policy). Third, benefits: the reduction in training time that is roughly linear in the number of parallel actor-learners and being able to use on-policy reinforcement learning methods such as Sarsa and actor-critic to train neural networks in a stable way, rather than relying on experience replay for stabilizing learning.
The A3C maintains a policy and an estimate of the value function , it operates in the forward view (as opposed to the more common backward view used by techniques like eligibility traces in Rich Sutton) and uses the same mix of n-step returns to update both the policy and the value-function. The policy and the value function are updated after every actions or when a terminal state is reached. The update performed by the algorithm can be seen as , where is an estimate of the advantage function given by , where can vary from state to state and is upper-bounded by . The pseudocode for the algorithm is as follow:
Adding the entropy of the policy to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies. The gradient of the full objective function including the entropy regularization term with respect to the policy parameters takes the form:
where is the entropy. The hyperparameter controls the strength of the entropy regularization term.
Implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
import gym import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.distributions import Categorical import torch.multiprocessing as mp import time n_train_processes = 3 learning_rate = 0.0002 update_interval = 5 gamma = 0.98 max_train_ep = 300 max_test_ep = 400 class ActorCritic(nn.Module): def __init__(self): super(ActorCritic, self).__init__() self.fc1 = nn.Linear(4, 256) self.fc_pi = nn.Linear(256, 2) self.fc_v = nn.Linear(256, 1) def pi(self, x, softmax_dim=0): x = F.relu(self.fc1(x)) x = self.fc_pi(x) prob = F.softmax(x, dim=softmax_dim) return prob def v(self, x): x = F.relu(self.fc1(x)) v = self.fc_v(x) return v def train(global_model, rank): local_model = ActorCritic() local_model.load_state_dict(global_model.state_dict()) optimizer = optim.Adam(global_model.parameters(), lr=learning_rate) env = gym.make('CartPole-v1') for n_epi in range(max_train_ep): done = False s = env.reset() while not done: s_lst, a_lst, r_lst = [], [], [] for t in range(update_interval): prob = local_model.pi(torch.from_numpy(s).float()) m = Categorical(prob) a = m.sample().item() s_prime, r, done, info = env.step(a) s_lst.append(s) a_lst.append([a]) r_lst.append(r/100.0) s = s_prime if done: break s_final = torch.tensor(s_prime, dtype=torch.float) R = 0.0 if done else local_model.v(s_final).item() td_target_lst = [] for reward in r_lst[::-1]: R = gamma * R + reward td_target_lst.append([R]) td_target_lst.reverse() s_batch, a_batch, td_target = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \ torch.tensor(td_target_lst) advantage = td_target - local_model.v(s_batch) pi = local_model.pi(s_batch, softmax_dim=1) pi_a = pi.gather(1, a_batch) loss = -torch.log(pi_a) * advantage.detach() + F.smooth_l1_loss(local_model.v(s_batch), td_target.detach()) optimizer.zero_grad() loss.mean().backward() for global_param, local_param in zip(global_model.parameters(), local_model.parameters()): global_param._grad = local_param.grad optimizer.step() local_model.load_state_dict(global_model.state_dict()) env.close() print("Training process {} reached maximum episode.".format(rank)) def test(global_model): env = gym.make('CartPole-v1') score = 0.0 print_interval = 20 for n_epi in range(max_test_ep): done = False s = env.reset() while not done: prob = global_model.pi(torch.from_numpy(s).float()) a = Categorical(prob).sample().item() s_prime, r, done, info = env.step(a) s = s_prime score += r if n_epi % print_interval == 0 and n_epi != 0: print("# of episode:{}, avg score: {:.1f}".format(n_epi, score/print_interval)) score = 0.0 time.sleep(1) env.close() if __name__ == '__main__': global_model = ActorCritic() global_model.share_memory() processes = [] for rank in range(n_train_processes + 1): # + 1 for test process if rank == 0: p = mp.Process(target=test, args=(global_model,)) else: p = mp.Process(target=train, args=(global_model, rank,)) p.start() processes.append(p) for p in processes: p.join() |
Result