我是邪少
我什么都不会

Keras深度强化学习– Policy Network与DQN实现

强化学习中有两种重要的方法:Policy Gradients和Q-learning。其中Policy Gradients方法直接预测在某个环境下应该采取的Action,而Q-learning方法预测某个环境下所有Action的期望值(即Q值)。一般来说,Q-learning方法只适合有少量离散取值的Action环境,而Policy Gradients方法适合有连续取值的Action环境。在与深度学习方法结合后,这两种算法就变成了Policy Network和DQN(Deep Q-learning Network)。

环境

  • Python 3.6
  • Tensorflow-gpu 1.8.0
  • Keras 2.2.2
  • Gym 0.10.8

Gym

Gym 是 OpenAI 发布的用于开发和比较强化学习算法的工具包。使用它我们可以让 AI 智能体做很多事情,比如行走、跑动,以及进行多种游戏。在这个Demo中,我们使用的是车杆游戏(Cart-Pole)这个小游戏。

游戏规则很简单,游戏里面有一个小车,上有竖着一根杆子。小车需要左右移动来保持杆子竖直。如果杆子倾斜的角度大于15°,那么游戏结束。小车也不能移动出一个范围(中间到两边各2.4个单位长度)。

Cart-Pole:

Cart-Pole世界包括一个沿水平轴移动的车和一个固定在车上的杆子。 在每个时间步,你可以观察它的位置(x),速度(x_dot),角度(theta)和角速度(theta_dot)。 这是这个世界的可观察的状态。 在任何状态下,车只有两种可能的行动:向左移动或向右移动。换句话说,Cart-Pole的状态空间有四个维度的连续值,行动空间有一个维度的两个离散值。

首先安装gym:

pip install gym

gym尝试:

# -*- coding: utf-8 -*-

import gym
import numpy as np


def try_gym():
    # 使用gym创建一个CartPole环境
    # 这个环境可以接收一个action,返回执行action后的观测值,奖励与游戏是否结束
    env = gym.make('CartPole-v0')
    # 重置游戏环境
    env.reset()

    # 游戏轮数
    random_episodes = 0
    # 每轮游戏的Reward总和
    reward_sum = 0
    count = 0
    while random_episodes < 10:
        # 渲染显示游戏效果
        env.render()
        # 随机生成一个action,即向左移动或者向右移动。
        # 然后接收执行action之后的反馈值
        observation, reward, done, _ = env.step(np.random.randint(0, 2))
        reward_sum += reward
        count += 1
        # 如果游戏结束,打印Reward总和,重置游戏
        if done:
            random_episodes += 1
            print("Reward for this episode was: {}, turns was: {}".format(reward_sum, count))
            reward_sum = 0
            count = 0
            env.reset()


if __name__ == '__main__':
    try_gym()

我们输出的是每一轮游戏从开始到结束得到的Reward的总和与操作次数,输出结果如下:

Reward for this episode was: 20.0, turns was: 20
Reward for this episode was: 26.0, turns was: 26
Reward for this episode was: 18.0, turns was: 18
Reward for this episode was: 25.0, turns was: 25
Reward for this episode was: 25.0, turns was: 25
Reward for this episode was: 23.0, turns was: 23
Reward for this episode was: 29.0, turns was: 29
Reward for this episode was: 17.0, turns was: 17
Reward for this episode was: 13.0, turns was: 13
Reward for this episode was: 27.0, turns was: 27

如果使用的环境是Anoconda 3,可能会出现下列错误:

    raise NotImplementedError('abstract')

NotImplementedError: abstract

这是由于pyglet引起的,需要替换成1.2.4版本:

pip uninstall pyglet
pip install pyglet==1.2.4

Policy Network

R.Sutton在2000年提出的Policy Gradient方法是RL中学习连续的行为控制策略的经典方法,其解决方案是通过一个概率分布函数πθ(st|θπ) 来表示每一步的最优策略,在每一步根据该概率分布进行action采样获得当前的最佳action取值,即: at∼πθ(st|θπ)。生成action的过程本质上是一个随机过程;最后学习到的策略,也是一个随机策略(stochastic policy)。

Policy Network是一种典型的蒙特卡洛方法,是在一个episode结束时对discount reward进行学习,其实现流程如下:

(1)首先构建神经网络,网络的输入为obervation,网络的输出为action=1的概率。
(2)在一个episode结束时(游戏胜利或死亡),将env重置,即observation恢复到了初始状态。下一次循环时,输入observation,输出一个概率值p0。根据概率p0选取一个action输入到环境中,获取到新的observation和reward。记录[observation, action, reward]作为后续训练的数据。
(3)reward为大于0的数,根据上面的action得到reward,将整个episode的reward放到一个序列里,然后计算discount_reward。
(4)攒够个batch的episode,进行梯度下降更新。损失函数分为两部分,首先使用binary_crossentropy计算action的交叉熵损失,然后与discount_reward相乘得到最终损失。

使用keras实现的Policy Network如下所示:

# -*- coding: utf-8 -*-
import os
import gym
import numpy as np

from keras.layers import Input, Dense
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K


class PG:
    def __init__(self):
        self.model = self.build_model()
        if os.path.exists('pg.h5'):
            self.model.load_weights('pg.h5')

        self.env = gym.make('CartPole-v0')
        self.gamma = 0.95

    def build_model(self):
        """基本网络结构.
        """
        inputs = Input(shape=(4,), name='ob_input')
        x = Dense(16, activation='relu')(inputs)
        x = Dense(16, activation='relu')(x)
        x = Dense(1, activation='sigmoid')(x)

        model = Model(inputs=inputs, outputs=x)

        return model

    def loss(self, y_true, y_pred):
        """损失函数.
        Arguments:
            y_true: (action, reward)
            y_pred: action_prob

        Returns:
            loss: reward loss
        """
        action_pred = y_pred
        action_true, discount_episode_reward = y_true[:, 0], y_true[:, 1]
        # 二分类交叉熵损失
        action_true = K.reshape(action_true, (-1, 1))
        loss = K.binary_crossentropy(action_true, action_pred)
        # 乘上discount_reward
        loss = loss * K.flatten(discount_episode_reward)

        return loss

    def discount_reward(self, rewards):
        """Discount reward
        Arguments:
            rewards: 一次episode中的rewards
        """
        # 以时序顺序计算一次episode中的discount reward
        discount_rewards = np.zeros_like(rewards, dtype=np.float32)
        cumulative = 0.
        for i in reversed(range(len(rewards))):
            cumulative = cumulative * self.gamma + rewards[i]
            discount_rewards[i] = cumulative

        # normalization,有利于控制梯度的方差
        discount_rewards -= np.mean(discount_rewards)
        discount_rewards //= np.std(discount_rewards)

        return list(discount_rewards)

    def train(self, episode, batch):
        """训练
        Arguments:
            episode: 游戏次数
            batch: 一个batch包含几次episode,每个batch更新一次梯度

        Returns:
            history: 训练记录
        """
        self.model.compile(loss=self.loss, optimizer=Adam(lr=0.01))

        history = {'episode': [], 'Batch_reward': [], 'Episode_reward': [], 'Loss': []}

        episode_reward = 0
        states = []
        actions = []
        rewards = []
        discount_rewards = []

        for i in range(episode):
            observation = self.env.reset()
            erewards = []

            while True:
                x = observation.reshape(-1, 4)
                prob = self.model.predict(x)[0][0]
                # 根据随机概率选择action
                action = np.random.choice(np.array(range(2)), size=1, p=[1 - prob, prob])[0]
                observation, reward, done, _ = self.env.step(action)
                # 记录一个episode中产生的数据
                states.append(x[0])
                actions.append(action)
                erewards.append(reward)
                rewards.append(reward)

                if done:
                     # 一次episode结束后计算discount rewards
                    discount_rewards.extend(self.discount_reward(erewards))
                    break
            # 保存batch个episode的数据,用这些数据更新模型
            if i != 0 and i % batch == 0: 
                batch_reward = sum(rewards)
                episode_reward = batch_reward / batch
                # 输入X为状态, y为action与discount_rewards,用来与预测出来的prob计算损失
                X = np.array(states)
                y = np.array(list(zip(actions, discount_rewards)))

                loss = self.model.train_on_batch(X, y)
    
                history['episode'].append(i)
                history['Batch_reward'].append(batch_reward)
                history['Episode_reward'].append(episode_reward)
                history['Loss'].append(loss)

                print('Episode: {} | Batch reward: {} | Episode reward: {} | loss: {:.3f}'.format(i, batch_reward, episode_reward, loss))

                episode_reward = 0
                states = []
                actions = []
                rewards = []
                discount_rewards = []

        self.model.save_weights('dpg.h5')

        return history

    def play(self):
        """使用训练好的模型测试游戏.
        """
        observation = self.env.reset()

        count = 0
        reward_sum = 0
        random_episodes = 0

        while random_episodes < 10:
            self.env.render()

            x = observation.reshape(-1, 4)
            prob = self.model.predict(x)[0][0]
            action = 1 if prob > 0.5 else 0
            observation, reward, done, _ = self.env.step(action)

            count += 1
            reward_sum += reward

            if done:
                print("Reward for this episode was: {}, turns was: {}".format(reward_sum, count))
                random_episodes += 1
                reward_sum = 0
                count = 0
                observation = self.env.reset()


if __name__ == '__main__':
    model = PG()
    history = model.train(5000, 5)
    model.play()

训练结果与测试结果如下所示,可以看出随着训练次数的增加,Policy Network模型在游戏中获得Reward不断的增加,并且Loss不断降低。在完成5000次Episode的训练后进行模型测试, 相比随机操作来说Policy Network模型能达到200 reward,由于到达200个reward之后游戏也会结束,因此Policy Network可以说是解决了这个问题。
但是根据我的实验,Policy Network训练起来并不稳定,模型参数初始化对训练效果也有着较大的影响,需要多次尝试。有时reward收敛一段时间后又会快速下降,出现周期性的变化,从图中也可以看出训练过程的不稳定。

Episode: 5 | Batch reward: 120.0 | Episode reward: 24.0 | loss: -0.325
Episode: 10 | Batch reward: 67.0 | Episode reward: 13.4 | loss: -0.300
Episode: 15 | Batch reward: 128.0 | Episode reward: 25.6 | loss: -0.326
Episode: 20 | Batch reward: 117.0 | Episode reward: 23.4 | loss: -0.332
Episode: 25 | Batch reward: 122.0 | Episode reward: 24.4 | loss: -0.330
Episode: 30 | Batch reward: 97.0 | Episode reward: 19.4 | loss: -0.339
Episode: 35 | Batch reward: 120.0 | Episode reward: 24.0 | loss: -0.331
......

Episode: 4960 | Batch reward: 973.0 | Episode reward: 194.6 | loss: -0.228
Episode: 4965 | Batch reward: 1000.0 | Episode reward: 200.0 | loss: -0.224
Episode: 4970 | Batch reward: 881.0 | Episode reward: 176.2 | loss: -0.238
Episode: 4975 | Batch reward: 1000.0 | Episode reward: 200.0 | loss: -0.213
Episode: 4980 | Batch reward: 974.0 | Episode reward: 194.8 | loss: -0.229
Episode: 4985 | Batch reward: 862.0 | Episode reward: 172.4 | loss: -0.235
Episode: 4990 | Batch reward: 914.0 | Episode reward: 182.8 | loss: -0.233
Episode: 4995 | Batch reward: 737.0 | Episode reward: 147.4 | loss: -0.254

Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 200.0, turns was: 200

Policy Network

DQN

DQN是一种典型的时序差分方法,与Policy Network不同,DQN对时刻n与时刻n+1的数据进行学习,这样话其产生的方差要小于蒙特卡洛方法。常用的DQN算法是在15年提出来的Nature DQN,这里使用Nature DQN为例。

DQN使用单个网络来进行选择动作和计算目标Q值;Nature DQN使用了两个网络,一个当前主网络用来选择动作,更新模型参数,另一个目标网络用于计算目标Q值,两个网络的结构是一模一样的。目标网络的网络参数不需要迭代更新,而是每隔一段时间从当前主网络复制过来,即延时更新,这样可以减少目标Q值和当前的Q值相关性。Nature DQN和DQN相比,除了用一个新的相同结构的目标网络来计算目标Q值以外,其余部分基本是完全相同的。

Nature DQN的实现流程如下:
(1)首先构建神经网络,一个主网络,一个目标网络,他们的输入都为obervation,输出为不同action对应的Q值。
(2)在一个episode结束时(游戏胜利或死亡),将env重置,即observation恢复到了初始状态observation,通过贪婪选择法ε-greedy选择action。根据选择的action,获取到新的next_observation、reward和游戏状态。将[observation, action, reward, next_observation, done]放入到经验池中。经验池有一定的容量,会将旧的数据删除。
(3)从经验池中随机选取batch个大小的数据,计算出observation的Q值作为Q_target。对于done为False的数据,使用reward和next_observation计算discount_reward。然后将discount_reward更新到Q_traget中。
(4)每一个action进行一次梯度下降更新,使用MSE作为损失函数。注意与DPG不同,参数更新不是发生在每次游戏结束,而是发生在游戏进行中的每一步。
(5)每个batch我们更新参数epsilon,egreedy的epsilon是不断变小的,也就是随机性不断变小。
(6)每隔固定的步数,从主网络中复制参数到目标网络。

使用keras实现的Nature DQN如下所示:

# -*- coding: utf-8 -*-
import os
import gym
import random
import numpy as np

from collections import deque

from keras.layers import Input, Dense
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K


class DQN:
    def __init__(self):
        self.model = self.build_model()
        self.target_model = self.build_model()
        self.update_target_model()

        if os.path.exists('dqn.h5'):
            self.model.load_weights('dqn.h5')

        # 经验池
        self.memory_buffer = deque(maxlen=2000)
        # Q_value的discount rate,以便计算未来reward的折扣回报
        self.gamma = 0.95
        # 贪婪选择法的随机选择行为的程度
        self.epsilon = 1.0
        # 上述参数的衰减率
        self.epsilon_decay = 0.995
        # 最小随机探索的概率
        self.epsilon_min = 0.01

        self.env = gym.make('CartPole-v0')

    def build_model(self):
        """基本网络结构.
        """
        inputs = Input(shape=(4,))
        x = Dense(16, activation='relu')(inputs)
        x = Dense(16, activation='relu')(x)
        x = Dense(2, activation='linear')(x)

        model = Model(inputs=inputs, outputs=x)

        return model

    def update_target_model(self):
        """更新target_model
        """
        self.target_model.set_weights(self.model.get_weights())

    def egreedy_action(self, state):
        """ε-greedy选择action

        Arguments:
            state: 状态

        Returns:
            action: 动作
        """
        if np.random.rand() <= self.epsilon:
             return random.randint(0, 1)
        else:
            q_values = self.model.predict(state)[0]
            return np.argmax(q_values)

    def remember(self, state, action, reward, next_state, done):
        """向经验池添加数据

        Arguments:
            state: 状态
            action: 动作
            reward: 回报
            next_state: 下一个状态
            done: 游戏结束标志
        """
        item = (state, action, reward, next_state, done)
        self.memory_buffer.append(item)

    def update_epsilon(self):
        """更新epsilon
        """
        if self.epsilon >= self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def process_batch(self, batch):
        """batch数据处理

        Arguments:
            batch: batch size

        Returns:
            X: states
            y: [Q_value1, Q_value2]
        """
         # 从经验池中随机采样一个batch
        data = random.sample(self.memory_buffer, batch)
        # 生成Q_target。
        states = np.array([d[0] for d in data])
        next_states = np.array([d[3] for d in data])

        y = self.model.predict(states)
        q = self.target_model.predict(next_states)

        for i, (_, action, reward, _, done) in enumerate(data):
            target = reward
            if not done:
                target += self.gamma * np.amax(q[i])
            y[i][action] = target

        return states, y


    def train(self, episode, batch):
        """训练
        Arguments:
            episode: 游戏次数
            batch: batch size

        Returns:
            history: 训练记录
        """
        self.model.compile(loss='mse', optimizer=Adam(1e-3))

        history = {'episode': [], 'Episode_reward': [], 'Loss': []}

        count = 0
        for i in range(episode):
            observation = self.env.reset()
            reward_sum = 0
            loss = np.infty
            done = False

            while not done:
                # 通过贪婪选择法ε-greedy选择action。
                x = observation.reshape(-1, 4)
                action = self.egreedy_action(x)
                observation, reward, done, _ = self.env.step(action)
                # 将数据加入到经验池。
                reward_sum += reward
                self.remember(x[0], action, reward, observation, done)

                if len(self.memory_buffer) > batch:
                    # 训练
                    X, y = self.process_batch(batch)
                    loss = self.model.train_on_batch(X, y)

                    count += 1
                    # 减小egreedy的epsilon参数。
                    self.update_epsilon()

                    # 固定次数更新target_model
                    if count != 0 and count % 20 == 0:
                        self.update_target_model()

            if i % 5 == 0:
                history['episode'].append(i)
                history['Episode_reward'].append(reward_sum)
                history['Loss'].append(loss)
    
                print('Episode: {} | Episode reward: {} | loss: {:.3f} | e:{:.2f}'.format(i, reward_sum, loss, self.epsilon))

        self.model.save_weights('dqn.h5')

        return history

    def play(self):
        """使用训练好的模型测试游戏.
        """
        observation = self.env.reset()

        count = 0
        reward_sum = 0
        random_episodes = 0

        while random_episodes < 10:
            self.env.render()

            x = observation.reshape(-1, 4)
            q_values = self.model.predict(x)[0]
            action = np.argmax(q_values)
            observation, reward, done, _ = self.env.step(action)

            count += 1
            reward_sum += reward

            if done:
                print("Reward for this episode was: {}, turns was: {}".format(reward_sum, count))
                random_episodes += 1
                reward_sum = 0
                count = 0
                observation = self.env.reset()

        self.env.close()


if __name__ == '__main__':
    model = DQN()
    history = model.train(600, 32)
    model.play()

训练结果与测试结果如下所示,可以看出随着训练次数的增加,DQN模型在游戏中获得Reward不断的增加,并且Loss不断降低。在batch=32的条件下500次Episode的训练后进行模型测试, DQN也有不错的表现,如果进一步训练应该能达到和Policy Network同样的效果。
相比Policy Network,DQN的训练过程更稳定一些,但是DQN有个问题,就是它并不一定能保证Q网络的收敛。也就是说,我们不一定可以得到收敛后的Q网络参数,这会导致我们训练出的模型效果很差,因此也需要反复尝试选取最好的模型。

Episode: 0 | Episode reward: 11.0 | loss: inf | e:1.00
Episode: 5 | Episode reward: 23.0 | loss: 0.816 | e:0.67
Episode: 10 | Episode reward: 18.0 | loss: 2.684 | e:0.46
Episode: 15 | Episode reward: 11.0 | loss: 3.662 | e:0.34
Episode: 20 | Episode reward: 16.0 | loss: 2.702 | e:0.23
Episode: 25 | Episode reward: 10.0 | loss: 4.092 | e:0.18
Episode: 30 | Episode reward: 12.0 | loss: 3.734 | e:0.13
...
Episode: 460 | Episode reward: 111.0 | loss: 6.325 | e:0.01
Episode: 465 | Episode reward: 180.0 | loss: 0.046 | e:0.01
Episode: 470 | Episode reward: 141.0 | loss: 0.136 | e:0.01
Episode: 475 | Episode reward: 169.0 | loss: 0.110 | e:0.01
Episode: 480 | Episode reward: 200.0 | loss: 0.095 | e:0.01
Episode: 485 | Episode reward: 200.0 | loss: 0.024 | e:0.01
Episode: 490 | Episode reward: 200.0 | loss: 0.066 | e:0.01
Episode: 495 | Episode reward: 146.0 | loss: 0.022 | e:0.01

Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 196.0, turns was: 196
Reward for this episode was: 198.0, turns was: 198
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 199.0, turns was: 199
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 193.0, turns was: 193
Reward for this episode was: 200.0, turns was: 200
Reward for this episode was: 189.0, turns was: 189
Reward for this episode was: 200.0, turns was: 200

DQN

对比

(1)Policy Network可以处理连续的action,而DQN则只能处理离散问题,通过枚举的方式来实现,连续的action只能离散化后再处理。

(2)Policy Network通过输出的action概率值大小随机选择action,而DQN则通过贪婪选择法ε-greedy选择action。

(2)DQN的更新是一个一个的reward进行更新,即当前的reward只跟邻近的一个相关;Policy Network则将一个episode的reward全部保存起来,然后用discount的方式修正reward,标准化后进行更新。

赞(5)
未经允许不得转载:微梦 - 邪少个人博客 » Keras深度强化学习– Policy Network与DQN实现