Playing Atari with DQN

Background¶

The key concepts are as followed.

The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value \(Q^*(s',a')\) of the sequence \(s'\) at the next time-step was known for all possible actions \(a'\), then the optimal strategy is to select \(a'\) maximising the expected value of \(r+\gamma Q^*(s',a')\),

\[Q^*(s,a) = \mathbb{E}_{s'\sim \mathcal{E}} \left[ r+\gamma max_{a'}Q^*(s',a') \mid s,a \right]\]

The basic idea behind many reinforcement learning algorithms is to estimate the action-value function, by using the Bellman equation as an iterative update, \(Q_{i+1}(s,a) = \mathbb{E}[r+\gamma max_{a'}Q_i(s',a') \mid s,a]\). Such value iteration algorithms converge to the optimal action-value function, \(Q_i \rightarrow Q^*\) as \(i \rightarrow \infty\). In practice, this basic approach is totally impractical, because the action-value function is estimated separately for each sequence, without any generalisation.

Instead, it is common to use a function approximator to estimate the action-value function \(Q(s,a;\theta) \approx Q^*(s,a)\). We refer to a neural network function approximator with weights \(\theta\) as a Q-network. A Q-network can be trained by minimising a sequence of loss functions \(L_i(\theta_i)\) that changes at each interation \(i\).

\[L_i(\theta_i) = \mathbb{E}_{s,a \sim \rho(\cdot)}\left[(y_i - Q(s,a;\theta_i))^2 \right],\]

where \(y_i = \mathbb{E}_{s\sim\sigma}[r + \gamma max_{a'}Q(s',a';\theta_{i-1})] \) is the target for iteration \(i\) and \(\rho(s,a)\) is a probability distribution over sequences \(s\) and actions \(a\) that we refer to as the behaviour distribution. The parameters from the previous iteration \(\theta_{i-1}\) are held fixed when optimising the loss function \(L_i(\theta_i)\).

Tip

Note that the targets depend on the network weights; this is in contrast with the targets used for supervised learning, which are fixed before learning begins.

每次迭代计算时固定，但是计算完后立即更新参数。

Deep Reinforcement Learning¶

The input is image pixels, so they preprocess sequence, transfer the state \(s\) into \(\phi\). 但是如果是负荷曲线等时序信号，单个时间步的值，即状态信息，应该就不能预处理。

dqn

The input to the neural network consists is an \(84\times 84\times 4\) image produced by \(\phi\).

The first hidden layer convolves 16 \(8\times8\) filters with stride 4 with the input image and applies a rectifier nonlinearity (ReLU activation function). \(\rightarrow 20\times20\times16\)

The second hidden layer convolves 32 \(4\times4\) filters with stride 2, again followed by a rectifier nonlinearity. \(\rightarrow 9\times9\times32\)

The final hidden layer is fully-connected and consists of 256 rectifier units. \(\rightarrow 256\)

The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered. \(\rightarrow 4\sim18\)

Tip

The first two convolutional layers output the feature map. To the first hidden layer, it has 16 feature map, every pixel will apply the ReLU, totally \(20\times20\times16=6400\), but we don't say there are 6400 ReLU units. They are not the real units.