cs285-lec8-DRL with Q-Functions

Summary

Q learning with:

  • replay buffer
  • target networks
  • double networks
  • continuous actions

Then we get ddpg:

  1. take some action aia_i and observe (si,ai,si,ri)(s_i, a_i, s'_i, r_i), add it to B\mathcal{B}, the actions were generated by ϕ\phi and we might use epsilon-greedy here.
  2. sample mini-batch (sj,aj,sj,rj)(s_j, a_j, s'_j, r_j) from B\mathcal{B} uniformly
  3. compute yj=rj+γQϕ(sj,μθ(sj))y_j = r_j + \gamma Q_{\phi'}(s'_{j}, \mu_{\theta'}(s'_j)) using target network QϕQ_{\phi'} and μθ\mu_{\theta'}
  4. ϕϕαjQϕ(sj,aj)ϕ(Qϕ(sj,aj)yj)\phi \leftarrow \phi - \alpha \sum_j\frac{\partial Q_\phi(s_j, a_j)}{\partial \phi}(Q_\phi(s_j, a_j) - y_j)
  5. θθ+βjQϕθ\theta \leftarrow \theta + \beta \sum_j \frac{\partial{Q_\phi}}{\partial{\theta}}
  6. update ϕ,θ\phi', \theta': eg Polyak averaging

Replay Buffer

In online q-learning we have:

  1. take an action aia_i and observe (si,ai,si,ri)(s_i, a_i, s'_i, r_i)
  2. yi=ri+γmaxaQϕ(si,ai)y_i = r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)
  3. ϕϕαQϕ(si,ai)ϕ(Qϕ(si,ai)yi)\phi \leftarrow \phi - \alpha \frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}(Q_\phi(s_i, a_i) - y_i)

to combine step2 and step3, we get:

  1. take an action aia_i and observe (si,ai,si,ri)(s_i, a_i, s'_i, r_i)
  2. ϕϕαQϕ(si,ai)ϕ(Qϕ(si,ai)[ri+γmaxaQϕ(si,ai)])target value\phi \leftarrow \phi - \alpha \frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}(Q_\phi(s_i, a_i) - \underbrace{[r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)])}_{\text{target value}}

Because all the samples we collected are happened sequentially, they are similar and strongly correlated, and the target value is always changing.

synchronized/asynchronized parallel Q-learning
To solve this problem, we could use synchronized parallel Q-learning or asynchronized parallel Q-learning as used in actor-critic:

replay buffer
Use replay buffers to generate data samples, we now have:

  1. sample a batch (si,ai,si,ri)(s_i, a_i, s'_i, r_i) from B\mathcal{B}
  2. ϕϕαiQϕ(si,ai)ϕ(Qϕ(si,ai)[ri+γmaxaQϕ(si,ai)])\phi \leftarrow \phi - \alpha \sum_i\frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}(Q_\phi(s_i, a_i) - [r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)])

This way samples are no longer correlated, and due to multiple samples in one batch, we have low-variance.

In the end, we get the full q-learning with replay buffer:

Large replay buffer helps to stablize

Target Network & Predict Network

In q-learning, when we need to update ϕ\phi, we’ll need to calculate:

ϕϕαiQϕ(si,ai)ϕ(Qϕ(si,ai)[ri+γmaxaQϕ(si,ai)]no gradient through target value)\phi \leftarrow \phi - \alpha \sum_i\frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}(Q_\phi(s_i, a_i) - \underbrace{[r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)]}_{\text{no gradient through target value}})

The target value every step, so the network is hard to converage, we’ll need to use another less-changed network.

If we take K=1K=1 then we’ll have DQN:

  1. take some action aia_i and observe (si,ai,si,ri)(s_i, a_i, s'_i, r_i), add it to B\mathcal{B}, the actions were generated by ϕ\phi and we might use epsilon-greedy here.
  2. sample mini-batch (sj,aj,sj,rj)(s_j, a_j, s'_j, r_j) from B\mathcal{B} uniformly
  3. compute yj=rj+γmaxajQϕ(sj,aj)y_j = r_j + \gamma \max_{a'_j}Q_{\phi'}(s'_{j}, a'_j) using target network QϕQ_{\phi'}
  4. ϕϕαjQϕ(sj,aj)ϕ(Qϕ(sj,aj)yj)\phi \leftarrow \phi - \alpha \sum_j\frac{\partial Q_\phi(s_j, a_j)}{\partial \phi}(Q_\phi(s_j, a_j) - y_j)
  5. update ϕ\phi': copy ϕ\phi every N steps

This way would lead the step right after the update of ϕ\phi' to be only 1 step older, and the step right before to be the oldest. We could use Polyak Averaging (an optimization technique that sets final parameters to an average of recent parameters visited in the optimization trajectory) to be fair to every step.

When update ϕ\phi', now we do:

ϕτϕ+(1τ)ϕ\phi' \leftarrow \tau \phi' + (1 - \tau) \phi

For a more general view, q-learning with replay buffer and target networks is as follows:

Improve Q-learning

Problem: Overestimate the Q Value

Q-learning tends to overestimate the q value, because we calculate yjy_j as follows:

yj=rj+γmaxajQϕ(sj,aj)here’s the problemy_j = r_j + \gamma \underbrace{\max_{a'_j}Q_{\phi'}(s'_{j}, a'_j)}_{\text{here's the problem}}

Because E[max(X1,X2)]max(E[X1],E[X2])E[\max(X_1, X_2)] \geq \max(E[X_1], E[X_2]), and Qr+γEQ \leftarrow r + \gamma E, the q-learning will overestimate the true q value.

Solution: Double Q-learning

Note that we select action aa' from Qϕ(s,a)Q_{\phi'}(s', a'), and we also estimate q-value using it, so:

maxaQϕ(s,a)=maxaQϕvalue also from Qϕ(s,arg maxaQϕ(s,a)action selected from Qϕ)\max_{a'}Q_{\phi'}(s', a') = \max_{a'}\underbrace{Q_{\phi'}}_{\text{value also from }Q_{\phi'}}(s', \underbrace{\argmax_{a'}Q_{\phi'}(s',a')}_{\text{action selected from }Q_{\phi'}})

If Qϕ(s,a)Q_{\phi'}(s', a') has noise, it would be multiplied. Thus we use two separate networks to choose the action and evaluate value.

QϕA(s,a)r+γQϕB(s,arg maxaQϕA(s,a))QϕB(s,a)r+γQϕA(s,arg maxaQϕB(s,a))\begin{aligned} Q_{\phi_A}(s', a') &\leftarrow r + \gamma Q_{\phi_B}(s', \argmax_{a'} Q_{\phi_A}(s', a')) \\ Q_{\phi_B}(s', a') &\leftarrow r + \gamma Q_{\phi_A}(s', \argmax_{a'} Q_{\phi_B}(s', a')) \end{aligned}

In practice, we already have target network and predict network, so we will use them as ϕA,ϕB\phi_A, \phi_B

So a double q-learning will be:

y=r+γmaxaQϕ(s,arg maxaQϕ(s,a))y = r + \gamma \max_{a'}Q_{\phi'}(s', \argmax_{a'}Q_\phi(s', a'))

Note we use predict network QϕQ_\phi to choose actions, use target network QϕQ_{\phi'} to evaluate the value.

Overall this double-q network helps a lot in practice, and it has no downsides.

Multiple Steps

Since we calculate q-learning target by:

yj,t=rj,t+γmaxaj,t+1Qϕ(sj,t+1,aj,t+1)y_{j,t} = r_{j,t} + \gamma \max_{a_{j, t+1}}Q_{\phi'}(s_{j,t+1}, a_{j, t+1})

The rj,tr_{j,t} is the only values that matter if QϕQ_{\phi'} is bad, and QϕQ_{\phi'} is the only values that matter if QϕQ_{\phi'} is good. So if our QϕQ_{\phi'} is bad at first, we may get stuck. If we calculate rewards at multiple steps, we could avoid this problem.

Like in actor-critic, we now have:

yj,t=t=tt+N1γttrj,t+γNmaxaj,t+NQϕ(sj,t+N,aj,t+N)y_{j,t} = \sum_{t'=t}^{t+N-1} \gamma^{t-t'}r_{j,t'} + \gamma^N \max_{a_{j, t+N}}Q_{\phi'}(s_{j,t+N}, a_{j, t+N})

But this only works on on-policy.

In practice, just ignore this restriction and apply it to off-policy, it still works well somehow…

Countinuous Actions

To find the arg maxaQϕ(s,a)\argmax_{a'}Q_{\phi'}(s', a') is easy when actions are discreate, but when actions are continuous, it would take much time.

Approximate

We could approximate the arg maxaQϕ(s,a)\argmax_{a'}Q_{\phi'}(s', a') by using monte-carlo like methods:

maxaQ(s,a)max{Q(s,a1),Q(s,a2),Q(s,an)}\max_a Q(s, a) \approx \max \{ Q(s, a_1), Q(s, a_2), \dots Q(s, a_n)\}

(a1,a2,,aN)(a_1, a_2, \dots, a_N) are sampled from some distribution

Works well for about 40 dimensions.

Choose Easily Maximizable Q-functions

At the cost of reducing the representation capacity of q-function, we could make arg maxaQ\argmax_aQ and maxaQ\max_aQ very easy, for example we could use NAF/Normalized Advantage Functions

Qϕ(s,a)=12(aμϕ(s))TPϕ(s)(aμϕ(s))+Vϕ(s)Q_\phi(s,a) = -\frac{1}{2}(a - \mu_\phi(s))^TP_\phi(s)(a - \mu_\phi(s))+V_\phi(s)

and we get:

arg maxaQϕ(s,a)=μϕ(s)maxaQϕ(s,a)=Vϕ(s)\begin{aligned} \argmax_a Q_\phi(s,a) &= \mu_\phi(s) \\ \max_aQ_\phi(s,a) &= V_\phi(s) \end{aligned}

ddpg: learn an approximate maximizer

train a network μθ(s)\mu_\theta(s) so that μθ(s)arg maxaQϕ(s,a)\mu_\theta(s) \approx \argmax_{a}Q_\phi(s, a)

ddpg:

  1. take some action aia_i and observe (si,ai,si,ri)(s_i, a_i, s'_i, r_i), add it to B\mathcal{B}, the actions were generated by ϕ\phi and we might use epsilon-greedy here.
  2. sample mini-batch (sj,aj,sj,rj)(s_j, a_j, s'_j, r_j) from B\mathcal{B} uniformly
  3. compute yj=rj+γQϕ(sj,μθ(sj))y_j = r_j + \gamma Q_{\phi'}(s'_{j}, \mu_{\theta'}(s'_j)) using target network QϕQ_{\phi'} and μθ\mu_{\theta'}
  4. ϕϕαjQϕ(sj,aj)ϕ(Qϕ(sj,aj)yj)\phi \leftarrow \phi - \alpha \sum_j\frac{\partial Q_\phi(s_j, a_j)}{\partial \phi}(Q_\phi(s_j, a_j) - y_j)
  5. θθ+βjQϕθ\theta \leftarrow \theta + \beta \sum_j \frac{\partial{Q_\phi}}{\partial{\theta}}
  6. update ϕ,θ\phi', \theta': eg Polyak averaging

Implement

Adam as the optimizer works better.
Remember to run using different random seeds.