cs285-lec7-value function

Summary

We don’t need to learn a policy, instead we’ll use the policy that choose the action to get maximum Q-value at the current state.

For example, offline Q-learning:

online Q-learning:

  1. take an action aia_i and observe (si,ai,si,ri)(s_i, a_i, s'_i, r_i)
  2. yi=ri+γmaxaQϕ(si,ai)y_i = r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)
  3. ϕϕαQϕ(si,ai)ϕ(Qϕ(si,ai)yi)\phi \leftarrow \phi - \alpha \frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}(Q_\phi(s_i, a_i) - y_i)

This method is not guaranteed to converage.

policy iteration algorithm

Since AπA^\pi could denote how much better is ata_t than the average actions accroding to π\pi, arg maxatAπ\argmax_{a_t} A^\pi means we take the best action from sts_t if we follow π\pi

choose the best action from sts_t, the policy will be:

π(atst)={1,if    at=arg maxatAπ(st,at)0,otherwise\pi'(a_t|s_t) = \begin{cases} 1, \text{if} \;\; a_t = \argmax_{a_t}A^\pi(s_t, a_t) \\ 0, \text{otherwise} \end{cases}

so we have:

  1. evaluate Aπ(s,a)A^\pi(s,a)
  2. set ππ\pi \leftarrow \pi'

Recall we have Aπ(st,at)r(st,at)+Vπ(st+1)Vπ(st)A^\pi(s_t, a_t) \approx r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t) from actor-critic, now we need to evaluate Vπ(st)V^\pi(s_t)

Tabular reinforcement learning

Assume we know the transition probability p(ss,a)p(s'|s,a), and st,ats_t, a_t are both discreate. Then we could do tabular reinforcement learning, store Vπ(st)V^\pi(s_t) into a table, update Vπ(st)V^\pi(s_t) at every one step by:

Vπ(s)r(s,π(s))+γEsp(ss,π(s))[Vπ(s)]V^\pi(s) \leftarrow r(s, \pi(s)) + \gamma E_{s' \sim p(s'|s, \pi(s))}[V^\pi(s')]

In order to simplify the above equation, we could use QQ to replace VV, so the procedure will be:

  1. construct the q-value table by: Q(s,a)=r(s,a)+γE[V(s)]Q(s,a) = r(s,a) + \gamma E[V(s')]
  2. V(s)maxaQ(s,a)V(s) \leftarrow \max_a Q(s,a)

Notice we would use the q-value to update value function at step2, so just combine them together, we would get rid of value function entirely, now the procedure will be:

  1. construct the q-value table by: Q(s,a)=r(s,a)+γEQ(s,a) = r(s,a) + \gamma E, EE is the expected value of the next time step
  2. set the q-value to be the max

Fitted Q-iteration

When the number of states is large, it’s impossible to construct a table, so we use neural network here.

We would train a neural network that maps state to an estimated reward: V(s)RV(s) \rightarrow \mathcal{R}, then we will optimize this network by:

L(ϕ)=12Vϕ(s)maxaQπ(s,a)2\mathcal{L}(\phi) = \frac{1}{2}\|V_\phi(s) - \max_aQ^\pi(s,a)\|^2

Note: sometimes the gradient might be too big, so we may change the loss function to Huber loss or use gradient clip.

The training will be:

  1. set yimaxai(r(si,ai)+γE[Vϕ(si)])y_i \leftarrow \max_{a_i}(r(s_i, a_i) + \gamma E[V_\phi(s'_i)])
  2. set ϕarg minϕ12Vϕ(s)yi2\phi \leftarrow \argmin_\phi \frac{1}{2}\|V_\phi(s) - y_i\|^2

Notice we still need to know the transition probability to calculate Esp(ss,a)E_{s'\sim p(s'|s,a)} in step1, to fit this, we approxiate E[Vϕ(si)]maxaQϕ(si,ai)E[V_\phi(s'_i)] \approx \max_{a'}Q_\phi(s'_i, a'_i)

This q-learning works for off-policy samples, and it has only one network, no need to do policy gradient. But it’s not guaranteed to converage.

The full algorithm is below:

Q Learning

An online q-learning will be:

  1. take an action aia_i and observe (si,ai,si,ri)(s_i, a_i, s'_i, r_i)
  2. yi=ri+γmaxaQϕ(si,ai)y_i = r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)
  3. ϕϕαQϕ(si,ai)ϕ(Qϕ(si,ai)yi)TD error\phi \leftarrow \phi - \alpha \frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}\underbrace{(Q_\phi(s_i, a_i) - y_i)}_{\text{TD error}}

Explore and Exploit

If the action we choose at very fisrt is bad, and we only take actions to maximize the Q value, we might stuck in the bad loop.
In order to solve this, we need to do some actions to explore.
epsilon-greedy
We take the actions that arg max Q value at the probability of 1ϵ1-\epsilon, that is:

π(atst)={1ϵ,if    at=arg maxatAπ(st,at)ϵ/(A1),otherwise\pi'(a_t|s_t) = \begin{cases} \begin{aligned} &1-\epsilon, & &\text{if} \;\; a_t = \argmax_{a_t}A^\pi(s_t, a_t) \\ &\epsilon / (|\mathcal{A}|-1), & &\text{otherwise} \end{aligned} \end{cases}

So when the policy is good, we could use smaller ϵ\epsilon and larger ϵ\epsilon when the policy is bad.

Usually, starts with large epsilon and gradually decrease.
Boltzmann exploration
Select actions in proportion to some positive transformation of Q values, and the most popular one is exp. Using this the best actions would appear at the highest frequency.

π(atst)exp(Qϕ(st,at))\pi(a_t|s_t) \propto \exp(Q_\phi(s_t, a_t))

Implement

Questions

what is dynamic programming here? is it different from DP in code contests?

does q-learning have high-variance? why?