cs285-lec7-value function

Edited on 2023-09-30 In technical , course notes Views:

If we don't want to learn a policy, we could learn value funtions.

Summary

We don’t need to learn a policy, instead we’ll use the policy that choose the action to get maximum Q-value at the current state.

For example, offline Q-learning:

online Q-learning:

take an action $a_i$ and observe $(s_i, a_i, s'_i, r_i)$
$y_i = r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)$
$\phi \leftarrow \phi - \alpha \frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}(Q_\phi(s_i, a_i) - y_i)$

This method is not guaranteed to converage.

policy iteration algorithm

Since $A^\pi$ could denote how much better is $a_t$ than the average actions accroding to $\pi$ , $\argmax_{a_t} A^\pi$ means we take the best action from $s_t$ if we follow $\pi$

choose the best action from $s_t$ , the policy will be:

\pi'(a_t|s_t) = \begin{cases} 1, \text{if} \;\; a_t = \argmax_{a_t}A^\pi(s_t, a_t) \\ 0, \text{otherwise} \end{cases}

so we have:

evaluate $A^\pi(s,a)$
set $\pi \leftarrow \pi'$

Recall we have $A^\pi(s_t, a_t) \approx r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)$ from actor-critic, now we need to evaluate $V^\pi(s_t)$

Tabular reinforcement learning

Assume we know the transition probability $p(s'|s,a)$ , and $s_t, a_t$ are both discreate. Then we could do tabular reinforcement learning, store $V^\pi(s_t)$ into a table, update $V^\pi(s_t)$ at every one step by:

V^\pi(s) \leftarrow r(s, \pi(s)) + \gamma E_{s' \sim p(s'|s, \pi(s))}[V^\pi(s')]

In order to simplify the above equation, we could use $Q$ to replace $V$ , so the procedure will be:

construct the q-value table by: $Q(s,a) = r(s,a) + \gamma E[V(s')]$
$V(s) \leftarrow \max_a Q(s,a)$

Notice we would use the q-value to update value function at step2, so just combine them together, we would get rid of value function entirely, now the procedure will be:

construct the q-value table by: $Q(s,a) = r(s,a) + \gamma E$ , $E$ is the expected value of the next time step
set the q-value to be the max

Fitted Q-iteration

When the number of states is large, it’s impossible to construct a table, so we use neural network here.

We would train a neural network that maps state to an estimated reward: $V(s) \rightarrow \mathcal{R}$ , then we will optimize this network by:

\mathcal{L}(\phi) = \frac{1}{2}\|V_\phi(s) - \max_aQ^\pi(s,a)\|^2

Note: sometimes the gradient might be too big, so we may change the loss function to Huber loss or use gradient clip.

The training will be:

set $y_i \leftarrow \max_{a_i}(r(s_i, a_i) + \gamma E[V_\phi(s'_i)])$
set $\phi \leftarrow \argmin_\phi \frac{1}{2}\|V_\phi(s) - y_i\|^2$

Notice we still need to know the transition probability to calculate $E_{s'\sim p(s'|s,a)}$ in step1, to fit this, we approxiate $E[V_\phi(s'_i)] \approx \max_{a'}Q_\phi(s'_i, a'_i)$

This q-learning works for off-policy samples, and it has only one network, no need to do policy gradient. But it’s not guaranteed to converage.

The full algorithm is below:

Q Learning

An online q-learning will be:

take an action $a_i$ and observe $(s_i, a_i, s'_i, r_i)$
$y_i = r_i + \gamma \max_{a'}Q_\phi(s'_i, a'_i)$
$\phi \leftarrow \phi - \alpha \frac{\partial Q_\phi(s_i, a_i)}{\partial \phi}\underbrace{(Q_\phi(s_i, a_i) - y_i)}_{\text{TD error}}$

Explore and Exploit

If the action we choose at very fisrt is bad, and we only take actions to maximize the Q value, we might stuck in the bad loop.
In order to solve this, we need to do some actions to explore.
epsilon-greedy
We take the actions that arg max Q value at the probability of $1-\epsilon$ , that is:

\pi'(a_t|s_t) = \begin{cases} \begin{aligned} &1-\epsilon, & &\text{if} \;\; a_t = \argmax_{a_t}A^\pi(s_t, a_t) \\ &\epsilon / (|\mathcal{A}|-1), & &\text{otherwise} \end{aligned} \end{cases}

So when the policy is good, we could use smaller $\epsilon$ and larger $\epsilon$ when the policy is bad.

Usually, starts with large epsilon and gradually decrease.
Boltzmann exploration
Select actions in proportion to some positive transformation of Q values, and the most popular one is exp. Using this the best actions would appear at the highest frequency.

\pi(a_t|s_t) \propto \exp(Q_\phi(s_t, a_t))

Kelly H

cs285-lec7-value function

Summary

policy iteration algorithm

Tabular reinforcement learning

Fitted Q-iteration

Q Learning

Explore and Exploit

Implement

Questions

what is dynamic programming here? is it different from DP in code contests?

does q-learning have high-variance? why?