cs285-lec6-actor critic

Summary

  1. sample {si,ai}\{ s_i, a_i \} from πθ(as)\pi_\theta(a|s) (run it over data)
  2. fit V^ϕ(s)\hat{V}_\phi(s) to sampled reward sums: {(si,t,r(si,t,ai,t)+model(si,t+1)yi,t)}\{ (s_{i, t}, \underbrace{r(s_{i,t}, a_{i,t}) + \text{model}(s_{i,t+1})}_{y_{i,t}}) \}
  3. evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si)V^ϕπ(si)\hat{A}^\pi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i)
  4. θJ(θ)1Niθlogπθ(aisi)A^π(si,ai)\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_i \nabla_\theta \log \pi_\theta(a_i | s_i) \hat{A}^\pi(s_i, a_i)
  5. θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

Recap: policy gradient

target function:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Q^i,t\begin{aligned} \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \hat{Q}_{i,t} \end{aligned}

notes: Q^i,t\hat{Q}_{i,t} is the reward-to-go, the estimated expected reward if the agent takes ai,ta_{i,t} at si,ts_{i,t},that is:

Q^i,tt=tTr(si,t,ai,t)\hat{Q}_{i,t} \approx \sum_{t' = t}^Tr(s_{i,t}, a_{i,t})

A better estimate rather than a single one will be:

Q^i,tt=tTEπθ[r(st,at)st,at]\hat{Q}_{i,t} \approx \sum_{t' = t}^TE_{\pi_\theta}[r(s_{t'}, a_{t'})|s_t, a_t]

Use expected estimates Q^i,t\hat{Q}_{i,t} to replace single ones could also reduce vairance.

To futher reduce variance, reduce a baseline from Q. This baseline could be the expected value of Q(use average to approximate):

bt=1NiQ(si,t,ai,t)b_t = \frac{1}{N}\sum_iQ(s_{i,t}, a_{i,t})

So the target function will be:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Q^i,t1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t)b)\begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \hat{Q}_{i,t} \\ &\approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \Big( \sum_{t'=1}^Tr(s_{i, t'}, a_{i, t'}) - b \Big) \end{aligned}

Because policy gradient calculates on every single sample, it’s unbiased but has high vairance.

To deal with this high variance, we introduce Actor-Critic.

Terminology Definition

Consider state-action value function, that is total reward from taking ata_t at sts_t:

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]Q^\pi(s_t, a_t) = \sum_{t' = t}^TE_{\pi_\theta}[r(s_{t'}, a_{t'})|s_t, a_t]

Consider state value function, that is average rewards from sts_t:

Vπ(st)=Eatπθ(atst)[Qπ(st,at)]V^\pi(s_t) = E_{a_t \sim \pi_\theta(a_t|s_t)}[Q^\pi(s_t, a_t)]

In this way we could use AπA^\pi to denote how much better is ata_t than the average actions accroding to π\pi:

Aπ(st,at)=Qπ(st,at)Vπ(st)A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)

And the target function will be:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Q^i,t1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Aπ(si,t,ai,t)\begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \hat{Q}_{i,t} \\ &\approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) A^\pi(s_{i,t}, a_{i,t}) \end{aligned}

which one to fit?

We now have 3 options to optimize: Q,V,AQ,V,A

For QQ we could rewrite it as:

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]=r(st,at)+t=t+1TEπθ[r(st,at)st,at]=r(st,at)+Est+1p(st+1st,at)[Vπ(st+1)]r(st,at)+Vπ(st+1)\begin{aligned} Q^\pi(s_t, a_t) &= \sum_{t' = t}^TE_{\pi_\theta}[r(s_{t'}, a_{t'})|s_t, a_t] \\ &= r(s_t, a_t) + \sum_{t' = t+1}^TE_{\pi_\theta}[r(s_{t'}, a_{t'})|s_t, a_t] \\ &= r(s_t, a_t) + E_{s_{t+1} \sim p(s_{t+1}|s_t, a_t)}[V^\pi(s_{t+1})] \\ &\approx r(s_t, a_t) + V^\pi(s_{t+1}) \end{aligned}

Then we get:

Aπ(st,at)=Qπ(st,at)Vπ(st)r(st,at)+Vπ(st+1)Vπ(st)\begin{aligned} A^\pi(s_t, a_t) &= Q^\pi(s_t, a_t) - V^\pi(s_t) \\ &\approx r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t) \end{aligned}

So we choose to fit Vπ(s)V^\pi(s) since it has the least parameters to learn.

policy evaluation: fit Vπ(s)V^\pi(s)

To perform optimization, like what the policy gradient does, we could use monte-carlo to get the estimated expected value:

Vπ(st)=Eatπθ(atst)[Qπ(st,at)]t=tTr(st,at)1Nn=1Nt=tTr(st,at)\begin{aligned} V^\pi(s_t) &= E_{a_t \sim \pi_\theta(a_t|s_t)}[Q^\pi(s_t, a_t)] \\ &\approx \sum_{t'=t}^Tr(s_{t'}, a_{t'}) \\ &\approx \frac{1}{N}\sum_{n=1}^N\sum_{t'=t}^Tr(s_{t'}, a_{t'}) \end{aligned}

Ideally, we need to collect all possible trajectories from sts_{t'}, but actually we could not start with certain state, generally we could only start in the initial state.

But if we use a neural network to fit Vπ(st)V^\pi(s_t) in a single trajectory, it will generalize to all trajectories (generation in ML).

training set: {(si,t,Vπ(st)1Nn=1Nt=tTr(st,at)yi,t)}\{ (s_{i, t}, \underbrace{V^\pi(s_t) \approx \frac{1}{N}\sum_{n=1}^N\sum_{t'=t}^Tr(s_{t'}, a_{t'})}_{y_{i,t}} ) \}
Likely, yi,ty_{i,t} here is a monte-carlo approximation, the actual yi,ty_{i,t} is:

yi,t=Vπ(st)=t=tTEπθ[r(st,at)st,at]r(st,at)+t=t+1TEπθ[r(st,at)st,at]=r(st,at)+Est+1p(st+1st,at)[Vπ(st+1)]r(st,at)+Vπ(st+1)\begin{aligned} y_{i,t} &= V^\pi(s_t) \\ &= \sum_{t' = t}^TE_{\pi_\theta}[r(s_{t'}, a_{t'})|s_t, a_t] \\ &\approx r(s_t, a_t) + \sum_{t' = t+1}^TE_{\pi_\theta}[r(s_{t'}, a_{t'})|s_t, a_t] \\ &= r(s_t, a_t) + E_{s_{t+1} \sim p(s_{t+1}|s_t, a_t)}[V^\pi(s_{t+1})] \\ &\approx r(s_t, a_t) + V^\pi(s_{t+1}) \end{aligned}

Since we don’t know Vπ(st+1)V^\pi(s_{t+1}), we use the previous fitted V^(st+1)\hat{V}(s_{t+1}).

Thus the training set is: {(si,t,r(si,t,ai,t)+V^ϕπ(si,t+1)yi,t)}\{ (s_{i, t}, \underbrace{r(s_{i,t}, a_{i,t}) + \hat{V}^\pi_\phi(s_{i,t+1})}_{y_{i,t}}) \}

Use supervised regression loss to optimize:

L(ϕ)=12iV^ϕπ(si)yi2\mathcal{L}(\phi) = \frac{1}{2}\sum_i || \hat{V}^\pi_\phi(s_i) - y_i ||^2

This “bootstrapped” estimate has lower variance because it’s using V^ϕπ(si)\hat{V}^\pi_\phi(s_i) as an estimator, and it has higher bias because V^ϕπ(si)\hat{V}^\pi_\phi(s_i) might be incorrect.

discount factors

If T (episode length) is \infin, V^ϕπ(si)\hat{V}^\pi_\phi(s_i) can get infinitely large in many cases, so we introduce a discount factor in yi,ty_{i,t}:

yi,t=r(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t} = r(s_{i,t}, a_{i,t}) + \gamma \hat{V}^\pi_\phi(s_{i,t+1})

actor-critic algorithm

batch actor-critic algorithm (with discount factor):

  1. sample {si,ai}\{ s_i, a_i \} from πθ(as)\pi_\theta(a|s) (run it over data)
  2. fit V^ϕ(s)\hat{V}_\phi(s) to sampled reward sums.
  3. evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si)V^ϕπ(si)\hat{A}^\pi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i)
  4. θJ(θ)1Niθlogπθ(aisi)A^π(si,ai)\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_i \nabla_\theta \log \pi_\theta(a_i | s_i) \hat{A}^\pi(s_i, a_i)
  5. θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

online actor-critic:

  1. take action aπθ(as)a \sim \pi_\theta(a|s), get(s,a,s,r)(s, a, s', r), store in R\mathcal{R}
  2. sample a batch {si,ai,ri,si}\{s_i, a_i, r_i, s_i'\} from buffer R\mathcal{R}
  3. update Q^ϕπ\hat{Q}^\pi_\phi using yi=ri+γQ^ϕπ(si,ai),si,aiy_i = r_i + \gamma \hat{Q}^\pi_\phi(s_i',a_i'), \forall s_i, a_i
  4. θJ(θ)1Niθlogπθ(as)Q^π(si,aiπ)\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_i \nabla_\theta \log \pi_\theta(a | s) \hat{Q}^\pi(s_i, a_i^\pi)
  5. θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

In online actor-critic, we need to fit Q(s,a)Q(s,a) instead of V(s)V(s) because we need to calculate yiy_i based on the policy when these actions and states were added in the buffer. Current policy might differ from the policy at buffer.

Two separate networks for actor and critic
In deep learning, we train two separate networks for value function and policy. This way is simple & stable, but the downside is there’s no shared feature between actor and critic.

Or we could train a shared bottom with two different heads.

how to generate a batch?

In deep learning we usually update V^ϕπr+γV^ϕπ(s)\hat{V}^\pi_\phi \leftarrow r + \gamma \hat{V}^\pi_\phi(s') with a batch of data. So how to generate the batch?

synchronized parallel actor-critic
run multiple simulators at a time, sometimes too expensive, need to run n simulators if the batch size is n

asynchronous parallel actor-critic
pull in the latest parameters, update them for that thread. The downside is maybe some of the policies pulled is old policies, but it’s acceptable because usually these policiese are not too old.

off-policy actor-critic

use replay buffer

Eligibility traces & n-steps returns

Actually we could see actor-critic as calculating A^π(si,ai)=r(si,ai)+γV^ϕπ(si)V^ϕπ(si)\hat{A}^\pi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i) using rewards at current step, and policy gradient using rewards from all steps: A^π(si,ai)=t=tTγttr(si,t,ai,t)V^ϕπ(si,t)\hat{A}^\pi(s_i, a_i)=\sum_{t' = t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i, t'}) - \hat{V}^\pi_\phi(s_{i,t}).

As we mentioned above, actor-critic has lower variance but high bias (if the estimated value is wrong, which is very common), while policy gradient has no bias but high variance (because it estimates on every single sample).

So maybe we could find a point between these two, using rewards from some steps to calculate A^π(si,ai)\hat{A}^\pi(s_i, a_i):

A^π(st,at)=t=tt+nγttr(st,at)V^ϕπ(st)+γnV^ϕπ(st+n)\hat{A}^\pi(s_t, a_t) = \sum_{t' = t}^{t+n} \gamma^{t'-t}r(s_{t'}, a_{t'}) - \hat{V}^\pi_\phi(s_{t}) + \gamma^n \hat{V}^\pi_\phi(s_{t+n})

Moreover, we could use attention-like mechanism to get a weighted combination of n-step returns:

A^GAEπ(st,at)=n=1wnA^nπ(st,at)=t=t(γλ)tt(r(si,ai)+γV^ϕπ(si)V^ϕπ(si))δt\begin{aligned} \hat{A}^\pi_{GAE}(s_t, a_t) &= \sum_{n=1}^\infin w_n \hat{A}^\pi_n(s_t, a_t) \\ &= \sum_{t'=t}^\infin (\gamma \lambda)^{t'-t} \underbrace{\Big(r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i)\Big)}_{\delta_{t'}} \end{aligned}

The γλ\gamma \lambda here plays a similar role as discount factor, which will reduce variance.

Implementation

实现思路很简单,用actor_net,输入state,输出action。用critic_net,输入state,输出dim=1的值(value)。
优化actor_net时,最大化这个obs和action对应的critic的输出值
优化critic_net时,优化(critic_net对states输出的值,env返回的实际值)