🛠️ Steven Gong

Search

SearchSearch

Aug 16, 2025, 1 min read

Reward-to-go

The rewards from this point to the end of the trajectory. Learned from spinning up, where to present this as the target to learn the value function. And then use that as a Generalized Advantage Estimation

  • https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#id16 Rt​=∑t′=tT​R(st′​,at′​,st′+1​) Where is the discount factor? I think in practice, the discount factor is actually used

Graph View

Backlinks

  • High-Dimensional Continuous Control Using Generalized Advantage Estimation

Created with Quartz, © 2025

  • Blog
  • LinkedIn
  • Twitter
  • GitHub