Return
- The discount is the present value of future rewards
- The value of receiving reward after time-steps is .
- This values immediate reward above delayed reward.
- close to 0 leads to ”myopic” evaluation
- close to 1 leads to ”far-sighted” evaluation
Closely related to the Value Function.
The word ”Expected”
I was having a lot of trouble understanding why it is “expected return”, and not just return. I mean, how do you embedded this idea of expected in the math? Update: ahh, see Expected Value.
The value function gives the expected return. The higher the value function, the greater the expected return. The value function depends on the policy you choose. The better the policy, the greater the values of the value function, because we have greater expected returns (i.e. total rewards).
But this doesn’t answer the question, why expected return, not just return?
- Because the environment can be stochastic. So we care about finding the best policy on average. Not from a single instance, because that might vary
Well, how do we even determine the value function? We have 2 methods of solving it.
If we solve it iteratively (using dynamic programming), you start off with random values, so your expected return is going to be wrong. Then, over time, as you experience rewards, your expected values are going to change.
Thanks to the current reward you get, the Value Function becomes more and more accurate.
Expected return = what do you expect to get in rewards cumulatively.
OH I know, the return is given after stepping through the rewards right?
So expected return is like the expected total rewards.
It’s like if you play flappy bird, your return is not actually say 1000 points unless you play the game until you get to 1000 points (where you get a reward of 1 for each time you go through a pipe, im disregarding the Discount Factor here). But you expect the return to be > 1000 if you have a policy that beats the flappy bird game (so your expected would be a very big number cuz you can’t lose!!).
- Also, I think because rewards can follow distributions, they are not static, so you want to use the expected value instead of the actual value.