Reward
A reward signal defines the goal of a Reinforcement Learning problem.
- A reward is a scalar feedback signal
- Indicates how well agent is doing at step
- The agent’s job is to maximize cumulative reward
Reinforcement learning is based on the reward hypothesis.
Reward Hypothesis
All goals can be described by the maximisation of expected cumulative reward.
Danger
Don’t use the reward to impart the agent prior knowledge about how to achieve the real goal (ex: rewarding chess-playing agent for taking pieces), because the agent might not achieve the real goal.
The agent’s goal is to maximize the cumulative reward in the long run, defined as the expected return.
In the Multi-Armed Bandit context, I see the reward being written as .
The reward function is generally defined as
It is the reward function that determines the immediate reward received by agent i for a transition from to . For example, see Value Iteration.
Thoughts on Reward
Is reward enough?
Because manually designing the reward functions is tedious. And does not allow for a general purpose robot. Which is why there’s been advances in Imitation Learning that doesn’t require RL, such as Diffusion Policy.
Confusion
I remember this being defined as a confusing topic:
- when are we supposed to receive the reward? As soon as we arrive in a new state, at the action level? In the bellman backup, the reward is right after choosing an action.
So actually, I am no longer confused. I got confused after going over Lecture 2 from Stanford CS234, 42:36.
Reward Engineering
Since I never got any hands-on experience, I never really had to think about this problem. But now that I am starting to work on the AWS DeepRacer https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0
If you don’t design your reward carefully, it might do some really stupid things. For instance, if you design a cleaning robot, and you give it reward for every time it sucks up dirt
- Then, it can just dump dirt, and suck up dirt (bad example)
Stack Exchange thread: https://ai.stackexchange.com/questions/22851/what-are-some-best-practices-when-trying-to-design-a-reward-function
Personal Reflections
Maybe the reason humans behave differently is that each of us are born biologically with a different reward function.