Reward

A reward signal defines the goal of a Reinforcement Learning problem.

A reward $R_{t}$ is a scalar feedback signal
Indicates how well agent is doing at step $t$
The agent’s job is to maximize cumulative reward

Reinforcement learning is based on the reward hypothesis.

Reward Hypothesis

All goals can be described by the maximisation of expected cumulative reward.

Danger

Don’t use the reward to impart the agent prior knowledge about how to achieve the real goal (ex: rewarding chess-playing agent for taking pieces), because the agent might not achieve the real goal.

The agent’s goal is to maximize the cumulative reward in the long run, defined as the expected return.

In the Multi-Armed Bandit context, I see the reward being written as $θ$ .

The reward function is generally defined as $R^{i} : S \times A \times S \to R$

It is the reward function that determines the immediate reward received by agent i for a transition from $(s, a)$ to $s^{'}$ . For example, see Value Iteration.

Thoughts on Reward

Is reward enough?

Yes, https://www.sciencedirect.com/science/article/pii/S0004370221000862

Because manually designing the reward functions is tedious. And does not allow for a general purpose robot. Which is why there’s been advances in Imitation Learning that doesn’t require RL, such as Diffusion Policy.

Confusion

I remember this being defined as a confusing topic:

when are we supposed to receive the reward? As soon as we arrive in a new state, at the action level? In the bellman backup, the reward is right after choosing an action.

So actually, I am no longer confused. I got confused after going over Lecture 2 from Stanford CS234, 42:36.

Reward Engineering

Since I never got any hands-on experience, I never really had to think about this problem. But now that I am starting to work on the AWS DeepRacer https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0

If you don’t design your reward carefully, it might do some really stupid things. For instance, if you design a cleaning robot, and you give it reward for every time it sucks up dirt

Then, it can just dump dirt, and suck up dirt (bad example)

Stack Exchange thread: https://ai.stackexchange.com/questions/22851/what-are-some-best-practices-when-trying-to-design-a-reward-function

Personal Reflections

Maybe the reason humans behave differently is that each of us are born biologically with a different reward function.

🛠️ Steven Gong

Table of Contents

Reward

Thoughts on Reward

Confusion

Reward Engineering

Personal Reflections

Graph View

Backlinks