A reward signal defines the goal of a Reinforcement Learning problem.

  • A reward is a scalar feedback signal
  • Indicates how well agent is doing at step
  • The agent’s job is to maximize cumulative reward

Reinforcement learning is based on the reward hypothesis.

Reward Hypothesis

All goals can be described by the maximisation of expected cumulative reward.


Don’t use the reward to impart the agent prior knowledge about how to achieve the real goal (ex: rewarding chess-playing agent for taking pieces), because the agent might not achieve the real goal.

The agent’s goal is to maximize the cumulative reward in the long run, defined as the expected return.

In the Multi-Armed Bandit context, I see the reward being written as .

The reward function is generally defined as

It is the reward function that determines the immediate reward received by agent i for a transition from to . For example, see Value Iteration.


I remember this being defined as a confusing topic: when are we supposed to receive the reward? As soon as we arrive in a new state, at the action level? In the bellman backup, the reward is right after choosing an action.

So actually, I am no longer confused. I got confused after going over Lecture 2 from Stanford CS234, 42:36.

Reward Engineering

Since I never got any hands-on experience, I never really had to think about this problem. But now that I am starting to work on the AWS DeepRacer

If you don’t design your reward carefully, it might do some really stupid things. For instance, if you design a cleaning robot, and you give it reward for every time it sucks up dirt

  • Then, it can just dump dirt, and suck up dirt (bad example)

Stack Exchange thread:

Personal Reflections

Maybe the reason humans behave differently is that each of us are born biologically with a different reward function.