Reinforcement Learning

Reinforcement Learning (RL) was invented as a way to model and solve problems of decision making under uncertainty.

Cool links to projects I found:

Learning RL

Why RL?

What makes reinforcement learning different from other machine learning paradigms?

  • There is no supervisor, only a reward signal
  • Feedback is delayed,
  • not instantaneous
  • Time really matters (sequential, non i.i.d data)
  • Agent’s actions affect the subsequent data it receives

RL is like a one-size fits all solution.


It seems, however, that sometimes RL doesn’t work yet. RL is sample inefficient. It requires millions of samples before it can learn something.

Deep RL is popular because it’s the only area in ML where it’s socially acceptable to train on the test set. I don’t care, so that’s great news for me!

I have this thought that RL is like at the intersection of many disciplines, and it feels so fascinating. Similar to my thoughts for Planning.

Sequential Decision Making

Goal: select actions to maximize total future Reward, which we call the expected return.

  • Actions may have long term consequences
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward

The main categories of RL algorithms are Value-Based vs. Policy-Based Methods.


RL vs. Planning

They are difference problem setups. In planning, we are already told in advance the setup of the game.

In reinforcement learning, the environment is initially unknown, the agent interacts with the environment and the agent improves its policy.

This is in contrast with planning. A model of the environment is known, and the agent performs computations with its model.

Incremental Implementation

This form occurs frequently throughout RL, where

Reinforcement Learning Algorithms


Foundation of RL is this process called Generalized Policy Iteration, and all RL methods can be described as GPI.

In Model-based, you can just use exhaustive search, and then optimize with Dynamic Programming in Reinforcement Learning.

Online vs Offline Updates

Online: We modify the value function during the episode Offline: We only modify the value function after the episode has ended

Evolutionary Algorithms vs RL

See Page 8. Evolutionary methods ignore much of the useful structure of the RL Problem: they do not use the fact that the policy they’re search for is a function from states to actions.

General Topics

Questions I have

  1. So with this value function, the state representation, does it encapsulate time? I think it should encapsulate that yes, like say you win more points the faster you run, then that is captured in the state representation.