Reinforcement Learning (RL) was invented as a way to model and solve problems of decision making under uncertainty.
- Tips and tricks: https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html
- Benchmarking: https://github.com/automl/CARL
- https://github.com/deepmind/open_spiel -> Seems like an alternative to OpenAI Gym??
Cool links to projects I found:
- https://github.com/eleurent/highway-env -> Look at the README!!
- The default is
- The default is
- https://github.com/bulletphysics/bullet3/ (alternative to MUJOCO?)
- https://github.com/DLR-RM/rl-baselines3-zoo (i sent this to Soham, but I don’t know how good it is)
- Honestly I feel like Foundations of Deep RL by Pieter Abbeel is all you need to a good introduction
- Course by David Silver (LEGENDARY) on YouTube
- Stanford CS234 on YouTube
- Spinning Up by OpenAI
- Sutton Book Solutions here
- Also on practice on Github, use your fork
- Extra Practice Problems, see here,
- Super helpful blog by Lilian Weng, OpenAI Lead for applied research
What makes reinforcement learning different from other machine learning paradigms?
- There is no supervisor, only a reward signal
- Feedback is delayed,
- not instantaneous
- Time really matters (sequential, non i.i.d data)
- Agent’s actions affect the subsequent data it receives
RL is like a one-size fits all solution.
It seems, however, that sometimes RL doesn’t work yet. RL is sample inefficient. It requires millions of samples before it can learn something.
Deep RL is popular because it’s the only area in ML where it’s socially acceptable to train on the test set. I don’t care, so that’s great news for me!
I have this thought that RL is like at the intersection of many disciplines, and it feels so fascinating. Similar to my thoughts for Planning.
Sequential Decision Making
- Actions may have long term consequences
- Reward may be delayed
- It may be better to sacrifice immediate reward to gain more long-term reward
The main categories of RL algorithms are Value-Based vs. Policy-Based Methods.
- Markov Decision Process
- Reinforcement Learning Terminology
- Agent-Environment Interface
- RL Agent
- Bootstrapping and Sampling
- Model-Based vs. Model-Free RL
- Evaluation and Control
RL vs. Planning
They are difference problem setups. In planning, we are already told in advance the setup of the game.
In reinforcement learning, the environment is initially unknown, the agent interacts with the environment and the agent improves its policy.
This is in contrast with planning. A model of the environment is known, and the agent performs computations with its model.
This form occurs frequently throughout RL, where
Reinforcement Learning Algorithms
Foundation of RL is this process called Generalized Policy Iteration, and all RL methods can be described as GPI.
In Model-based, you can just use exhaustive search, and then optimize with Dynamic Programming in Reinforcement Learning.
- Value Function, introduced with Dynamic Programming in Reinforcement Learning
- Model-Free Policy Evaluation
- Model-Free Control
- Value Function Approximation
- Policy Gradient Methods Eligibility Trace
Online vs Offline Updates
Online: We modify the value function during the episode Offline: We only modify the value function after the episode has ended
Evolutionary Algorithms vs RL
See Page 8. Evolutionary methods ignore much of the useful structure of the RL Problem: they do not use the fact that the policy they’re search for is a function from states to actions.
- Searching Algorithms
- Knowledge Engineering
- Exploration and Exploitation
- Imitation Learning
Questions I have
- So with this value function, the state representation, does it encapsulate time? I think it should encapsulate that yes, like say you win more points the faster you run, then that is captured in the state representation.