Used for Reinforcement Learning
Stable Baselines
I believe I used this for my Poker AI.
https://stable-baselines3.readthedocs.io/en/master/
List of Algorithms https://stable-baselines3.readthedocs.io/en/master/guide/algos.html
This is example code for PPO
Notice the following line:
Usually, in reset, you also return info. This is kind of incorrect, you need to modify the environment so it works
How things work under the hood
https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html
SB3 networks are separated into two mains parts (see figure below):
-
A features extractor (usually shared between actor and critic when applicable, to save computation) whose role is to extract features (i.e. convert to a feature vector) from high-dimensional observations, for instance, a CNN that extracts features from images. This is the
features_extractor_class
parameter. You can change the default parameters of that features extractor by passing afeatures_extractor_kwargs
parameter. -
A (fully-connected) network that maps the features to actions/value. Its architecture is controlled by the
net_arch
parameter.