AlphaGo

Why do you have both a policy network and a value network?

Policy network helps give you an idea of the distribution of great moves.

Value network helps you assess how good your current position is.

In alphago, they actually use MC to learn the value network.

Important details:

  • “The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting.”
    • This is because many games are highly correlated. So they make sure that each data point is from a different game

Lineage (CS231n 2025 Lec 17, slides 36–40)

DeepMind’s game-playing line, in order, with what each entry actually changed:

WhenSystemWhat changed
Jan 2016AlphaGoimitation (expert games) + Monte Carlo tree search + RL self-play. Beat Lee Sedol — who later cited the loss as a reason for retiring from professional play.
Oct 2017AlphaGo Zerodropped human imitation entirely; pure self-play from random init. Beat Ke Jie.
Dec 2018AlphaZeroone architecture generalized to Go, Chess, and Shogi.
Nov 2019MuZeroplanning with a learned model — game rules no longer given to the agent.

Sibling lineage at DeepMind / OpenAI on partial-info, long-horizon real-time games: AlphaStar (StarCraft II, Vinyals et al. Science 2018), OpenAI Five (Dota 2, April 2019). These are cited in the same lecture as evidence that scaling RL works beyond perfect-information board games.