Depth-Limited Solving
You actually need to understand how Subgame Solving works. It took me 3 reads to realize that the depth-limit idea, and generating a set of opponent strategies is a small part of it.
“Depth-limited solving makes nested solving feasible even in the early game, so it is possible to play without acting according to a precomputed strategy or using action translation.” (Brown et al., 2018, p. 5)
You use the blueprint strategy which approximates a Nash Equilibrium, in order to calculate the Counterfactual Values.
This is actually a pretty simple algorithm at a high level
- Do Subgame Solving with current set of policies
- Calculate a Best Response
- Add BR to leaf-policies
- Repeat
Basically, calculate a best response strategy a bunch of times, and hope to converge to something.
Creating opponent policies
However, we don’t know the values at those nodes, so we use some sort of expected value?
We first have , which is an estimated equilibrium strategy profile that is calculated using standard methods like MCCFR. We want to do real-time search on this.
The first part is coming up with a set of opponent strategies that respond well to our strategy. There are 2 ways we can do this:
- Bias Approach
- You bias the blueprint strategy to generate
- Generative Approach
- You generate a set of best response opponent profiles to , which is equivalent to solving an MDP (apparently, I don’t know how this works)
We then play out these games, and calculate the expected values from the strategies using Monte-Carlo Simulationo simulations, the optimal strategy for that subgame can be determined by selecting the action that maximizes the expected value.
For example, suppose we have a subgame with two actions, “A” and “B”, and the expected value of each action has been calculated as follows:
- A: 5
- B: 3
Once the optimal strategy for each subgame has been determined, these strategies can be combined to form a strategy for the entire game. This process can be repeated as needed to further improve the accuracy of the solution.
Danger
The expected value provides a useful measure of the relative value of different actions and can be used to guide the search for an optimal strategy.
Confusion
Why didn’t people come up with realtime search in the first place then? Just simulate a bunch of games and calculate our expected value.
- NO! You cannot just take the action with the highest expected value. For example, if you play RPS and find that rock gives the highest expected value, playing rock all the time is not optimal.
Is there a concept of expected values in Perfect Information games? Yes, there is Expectiminimax. But in deterministic game, there isn’t really “expected value”. CORRECTION: in MCTS, you use “randomness” as a self-play algorithm.
Random Notes
In this paper, they refer to histories as states, which does make sense since each history is represented by a particular node in the game tree.
Knowing the expected value of is not enough to come up with a good strategy. Show the example of rock paper scissors. Even though the expected value is 0, you don’t have enough information to decide a strategy.
The counter to this is to let the opponent consider a strategy of strategies at these leaf infosets, and reevaluate the strategy based on that?
They talk about safe vs unsafe subgame solving.