Counterfactual Regret Minimization (CFR)

https://nn.labml.ai/cfr/index.html

CFR is an extension of Regret Matching for extensive-form games. We use CFR to solve Imperfect Information games, i.e. cases where the game is only partially observable, such as Poker.

This is a really good paper to introduce your to CFR, and a tutorial here.

Link to original paper:

This is good blog:

https://int8.io/counterfactual-regret-minimization-for-poker-ai/
You should look at MCCFR http://mlanctot.info/files/papers/nips09mccfr.pdf
- They explain CFR in there. I think you should understand the theory behind the regret bounds, even if it is going to take a bit of time. Because I don’t want to be a scam. I want to truly understand what I am doing

So I have good understanding of formalization, but...

I am not really well-versed on the proof. I have an intuition that nash equilibrium means that deviating from the strategy works. But like how the idea that using regret shows that CFR guarantees, I need to be familiar with the math.

Families of CFR

Monte-Carlo CFR
CFR-D (CFR-decomposition) (2014)
- allows “sub-linear space costs at the cost of increased computation time”
CFR+ (2015) used to solve Heads Up Limit Hold’Em
- all regrets are constrained to be non-negative
- Final strategy used is the current strategy at that time (not the average strategy)
- no sampling is used
Discounted CFR(2018)
Instant CFR (2019)
Deep CFR (2020?)

Connection between Regret, average strategies, and Nash Equilibrium

In a zero-sum game, if $R_{i \in {1, 2}}^{T} \leq ϵ$ , then $\overline{σ}^{T}$ is a $2 ϵ$ equilibrium.

Two papers have been used to produce the article below:

Regret Minimization in Games with Incomplete Information: The original paper that came up with the counterfactual regret minimization algorithm.
Monte Carlo Sampling for Regret Minimization in Extensive Games: introduces Monte Carlo Counterfactual Regret Minimization (MCCFR), where we sample from the game tree and estimate the regrets.

The notation written below is entirely taken from the papers above. If you want the original source of truth to produce this article, refer to the original papers. The goal of this article is to produce a more user-friendly introduction, as well as for my own self-learning.

Credits to LabML’s annotated CFR for the original idea to host a this style of notebook. Contents are very similar.

See Poker AI Blog, where we first talk about RPS and regret matching. I think it is helpful to go through that first, but not necessary.

#todo Check-out Lilian Weng’s blog, and Andrej Karpathy’s blog, and see how much notation they use.

Extensive-Form Game with Imperfect Information

Before we dive into the CFR algorithm, we need to first formally the game we are trying to solve. A basic understanding of notation behind Functions, Relations and Sets is needed to grasp the notation below (if rusty, check out a quick cheatsheet here).

No-Limit Texas Hold’Em Poker is a finite extensive game with imperfect information:

Finite because we can guarantee that the game will terminate
Extensive because there are multiple decisions made in a single game, unlike Rock Paper Scissors, which is called a normal-form game
Imperfect Information because each player can only make partial observations, i.e they cannot see the other player’s cards

A finite extensive game with imperfect information has the following components):

A finite set $N$ of Players
A finite set $H$ of Histories
A Player Function
A Chance Function
Information Sets
Utility Function

Player

An imperfect information extensive game (IIEG) has a finite set $N$ of players.

We denote a particular player by $i$ , where $i \in N$
For Heads-Up Poker, there are only two players, so $N = {0, 1}$

History

An IIEG has finite set $H$ of possible histories of actions.

Each history $h \in H$ denotes a sequence of legal actions
$Z$ is the set of terminal histories, $Z \subseteq H$ , i.e. when we have reached the end of a game
$A (h) = {a : (h, a) \in H}$ is the set of actions available after a non-terminal history $h \in H$ #todo you have both $h$ and $(h, a)$ which is confusing

Player Function

A player function $P : H ∖ Z \to N \cup {c}$ assigns each nonterminal history $h \in H ∖ Z$ to an element of $N \cup {c}$ , which tells you whose turn it is

When $P (h) \in Z$ , the function returns the player who has to take an action after non-terminal history $h$
When $P (h) = c$ , then chance determines the action taken after history $h$
- This is when for example, it is time to draw a new card. It is not up to the player to decide

Chance Function

A function $f_{c} : H \to R$ that assigns a probability measure $f_{c} (a ∣ h)$ for all $a \in A (h)$ , for each history $h$ where $P (h) = c$ . In Poker, this simply assigns an equal probability for each card of the remaining deck that gets drawn.

Information Set

An information partition $I_{i}$ of ${h \in H : P (h) = i}$ for each player $i \in N$ , where $A (h) =$ A(h’) $w h e n e v er$ h $an d$ h’$ are members of the same partition
- information partition is confusing, what is this??
- Each information set $I_{i} \in I_{i}$ for player $i$ contains a set of histories that look identical to player $i$
  - Note that we will often just write $I$ instead of $I_{i}$

Utility

A utility function $u_{i} : Z \to R$ for each player $i \in N$

The function $u_{i} (z)$ returns how much player $i$ wins/loses at the end of a game given a terminal history $z \in Z$
$Δ_{u, i} = max_{z} (z) - min_{z} u_{i} (z)$ is the range of utilities to player $i$
when $u_{1} = - u_{2}$ , and $N = {1, 2}$ , we have a zero-sum extensive game (such as in the case of Heads-Up Poker)

Some Clarifications:

when you are initially dealt two cards, that is part of the history. Then, your opponent gets dealt two cards. However, you won’t be able to see that information about the history.

Strategy and Equilibria

Now that we have formally defined the notation for the extensive game, let us define strategies to play this game:

Strategy

A strategy $σ_{i}$ of player $i$ is a function that assigns a distribution over $A (I)$ for each $I \in I_{i}$
- As player $i$ , we follow our strategy $σ_{i}$ to decide which action to pick throughout the game
- $σ_{i} (I, a)$ represents the probability of choosing action $a \in A (I)$ given information set $I$ for player $i$
- Two types of strategies:
  - Pure strategy (deterministic): Chooses a single action with probability 1.
  - Mixed strategy ( $σ$ ): At least two actions played with positive probability
$\sum_{i}$ is the set of all strategies for player $i$

Parallel: If you’ve dabbled/are an expert in reinforcement learning, strategy corresponds to the policy $π$ that an agent follows, where $π (a ∣ s)$ represents the probability of choosing action $a$ given that the agent is in state $s$ .

However, in Game Theory, we use $π$ to denote reach probability (see more below).

Strategy Profile

A strategy profile $σ$ consists of a strategy for each player $σ_{1}, \dots, σ_{n}$
$σ_{- i}$ refers to all the strategies in strategy profile $σ$ excluding $σ_{i}$

Reach Probability of History

This is a really crucial idea that one must understand before moving on.

$π^{σ} (h)$ is the reach probability of history $h$ (probability of $h$ occurring) if all players choose to play according to strategy $σ$

$π^{σ} (h) = Π_{i \in N \cup {c}} π_{i}^{σ} (h)$

In other words, it’s the probability of each player multiplied together. But shouldn’t it be the contribution from both players? So draw out Rock papers scissors and the probabilities. Example: $σ_{1} = [\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]$ , $σ_{2} = [\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]$ , where the actions are $[R oc k, P a p er, S c i ssors]$ If each player plays RPS with 1/3 probability, then $π ([R]) = 1/3$ Player 2 doesn’t play at this point, so how do you compute $π_{2}^{σ} (h)$ ?

Well, the probability is just 1

so its actually Let’s decompose $h = h_{1} h_{2} h_{3} \dots h_{n}$

$π^{σ} (h) = π_{1}^{σ} (h_{1}) π_{2}^{σ} (h_{1} h_{2}) π_{1}^{σ} (h_{1} h_{2} h_{3}) \dots$

So say it’s Rock Rock. From the perspective We actually only know $σ (I, a)$ like what

This is telling the probability that you are to end up in history $h$ .

The probability of reach an information set $I$ under strategy profile $σ$ given by

$π^{σ} (I) = \sum_{h \in I} π^{σ} (h)$

Notice that the above is a sum, and not a product, because multiple histories can be the same InfoSet, so we increase our chances of encountering a particular information set by summing the probabilities.

We also define $π^{σ} (h, z)$ , the reach probability of a terminal history $z$ given the current history $h$ under strategy profile $σ$ :

$π^{σ} (h, z) = {\frac{π ^{σ} ( z )}{π ^{σ} ( h )} 0 h ⊑ z otherwise$

Expected Payoff

The utility function we covered gives us some payoff $u_{i} (z)$ for player $i$ at terminal history $z \in Z$ . How can we measure our average payoff as we play the game more and more? That is the idea of an expected payoff.

The expected payoff for player $i$ is given by

$u_{i} (σ) = \sum_{h \in Z} u_{i} (h) π^{σ} (h)$

This is simply a single value that tells you how good your strategy profile $σ$ . That is, if you play according to $σ$ , how much money do you expect to win or lose? our goal is to reach a Nash Equilibrium, so $u_{i} (σ) = 0$ .

Intuitively, we are simply taking a weighted average, multiplying our payoff $u_{i} (h)$ of each terminal history $h \in Z$ by how likely we are to reach that history using strategy profile $σ$ , given by $π^{σ} (h)$ . Note that $\sum_{h \in Z} π^{σ} (h) = 1$ .
The expected payoff is different depending on what strategy profile $σ$

We will also sometimes use the notation $u_{i} (σ_{i}, σ_{- i})$ , which is the expected payoff for player $i$ if all players play according to the strategy profile $σ = ⟨ σ_{i}, σ_{- i} ⟩$ .

We use this expanded form because we will also add a superscript $t$ in $\sigma^\textcolor{orange}{t}_i$ , so you can obtain the utility at two different times, for example $u_{i} (σ_{i}^{t - 1}, σ_{- i}^{t})$

We can also talk about expected payoff $u_{i} (σ, h)$ at a particular nonterminal history $h$ (this is also sometimes referred to as just expected value at a particular history $h$ ) $v_{i} (σ, h) = u_{i} (σ, h) = \sum_{z \in Z, h ⊏ z} π_{- i}^{σ} (h) π_{i}^{σ} (h, z) u_{i} (z)$ We also call the above counterfactual value at nonterminal history $h$ , but instead of histories, we are interested in getting the value for an information set $I$ .

$v_{i} (σ, I) = \sum_{z \in Z_{I}} π_{- i}^{σ} (z [I]) π^{σ} (z [I], z) u_{i} (z)$

I am quite confused by why this equation is the way it is, so you should see below for counterfactual value.

Serendipity: This is quite similar to the value functions we see in Reinforcement Learning, where we do a Monte-Carlo rollout to sample the rewards.

Some more notes about discrepancies

Note: Notice that we were defining the domain of utility function $u_{i}$ as the set of terminal histories $Z$ , but now we define it as a strategy profile.#todo Explain discrepancy.

Nash Equilibrium

Our goal is to find the Nash Equilibrium. This is a strategy profile in which no players can improve by deviating from their strategies.

Formally, in a two-player extensive game, a Nash Equilibrium is a strategy profile $σ$ where the two inequalities are satisfied:

{u_{1} (σ) \geq max_{σ_{1}^{'} \in \sum_{1}} u_{1} (σ_{1}^{'}, σ_{2}) u_{2} (σ) \geq max_{σ_{2}^{'} \in \sum_{2}} u_{2} (σ_{1}, σ_{2}^{'})

For instance, the first equation states for take any strategy $σ_{1}^{'}$ for player $1$ in the set of all strategies $\sum_{1}$ for player $1$ , the expected payoff $u_{1} (σ)$ for player $1$ in this strategy profile $σ$ cannot improve

An approximation of a Nash Equilibrium, called $ϵ$ -Nash Equilibrium, is a strategy profile where

{u_{1} (σ) + ϵ \geq max_{σ_{1}^{'} \in \sum_{1}} u_{1} (σ_{1}^{'}, σ_{2}) u_{2} (σ) + ϵ \geq max_{σ_{2}^{'} \in \sum_{2}} u_{2} (σ_{1}, σ_{2}^{'})

Best Response

Given a strategy profile $σ$ , we define a player $i$ ’s best response $b_{i}$ as

$b_{i} (σ_{- i}) = max_{σ_{i}^{'} \in \sum_{i}} u_{i} (σ_{i}^{'}, σ_{- i})$

In other words, player $i$ ’s best response is a strategy $σ_{i}^{'}$ that maximizes their expected payoff assuming all other players play according to $σ$ .

Exploitability

Exploitability $ϵ_{σ}$ is a measure of how close $σ$ is to an equilibrium

$ϵ_{σ} = b_{1} (σ_{2}) + b_{2} (σ_{1})$

This will be important to measure how close the strategy profile $σ$ we generate is to the Nash Equilibrium.

Regret

As we will come to understand shortly, CFR is based on the very powerful yet simple idea of minimizing regret over time. Let us first understand what regret is about.

Regret is a measure of how much one regrets not having chosen an action. It is the difference between the utility/reward of that action and the action we actually chose, with respect to the fixed choices of other players.

The overall average regret of player $i$ at time $T$ is

$R_{i}^{T} = \frac{1}{T} max_{σ_{i}^{*} \in \sum_{i}} \sum_{t = 1}^{T} (u_{i} (σ_{i}^{*}, σ_{- i}^{t}) - u_{i} (σ^{t}))$

where $σ_{i}^{t}$ is the strategy used by player $i$ on round $t$
$u_{i} (σ_{i}^{*}, σ_{- i}^{t})$ is utility of a strategy profile with $σ_{i}^{*}$ and $σ_{- i}^{t}$
Think of $σ^{*}$ as the “optimal” strategy profile, just like Optimal Value Function in RL

Notice, however, that we need to find $σ_{i}^{*} \in \sum_{i}$ which maximizes this difference in utilities. This is the difficulty… we don’t know this $σ_{i}^{*}$ unless we loop over all possible strategies, which is computationally intractable. CFR will comes up with a solution to represent a tractable regret.

Reinforcement Learning has a similar idea when trying to find the Optimal Policy

We also define the average strategy $\overline{σ}_{i}^{T}$ for player $i$ from time $1$ to $T$ . This is the final strategy would be used. For each information set $I \in I_{i}$ , for each $a \in A (I)$ , we define

$\overline{σ}_{i}^{T} (a ∣ I) = \frac{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I ) σ ^{t} ( a ∣ I )}{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I )}$

So for a given player $i$ , we want a regret minimizing algorithm such that $R_{i}^{T} \to 0$ as $T \to \infty$ .

Connection between Regret, Average Strategy and Nash Equilibrium

There is this connection between regret $R$ , average strategies $\overline{σ}$ , and Nash Equilibrium, which is extremely important and going to help us unlock the power of CFR!

Theorem 1: In a zero-sum game, if $R_{i \in {1, 2}}^{T} \leq ϵ$ , then $\overline{σ}^{T}$ is a $2 ϵ$ equilibrium.

#todo Try to prove this. You can do it

Counterfactual Regret Minimization

So the question is, how can we figure out this set of strategies such that regret is minimized? The key idea is to decompose this overall regret into a set of additive regret terms, which can be minimized independently. These individual regret terms are called counterfactual regret, and defined on an individual information set $I$ .

Counterfactual?

In game theory, a counterfactual is a hypothetical scenario that considers what would have happened if a player had made a different decision at some point in the game. In CFR, we consider these hypothetical situations.

We are now finally getting into the heart of the counterfactual regret minimization (CFR) algorithm.

Let $I$ be an information set of player $i$ , and let $Z_{I}$ be the subset of all terminal histories where a prefix of the history is in the information set $I$ . for $z \in Z_{I}$ let $z [I]$ be that prefix.

So basically $Z [I]$ is converting the information set $I$ to the history version where we can see both players actions.

We first define the counterfactual value $v_{i} (σ, I)$ as

$v_{i} (σ, I) = \sum_{z \in Z_{I}} π_{- i}^{σ} (z [I]) π^{σ} (z [I], z) u_{i} (z)$

$π_{- i}^{σ} (z [I])$ is the reach probability of history $z [I]$
$π^{σ} (z [I], z)$
$u_{i} (z)$ is the payoff for player $i$ at terminal history $z \in Z_{I}$
Remember that we can only compute the utility for terminal histories, therefore $z \in Z_{1}$ must be a terminal history

This is pretty much the same idea as [[notes/Counterfactual Regret Minimization#Expected Payoff|Expected Payoff]] which we just talked about above. Think of this as the Expected Value of a particular information set $I$ . The greater $v_{i} (σ, I)$ is, the greater we expect the money we make at this information set.

Similar to the idea of Value Function for RL

We use counterfactual values to update counterfactual regrets.

The immediate counterfactual regret is given by

$R_{i, imm}^{T} (I) = max_{a \in A (I)} R_{i, imm}^{T} (I, a)$ where

$R_{i, imm}^{T} (I, a) = \frac{1}{T} \sum_{t = 1}^{T} (v_{i} (σ_{(I \to a)}^{t}, I) - v_{i} (σ^{t}, I))$

In other words, the immediate counterfactual regret $R_{i, imm}^{T} (I)$ is the maximum immediate counterfactual regret out of all possible actions $a \in A (I)$
$σ_{(I \to a)}$ is the strategy profile identical to $σ$ , except that player $i$ always chooses action $a$ at information set $I$

The reason we are interested in the immediate counterfactual regret is because of this key insight that is going to be powering CFR.

Theorem 2: $R_{i}^{T} \leq \sum_{I \in I_{i}} R_{i, imm}^{T, +} (I)$

Note that $x^{+} = max (x, 0)$

This tells us that the overall regret is bounded by the sum of counterfactual regret. Thus, if we want to minimize our overall regret $R_{i}^{T}$ , we can simply minimize our cumulative immediate counterfactual regret. And because of theorem 1, we know that minimizing $R_{i}^{T}$ allows us to converge to a $ϵ$ -Nash Equilibrium!

We now have all the pieces of the puzzle. How can we update our strategy profile $σ$ such that the cumulative counterfactual regret is minimized over time?

We use the following update rule which applies Blackwell’s Approachability Theorem to regret (called Regret Matching):

$σ_{i}^{T + 1} (I, a) = ⎩ ⎨ ⎧ \frac{R _{i}^{T, +} ( I , a )}{\sum _{a \in A (I)} R _{i}^{T, +} ( I , a )} \frac{1}{∣ A ( I ) ∣} \sum_{a \in A (I)} R_{i}^{T, +} (I, a) > 0 otherwise$

where $R_{i}^{T, +} (I, a) = max (R_{i}^{T} (I, a), 0)$
In other words, actions are selected in proportion to the amount of positive counterfactual regret for not playing that action. If no actions have any positive counterfactual regret, then the action is selected randomly.
Since we cannot calculate $R_{i}^{T}$ , we use $R_{imm}^{T} (I, a)$ in practice

If a player $i$ selects uses a strategy $σ_{i}$ according to the equation above, then we can guarantee the following.

Theorem 3: $R_{i}^{T} \leq R_{i, imm}^{T} (I) \leq Δ_{u, i} \frac{∣ A _{i} ∣}{T}$

Intuition

Okay, I’ve explained a lot of the above, but then how this translates in code is not super intuitive. The pseudocode I referenced was from a CFR tutorial, the original paper does not have anything like that.

Pseudocode for vanilla CFR.

Essentially, you initialize your strategy profile to a random uniform probability of choosing an action for each Information Set.

Then, you can start iterating the algorithm, updating the counterfactual value by self-play, the value updates are propagated through the terminal histories, and weighted by this reach probability (determined by our current strategy).

The strategy profile is then selected proportional the regret using Regret Matching.

Monte-Carlo CFR

With Vanilla CFR, each iteration requires us to traverse an entire game tree since we update $σ_{i}^{T + 1} (I, a)$ for every single information set $I$ and every action $a$ on each iteration.

However, this is simply too slow with large games such as No-Limit Heads Up Texas Hold’Em Poker. MCCFR can help the speedup convergence to an approximate Nash Equilibrium.

With MCCFR, we avoid traversing the entire game tree on each iteration while still having the immediate counterfactual regrets be unchanged in expectation. We do this by restricting the terminal histories that we consider on each iteration.

Let $Q = {Q_{1}, \dots, Q_{r}}$ be a set of subsets of terminal histories $Z$ (i.e. $Q_{j} \subseteq Z$ ), such that their union spans the set $Z$ (i.e. ( $Q_{1} \cup \dots \cup Q_{r}) = Z$ .

Each of these subsets is called a block $Q_{j}$ .

On each iteration, we will sample one of these blocks and only consider that block. Let $q_{j} > 0$ be the probability of considering $Q_{j}$ for the current iterations (where $\sum_{j = 1}^{r} q_{j} = 1$ ).

The probability $q (z)$ of considering terminal history $z$ on the current iteration is given by $q (z) = \sum_{j : z \in Q_{j}} q_{j}$ Then, the sampled counterfactual value when updating block $j$ is

$\tilde{v} (σ, I ∣ j) = \sum_{z \in Z_{I} \cap Q_{j}} \frac{1}{q ( z )} u_{i} (z) π_{- i}^{σ} (z [I]) π^{σ} (z [I], z)$

The paper proves that the expectations of the sampled counterfactual value is equal to the counterfactual value

$E_{j \sim q_{j}} [\tilde{v} (σ, I ∣ j)] = v_{i} (σ, I)$

So in the immediate counterfactual regret, we can use the following

$R_{i, imm}^{T} (I, a) = \frac{1}{T} \sum_{t = 1}^{T} (\tilde{v}_{i} (σ_{(I \to a)}^{t}, I) - \tilde{v}_{i} (σ^{t}, I))$

Note to self

I am still having trouble going from the math equations to a full on algorithm that does this.

There is also a speedup to this that I am not aware about.

Old notes

Explanation of notations:

$σ$ is a strategy profile which consists of a strategy $σ_{i}$ for each player $i$
- $σ_{i}^{t}$ is the strategy for player $i$ at time $t$ (assigns probability distributions over $A (I)$ )
$A$ is the set of all game actions
$I$ is an Information Set
- The is the information that a player can see. Distinction between State and Set
$A (I)$ is the set of legal actions for information set $I$
History $h$ is the sequence of actions from the root of the game
$π^{σ} (h)$ is the reach probability of history $h$ if all players choose to play according to strategy $σ$
- This is where “chance sampling” comes into play, because we consider reach probabilities to more efficiently train our algorithm

More notation:

$Z$ denotes the set of all terminal game histories (sequences from root to leaf)
$u_{i} (z)$ is the utility of player $i$ at terminal history $z \in Z$
The counterfactual value at non-terminal history $h$ is $v_{i} (σ, h) = \sum_{z \in Z, h ⊏ z} π_{- i}^{σ} (h) π_{i}^{σ} (h, z) u_{i} (z)$

We use counterfactual values to update counterfactual regrets.

The counterfactual regret of not acting action $a$ at history $h$ is $r (h, a) = v_{i} (σ_{I \to a}, h) - v_{i} (σ, h)$

where $σ_{I \to a}$ is the profile $σ$ except that at $I$ , action $a$ is always taken

The counterfactual regret of not taking action $a$ at information set $I$ is just the sum of histories, so $r (I, a) = h \in I \sum r (h, a)$

You know the counterfactual values since this is a self-play algorithm
$r_{i}^{t} (I, a)$ is the regret of player $i$ at time $t$ for not taking action $a$ at information set $I$ (measures how much player $i$ would rather play action $a$ at information set $I$ )
The cumulative counterfactual regret is $R_{i}^{T} (I, a) = \sum_{t = 1}^{T} r_{i}^{t} (I, a)$

We use the cumulative counterfactual regret to update the current strategy profile via Regret Matching: $σ_{i}^{T + 1} (I, a) = ⎩ ⎨ ⎧ \frac{R _{i}^{T, +} ( I , a )}{\sum _{a \in A (I)} R _{i}^{T, +} ( I , a )} \frac{1}{∣ A ( I ) ∣} \sum_{a \in A (I)} R_{i}^{T, +} (I, a) > 0 otherwise$

where $R_{i}^{T, +} (I, a) = max (R_{i}^{T} (I, a), 0)$

In Games of larger state spaces → For more complex games, where there are quintillions of information sets, this is too large to iterate over for convergence of mixed strategies.

The key idea is that we can approximate information sets, by using imperfect recall. See Game Abstraction

Theoretical Analysis (Regret Bounds)

Notes for me explaining the poker AI

My thoughts: CFR finds the best response to the opponent’s strategy. Over time, this converges to the Nash Equilibrium.

Information Set

🛠️ Steven Gong

Table of Contents

Counterfactual Regret Minimization (CFR)

Families of CFR

Extensive-Form Game with Imperfect Information

Player

History

Player Function

Chance Function

Information Set

Utility

Strategy and Equilibria

Strategy

Strategy Profile

Reach Probability of History

Expected Payoff

Some more notes about discrepancies

Nash Equilibrium

Best Response

Exploitability

Regret

Connection between Regret, Average Strategy and Nash Equilibrium

Counterfactual Regret Minimization

Intuition

Monte-Carlo CFR

Note to self

Old notes

Theoretical Analysis (Regret Bounds)

Notes for me explaining the poker AI

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Counterfactual Regret Minimization (CFR)

Families of CFR

Extensive-Form Game with Imperfect Information

Player

History

Player Function

Chance Function

Information Set

Utility

Strategy and Equilibria

Strategy

Strategy Profile

Reach Probability of History

Expected Payoff

Some more notes about discrepancies

Nash Equilibrium

Best Response

Exploitability

Regret

Connection between Regret, Average Strategy and Nash Equilibrium

Counterfactual Regret Minimization

Intuition

Monte-Carlo CFR

Note to self

Old notes

Theoretical Analysis (Regret Bounds)

Notes for me explaining the poker AI

Related Concepts

Graph View

Backlinks