Counterfactual Regret Minimization
https://nn.labml.ai/cfr/index.html
CFR is an extension of Regret Matching for extensiveform games. We use CFR to solve Imperfect Information games, i.e. cases where the game is only partially observable, such as Poker.
This is a really good paper to introduce your to CFR, and a tutorial here.
Link to original paper:
 http://poker.cs.ualberta.ca/publications/NIPS07cfr.pdf
 https://papers.nips.cc/paper/2007/file/08d98638c6fcd194a4b1e6992063e944Paper.pdf
This is good blog:
 https://int8.io/counterfactualregretminimizationforpokerai/
 You should look at MCCFR http://mlanctot.info/files/papers/nips09mccfr.pdf
 They explain CFR in there. I think you should understand the theory behind the regret bounds, even if it is going to take a bit of time. Because I donβt want to be a scam. I want to truly understand what I am doing
Families of CFR
 MonteCarlo CFR
 CFRD (CFRdecomposition) (2014)
 allows βsublinear space costs at the cost of increased computation timeβ
 CFR+ (2015) used to solve Heads Up Limit HoldβEm
 all regrets are constrained to be nonnegative
 Final strategy used is the current strategy at that time (not the average strategy)
 no sampling is used
 Discounted CFR(2018)
 Instant CFR (2019)
 Deep CFR (2020?)
Connection between Regret, average strategies, and Nash Equilibrium
In a zerosum game, if $R_{iβ{1,2}}β€Ο΅$, then $Ο_{T}$ is a $2Ο΅$ equilibrium.
Two papers have been used to produce the article below:
 Regret Minimization in Games with Incomplete Information: The original paper that came up with the counterfactual regret minimization algorithm.
 Monte Carlo Sampling for Regret Minimization in Extensive Games: introduces Monte Carlo Counterfactual Regret Minimization (MCCFR), where we sample from the game tree and estimate the regrets.
The notation written below is entirely taken from the papers above. If you want the original source of truth to produce this article, refer to the original papers. The goal of this article is to produce a more userfriendly introduction, as well as for my own selflearning.
Credits to LabMLβs annotated CFR for the original idea to host a this style of notebook. Contents are very similar.
See Poker AI Blog, where we first talk about RPS and regret matching. I think it is helpful to go through that first, but not necessary.
#todo Checkout Lilian Wengβs blog, and Andrej Karpathyβs blog, and see how much notation they use.
ExtensiveForm Game with Imperfect Information
Before we dive into the CFR algorithm, we need to first formally the game we are trying to solve. A basic understanding of notation behind Functions, Relations and Sets is needed to grasp the notation below (if rusty, check out a quick cheatsheet here).
NoLimit Texas HoldβEm Poker is a finite extensive game with imperfect information:
 Finite because we can guarantee that the game will terminate
 Extensive because there are multiple decisions made in a single game, unlike Rock Paper Scissors, which is called a normalform game
 Imperfect Information because each player can only make partial observations, i.e they cannot see the other playerβs cards
A finite extensive game with imperfect information has the following components):
 A finite set $N$ of Players
 A finite set $H$ of Histories
 A Player Function
 A Chance Function
 Information Sets
 Utility Function
Player
An imperfect information extensive game (IIEG) has a finite set $N$ of players.
 We denote a particular player by $i$, where $iβN$
 For HeadsUp Poker, there are only two players, so $N={0,1}$
History
An IIEG has finite set $H$ of possible histories of actions.
 Each history $hβH$ denotes a sequence of legal actions
 $Z$ is the set of terminal histories, $ZβH$, i.e. when we have reached the end of a game
 $A(h)={a:(h,a)βH}$ is the set of actions available after a nonterminal history $hβH$#todo you have both $h$ and $(h,a)$ which is confusing
Player Function
A player function $P:HβZβNβͺ{c}$ assigns each nonterminal history $hβHβZ$ to an element of $Nβͺ{c}$, which tells you whose turn it is
 When $P(h)βZ$, the function returns the player who has to take an action after nonterminal history $h$
 When $P(h)=c$, then chance determines the action taken after history $h$
 This is when for example, it is time to draw a new card. It is not up to the player to decide
Chance Function
 A function $f_{c}:HβR$ that assigns a probability measure $f_{c}(aβ£h)$ for all $aβA(h)$, for each history $h$ where $P(h)=c$. In Poker, this simply assigns an equal probability for each card of the remaining deck that gets drawn.
Information Set
 An information partition $I_{i}$ of ${hβH:P(h)=i}$ for each player $iβN$, where $A(h)=$A(hβ)$whenever$h$and$hβ$ are members of the same partition
 #gapinknowledge information partition is confusing, what is this??
 Each information set $I_{i}βI_{i}$ for player $i$ contains a set of histories that look identical to player $i$
 Note that we will often just write $I$ instead of $I_{i}$
Utility
A utility function $u_{i}:ZβR$ for each player $iβN$
 The function $u_{i}(z)$ returns how much player $i$ wins/loses at the end of a game given a terminal history $zβZ$
 $Ξ_{u,i}=max_{z}(z)βmin_{z}u_{i}(z)$ is the range of utilities to player $i$
 when $u_{1}=βu_{2}$, and $N={1,2}$, we have a zerosum extensive game (such as in the case of HeadsUp Poker)
Some Clarifications:
 when you are initially dealt two cards, that is part of the history. Then, your opponent gets dealt two cards. However, you wonβt be able to see that information about the history.
Strategy and Equilibria
Now that we have formally defined the notation for the extensive game, let us define strategies to play this game:
Strategy
 A strategy $Ο_{i}$ of player $i$ is a function that assigns a distribution over $A(I)$ for each $IβI_{i}$
 As player $i$, we follow our strategy $Ο_{i}$ to decide which action to pick throughout the game
 $Ο_{i}(I,a)$ represents the probability of choosing action $aβA(I)$ given information set $I$ for player $i$
 Two types of strategies:
 Pure strategy (deterministic): Chooses a single action with probability 1.
 Mixed strategy ($Ο$): At least two actions played with positive probability
 $β_{i}$ is the set of all strategies for player $i$
Parallel: If youβve dabbled/are an expert in reinforcement learning, strategy corresponds to the policy $Ο$ that an agent follows, where $Ο(aβ£s)$ represents the probability of choosing action $a$ given that the agent is in state $s$.
However, in Game Theory, we use $Ο$ to denote reach probability (see more below).
Strategy Profile
 A strategy profile $Ο$ consists of a strategy for each player $Ο_{1},β¦,Ο_{n}$
 $Ο_{βi}$ refers to all the strategies in strategy profile $Ο$ excluding $Ο_{i}$
Reach Probability of History
This is a really crucial idea that one must understand before moving on.
 $Ο_{Ο}(h)$ is the reach probability of history $h$ (probability of $h$ occurring) if all players choose to play according to strategy $Ο$ $Ο_{Ο}(h)=Ξ_{iβNβͺ{c}}Ο_{i}(h)$ In other words, itβs the probability of each player multiplied together. But shouldnβt it be the contribution from both players? So draw out Rock papers scissors and the probabilities. Example: $Ο_{1}=[31β,31β,31β]$, $Ο_{2}=[31β,31β,31β]$, where the actions are $[Rock,Paper,Scissors]$ If each player plays RPS with 1/3 probability, then $Ο([R])=1/3$ Player 2 doesnβt play at this point, so how do you compute $Ο_{2}(h)$?
 Well, the probability is just 1
so its actually Letβs decompose $h=h_{1}h_{2}h_{3}β¦h_{n}$ $Ο_{Ο}(h)=Ο_{1}(h_{1})Ο_{2}(h_{1}h_{2})Ο_{1}(h_{1}h_{2}h_{3})β¦$
So say itβs Rock Rock. From the perspective We actually only know $Ο(I,a)$ like what
This is telling the probability that you are to end up in history $h$.
The probability of reach an information set $I$ under strategy profile $Ο$ given by $Ο_{Ο}(I)=β_{hβI}Ο_{Ο}(h)$
 Notice that the above is a sum, and not a product, because multiple histories can be the same InfoSet, so we increase our chances of encountering a particular information set by summing the probabilities.
We also define $Ο_{Ο}(h,z)$, the reach probability of a terminal history $z$ given the current history $h$ under strategy profile $Ο$: $Ο_{Ο}(h,z)={Ο_{Ο}(h)Ο_{Ο}(z)β0βhβzotherwiseβ$
Expected Payoff
The utility function we covered gives us some payoff $u_{i}(z)$ for player $i$ at terminal history $zβZ$. How can we measure our average payoff as we play the game more and more? That is the idea of an expected payoff.
The expected payoff for player $i$ is given by $u_{i}(Ο)=β_{hβZ}u_{i}(h)Ο_{Ο}(h)$
This is simply a single value that tells you how good your strategy profile $Ο$. That is, if you play according to $Ο$, how much money do you expect to win or lose? our goal is to reach a Nash Equilibrium, so $u_{i}(Ο)=0$.
 Intuitively, we are simply taking a weighted average, multiplying our payoff $u_{i}(h)$ of each terminal history $hβZ$ by how likely we are to reach that history using strategy profile $Ο$, given by $Ο_{Ο}(h)$. Note that $β_{hβZ}Ο_{Ο}(h)=1$.
 The expected payoff is different depending on what strategy profile $Ο$
We will also sometimes use the notation $u_{i}(Ο_{i},Ο_{βi})$, which is the expected payoff for player $i$ if all players play according to the strategy profile $Ο=β¨Ο_{i},Ο_{βi}β©$.
 We use this expanded form because we will also add a superscript $t$ in \sigma^\textcolor{orange}{t}_i, so you can obtain the utility at two different times, for example $u_{i}(Ο_{i},Ο_{βi})$
We can also talk about expected payoff $u_{i}(Ο,h)$ at a particular nonterminal history $h$ (this is also sometimes referred to as just expected value at a particular history $h$) $v_{i}(Ο,h)=u_{i}(Ο,h)=β_{zβZ,hβz}Ο_{βi}(h)Ο_{i}(h,z)u_{i}(z)$ We also call the above counterfactual value at nonterminal history $h$, but instead of histories, we are interested in getting the value for an information set $I$.
$v_{i}(Ο,I)=β_{zβZ_{I}}Ο_{βi}(z[I])Ο_{Ο}(z[I],z)u_{i}(z)$
 I am quite confused by why this equation is the way it is, so you should see below for counterfactual value.
Serendipity: This is quite similar to the value functions we see in Reinforcement Learning, where we do a MonteCarlo rollout to sample the rewards.
Some more notes about discrepancies
Note: Notice that we were defining the domain of utility function $u_{i}$ as the set of terminal histories $Z$, but now we define it as a strategy profile.#todo Explain discrepancy.
Nash Equilibrium
Our goal is to find the Nash Equilibrium. This is a strategy profile in which no players can improve by deviating from their strategies.
Formally, in a twoplayer extensive game, a Nash Equilibrium is a strategy profile $Ο$ where the two inequalities are satisfied: ${u_{1}(Ο)β₯max_{Ο_{1}ββ_{1}}u_{1}(Ο_{1},Ο_{2})u_{2}(Ο)β₯max_{Ο_{2}ββ_{2}}u_{2}(Ο_{1},Ο_{2})β$
 For instance, the first equation states for take any strategy $Ο_{1}$ for player $1$ in the set of all strategies $β_{1}$ for player $1$, the expected payoff $u_{1}(Ο)$ for player $1$ in this strategy profile $Ο$ cannot improve
An approximation of a Nash Equilibrium, called $Ο΅$Nash Equilibrium, is a strategy profile where ${u_{1}(Ο)+Ο΅β₯max_{Ο_{1}ββ_{1}}u_{1}(Ο_{1},Ο_{2})u_{2}(Ο)+Ο΅β₯max_{Ο_{2}ββ_{2}}u_{2}(Ο_{1},Ο_{2})β$
Best Response
Given a strategy profile $Ο$, we define a player $i$βs best response $b_{i}$ as $b_{i}(Ο_{βi})=max_{Ο_{i}ββ_{i}}u_{i}(Ο_{i},Ο_{βi})$ In other words, player $i$βs best response is a strategy $Ο_{i}$ that maximizes their expected payoff assuming all other players play according to $Ο$.
Exploitability
Exploitability $Ο΅_{Ο}$ is a measure of how close $Ο$ is to an equilibrium $Ο΅_{Ο}=b_{1}(Ο_{2})+b_{2}(Ο_{1})$ This will be important to measure how close the strategy profile $Ο$ we generate is to the Nash Equilibrium.
Regret
As we will come to understand shortly, CFR is based on the very powerful yet simple idea of minimizing regret over time. Let us first understand what regret is about.
Regret is a measure of how much one regrets not having chosen an action. It is the difference between the utility/reward of that action and the action we actually chose, with respect to the fixed choices of other players.
The overall average regret of player $i$ at time $T$ is $R_{i}=T1βmax_{Ο_{i}ββ_{i}}β_{t=1}(u_{i}(Ο_{i},Ο_{βi})βu_{i}(Ο_{t}))$
 where $Ο_{i}$ is the strategy used by player $i$ on round $t$
 $u_{i}(Ο_{i},Ο_{βi})$ is utility of a strategy profile with $Ο_{i}$ and $Ο_{βi}$
 Think of $Ο_{β}$ as the βoptimalβ strategy profile, just like Optimal Value Function in RL
Notice, however, that we need to find $Ο_{i}ββ_{i}$ which maximizes this difference in utilities. This is the difficultyβ¦ we donβt know this $Ο_{i}$ unless we loop over all possible strategies, which is computationally intractable. CFR will comes up with a solution to represent a tractable regret.
 Reinforcement Learning has a similar idea when trying to find the Optimal Policy
We also define the average strategy $Ο_{i}$ for player $i$ from time $1$ to $T$. This is the final strategy would be used. For each information set $IβI_{i}$, for each $aβA(I)$, we define $Ο_{i}(aβ£I)=β_{t=1}Ο_{i}(I)β_{t=1}Ο_{i}(I)Ο_{t}(aβ£I)β$ So for a given player $i$, we want a regret minimizing algorithm such that $R_{i}β0$ as $Tββ$.
Connection between Regret, Average Strategy and Nash Equilibrium
There is this connection between regret $R$, average strategies $Ο$, and Nash Equilibrium, which is extremely important and going to help us unlock the power of CFR!
Theorem 1: In a zerosum game, if $R_{iβ{1,2}}β€Ο΅$, then $Ο_{T}$ is a $2Ο΅$ equilibrium.
#todo Try to prove this. You can do it
Counterfactual Regret Minimization
So the question is, how can we figure out this set of strategies such that regret is minimized? The key idea is to decompose this overall regret into a set of additive regret terms, which can be minimized independently. These individual regret terms are called counterfactual regret, and defined on an individual information set $I$.
Counterfactual?
In game theory, a counterfactual is a hypothetical scenario that considers what would have happened if a player had made a different decision at some point in the game. In CFR, we consider these hypothetical situations.
We are now finally getting into the heart of the counterfactual regret minimization (CFR) algorithm.
Let $I$ be an information set of player $i$, and let $Z_{I}$ be the subset of all terminal histories where a prefix of the history is in the information set $I$. for $zβZ_{I}$ let $z[I]$ be that prefix.
 So basically $Z[I]$ is converting the information set $I$ to the history version where we can see both players actions.
We first define the counterfactual value $v_{i}(Ο,I)$ as $v_{i}(Ο,I)=β_{zβZ_{I}}Ο_{βi}(z[I])Ο_{Ο}(z[I],z)u_{i}(z)$
 $Ο_{βi}(z[I])$ is the reach probability of history $z[I]$
 $Ο_{Ο}(z[I],z)$
 $u_{i}(z)$ is the payoff for player $i$ at terminal history $zβZ_{I}$
 Remember that we can only compute the utility for terminal histories, therefore $zβZ_{1}$ must be a terminal history
This is pretty much the same idea as [[notes/Counterfactual Regret Minimization#Expected PayoffExpected Payoff]] which we just talked about above. Think of this as the Expected Value of a particular information set $I$. The greater $v_{i}(Ο,I)$ is, the greater we expect the money we make at this information set.
 Similar to the idea of Value Function for RL
We use counterfactual values to update counterfactual regrets.
The immediate counterfactual regret is given by $R_{i,imm}(I)=max_{aβA(I)}R_{i,imm}(I,a)$ where $R_{i,imm}(I,a)=T1ββ_{t=1}(v_{i}(Ο_{(Iβa)},I)βv_{i}(Ο_{t},I))$
 In other words, the immediate counterfactual regret $R_{i,imm}(I)$ is the maximum immediate counterfactual regret out of all possible actions $aβA(I)$
 $Ο_{(Iβa)}$ is the strategy profile identical to $Ο$, except that player $i$ always chooses action $a$ at information set $I$
The reason we are interested in the immediate counterfactual regret is because of this key insight that is going to be powering CFR.
Theorem 2: $R_{i}β€β_{IβI_{i}}R_{i,imm}(I)$
 Note that $x_{+}=max(x,0)$
This tells us that the overall regret is bounded by the sum of counterfactual regret. Thus, if we want to minimize our overall regret $R_{i}$, we can simply minimize our cumulative immediate counterfactual regret. And because of theorem 1, we know that minimizing $R_{i}$ allows us to converge to a $Ο΅$Nash Equilibrium!
We now have all the pieces of the puzzle. How can we update our strategy profile $Ο$ such that the cumulative counterfactual regret is minimized over time?
We use the following update rule which applies Blackwellβs Approachability Theorem to regret (called Regret Matching): $Ο_{i}(I,a)=β©β¨β§ββ_{aβA(I)}R_{i}(I,a)R_{i}(I,a)ββ£A(I)β£1βββ_{aβA(I)}R_{i}(I,a)>0otherwiseβ$
 where $R_{i}(I,a)=max(R_{i}(I,a),0)$
 In other words, actions are selected in proportion to the amount of positive counterfactual regret for not playing that action. If no actions have any positive counterfactual regret, then the action is selected randomly.
 Since we cannot calculate $R_{i}$, we use $R_{imm}(I,a)$ in practice
If a player $i$ selects uses a strategy $Ο_{i}$ according to the equation above, then we can guarantee the following.
Theorem 3: $R_{i}β€R_{i,imm}(I)β€Ξ_{u,i}Tββ£A_{i}β£ββ$
Intuition
Okay, Iβve explained a lot of the above, but then how this translates in code is not super intuitive. The pseudocode I referenced was from a CFR tutorial, the original paper does not have anything like that.
Pseudocode for vanilla CFR.
Essentially, you initialize your strategy profile to a random uniform probability of choosing an action for each Information Set.
Then, you can start iterating the algorithm, updating the counterfactual value by selfplay, the value updates are propagated through the terminal histories, and weighted by this reach probability (determined by our current strategy).
The strategy profile is then selected proportional the regret using Regret Matching.
MonteCarlo CFR
With Vanilla CFR, each iteration requires us to traverse an entire game tree since we update $Ο_{i}(I,a)$ for every single information set $I$ and every action $a$ on each iteration.
However, this is simply too slow with large games such as NoLimit Heads Up Texas HoldβEm Poker. MCCFR can help the speedup convergence to an approximate Nash Equilibrium.
With MCCFR, we avoid traversing the entire game tree on each iteration while still having the immediate counterfactual regrets be unchanged in expectation. We do this by restricting the terminal histories that we consider on each iteration.
Let $Q={Q_{1},β¦,Q_{r}}$ be a set of subsets of terminal histories $Z$ (i.e. $Q_{j}βZ$), such that their union spans the set $Z$ (i.e. ($Q_{1}βͺβ―βͺQ_{r})=Z$.
Each of these subsets is called a block $Q_{j}$.
On each iteration, we will sample one of these blocks and only consider that block. Let $q_{j}>0$ be the probability of considering $Q_{j}$ for the current iterations (where $β_{j=1}q_{j}=1$).
The probability $q(z)$ of considering terminal history $z$ on the current iteration is given by $q(z)=β_{j:zβQ_{j}}q_{j}$ Then, the sampled counterfactual value when updating block $j$ is $v~(Ο,Iβ£j)=β_{zβZ_{I}β©Q_{j}}q(z)1βu_{i}(z)Ο_{βi}(z[I])Ο_{Ο}(z[I],z)$
The paper proves that the expectations of the sampled counterfactual value is equal to the counterfactual value $E_{jβΌq_{j}}[v~(Ο,Iβ£j)]=v_{i}(Ο,I)$
So in the immediate counterfactual regret, we can use the following $R_{i,imm}(I,a)=T1ββ_{t=1}(v~_{i}(Ο_{(Iβa)},I)βv~_{i}(Ο_{t},I))$
Note to self
I am still having trouble going from the math equations to a full on algorithm that does this.
There is also a speedup to this that I am not aware about.
Old notes
Explanation of notations:
 $Ο$ is a strategy profile which consists of a strategy $Ο_{i}$ for each player $i$
 $Ο_{i}$ is the strategy for player $i$ at time $t$ (assigns probability distributions over $A(I)$)
 $A$ is the set of all game actions
 $I$ is an Information Set
 The is the information that a player can see. Distinction between State and Set
 $A(I)$ is the set of legal actions for information set $I$
 History $h$ is the sequence of actions from the root of the game
 $Ο_{Ο}(h)$ is the reach probability of history $h$ if all players choose to play according to strategy $Ο$
 This is where βchance samplingβ comes into play, because we consider reach probabilities to more efficiently train our algorithm
More notation:

$Z$ denotes the set of all terminal game histories (sequences from root to leaf)

$u_{i}(z)$ is the utility of player $i$ at terminal history $zβZ$

The counterfactual value at nonterminal history $h$ is $v_{i}(Ο,h)=β_{zβZ,hβz}Ο_{βi}(h)Ο_{i}(h,z)u_{i}(z)$
We use counterfactual values to update counterfactual regrets.
The counterfactual regret of not acting action $a$ at history $h$ is $r(h,a)=v_{i}(Ο_{Iβa},h)βv_{i}(Ο,h)$
 where $Ο_{Iβa}$ is the profile $Ο$ except that at $I$, action $a$ is always taken
The counterfactual regret of not taking action $a$ at information set $I$ is just the sum of histories, so $r(I,a)=hβIββr(h,a)$
 You know the counterfactual values since this is a selfplay algorithm
 $r_{i}(I,a)$ is the regret of player $i$ at time $t$ for not taking action $a$ at information set $I$ (measures how much player $i$ would rather play action $a$ at information set $I$)
 The cumulative counterfactual regret is $R_{i}(I,a)=β_{t=1}r_{i}(I,a)$
We use the cumulative counterfactual regret to update the current strategy profile via Regret Matching: $Ο_{i}(I,a)=β©β¨β§ββ_{aβA(I)}R_{i}(I,a)R_{i}(I,a)ββ£A(I)β£1βββ_{aβA(I)}R_{i}(I,a)>0otherwiseβ$
 where $R_{i}(I,a)=max(R_{i}(I,a),0)$
In Games of larger state spaces β For more complex games, where there are quintillions of information sets, this is too large to iterate over for convergence of mixed strategies.
The key idea is that we can approximate information sets, by using imperfect recall. See Game Abstraction
Theoretical Analysis (Regret Bounds)
Notes for me explaining the poker AI
My thoughts: CFR finds the best response to the opponentβs strategy. Over time, this converges to the Nash Equilibrium.