Regret Matching
We use regret matching to solve Normal-Form Game such as RPS.
Blackwell’s approachability theorem when applied to minimizing regret is known as regret matching.
In regret matching, an agent’s actions are selected at random with a distribution that is proportional to positive regrets.
I implemented regret matching in Python. But at a high level, you mainly implement train
, which is broken down into three steps:
- Get regret-matched mixed-strategy actions (i.e. play a game)
- The strategy you play is based on the regret you have, computed from
regretSum
- The strategy you play is based on the regret you have, computed from
- Compute action utilities (i.e. compute the regrets of those games)
- Accumulate action regrets (accumulate those regrets) into
regretSum
def train(iterations: int):
actionUtility = np.zeros(NUM_ACTIONS)
for i in range(iterations):
# 1. Get regret-matched mixed-strategy actions
strategy = getStrategy()
myAction = getAction(strategy)
otherAction = getAction(oppStrategy)
# 2. Compute action utilities
actionUtility[otherAction] = 0
actionUtility[(otherAction + 1) % NUM_ACTIONS] = 1
actionUtility[(otherAction - 1) % NUM_ACTIONS] = -1
# 3. Accumulate action regrets
for a in range(NUM_ACTIONS):
regretSum[a] += actionUtility[a] - actionUtility[myAction]
Regret sums are highly erratic. What converges to a minimal regret strategy is the average strategy across all iterations. This is why we use avgStrategy
at the very end when we want the final strategy.