Jekyll2021-08-16T04:57:00+00:00https://aipokertutorial.com/feed.xmlAI Poker TutorialAIPT Section 2.1: Game Theory – Game Theory Foundation2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/game-theory-foundation<!-- indifference, make worst hand indifferent, otherwise would have different worst hand --> <h1 id="game-theory--game-theory-foundation">Game Theory – Game Theory Foundation</h1> <p>Let’s look at some important game theory concepts before we get into actually solving for poker strategies.</p> <h2 id="game-theory-optimal-gto">Game Theory Optimal (GTO)</h2> <p>What does it mean to “solve” a poker game? In the 2-player setting, this means to find a <strong>Nash Equilibrium strategy</strong> (aka GTO strategy) for the game. By definition, if both players are playing this strategy, then neither would want to change to a different strategy since neither could do better with any other strategy (assuming that the opponent’s strategy stays fixed).</p> <p>To break this down, if players A and B are both playing GTO, then both are doing as well as possible. If player A changes strategy, then player A is doing worse and player B is doing better. Player B could do <em>even better</em> by changing strategy to exploit player A’s new strategy, but then player A could take advantage of this change. If Player B stays put with GTO, then EV is not maximized, but there is no risk of being exploited. In this sense, GTO is an unexploitable strategy that gets a guaranteed minimum EV.</p> <p>With more than 2 players, Nash equilibria are still guaranteed to exist, but algorithms that work to solve 2 player games are not guaranteed to converge to the equilibrium strategy, though the strategies still tend to be strong. Even if you did know a Nash equilibrium strategy for your own position in a multiplayer game, other players could either (a) play strategies that weaken your expected value or (b) play their own Nash equilibrium strategies but the resulting combination of strategies is not guaranteed to be a Nash equilibrium.</p> <p>In practice, even in the 2-player setting, we have to approximate GTO strategies in full-sized poker games. We will go more into the details of what it means to solve a game in section 3.1 “What is Solving”?</p> <p>Intuition for equilibrium in poker can be explained using a simple all-in game where one player must either fold or bet all his chips and the second player must either call or fold if the first player bets all the chips. We’ll refer to these two players as the “all-in player” and the “calling player”. We assume each player starts with 10 big blinds and P1 puts in a small blind of 0.5 units and P2 puts in a big blind of 1 unit. There are three possible outcomes:</p> <table> <thead> <tr> <th>Scenarios</th> <th>Player 1 (SB)</th> <th>Player 2 (BB)</th> <th>Result</th> </tr> </thead> <tbody> <tr> <td>Case 1</td> <td>Fold</td> <td>–</td> <td>P2 wins 0.5 BB</td> </tr> <tr> <td>Case 2</td> <td>All-in</td> <td>Fold</td> <td>P1 wins 1 BB</td> </tr> <tr> <td>Case 3</td> <td>All-in</td> <td>Call</td> <td>Winner of showdown wins 10 BB (pot size 20 BB)</td> </tr> </tbody> </table> <p>In this situation, the calling player might begin the game with a default strategy of calling a low percentage of hands. An alert all-in player might exploit this by going all-in with a large range of hands.</p> <!-- **ICMIZER of this with EV** --> <p>After seeing the first player go all-in very frequently, the calling player might increase the calling range.</p> <!-- **ICMIZER of this with EV** --> <p>Once the all-in player observes this, it could lead him to reduce his all-in percentage. Once the all-in range of hands and the calling range stabilize such that neither player can unilaterally change his strategy to increase his profit, then the equilibrium strategies have been reached.</p> <p>A <strong>strategy</strong> in game theory is the set of actions one will take at every decision point. In the all-in game, there is only one decision for each player, so the entire strategy is the number of hands to go all-in with for Player 1 and the number of hands to call with for Player 2. A real strategy in poker would say what to do with every possible hand in every situation. For example, it might say that when you have 98 suited on the button and it’s folded to you, you should raise 2.5x the pot 90% of the time and go allin 10% of the time.</p> <p>We can use the ICMIZER program to compute the game theory optimal strategies in a 1v1 setting where both players start the hand with 10 big blinds. In this case, the small blind all-in player goes all-in 58% of the time and the big blind calling player calls 37% of the time.</p> <!-- **ICMIZER of this with EV** --> <p>If either player changed those percentages, then their EV would go down! If the calling player called more hands (looser), those hands wouldn’t be profitable. If the calling player called fewer hands (tighter), then he would be folding too much. If the all-in player went looser, those extra hands wouldn’t be profitable, and if he went tighter, then he would be folding too much.</p> <p>Why, intuitively, is the all-in player’s range so much wider than the calling player’s? David Sklansky coined the term “gap concept”, which states that a player needs a better hand to call with than to go all-in with – that difference is the gap. The main reasons for this are (a) the all-in player gets the initiative whereby he can force a fold, but the calling player can only call, (b) the all-in player is signaling the strength of his hand, and (c) when facing an all-in bet the pot odds are not especially appealing.</p> <h2 id="exploitation">Exploitation</h2> <p>What if the big blind calling player doesn’t feel comfortable calling with weaker hands like K2s and Q9o and maximized his calling range tighter than the equilibrium range of 37%? The game theoretic solution strategy for the other player would not fully take advantage of this opportunity. The <strong>best response strategy</strong> is the one that maximally exploits the opponent by always performing the highest expected value play against their fixed strategy. In general, an exploitative strategy is one that exploits an opponent’s non-equilibrium play. In the above example, an exploitative play could be raising with all hands after seeing the opponent calling with a low percentage of hands. However, this strategy can itself be exploited. When both players are playing a Nash equililbrium then they are playing best response strategies against each other.</p> <!-- **table of EV vs. looser and tighter opponents compared to GTO and possible loss (pg 87 of Modern Poker Theory)** --> <h2 id="normal-form">Normal Form</h2> <p>Normal Form is writing the <strong>strategies</strong> and game <strong>payouts</strong> in matrix form. The Player 1 strategies are in the rows and Player 2 strategies are in the columns. The payouts are written in terms of P1, P2.</p> <h3 id="zero-sum-all-in-poker-game">Zero-Sum All-in Poker Game</h3> <p>We can model the all-in game in normal form as below. Assume that each player looks at his/her hand and settles on an action, then the below chart is the result of those actions with the first number being Player 1’s <strong>payout</strong> and the second being Player 2’s. In general, normal form matrices show <strong>utilities</strong> for each player in a game, which is what the situation is valued for each player, and in poker settings, these are the payouts from the hand.</p> <p>Note that e.g. the call player cannot call when the all-in player folds, but we assume the actions are pre-selected and the payouts still remain the same.</p> <p>In a one on one poker game, the sum of the payouts in each box are 0 since whatever one player wins, the other loses, which is called a <strong>zero-sum game</strong> (not including the house commission, aka rake).</p> <table> <thead> <tr> <th>All-in Player/Call Player</th> <th>Call</th> <th>Fold</th> </tr> </thead> <tbody> <tr> <td>All-in</td> <td>EV of all-in, -EV of all-in</td> <td>1, -1</td> </tr> <tr> <td>Fold</td> <td>-0.5, 0.5</td> <td>-0.5, 0.5</td> </tr> </tbody> </table> <p>If Player 1 has JT offsuit and Player 2 has AK offsuit, the numbers are as below. The all-in call scenario has -2.5 for Player 1 and 2.5 for Player 2 because the hand odds are about 37.5% for Player 1 and 62.5% for Player 2, meaning that Player 1’s equity in a \$20 pot is about \$7.50 and Player 2’s equity is about $12.50, so the net expected profit is -\$2.50 and \$2.50, respectively.</p> <table> <thead> <tr> <th>All-in Player/Call Player</th> <th>Call</th> <th>Fold</th> </tr> </thead> <tbody> <tr> <td>All-in</td> <td>-2.5, 2.5</td> <td>1, -1</td> </tr> <tr> <td>Fold</td> <td>-0.5, 0.5</td> <td>-0.5, 0.5</td> </tr> </tbody> </table> <p>Because in poker the hands are hidden, there would be no way to actually know the all-in/call EV in advance, but we show this to understand how the normal form looks.</p> <h3 id="simple-2-action-game">Simple 2-Action Game</h3> <table> <thead> <tr> <th>P1/P2</th> <th>Action 1</th> <th>Action 2</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>5, 3</td> <td>4, 0</td> </tr> <tr> <td>Action 2</td> <td>3, 2</td> <td>1, -1</td> </tr> </tbody> </table> <p>In the Player 1 Action 1 and Player 2 Action 1 slot, we have (5, 3), which represents P1 = 5 and P2 = 3. I.e., if these actions are taken, Player 1 wins 5 units and Player 2 wins 3 units.</p> <h4 id="dominated-strategies">Dominated Strategies</h4> <p>A dominated strategy is one that is strictly worse than an alternative strategy. Let’s find the equilibrium strategies for this game by using <strong>iterated elimination of dominated strategies</strong>.</p> <p>If Player 2 plays Action 1, then Player 1 gets a payout of 5 with Action 1 or 3 with Action 2. Therefore Player 1 prefers Action 1 in this case.</p> <p>If Player 2 plays Action 2, then Player 1 gets a payout of 4 with Action 1 or 1 with Action 2. Therefore Player 1 prefers Action 1 again in this case.</p> <p>This means that whatever Player 2 does, Player 1 prefers Action 1 and therefore can eliminate Action 2 entirely since it would never make sense to play Action 2. We can say Action 1 dominates Action 2 or Action 2 is dominated by Action 1.</p> <p>We can repeat the same process for Player 2. When Player 1 plays Action 1, Player 2 prefers Action 1 (3&gt;0). When Player 1 plays Action 2, Player 2 prefers Action 1 (2&gt;-1). Even though we already established that Player 1 will never play Action 2, Player 2 doesn’t know that so needs to evaluate that scenario.</p> <p>We see that Player 2 will also always play Action 1 and has eliminated Action 2.</p> <p>Therefore we have an <strong>equilibrium</strong> at (5,3) and no player would want to deviate or else they would have a lower payout!</p> <h3 id="3-action-game">3-Action Game</h3> <p>In the Player 1 Action 1 and Player 2 Action 1 slot, we have (10, 2), which represents P1 = 10 and P2 = 2. I.e. if these actions are taken, Player 1 wins 10 units and Player 2 wins 2 units.</p> <table> <thead> <tr> <th>P1/P2</th> <th>Action 1</th> <th>Action 2</th> <th>Action 3</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>10, 2</td> <td>8, 1</td> <td>3, -1</td> </tr> <tr> <td>Action 2</td> <td>5, 8</td> <td>4, 0</td> <td>-1, 1</td> </tr> <tr> <td>Action 3</td> <td>7, 3</td> <td>5, -1</td> <td>0, 3</td> </tr> </tbody> </table> <p>Given this table, how can we determine the best actions for each player? Again, P1 is represented by the rows and P2 by the columns.</p> <p>We can see that Player 1’s strategy of Action 1 dominates Actions 2 and 3 because all of the values are strictly higher for Action 1. Regardless of Player 2’s action, Player 1’s Action 1 always has better results than Action 2 or 3.</p> <p>When P2 chooses Action 1, P1 earns 10 with Action 1, 5 with Action 2, and 7 with Action 3 When P2 chooses Action 2, P1 earns 8 with Action 1, 4 with Action 2, and 5 with Action 3 When P2 chooses Action 3, P1 earns 7 with Action 1, 5 with Action 2, and 0 with Action 3</p> <p>We also see that Action 1 dominates Action 2 for Player 2. Action 1 gets payouts of 2 or 8 or 3 depending on Player 1’s action, while Action 2 gets payouts of 1 or 0 or -1, so Action 1 is always superior.</p> <p>Action 1 <strong>weakly</strong> dominates Action 3 for Player 2. This means that Action 1 is greater than <strong>or equal</strong> to playing Action 3. In the case that Player 1 plays Action 3, Player 2’s Action 1 and Action 3 both result in a payout of 3 units.</p> <p>We can eliminate strictly dominated strategies and then arrive at the reduced Normal Form game. Recall that Player 1 would never play Actions 2 or 3 because Action 1 is always better. Similarly, Player 2 would never play Action 2 because Action 1 is always better.</p> <table> <thead> <tr> <th>P1/P2</th> <th>Action 1</th> <th>Action 3</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>10, 2</td> <td>3, -1</td> </tr> </tbody> </table> <p>In this case, Player 2 prefers to play Action 1 since 2 &gt; -1, so we have a Nash Equilibrium with both players playing Action 1 100% of the time (also known as a <strong>pure strategy</strong>) and the payouts will be 10 to Player 1 and 2 to Player 2. The issue with Player 2’s Action 1 having a tie with Action 3 when Player 1 played Action 3 was resolved because we now know that Player 1 will never actually play that action and when Player 1 plays Action 1, Player 2 will always prefer Action 1 to Action 3.</p> <table> <thead> <tr> <th>P1/P2</th> <th>Action 1</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>10, 2</td> </tr> </tbody> </table> <p>To summarize, Player 1 always plays Action 1 because it dominates Actions 2 and 3. When Player 1 is always playing Action 1, it only makes sense for Player 2 to also play Action 1 since it gives a payoff of 2 compared to payoffs of 1 and -1 with Actions 2 and 3, respectively.</p> <h3 id="tennis-vs-power-rangers">Tennis vs. Power Rangers</h3> <p>In this game, we have two people who are going to watch something together. P1 has a preference to watch tennis and P2 prefers Power Rangers. If they don’t agree, then they won’t watch anything and will have payouts of 0. If they do agree, then the person who gets to watch their preferred show has a higher reward than the other, but both are positive.</p> <table> <thead> <tr> <th>P1/P2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>3, 2</td> <td>0, 0</td> </tr> <tr> <td>Power Rangers</td> <td>0, 0</td> <td>2, 3</td> </tr> </tbody> </table> <p>In this case, neither player can eliminate a strategy. For Player 1, if Player 2 chooses Tennis then he also prefers Tennis, but if Player 2 chooses Power Rangers, then he prefers Power Rangers as well (both of these are Nash Equilbrium). This is intuitive (assuming that the payouts are valid, i.e., that the people really like TV) because there is 0 value in watching nothing but at least some value if both agree to watch one thing. This also shows the Nash equilibrium principle of not being able to benefit from <strong>unilaterally</strong> changing strategies – if both are watching tennis and P2 changes to Power Rangers, that change would reduce value from 2 to 0!</p> <p>So what is the optimal strategy here? If each player simply picked their preference, then they’d always watch nothing and get 0! If they both always picked their non-preference, then the same thing would happen! If they pre-agreed to either Tennis or Power Rangers, then utilties would increase, but this would never be “fair” to either person.</p> <p>We can calculate the optimal strategies like this:</p> <p>Let’s define:</p> $\text{P(P1 Tennis)} = p$ $\text{P(P1 Power Rangers)} = 1 - p$ <p>These represent the probability that Player 1 would select each of these.</p> <p>If Player 2 chooses Tennis, Player 2 earns $$p*(2) + (1-p)*(0) = 2p$$. The EV is calculated as probabilities of Player 1 multiplied by payouts of Player 2 playing Tennis.</p> <p>If Player 2 chooses Power Rangers, Player 2 earns $$p*(0) + (1-p)*(3) = 3 - 3p$$</p> <p>We are trying to find a strategy that involves mixing between both options, a <strong>mixed strategy</strong>. A fundamental rule is that if you are going to play multiple strategies, then the utility value of each must be the same, because otherwise you would just pick one and stick with that.</p> <p>Therefore we can set these values equal to each other, so</p> $2p = 3 - 3p$ $5p = 3$ $p = 3/5$ <p>Therefore Player 1’s strategy is to choose Tennis $$p = 3/5$$ and Power Rangers $$1 - p = 2/5$$. This is a mixed strategy equilibrium because there is a probability distribution over which strategy to play.</p> <p>This result comes about because these are the probabilities for P1 that induce P2 to be indifferent between Tennis and Power Rangers.</p> <p>By symmetry, P2’s strategy is to choose Tennis $$2/5$$ and Power Rangers $$3/5$$.</p> <p>This means that each player is choosing his/her chosen program $$3/5$$ of the time, while choosing the other option $$2/5$$ of the time. Let’s see how the final outcomes look.</p> <p>Tennis, Tennis occurs $$3/5 * 2/5 = 6/25$$</p> <p>Power Rangers, Power Rangers occurs $$2/5 * 3/5 = 6/25$$</p> <p>Tennis, Power Rangers occurs $$3/5 * 3/5 = 9/25$$</p> <p>Power Rangers, Tennis occurs $$2/5 * 2/5 = 4/25$$</p> <p>These probabilities are shown below (this is not a normal form matrix because we are showing probabilities and not payouts):</p> <table> <thead> <tr> <th>P1/P2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>6/25</td> <td>9/25</td> </tr> <tr> <td>Power Rangers</td> <td>4/25</td> <td>6/25</td> </tr> </tbody> </table> <p>The average payouts to each player are $$6/25 * (3) + 6/25 * (2) = 30/25 = 1.2$$. This would have been higher if they had avoided the 0,0 payouts! Unfortunately $$9/25 + 4/25 = 13/25$$ of the time, the payouts were 0 to each player. Coordinating to watch <em>something</em> rather than so often watching nothing would be a much better solution!</p> <p>What if Player 1 decided to be sneaky and change his strategy to choosing Tennis always instead of $$3/5$$ tennis and $$2/5$$ Power Rangers? Remember that there can be no benefit to deviating from a Nash Equilibrium strategy by definition. If he tries this, then we have the following likelihoods since P1 is never choosing Power Rangers and so the probabilities are determined strictly by P2’s strategy of $$2/5$$ tennis and $$3/5$$ Power Rangers:</p> <table> <thead> <tr> <th>P1/P2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>2/5</td> <td>3/5</td> </tr> <tr> <td>Power Rangers</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>The Tennis and Power Rangers $$3/5$$ has no payoffs and the Tennis Tennis has a payoff of of $$2/5 * 3 = 6/5 = 1.2$$ for P1. This is the same as the payout he was already getting. Note that deviating from the equilibrium <em>can</em> maintain the same payoff, but cannot improve the payoffs. In the zero-sum case, the opponent also can only do better or equal when a player deviates, but in this case Player 2 actually has a lower payoff of $$2/5 * 2 = 0.8$$ instead of $$1.2$$.</p> <p>However, P2 might catch on to this and then get revenge by pulling the same trick and changing strategy to always selecting Power Rangers, resulting in the following probabilities:</p> <table> <thead> <tr> <th>P1/P2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>0</td> <td>1</td> </tr> <tr> <td>Power Rangers</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>Now the probability is fully on P1 picking Tennis and P2 picking Power Rangers, and nobody gets anything!</p> <h3 id="rock-paper-scissors">Rock Paper Scissors</h3> <p>Finally, we can also think about this concept in Rock-Paper-Scissors. Let’s define a win as +1, a tie as 0, and a loss as -1. The game matrix for the game is shown below in Normal Form:</p> <table> <thead> <tr> <th>P1/P2</th> <th>Rock</th> <th>Paper</th> <th>Scissors</th> </tr> </thead> <tbody> <tr> <td>Rock</td> <td>0, 0</td> <td>-1, 1</td> <td>1, -1</td> </tr> <tr> <td>Paper</td> <td>1, -1</td> <td>0, 0</td> <td>-1, 1</td> </tr> <tr> <td>Scissors</td> <td>-1, 1</td> <td>1, -1</td> <td>0, 0</td> </tr> </tbody> </table> <p>As usual, Player 1 is the row player and Player 2 is the column player. The payouts are written in terms of P1, P2. So for example P1 Paper and P2 Rock corresponds to a reward of +1 for P1 and -1 for P2 since Paper beats Rock.</p> <p>The equilibrium strategy is to play each action with $$1/3$$ probability. We can see this intuitively because if any player played anything other than this distribution, then you could crush them by always playing the strategy that beats the strategy that they most favor. For example if someone played rock 50%, paper 25%, and scissors 25%, they are overplaying rock, so you could always play paper and then would win 50% of the time, tie 25% of the time, and lose 25% of the time for an average gain of $$1*0.5 + 0*0.25 + (-1)*0.25 = 0.25$$ each game.</p> <table> <thead> <tr> <th>P1/P2</th> <th>Rock 50%</th> <th>Paper 25%</th> <th>Scissors 25%</th> </tr> </thead> <tbody> <tr> <td>Rock 0%</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>Paper 100%</td> <td>0.5*1 = 0.5</td> <td>0.25*0 = 0</td> <td>0.25*(-1) = -0.25</td> </tr> <tr> <td>Scissors 0%</td> <td>0</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>We can also work it out mathematically. Let P1 play Rock $$r%$$, Paper $$p%$$, and Scissors $$s%$$. The utility of P2 playing Rock is then $$0*(r) + -1 * (p) + 1 * (s)$$. The utility of P2 playing Paper is $$1 * (r) + 0 * (p) + -1 * (s)$$. The utility of P2 playing Scissors is $$-1 * (r) + 1 * (p) + 0 * (s)$$.</p> <p>We can figure out the best strategy with this system of equations (the second equation below is because all probabilities must add up to 1):</p> $\begin{cases} -p + s = r - s = -r + p \\ r + p + s = 1 \end{cases}$ $-p + s = r - s \implies 2s = p + r$ $r - s = - r + p \implies 2r = s + p$ $-p + s = -r + p \implies s + r = 2p$ $r + s + p = 1 \implies r + s = 1 - p$ <p>Setting the last two equations equal, we have:</p> $1 - p = 2p$ $1 = 3p$ $p = 1/3$ <p>Rewriting the final equation:</p> $r + s + p = 1$ $s + p = 1 - r$ <p>Using the above combined with the 2nd equation:</p> $1 - r = 2r$ $1 = 3r$ $1/3 = r$ <p>Writing the probabability sums to 1 equation with the results for $$p$$ and $$r$$:</p> $1/3 + 1/3 + s = 1 s = 1/3$ <p>The equilibrium strategy is therefore to play each action with 1/3 probability.</p> <p>If your opponent plays the equilibrium strategy of Rock 1/3, Paper 1/3, Scissors 1/3, then he will have the following EV. EV = $$1*(1/3) + 0*(1/3) + (-1)*(1/3) = 0$$. Note that in Rock Paper Scissors, if you play equilibrium then you can never show a profit because you will always breakeven, regardless of what your opponent does. In poker, this is not the case.</p> <h2 id="regret">Regret</h2> <p>When I think of regret related to poker, the first thing that comes to mind is often “Wow you should’ve played way more hands in 2010 when poker was so easy”. Often in poker we regret big folds or bluffs or calls that didn’t work out well (even though poker players in general are good at not being very results oriented).</p> <iframe src="https://player.vimeo.com/video/265401201?title=0&amp;byline=0&amp;portrait=0" width="640" height="468" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen=""></iframe> <p>Here we will look at a less sad version, the mathematical concept of regret. Regret is a measure of how well you could have done compared to some alternative. Phrased differently, what you would have done in some situation instead.</p> <p>$$\text{Regret} = \text{u(Alternative Strategy)} - \text{u(Current Strategy)}$$ where $$u$$ represents utility</p> <p>If your current strategy for breakfast is cooking eggs at home, then maybe $$\text{u(Current Home Egg Strategy)} = 5$$. If you have an alternative of eating breakfast at a fancy buffet, then maybe $$\text{u(Alternative Buffet Strategy)} = 9$$, so the regret for not eating at the buffet is $$9 - 5 = 4$$. If your alternative is getting a quick meal from McDonald’s, then you might value $$\text{u(Alternative McDonald's Strategy)} = 2$$, so regret for not eating at McDonald’s is $$2 - 5 = -3$$. We prefer alternative actions with high regret.</p> <p>We can give another example from Rock Paper Scissors:</p> <p>We play rock and opponent plays paper $$\implies \text{u(rock,paper)} = -1$$</p> $\text{Regret(scissors)} = \text{u(scissors,paper)} - \text{u(rock,paper)} = 1-(-1) = 2$ $\text{Regret(paper)} = \text{u(paper,paper)} - \text{u(rock,paper)} = 0-(-1) = 1$ $\text{Regret(rock)} = \text{u(rock,paper)} - \text{u(rock,paper)} = -1-(-1) = 0$ <p>We play scissors and opponent plays paper $$\implies \text{u(scissors,paper)} = 1$$</p> $\text{Regret(scissors)} = \text{u(scissors,paper)} - \text{u(scissors,paper)} = 1-1 = 0$ $\text{Regret(paper)} = \text{u(paper,paper)} - \text{u(scissors,paper)} = 0-1 = -1$ $\text{Regret(rock)} = \text{u(rock,paper)} - \text{u(scissors,paper)} = -1-1 = -2$ <p>We play paper and opponent plays paper $$\implies \text{u(paper,paper)} = 0$$</p> $\text{Regret(scissors)} = \text{u(scissors,paper)} - \text{u(paper,paper)} = 1-0 = 1$ $\text{Regret(paper)} = \text{u(paper,paper)} - \text{u(paper,paper)} = 0-0 = 0$ $\text{Regret(rock)} = \text{u(rock,paper)} - \text{u(paper,paper)} = -1-0 = -1$ <p>Again, we prefer alternative actions with high regret.</p> <p>To generalize for the Rock Paper Scissors case:</p> <ul> <li>The action played always gets a regret of 0 since the “alternative” is really just that same action</li> <li>When we play a tying action, the alternative losing action gets a regret of -1 and the alternative winning action gets a regret of +1</li> <li>When we play a winning action, the alternative tying action gets a regret of -1 and the alternative losing action gets a regret of -2</li> <li>When we play a losing action, the alternative winning action gets a regret of +2 and the alternative tying action gets a regret of +1</li> </ul> <h3 id="bandits">Bandits</h3> <p>A common way to analyze regret is the multi-armed bandit problem. The setup is a player sitting in front of a multi-armed “bandit” with some number of arms. (Think of this as sitting in front of a bunch of slot machines.) Each time the player pulls an arm, they get some reward, which could be positive or negative. Bandits are a set of problems with repeated decisions and a fixed number of actions possible. This is related to reinforcement learning because the agent player updates its strategy based on what it learns from the feedback from the environment.</p> <p>Multi-armed bandit problems are a common representation of regret and an algorithm is called “no-regret” if its average overall regret grows sublinearly in time. In the adversarial setting, the opponent chooses the reward. This is the equivalent setting that we see in poker games, since the opponent’s actions influence our utility in the game.</p> <p>In the full information setting, the player can see the entire reward vector for each machine chosen and in the partial setting, sees only the reward that the machine has chosen for that particular play. Here we will focus on the partial setting.</p> <p>A basic setting initializes each of 10 arms with $$q_*(\text{arm}) = \mathcal{N}(0, 1)$$, so each is initialized with a center point found from the Gaussian distribution. Each pull of an arm then gets a reward of $$R = \mathcal{N}(q_*(\text{arm}), 1)$$.</p> <p>To clarify, this means each arm is initialized with a value centered around 0 but with some variance, so each will be a bit different. Then from that point, the actual pull of an arm is centered around that new point with some variance as seen in this figure with a 10-armed bandit from Intro to Reinforcement Learning by Sutton and Barto:</p> <p><img src="../assets/section2/gametheory/banditsetup.png" alt="Bandit setup" /></p> <p>In simple terms, each machine has some set value that isn’t completely fixed at that value, but rather varies slightly around it, so a machine with a value of 3 might range from 2.5 to 3.5.</p> <p>Imagine that the goal is to play this game 2000 times with the intention to achieve the highest rewards. We can only learn about the rewards by pulling the arms – we don’t have any information about the distribution behind the scenes. We maintain an average reward per pull for each arm as a guide for which arm to pull in the future.</p> <p><strong>Greedy</strong> The most basic algorithm to score well is to pull each arm once and then forever pull the arm that performed the best in the sampling stage. This could be modified by pulling each arm $$n$$ times and then pulling the best arm, while allowing “best arm” to be updated if the average reward for the current best arm becomes worse than another arm. If multiple actions have the same value, then select either the lowest index or one of the ties at random.</p> <p><strong>Epsilon Greedy</strong> $$\epsilon$$-Greedy works similarly to Greedy, but instead of <strong>always</strong> picking the best arm, we use an $$\epsilon$$ value that defines how often we should randomly pick a different arm. We keep track of which arm is the current best arm before each pull according to the average reward per pull, then play that arm $$1-\epsilon$$ of the time and play a random arm $$\epsilon$$ of the time. For example if $$\epsilon$$ was 0.1, then we’d pick the currently known best arm 90% of the time and a random arm 10% of the time.</p> <p>The idea of usually picking the best arm and sometimes switching to a random one is the concept of <strong>exploration vs. exploitation</strong>. Think of this in the context of picking a travel destination or picking a restaurant. You are likely to get a very high “reward” by continuing to go to a favorite vacation spot or restaurant, but it’s also useful to explore other options that you could end up preferring. (Note to self: Think about not eating the same exact meals every day.)</p> <p>This idea is also seen in online advertising, where an advertiser might create a few ads for the same product, then show a random one for each user on the site. They could then monitor which ones get the most clicks and therefore which ones tend to work better than the others. Should they greedily “exploit” by using the best-ranked ad or continue accumulating data or use some kind of epsilon Greedy method that now mostly uses the best ad but also sometimes shows other ones to accumulate more data?</p> <p><strong>Bandit Regret</strong> The goal of the agent playing this game is to get the best reward. This is done by pulling the best arm repeatedly, but this is of course not known to the agent in advance. We can define a very sensible definition of average regret as</p> $\text{Regret}_t = \frac{1}{t} \sum_{\tau=1}^t (V^* - Q(a_\tau))$ <p>where $$V^*$$ is the fixed reward from the best action, $$Q(a_\tau)$$ is the reward from selecting arm $$a$$ at timestep $$\tau$$, and $$t$$ is the total number of timesteps.</p> <p>So at each timestep, we are computing the difference between if we had picked the best possible action (for the sake of the calculation we assume that we are able to know this) compared to the actual action that we pulled and taking the average.</p> <p>In other words, this is the average of how much worse we have done than the best possible action over the number of timesteps.</p> <p>So if the best action would give a value of 5 and our rewards on our first 3 pulls were {3, 5, 1}, our regrets would be {5-3, 5-5, 5-1} = {2, 0, 4}, for an average of $$(2+0+4)/3 = 2$$. So the equivalent to trying to maximize rewards is trying to minimize regret.</p> <p>Note that above we said that the idea was to maximize regret and that we’d play actions in proportion to the regrets with regret matching. Now we’re trying to minimize regret. This is confusing and regret can be interpreted in both ways depending on the situation.</p> <p>For values of $$\epsilon = 0$$ (greedy), $$\epsilon = 0.01$$, $$\epsilon = 0.1$$, and $$\epsilon = 0.5$$ and using the setup described above, we did a simulation that averaged 2,000 runs of 1,000 timesteps each.</p> <p><img src="../assets/section2/gametheory/bandits_avg_reward.png" alt="Bandit average reward" /> <em>Average reward plot</em></p> <p>For the average reward plot, we see that the optimal $$\epsilon$$ amongst those used is 0.1, next best is 0.01, then 0, and then 0.5. This shows that some exploration is valuable, but too much (0.5) or too little (0) is not optimal.</p> <p><img src="../assets/section2/gametheory/bandits_avg_regret.png" alt="Bandit average regret" /> <em>Average regret plot</em></p> <p>The average regret plot is the inverse of the reward plot because it is the best possible reward minus the actual rewards received and so the goal is to minimize the regret.</p> <p><strong>Upper Confidence Bound (UCB)</strong> There are many algorithms for choosing bandit arms. The last one we’ll touch on is called Upper Confidence Bound (UCB). This bound represents that even though we have some data about the value of an arm, there is some uncertainty around it so we can take the upper bound of this uncertainty to determine which arm to pull next.</p> $A_t = \arg\max_{a}(Q_t(a) + c*\sqrt{\frac{log{t}{N_t(a)}})$ <p>The output of the formula is to choose the optimal action. The $$Q_{t(a)}$$ represents the exploitation – it is the average value so far of the pulls from playing action $$a$$ at time $$t$$. The rest of the equation represents the exploration and $$c$$ is the exploration constant. The $$t$$ term in the numerator represents the number of pulls completed so far in total (not just for a particular action) and the denominator term represents the number of pulls for action $$a$$. This means that for actions that have been pulled less, the term will be higher (we are more uncertain about the true value of that action).</p> <p>The Upper Confidence Bound formula effectively selects the action that has the highest estimated value (exploitation) plus the upper confidence bound exploration term. We can think of this sort of as a “maximum expectation” or “optimism in the face of uncertainty” term and therefore would choose to pull the next arm that has the highest upper confidence bound value, and the equation for that arm would be updated after each pull with the new $$Q_{t(a)}$$ estimate and new values of $$t$$ and $$N_{t(a)}$$.</p> <h3 id="regret-matching">Regret Matching</h3> <p>The concept of regret matching was developed in 2000 by Hart and Mas-Collel. Regret matching means playing a future strategy in proportion to the accumulated regrets from past actions. As we play, we keep track of the accumulated regrets for each action and then play in proportion to those values. For example, if the total regret values for Rock are 5, Paper 10, Scissors 5, then we have total regrets of 20 and we would play Rock 5/20 = 1/4, Paper 10/20 = 1/2, and Scissors 5/20 = 1/4.</p> <p>It makes sense intuitively to prefer actions with higher regrets because they provide higher utility, as shown in the prior section. So why not just play the highest regret action always or with some epsilon? Because playing in proportion to the regrets allows us to keep testing all of the actions, while still more often playing the actions that have the higher chance of being best. It could be that at the beginning, the opponent happened to play Scissors 60% of the time even though their strategy in the long run is to play it much less. We wouldn’t want to exclusively play Rock in this case, we’d want to keep our strategy more robust. It also means that we aren’t as predictable.</p> <p>The regret matching algorithm works like this:</p> <ol> <li>Initialize regret for each action to 0</li> <li>Set the strategy as: $$\text{strategy} _{action}_{i} = \begin{cases} \frac{R_{i}^{+}}{\sum_{k=1}^nR_{k}^{+}}, &amp; \mbox{if at least 1 pos regret} \\ \frac{1}{n}, &amp; \mbox{if all regrets negative} \end{cases}$$</li> <li>Accumulate regrets after each game and update the strategy</li> </ol> <p>Let’s consider Player 1 playing a fixed rock paper scissors strategy of Rock 40%, Paper 30%, Scissors 30% and Player 2 playing using regret matching. So Player 1 is playing almost the equilibrium strategy, but a little bit biased on favor of Rock.</p> <p>Let’s look at a sequence of plays in this scenario that were generated randomly.</p> <table> <thead> <tr> <th>P1</th> <th>P2</th> <th>New Regrets</th> <th>New Total Regrets</th> <th>Strategy [R,P,S]</th> <th>P2 Profits</th> </tr> </thead> <tbody> <tr> <td>S</td> <td>S</td> <td>[1,0,-1]</td> <td>[1,0,-1]</td> <td>[1,0,0]</td> <td>0</td> </tr> <tr> <td>P</td> <td>R</td> <td>[0,1,2]</td> <td>[1,1,1]</td> <td>[1/3, 1/3, 1/3]</td> <td>1</td> </tr> <tr> <td>S</td> <td>P</td> <td>[2,0,1]</td> <td>[3,1,2]</td> <td>[1/2, 1/6, 1/3]</td> <td>0</td> </tr> <tr> <td>P</td> <td>R</td> <td>[0,1,2]</td> <td>[3,2,4]</td> <td>[3/10, 1/5, 2/5]</td> <td>-1</td> </tr> <tr> <td>R</td> <td>S</td> <td>[1,2,0]</td> <td>[4,4,4]</td> <td>[1/3,1/3,1/3]</td> <td>-2</td> </tr> <tr> <td>R</td> <td>R</td> <td>[0,1,-1]</td> <td>[4,5,3]</td> <td>[1/3,5/12,1/4]</td> <td>-2</td> </tr> <tr> <td>P</td> <td>P</td> <td>[-1,0,1]</td> <td>[3,5,4]</td> <td>[1/4,5/12,1/3]</td> <td>-2</td> </tr> <tr> <td>S</td> <td>P</td> <td>[2,0,1]</td> <td>[5,5,5]</td> <td>[1/3, 1/3, 1/3]</td> <td>-3</td> </tr> <tr> <td>R</td> <td>R</td> <td>[0,1,-1]</td> <td>[5,6,4]</td> <td>[1/3, 2/5, 4/15]</td> <td>-3</td> </tr> <tr> <td>R</td> <td>P</td> <td>[-1,0,-2]</td> <td>[4,6,2]</td> <td>[1/3,1/2,1/6]</td> <td>-2</td> </tr> </tbody> </table> <p>In the long-run we know that P2 can win a large amount by always playing Paper to exploit the over-play of Rock by P1. The expected value of always playing Paper is $$1*0.4 + 0*0.3 + (-1)*0.3 = 0.1$$ per game and indeed after 10 games, the strategy with regret matching has already become biased in favor of playing Paper as we see in the final row where the Paper strategy is listed as 1/2 or 50%.</p> <p>Depending on the run and how the regrets accumulate, the regret matching can figure this out immediately or it can take some time. Here are 10,000 sample runs of this scenario.</p> <p>The plots show the current strategy and average strategy over time of each of rock (green), paper (purple), and scissors (blue). These are on a 0 to 1 scale on the left axis. The black line measures the profit (aka rewards) on the right axis. The horizontal axis is the number of sample runs. The top plot shows how the algorithm can sometimes “catch on” very fast and almost immediately switch to always playing paper, while the second shows it taking about 1,500 games to figure that out.</p> <p><img src="../assets/section2/gametheory/rps_fast1.png" /></p> <p><img src="../assets/section2/gametheory/rps_slow1.png" /></p> <!-- CFR+ thing? --> <h3 id="regret-in-poker">Regret in Poker</h3> <p>The regret matching method is at the core of selecting actions in the algorithms used to solve poker games. We will go into more detail in the CFR Algorithm section. In brief, each unique state of the game has a regret counter for each action and so the strategy is determined at each game state by using regret matching as the regrets get updated.</p>AIPT Section 2.2: Game Theory – Trees in Games2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/trees-in-games<h1 id="game-theory--trees-in-games">Game Theory – Trees in Games</h1> <p>Many games can be solved using the minimax algorithm for exploring a tree and determining the best move from each position. Unfortunately, poker is not one of those games.</p> <h2 id="basic-tree">Basic Tree</h2> <p>Take a look at the game tree below. The circular nodes represent player positions and the lines represent possible actions. The “root” of the tree is the initial state at the top. We have P1 acting first, P2 acting second, and the payoffs at the leaf nodes in the standard P1, P2 format.</p> <p>In a poker game, there might be a chance node at the top that deals cards, followed by player decision nodes, and then terminal nodes at the bottom according to the amounts won in the hand.</p> <!--chance at top, card states, actions, terminal node with utility --> <p><img src="../assets/section2/trees/minimax.png" alt="Minimax tree" /></p> <p>The standard way to solve a game tree like this is using <strong>backward induction</strong>, whereby we start with the leaves (i.e. the payoff nodes at the bottom) of the tree and see which decisions the last player, Player 2 in this case, will make at her decision nodes.</p> <p>Player 2’s goal is to minimize the maximum payoff of Player 1, which in the zero-sum setting is equivalent to minimizing her own maximum loss or maximizing her own minimum payoff. This is equivalent to a Nash equilibrium in the zero-sum setting.</p> <p>She picks the right node on the left side (payoff -1 instead of -5) and the left node on the right side (payoff 3 instead of -6).</p> <p>These values are then propagated up the tree so from Player 1’s perspective, the value of going left is 1 and of going right is -3. The other leaf nodes are not considered because Player 2 will never choose those. Player 1 then decides to play left to maximize his payoff.</p> <p><img src="../assets/section2/trees/minimax2.png" alt="Minimax tree solved" /></p> <p>We can see all possible payouts, where the rows are P1 actions and the columns are P2 actions after P1 actions (e.g. Left/Left means P1 chose Left and then P2 also chose Left).</p> <table> <thead> <tr> <th>P1/P2</th> <th>Left/Left</th> <th>Left/Right</th> <th>Right/Left</th> <th>Right/Right</th> </tr> </thead> <tbody> <tr> <td>Left</td> <td>5,-5</td> <td>5,-5</td> <td>1,-1</td> <td>1,-1</td> </tr> <tr> <td>Right</td> <td>-3,3</td> <td>-3,3</td> <td>6,-6</td> <td>6,-6</td> </tr> </tbody> </table> <p>Note that Player 1 choosing right <em>could</em> result in a higher payout (6) if Player 2 also chose right, but a rational Player 2 would not do that, and so the algorithm requires maximizing one’s minimum payoff, which means Player 1 must choose left (earning a guaranteed value of 1).</p> <p>By working backwards from the end of a game, we can evaluate each possible sequence of moves and propagate the values up the game tree.</p> <p><strong>Subgame perfect equilibrium</strong> means that each subgame, which is a decision state in the above game tree, is a Nash equilibrium. The strategy of P1 choosing Left and P2 choosing Right after Left and Left after Right is a subgame perfect equilibrium.</p> <p>Two main problems arise with minimax and backward induction.</p> <h3 id="problem-1-the-game-is-too-damn-large">Problem 1: The Game is Too Damn Large</h3> <p>In theory, we could use the minimax algorithm to solve games like chess. The problem is that the game and the space of possible actions is HUGE. It’s not feasible to evaluate all possibilities. The first level of the tree would need to have every possible action and then the next level would have every possible action from each of those actions, and so on. Even checkers is very large, though smaller games like tic tac toe can be solved with minimax. More sophisticated methods and approximation techniques are used in practice for large games. One simple method is to only go down the tree to a depth of “X” and then approximate the value of the states there by using some sort of heuristic.</p> <h3 id="problem-2-perfect-information-vs-imperfect-information">Problem 2: Perfect Information vs. Imperfect Information</h3> <p>What about poker? Real poker games like Texas Hold’em are very large and run into the same problem we have with games like chess, but in addition, poker is an <strong>imperfect information game</strong> and games like chess and tic tac toe are <strong>perfect information games</strong>. The distinction is that in poker there is hidden information – each player’s private cards. In perfect information games, all players see all of the information.</p> <p>With perfect information, each player knows exactly what node/state he is in in the game tree. With imperfect information, there is uncertainty about the state of the game because the other player’s cards are unknown.</p> <h2 id="poker-tree">Poker Tree</h2> <p>Below we show the game tree for 1-card poker. In brief, it’s a 1v1 game where each player starts with$2 and antes \$1, leaving a single \$1 bet remaining. We’ll go into more details about the game in the next section.</p> <p>The top node is a chance node that “deals” the cards. To make it more readable, only 2 chance outcomes are shown, Player 1 dealt Q with Player 2 dealt J and Player 1 dealt Q with Player 2 dealt K.</p> <p><img src="../assets/section2/trees/infoset2.png" alt="1-card poker game tree" /> <em>1-card poker game tree from University of Alberta 2015 paper HULHE is Solved</em></p> <p>Player 1’s initial action is to either bet or pass. If Player 1 bets, Player 2 can call or fold. If Player 1 passes, Player 2 can bet or pass. If Player 1 passed and Player 2 bet, then Player 1 can call or fold.</p> <p>Note the nodes that are circled and connected by a line. This means that they are in the same <strong>information set</strong>. An information set consists of equivalent game states based on information known to that player. For example, in the top information set, Player 1 has a Q in both of the shown states, so his actions will be the same in both even though Player 2 could have either a K or J. The information known to Player 1 is “Card Q, starting action”. At the later information set, the information known is “Card Q, I pass, opponent bets”. All decisions must be made based only on information known to each player! However, these are actually different true game states.</p> <p>Looking at the information set at the bottom after Player 1 passes and Player 2 bets, Player 1 has the same information in both cases, but calling when Player 2 has a J means winning 2 and calling when Player 2 has a K means losing 2. The payoffs are completely different! We can refer to these as different “worlds”. Player 2 would also have equivalent states if the additional chance branches were shown where Player 2 also had the J or K cards.</p> <p>Because of this problem we can’t simply propagate values up the tree as we can do in perfect information games. Later in the tutorial, we will discuss CFR (counterfactual regret minimization), which is a way to solve games like poker that can’t be solved using minimax.</p> <h2 id="tic-tac-toe-tree">Tic Tac Toe Tree</h2> <!-- **Tree goes here** --> <p>On the tic tac toe tree, from the initial state, there are up to 9 levels of moves. Each subsequent level has fewer possible actions since more spaces on the game board are taken as we go down the tree. The tree ends at points either where the game is over because one player wins or when all the spaces are filled and no one has won, resulting in a tie.</p> <p>In tic tac toe, the sequence of actions prior to a certain game state are not important.</p> <h2 id="tic-tac-toe-python-implementations">Tic Tac Toe Python Implementations</h2> <p>While we’re mainly focused in this tutorial on poker and imperfect information games, we take a short detour to look more in-depth at minimax and Monte Carlo Tree Search (MCTS) through the lens of tic tac toe.</p> <h3 id="tic-tac-toe-in-python">Tic Tac Toe in Python</h3> <p>Below we show a basic Python class called Tictactoe. The board is initialized with all 0’s and each player is represented by a 1 or -1. Those numbers go into board spaces when the associated player makes a move. The class has 5 functions:</p> <ol> <li>make_move: Enters the player’s move onto the board if the space is available and advances the play to the next player</li> <li>new_state_with_move: Same as make_move, but returns a copy of the board with the new move instead of the original board</li> <li>available_moves: Lists the moves that are currently available on the board</li> <li>check_result: Checks every possible winning sequence and returns either the winning player’s ID if there is a winner, a 0 if the game has ended in a tie, or None if the game is not over yet</li> <li>repr: Used for printing the board. Empty slots are represented by their number from 0 to 8, player 1 is represented with ‘x’, player 2 is represented with ‘o’, and a line break is added as needed after the first 3 and middle 3 positions.</li> </ol> <p>We also have two simple agent classes:</p> <ol> <li>HumanAgent: Enters a move from 0-8 and the move is placed if it’s available, otherwise we ask for the move again</li> <li>RandomAgent: Randomly selects a move from the available moves</li> </ol> <p>Finally, we need another function to actually run the game.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Tictactoe</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="mi">9</span><span class="p">,</span> <span class="n">acting_player</span> <span class="o">=</span> <span class="mi">1</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span> <span class="o">=</span> <span class="n">board</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="o">=</span> <span class="n">acting_player</span> <span class="k">def</span> <span class="nf">make_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">move</span><span class="p">):</span> <span class="k">if</span> <span class="n">move</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">move</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="o">=</span> <span class="mi">0</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="c1">#Players are 1 or -1 </span> <span class="k">def</span> <span class="nf">new_state_with_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">move</span><span class="p">):</span> <span class="c1">#Return new ttt state with move, but don't change this state </span> <span class="k">if</span> <span class="n">move</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="n">board_copy</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">deepcopy</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">)</span> <span class="n">board_copy</span><span class="p">[</span><span class="n">move</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="k">return</span> <span class="n">Tictactoe</span><span class="p">(</span><span class="n">board_copy</span><span class="p">,</span> <span class="mi">0</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="k">def</span> <span class="nf">available_moves</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">9</span><span class="p">)</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">]</span> <span class="k">def</span> <span class="nf">check_result</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">for</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">,</span><span class="n">c</span><span class="p">)</span> <span class="ow">in</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">),(</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">),(</span><span class="mi">6</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">8</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">6</span><span class="p">),(</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">7</span><span class="p">),(</span><span class="mi">2</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">8</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">8</span><span class="p">),(</span><span class="mi">2</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">6</span><span class="p">)]:</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">==</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">b</span><span class="p">]</span> <span class="o">==</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">available_moves</span><span class="p">()</span> <span class="o">==</span> <span class="p">[]:</span> <span class="k">return</span> <span class="mi">0</span> <span class="c1">#Tie </span> <span class="k">return</span> <span class="bp">None</span> <span class="c1">#Game not over </span> <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="n">s</span><span class="o">=</span> <span class="s">""</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">9</span><span class="p">):</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="n">s</span><span class="o">+=</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="n">s</span><span class="o">+=</span><span class="s">'x'</span> <span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="n">s</span><span class="o">+=</span><span class="s">'o'</span> <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">5</span><span class="p">:</span> <span class="n">s</span> <span class="o">+=</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="k">return</span> <span class="n">s</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HumanAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">select_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="s">'Enter your move (0-8): '</span><span class="p">)</span> <span class="n">move</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="nb">input</span><span class="p">()))</span> <span class="c1">#print('move', move) </span> <span class="c1">#print('game state available moves', game_state.available_moves()) </span> <span class="k">if</span> <span class="n">move</span> <span class="ow">in</span> <span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="k">return</span> <span class="n">move</span> <span class="k">else</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Invalid move, try again'</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">select_move</span><span class="p">(</span><span class="n">game_state</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RandomAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">select_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="k">return</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">())</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span> <span class="n">ttt</span> <span class="o">=</span> <span class="n">Tictactoe</span><span class="p">()</span> <span class="c1">#ttt = Tictactoe([0,0,-1,0,0,0,1,-1,1]) #Optionally can start from a pre-set game position </span> <span class="n">h</span> <span class="o">=</span> <span class="n">HumanAgent</span><span class="p">()</span> <span class="n">r</span> <span class="o">=</span> <span class="n">RandomAgent</span><span class="p">()</span> <span class="n">moves</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="n">ttt</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="k">print</span><span class="p">(</span><span class="n">ttt</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="s">'move'</span><span class="p">,</span> <span class="n">moves</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="s">'acting player'</span><span class="p">,</span> <span class="n">ttt</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="k">if</span> <span class="n">moves</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="n">move</span> <span class="o">=</span> <span class="n">player1</span><span class="p">.</span><span class="n">select_move</span><span class="p">(</span><span class="n">ttt</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="n">move</span> <span class="o">=</span> <span class="n">player2</span><span class="p">.</span><span class="n">select_move</span><span class="p">(</span><span class="n">ttt</span><span class="p">)</span> <span class="n">ttt</span><span class="p">.</span><span class="n">make_move</span><span class="p">(</span><span class="n">move</span><span class="p">)</span> <span class="k">if</span> <span class="n">ttt</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Draw game'</span><span class="p">)</span> <span class="k">break</span> <span class="k">elif</span> <span class="n">ttt</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Player 1 wins'</span><span class="p">)</span> <span class="k">elif</span> <span class="n">ttt</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Player 2 wins'</span><span class="p">)</span> <span class="n">moves</span><span class="o">+=</span><span class="mi">1</span> </code></pre></div></div> <h3 id="minimax-applied-to-tic-tac-toe">Minimax Applied to Tic Tac Toe</h3> <p>We can apply the minimax algorithm to tic tac toe. Here we use a simplified version of minimax called negamax, because in a zero-sum game like tic tac toe, the value of a position to one player is the negative value to the other player.</p> <p>We store already evaluated states in a memo dictionary containing each move and its value. When a state has not been seen before, we check the game state, which will either return the winning player, 0 for tie, or None for “game not yet over”.</p> <p>If the game is over, then there is no move from this position and the value is simply the result of the game.</p> <p>If the game is not over, we iterate through the available moves and recursively find a value for each possible move. As each move is evaluated, we store the best move and the value for that move and return the overall best after evaluating each move.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">NegamaxAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span> <span class="o">=</span> <span class="p">{}</span> <span class="c1">#move, value </span> <span class="k">def</span> <span class="nf">negamax</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="k">if</span> <span class="n">game_state</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span><span class="p">:</span> <span class="c1">#already visited this state? </span> <span class="n">result</span> <span class="o">=</span> <span class="n">game_state</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="k">if</span> <span class="n">result</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span> <span class="c1">#leaf node or end of search </span> <span class="n">best_move</span> <span class="o">=</span> <span class="bp">None</span> <span class="n">best_val</span> <span class="o">=</span> <span class="n">result</span> <span class="o">*</span> <span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span> <span class="c1">#return 0 for tie or 1 for maximizing win or -1 for minimizing win </span> <span class="k">else</span><span class="p">:</span> <span class="n">best_val</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s">'-inf'</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="n">clone_state</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">deepcopy</span><span class="p">(</span><span class="n">game_state</span><span class="p">)</span> <span class="n">clone_state</span><span class="p">.</span><span class="n">make_move</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="c1">#makes move and switches to next player </span> <span class="n">_</span><span class="p">,</span> <span class="n">val</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">negamax</span><span class="p">(</span><span class="n">clone_state</span><span class="p">)</span> <span class="n">val</span> <span class="o">*=</span> <span class="o">-</span><span class="mi">1</span> <span class="k">if</span> <span class="n">val</span> <span class="o">&gt;</span> <span class="n">best_val</span><span class="p">:</span> <span class="n">best_move</span> <span class="o">=</span> <span class="n">i</span> <span class="n">best_val</span> <span class="o">=</span> <span class="n">val</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span><span class="p">[</span><span class="n">game_state</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">best_move</span><span class="p">,</span> <span class="n">best_val</span><span class="p">)</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span><span class="p">[</span><span class="n">game_state</span><span class="p">]</span> </code></pre></div></div> <h2 id="monte-carlo-tree-search-mcts">Monte Carlo Tree Search (MCTS)</h2> <p>MCTS is a more advanced algorithm that finds the optimal move through simulation. This algorithm is used as part of some recent advances in AI poker agents as well as in agents in perfect information games like Go and chess. Monte Carlo methods in general use random sampling for problems that are difficult to solve using other approaches.</p> <h3 id="mcts-background">MCTS Background</h3> <p>MCTS allows us to determine the best optimal move from a game state without having to expand the entire tree like we had to do in the minimax algorithm. Also the MCTS algorithm does not require any domain knowledge, making it very versatile and powerful. However, domain knowledge can be used to improve performance by applying known patterns to the simulation policy rather than using default moves.</p> <p>MCTS is primarily effective in games of perfect information and provides no guarantees for imperfect information games. In imperfect information games, MCTS must use determinization, which is the analysis of the game as if the true world states were known. However, this presents a number of large problems such as strategy fusion. Suppose that the true setting could be either “World 1” or “World 2” (the equivalent in Kuhn Poker is that we have card Q and our opponent has either card J or K). In computing the strategy, there could be a case such as where the maximizing player can guarantee a utility of 1 by, for example, always going right (valid in both World 1 and World 2), but if the player chose to go left, then they would have yet another decision to go right or left after Play 2 acted. If the true setting was “World 1”, then going right from this node would result in a utility of 1 and left would be -1. If the true setting was “World 2”, then going right would result in a utility of -1 and left would be 1. We can see then that with perfect information (if we knew the actual World 1 or World 2 situation), then the player could always guarantee the payout of 1 regardless of the initial action, but with imperfect information, the player risks a utility of -1 by taking the non-guaranteed payout route.</p> <p>This is a problem that prevents the usual minimax formulation from working properly.</p> <p>Still, there are methods to apply MCTS to imperfect information games like poker. Due to the asymmetry of information in imperfect information games, a separate search tree is used for each player. Information State UCT (IS-UCT) is a multi-player version of Partially Observable UCT, which searches trees over histories of information states instead of histories of observations and actions of Partally Observable Markov Decision Processes (POMDP). IS-UCT has not been shown to converge in poker games, but an alternative called Smooth IS-UCT was shown to converge in Kuhn poker and showed robust results in Limit Hold’em, although regular IS-UCT also did.</p> <p>Although MCTS does not have theoretical convergence guarantees for multiplayer games, it is well defined for such games, unlike CFR methods (although CFR methods have found strong results). In poker, MCTS has been found to work quickly, but it generally finds a suboptimal (although decent) strategy.</p> <p>In most recent applications, the MCTS algorithm is used as part of poker algorithms to estimate state values, but not on its own to solve the game.</p> <h3 id="mcts-applied-to-tic-tac-toe">MCTS Applied to Tic Tac Toe</h3> <p>Let’s consider that we want to find the best tic tac toe move from some state of the game that we can pre-specify. Let’s go through the algorithm step by step.</p> <p>The MCTSAgent class and its select_move function contain the core of the algorithm. The function begins by setting the root of the game tree as an MCTSNode class. A node is a decision point in the game tree, so the root node is the beginning of the tree. If we wanted to find out the best move from the beginning of the game, this would represent an empty tic tac toe board.</p> <p>Each node is initialized with a game state, a parent node, a move, a set of child nodes, a counter for wins by each player, a counter for rollouts that have gone through this node, and a list of available moves from this node.</p> <p>For some fixed number of rounds, we go through the following steps:</p> <ol> <li> <p>Selection: Start from the root (current game state) and select child nodes until a leaf node (node that has a potential child from which no simulation has yet been initiated) is reached. Child nodes are selected by modeling each selection problem as a multi armed bandit using the Upper Confidence Bound (UCB) applied to trees to balance between the exploitation of moves with high average wins and exploration of moves with few simulations.</p> </li> <li> <p>Expansion: If we can add a child node, then we select a random move from the current game state and create a new node to represent the game with this new move, which becomes a child of the prior node. This is done through the add_random_child function in the MCTSNode class.</p> </li> <li> <p>Simulation: Next we run a random playout from the newly expanded node until the game ends</p> </li> <li> <p>Backpropagation: From the end of the game, we update each node that was passed through by updating the win counts for each player (one player gets +1 and one gets -1 or both get 0 in the case of a tie) and adding 1 to the number of rollouts that have passed through each of the nodes.</p> </li> </ol> <p>After running MCTS, we look at each child node from the root and evaluate its win percentage over all of the simulations. We then print a list of the moves in order of their winning percentage, along with how many simulations were run for each move.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MCTSNode</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">,</span> <span class="n">parent</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="n">move</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">parent</span> <span class="o">=</span> <span class="n">parent</span> <span class="bp">self</span><span class="p">.</span><span class="n">move</span> <span class="o">=</span> <span class="n">move</span> <span class="bp">self</span><span class="p">.</span><span class="n">game_state</span> <span class="o">=</span> <span class="n">game_state</span> <span class="bp">self</span><span class="p">.</span><span class="n">children</span> <span class="o">=</span> <span class="p">[]</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="mi">0</span><span class="p">}</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_rollouts</span> <span class="o">=</span> <span class="mi">0</span> <span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span> <span class="o">=</span> <span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">()</span> <span class="k">def</span> <span class="nf">add_random_child</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="n">move_index</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1">#inclusive </span> <span class="n">new_move</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="n">move_index</span><span class="p">)</span> <span class="n">new_node</span> <span class="o">=</span> <span class="n">MCTSNode</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">game_state</span><span class="p">.</span><span class="n">new_state_with_move</span><span class="p">(</span><span class="n">new_move</span><span class="p">),</span> <span class="bp">self</span><span class="p">,</span> <span class="n">new_move</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">children</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_node</span><span class="p">)</span> <span class="k">return</span> <span class="n">new_node</span> <span class="k">def</span> <span class="nf">can_add_child</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">def</span> <span class="nf">is_terminal</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">game_state</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">result</span><span class="p">):</span> <span class="k">if</span> <span class="n">result</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">-=</span> <span class="mi">1</span> <span class="k">elif</span> <span class="n">result</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-=</span> <span class="mi">1</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_rollouts</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">def</span> <span class="nf">winning_frac</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">player</span><span class="p">):</span> <span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="n">player</span><span class="p">])</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">num_rollouts</span><span class="p">)</span> <span class="k">class</span> <span class="nc">MCTSAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">num_rounds</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">,</span> <span class="n">temperature</span> <span class="o">=</span> <span class="mi">2</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_rounds</span> <span class="o">=</span> <span class="n">num_rounds</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">=</span> <span class="n">temperature</span> <span class="k">def</span> <span class="nf">uct_select_child</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">node</span><span class="p">):</span> <span class="n">best_score</span> <span class="o">=</span> <span class="o">-</span><span class="nb">float</span><span class="p">(</span><span class="s">'inf'</span><span class="p">)</span> <span class="n">best_child</span> <span class="o">=</span> <span class="bp">None</span> <span class="n">total_rollouts</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">num_rollouts</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">node</span><span class="p">.</span><span class="n">children</span><span class="p">)</span> <span class="n">log_rollouts</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">total_rollouts</span><span class="p">)</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">node</span><span class="p">.</span><span class="n">children</span><span class="p">:</span> <span class="n">win_pct</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">winning_frac</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="n">exploration_factor</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">log_rollouts</span> <span class="o">/</span> <span class="n">child</span><span class="p">.</span><span class="n">num_rollouts</span><span class="p">)</span> <span class="n">uct_score</span> <span class="o">=</span> <span class="n">win_pct</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">*</span> <span class="n">exploration_factor</span> <span class="k">if</span> <span class="n">uct_score</span> <span class="o">&gt;</span> <span class="n">best_score</span><span class="p">:</span> <span class="n">best_score</span> <span class="o">=</span> <span class="n">uct_score</span> <span class="n">best_child</span> <span class="o">=</span> <span class="n">child</span> <span class="k">return</span> <span class="n">best_child</span> <span class="k">def</span> <span class="nf">select_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="n">root</span> <span class="o">=</span> <span class="n">MCTSNode</span><span class="p">(</span><span class="n">game_state</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">num_rounds</span><span class="p">):</span> <span class="n">node</span> <span class="o">=</span> <span class="n">root</span> <span class="c1">#selection -- UCT select child until we get to a node that can be expanded </span> <span class="k">while</span> <span class="p">(</span><span class="ow">not</span> <span class="n">node</span><span class="p">.</span><span class="n">can_add_child</span><span class="p">())</span> <span class="ow">and</span> <span class="p">(</span><span class="ow">not</span> <span class="n">node</span><span class="p">.</span><span class="n">is_terminal</span><span class="p">()):</span> <span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">uct_select_child</span><span class="p">(</span><span class="n">node</span><span class="p">)</span> <span class="c1">#expansion -- expand from leaf unless leaf is end of game </span> <span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">can_add_child</span><span class="p">():</span> <span class="n">node</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">add_random_child</span><span class="p">()</span> <span class="c1">#simulation -- complete a random playout from the newly expanded node </span> <span class="n">gs_temp</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">deepcopy</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">game_state</span><span class="p">)</span> <span class="k">while</span> <span class="n">gs_temp</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span> <span class="n">gs_temp</span><span class="p">.</span><span class="n">make_move</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">gs_temp</span><span class="p">.</span><span class="n">available_moves</span><span class="p">()))</span> <span class="c1">#backpropagation -- update all nodes from the selection to leaf stage </span> <span class="k">while</span> <span class="n">node</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">gs_temp</span><span class="p">.</span><span class="n">check_result</span><span class="p">())</span> <span class="n">node</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">parent</span> <span class="n">scored_moves</span> <span class="o">=</span> <span class="p">[(</span><span class="n">child</span><span class="p">.</span><span class="n">winning_frac</span><span class="p">(</span><span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span><span class="p">),</span> <span class="n">child</span><span class="p">.</span><span class="n">move</span><span class="p">,</span> <span class="n">child</span><span class="p">.</span><span class="n">num_rollouts</span><span class="p">)</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">root</span><span class="p">.</span><span class="n">children</span><span class="p">]</span> <span class="n">scored_moves</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">key</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">s</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">scored_moves</span><span class="p">[:</span><span class="mi">10</span><span class="p">]:</span> <span class="k">print</span><span class="p">(</span><span class="s">'%s - %.3f (%d)'</span> <span class="o">%</span> <span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">n</span><span class="p">))</span> <span class="n">best_pct</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="n">best_move</span> <span class="o">=</span> <span class="bp">None</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">root</span><span class="p">.</span><span class="n">children</span><span class="p">:</span> <span class="n">child_pct</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">winning_frac</span><span class="p">(</span><span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="k">if</span> <span class="n">child_pct</span> <span class="o">&gt;</span> <span class="n">best_pct</span><span class="p">:</span> <span class="n">best_pct</span> <span class="o">=</span> <span class="n">child_pct</span> <span class="n">best_move</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">move</span> <span class="k">print</span><span class="p">(</span><span class="s">'Select move %s with avg val %.3f'</span> <span class="o">%</span> <span class="p">(</span><span class="n">best_move</span><span class="p">,</span> <span class="n">best_pct</span><span class="p">))</span> <span class="k">return</span> <span class="n">best_move</span> </code></pre></div></div> <p>MCTS in general works very effectively to simulate play in game trees and was famously combined with neural networks in AlphaGo by DeepMind to create a Go agent that defeated top human players.</p>Game Theory – Trees in Games Many games can be solved using the minimax algorithm for exploring a tree and determining the best move from each position. Unfortunately, poker is not one of those games.AIPT Section 3.2: Solving Poker – Toy Poker Games2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/toy-poker-games<!-- TODO: 1) Game tree for each decision point in analytical version and full one above 3) Tables to show y and x and y and z relative to each other 4) Deviations e.g. player who never bluffs 5) Compare to rule-based 6) That is, there are $$2^64$$ strategy combinations. (??) 7) Lin alg example 8) More on balancing bluffs Rhode Island Hold'em Results with other poker agents playing worse strategies exploitable More graphics --> <h1 id="solving-poker---toy-poker-games">Solving Poker - Toy Poker Games</h1> <p>We will take a look at solving a very simple toy poker game called Kuhn Poker using multiple techniques that are increasingly efficient. Starting with Section 4, we will go into the Counterfactual Regret Minimization (CFR) algorithm that has been the standard in solving imperfect information games since 2007.</p> <h2 id="kuhn-poker">Kuhn Poker</h2> <p><strong>Kuhn Poker</strong> is the most basic poker game with interesting strategic implications. We mentioned it in the Poker Background section and will summarize the rules here as well.</p> <p>The game in its standard form is played with 3 cards {A, K, Q} and 2 players. Each player starts with \$2 and places an ante (i.e., forced bet before the hand) of \$1. And therefore has \$1 left to bet with. Each player is then dealt 1 card and 1 round of betting ensues.</p> <p><img src="../assets/section3/toygames/deputydots.png" alt="The deputy likes dots" /></p> <p>The rules in bullet form:</p> <ul> <li>2 players</li> <li>3 card deck {A, K, Q}</li> <li>Each starts the hand with$2</li> <li>Each antes (i.e., makes forced bet of) $1 at the start of the hand</li> <li>Each player is dealt 1 card</li> <li>Each has$1 remaining for betting</li> <li>There is 1 betting round and 1 bet size of $1</li> <li>The highest card is the best (i.e., A$&gt;$K$&gt;$Q)</li> </ul> <p>Action starts with P1, who can Bet$1 or Check</p> <ul> <li>If P1 bets, P2 can either Call or Fold</li> <li>If P1 checks, P2 can either Bet or Check</li> <li>If P2 bets after P1 checks, P1 can then Call or Fold</li> </ul> <p>These outcomes are possible:</p> <ul> <li>If a player folds to a bet, the other player wins the pot of $2 (profit of \$1)</li> <li>If both players check, the highest card player wins the pot of $2 (profit of \$1)</li> <li>If there is a bet and call, the highest card player wins the pot of $4 (profit of \$2)</li> </ul> <p>The following are all of the possible full sequences.</p> <p>The “History full” shows the exact betting history with “k” for check, “b” for bet, “c” for call, “f” for fold.</p> <p>The “History short” uses a condensed format that uses only “b” for betting/calling and “p” (pass) for checking/folding, meaning that “b” is used when putting $1 into the pot and “p” when putting no money into the pot. We reference this shorthand format since we’ll use it when putting the game into code.</p> <table> <thead> <tr> <th>P1</th> <th>P2</th> <th>P1</th> <th>Pot size</th> <th>Result</th> <th>History full</th> <th>History short</th> </tr> </thead> <tbody> <tr> <td>Check</td> <td>Check</td> <td>–</td> <td>$2</td> <td>High card wins $1</td> <td>kk</td> <td>pp</td> </tr> <tr> <td>Check</td> <td>Bet$1</td> <td>Call $1</td> <td>$4</td> <td>High card wins $2</td> <td>kbc</td> <td>pbb</td> </tr> <tr> <td>Check</td> <td>Bet$1</td> <td>Fold</td> <td>$2</td> <td>P2 wins$1</td> <td>kbf</td> <td>pbp</td> </tr> <tr> <td>Bet $1</td> <td>Call$1</td> <td>–</td> <td>$4</td> <td>High card wins$2</td> <td>bc</td> <td>bb</td> </tr> <tr> <td>Bet $1</td> <td>Fold</td> <td>–</td> <td>$2</td> <td>P1 wins $1</td> <td>bf</td> <td>bp</td> </tr> </tbody> </table> <h2 id="solving-kuhn-poker">Solving Kuhn Poker</h2> <p>We’re going to solve for the GTO solution to this game using 3 methods, an analytical solution, a normal form solution, and then a more efficient extensive form solution. We will then briefly mention game trees and the CFR counterfactual regret algorithm, that will be detailed more in section 4.1.</p> <p>What’s the point of solving such a simple game? We can learn some important poker principles even from this game, although they are most useful for beginner players. We can also see the limitations of these earlier solving methods and therefore why new methods were needed to solve games of even moderate size.</p> <h2 id="analytical-solution">Analytical Solution</h2> <p>There are 4 decision points in this game: P1’s opening action, P2 after P1 bets, P2 after P1 checks, and P1 after checking and P2 betting.</p> <h3 id="defining-the-variables">Defining the variables</h3> <p><strong>P1 initial action</strong></p> <p>Let’s first look at P1’s opening action. P1 should never bet the K card here because if he bets the K, P2 with Q will always fold (since the lowest card can never win) and P2 with A will always call (since the best card will always win). By checking the K always, P1 can try to induce a bluff from P2 when P2 has the Q and may be able to fold to a bet when P2 has the A.</p> <p>Therefore we assign P1’s strategy:</p> <ul> <li>Bet Q: $$x$$</li> <li>Bet K: $$0$$</li> <li>Bet A: $$y$$</li> </ul> <p><strong>P2 after P1 bet</strong></p> <p>After P1 bets, P2 should always call with the A and always fold the Q as explained above.</p> <p>Therefore we assign P2’s strategy after P1 bet:</p> <ul> <li>Call Q: $$0$$</li> <li>Call K: $$a$$</li> <li>Call A: $$1$$</li> </ul> <p><strong>P2 after P1 check</strong></p> <p>After P1 checks, P2 should never bet with the K for the same reason as P1 should never initially bet with the K.</p> <p>P2 should always bet with the A because it is the best hand and there is no bluff to induce by checking (the hand would simply end and P2 would win, but not have a chance to win more by betting).</p> <p>Therefore we assign P2’s strategy after P1 check:</p> <ul> <li>Bet Q: $$b$$</li> <li>Bet K: $$0$$</li> <li>Bet A: $$1$$</li> </ul> <p><strong>P1 after P1 check and P2 bet</strong></p> <p>This case is similar to P2’s actions after P1’s bet. P1 can never call here with the worst hand (Q) and must always call with the best hand (A).</p> <p>Therefore we assign P1’s strategy after P1 check and P2 bet:</p> <ul> <li>Call Q: $$0$$</li> <li>Call K: $$z$$</li> <li>Call A: $$1$$</li> </ul> <p>So we now have 5 different variables $$x, y, z$$ for P1 and $$a, b$$ for P2 to represent the unknown probabilities.</p> <h3 id="solving-for-the-variables">Solving for the variables</h3> <p><strong>The Indifference Principle</strong></p> <p>When we solve for the analytical game theory optimal strategy, we want to make the opponent indifferent. This means that the opponent cannot exploit our strategy. If we deviated from this equilibrium then since poker is a 2-player zero-sum game, our opponent’s EV could increase at the expense of our own EV.</p> <p><strong>Solving for $$x$$ and $$y$$</strong></p> <p>For P1 opening the action, $$x$$ is his probability of betting with Q (bluffing) and $$y$$ is his probability of betting with A (value betting). We want to make P2 indifferent between calling and folding with the K (since again, Q is always a fold and A is always a call for P2).</p> <p>When P2 has K, P1 has $$\frac{1}{2}$$ of having a Q and A each.</p> <p>P2’s EV of folding with a K to a bet is $$0$$. (Note that we are defining EV from the current decision point, meaning that money already put into the pot is sunk and not factored in.)</p> <p>P2’s EV of calling with a K to a bet</p> $= 3 * \text{P(P1 has Q and bets with Q)} + \$$-1) * \text{P(P1 has A and bets with A)}$ $= (3) * \frac{1}{2} * x + (-1) * \frac{1}{2} * y$ <p>Setting the calling and folding EVs equal (because of the indifference principle), we have:</p> $0 = (3) * \frac{1}{2} * x + (-1) * \frac{1}{2} * y$ $y = 3 * x$ <p>That is, P1 should bet the A 3 times more than bluffing with the Q. This result is parametrized, meaning that there isn’t a fixed number solution, but rather a ratio of how often P1 should value-bet compared to bluff. For example, if P1 bluffs with the Q 10% of the time, he should value bet with the A 30% of the time.</p> <p><strong>Solving for \(a$$</strong></p> <p>$$a$$ is how often P2 should call with a K facing a bet from P1.</p> <p>P2 should call $$a$$ to make P1 indifferent to bluffing (i.e., betting or checking) with card Q.</p> <p>If P1 checks with card Q, P1 will always fold afterwards if P2 bets (because it is the worst card and can never win), so P1’s EV is 0.</p> $\text{EV P1 check with Q} = 0$ <p>If P1 bets with card Q,</p> $\text{EV P1 bet with Q} = (-1) * \text{P2 has A and always calls/wins} + \$$-1) * \text{P2 has K and calls/wins} + 2 * \text{P2 has K and folds}$ $= \frac{1}{2} * (-1) + \frac{1}{2} * (a) * (-1) + \frac{1}{2} * (1 - a) * (2)$ $= -\frac{1}{2} - \frac{1}{2} * a + (1 - a)$ $= \frac{1}{2} - \frac{3}{2} * a$ <p>Setting the probabilities of betting with Q and checking with Q equal, we have:</p> $0 = \frac{1}{2} - \frac{3}{2} * a$ $\frac{3}{2} * a = \frac{1}{2}$ $a = \frac{1}{3}$ <p>Therefore P2 should call \(\frac{1}{3}$$ with a K when facing a bet from P1.</p> <p><strong>Solving for $$b$$</strong></p> <p>Now to solve for $$b$$, how often P2 should bet with a Q after P1 checks. The indifference for P1 is only relevant when he has a K, since if he has a Q or A, he will always fold or call, respectively.</p> <p>If P1 checks a K and then folds, then:</p> $\text{EV P1 check with K and then fold to bet} = 0$ <p>If P1 checks and calls, we have:</p> $\text{EV P1 check with K and then call a bet} = (-1) * \text{P(P2 has A and always bets) + (3) * P(P2 has Q and bets)$ $= \frac{1}{2} * (-1) + \frac{1}{2} * b * (3)$ <p>Setting these probabilities equal, we have:</p> $0 = \frac{1}{2} * (-1) + \frac{1}{2} * b * (3)$ $\frac{1}{2} = \frac{1}{2} * b * (3)$ $3 * b = 1$ $b = \frac{1}{3}$ <p>Therefore P2 should bet $$\frac{1}{3}$$ with a Q after P1 checks.</p> <p><strong>Solving for $$z$$</strong></p> <p>The final case is when P1 checks a K, P2 bets, and P1 must decide how frequently to call so that P2 is indifferent to checking vs. betting (bluffing) with a Q.</p> <p>(Note that \| denotes “given that” and we use the conditional probability formula of $$\text{P(A\\|B)} = \frac{P(A \cup B)}{P(B)}$$ where $$\cup$$ denotes the intersection of the sets, so in this case is where $$A$$ and $$B$$ intersect – by intersect we just mean that they are both true at the same time, like the middle part of a Venn diagram)</p> <p>We start with finding the probability that P1 has an A given that P1 has checked and P2 has a Q, meaning that P1 has an A or K.</p> $\text{P(P1 has A | P1 checks A or K)} = \frac{\text{P(P1 has A and checks)}}{\text{P(P1 checks A or K)}}$ <p>We simplify the numerator to P1 having A and checking because there is no intersection between checking a K and having an A.</p> $= \frac{(1-y) * \frac{1}{2}}{(1-y) * \frac{1}{2} + \frac{1}{2}}$ $= \frac{1-y}{2-y}$ $\text{P(P1 has K | P1 checks A or K)} = 1 - \text{P(P1 has A | P1 checks A or K)}$ $= 1 - \frac{1-y}{2-y}$ $= \frac{2-y}{2-y} - \frac{1-y}{2-y}$ $= \frac{1}{2-y}$ <p>If P2 checks his Q, his EV $$= 0$$.</p> <p>If P2 bets (bluffs) with his Q, his EV is:</p> $-1 * \text{P(P1 check A then call A)} - 1 * \text{P(P1 check K then call K)} + 2 * \text{P(P1 check K then fold K)}$ $= -1 * \frac{1-y}{2-y} + -1 * z * \frac{1}{2-y} + 2 * (1-z) * \frac{1}{2-ya}$ <p>Setting these equal:</p> $0 = -1 * \frac{1-y}{2-y} + -1 * z * \frac{1}{2-y} + 2 * (1-z) * \frac{1}{2-y}$ $0 = -1 * \frac{1-y}{2-y} + -1 * z * \frac{1}{2-y} + 2 * (1-z) * \frac{1}{2-y}$ $0 = -\frac{1-y}{2-y} - z * \frac{3}{2-y} + \frac{2}{2-y}$ $z * \frac{3}{2-y} = \frac{2}{2-y} - \frac{1-y}{2-y}$ $z = \frac{2}{3} - \frac{1-y}{3}$ $z = \frac{y+1}{3}$ <p>So P1 should call with a K relative to the proportion of betting an A. This means if betting A 50% of the time ($$y=0.5$$), we would have $$z = \frac{1.5}{3} = 0.5$$ as well.</p> <h3 id="solution-summary">Solution summary</h3> <p>We now have the following result:</p> <p>P1 initial actions:</p> <p>Bet Q: $$x = \frac{y}{3}$$</p> <p>Bet A: $$y = 3*x$$</p> <p>P2 after P1 bet:</p> <p>Call K: $$a = \frac{1}{3}$$</p> <p>P2 after P1 check:</p> <p>Bet Q: $$b = \frac{1}{3}$$</p> <p>P1 after P1 check and P2 bet:</p> <p>Call K: $$z = \frac{y+1}{3}$$</p> <p>P2 has fixed actions, but P1’s are dependent on the $$y$$ parameter.</p> <h3 id="finding-the-game-value">Finding the game value</h3> <p>We can look at the expected value of every possible deal-out to evaluate the value for $$y$$. We format these EV calculations as $$\text{P1 action} * \text{P2 action} * \text{P1 action if applicable} * \text{EV}$$, all from the perspective of P1.</p> <p><strong>Case 1: P1 A, P2 K</strong></p> <ol> <li>Bet fold:</li> </ol> $y * \frac{2}{3} * 2 = \frac{y}{3}$ <ol> <li>Bet call:</li> </ol> $y * \frac{1}{3} * 3 = 2 * y$ <ol> <li>Check check:</li> </ol> $(1 - y) * 1 * 2 = 2 * (1 - y)$ <p>Total = $$\frac{y}{3} + 2$$</p> <p><strong>Case 2: P1 A, P2 Q</strong></p> <ol> <li>Bet fold:</li> </ol> $y * 1 * 2 = 2 * y$ <ol> <li>Check bet call:</li> </ol> $(1 - y) * \frac{1}{3} * 1 * 3 = 3 * \frac{1}{3} * (1 - y)$ <ol> <li>Check check:</li> </ol> $(1 - y) * \frac{2}{3} * 2 = 2 * \frac{2}{3} * (1 - y)$ <p>Total = $$2 * y + (1 - y) + \frac{4}{3} * (1-y) = \frac{1}{3} * (7 - y)$$</p> <p><strong>Case 3: P1 K, P2 A</strong></p> <ol> <li>Check bet call:</li> </ol> $(1) * (1) * \frac{y+1}{3} * (-1) = -\frac{y+1}{3}$ <ol> <li>Check bet fold:</li> </ol> $(1) * (1) * (1 - \frac{y+1}{3}) * (0) = 0$ <p>Total = $$-\frac{y+1}{3}$$</p> <p><strong>Case 4: P1 K, P2 Q</strong></p> <ol> <li>Check check:</li> </ol> $(1) * \frac{2}{3} * 2 = 2 * \frac{2}{3}$ <ol> <li>Check bet call:</li> </ol> $(1) * \frac{1}{3} * \frac{y+1}{3} * 3 = \frac{y+1}{3}$ <ol> <li>Check bet fold:</li> </ol> $(1) * \frac{1}{3} * (1 - \frac{y+1}{3}) * 0 = 0$ <p>Total = $$\frac{4}{3} + \frac{y+1}{3} = \frac{y+5}{3}$$</p> <p><strong>Case 5: P1 Q, P2 A</strong></p> <ol> <li>Bet call:</li> </ol> $\frac{y}{3} * 1 * (-1) = \frac{-y}{3}$ <ol> <li>Check bet fold:</li> </ol> $(1 - \frac{y}{3}) * 1 * 1 * (0) = 0$ <p>Total = $$\frac{-y}{3}$$</p> <p><strong>Case 6: P1 Q, P2 K</strong></p> <ol> <li>Bet call:</li> </ol> $\frac{y}{3} * \frac{1}{3} * (-1) = -\frac{y}{9}$ <ol> <li>Bet fold:</li> </ol> $\frac{y}{3} * \frac{2}{3} * 2 = \frac{4*y}{9}$ <ol> <li>Check check:</li> </ol> $(1-\frac{y}{3}) * 1 * (0) = 0$ <p>Total = $$-\frac{y}{9} + \frac{4*y}{9} = \frac{y}{3}$$</p> <p><strong>Summing up the cases</strong> Since each case is equally likely based on the initial deal, we can multiply each by $$\frac{1}{6}$$ and then sum them to find the EV of the game. Summing up all cases, we have:</p> <p>Overall total = $$\frac{1}{6} * [\frac{y}{3} + 2 + \frac{1}{3} * (7 - y) + -\frac{y+1}{3} + \frac{y+5}{3} + \frac{-y}{3} + \frac{y}{3}] = \frac{17}{18}$$</p> <h3 id="main-takeaways">Main takeaways</h3> <p>What does this number $$\frac{17}{18}$$ mean? It says that the expectation of the game from the perspective of Player 1 is $$\frac{17}{18}$$. Since this is $$&lt;1$$, we see that the expected gain from playing the game of Player 1 is $$1 - \frac{17}{18} = -0.05555$$. This is because for each$1 put into the game, Player 1 is expected to get back $$\frac{17}{18}$$ and so is expected to lose. Therefore the value of the game for Player 2 is $$+0.05555$$.</p> <p>Every time that these players play a hand against each other (assuming they play the equilibrium strategies), that will be the outcome on average – meaning P1 will lose $$\5.56$$ on average per 100 hands and P2 will gain that amount. However, since in practice players rotate between being Player 1 and Player 2, both players will be guaranteed to breakeven if playing the Nash equilibrium.</p> <p>This indicates the advantage of acting last in poker – seeing what the opponent has done first gives an information advantage. In this game, the players would rotate who acts first for each hand, but the principle of playing more hands with the positional advantage is very important in real poker games and is why good players are much looser in later positions at the table.</p> <p>The expected value is not at all dependent on the $$y$$ variable which defines how often Player 1 bets his A hands. If we assumed that the pot was not a fixed size of \$2 to start the hand, then it would be optimal for P1 to either always bet or always check the A (the math above would change and the result would depend on $$y$$), but we’ll stick with the simple case of the pot always starting at \$2 from the antes.</p> <p>From a poker strategy perspective, the main takeaway is that we can essentially split our hands into:</p> <ol> <li>Strong hands</li> <li>Mid-strength hands</li> <li>Weak hands</li> </ol> <p>Mid-strength hands can win, but don’t want to build the pot. Strong hands try to generally make the pot large with value bets (though can also be used deceptively). Weak hands want to either give up or be used as bluffs. There is a major polarization effect where strong and weak hands have similarities and mid-strength hands are played passively.</p> <p>Note that this mathematically optimal solution automatically uses bluffs. Bluffs are not -EV bets that are used as “bad plays” to get more credit for value bets later, they are part of an overall optimal strategy.</p> <p>We also see that a major component of poker strategy is “balancing” bluffs. We see that P1 value bets 3 times more than she bluffs. In a real poker setting, you might have a similar strategy, but will have many possible bluff hands in your range to choose from, which means that they can be strategically selected to match the ratio, for example by bluffing with hands that make it less likely that your opponent is strong, while giving up with other weak hands.</p> <p>Finally, there are many cases where the probabilities are 0 or 1. Often, these represent obvious actions where the player is definitely winning or losing so has no incentive to do anything else. For example, the “Jack, bet (after bet)” (i.e. calling a bet with the worst hand) would not make sense because he can’t be winning and the “King, pass (after pass)” (i.e. not betting with the best hand when there’s no action to go) is the reverse, because he must be winning.</p> <p>One interesting case of 0 probability is “Queen, bet” because if the first acting player bets with a Queen, he will certainly be called and lose to a King and will certainly force the inferior Jack to fold, therefore this action should never be taken. This illustrates the poker concept of not betting middling hands, since there is a high probability that only better hands will call and you will force worse hands to fold.</p> <h2 id="kuhn-poker-in-normal-form">Kuhn Poker in Normal Form</h2> <p>Analytically solving all but the smallest games is not very feasible – a faster way to compute the strategy for this game is putting it into normal form.</p> <h3 id="information-sets">Information sets</h3> <p>There are 6 possible deals in Kuhn Poker: AK, AQ, KQ, KA, QK, QA.</p> <p>Each player has 2 decision points in the game. Player 1 has the initial action and the action after the sequence of P1 checks –&gt; P2 bets. Player 2 has the second action after Player 1 bets or Player 1 checks.</p> <p>Therefore each player has 12 possible acting states. For player 1 these are (where the first card belongs to Player 1 and the second card belongs to Player 2):</p> <ol> <li>AK acting first</li> <li>AQ acting first</li> <li>KQ acting first</li> <li>KA acting first</li> <li>QK acting first</li> <li>QA acting first</li> <li>AK check, P2 bets, P1 action</li> <li>AQ check, P2 bets, P1 action</li> <li>KQ check, P2 bets, P1 action</li> <li>KA check, P2 bets, P1 action</li> <li>QK check, P2 bets, P1 action</li> <li>QA check, P2 bets, P1 action</li> </ol> <p>However, the state of the game (or world of the game) is not actually known to the players! Each player has 2 decision points that are equivalent from their point of view, even though the true game state is different. For player 1 these are:</p> <ol> <li>A acting first (combines AK and AQ)</li> <li>K acting first (combines KQ and KA)</li> <li>Q acting first (combines QK and QA)</li> <li>A check, P2 bets, P1 action (combines AK and AQ)</li> <li>K check, P2 bets, P1 action (combines KQ and KA)</li> <li>Q check, P2 bets, P1 action (combines QK and QA)</li> </ol> <p>From Player 1’s perspective, she only knows her own private card and can only make decisions based on knowledge of this card.</p> <p>For example, if Player 1 is dealt a K and Player 2 dealt a Q or P1 dealt K and P2 dealt A, P1 is facing the decision of having a K and starting the betting not knowing what the opponent has.</p> <p>Likewise if Player 2 is dealt a K and is facing a bet, he must make the same action regardless of what the opponent has because from his perspective he only knows his own card and the action history.</p> <p>We define an information set as the set of information used to make decisions at a particular point in the game. In Kuhn Poker, it is equivalent to the card of the acting player and the history of actions up to that point.</p> <p>When writing game history sequences, we use “k” to define check, “b” for bet”, “f” for fold, and “c” for call. So for Player 1 acting first with a K, the information set is “K”. For Player 2 acting second with an A and facing a bet, the information set is “Ab”. For Player 2 acting second with a A and facing a check, the information set is “Ak”. For Player 1 with a K checking and facing a bet from Player 2, the information set is “Kkb”.</p> <p>The shorthand version in the case of Kuhn Poker is to combine “k” and “f” into “p” for pass and to combine “b” and “c” into “b” for bet. Pass indicates putting no money into the pot and bet indicates putting $1 into the pot.</p> <h3 id="writing-kuhn-poker-in-normal-form">Writing Kuhn Poker in Normal Form</h3> <p>Now that we have defined information sets, we see that each player in fact has 2 information sets per card that he can be dealt, which is a total of 6 information sets per player since each can be dealt a card in {Q, K, A}. (If the game was played with a larger deck size, then we would have $$\text{N} * 2$$ information sets, where N is the deck size.)</p> <p>Each information set has two actions possible, which are essentially “do not put money in the pot” (check when acting first/facing a check or fold when facing a bet – we call this pass) and “put in$1” (bet when acting first or call when facing a bet – we call this bet).</p> <p>The result is that each player has $$2^6 = 64$$ total combinations of pure strategies. Think of this as each player having a switch between pass/bet for each of the 6 information sets that can be on or off and deciding all of these in advance.</p> <p>Here are a few examples of the 64 strategies for Player 1 (randomly selected):</p> <ol> <li>A - bet, Apb - bet, K - bet, Kpb - bet, Q - bet, Qpb - bet</li> <li>A - bet, Apb - bet, K - bet, Kpb - bet, Q - bet, Qpb - pass</li> <li>A - bet, Apb - bet, K - pass, Kpb - bet, Q - bet, Qpb - bet</li> <li>A - bet, Apb - pass, K - bet, Kpb - pass, Q - bet, Qpb - bet</li> <li>A - bet, Apb - pass, K - bet, Kpb - bet, Q - bet, Qpb - bet</li> <li>A - pass, Apb - bet, K - bet, Kpb - bet, Q - pass, Qpb - bet</li> </ol> <p>We can create a $$64 \text{x} 64$$ payoff matrix with every possible strategy for each player on each axis and the payoffs inside.</p> <table> <thead> <tr> <th>P1/P2</th> <th>P2 Strat 1</th> <th>P2 Strat 2</th> <th>…</th> <th>P2 Strat 64</th> </tr> </thead> <tbody> <tr> <td>P1 Strat 1</td> <td>EV(1,1)</td> <td>EV(1,2)</td> <td>…</td> <td>EV(1,64)</td> </tr> <tr> <td>P1 Strat 2</td> <td>EV(2,1)</td> <td>EV(2,2)</td> <td>…</td> <td>EV(2,64)</td> </tr> <tr> <td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td> </tr> <tr> <td>P1 Strat 64</td> <td>EV(64,1)</td> <td>EV(64,2)</td> <td>…</td> <td>EV(64,64)</td> </tr> </tbody> </table> <p>This matrix has 4096 entries and would be difficult to use for something like iterated elimination of dominated strategies. We turn to linear programming to find a solution.</p> <h3 id="solving-with-linear-programming">Solving with Linear Programming</h3> <p>The general way to solve a game matrix of this size is with linear programming, which is essentially a way to optimize a linear objective, which we’ll define below. This kind of setup could be used in a problem like minimizing the cost of food while still meeting objectives like a minimum number of calories and maximum number of carbohydrates and sugar.</p> <p>We can define Player 1’s strategy as $$x$$, which is a vector of size 64 corresponding to the probability of playing each strategy. We do the same for Player 2 as $$y$$.</p> <p>We define the payoff matrix as $$A$$ with the payoffs written with respect to Player 1.</p> $A = \quad \begin{bmatrix} EV(1,2) &amp; EV(1,2) &amp; ... &amp; EV(1,64) &amp; \\ EV(2,1) &amp; EV(2,2) &amp; ... &amp; EV(2,64) &amp; \\ ... &amp; ... &amp; ... &amp; ... &amp; \\ EV(64,1) &amp; EV(64,2) &amp; ... &amp; EV(64,64) &amp; \\ \end{bmatrix}$ <p>We can use payoff matrix $$B$$ for payoffs written with respect to Player 2 – in zero-sum games like poker, $$A = -B$$, so it’s easiest to just use $$A$$.</p> <p>We can also define a constraint matrix for each player:</p> <p>Let P1’s constraint matrix = $$E$$ such that $$Ex = e$$</p> <p>Let P2’s constraint matrix = $$F$$ such that $$Fy = f$$</p> <p>The only constraint we have at this time is that the sum of the strategies is 1 since they are a probability distribution (all probability distributions must add up to 1, for example the probabilities of getting heads (0.5) and tails (0.5) sum to 1), so $$E$$ and $$F$$ will just be vectors of 1’s and $$e$$ and $$f$$ will $$= 1$$. In effect, this is just saying that each player has 64 strategies and should play each of those some % of the time (some will be 0) and these %s have to add up to 1 since this is a probability distribution and probabilities always add up to 1.</p> <p>In the case of Kuhn Poker, for <strong>step 1</strong> we look at a best response for Player 2 (strategy y) to a fixed Player 1 (strategy x) and have the following. Best response means best possible strategy for Player 2 given Player 1’s fixed strategy.</p> $\max_{y} (x^TB)y$ $\text{Such that: } Fy = f, y \geq 0$ <p>We are looking for the strategy parameters $$y$$ that maximize the payoffs for Player 2. $$x^TB$$ is the transpose of $$x$$ multiplied by $$B$$, so the strategy of Player 1 multiplied by the payoffs to Player 2. Player 2 then can choose $$y$$ to maximize his payoffs.</p> <p>We substitute $$-A$$ for $$B$$ so we only have to work with the $$A$$ matrix.</p> $= \max_{y} (x^T(-A))y$ <p>We can substitute $$-A$$ with $$A$$ and change our optimization to minimizing instead of maximizing.</p> $= \min_{y} (x^T(A))y$ $\text{Such that: } Fy = f, y \geq 0$ <p>In words, this is the expected value of the game from Player 2’s perspective because the $$x$$ and $$y$$ matrices represent the probability of ending in each state of the payoff matrix and the $$B == -A$$ value represents the payoff matrix itself. So Player 2 is trying to find a strategy $$y$$ that maximizes the payoff of the game from his perspective against a fixed $$x$$ player 1 strategy.</p> <p>For <strong>step 2</strong>, we look at a best response for Player 1 (strategy x) to a fixed Player 2 (strategy y) and have:</p> $\max_{x} x^T(Ay)$ $\text{Such that: } x^TE^T = e^T, x \geq 0$ <p>Note that now Player 1 is trying to maximize this equation and Player 2 is trying to minimize this same thing.</p> <p>For <strong>step 3</strong>, we can combine the above 2 parts and now allow for $$x$$ and $$y$$ to no longer be fixed, which leads to the below minimax equation. In 2-player zero-sum games, the minimax solution is the same as the Nash equilibrium solution. We call this minimax because each player minimizes the maximum payoff possible for the other – since the game is zero-sum, they also minimize their own maximum loss (maximizing their minimum payoff). This is also why the Nash equilibrium strategy in poker can be thought of as a “defensive” strategy, since by minimizing the maximum loss, we aren’t trying to maximally exploit.</p> $\min_{y} \max_{x} [x^TAy]$ $\text{Such that: } x^TE^T = e^T, x \geq 0, Fy = f, y \geq 0$ <p>We can solve this with linear programming, but this would involve a huge payoff matrix $$A$$ and length 64 strategy vectors for each player. There is a much more efficient way!</p> <h2 id="solving-by-simplifying-the-matrix">Solving by Simplifying the Matrix</h2> <p>Kuhn Poker is the most basic poker game possible and requires solving a $$64 \text{x} 64$$ matrix. While this is feasible, any reasonably sized poker game would blow up the matrix size.</p> <p>We can improve on this form by considering the structure of the game tree. Rather than just saying that the constraints on the $$x$$ and $$y$$ matrices are that they must sum to 1 as we did above, we can redefine these conditions according to the structure of the game tree.</p> <h3 id="simplified-matrices-for-player-1-with-behavioral-strategies">Simplified Matrices for Player 1 with Behavioral Strategies</h3> <p>Previously we defined $$E = F = \text{Vectors of } 1$$, which is the most basic constraint that all probabilities have to sum to 1.</p> <p>However, we know that some strategic decisions can only be made after certain other decisions have already been made. For example, Player 2’s actions after a Player 1 bet can only be made after Player 1 has first bet!</p> <p>Now we can redefine the $$E$$ constraint as follows for Player 1:</p> <table> <thead> <tr> <th>Infoset/Strategies</th> <th>0</th> <th>A_b</th> <th>A_p</th> <th>A_pb</th> <th>A_pp</th> <th>K_b</th> <th>K_p</th> <th>K_pb</th> <th>K_pp</th> <th>Q_b</th> <th>Q_p</th> <th>Q_pb</th> <th>Q_pp</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>A</td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Ap</td> <td> </td> <td> </td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>K</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Kp</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>Qp</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td>1</td> <td>1</td> </tr> </tbody> </table> <p>We see that $$E$$ is a $$7 \text{x} 13$$ matrix, representing the root of the game and the 6 information sets vertically and the root of the game and the 12 possible strategies horizontally. The difference now is that we are using <strong>behavioral strategies</strong> instead of <strong>mixed strategies</strong>. Mixed strategies meant specifying a probability of how often to play each of 64 possible pure strategies. Behavioral strategies assign probability distributions over strategies at each information set. Kuhn’s Theorem (the same Kuhn) states that in a game where players may remember all of their previous moves/states of the game available to them, for every mixed strategy there is a behavioral strategy that has an equivalent payoff (i.e. the strategies are equivalent).</p> <p>Within the matrix, the [0,0] entry is a dummy and filled with a 1. Each row has a single -1, which indicates the strategy (or root) that must precede the infoset. For example, the A has a -1 entry at the root (0) and 1 entries for A_b and A_p since the A must precede those strategies. The $$1$$ entries represent strategies that exist from a certain infoset. In matrix form we have $$E$$ as below:</p> $E = \quad \begin{bmatrix} 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 \\ \end{bmatrix}$ <p>$$x$$ is a $$13 \text{x} 1$$ matrix of probabilities to play each strategy.</p> $x = \quad \begin{bmatrix} 1 \\ A_b \\ A_p \\ A_{pb} \\ A_{pp} \\ K_b \\ K_p \\ K_{pb} \\ K_{pp} \\ Q_b \\ Q_p \\ Q_{pb} \\ Q_{pp} \\ \end{bmatrix}$ <p>We have finally that $$e$$ is a $$7 \text{x} 1$$ fixed matrix.</p> $e = \quad \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}$ <p>So we have overall:</p> $\quad \begin{bmatrix} 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 \\ \end{bmatrix} \quad \begin{bmatrix} 1 \\ A_b \\ A_p \\ A_{pb} \\ A_{pp} \\ K_b \\ K_p \\ K_{pb} \\ K_{pp} \\ Q_b \\ Q_p \\ Q_{pb} \\ Q_{pp} \\ \end{bmatrix} = \quad \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}$ <h3 id="what-do-the-matrices-mean">What do the matrices mean?</h3> <p>To understand how the matrix multiplication works and why it makes sense, let’s look at each of the 7 multiplications (i.e., each row of $$E$$ multiplied by the column vector of $$x$$ $$=$$ the corresponding row in the $$e$$ column vector.</p> <p><strong>Row 1</strong></p> <p>We have $$1 \text{x} 1$$ = 1. This is a “dummy”.</p> <p><strong>Row 2</strong></p> <p>$$-1 + A_b + A_p = 0$$ $$A_b + A_p = 1$$</p> <p>This is the simple constraint that the probability between the initial actions in the game when dealt an A must sum to 1.</p> <p><strong>Row 3</strong> $$-A_p + A_{pb} + A_{pp} = 1$$ $$A_{pb} + A_{pp} = A_p$$</p> <p>The probabilities of Player 1 taking a bet or pass option with an A after initially passing must sum up to the probability of that initial pass $$A_p$$.</p> <p>The following are just repeats of Rows 2 and 3 with the other cards.</p> <p><strong>Row 4</strong></p> $-1 + K_b + K_p = 0$ $K_b + K_p = 1$ <p>The probabilities of Player 1’s initial actions with a K must sum to 1.</p> <p><strong>Row 5</strong></p> $-K_p + K_{pb} + K_{pp} = 1$ $K_{pb} + K_{pp} = K_p$ <p>The probabilities of Player 1 taking a bet or pass option with a K after initially passing must sum up to the probability of that initial pass $$K_p$$.</p> <p><strong>Row 6</strong></p> $-1 + Q_b + Q_p = 0$ $Q_b + Q_p = 1$ <p>The probabilities of Player 1’s initial actions with a Q must sum to 1.</p> <p><strong>Row 7</strong></p> $-Q_p + Q_{pb} + Q_{pp} = 1$ $Q_{pb} + Q_{pp} = Q_p$ <p>The probabilities of Player 1 taking a bet or pass option with a Q after initially passing must sum up to the probability of that initial pass $$Q_p$$.</p> <h3 id="simplified-matrices-for-player-2">Simplified Matrices for Player 2</h3> <p>And $$F$$ works similarly for Player 2:</p> <table> <thead> <tr> <th>Infoset/Strategies</th> <th>0</th> <th>A_b(ab)</th> <th>A_p(ab)</th> <th>A_b(ap)</th> <th>A_p(ap)</th> <th>K_b(ab)</th> <th>K_p(ab)</th> <th>K_b(ap)</th> <th>K_p(ap)</th> <th>Q_b(ab)</th> <th>Q_p(ab)</th> <th>Q_b(ap)</th> <th>Q_p(ap)</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Ab</td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Ap</td> <td>-1</td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Kb</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Kp</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Qb</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>Qp</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> </tr> </tbody> </table> <p>From the equivalent analysis as we did above with $$Fx = f$$, we will see that each pair of 1’s in the $$F$$ matrix will sum to $$1$$ since they are the 2 options at the information set node.</p> <h3 id="simplified-payoff-matrix">Simplified Payoff Matrix</h3> <p>Now instead of the $$64 \text{x} 64$$ matrix we made before, we can represent the payoff matrix as only $$6 \text{x} 2 \text{ x } 6\text{x}2 = 12 \text{x} 12$$. (It’s actually $$13 \text{x} 13$$ because we use a dummy row and column.) These payoffs are the actual results of the game when these strategies are played from the perspective of Player 1, where the results are in {-2, -1, 1, 2}.</p> <table> <thead> <tr> <th>P1/P2</th> <th>0</th> <th>A_b(ab)</th> <th>A_p(ab)</th> <th>A_b(ap)</th> <th>A_p(ap)</th> <th>K_b(ab)</th> <th>K_p(ab)</th> <th>K_b(ap)</th> <th>K_p(ap)</th> <th>Q_b(ab)</th> <th>Q_p(ab)</th> <th>Q_b(ap)</th> <th>Q_p(ap)</th> </tr> </thead> <tbody> <tr> <td>0</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>A_b</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>A_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>A_pb</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>0</td> </tr> <tr> <td>A_pp</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> </tr> <tr> <td>K_b</td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>K_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>K_pb</td> <td> </td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> </tr> <tr> <td>K_pp</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> </tr> <tr> <td>Q_b</td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_pb</td> <td> </td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_pp</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <p>And written in matrix form:</p> $A = \quad \begin{bmatrix} 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 1 &amp; 0 &amp; 0 &amp; 2 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 1 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 \\ 0 &amp; -2 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 \\ 0 &amp; 0 &amp; 0 &amp; -2 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 \\ 0 &amp; -2 &amp; 1 &amp; 0 &amp; 0 &amp; -2 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; -2 &amp; 0 &amp; 0 &amp; 0 &amp; -2 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ \end{bmatrix}$ <p>We could even further reduce this by eliminating dominated strategies:</p> <table> <thead> <tr> <th>P1/P2</th> <th>0</th> <th>A_b(ab)</th> <th>A_b(ap)</th> <th>A_p(ap)</th> <th>K_b(ab)</th> <th>K_p(ab)</th> <th>K_b(ap)</th> <th>K_p(ap)</th> <th>Q_b(ab)</th> <th>Q_p(ab)</th> <th>Q_b(ap)</th> <th>Q_p(ap)</th> </tr> </thead> <tbody> <tr> <td>0</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>A_b</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>A_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>A_pb</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>0</td> </tr> <tr> <td>K_p</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>K_pb</td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> </tr> <tr> <td>K_pp</td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> </tr> <tr> <td>Q_b</td> <td> </td> <td>1</td> <td> </td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_p</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_pp</td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <p>For simplicity, let’s stick with the original $$A$$ payoff matrix and see how we can solve for the strategies and value of the game.</p> <h3 id="simplified-linear-program">Simplified Linear Program</h3> <p>Our linear program is now updated as follows. It is the same general form as before, but now our $$E$$ and $$F$$ matrices have constraints based on the game tree and the payoff matrix $$A$$ is smaller, evaluating when player strategies coincide and result in payoffs, rather than looking at every possible set of pure strategic options as we did before:</p> $\min_{y} \max_{x} [x^TAy]$ $\text{Such that: } x^TE^T = e^T, x \geq 0, Fy = f, y \geq 0$ <p>MATLAB code is available to solve this linear program where A, E, e, F, and f are givens and we are trying to solve for x and y. The code also includes variables p and q, which we don’t go into here except for the first value of the p vector, which is the game value.</p> <div class="language-matlab highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">%givens</span> <span class="n">A</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">]/</span><span class="mf">6.</span><span class="p">;</span> <span class="n">F</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">];</span> <span class="n">f</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">];</span> <span class="n">E</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">;</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">];</span> <span class="n">e</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">;</span><span class="mi">0</span><span class="p">];</span> <span class="c1">%get dimensions </span> <span class="n">dim_E</span> <span class="o">=</span> <span class="nb">size</span><span class="p">(</span><span class="n">E</span><span class="p">)</span> <span class="n">dim_F</span> <span class="o">=</span> <span class="nb">size</span><span class="p">(</span><span class="n">F</span><span class="p">)</span> <span class="c1">%extend to cover both y and p</span> <span class="n">e_new</span> <span class="o">=</span> <span class="p">[</span><span class="nb">zeros</span><span class="p">(</span><span class="n">dim_F</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span><span class="mi">1</span><span class="p">);</span><span class="n">e</span><span class="p">]</span> <span class="c1">%constraint changes for 2 variables</span> <span class="n">H1</span><span class="o">=</span><span class="p">[</span><span class="o">-</span><span class="n">F</span><span class="p">,</span><span class="nb">zeros</span><span class="p">(</span><span class="n">dim_F</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span><span class="n">dim_E</span><span class="p">(</span><span class="mi">1</span><span class="p">))]</span> <span class="n">H2</span><span class="o">=</span><span class="p">[</span><span class="n">A</span><span class="p">,</span><span class="o">-</span><span class="n">E</span><span class="o">'</span><span class="p">]</span> <span class="n">H3</span><span class="o">=</span><span class="nb">zeros</span><span class="p">(</span><span class="n">dim_E</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span><span class="mi">1</span><span class="p">)</span> <span class="c1">%bounds for both </span> <span class="n">lb</span> <span class="o">=</span> <span class="p">[</span><span class="nb">zeros</span><span class="p">(</span><span class="n">dim_F</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="mi">1</span><span class="p">);</span><span class="o">-</span><span class="n">inf</span><span class="o">*</span><span class="nb">ones</span><span class="p">(</span><span class="n">dim_E</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span><span class="mi">1</span><span class="p">)]</span> <span class="n">ub</span> <span class="o">=</span> <span class="p">[</span><span class="nb">ones</span><span class="p">(</span><span class="n">dim_F</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="mi">1</span><span class="p">);</span><span class="n">inf</span><span class="o">*</span><span class="nb">ones</span><span class="p">(</span><span class="n">dim_E</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span><span class="mi">1</span><span class="p">)]</span> <span class="c1">%solve lp problem </span> <span class="p">[</span><span class="n">yp</span><span class="p">,</span><span class="n">fval</span><span class="p">,</span><span class="n">exitflag</span><span class="p">,</span><span class="n">output</span><span class="p">,</span><span class="n">lambda</span><span class="p">]</span><span class="o">=</span><span class="n">linprog</span><span class="p">(</span><span class="n">e_new</span><span class="p">,</span><span class="n">H2</span><span class="p">,</span><span class="n">H3</span><span class="p">,</span><span class="n">H1</span><span class="p">,</span><span class="o">-</span><span class="n">f</span><span class="p">,</span><span class="n">lb</span><span class="p">,</span><span class="n">ub</span><span class="p">);</span> <span class="c1">%get solutions {x, y, p, q} </span> <span class="n">x</span> <span class="o">=</span> <span class="n">lambda</span><span class="o">.</span><span class="n">ineqlin</span> <span class="n">y</span> <span class="o">=</span> <span class="n">yp</span><span class="p">(</span><span class="mi">1</span> <span class="p">:</span> <span class="n">dim_F</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span> <span class="n">p</span> <span class="o">=</span> <span class="n">yp</span><span class="p">(</span><span class="n">dim_F</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span> <span class="p">:</span> <span class="n">dim_F</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">dim_E</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span> <span class="n">q</span> <span class="o">=</span> <span class="n">lambda</span><span class="o">.</span><span class="n">eqlin</span> </code></pre></div></div> <p>The output is:</p> <div class="language-matlab highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Optimal</span> <span class="n">solution</span> <span class="n">found</span><span class="o">.</span> <span class="n">x</span> <span class="o">=</span> <span class="mf">1.0000</span> <span class="mf">1.0000</span> <span class="mi">0</span> <span class="mi">0</span> <span class="mi">0</span> <span class="mi">0</span> <span class="mf">1.0000</span> <span class="mf">0.6667</span> <span class="mf">0.3333</span> <span class="mf">0.3333</span> <span class="mf">0.6667</span> <span class="mi">0</span> <span class="mf">0.6667</span> <span class="n">y</span> <span class="o">=</span> <span class="mf">1.0000</span> <span class="mf">1.0000</span> <span class="mi">0</span> <span class="mf">1.0000</span> <span class="mi">0</span> <span class="mf">0.3333</span> <span class="mf">0.6667</span> <span class="o">-</span><span class="mf">0.0000</span> <span class="mf">1.0000</span> <span class="o">-</span><span class="mf">0.0000</span> <span class="mf">1.0000</span> <span class="mf">0.3333</span> <span class="mf">0.6667</span> <span class="n">p</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.0556</span> <span class="mf">0.3889</span> <span class="mf">0.1111</span> <span class="o">-</span><span class="mf">0.1111</span> <span class="o">-</span><span class="mf">0.2222</span> <span class="o">-</span><span class="mf">0.3333</span> <span class="o">-</span><span class="mf">0.1667</span> <span class="n">q</span> <span class="o">=</span> <span class="mf">0.1111</span> <span class="o">-</span><span class="mf">0.1111</span> <span class="o">-</span><span class="mf">0.3889</span> <span class="mf">0.2222</span> <span class="o">-</span><span class="mf">0.1111</span> <span class="mf">0.3333</span> <span class="mf">0.1667</span> </code></pre></div></div> <p>The $$x$$ and $$y$$ values are a Nash equilibrium strategy solution for each player (one of many equilibrium solutions), whereby the values after the first in the vector describe the betting strategy for each of the actions for each player as shown in the vectors above. The first $$p$$ value shows the value of the game as we had calculated before in the analytical section: -0.0556.</p> <h2 id="iterative-algorithms">Iterative Algorithms</h2> <p>Now we have shown a way to solve games more efficiently based on the structure/ordering of the decision nodes.</p> <p>Using behavioral strategies significantly reduces the size of the game and solving is much more efficient by using the structure/ordering of the decision nodes. This can be expresessed in tree form and leads to algorithms that can use self-play to iterate through the game tree.</p> <p>Specifically CFR (Counterfactual Regret Minimization) has become the foundation of imperfect information game solving algorithms. We will go into detail on this in Section 4.1.</p> <p>We can see a visualization of the optimal strategy that was solved in 100-card Kuhn Poker using the CFR algorithm. The strategy for the following charts was computed using 10^9 iterations of External Sampling CFR (see Section 4.1). The Player 1 Opening Actions represent the general strategy of mostly betting good and bad hands, while passing on medium-strength hands, and also passing on the very best hands. The Player 2 Action After Pass chart uses the same logic, although with no passing on the best hands or worst hands because this is the final action, so a pass would simply end the hand, which would generally mean no chance of winning with poor hands and no chance of earning value with good hands.</p> <p>The Player 2 Action After Bet and Player 1 Action After Pass and Opponent Bet are quite similar as both represent the situation of facing a bet with no additional money behind (i.e., no bluffs are possible), so the only decision is to call the bet or fold, so naturally we call with our better hands.</p> <p>The horizontal axis represents the hands in order given the situation and the bars above each represent the probability of betting, with fully blue meaning always Pass, fully red meaning always Bet, and some of each meaning the mixture of the two.</p> <p><img src="../assets/section3/toygames/kuhn100.png" alt="Kuhn 100-card solution" title="Kuhn 100-card solution" /></p> <p>We see that this is similar to what we’d expect from learning the 3-card game, but with some additional complexity.</p> <p>The bottom two graphs are not very informative because they only show that when facing the final bet, you should call with better hands and fold worse hands. However, from the Player 1 Opening Actions graph (and with similar principles applied to the Player 2 Action After Pass graph), the idea is that the good hands that we bet are value-bets meant to get called by worse hands. The passing with the very best hands is called slow-playing and the betting with the worst hands is called bluffing. Both deceptive moves give the strategy a balance and the bluff gives a chance to win with hands that otherwise would have no chance of winning. Not betting with middling hands is effective because betting would generally cause worse hands to fold and keep better hands in. The principles are the same as in the 3 card version, but more cards allows for a sharper understanding of the general strategy.</p>AIPT Section 3.1: Solving Poker – What is Solving?2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/what-is-solving<h1 id="solving-poker---what-is-solving">Solving Poker - What is Solving?</h1> <p>What does it mean to solve a poker game? How can we “solve” a game that seemingly relies so much on psychology, reading opponents, and deception?</p> <h2 id="definitions">Definitions</h2> <p>When we talk about a solution to a poker game, we mean playing the game theory optimal (GTO) Nash equilibrium strategy. As discussed in the Game Theory Foundations section, Nash equilibrium is defined as when there is a strategy profile such that no player can unilaterally alter his current strategy to increase his expected utility. The Nash strategies in two-player zero-sum games limit a player’s exploitability at the expense of not exploiting weaker opponents, as each player is minimizing his worst-case expected payoff. This means that in expectation, no strategy can do better against a GTO strategy.</p> <p>This means balancing one’s playing and and thinking about the range of hands that you play in each situation and how you act with each of those hands. The result is play that in theory is minimizing the worst case outcome and in practice (i.e. in real life as a human who can’t play a true GTO strategy) is minimizing chances of opponents exploiting or taking advantage of leaks and weaknesses. Humans can think about this as how to play if they had to announce their strategy in advance and give it to the opponent. This seems crazy, but emphasizes the balance of such a strategy where actions attempt to make the opponent indifferent, so even if they know your strategy (they of course don’t know your actual cards), they can’t take exploit you.</p> <p>In small toy games, these strategies are relatively easy to find. In 1v1 Limit Texas Hold’em a very close approximation of this strategy <a href="https://science.sciencemag.org/content/347/6218/145">was computed in 2015 at the University of Alberta</a>. In commonly played games like 1v1 No Limit Texas Hold’em and multiplayer No Limit Texas Hold’em, no complete game theory optimal strategies exist…yet. In multiplayer games, the concept is less clear because of the interactions involved with other players and so the solution approach has been to develop agents that perform well, but aren’t necessarily playing some theoretical optimal strategy.</p> <h3 id="measuring-agent-quality">Measuring Agent Quality</h3> <p>We are often working with only approximately optimal strategies when solving games. While small toy games can be small enough to find completely game theory optimal solutions, it’s important to be able to evaluate approximately GTO agents. We go into this further in Section 4.3: Agent Evaluation, but will mention the three main ways of measuring agent quality here as well:</p> <ol> <li> <p>We can look at a given strategy against the “best response” strategy, which is the strategy that maximally exploits the given strategy (i.e. how well can someone do against you if they know your exact strategy). This shows the maximum that your strategy would lose in expectation against such an opponent, which is defined as the exploitability of an agent. Note that a game is in Nash equilibrium if and only if all players are playing best responses to what the other players are doing.</p> </li> <li> <p>We can look at a strategy against a different computer agent, including other approximate equilibrium agents that were somehow computed differently.</p> </li> <li> <p>We can play the agent against humans, though this will require a lot of human time since many hands are needed to properly assess agent quality given the variance inherent in poker.</p> </li> </ol> <h2 id="why-the-gto-strategy">Why the GTO Strategy?</h2> <p>GTO makes sense as a formalized way to solve a game because it can’t be beaten! But what if you are playing against a very bad player and a non-GTO strategy would be much stronger? In some sense the GTO strategy is not optimal in this situation, but we distinguish between a GTO strategy and an exploitative optimal strategy. By definition, when solving a game it makes sense to use the game theory optimal strategy because the exploitative optimal strategy is far less robust and while it may be more profitable against certain players, could be far less profitable against other players. So while a best response is the actual optimal strategy for a certain situation, it can’t be generalized in the way that GTO can be.</p> <h2 id="solving-methods">Solving Methods</h2> <p>Generally algorithms and commercial programs are built around Counterfactual Regret Minimization (CFR), which is described in Section 4.1. In short, it works by playing the game against itself repeatedly and eventually coming to an equilibrium strategy, similarly to how two humans might play each other repeatedly and start off with some default strategy in mind, but then may reach some sort of equilibrium as they re-optimize based on the opponent strategy.</p> <h2 id="the-evolution-of-poker-studying">The Evolution of Poker Studying</h2> <p>Poker studying has evolved over the last decade or so from:</p> <ol> <li>Using simple odds evaluation software that would literally just show the odds of two or more specific hands given a board (or no board for the preflop odds). I remember being especially surprised the first time I saw how close some odds were for hands that seemed so much better than other hands, like AK vs. AJ or AK vs. 98!</li> <li>Software that could run expected value simulations with decision trees. You could provide rule-based strategies at each node in a decision tree and the software would compute which hands were profitable, like calculating which hands could be shoved all-in from the small blind against the big blind if the big blind were calling with a fixed set of hands. This is sort of like the first step in finding an equilibrium, which would entail each player’s strategy getting modified after each subsequent round until an equilibrium was found. But if for example it was known that a specific player or population of players tended to play a certain strategy, there could be a lot of value in finding optimal exploitative counter strategies to them.</li> <li>Solving toy poker games and applying lessons from these games into full games of poker. A lot of important lessons about balance and how to split up a hand range into different actions like betting for value and betting as a bluff were derived from looking at solutions to simple toy games and extrapolating them into full poker environments.</li> <li>Using “solver” software that can actually solve for the GTO play given a specfic situation. Commercial solver software is arguably the most important development in poker over the last decade. The software lets you specify a game tree, generally with highly abstracted bet options (e.g. ~5 total bet options instead of being able to bet any integer or even any amount down to the cent) and then provides a full solution for that situation, which shows exactly how to play each hand in your range of hands.</li> </ol> <p>For this entire time, statistical software that players can use to analyze their own hands and the population of players that they have played against has also been essentially mandatory for serious players (although some sites now don’t allow saving hand histories and some sites use anonymous names so individual opponents cannot be analyzed). This software is super valuable for “leak finding” by running reports and evaluating one’s own play, both on specific important hands and on trends like “folding too much to bets on the turn card”. The software also has a “heads up display” or HUD that shows up on screen while playing with important statistics about opponents like how often they voluntarily play hands, how often they raise preflop, how often they 3-bet preflop, and many more.</p> <p>As AI continues to improve and computing power increases and as humans study more, AI and humans will both get gradually closer to optimal, but for now, we rely on abstractions and approximations for larger games.</p> <!-- ## Indifference --> <h1 id="useful-poker-math">Useful Poker Math</h1> <p>Standard poker games are played with a 52 card deck with four cards of each value and 13 cards of each suit. The math here is important when playing poker and also makes sense to be familiar with when studying AI poker because it’s important to be able to interpret why solvers do what they do and to make generalizations, not just to try to memorize solver outputs.</p> <h2 id="computing-equities">Computing Equities</h2> <p>Equity is the odds of you winning the pot. If there were no additional cards to come, then the player with the better hand would have 100% equity and the player with the worse hand would have 0% equity since those odds would be fixed. Assuming that we have two players and one has a stronger hand and there are still cards to come, the player with the weaker hand needs a certain number of “outs”, or cards to improve their hand.</p> <p>For example, if the board is A562 and one player has AK and one player has 87, the player with 87 has an open ended straight draw and needs either a 9 (for the 56789 straight) or 4 (for the 45678 straight) to win. Since there are four each of card 9 and card 4, the player has eight total outs to win.</p> <p>We’ve seen 4 + 2 + 2 = 8 total cards so far (four community cards on the board and two hole cards for each player), which means there are 52-8 = 44 cards left in the deck. The player with the straight draw has 8 outs of the 44 cards so 8/44 = 18% chance of winning. This math assumes that the 9 and 4 cards are still in the deck, even though the opponent could have them – we compute this way to simplify the situation and since we have no way of knowing the opponent’s cards.</p> <p>The equity is computed as chance of winning * size of pot, so if the pot was \$X, the straight draw player would have 0.18 * x equity and the other player would have (1-0.18) * x = 0.82 * x equity. So with a \$100 pot, the players would have \$18 and \$82 equity.</p> <p>In this case the player could be certain that completing the straight would result in a win, but sometimes this isn’t the case so you have to be cautious! For example, a player with a 3 on the A562 board could hit a 4 for a straight, but would then still be losing to a player with 87! Equities can be clearly computed when all cards are known, but in a real game situation, each player has to approximate which cards (outs) are helpful. In the straight example above this is pretty clear, but for example if the board was 456Q and you held A7, you could be somewhat confident that a 3, 8, or A would give you the best hand (9 total outs), but in each case you could still be losing! Someone could have 87 and beat you if the 3 comes, someone could have 79 and beat you if the 8 comes, and someone could have a huge number of things that would beat a single pair of Aces.</p> <p>Briefly, there is also a concept called equity realization, which is the amount of equity a hand is expected to actually realize. An example of this is that in a one on one situation if I have 44 and my opponent has JT offsuit, we are right at about 50% equity each. However, 44 is very difficult to play postflop because unless I hit a 4, there aren’t many good flops for my hand, while JT will frequently have a straight draw (or straight) or at least a pair or might make me fold when high cards come as a bluff.</p> <h2 id="computing-the-expected-value-of-an-allin">Computing the Expected Value of an Allin</h2> <p>A valuable exercise in understanding poker math is to compute the expected value of going allin in a one versus one setting. Let’s once again take a look at the situation of small blind vs. big blind. We’ll use \$1 and \$2 blinds for simplicity. So the small blind posts \$1 and the big blind posts \$2 and now the small blind can either go allin or fold. Let’s assume that both players start the hand with \$40, again for simplicity. Note that if one player started with more than the other, only the smaller stack matters (e.g. if the big blind had more money, they would only have to call \$40 to match the bet, so any additional money over the minimum player’s stack size is disregarded).</p> <p>There are 3 scenarios that can occur here:</p> <ol> <li>The small blind folds</li> <li>The small blind goes allin and the big blind folds</li> <li>The small blind goes allin and the big blind calls</li> </ol> <p>Let’s think about the expected value from the perspective of the small blind allin player.</p> <p>Case 1: Small blind folds and the EV is 0. This is because we compute the EV from the point of the allin, so even though the small blind posted a \$1 blind, this is a “sunk” cost and is not used in computations.</p> <p>Case 2: Small blind allin and gains the entire \$3 pot when the big blind folds, so the EV is +\$3.</p> <p>Case 3: When the small blind is allin and the big blind calls and the small blind wins the pot, the win is all of the \$80 chips. When the small blind is allin and loses the pot, the loss is the additional $39 bet for the allin. Combining these, the EV = (win %) * (80) + (lose %) * (39). Note that the win is the amount you get back from the pot when you go allin and win and loss is the amount that goes in with the allin action.</p> <p>Putting this all together, we have EV = (opponent fold %) * 3 + (1- opponent fold %) * ((win %) * (80) + (lose %) * 39)</p> <p>Written more generally, we have EV = (opponent fold %) * (pot size before allin) * (1 - opponent fold %)<em>((win %)</em>(total pot size after allin) + (lose %) * (allin bet size put at risk))</p> <p>Note that fold equity is the name of the equity that you get from the pot when your opponent folds.</p> <h2 id="pot-odds">Pot Odds</h2> <p>Pot odds means the ratio between the size of the pot and the size of a bet. Continuing with the above example, if the small blind goes allin then the big blind sees a pot of \$42 (the small blind’s \$40 + his \$2 big blind) and has to call \$38 more, so their odds are 42:38. This can be converted to a percentage by taking 38/(38+42) = 38/80 = 47.5%. Note that this is the size of the call divided by the size of the pot if we were to make the call. We can interpret this as the odds needed to win – the big blind needs to have at least 47.5% equity to make a profitable call. (Note that if this was a regular (non all-in) bet then there would be additional action on the hand, which we will discuss in the below Implied Odds section.)</p> <p>Pot odds are super important when deciding whether or not to make a call in poker and it’s important to be able to calculate an approximation on the fly in your head when playing.</p> <p>We use pot odds in conjunction with equity as a factor in deciding whether or not to make a call. In general, if you compute that your equity is higher than your pot odds, then you are making a profitable call. There are important standard pot odds that are useful to memorize – if an opponent bets half the pot, then pot odds are 25%. If an opponent bets 3/4 of the pot, then pot odds are 43%. A bet of full pot gives pot odds of 50% and a bet of 2x the pot gives pot odds of 67%. This intuitively makes sense – when facing lower bets as a percentage of the pot, we need lower odds to make the call because we have less at risk relative to what we can gain. This is something to think about when betting as well – to manipulate the pot odds that your opponent is getting – for example betting larger can protect a made hand against a posssible drawing hand and betting smaller with a very strong hand can “suck in” the opponent since they are getting good odds (but could also backfire if the opponent hits a draw cheaply, which is why bets are often made according to conditions on the board and how many draws are possible).</p> <p>Pot odds are especially useful in the context of catching a bluff with a medium-strength hand since that is when our decision is simply to call or fold. When we have a very strong or very weak hand, we would possibly be considering more options than just calling and folding, like raising for value or as a bluff.</p> <h2 id="implied-odds">Implied Odds</h2> <p>Pot odds simplify things by considering only the immediate odds of the poker situation, which is valid on a final betting round or when all of the money is in the pot, but not when there is still money and cards to come.</p> <p>Implied odds takes into account the chips that you could win in addition to the current pot. The typical example for this is that if someone bets the pot on the flop and you have a flush draw, you’re getting 50% pot odds and just a flush draw will not have 50% odds – more like 35%. However, when you do make that flush, you can expect on average to win additional money. But beware of reverse implied odds, which refers to losing more than the immediate pot, like if you are drawing to a very small flush, hitting that could be big trouble if an opponent has a higher flush.</p> <p>When you have good implied odds, it can make sense to call without the correct immediate pot odds. If you don’t think you’ll make more money in later betting rounds, then you have low implied odds and it probably doesn’t make sense to call a bet unless you have the correct immediate pot odds.</p> <p>We can compute the additional money needed to win on later streets assuming that we can estimate our equity in the hand. This is the same computation as before for pot odds, but with the denominator adding a variable that represents what we would need to win on later streets.</p> <p>So if the pot is \$100 and the opponent bet \$50, the standard pot odds would be 50/(150+50) = 0.25 (which again, is the bet divided by the current pot and the amount that would be added when calling the bet).</p> <p>Assuming that it’s possible to win \$x more on later streets, we could compute 50/(150+50+x) = equity. Assuming that we have 0.2 equity, that means that we’d have 50/(150+50+x) = 0.2, so 50 = 40 + 0.2x and 50 = x. So if our equity was 0.2, we’d have to win an extra \$50 in later streets to make calling now breakeven. If our equity was 0.25, then we’d have to win \$0 more in later streets (making a call breakeven immediately, with any additional money gained later as positive value). If our equity was only 0.1, then we’d need to win an additional \$300 later, which seems like a lot, but in no limit games with deep stacks, it could make sense to go for unexpected draws to extract a lot of money later!</p> <h2 id="minimum-defense-frequency">Minimum Defense Frequency</h2> <p>Minimum defense frequency is related to pot odds and shows the percentage of hands you should minimally continue with – if you continue with less than this then your opponent can always bluff and be profitable!</p> <p>To calculate minimum defense frequency (MDF), we have (pot size)/(pot size + bet size). Recall that pot odds is (bet size)/(pot size + bet size + call size). Suppose that we are in a pot of \$100 on the river in Texas Hold’em and the opponent bets \$50. This means that our MDF is 100/(100+50) = 0.67. The pot odds are 50/(100+50+50) = 0.25. In words, the MDF means that we should be continuing with at least 67% of hands and the pot odds means that we should have at least 25% equity to call.</p> <p>So if you fold more often than the MDF then your opponent can exploit you by over-bluffing. However, going by MDF means attempting to be unexploitable, but if your opponent is a new player and never bluffing, then that kind of specific knowledge (read) may be more useful. In general, MDF is useful to keep in mind and to try to optimize hand ranges to get a feel for which hands in your range you should be calling with and is useful as a theoretical concept, but may not be good to use in specific scenarios like exploiting an opponent who bluffs too much or too little or when it’s very unlikely that an opponent can actually be bluffing and your hand isn’t good enough to beat anything but bluffs.</p> <p>Pot odds, on the other hand, especially on the river (or final bet like an allin) are strictly applicable to the situation and if you determine that your equity is higher than the pot odds then you should call!</p> <p>I co-authored a 2020 paper with Sam Ganzfried called Most Important Fundamental Rule of Poker Strategy, that extrapolates a minimum defense frequency rule from game theoretic strategies in a way that can be interpreted by humans. The rule does simulations combines minimum defense frequency with a concept called rane advantage, which is how much stronger the range of one player’s hand is compared to the others’. While this can’t be known with certainty, using this approximation in conjunction with MDF can provide a better result than MDF alone.</p> <p>We generated 100,000 one on one simplified poker games and solved them for Nash equilibrium, while computing the optimal defense frequency for each. We then used machine learning regression to find an equation to optimize getting as close as possible to the optimal defense frequency in terms of the minimum defense frequency and range advantage. We found that in a simplified game with only one pot sized bet possible (meaning MDF was fixed, at 0.5 in this case), using range advantage in addition to MDF instead of just MDF leads to a significant reduction in mean squared error, about a 56% reduction.</p> <p>In a 100,000 game dataset with three bet sizes, 0.5 pot, 0.75 pot, and pot, we uncover a rule that would extend to different bet sizes. Again, we find that using MDF and RA both as features reduces the mean squared error loss by about 56% compared to using MDF only. We found that the best result was using the strategy to call at least 0.904 * MDF - 0.495 * RA + 0.261. Since we wanted to make this formula human interpretable and since we see that the formula is close to MDF - 0.5 * RA + 0.25, we simplify our suggested result with these easier to remember coefficients. Finally, we noticed that the optimal defense frequency only rarely exceeds the MDF, so we found that using min(MDF, MDF - 0.5 * RA + 0.25) improves about 33% from the models without the truncation. The mean squared error of this approach is 0.0032, about 50% better than linear MDF.</p> <p>This led to the Fundamental Rule of Poker Strategy: Given minimum defense frequency value MDF when facing a certain bet size, and assuming a range advantage of RA, then you should call the bet with a fraction of the hands in your range equal to min(MDF, MDF - 0.5*RA + 0.25), and fold otherwise. When neither player has a range advantage, then RA = 0.5, and the equation simplifies to min(MDF,MDF) = MDF, which makes sense!</p> <p>The rule can make using MDF easier because it can provide more theoretical justification to fold given more information about your opponent and your hand ranges. We give an example in the paper:</p> <p>Suppose we are in a setting where the opponent bets the pot (so MDF = pot/(pot+pot) = 0.5) and we think that the opponent has a range advantage of 0.8, then we would want to call at least min(MDF, MDF - 0.5 * RA + 0.25) = min(MDF, MDF - 0.15) = MDF - 0.15 = 0.5 - 0.15 = 0.35. Suggesting that we should call a minimum of 35% rather than 50%. The only downside here is that this removes the theoretical property of the MDF and relies on an estimate of the range advantage.</p> <h2 id="hand-combinations">Hand Combinations</h2> <p>Hand combinations are the math behind how likely hands are in poker. As we know, there are 52 cards in a deck with 13 cards of each suit and 4 cards of each type (rank). We can compute 52c2 as 52!/(50! * 2!) = 52 * 51/2 = 1326 combinations of 2-card starting hands.</p> <p>Each non-paired hand has 4*4 = 16 combinations of which 12 are offsuit (each card a different suit) and four are suited (both cards the same suit). The four suited combinations comes from the four suits in the deck, so one suited combination of each suit. When computing combinations possible given a certain board, we reduce the 4 * 4 multiplication values to account for cards that appear on the board. For example, if the board is J93, we can look at a few combinations:</p> <ol> <li>AK is not on the board so still has 4 * 4 = 16 combinations</li> <li>AJ has one card on the board so now has 4 * 3 = 12 combinations</li> <li>J9 has two cars on the board so now has 3 * 3 = 9 combinations</li> </ol> <p>Finally, there are 4c2 = 4!/(2! * 2!) = 4 * 3/2 = 6 combinations of paired cards. Every card of a certain type that appears on the board reduces the combinations. For example if the board is T52, then there are only 3c2 = 3 combinations of each of those cards instead of the normal 6 combinations. If two of a card are on the board then only one pair combination remains (2c2 = 1).</p> <p>We can use hand combinations to assist in approximating our equity in a hand, which can be used in conjunction with pot odds to make informed decisions, especially on the final betting round.</p> <p>As opponent hands are more narrowly defined towards the end of a hand, it becomes more possible to count the combinations of possible hands that they might have and in turn compute your own equity.</p> <!-- ## ICM -->Solving Poker - What is Solving? What does it mean to solve a poker game? How can we “solve” a game that seemingly relies so much on psychology, reading opponents, and deception?AIPT Section 4.3: CFR – Agent Evaluation2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/agent-evaluation<h1 id="cfr---agent-evaluation">CFR - Agent Evaluation</h1> <p>Poker results are generally measured in big blinds per 100 hands won. In the research community, the standard measure is milli-big-blinds per hand (or per game), or mbb/g, where one milli-big-blind is 1/1000 of one big blind. This is also used as a measure of exploitability, the expected loss per game against a worst-case opponent. A player who folds every hand will lose 750 mbb/g on average in a heads up match (1000 mbb/g as big blind and 500 as small blind).</p> <h2 id="evaluating-poker-agents">Evaluating Poker Agents</h2> <p>There are three main ways to evaluate poker agents – against a best response opponent, against other poker agents, and against human opponents. Evaluation can be difficult due to the inherent variance in poker, but this can be minimized by playing a very large number of hands and also by playing in a “duplicate” format, where, for example, two agents would play a set of hands and then clear their memories and play the same hands with the cards reversed (only possible when humans are not involved). Further, an agent that performs well in one evaluation metric is not guaranteed to perform well in others.</p> <h3 id="best-response">Best Response</h3> <p>We can find the best response to a poker agent’s strategy analytically by having its opponent always choose the action that maximizes expected value against the agent’s strategy in all game states, given the agent’s strategy. We can then compute the e in the e-Nash equilibria as a measure of exploitability that gives the lower bound on the exploitability of the agent.</p> <p>The purpose of calculating a best response is to choose actions to maximize our expected value given an opponent’s entire strategy. The expectimax algorithm involves a simple recursive tree walk where the probability of the opponent’s hand is passed forward and the expected value for our states (I – information sets) returned, involving just one pass over the entire game tree.</p> <p>Best response usually requires a full tree traversal, but Johansen et al. showed a general technique that can avoid this in their paper “<a href="http://www.cs.cmu.edu/~waugh/publications/johanson11.pdf">Accelerating Best Response Calculation in Large Extensive Trees</a>”, since full tree traversal is often infeasible in very large games.</p> <p>Best response has advantages over comparing strategies by competing them against each other since it is a more theoretical measure and because problems can arise when doing agent vs. agent evaluations, like intransitivities (e.g. Agent A beats Agent B and Agent B beats Agent C but Agent C beats Agent A) and variance.</p> <p>The standard best response method involved “examining each state once to compute the value of every outcome, followed by a pass over the strategy space to determine the optimal counter-strategy”.</p> <p><img src="../assets/section4/evaluation/kuhndiff.png" alt="Kuhn Poker Tree from different perspectives" title="Kuhn Poker Tree from different perspectives" /></p> <p>This figure shows different trees representing Kuhn (1-card) poker. The left tree, titled Game Tree, shows the exact state of the game. The squares are public nodes since bets are made publicly, while the circles are private nodes since player cards are private only to them. In the two rightmost trees (P1 and P2 Information Set Trees), the opponent chance nodes have only one child each since their chance information is unknown to the other player.</p> <p>Instead of walking the full game tree or even the information set trees, it was shown that we can improve the algorithm by walking only the public tree and visiting each state only once. When we reach a terminal node such as “A,B,X,Y”, this means that player 1 could be in nodes A or B as viewed by player 2 and that player 2 could be in nodes X or Y as viewed by player 1. The algorithm calls for passing forward a vector of reach probabilities of the opponent and chance and recursing back, while choosing the highest valued actions for the iteration player’s perspective and returning the sum of child values for the opponent player, and then at the root, the returned value is the best response to the opponent’s strategy.</p> <p>Michael Johanson’s paper describes techniques for accelerating best response calculations using the structure of information and utilities to avoid a full game tree traversal, allowing the algorithm to compute the worst case performance of non-trivial strategies in large games.</p> <p>The best response algorithm in the below figure takes as inputs the history of the actions, the current player (the algorithm must be run for each player), and a distribution of the opponent reach probabilities (line 2). An additional D variable is set on line 12 to define the opponent’s action distribution. Then for each action possible, if the node does not belong to the current player, then we iterate over each opponent possible cards, find the probability of the player playing those cards, and update the reach probability accordingly. A new variable w[a] is also introduced on line 19 to sum the probabilities over all cards that are taking a certain action. Line 21 recurses over the best response function to find the value of taking each action from that node, and then on line 22-23, if that action value is better than the previous value (which is defaulted at negative infinity), then it is assigned as the value for that node.</p> <p>On lines 26-28, if the node is not the current player, then the w values are normalized to define the opponent’s action distribution and the node is assigned a value according to the action distribution and the node action values (i.e. this node’s value is assigned with the normal weights, as opposed to the current player’s node value that is assigned according to the best response method).</p> <p>Finally, on lines 3-6, at terminal points, the opponent distribution is normalized, values are assigned from the perspective of the iteration player, and the expected value payoff is computed as a multiplication between the payoff and the normalized opponent distribution</p> <p><img src="../assets/section4/evaluation/bestresponse.png" alt="Best Response algorithm" title="Best Response algorithm" /></p> <p>Here’s how the algorithm works in practice in conjunction with CFR:</p> <ol> <li>Pause CFR intermittently</li> <li>Call the best response function (BRF) for each player separately (this player is called the iterating or traversing player)</li> <li>Iterate over all cards and sum all to get overall best respones for each iterating player</li> <li>Pass to BRF: <ul> <li>Player card of iterating player</li> <li>Root starting history</li> <li>Which player is iterating player</li> <li>Vector of uniform reach probabilities of opponent hand possibilities</li> <li>Example in 5 card Kuhn poker: Player card = 3. Opponent vector = [0.25, 0.25, 0.25, 0, 0.25]</li> </ul> </li> <li>If at a terminal node, normalize the vector of the opponent reach probabilities and for each possible opponent hand, add the probability of that hand * the payoff from the iterating player’s perspective. Then return the expected payoff after going through all possible hands.</li> <li>If not at a terminal node, create the following: <ul> <li>D = [0, 0] to track the opponent’s action distribution</li> <li>V = -inf for the value of the node</li> <li>New opponent reach probabiltiies that are initialized as a copy of the previous ones</li> <li>Util = [0, 0] to track the utility of each action</li> <li>W = [0, 0]</li> </ul> </li> <li>Iterate over the actions: <ul> <li>If the acting player is not the iterating player:</li> <li> <ul> <li>Iterate over all hands of this player</li> </ul> </li> <li> <ul> <li>Get the strategy of the actin gplayer for each hand based on what CFR has found up to now</li> </ul> </li> <li> <ul> <li>Update the acting player reach probabilities multiplied by the strategy</li> </ul> </li> <li> <ul> <li>W[action] += each of the new reach probabiltiies for this action</li> </ul> </li> <li>Set the utility of this action to a recursive BRF call with the new history and new opponent reach (only changed if the acting player is not the iterating player)</li> <li>If the acting player is the iterating player and the utility of this action is higher than the current V, then set V = util[this action] since the iterating player will play the best pure strategy</li> </ul> </li> <li>If the acting player is not the iterating player: <ul> <li>D = the normalization of W over each action (i.e., D = W/(W + W))</li> <li>V = D * util * D * util</li> </ul> </li> <li>Return V</li> </ol> <p>Here is an implementation of the best response function in Python for Kuhn Poker:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="k">def</span> <span class="nf">brf</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">player_card</span><span class="p">,</span> <span class="n">history</span><span class="p">,</span> <span class="n">player_iteration</span><span class="p">,</span> <span class="n">opp_reach</span><span class="p">):</span> <span class="n">plays</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">history</span><span class="p">)</span> <span class="n">acting_player</span> <span class="o">=</span> <span class="n">plays</span> <span class="o">%</span> <span class="mi">3</span> <span class="n">expected_payoff</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">if</span> <span class="n">plays</span> <span class="o">&gt;=</span> <span class="mi">3</span><span class="p">:</span> <span class="c1">#can be terminal </span> <span class="n">opponent_dist</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">opp_reach</span><span class="p">))</span> <span class="n">opponent_dist_total</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1">#print('opp reach', opp_reach) </span> <span class="k">if</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'f'</span> <span class="ow">or</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'c'</span> <span class="ow">or</span> <span class="p">(</span><span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">==</span> <span class="s">'k'</span><span class="p">):</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">opp_reach</span><span class="p">)):</span> <span class="n">opponent_dist_total</span> <span class="o">+=</span> <span class="n">opp_reach</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="c1">#compute sum of dist. for normalizing </span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">opp_reach</span><span class="p">)):</span> <span class="n">opponent_dist</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">opp_reach</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">/</span> <span class="n">opponent_dist_total</span> <span class="n">payoff</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">is_player_card_higher</span> <span class="o">=</span> <span class="n">player_card</span> <span class="o">&gt;</span> <span class="n">i</span> <span class="k">if</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'f'</span><span class="p">:</span> <span class="c1">#bet fold </span> <span class="k">if</span> <span class="n">acting_player</span> <span class="o">==</span> <span class="n">player_iteration</span><span class="p">:</span> <span class="n">payoff</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">else</span><span class="p">:</span> <span class="n">payoff</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="k">elif</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'c'</span><span class="p">:</span> <span class="c1">#bet call </span> <span class="k">if</span> <span class="n">is_player_card_higher</span><span class="p">:</span> <span class="n">payoff</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">else</span><span class="p">:</span> <span class="n">payoff</span> <span class="o">=</span> <span class="o">-</span><span class="mi">2</span> <span class="k">elif</span> <span class="p">(</span><span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">==</span> <span class="s">'k'</span><span class="p">):</span> <span class="c1">#check check </span> <span class="k">if</span> <span class="n">is_player_card_higher</span><span class="p">:</span> <span class="n">payoff</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">else</span><span class="p">:</span> <span class="n">payoff</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="n">expected_payoff</span> <span class="o">+=</span> <span class="n">opponent_dist</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">payoff</span> <span class="k">return</span> <span class="n">expected_payoff</span> <span class="n">d</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="c1">#opponent action distribution </span> <span class="n">d</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="n">new_opp_reach</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">opp_reach</span><span class="p">))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">opp_reach</span><span class="p">)):</span> <span class="n">new_opp_reach</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">opp_reach</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">v</span> <span class="o">=</span> <span class="o">-</span><span class="mi">100000</span> <span class="n">util</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="n">util</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="n">w</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="c1">#infoset = history </span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">):</span> <span class="k">if</span> <span class="n">acting_player</span> <span class="o">!=</span> <span class="n">player_iteration</span><span class="p">:</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">opp_reach</span><span class="p">)):</span> <span class="n">infoset</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">+</span> <span class="n">history</span> <span class="k">if</span> <span class="n">infoset</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">nodes</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">infoset</span><span class="p">]</span> <span class="o">=</span> <span class="n">Node</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="n">strategy</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">nodes</span><span class="p">[</span><span class="n">infoset</span><span class="p">].</span><span class="n">get_average_strategy</span><span class="p">()</span><span class="c1">#get_strategy_br() </span> <span class="n">new_opp_reach</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">opp_reach</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">strategy</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="c1">#update reach prob </span> <span class="n">w</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">+=</span> <span class="n">new_opp_reach</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="c1">#sum weights over all poss. of new reach </span> <span class="k">if</span> <span class="n">a</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">history</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span> <span class="k">if</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'b'</span><span class="p">:</span> <span class="n">next_history</span> <span class="o">=</span> <span class="n">history</span> <span class="o">+</span> <span class="s">'f'</span> <span class="k">elif</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'k'</span><span class="p">:</span> <span class="n">next_history</span> <span class="o">=</span> <span class="n">history</span> <span class="o">+</span> <span class="s">'k'</span> <span class="k">else</span><span class="p">:</span> <span class="n">next_history</span> <span class="o">=</span> <span class="n">history</span> <span class="o">+</span> <span class="s">'k'</span> <span class="k">elif</span> <span class="n">a</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">history</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span> <span class="k">if</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'b'</span><span class="p">:</span> <span class="n">next_history</span> <span class="o">=</span> <span class="n">history</span> <span class="o">+</span> <span class="s">'c'</span> <span class="k">elif</span> <span class="n">history</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">'k'</span><span class="p">:</span> <span class="n">next_history</span> <span class="o">=</span> <span class="n">history</span> <span class="o">+</span> <span class="s">'b'</span> <span class="k">else</span><span class="p">:</span> <span class="n">next_history</span> <span class="o">=</span> <span class="n">history</span> <span class="o">+</span> <span class="s">'b'</span> <span class="c1">#print('w', w) </span> <span class="c1">#print('history', history) </span> <span class="c1">#print('next history', next_history) </span> <span class="n">util</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">brf</span><span class="p">(</span><span class="n">player_card</span><span class="p">,</span> <span class="n">next_history</span><span class="p">,</span> <span class="n">player_iteration</span><span class="p">,</span> <span class="n">new_opp_reach</span><span class="p">)</span> <span class="c1">#print('util a', util[a]) </span> <span class="k">if</span> <span class="p">(</span><span class="n">acting_player</span> <span class="o">==</span> <span class="n">player_iteration</span> <span class="ow">and</span> <span class="n">util</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">v</span><span class="p">):</span> <span class="n">v</span> <span class="o">=</span> <span class="n">util</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="c1">#this action better than previously best action </span> <span class="k">if</span> <span class="n">acting_player</span> <span class="o">!=</span> <span class="n">player_iteration</span><span class="p">:</span> <span class="c1">#D_(-i) = Normalize(w) , d is action distribution that = normalized w </span> <span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">v</span> <span class="o">=</span> <span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">util</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">util</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">return</span> <span class="n">v</span> </code></pre></div></div> <h3 id="agent-vs-agent">Agent vs. Agent</h3> <p>Agent against agent is a common way to test the abilities of poker programs. This is an empirical method for researchers to evaluate agents with different characteristics, such as different abstractions. If a game theory optimal agent exists for a game then another agent could play against the GTO agent as a measure of quality.</p> <p>Researchers from around the world competed annually at the Annual Computer Poker Competition that began in 2006 and is now part of the Poker Workshop at the annual AAAI Conference on Artificial Intelligence. It has competitions in limit, no-limit, and later 3 player Kuhn poker (most recently only no limit competitions were played). This uses duplicate matches and two winner determination methods – instant run-off, which eliminates the worst agent in each round and bankroll, which gives the win to the agent with the highest bankroll in that event. This means that there are different incentives for agents that play a more defensive vs. more aggressive strategy.</p> <h3 id="human-opponents">Human Opponents</h3> <p>The main issue with playing against human opponents is that win-rates can take approximately one million hands to converge. The 2015 Man vs. Machine competition involved 80,000 hands total against four opponents, which led to disputes over statistical significance of the results.</p> <p>Despite the difficulty of achieving very large hand samples against human opponents, there is still value in these test games, as human experts are capable of quickly analyzing a strategy and the playing statistics of that strategy, so a computer program can be sanity checked and evaluated for unique tendencies by such human experts. Computer programs could also be made available on the Internet to play against many opponents to obtain significant levels of data. In practice, most of the breakthrough agents have only been released to play against select top players since this is considered more significant than playing against random players and perhaps to keep the agents’ strategies more private.</p>CFR - Agent Evaluation Poker results are generally measured in big blinds per 100 hands won. In the research community, the standard measure is milli-big-blinds per hand (or per game), or mbb/g, where one milli-big-blind is 1/1000 of one big blind. This is also used as a measure of exploitability, the expected loss per game against a worst-case opponent. A player who folds every hand will lose 750 mbb/g on average in a heads up match (1000 mbb/g as big blind and 500 as small blind).AIPT Section 5.1: Top Poker Agents – AI vs. Human Competitions2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/ai-vs-human-competitions<h1 id="top-poker-agents---ai-vs-human-competitions">Top Poker Agents - AI vs. Human Competitions</h1> <p>Ever since Gary Kasparov faced off against the Deep Blue IBM supercomputer in 1996, the idea of AI defeating humans has held a special significance. There have been a number of AI vs. human competitions in poker and we will highlight some of the most important ones here.</p> <ol> <li>2007 CPRG’s Polaris Limit Hold’em</li> <li>2008 CPRG’s Polaris 2 Limit Hold’em</li> <li>2015 CMU’s Claudico No Limit Hold’em</li> <li>2017 CMU’s Libratus No Limit Hold’em</li> <li>2019 Facebook’s Pluribus Multiplayer No Limit Hold’em</li> </ol> <h2 id="2007-polaris">2007 Polaris</h2> <p>The first Man vs. Machine poker match took place in 2007 and there was hope that this would get publicity like the Deep Blue match, but it resulted in human players beating CPRG’s Polaris agent, though statistically the skills of the humans and computer were similar. The match was limit hold’em with two professionals playing against Polaris in a duplicate-match, where they played a total of 4,000 hands and won$395, or 39.5 small bets. Although variance in limit hold’em variants is relatively low, this seems like a low number of hands. This version of Polaris was found to have an exploitability of 275.9 mbb/g.</p> <h2 id="2008-polaris-2">2008 Polaris 2</h2> <p>In the 2008 match against five poker pros in Las Vegas, Polaris came out ahead overall and in two of the four matches, while one was tied and the humans won the other, with a version of the agent that is now known to be exploitable for 235.3 mbb/g. This version of Polaris was updated to add “an element of learning”, whereby it selected a strategy according to how the opponent was playing instead of using one default strategy. This agent also made use of CFR and was an important victory for the AI side.</p> <h2 id="2015-claudico">2015 Claudico</h2> <p>After a significant break in the action, Man vs. Machine was again played at the Rivers Casino in Pittsburgh in April-May 2015 and included major sponsorship from Microsoft Research. Tuomas Sandholm, a professor at CMU, led the team that created the bot named Claudico, which means “limp” in Latin, named such because the bot “limps”, or calls instead of raising preflop, a strategy very uncommon with expert players. The humans were some of the top in the world – Doug Polk, Dong Kim, Jason Les, and Bjorn Li.</p> <p>Sandholm’s goal is to create the greatest no-limit Texas Hold’em player in the world, which could potentially be used as the ultimate poker training tool. Along with PhD student Noam Brown, they created, Tartanian7 (Claudico’s predecessor), which won both categories of the ACPC in 2014, and Baby Tartanian8, which won the total bankroll and took 3rd in the instant run-off competition in 2016. Sandholm’s previous agents were named Hyperborean. These agents use the standard abstraction paradigm for solving poker games.</p> <p>Sandholm first noted that it has a far more varied approach to bet sizing, one of the most important and skillful aspects of the game. “I think by using one or two bet sizes humans can avoid signaling too much about the strength of their hand,“ Sandholm continued. “But the computer can use a larger number of bet sizes because it knows it isn’t giving away too much because it balances its bets. Another example is limping, by the way. Limping in on the button heads-up no-limit is often considered by humans as a novice thing to do, but the bot will do that”.</p> <p>“It plays poker very differently from how humans play poker. Humans learn from each other how humans play the game, not how it is optimally played. This bot, in contrast, has never seen a human play poker. Instead, it has reasoned from first principles how poker should be played and the conclusions are different from what humans have reached”, said Sandholm. He attributes the power of the agent to his team’s knowledge of computational game theory and optimization algorithms.</p> <p>The bot played 20,000 hands each against four top poker professionals and the humans ended up winning over \$700,000 in the 80,000 No Limit Hold’em hands, which were played at \$50-\$100 blinds (the players won at about 9.16 big blinds per 100 hands) with each player starting each hand with \$20,000 in chips. Players won a share of a \$100,000 prize pool depending on their individual results. Despite the loss, Sandholm claims that this was a statistical tie because he could not say with 95% or higher certainty that the players were statistically better players (although the win was significant at the 90% level). There was a \$100,000 prize pool to incentivize high performance.</p> <p>Doug Polk, perhaps the most well known of the human players, noted that he felt that Claudico was very aggressive and he recalled that it one time made a huge bet into a small pot, that would essentially never happen in a human game. He felt that Claudico was very strong overall, mostly because of how well balanced it played, meaning that it would take similar actions with a balance of hands. He also noted that there are already problems with bots playing in real money online (despite being against the rules), so this is an ongoing concern as bots continue to improve.</p> <p>Some of the unconventional plays included limping around 10% of hands, which it seemed to do profitably, and often betting very small like 10% of pot or very large like 40 times the pot. These strategies have gotten more prevalent in human games because they can be super effective and tend to make opponents uncomfortable since they aren’t used to playing against them!</p> <h2 id="2017-libratus">2017 Libratus</h2> <p>Libratus from CMU overwhelmingly beats top humans in the world On January 11, 2017, another Poker AI vs. Human match began, a rematch of the 2015 match, again featuring Professor Tuomas Sandholm’s and Noam Brown’s agent, now named Libratus (Latin for “balance”), facing off against four top poker players. The event lasted 20 days with a prize purse of $200,000 and 120,000 hands to be played (more than in the first match in order to provide more statistical significance). Hands are again mirrored so that between two human players, their cards are reversed, to reduce variance, and also the computer plays the mirrored hand from both perspectives. Each hand resets with 200 big blinds per player and players have access to a history of each day’s hands to review at the end of each day. Sandholm declared that he thinks of poker as the “last frontier within the visible horizon” of solvable games where computers can defeat humans.</p> <p>Libratus is based on CFR+ with a sampled form of regret-based pruning to speed up computation. It was run for “trillions” of iterations for months over 200 nodes on a Carnegie Mellon University supercomputer called Bridges. Unlike when Go was solved, Libratus did not analyze any human hands, but rather learned from scratch, which could be beneficial when playing against humans since they may be less familiar with its strategy compared to what they are used to seeing. Additionally, Libratus used an end-game solver in match and took card-removal effects, also known as blockers, into account (when opponent hands are less likely because of cards that you are holding – which was not in the previous version), and finally, the Libratus team updated the betting translations every night to avoid them being exploited (which happened in the previous version).</p> <p>The human team had been successful in the 2015 match by exploiting bet sizing by betting in between known sizes to “confuse” the bot, and they tried that tactic again in the 2017 competition, but this time even if they did find an exploit, it would be fixed within a few days. Significantly, while the previous version, Claudico, used card abstraction, this version does not. This is thought to be a significant reason for Libratus’s improvements over Claudico, since merging hands together and missing the subtleties between closely related hands could be a strong disadvantage against top players.</p> <p>Libratus ended up winning a massive \$1,766,250 in tournament chips over the 20 days and 120,000 hands, or 14.7 big blinds per 100 hands, an excellent winrate. The players, two of whom are the same that played in the first match against Sandholm’s bot, agreed that it has substantially improved in this iteration. They especially noted that the overbets (large bets that are larger, sometimes significantly larger, than the pot) have been surprising and challenging to combat and that the agent’s ability to balance and confuse the humans about which bets were value bets and which bets for bluffs was especially impressive and is a skill that is beyond the capabilities of most humans. They also made the point that it is mentally difficult to play poker for so many hours (about 10 per day) with only limited breaks possible and limited time to study hands at night.</p> <p>What other advantages might poker agents have over humans? A major one is randomization. Computers are much better at humans at playing mixed strategies in terms of both actions and bet sizes, so most humans stick to only a few bet sizes and very approximate randomization.</p> <p>After the match, Sandholm declared that “This is the first time that AI has been able to beat the best humans at Heads-Up No-Limit Texas Hold’em” and “More generally, this shows that the best AI’s ability to do strategic reasoning under imperfect information has surpassed the best humans.” Andrew Ng, a computer scientist at Stanford University said that this is a “major milestone for AI”, comparable to AI achievements in chess and Go. While this is certainly true, poker is most often played with six to ten players at a table, situations in which humanity still had the upper hand.</p> <p>What does this mean for the future of poker and online games? If an agent is this strong against four of the best 15 players in the world, then it could certainly be extremely successful against typical opponents encountered online. One caveat is that decisions tend to take a large amount of time and online sites allow only a specified number of seconds per action, (along with a regenerating time bank for more difficult decisions) so this strongest version of the agent would probably not quite be ready to play online now.</p> <p>Reputable poker sites have strong anti-bot detection and while this can be circumvented by manually inputting commands given by bot software, this makes it more challenging to scale. Finally, most poker games take places at six or nine player tables (including the even more complex tournament style of poker), which means more players and more complexity, and research has not yet focused on this problem, although even agents like Libratus may succeed with only minor tweaks.</p> <p>There were a number of super interesting hands from this match – here we discuss two that were especially weird!</p> <p>In the first hand the computer had 53 of clubs and Daniel McAualay had two hearts (his exact hand was not given). Preflop there was a raise by Daniel, a reraise by Libratus, and a 4bet by Daniel. Normally in this situation an opponent would expect Daniel to have a very, very strong hand and would almost always throw away a hand as weak as 53 of clubs unless perhaps the stack sizes were huge.</p> <p>The flop was K of hearts, Q of hearts, J of clubs, giving Daniel a flush draw and Libratus almost no chance of winning. Both players checked, Libratus probably to give up and Daniel probably to take a “free” card since the flop was “messy” and could result in getting raised, while he would rather see the next card and hopefully hit his flush.</p> <p>Indeed the next card was a heart (exact card not given), which means that Libratus has nothing and cannot win and Daniel has a very strong hand with a flush. Again both players checked, this time again probably for Libratus to give up, and this time for Daniel to slow play his very strong hand to try to get Libratus to bluff or hit something.</p> <p>The river card was a 5 of spades, giving Libratus a pair, but being very unlikely that it would win given the rest of the board. Even if it were the best hand, it would be extremely unlikely that an opponent would call a bet with a worse hand. Still, Libratus bet, defying conventional poker wisdom, and Daniel made a small raise that he wanted Libratus to read as being a likely bluff. Then Libratus went allin as a complete bluff, knowing that it would lose if called. Daniel quickly called and won.</p> <p>What do we make of this hand? If we saw a human do this, we’d think that he was absolutely crazy and quite likely a very poor player, but Libratus is actually one of the best agents ever created. Is it possible that this action was actually a very very tiny probability action that we got to see? This shows that a top poker player is very unpredictable and AI is even better at being unpredictable and randomizing and doing this low probability plays than humans ever will be.</p> <p>Noam Brown, who was the lead on creating Libratus, said that despite the bot’s skill at playing unexploitable poker, he believes that humans are still superior when it comes to exploiting weaker players, but that bots are gradually improving in this area also.</p> <h2 id="2019-pluribus">2019 Pluribus</h2> <p>Pluribus was released in 2019 by Noam Brown and Tuomas Sandholm. As the first agent to play multiplayer poker at a high level, they recruited strong opponents (they claim that each has won more than \$1 million professionally in poker and provided a$50,000 incentive to get players to play seriously) to prove that the agent was indeed very powerful. They used two evaluation methods, both at six player tables:</p> <ol> <li>Five humans with one copy of Pluribus</li> <li>Five copies of Pluribus with one human</li> </ol> <p>The five humans experiment played 10,000 hands of poker over 12 days, where each day a different combination of five of the 13 players involved would play. After using the AIVAT variance reduction technique, Pluribus won 48 mbb per game +/- 25. The issue here is that each human only had on average about 770 hands and far fewer that actually involved the Pluribus agent, meaning that they would have very little opportunity to adjust to its tendencies.</p> <p>One of the players, Jason Les, who also played against Libratus in 2017, commented “It is an absolute monster bluffer. I would say it’s a much more efficient bluffer than most humans. And that’s what makes it so difficult to play against. You’re always in a situation with a ton of pressure that the AI is putting on you and you know it’s very likely it could be bluffing here.” That pressure feeling is exactly what it feels like to be playing against a very tough opponent and is a good sign for Pluribus!</p> <p>The single human experiment used only two humans who are perhaps less profitable in recent real money poker games. Each of them played 5000 hands against five copies of Pluribus and each human got $2000 for participating and an extra$2000 would go to the one who performed better. Pluribs won 32 mbb per game +/- 15. It would have been more interesting to have some of the best players play in this format as well, and perhaps even in different formats like with only three players.</p> <p>The paper notes that one strategy that Pluribus uses that humans rarely do is “donk betting”, which means betting out into an opponent who was the one who bet on the previous round. This may be an example of a play that does make sense when playing optimally, but doesn’t make sense when attempting to play “mostly optimally” and use a far more simplified strategy. Pluribus did confirm that limping is not a valuable strategy except from the small blind.</p> <h2 id="human-biases">Human Biases</h2> <p>Why can computer agents be superior to expert players? They can certainly store more information and do calculations faster and act faster, but even on a more even playing field, there are human emotional biases that make things difficult relative to the objectivity of a computer program. Daniel Kahneman, a well known behavioral economist, has written about how decision makers often react to different frames (how information is presented) in different ways, and suggests that these people would be better off using a risk policy that they apply as a standard broad frame. Computer agents are essentially already doing this and have no framing or emotional biases or loss aversion that trouble most human decision makers and poker players.</p> <p>In Daniel Kahneman’s 2011 book “Thinking Fast and Slow”, a study by Paul Meehl is cited that shows clearly that statistical predictions made by combining a few scores or ratings according to a rule tend to be much superior to predictions based on subjective impressions of trained professionals. Looking at about 200 comparisons, the algorithms are significantly better in about 60% of cases and the rest are considered ties (although algorithms are generally much cheaper). Examples of studies include credit risk evaluation by banks and the odds of recidivism amongst juvenile offenders. Although poker involves much more than just statistical predictions, this study suggests that despite poker having a reputation as being an emotional game where reading others and having a good poker face are important, the reality is that already computer poker agents are better than all but the very best human players. Some specific issues mentioned are:</p> <ul> <li>Experts try to be clever and to think outside of the box instead of sticking to “fundamentals”</li> <li>Humans are inconsistent in making summary judgments of complex information</li> </ul>Top Poker Agents - AI vs. Human Competitions Ever since Gary Kasparov faced off against the Deep Blue IBM supercomputer in 1996, the idea of AI defeating humans has held a special significance. There have been a number of AI vs. human competitions in poker and we will highlight some of the most important ones here.AIPT Section 4.4: CFR – CFR Advances2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/cfr-advances<h1 id="cfr---cfr-advances">CFR - CFR Advances</h1> <p>Now almost 15 years after the CFR algorithm was originally published by the University of Alberta in 2007, there have been a number of significant advances. Perhaps the most important was the use of Monte Carlo sampling that was discussed in the main CFR section, which allows the algorithm to work significantly faster. In Section 1.2 History of Solving Poker, we touched on some advances like strategy purification (always playing the best strategy instead of distributing strategy over a distribution), decomposition (analyzing different subgames independently), endgame solving (using a finer abstraction to solve the endgame part specifically), and warm starting CFR with a strategy that saves many iterations rather than starting with a uniform random strategy.</p> <p>Here we want to focus specifically on a few of the most important advancements.</p> <h2 id="cfr">CFR+</h2> <p>Oskari Tammelin, an independent researcher, first published a <a href="https://arxiv.org/abs/1407.5042">paper about CFR+</a> in July, 2014, which was then popularized by the University of Alberta when they used it to completely solve Heads Up limit Hold’em Poker, one of the most majors breakthroughs in the field of poker research, which was <a href="http://webdocs.cs.ualberta.ca/~bowling/papers/15science.pdf">published in Science</a> in January, 2015.</p> <p>This was the first time that a full, unabstracted game, that is regularly played in casinos was completely solved. While prior improvements to the vanilla CFR algorithm focused on sampling, CFR+ in fact does not rely on any sampling. CFR+ was found to be a good candidate for using compression to address the memory challenges of tabular CFR since all values are updated at each iteration. This was used to sort boards and private cards in order and to use prior values to predict the next values and then store errors on the predicted values rather than their values directly.</p> <p><img src="../assets/section4/cfradvances/cfr+.png" alt="CFR+ equation" title="CFR+ equation" /></p> <p>The main enhancement with CFR+ is that regret matching is replaced with newly created regret matching plus (+). In short, any time an action’s regret_sum goes negative, we reset it to 0. The regret matching concept remains the same and the difference is that the R regret matching values are replaced by the new Q regret matching plus (+) values, which are defined above in terms of R. The main difference is that each term calculated in the regret matching is always non-negative, so future positive regrets are immediately added to these values, rather than cancelling out accumulated negative regret. This means that actions are chosen again after “proving themselves useful” again instead of potentially becoming very negative and never having a chance to come back.</p> <p>We can imagine an unlucky situation in poker that results in a large loss and CFR+ is essentially saying that we should act as if we haven’t seen this situation before and would reset its regret sum to 0 to give it more chances to be useful. Regular CFR would put this move in a deep hole that could take a very, very long time to climb out of.</p> <p>The Alberta researchers also found, empirically, that the final step in regular CFR versions of computing the average strategy as the strategy that is a Nash equilibrium is not necessary with CFR+ (the current, non-average, strategy also approaches zero), and we can simply use the final strategy at the last iteration as the computed solution. They also showed that the Nash approximation e is at most twice the final strategy’s exploitability.</p> <p>Tammelin and the CPRG at Alberta also proved that CFR+ and CFR have the same regret bounds and convergence is faster in CFR+ (the initial paper version did not use average strategies, but a weighted average strategy was used in a later paper to show proof of convergence).</p> <p>They showed that CFR+ using either current (i.e., final) or average strategy results converges faster than CFR and using average strategy converges faster in Rhode Island Hold’em (a simple poker game with one private card and two communal cards and three betting rounds).</p> <p>Why did they base CFR+ on Vanilla CFR and not the faster converging sampling versions? They found that CFR+ did not work well with sampling because regret matching plus did not work well when sampling noise was present.</p> <h2 id="deep-counterfactual-regret-minimization"><a href="https://arxiv.org/abs/1811.00164">Deep Counterfactual Regret Minimization</a></h2> <p>Deep CFR was published in 2019 by Noam Brown et al. from Facebook AI Research. This was a super important paper that makes use of deep neural networks as an alternative to the abstraction, solving, translation paradigm that was previously standard. The previous method involves abstracting a game to make it smaller and tractable for the tabular (standard) CFR algorithm. Then the solution is translated (mapped) back into the full game, but this translation process can result in problems and exploitations. A solution to an abstracted game that only has five bet options may do poorly back in the real game when there are hundreds of possible bets. Certain abstractions can also miss important strategic elements of a game. For example, if the only bet options are check, half pot, and pot, then important poker tools like very small bets and overbets (bets larger than the pot) will be completely ignored.</p> <p>Deep CFR aims to move on from abstractions by approximating CFR in full poker games, which is the first non-tabular version of CFR to do so. The tabular version was used in the creation of the top agents from 2015-2017 in No Limit Texas Hold’em including Libratus and DeepStack, which both succeeded in beating top humans. The abstractions used required domain knowledge and were only rough approximations of a true equilibrium, so even though they worked well in practice, this shows the limitations of the abstraction approach.</p> <p>Though reinforcement learning, an area of machine learning where agents learn which actions to take based on rewards earned in their environment, has been around for a long time, only in recent years has the field of Deep Reinforcement Learning been developing. For example, in games with a very large action space or very complex environment, deep neural networks can use function approximation to efficiently learn. We’ve seen this in 2015 when DeepMind was able to build strong agents in many Atari games as well as AlphaGoZero a few years later, which is capable of excellent performance in zero-sum perfect information games like Go and chess. The approximation techniques with deep neural networks mean that no specific domain knowledge is needed as was previously the case when building agents in Go and chess – the agent can simply learn on its own with self play. However, the problem remains that standard reinforcement learning algorithms are built for games of perfect information and do not work effectively on games like poker.</p> <p>This paper shows that Deep CFR converges to an e-Nash equilibrium in two player zero sum games and is tested in heads up limit Texas Hold’em, where it is competitive with previous tabular abstraction techniques and an algorithm called Neural Fictitious Self Play, that was previously the leading function approximation algorithm for imperfect information games. The Neural Fictitious Self Play algorithm combined deep learning function approximation with Fictitious Play to create an AI agent for HULHE. The paper notes that Fictitious Play “has weaker theoretical convergence guarantees than CFR and also converges slower in practice”.</p> <p>While the DeepStack agent did use deep learning to estimate values at a depth limit of a subgame in imperfect information games, the tabular version of CFR was what was used within the subgames themselves.</p> <p>An earlier similar work to Deep CFR was done by Waugh et al. in 2015 with an algorithm called Regression CFR (RCFR), which “defines a number of features of the infosets in a game and calculates weights to approximate the regrets that a tabular CFR implementation would produce”. This is similar to Deep CFR, but uses hand made features instead of the neural net automatically learning the features. It was also only shown to work in toy games with full tree traversals, so is a good building block to build upon for this new algorithm.</p> <p>The standard CFR method is still applied in this paper, with a small deviation that instead of using standard regret matching, they choose the action with highest counterfactual regret with probability 1, which is found to help regret matching work better with approximation error.</p> <p>This paper uses the Monte Carlo CFR variant called external sampling due to its simplicity and strong performance.</p> <h3 id="deep-cfr-algorithm-details">Deep CFR Algorithm Details</h3> <p>Deep CFR works by approximating CFR “without storing regrets at each game state (information set) in the game by generalizing across similar infosets using function approximation with deep neural networks”.</p> <p><img src="../assets/section4/cfradvances/deepcfralg.png" alt="Deep CFR Algorithms" title="Deep CFR Algorithms" /></p> <p>My implementation is <a href="https://github.com/chisness/aipoker/blob/master/deepcfr_kuhn.py">here</a>.</p> <p>The Deep CFR algorithm works as follows:</p> <ol> <li>Separate advantage networks are initialized for each player. The advantage network V(I,a\|theta_p) uses a deep neural network to approximate the advantage of a particular action at a particular information set, where the advantage is</li> <li>Reservoir-sampled advantage memories are initialized for each player along with a strategy memory. Reservoir-sampling means that they have some fixed limit on how many samples can fit and once the limit is up, then the oldest ones will be removed when new ones go in.</li> <li>There are three nested loops – the outer loop goes over CFR iterations t, the next loop rotates between the traversing player, and the inner loop goes over traversals k. Within that loop is the TRAVERSE function, which calls the CFR external sampling algorithm to collect game data.</li> <li>After each set of K traversals, the parameters of the advantage network are trained from scratch given the reservoir-sampled advantage memories by minimizing the mean squared error (MSE) between the predicted advantages from the neural network and the sampled advantages that are in the advantage memory.</li> <li>After all of the CFR iterations, only then is the average strategy trained from the strategy memory – this is what converges to the Nash equilibrium.</li> <li>At the end, the strategy is returned.</li> </ol> <p>In the CFR section, we only discussed CFR and Chance Sampling Monte Carlo CFR in detail, so here we will go through the External Sampling algorithm, titled TRAVERSE in the paper. External sampling samples the actions of the opponent and chance (i.e. all decisions external to the traversing player) and probabilistically converges to an equilibrium.</p> <p>The function takes as input history h (which includes player cards, previous actions in the hand, and the pot size), which player is the traversing player p, and the CFR iteration t. The advantage memory, strategy memory, and regret network parameters are used in the function in the algorithm, but in my implementation are accessible without being direct inputs, which makes the inputs to the function the same as they would be for a regular tabular CFR variant. The game tree traversals alternate between each of the two players and the player currently traversing the tree is called the traversing player.</p> <p>The algorithm then works as follows:</p> <ol> <li>Checks whether the history is terminal (i.e. the end of the hand) and if so returns the payoff, where everything is in terms of the traversing player p. So if player p wins then the payoff is positive, and otherwise it’s negative.</li> <li>If the history is a chance node, then it is sampled, for example dealing the cards in the first round of a poker game.</li> <li>When it’s the traversing player’s turn to act, then the strategy is determined from the deep neural network for predicted advantages – the network returns the regret values and then they are converted to a strategy by the standard regret matching procedure. The regret values are intended to be proportional to the regret that would have been produced by tabular CFR – proportional since regret matching is computed using ratios. For each possible action (sampling all actions is useful for reducing variance but could be limiting in games with a very large number of actions), the value of that action is recursively computed by calling the same TRAVERSE function. Then once those values are all determined, the action advantages (known as regret values in prior CFR algorithms) for each action are computed by taking the just computed counterfactual action values and subtracting the average value of all of the other actions (note that some implementations of CFR subtract the value of the node, which would include all actions, not only actions aside from the action being computed). Note that normally in CFR we are multiplying the advantage value by the probability of the opponent playing to that node, but here because we are sampling opponent nodes, the sampling divides by this same value and so we can effectively ignore that part of the equation. For example, if action 1 had v(1) = 3 with probability 0.25 and action 2 had v(2) = 5 with probability 0.15 and action 3 had v(3) = 8 with probability 0.6, then r(I,1) = 3 - 0.15 * 5 - 0.6 * 8 = -2.55. r(I,2) = 5 - 3 * 0.25 - 8 * 0.6 = -0.55. r(I,3) = 8 - 3 * 0.25 - 5 * 0.15 = 6.5. The tuple of the information set, CFR iteration, and the just computed advantages are stored in the advantage memory M_V. Finally, the utility of the information set (weighted value of all actions) is returned (this seems to be mistakenly omitted from the algorithm.)</li> <li>When it’s the opponent player’s turn to act, then the strategy is again determined from the deep neural network for predicted advantages using regret matching. Now instead of inserting the information set, CFR iteration, and advantages into the advantage memory, we substitute the strategy and insert into the strategy memory. Average strategy updates are inserted only on the opponent’s turns to be sure that the algorithm is not biased to the average strategy. Then an action is sampled from the strategy distribution and the TRAVERSE function is recursively called “on policy”.</li> </ol> <p>The core of Deep CFR is that it can approximate the proportion of the regrets at each information set of a game by using the deep network. Theorem 1 in the paper says that with a sufficiently large memory buffer, “with high probability the algorithm will result in average regret being bounded by a constant proportional to the square root of the function approximation error”.</p> <h3 id="experimental-setup">Experimental Setup</h3> <p>The authors of the paper did an experiment to test the performance of the algorithm in a called called heads up flop hold’em, which is heads up limit hold’em with only two rounds (ending after the flop) instead of four rounds (also having a turn and river). They also experimented with Linear CFR, which means weighing each advantage by the iteration number, meaning that later iterations are counted as more valuable (this is sensible because the agent is more experienced as the iterations go on). They show good results against NFSP implementations and smaller tabular implementations with lower abstraction sizes, but the algorithm seems to do worse against larger abstractions, though these require more fine tuning and expertise than the Deep CFR algorithm.</p> <h4 id="neural-network">Neural Network</h4> <p>The main neural network architecture takes as input cards that are run through a card embedder and bets that are inputted as both a binary for whether or not they occurred at each round and also numerically the proportion of the pot that was bet (e.g. a bet of \$50 into a pot of \$100 would mean bet is True and fraction of the pot is 0.5). It then outputs predicted advantages for each possible action for the value network and logits of the probability distribution over actions for the average strategy network.</p> <!-- Single Deep CFR https://arxiv.org/pdf/1901.07621.pdf -->CFR - CFR Advances Now almost 15 years after the CFR algorithm was originally published by the University of Alberta in 2007, there have been a number of significant advances. Perhaps the most important was the use of Monte Carlo sampling that was discussed in the main CFR section, which allows the algorithm to work significantly faster. In Section 1.2 History of Solving Poker, we touched on some advances like strategy purification (always playing the best strategy instead of distributing strategy over a distribution), decomposition (analyzing different subgames independently), endgame solving (using a finer abstraction to solve the endgame part specifically), and warm starting CFR with a strategy that saves many iterations rather than starting with a uniform random strategy.AIPT Section 6.1: Other Topics – Interpreting Agent Decisions2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/decision-making-lessonsAIPT Section 4.2: CFR – Game Abstractions2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/game-abstractions<h1 id="cfr---game-abstractions">CFR - Game Abstractions</h1> <p>Abstraction is the main method for solving incomplete information games that are too large to be solved in their original form. It is extremely prevalent in solving poker games.</p> <h2 id="game-abstraction-techniques">Game Abstraction Techniques</h2> <p>The Abstraction-Solving-Translation model includes the following steps:</p> <p>First, the game is abstracted to create a smaller game that is strategically similar to the original game. Then the approximate equilibrium is computed in the abstract game, and finally the abstract game is mapped back to the original game. The resulting strategy is a Nash equilibrium approximation in the original game, but will still be exploitable to some degree since it is only an actual Nash equilibrium in the abstracted smaller game.</p> <p>We need abstraction when game sizes are too large to be solved with current technology, or when the original game may be too difficult or large to write in full details, or when the game may not be composed of discrete actions/states in its original form.</p> <p>Strategies for abstract games are defined in the same way as strategies in the main game, but restricted strategies must be given zero probability, meaning if there was a betting abstraction to use only check, bet 1/2 pot, and bet full pot, in a situation where the pot is \$100, only checking, betting \$50, and betting \$100 would be possible, while all other options would be given zero probability.</p> <h2 id="game-size">Game Size</h2> <p>The size of a game is generally computed from two components:</p> <ol> <li>The number of game states, which is generally counted as information sets in poker to account for situations that are equivalent given one player’s information, and therefore which the player uses the same strategy for each information set</li> <li>The number of actions at each of these information sets</li> </ol> <p>The composite multiplication of these two components is referred to as infoset actions, which is related to the amount of memory needed to store a strategy, like in the CFR algorithm. This is the standard game size measurement in poker.</p> <p>The size of a game is valuable because it lets us relate complexity of different games and determines which algorithms might be usable and what level of abstraction might be needed to get a game complexity to a tractable level.</p> <p>Michael Johanson wrote a paper <a href="https://arxiv.org/pdf/1302.7008.pdf">Measuring the Size of Large No-Limit Poker Games</a> in 2013 that explains these methods in detail.</p> <p>This can be extended by considering the game states from multiple perspectives. The two-player view is in terms of an external players view of the number of infoset-actions (they cannot see private cards). The one-sided view is as described above, with one player’s infoset-actions. The canonical one-player view further reduces the size by including lossless abstractions (states that are strategically identical).</p> <p>CFR stores regrets and strategies at each information set, so CFR is known to run in proportion to the size of the game.</p> <h3 id="limit-holdem-size">Limit Hold’em Size</h3> <p>In limit hold’em, the task of computing the number of infoset-actions is relatively easy because there is only one betting option allowed for each betting round, which can only occur a maximum of four times, and the betting actions and information sets within each round are independent of the betting history and stack sizes (assuming large enough stack sizes to be able to complete all bets). From a one-player perspective (assuming the 2nd player’s cards are unknown), the number of ways to deal the cards is calculated as:</p> <p>52c2 for the 1st round and then 52c2 * 50c3 for the 2nd round, and so on</p> <p>These calculations would be reduced if we considered lossless abstraction of card combinations.</p> <p>We can calculate the number of information sets by looking at each round and multiplying the card combinations in that round by the possible betting sequences based on a chart of betting sequences.</p> <h3 id="no-limit-holdem-size">No Limit Hold’em Size</h3> <p>No-limit poker is more of a computational challenge because each betting round depends on prior rounds, since each player’s stack size varies as the hand progresses. Each game depends on two variables: the stack size to start the game and the value of the big blind.</p> <p>Per the game rules, players have the following two betting restrictions: Minimum bet: max(big blind, current bet size) Maximum bet: Stack size</p> <p>The legal actions possible depend on three factors: “amount of money remaining, size of bet faced, and if it’s possible to check (if it’s the first action in a round)”.</p> <p>Each of these factors strictly increases or decreases in a round. The method used to compute the number of infoset-actions in no limit hold’em poker is to “incrementally compute the number of action histories that reach each of these configurations by using dynamic programming”.</p> <p>Michael Johanson’s paper performs these calculations for the standard No Limit Texas Hold’em game used in the ACPC, which uses \$20,000 (200-blind) stacks with \$50- \$100 blinds. Although 200 blinds is fairly normal in poker (although most online games start with 100 blinds), the large stack size in absolute dollar terms means that a much larger number of actions are possible than, for example, 200 blinds in a \$1-\$2 blind setting. The initial raise in the latter setting is any amount from \$4 to \$400, whereas in the former it is \$200 to \$20,000.</p> <h3 id="comparing-limit-and-no-limit-holdem">Comparing Limit and No Limit Hold’em</h3> <p>Whereas limit hold’em has a 1-sided canonical game size of 1.4x10^13 infoset actions, no limit \$1-2 with \$1000 starting stacks (500 blinds) is 3.12x10^71, \$1-2 with \$400 (\200 blind) starting stacks is 6.0x10^46, and \$50-100 with \$20,000 (200 blind) starting stacks is 2.8x10^160. Not including transpositions, chess has 10^47 game states, checkers has 10^20 game states, and Go has 10^170 states.</p> <p>Although one vs. one limit hold’em has now been solved over a long computation period with a very specialized parallel machine setup, no limit is substantially larger and requires abstraction to make the game small enough to be solved.</p> <p>Johanson recommends finding a game that is able to be analyzed in unabstracted form but that is still valuable as a research test experiment. He suggests that the game should be compatible with three properties:</p> <ol> <li> <p>“Unabstracted best response computations are tractable and convenient, so worst case performance of strategies with abstracted betting can be evaluated. One can then evaluate abstraction and translation techniques in isolation from other factors.”</p> </li> <li> <p>“Unabstracted equilibrium computations are tractable and convenient. So we can compute an optimal strategy for the game and measure its performance against agents that use betting abstraction.”</p> </li> <li> <p>“Strategic elements similar to those of NLHE” (in terms of rounds, deck size, 5- card poker hands, and large stack sizes)</p> </li> </ol> <p>Properties (1) and (2) allow for us to compare agents in the full game and in terms of (1) best response and (2) against the full game equilibrium. For condition (3), in order to provide the flexibility of solving this game on standard personal computers, we are limited in the size of the game that we can possibly use, but essentially meeting the first two properties and making the game as large as possible would be most valuable to be as strategically close to NLHE as possible.</p> <h3 id="royal-no-limit-holdem">Royal No Limit Hold’em</h3> <p>Johanson suggests a potential testbed game as 2-\$20 \$1-\$2 No Limit Royal Hold’em, a game which uses 2 betting rounds, \$20 stack sizes, and \$1-\$2 blinds. The game size is 1.55x10^8 and CFR requires 7GB of RAM for the computation.</p> <p>While the size of full poker games that are commonly played in casinos require more memory than is feasible for today’s modern computers, Royal No Limit Hold’em is accessible to all, which could make a game of this sort a more even playing field in a competition. Later on this page, we show an experiment with this game to evaluate different betting abstractions and show that unfortunately the game may be too small to work effectively.</p> <p>With more advanced CFR versions that use deep learning rather than only tabular data like the original CFR, it’s likely that a much larger testbed game could be explored now in 2021.</p> <h2 id="types-of-abstractions">Types of Abstractions</h2> <p>The two main ways to create a smaller game from a larger one in poker are to merge information sets together (card abstraction) and to restrict the actions a player can take from a history (action abstraction). These techniques are often combined.</p> <p>A further possibility is to simplify the game itself. This can be done in poker by limiting the maximum number of bets per round, eliminating betting rounds, and eliminating cards. For example, the variant of Texas Hold’em that we analyze with betting abstractions below is called Royal Texas Hold’em and uses only 20 cards instead of the standard 52 in the deck.</p> <p>Abstractions can be either lossless or lossy. A lossless abstraction respects the original strategic complexity of the game, while reducing the number of game states.</p> <h3 id="lossless-abstraction-and-isomorphisms">Lossless Abstraction and Isomorphisms</h3> <p>With poker, the first step is usually to use lossless abstraction to take advantage of the strategic equivalence of poker hands with regards to their suits. All suits are of the same value, so only how many cards of the same suit a player has is relevant, not the actual type of suit. For example, a player with a starting hand of Jack of spades and Queen of hearts has the same exact quality hand in the case of having Jack of diamonds and Queen of clubs. There are 16 combinations of a Queen and Jack. The 12 that are different suits can be reduced to only one abstracted strategy and the four of the same suit are also equivalent to one abstracted strategy. Such abstractions generally reduce the size of poker games by one to two orders of magnitude. Lossless abstraction enabled the solution of Rhode Island Hold’em (by Gilpin and Sandhol from CMU in the mid 2000s), an AI challenge problem with 3.1 billion nodes in the game tree, but generally, lossy abstraction is also needed.</p> <p>This lossless abstraction must be redefined at each betting round, because while the type of suits are not relevant on a per-round basis, future rounds can redefine the value of a hand according to its suits. Continuing the above example, after a flop of 6h7h8h, the QhJs hand is much superior to the QcJd hand due to now having four hearts (one heart away from a flush).</p> <p>In a Texas Hold’em game, just from the first round alone, we move from 52c2*50c2 = 1,624,350 to 28,561 combinations by using lossless abstraction.</p> <p>Kevin Waugh showed a <a href="http://www.cs.cmu.edu/~./kwaugh/publications/isomorphism13.pdf">fast and optimal technique</a> to index poker hands that accounts for suit isomorphisms. Isomorphisms are cases where poker hands cannot be strategically distinguished. Using such techniques, we can build lossless abstraction. For example, in Royal Hold’em, where there are 20 cards (four suits and cards Ten, Jack, Queen, King, and Ace), we have the following two-card starting hands:</p> <ul> <li>Order and suitedness matter: 20 ∗ 20 = 400 combinations</li> <li>Order does not matter, suitedness matters: 20c2 = 190 combinations</li> <li>Order and suitedness do not matter: 25 combinations (10 unpaired combinations both suited and unsuited and 5 pairs)</li> </ul> <p>We can permute the suits and order of the cards within any round however we would like without losing any strategic significance, so Royal Hold’em effectively begins with only 25 information sets for the player acting first in the preflop round. According to Waugh’s paper, it is important that we can construct an indexing function that is efficient to compute which is optimal with no holes, has an inverse mapping, and is generalizable to other games.</p> <p>In practice, we store the regrets and strategies for each index, whereby multiple equivalent hands can use the same index. The indexing procedure works by indexing using the multiset-colex index, whereby we first index rank sets (sets of cards of the same suit), then rank groups (sequences of rank sets), and finally combine them into a hand index (details in the paper linked above).</p> <h3 id="lossy-abstraction">Lossy Abstraction</h3> <p>All other abstractions are lossy and result in some loss of strategic significance. We look at experiments with action abstraction in our work with Royal No Limit Hold’em and with card abstraction in Kuhn Poker.</p> <p>Action abstraction is when players are given fewer actions available than in the original game, that is, a restriction on the players’ strategy space. This is especially useful for games with large numbers of possible actions available, such as NLHE. In no limit poker, the most standard action abstraction is allowing only {fold, call, pot bet, allin bet}. This restricts the first action in a no limit hold’em game with \$20 starting stacks and a \$2 big blind to either {fold, call \$2, raise to \$4, raise to \$20} instead of {fold, call, raise to any amount between and including \$4-\$20}, which results in four total actions possible instead of 19. These types of abstractions are often used when running solver simulations.</p> <p>Card, or information, abstraction occurs by grouping categories of hands into equivalence classes called buckets. The standard method, expected hand strength, works by grouping by the probability of winning at showdown against a random hand by enumerating all possible combinations of community cards and finding the portion of the time the hand wins. For example, one could create five buckets to divide the probabilities into equities from {0-0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, 0.8-1}. This could lead to some buckets being very small and others very large. Hands also must transition between buckets during play. Buckets could be created automatically such as in the manner just described or manually, which requires expert input, but would be quite difficult to create policies for.</p> <p>Three common ways of bucketing are:</p> <p><strong>Expectation-based:</strong> Developed by Gilpin and Sandholm in 2007. Buckets are created based on the potential to improve given future cards, potentially by putting the same amount of hands into some fixed number of buckets according to hand strength. A problem is that buckets can be unevenly distributed and that there are often strategic differences in playing draw hands compared to made hands, which would not be differentiated here (for example the expectation of the value of JT on 983 may be similar to A8, but the A8 is a mid strength hand that probably should be playing passively and the JT is a good semi bluffing hand that has overcards and a straight draw). The standard technique is to make each bucket on each round to have the same percentage of cards based on a hand strength measure, which could be based on the expectation of hand strength E[HS] or E[HS^2] (gives more value to hands with higher potential to improve in the future) or some similar metric, which are measured against winning at showdown against a random hand.</p> <p><strong>Potential-aware:</strong> Automated abstraction technique that was created in 2007 by Gilpin and Sandholm that uses multi-dimensional histograms, where the maximum number of buckets in each round is determined based on the size of the final linear program to solve. This starts in the final round by bucketing hands into sets with k-means clustering and then buckets on prior rounds are derived from the final round based on how often they turn into the final round clusters. So if there are five clusters on the final round, then there would be a distribution of five transition probabilities on the prior round, which are then grouped together using some similarity function such as Euclidean distance for each value in the histogram. This continues backwards until the first round.</p> <p><strong>Phase-based:</strong> This method solves earlier parts of the game separately from later parts of the game, which allows for finer abstraction in each segment than would be possible in the game as a whole.</p> <p>Additionally, most poker algorithms depend on <strong>perfect recall</strong> in terms of theory to arrive at a Nash equilibrium. This is the assumption that an agent doesn’t forget anything and can rely on prior knowledge. An additional way to abstract, then, is to use <strong>imperfect recall abstractions</strong> (which remove theoretical CFR guarantees), which are when the player “forgets” previous observations and uses information based only on specific parts of the history, or even on no history at all. This assumption is therefore not very applicable to real life, although in real life the true human memory does not remember all of the historical observations and actions like in perfect recall abstractions.</p> <p>Imperfect recall therefore places more emphasis on recent information, which can be done by using an abstraction with fewer buckets on earlier rounds or by forgetting all prior rounds except for the current. This is similar to endgame solving – a way to focus resources on the most important betting rounds and decisions, particularly in no limit hold’em.</p> <p>There are three main methods to compare abstraction solutions to poker games: one on one (against either another agent or a human), versus equilibrium, and versus best response. Respectively, the possible problems possible are intransitivities, infeasible computation, and not being well correlated with best performance. Abstractions can also be measured based on their ability to estimate the true value of the game.</p> <h3 id="action-translation">Action Translation</h3> <p>A reverse mapping, also known as action translation, is used to map actions in the original game, where all actions are possible, to an action in the abstracted model. This is necessary because opponents can take actions in the full game that have been removed from the abstracted model. Clever bet sizing can render the most basic mappings highly exploitable. An intelligent model is needed to handle these situations. The basic model works by mapping an observed action a of the opponent to an action a’ that exists in the abstracted model, and then responding to the action as if the opponent had played a’.</p> <p>Prior to Ganzfriend and Sandholm’s solution in 2013 in “<a href="https://www.cs.cmu.edu/~sandholm/reverse%20mapping.ijcai13.pdf">Action Translation in Extensive-Form GAmes with Large Action Spaces: Axioms, Paradoxes, and the Pseudo-Harmonic Mapping</a>”, most mappings were exploitable and based on heuristics, not theory.</p> <p>Their model works as follows: The opponent bets x, an element of [A, B], where A is the largest betting size in the abstraction that is ≤ x and B is the smallest betting size in the abstraction that is ≥ x, assuming 0 ≤ A &lt; B.</p> <p>The question is where to match (and therefore respond to) the bet x as if it were A or B. f_{A,B}(x) is the probability that we map x to A with the goal being to minimize exploitability.</p> <p>The following basic desiderata properties are given for all action mappings in poker from the paper:</p> <ol> <li>Boundary Constraints: If an opponent bets an action in our abstraction, then x should be matched to that bet size with probability 1, so f(A) = 1 and f(B) = 0.</li> <li>Monotonicity: The probability mapping to A should decrease as x moves closer to B</li> <li>Scale Invariance: Scaling A, B, and x by some multiplicative factor k &gt; 0 does not affect the action mapping</li> <li>Action Robustness: Such that f changes smoothly in x, avoiding any sudden changes that could result in exploitability</li> <li>Boundary Robustness: Such that f changes smoothly with A and B</li> </ol> <p>There are also properties that are justified based on a small toy game called the clairvoyance game, found in the book The Mathematics of Poker. The game works as follows:</p> <ul> <li>Player P2 is given no private cards</li> <li>Player P1 is given a single card drawn from a distribution of half winning and half losing hands</li> <li>Both players start with n chips</li> <li>Both players ante \$0.50, so the starting pot is \$1</li> <li>P1 acts first and can bet any amount from 0 to n</li> <li>P2 responds by calling or folding (no raising is allowed and a bet of 0 simply results in a showdown)</li> </ul> <p>The solution of this game was found to be:</p> <ul> <li>P1 bets n with probability 1 with a winning hand</li> <li>P1 bets n with probability n/(1+n) with a losing hand (otherwise checks, with probability 1/(1+n))</li> <li>P2 calls a bet of size x ∈ [0, n] with probability 1/(1+x)</li> </ul> <p>This motivates the proposed action translation mapping of, which meets the above desiderata and that is consistent with the results of the clairvoyance game:</p> <p>f_{A,B}(x) * 1/(1+A) + (1-f_{A,B}(x))*1/(1+B) = 1/(1+x)</p> <p>Which can be solved to find the mapping, called the pseudo harmonic mapping:</p> <p>f_{A,B}(x) = (B-x)(1+A)/((B-A)(1+x))</p> <p>This mapping is the only one consistent with player 2 calling a bet size of size x with probability 1/(1+x) for all x ∈ [A, B].</p> <p>This mapping exhibited less exploitability than prior mappings in almost all cases, based on test games such as Leduc Hold’em and Kuhn Poker. In Kuhn Poker, an interesting phenomenon was discovered – that fitting an action betting abstraction to a known equilibrium strategy could actually result in the agent being more exploitable. The optimal bet size was found to vary significantly as different stack sizes and mappings were used.</p> <p>This means that the optimal action abstraction to use could vary depending on the action translation mapping used. It may be important to use action abstractions that are a combination of optimal offensive actions used by the agent itself and defensive actions that are used by opponents and are necessary to reduce exploitability. It may be even better to use game specific information in determining abstraction or to use different mappings at different information sets.</p> <h3 id="card-abstraction-in-kuhn-poker">Card Abstraction in Kuhn Poker</h3> <p>For this coding project, we use a verison of Kuhn Poker with deck size of 100 cards, so the game still has the same rules, but the complexity increases as players do not have such simplistic decisions as they would with a very small deck. Kuhn Poker has only four information sets per card, so it has 12 information sets in standard form (using 3 distinct cards) and 400 information sets in the 100-card version.</p> <p>We compared these versions of CFR:</p> <ol> <li>Chance Sampling (sampling only chance nodes and then running regular CFR)</li> <li>External Sampling (sampling chance nodes and opponent nodes)</li> <li>Vanilla (regular)</li> <li>CFR+ (regular CFR with a modified regret metric that resets regrets to 0 if they become negative)</li> </ol> <p>For the External and Chance Sampling CFR versions, we tested simple bucketing that splits the cards into three, 10, and 25 buckets. This means that with 3-bucket abstraction, each group of hands {0-32, 33-65, 66-99} will share a single strategy. Essentially each player’s card is hidden to himself and he only knows which bucket his card is in, although at showdown, each player’s true card would be shown to determine the winner of the hand.</p> <p>The results are shown below. Again, the game values for all abstractions are -0.0566 except for the 3-bucket abstraction, which is -0.0264, meaning that the Player 2 advantage is decreased due to the weakness of the abstraction. Note that ES and CS refer to External Sampling and Chance Sampling and 3B/10B/25B refer to 3 buckets, 10 buckets, and 25 buckets.</p> <p><img src="../assets/section4/abstractions/kuhnabscs.png" alt="Card Abstraction Chance Sampling" title="Card Abstraction Chance Sampling" /></p> <p><img src="../assets/section4/abstractions/kuhnabses.png" alt="Card Abstraction External Sampling" title="Card Abstraction External Sampling" /></p> <p>The figures are quite similar for both algorithms and show that the lower bucket abstractions converge slightly faster than the 25-bucket abstraction, which is also faster than the unabstracted exploitability. However, the unabstracted exploitability is almost immediately less exploitable than the alternatives, rendering them not very useful, probably because the game is not complex enough. We also see here that the exploitability values are significantly better as the number of buckets increases, because with fewer buckets, more hands share the same strategy.</p> <h3 id="bet-abstraction-in-no-limit-royal-holdem">Bet Abstraction in No Limit Royal Hold’em</h3> <p>We look a variation of No Limit Hold’em called Royal No Limit Hold’em that requires only 7GB of RAM to solve (assuming one byte per infoset to store the behavioral strategy and two 8-byte doubles per infoset to solve the game precisely), which means that it can be used as a testbed for anyone to work on, without requiring access to supercomputers. This game and its use as a testbed was given as an idea by Michael Johansen in his 2013 paper, “<a href="https://arxiv.org/abs/1302.7008">Measuring the Size of Large No-Limit Poker Games</a>”and was described above in the Royal No Limit Hold’em section. This testbed concept was important because at the time, during the ACPC, a major factor correlating to the quality of the poker agents was access to powerful supercomputers capable of solving larger game abstractions, which gives an advantage to those with access to such machines.</p> <p>Royal No Limit Hold’em is a simplified version of No Limit Texas Hold’em. We will study the game with 2 players, blinds of \$1 (small blind) and \$2 (big blind), and starting stacks of \$20 per player, which reset after each hand. Instead of a standard 52 card deck, the Royal game means that we discard cards 2 through 9 and use only the Ten, Jack, Queen, King, and Ace, hence the name Royal. Royal Hold’em has only two betting rounds, not four as in standard Texas Hold’em. There is the preflop round and then after the flop (the first three community cards) betting round, the Royal Hold’em game ends. No limit betting is the same as in Texas Hold’em – no maximum number of bets per round and the minimum bet is the big blind or the prior bet/raise size, whichever is larger. The maximum bet size is the amount of chips in front of each player.</p> <p>The strength of hands in Royal Hold’em is, in order from best to worst: Royal flush (the only flush possible), Four of a kind, Full house, Straight, Three of a kind, Two pair, One pair (worse hands are not possible). In normal poker, a hand without pairs would be possible and also other flushes would be possible.</p> <p>Royal No Limit Hold’em has on the order of 10^8 information sets, compared to the full No Limit Hold’em game on the order of 10^161 information sets.</p> <p>The purpose of experimenting with action abstractions in Royal No Limit Hold’em is to understand which abstractions are most efficient, and since Royal Hold’em has many features in common with Texas Hold’em, it is hoped that these abstraction features could be transferred to researchers studying Texas Hold’em or other incomplete information games.</p> <p>The experiment was run using the ACPC standard protocol and a version of CFR called Pure CFR created by Richard Gibson of the University of Alberta and Kevin Waugh’s card isomorphism open source code. Pure CFR is a version of CFR that samples pure strategy profiles (exactly one action assigned probability 1 at each state) for each iteration, using no sampling. Pure CFR has been shown to work faster than Vanilla CFR and also is memory efficient by using only integer values and not doubles, which is made possible by the pure strategy profiles and the fact that pot sizes and bets in poker are all integer valued.</p> <p>We tested solving the full game and also the game with these action abstractions:</p> <ol> <li>Fold, call, pot, allin (FCPA)</li> <li>Fold, call, half pot, allin (FCHA)</li> <li>Fold, call, minimum, half pot, pot, allin (FCMHPA)</li> <li>Fold, call, minimum, half pot, three-quarters pot, pot, allin (FCMHQTPA)</li> </ol> <p>Since the full game can be solved completely, this enables us to compare our abstracted solutions against the full game solution, otherwise not possible in standard poker games due to the size of the full game.</p> <p>Each abstraction and the main game were run for 12 hours over four threads on a 2.6 GHz Intel Core i7 computer with 16 GB of RAM. Since there was not a best response algorithm used in this experiment, the abstractions were run for shorter periods and then tested and then continued for longer periods until it seemed that the strategies had converged, based on the match result changes. The sizes of the strategies of the abstractions from smallest to largest were 5.5 MB, 10.2 MB, 160 MB, and 471 MB. The full game solution is 3.58 GB.</p> <p>We compare our abstractions in two ways:</p> <ol> <li>Against main game equilibrium: Each of the four abstracted solutions plays against the main game solution in the main game</li> <li>Against each other: Each of the four abstracted solutions plays against each other in the main game</li> </ol> <p>After solving the game in the abstracted action space, we must allow for the abstracted player solution to play in the original game, where all actions are allowed.</p> <p>This requires the use of action translation. As described above, the following translation has been shown to be less exploitable than other options:</p> <p>f_{A,B}(x) = (B-x)(1+A)/((B-A)(1+x))</p> <p>Matches between all combinations of the four abstractions and the full game were played over 10,000,000 hands each in duplicate, so 5,000,000 were played once and then the same 5,000,000 were played again with the hands swapped to reduce variance.</p> <p>The following table lists the results from the perspective of the row abstraction in terms of big blinds per 100 hands (i.e., the number of \$2 amounts won per 100 hands):</p> <p><img src="../assets/section4/abstractions/royalabs.png" alt="Bet Abstraction Royal Hold'em" title="Bet Abstraction Royal Hold'em" /></p> <p>We see that the full agent dominates all abstracted agents, as expected. FCPA is slightly superior to the other agents against the full agent. FCPA also beats all of the other abstracted agents, which may be due to the more sophisticated ones becoming overfit to their abstracted games. FCHA loses to all abstractions, possibly because the half pot and all-in bets results in too wide of a range between the two actions.</p> <p>Unfortunately, the relatively small size of the game, intended to provide access to individuals without access to supercomputers, results in the abstracted agents for the first player strategy in the preflop round to virtually always only call the big blind or to go allin. While the full game agent plays a more mixed strategy that involves raising to a variety of sizes, a large amount of these raises are near-all-in amounts, that are strategically essentially equivalent to an all-in bet (e.g., raising to 18 out of 20 chips on the first action).</p> <p>Waugh et al’s <a href="https://poker.cs.ualberta.ca/publications/AAMAS09-abstraction.pdf">work on abstraction pathologies</a> shows that despite finer abstractions tending to perform better and better each year in the ACPC, these abstractions do not guarantee better results. However, they do show that if the opponent is playing the full game, that the opponents should be monotonically improving as their abstractions grow larger, which we did not see in these results, although the winrates are within only about 2% between the different abstractions. They showed that in abstraction vs. abstraction matches, monotonicity often does not hold, and agents could even become more exploitable. A theory is that providing additional strategies to a player can encourage the player to exploit the limitations of the opponent’s abstraction, resulting in a strategy that is more exploitable by actions that become available to the opponent in the full game.</p> <p>While these results show evidence of FCPA being the best abstraction amongst these choices – it may be necessary to run this experiment on a larger testbed game that may not be solvable on personal computers, but also does not require very specialized equipment. For example, even doubling the blinds so that each player starts with 20 big blinds instead of 10 would possibly significantly increase the complexity, although in doing so would significantly increase the game size since already on the flop there would be 38 initial actions (fold or raise to any amount between 4 and 40) instead of 18 initial actions (fold or raise to any amount between 4 and 20).</p>CFR - Game Abstractions Abstraction is the main method for solving incomplete information games that are too large to be solved in their original form. It is extremely prevalent in solving poker games.AIPT Section 1.2: Background – History of Solving Poker2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/history-of-solving-poker<h1 id="background--history-of-solving-poker">Background – History of Solving Poker</h1> <p>There has been a rich literature of research and algorithms involving games and game theory, with poker specific research starting to grow in the late 1990s and early 2000s, especially after the creation of the Computer Poker Research Group (CPRG) at the University of Alberta in the early 1990s. Real life decision making settings almost always involve imperfect information and uncertainty, so algorithmic advances in poker are exciting with regards to the possible applications in the real world.</p> <p>Research accelerated, partially thanks to the founding of the Annual Computer Poker Competition (ACPC) in 2006, which has led to people from around the world competing to build the best poker agents, which has in turn led to significant research in algorithm development, abstraction, and game solving techniques. In recent years, we have also seen multiple “man vs. machine” contests to test the latest poker agents against some of the best poker players in the world.</p> <p>After some major results from around 2015-2018, some researchers have moved on to even more complex imperfect information games like Hanabi.</p> <p>Note that it may make sense to read highlights from this section and come back to learn more about the history of certain algorithms when they are mentioned later in the tutorial.</p> <p>Some of the most important highlights from poker research have been:</p> <ol> <li>1998: Opponent Modelling: University of Alberta Computer Poker Research Group creates agent Poki – output (fold, call, raise) based on effective hand strength (Billings et al)</li> <li>2000: Abstraction Methods: Bucketing hands together (also bet sizes) (Shi, Littman)</li> <li>2003: Approximating Game Theoretic Optimal Strategies for full scale poker: Using abstraction on HULHE (Billings et al)</li> <li>2005: Optimal Rhode Island Poker (1 private card, 2 public cards, 3 betting rounds): Sequence Form Linear Programming to produce equilibrium player with only lossless card abstraction (Gilpin, Sandholm)</li> <li>2006: Beginning of the Annual Computer Poker Competition, which became a yearly part of the AAAI conference</li> <li>2007: The iterative Counterfactual Regret Minimization (CFR) algorithm was developed at the University of Alberta which allowed for solving games up to size 10^12 and solving for inexact solutions (Zinkevich et al)</li> <li>2015: Heads-up Limit Hold’em Poker is Solved, the first major poker game solved without abstractions (Bowling et al)</li> <li>2015: Brains vs. AI competition, the first major Heads Up No Limit poker competition where top humans defeated CMU’s poker agent</li> <li>2017: DeepStack Expert-Level Artificial Intelligence in No-Limit Poker, likely the first agent that was superior to humans in No Limit Hold’em (DeepMind)</li> <li>2017: Superhuman AI for heads-up no-limit poker with the Libratus agent that overwhelmingly defeated top humans in a large scale competition (CMU)</li> <li>2019: Deep CFR, the first algorithm to use deep neural networks to implement CFR without the need for abstractions (Facebook)</li> <li>2019: Superhuman AI for multiplayer poker, the first algorithm to defeat top humans in multiplayer poker (Facebook)</li> </ol> <h2 id="early-poker-research-and-theories">Early Poker Research and Theories</h2> <p>The earliest signs of poker research began over 70 years ago. In the Theory of Games and Economic Behavior from 1944, John von Neumann and Oskar Morgenstern used mathematical models to analyze simplified games of poker and showed that bluffing is an essential component of a sound poker strategy, but the games analyzed were very basic.</p> <p>Harold W. Kuhn solved 1-card poker with 3 cards, also known as Kuhn poker, by hand, analytically, in 1950.</p> <p>The first major effort to build a poker program occurred in the 1970s by Nicholas Findler. He designed a 5-card draw poker program as a method to study computer models of human cognitive processes in poker. The program was able to learn, but was not considered a strong player, despite 5-card draw being less complex than Texas Hold’em, in part because all cards in 5-card draw are private.</p> <p>In 1984, Mike Caro, now perhaps most famous for his “Book of Poker Tells”, a book that examines poker player behaviors that can give away, or tell, information about their hands, created a one on one no limit hold’em program called Orac. It was faced off against strong human opponents, but the matches were not statistically significant and were not well documented.</p> <p>Darse Billings, in his master’s thesis from 1995, adeptly noted that despite poker generally being considered very dependent on “human elements” like psychology and bluffing, perfect poker is based on probabilities and strategic principles, so Billings suggested focusing on the mathematics of poker and game theory before considering the human or rule-type systems.</p> <p>Given this insight, how have “feel” players who rarely consider advanced mathematical principles thrived in poker for decades? Doyle Brunson’s famous book, Super/System: A Course in Power Poker from 1978 mainly prescribes a style of aggression, notably in cases where most players would be more passive. Billings notes that this advice was based off of experience and not mathematical soundness, but is valid from a mathematical perspective, because opponents are far more likely to make mistakes when faced with a bet or raise than when facing a check or call.</p> <p>The implication here is that despite many players not considering sophisticated mathematical and game theoretical principles, they can develop to use plays that make sense mathematically learned over time from trial and error and recognizing patterns that work. However, nowadays, many of the world’s best players do indeed use sophisticated statistical software to analyze their hands and their opponent’s hands, as well as simulation software to run mathematical simulations on common hand situations.</p> <h2 id="history-of-poker-agents">History of Poker Agents</h2> <p>The main types of poker agents have been <strong>knowledge based</strong> (using rules or formulas), <strong>simulation based</strong>, and <strong>game theoretical</strong>.</p> <p>Research in recent years has mainly focused on Texas Hold’em using game theoretical strategies and building agents that play as close as possible to the Nash equilibrium of the full game, while generally using either abstractions or deep learning or other methods for games beyond the size of Heads Up Limit Hold’em.</p> <p>An excellent (though now outdated) paper explaining the state of computer poker, called “<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.380.5127&amp;rep=rep1&amp;type=pdf">Computer poker: A review</a>” was published in 2010 by Jonathan Rubin and Ian Watson of the University of Auckland. Parts of this section follow their review closely for the time period leading up to 2010.</p> <h3 id="knowledge-based-agents">Knowledge Based Agents</h3> <p>Domain knowledge systems can be rule-based, which means that they use if-then logic, incorporating variables such as number of opponents, hand strength, position, and bets to call. Or they can be formula-based, which requires a numeric representation of hand strength and pot odds. The domain knowledge system then ranks the agent’s hand over all possible hands (ignoring the opponent’s hand likelihoods) and can be modified to also include hand potential (i.e., not only the current hand value, but the probabilistic hand value by the end of the hand).</p> <p>These exploitative agents focus on isolating weaknesses in the agent’s opponents. They must have an opponent model that is generally set up as (fold, call, raise) probability triples and generally set up histograms of hands into buckets based on actions.</p> <p>David Sklansky and Mason Malmuth’s well regarded 1994 book Hold’em Poker for Advanced Players recommended a basic rule-based system for Limit Texas Hold’em, with variations according to betting position and game conditions. Most human players probably operate under some sort of rule-based system whether they are thinking about it or not. This is more obvious late in poker tournaments when most players are either going allin or folding and so players generally have a rule-based system for which hands they’d go allin with in which position and with knowledge of opponent actions. For example, with 10 big blinds or lower, a good strategy might be to go allin with over 50% of hands from the small blind as a rule, but the rule could also include other factors like the opponent in the big blind.</p> <p>In David Sklansky’s 1999 text, The Theory of Poker, Sklansky states his Fundamental Theorem of Poker: “Every time you play a hand differently from the way you would have played it if you could see all your opponents’ cards, they gain; and every time you play your hand the same way you would have played it if you would call all their cards, they lose. Conversely, every time opponents play their hands differently from the way they would have if they could see all your cards, you gain; and every time they play their hands the same way they would have played if they could see all your cards, you lose.”</p> <p>The individual hand vs. individual hand concept was then expanded by Phil Galfond in a 2007 article in Bluff Magazine that pioneered the idea of thinking of opponents’ hands in terms of a range of possible hands, not a particular hand.</p> <p>Quantitative Poker, a web blog, expanded on this and added the levels of range vs. range (level 2), strategy vs. strategy (level 3), and finally, strategy vs. inferred strategy distribution, which allows for the idea that opponent strategies will be employing a strategy from a range of strategies.</p> <p>In 2001, Furnkranz identified opponent modeling as a neglected area of research within game playing domains and mentions its importance in a game like poker. The risk with opponent modeling techniques that seek to exploit opponents (exploitative strategies) is that the agent then subjects itself to exploitation, especially if the model of the opponent is invalid or incorrect. Such a model is required for exploitative agents, which can be either static, based on predefined opponent characteristics, or adaptable, based additionally on opponent’s actions during the match. While opponent modeling has shown some promise in poker research, the most successful solutions now tend to use a more game theoretic approach, which limit their exploitability at the expense of maximizing value against weaker opponents. Although perhaps after the game theoretic approach is exhausted, improving opponent modelling will be the next frontier.</p> <p>Rule-based systems generally show up preflop, where fixed rules make more sense given the limited possibilities of hand combinations. The University of Alberta Computer Poker Research Group created one of the first poker agents in 1999, called Poki, which took an effective hand strength as input and outputted a (fold, call, raise) probability triple, using a form of basic opponent modelling.</p> <p>Various programs were created in the late 1990s and early 2000s including r00lbot, which, based on rules in the Advanced Hold’em Poker book, did 3000 trials simulating hand to showdown against n random opponent holdings, and then made decisions. Poki (and its earlier iteration, Loki) used opponent modelling and a formula-based betting strategy and was successful when tested in Internet Relay Chat (IRC). A later version of Poki took first place in the 2008 6-player Limit Hold’em tournament in the ACPC.</p> <p>The problem with such systems is that they require expert rules and there are too many situations possible to have rules for all of them, which requires merging situations that are not necessarily similar enough. Also the systems are generally not very scalable, because they cannot maintain enough rules for the game and therefore become exploitable. If you tried writing down rules for how a bot should play poker, it would quickly become overwhelming! That said, this was an exciting first approach and may work well in smaller game settings.</p> <h3 id="simulation-based-agents">Simulation Based Agents</h3> <p>Simulation agents can use a simulation like Monte Carlo Tree Search (MCTS – see Section 2.2 for more details) to predict action and hand rank probabilities for specific opponents. By predicting opponent strategies, the agent can develop its own optimal counter strategy.</p> <p>After the Alberta Computer Poker Research Group’s formula based Loki-1 agent, they created Loki-2, which added a simulation function to “better determine the expected value of check call and bet/raise decisions”. They called this selective-sampling simulation since there is a selective bias when choosing a random sample at a choice node. Opponent hands were based on a weight table, instead of randomly. This means that each possible hand the opponent could have would be assigned a weight (e.g. 55 0.03, 88 0.05, JT 0.08, etc.), which would allow for the Loki-2 agent to make decisions accordingly. In the late 1990s, Alberta recommended this as a general framework for solving games with imperfect information. These changes made the second iteration of this agent earn 0.05 small blinds per hand more on average than the first version.</p> <p>The next version of this agent was called Poki, which was split into a formula based model and a simulation based model. The simulation based model was the more advanced one, using more recent research techniques, but did not really outperform the formula based model. The simulation model strangely seemed to play overly tight in one on one situations and overly loose in full table situations (strong poker players are usually exceptionally loose in one on one situations, playing almost all hands, and quite tight at full tables). The Alberta team realized that decision making based on the expected holdings of the opponent may not be the best technique.</p> <p>In 2008, a simulation-based agent called AKI Realbot participated in the Annual Computer Poker Competition (ACPC) and took second place. It used an exploitative strategy to play more against weak players, which it defined as players who had lost money overall in the prior 500 hands. Most of its gains came from exploiting these weak opponents, which is also a strategy commonly seen in real life poker where expert players target weak opponents and try to play as many hands as possible with them.</p> <p>In 2009, Van den Broeck created an agent using MCTS for a NLHE game, using “an offline non-player specific opponent model by learning a regression tree from online poker data, which were used to predict actions and hand rank probabilities for opponents”. The agent worked in no limit poker games and seemed to perform well against basic rule-based opponents.</p> <h3 id="game-theoretic-equilibrium-solutions">Game Theoretic Equilibrium Solutions</h3> <p>We have continued to see interesting applications with games and AI succeeding in the real world, including Deep Blue famously beating Kasporov in chess in 1997, Watson defeating Jennings in Jeopardy in 2011, and most recently AlphaGo defeating Lee Se-dol in Go in 2016. These major results, however, differ from the trends in poker solving, since most work in poker is now involving equilibrium strategies that are unexploitable, rather than building AI whose goal is to defeat or exploit individual opponents (although AlphaGo was a robust agent that did not target a specific opponent) and also because poker has imperfect information. In recent years we have seen major results using game theoretic approaches, which are now almost exclusively used in poker solving.</p> <p>In this section, we look at the history of linear programming game theoretic solution techniques, while saving the more recent CFR game theoretic techniques for the next section.</p> <p>In 1999, Selby produced equilibrium strategies for the preflop betting round of Texas Hold’em. This was done by effectively assuming that the hand ended after the preflop betting round and that the five community cards would simply be revealed with no further betting. Even though there are 52c2 = 1326 possible starting hands in poker, this can be reduced to 169 since many are equivalent (since only both cards being the same suit or different suits is relevant in the preflop stage).</p> <p>In 2000, Shi and Littman produced a game theoretic agent for a one on one game called Rhode Island Hold’em by using abstractions to bucket together hands and bet sizes. This was a major result because even though the game is a toy game with three betting rounds and one private card per player, this was effectively the start of the game theoretic trend and the trend of using abstractions to solve large games. This agent was able to beat a range of rule-based opponents.</p> <p>In 2003, the research community began to focus on Heads Up Limit Hold’em (HULHE) as a challenge problem, which is when the CPRG at Alberta first attempted to create an approximate solution for the game using abstraction, rather than sequence form linear programming (SFLP). Billings et al. created PsOpti, which focused on near-equilibrium solutions for full two player HULHE, a game with 10^18 game states.</p> <p>They knew that the sequence form linear programming techniques of Koller were not sufficient to solve such a large game, so abstractions were needed, which reduced the game to about 10^7 game states by using bucketing abstractions and eliminating betting rounds. They also built a preflop model and multiple postflop models, since they couldn’t create one postflop model given the size of the game. In practice, the same preflop model would always be used and then it would choose a postflop model according to what happened preflop.</p> <p>A combination of PsOpti agents was named Hyperborean and won the first AAAI Computer Poker Competition in 2006, although the PsOpti agent was found later to not be a great equilibrium approximation since it was defeated by a wide margin by later Nash approximation strategies.</p> <p>The combination of preflop and postflop models was later identified as problematic since this does not guarantee a coherent equilibrium strategy (since there may be multiple equilibriums). However, in 2009, Waugh showed that separate subgames can be pieced together as long as they are “tied together by a common base strategy”.</p> <p>The Abstraction-Solving-Translation approach requires that a full-scale poker game be modeled by a smaller, solvable game. The abstracted game is then solved to find an approximate Nash equilibrium, which is then translated to choose actions in the full game. This technique is now very common in game theoretic solution techniques in the poker space. In general, SFLP can solve games up to about 10^8 info sets, so most games require more abstraction than possible through lossless (i.e. card isomorphism in poker – when different sequences of cards are actually the same strategically) abstraction alone.</p> <p>In 2005, Gilpin and Sandholm, after applying abstractions, used SFLP to solve Rhode Island Hold’em completely, which has 3.94x10^6 info sets after symmetries are removed.</p> <p>In 2006, Gilpin and Sandholm created the GS1 agent to find game theoretic solutions to HULHE. They used a new technique called GameShrink to abstract the game tree automatically and then used offline solving for the preflop and flop betting rounds and real time solving for the turn and river betting rounds. They later improved this with a GS2 iteration that improved the abstraction technique with k-means clustering.</p> <p>The game tree in NLHE can have up to around 10^161 nodes (depending on the stack sizes), significantly larger than in LHE, where it has only about 10^18 nodes. Aggrobot was created in 2006 and is similar to PsOpti, but was made to solve NLHE and adds the use of betting abstractions, which meant that the agent could bet only half pot, full pot, double pot, and all-in. While the abstraction itself is relatively straightforward, the issue of translating opponent bets to match the best estimate within the abstract game was a challenge because it can lead to easy exploitation (e.g. by betting between the abstracted sizes).</p> <p>Miltersen and Sorenson, in 2007, solved for Nash equilibrium solutions for two players at the end of a tournament with 8000 chips each and 300-600 blinds where actions were restricted to be allin or fold only. They showed that this simple allin or fold strategy approximated Nash equilibrium for the full tournament setting (probably because with only about 13 big blinds, going allin or folding is generally the best play). Because of special considerations with tournaments that require hands to be looked at in the context of the stage of the tournament in addition to the expected value of winning chips on a particular hand, most poker research has avoided these complicating factors by focusing on cash games, in which each hand is independent from the next, and generally simplify things even further by starting stack sizes at the same amount for each hand rather than proceeding with having the effective stack size be smaller than the default size (which would entail using a more complex agent).</p> <p>In 2006, Gilpin, Hoda, Pena, and Sandholm developed Scalable EGT (Excessive Gap Technique), which was improved in 2007, by Gilpin, Sandholm, and Sorensen. EGT works to solve optimization problems like linear programming, but allows for suboptimality and therefore faster solving. This was used to solve Rhode Island Hold’em with 3x10^9 histories.</p> <p>Also in 2007, Counterfactual Regret was developed by Zinkevich et al., which revolutionized poker research and will be explored in detail in this tutorial.</p> <p>The linear programming (LP) algorithms in 1995 were capable of solving games with around 10^5 nodes in the game tree. This grew in 2003 to 10^6, in 2005 to 10^7, and then in 2006 to 10^10 with the development of EGT. Since 2007, both EGT and CFR have been able to solve games in full with about 10^12 nodes.</p> <h4 id="iterative-algorithms">Iterative Algorithms</h4> <p>Ficticious play involves two players who “continually play against each other and adapt and improve over time”. Each player starts with an arbitrary strategy and trains with random situations, while knowing the correct moves since he knows the opponent strategy, and then keeps updating the strategy, which results in approaching a Nash equilibrium as iterations increase.</p> <p>In 2006, Dudziak used fictitious play to develop an agent called Adam for HULHE with abstractions. In 2010, Fellows developed other fictitious play agents called INOT and Fell Omen 2, which respectively took 2nd of 15 and 2nd of 9 in the HULHE instant run off competition in 2007 and 2008 in the ACPC.</p> <p>In 2007, the Range of Skill algorithm was developed at Alberta by Zinkevich. This is an iterative procedure that creates a “sequence of agents where each next agent employs a strategy that can beat the strategy of the prior agent by at least some epsilon amount”. The agents would then eventually approach a Nash equilibrium. This technique led to the creation of SmallBot (uses three bets per round) and BigBot (uses four bets per round), which performed well compared to prior agents in HULHE.</p> <h4 id="other-algorithms">Other Algorithms</h4> <p>Case based reasoning effectively stores “cases” of knowledge of previous hands and then when similar situations come up again, uses those to determine how to play. Agents are trained on hand histories of other agents or players and use that data which gives the output of a betting decision from the most similar case encountered before. It would seem difficult to outperform agents that are almost being copied using this technique, but the method shows promise when using data from multiple top agents, since this could make its play more unpredictable and perhaps more robust since if one agent had a particular leak, it would be countered by cases from other agents.</p> <p>Bayesian agents were attempted, but achieved poor results in the ACPC in 2006 and 2007.</p> <p>Major challenges with agents in general are the creation of “accurate opponent models in real time with limited data, dealing with multiple opponents, and dealing with tournaments that have a variable structure”, including changing blinds as the tournament progresses. Only very recently in around 2019 have we started to see solutions for multiple opponents, while tournaments and variable structure has not yet been studied in academia. Accurate opponent models may be an interesting future research domain, but most recently research has been very focused on game theoretic solutions.</p> <h3 id="exploitative-strategies">Exploitative Strategies</h3> <p>As we have discussed, a Nash equilibrium strategy is “static, robust, and limits its own exploitability by maximizing a minimum outcome against a perfect opponent”. A true Nash equilibrium cannot be outplayed and also will not exploit weaknesses in opponents. To exploit weaknesses requires opponent modelling, which means deciding actions based on some opponent characteristic, which opens up the agent itself to exploitability as a consequence.</p> <p>In 2004 the CPRG created HULHE agents Vexbot and BRPlayer, which use game trees as a way to create opponent models through hands that are observed and generalizing these observations to other similar situations, where there are various levels of similarity, and if the closest similarity level is not possible then the agent attempts to find a similarity at the next level. The difference between the two agents is simply that BRPlayer uses five similarity levels and Vexbot uses only three.</p> <p>Robust counter-strategies offer a compromise between exploiting an opponent and minimizing one’s own exploitability. For example, an $$\epsilon$$-safe best response is a strategy from the set of strategies exploitable for no more than $$\epsilon$$, that maximizes utility against a particular opponent. The Restricted Nash Response (RNR) algorithm takes a target opponent’s strategy and a parameter $$p \in [0, 1]$$ which trades off between minimizing exploitability and exploiting the opponent. CFR is then used to solve the game, fixing the player’s strategy at the input strategy for $$p$$ and for $$(1 − p)$$, and allowing the player to choose their own actions. When p = 0, the agent uses equilibrium strategies and when p = 1, it attempts to use a fully exploitative strategy.</p> <p>Since Restricted Nash Response can only be implemented if given the opponent’s strategy, Data Biased Response (DBR) is an option to create robust counter-strategies, which works by choosing an abstraction for an opponent model, mapping the real game observations of the opponent into the abstract game, and determining frequency counts of the observed actions. Then, like with the RNR algorithm, a modified game is solved in which one player is playing based on the opponent model with probability determined at each information set (not at the root), and the other player converges to a robust counter-strategy. The Hyperborean agent got 2nd in the two player limit bankroll competition in 2009 using DBR.</p> <p>Another strategy is using Teams of Agents, in which the agent selects which strategy to use from a group of exploitative strategies, and picks one depending on the opponent, using the UCB1 algorithm for the policy selection procedure, described as like having a “coach” that determines which “player” to put in against certain opponents to maximize the chance of winning. Johanson showed that this results in an agent that is able to achieve a greater profit against a set of opponents, compared to the use of a single $$\epsilon$$-Nash equilibrium strategy against the same opponents.</p> <p>Finally, there is frequentist best response, which takes an offline model for an opponent by observing training games and assuming that his strategy is fixed, which requires having a default policy for gaps in the observed data, and then plays based on a best response to this strategy.</p> <p>Ganzfried and Sun explored the idea of opponent exploitation in an early 2017 paper. They note that opponent exploitation is crucial in imperfect information games and has much wider use than Nash equilibrium, which is only valid in two-player zero sum games.</p> <p>They propose opponent modelling based on a Bayesian model. The model involves starting with an equilibrium approach and then adjusting to the opponent’s weaknesses (deviations from equilibrium) as more information about his play is revealed by modelling the opponent assuming that he begins play with an equilibrium strategy and then playing an approximate best response to that strategy, which is updated as more information about the opponent is learned.</p> <p>They had strong results with this method against weak ACPC agents and trivial opponents that always call or play randomly.</p> <p>One major problem with opponent modelling in poker is that most hand data does not include private hole cards unless they are shown at the end of a hand, which means hands that are seen tend to be stronger, since, for example, bluffs would not be seen as frequently (players only have to reveal their hole cards if they bet first or called a bet and are winning, otherwise they can throw them into the “muck”). Counter-strategies such as building up a best response model offline against a particular opponent are possible, but may require a very large hand sample on the opponent before having enough data to effectively learn their strategy. This strategy is also risky because it could lead to significant exploitation against other opponents or an opponent who changes his strategy, as most do. Finally, on the fly opponent modelling is difficult because sample sizes tend to be quite low, so only very few information sets would be seen even over thousands of hands.</p> <p>The advantage with opponent modelling is that an agent could rotate between playing a game theory optimal strategy and using opponent exploitation, where the exploitation would be used only against perceived weaker opponents with a sufficient amount of data on their play. In fact this is how many professional human players attempt to play currently.</p> <h3 id="counterfactual-regret-minimization-cfr">Counterfactual Regret Minimization (CFR)</h3> <p>CFR grew out of the Annual Computer Poker Competition that began in 2006 as a way to build a more competitive agent. Counterfactual regret was introduced in the 2007 paper by Zinkevich et al. called Regret Minimization in Games with Incomplete Information and has continued to this day to be the primary computer poker research focus and the basis of the majority of the best poker agents and algorithms that exist.</p> <p>The CPRG developed two successful agents based on CFR: Hyperborean to compete in the ACPC and Polaris to compete in Man vs. Machine poker events. The Man vs. Machine events are covered in the AI vs. Human Competitions Section 5.2.</p> <p>In CFR, memory needed is linear in the number of info sets (whereas linear program memory needed is quadratic). An approximate Nash equilibrium is reached by averaging player strategies over all iterations, which improves as the number of iterations increases. Solving games is usually bounded by memory constraints, so CFR has resulted in a similar improvement as compared to the sequence form linear programming as sequence form did with normal form linear programming. CFR is guaranteed to converge and has experimentally worked well even with non two player non zero sum games. It has been used to solve games with as many as 3.8x10^10 information sets in 2012.</p> <h2 id="evolution-in-measuring-agent-quality">Evolution in Measuring Agent Quality</h2> <p>The main ways to measure the quality of a poker agent are (1) against other poker agents, (2) against humans, and (3) against a “best response” agent that always plays the best response to see how well the agent does in the worst case.</p> <h3 id="measuring-best-response">Measuring Best Response</h3> <p>2-player Limit Texas Hold’em Poker has about 10^18 game states, so in 2011, computing a best response was thought to be intractable because a full game tree traversal would take 10 years to compute at 3 billion states per second. Strategies could still be computed by using abstract games, but a true best response was out of reach.</p> <p>In 2011, Michael Johansen et al. developed an <a href="http://www.cs.cmu.edu/~waugh/publications/johanson11.pdf">accelerated best response</a> computation that made it possible to measure a Heads Up Limit Hold’em strategy’s approximation quality by efficiently computing its exploitability. This was able to show that exploitability has dropped drastically from 2006, when it stood at about 330 micro big blinds per hand using a Range of Skill program, to the high 235 in 2008 to 135 in 2010, and continued to drop until achieving CFR+ in 2014, which solves HULHE directly without abstraction, with an exploitability of less than 1 mbb/h.</p> <p>The paper was important because it was able to evaluate agents that had been used in competitions in terms of their exploitability. They showed that the 2010 competition winner was significantly more exploitable than agents that it defeated slightly.</p> <p>Another important insight is that while increasing abstraction size in toy domains, improvements are not guaranteed, but in large games like Texas Hold’em, it seems that increasing abstractions continues to provide a consistent, although decreasing improvement.</p> <p>Finally, an experiment was done to show that the technique of minimizing exploitability in an abstract game in order to use that strategy in the main game can have some interesting overfitting consequences. The exploitability in the main game tended to decrease (a good thing) initially, but then as iterations continued to increase, the exploitability would increase in the main game. That is, the strategy improves in the abstracted game while worsening in the main game.</p> <p>The paper concludes that finer abstractions do produce better equilibrium approximations, but better worst-case performance does not always result in better performance in a tournament.</p> <p>In 2013, Johanson et al. created CFR-BR, which computes the best Nash approximation strategy that can be represented in a given abstraction. This additional evaluation method compares the representation power of an abstraction by how well it can approximate an unabstracted Nash equilibrium. Using CFR-BR, Johanson showed that imperfect recall abstractions tend to perform better than their perfect recall abstractions, despite the theoretical issues.</p> <p>In 2014, Sandholm and Kroer introduced a framework to give bounds on solution quality for any perfect-recall extensive-form game. It uses a newly created method for mapping abstract strategies to the original game and uses new equilibrium refinement for analysis. This resulted in developing the first general lossy extensive-form game abstraction with bounds. It finds a lossless abstraction when one is available and a lossy abstraction when smaller abstractions are desired, and has now been extended to imperfect-recall.</p> <p>In 2017, Lisy and Bowling showed a new method called Local Best Response to compute approximate lower bounds on the best response strategy. These can be used even in large NLHE games like those played in the ACPC. The algorithm computes the approximation by looking one action ahead and assuming that players will check until the end of the game after this action. The tests against recent successful ACPC agents were quite surprising, because even though when they played against themselves, the results were quite close, it is shown that they are extremely exploitable, even more so than if they had simply folded every hand!</p> <h2 id="poker-abstractions">Poker Abstractions</h2> <p>The three main ways to create a smaller game from a larger one are to merge information sets together (card abstraction) to restrict the actions a player can take (action abstraction), to use imperfect recall, or a combination of the three. Other possibilities are reducing the number of betting rounds or bets allowed per round or modifying the game itself, for example to use a smaller deck size or smaller starting chip size. In 2012, Lanctot developed theoretical bounds on bucketing abstraction that shows that CFR on abstractions leads to bounded regret in the full game. In 2014, this was expanded by Kroer and Sandholm to give bounds on solution quality for any perfect recall extensive form game. The framework can find abstractions, both lossy and lossless, given a specific bound requirement.</p> <p>In the early 2000s, lossy abstractions were created by hand based on game-specific knowledge, but since then experimental advances in automation have advanced. The main ideas are:</p> <p>1) Using integer programming to optimize the abstraction, usually within one level of the game at a time</p> <p>2) Potential-aware abstraction, where information sets of a player at a given level are based on a probability vector of transition to state buckets at the next level</p> <p>3) Imperfect recall abstraction, such that a player forgets some details that he knew earlier in the game</p> <p>Most abstractions now use numbers 2 and 3 above and divide the game across a supercomputer for equilibrium finding computation.</p> <p>Sandholm states that abstractions tend to be most successful when starting with a game theoretical strategy and modifying it based on the opponent.</p> <h3 id="abstraction-translations">Abstraction Translations</h3> <p>Abstraction translations are mappings from an opponent action to an action within the abstracted framework of the game. The goal of these reverse mappings, or translations, is to minimize the player’s exploitability, while also wanting one’s own abstraction to exploit other player abstraction choices.</p> <p>Early abstraction translation methods included:</p> <ul> <li>Deterministic arithmetic: Map to A or B that is closest to x. This is highly exploitable, for example, by strong hands that bet slightly closer to A and therefore can be responded to as if they were much weaker. This was used in the 2007 ACPC by Tartanian1 and lost to an agent that didn’t even look at its cards!</li> <li>Randomized arithmetic: f_{A,B}(x) = (B-x)/(B-A). This is exploitable when facing bets close to 1/2 of the pot. This was used by AggroBot in 2006 in the ACPC.</li> <li>Deterministic geometric: Map to A if A/x &gt; x/B and B otherwise. This was used by Tartanian2 in the 2008 ACPC.</li> <li>Randomized geometric 1: f_{A,B}(x) = A(B-x)/(A(B-x) + x(x-A)). This was used by Sartre and Hyperborean in the ACPC.</li> <li>Randomized geometric 2: f_{A,B}(x) = A(B+x)(B-x)/((B-A)(x^2+AB)) was used by Tartanian4 in the 2010 ACPC.</li> </ul> <p>In 2013, Ganzfried and Sandholm developed the following translation, as explained in the section on Abstractions.</p> <p>f_{A,B}(x) = (B-x)(1+A)/((B-A)(1+x))</p> <p>The new mapping was tested by “rematching” the agents from the 2012 ACPC and keeping the Tartanian5 agent the same, except for revising its mapping. The mapping performed well, but interestingly performed worse than the simple deterministic arithmetic mapping that simply maps the bet to the closest abstraction. However, this leaves agents open to exploitation that perhaps was not acted upon in prior years, but could be in the future.</p> <h3 id="asymmetric-abstractions">Asymmetric Abstractions</h3> <p>The standard approach when abstracting is to use symmetric abstraction, assuming that all agents distinguish states in the same way. Agents in Texas Hold’em have been shown to perform better in head-to-head competitions and to be less exploitable when using finer-grained abstractions, although there are no theoretical guarantees of this. Until Bard et al’s <a href="https://webdocs.cs.ualberta.ca/~games/poker/publications/2014-aamas-asymmetric-abstractions.pdf">2014 paper</a>, all research was done by examining only symmetric abstractions.</p> <p>The choice of using asymmetric abstractions could affect both “one on one performance against other agents and exploitability in the unabstracted game”. By examining a number of different abstraction combinations in Texas Hold’em, a few main conclusions were drawn:</p> <ul> <li> <p>“With symmetric abstractions, increasing the abstraction size results in improved utility against other agents and improved exploitability”</p> </li> <li> <p>“Smaller abstractions for ourselves, while our opponent uses larger abstractions, tended to result in improved exploitability, but decreased one on one mean utility”</p> </li> <li> <p>“Larger abstractions for ourselves, while our opponent uses smaller abstractions, tends towards our exploitability worsening and our one on one utility improving, leading to the conclusion to want to increase abstractions when creating an agent for a one on one competition”</p> </li> </ul> <p>We see that the goals of minimizing exploitability and increasing one on one utility can be at odds with each other, so the agent designer must make abstraction decisions based on his goals and beliefs about other agents. A non poker example given is that if worst case outcomes resulted in people being injured or killed, the only goal may be to increase worst case performance.</p> <h3 id="abstractions-are-not-necessarily-monotonically-improving">Abstractions are Not Necessarily Monotonically Improving</h3> <p>Although logic would suggest that finer abstractions result in superior agents, this was shown to not be true necessarily by Waugh et al. in 2009. However, they did show that if one player is using abstraction while the other is playing in the unabstracted game, then the abstracted player’s strategies do monotonically improve as the abstractions get finer. As the annual poker competitions have advanced, empirically the winning strategies have generally been the teams that have solved the largest abstracted games.</p> <p>An example with Leduc Hold’em is shown, in which a finer card abstraction can result in a more exploitable strategy. They also tested betting abstraction and again found instances in which exploitability increased as the abstraction became finer. One theory presented is that providing additional strategies to a player can encourage the player to “exploit the limitations of the opponent’s abstraction”, resulting in a strategy that is more exploitable by actions that become available to the opponent in the full game.</p> <h3 id="earth-movers-distance-metric-in-abstraction">Earth Mover’s Distance Metric in Abstraction</h3> <p>In 2014, Ganzfried and Sandholm developed the leading abstraction algorithm for imperfect information games using k-means with the earth mover’s distance metric to cluster similar states together in <a href="https://www.cs.cmu.edu/~sandholm/potential-aware_imperfect-recall.aaai14.pdf">Potential-Aware Imperfect-Recall Abstraction with Earth Mover’s Distance in Imperfect-Information Games</a>.</p> <p>Standard clustering techniques merge information sets together in the abstracted game according to hand strength. A common metric is equity against a random uniform hand, but this is problematic when the uniform hands have very different distributions against those given hands. The example given is that KQ suited is very strong against some hands and weak against some hands, while a pair of 66 has most of its weight in the middle, meaning it’s roughly 50% against most hands.</p> <p>Distribution aware abstractions group states together at a given round if their full distributions over future strength are similar, instead of just basing this on the expectation of their strength. The goal of the earth mover’s algorithm is to build on this by not only looking at the future strength of the final round, but instead at trajectories of strength over all future rounds.</p> <p>The recommended method for computing distances between histograms of hand strength distributions is the earth mover’s distance, which is the “minimum cost of turning one pile into the other where cost is assumed to be the amount of dirt moved times the distance by which it moved”. This metric accounts for both the amount and distance moved, not just the amount.</p> <p>The algorithm incorporates imperfect recall and potential aware abstractions (where the maximum number of buckets in each round is determined based on the size of the final linear program to solve) using earth mover’s distance. We can find a case where two very different hands have similar equity distributions on the river, but are extremely different on the turn. The potential aware abstraction will observe this difference based on the earth mover’s distance and place them into different histograms on the turn. This is especially valuable when hand strengths can change significantly between rounds – the paper notes that Omaha Hold’em could be a good fit.</p> <h3 id="simultaneous-abstraction-and-equilibrium-finding"><a href="https://www.cs.cmu.edu/~sandholm/simultaneous.ijcai15.pdf">Simultaneous Abstraction and Equilibrium Finding</a></h3> <p>In a 2015 paper, Brown and Sandholm show a method to combine action abstraction and equilibrium finding together. An agent can start learning with a coarse abstraction and then can add abstracted actions that seem like they would be useful, which is determined by trying to minimize average overall regret. They showed that it converges to improved equilibrium solutions with no computational time loss.</p> <h2 id="cfr-extensions">CFR Extensions</h2> <p>In recent years, nearly all poker research has involved optimizations and extensions on counterfactual regret minimization. Here we highlight five interesting developments from the past few years.</p> <h3 id="monte-carlo-cfr">Monte Carlo CFR</h3> <p>Marc Lanctot et al. of Alberta expanded the standard vanilla CFR method to include Monte Carlo sampling in 2009. The Monte Carlo methods are important because they converge faster than vanilla CFR, since vanilla CFR iterations require an entire tree traversal. Monte Carlo CFR sampling (MCCFR) requires more iterations, but each iteration tends to be much faster, resulting in a significantly faster convergence.</p> <p>In Lanctot’s thesis, he goes beyond only poker and explores other interesting imperfect information games such as Latent Tic Tac Toe, a form of Tic Tac Toe in which players cannot see their opponents moves until after each round, and Princess and Monster, a pursuit-evasion game where two players are in a “dark” grid where they can’t see each other and can only move to points adjacent to their current locations. Scoring is based on catching the princess as fast as possible.</p> <p>The primary game that Lanctot focuses on is called Bluff, where each player has private dice to roll and look at without showing to opponents. Each round, players alternate bidding on the outcome of all dice in play until one calls “bluff”, claiming that the bid is invalid. If caught, the losing player has to remove a number of dice between the amount he bid and the actual amount in the game. If a player loses all of his dice, he loses the game.</p> <p>He shows that in these games and in poker, vanilla CFR is essentially always less efficient than Monte Carlo CFR methods. A method called external sampling also tends to dominate chance sampling in his experiments.</p> <h3 id="cfr">CFR+</h3> <p>In early 2015, the Computer Poker Research Group from the University of Alberta announced that “Heads-up limit hold’em poker is solved”, a huge achievement in the poker research community and poker community at large since this is the first significant imperfect-information game played competitively by humans that has been solved. Their article was published in the January 9, 2015 issue of Science and is discussed in detail in Section 4.4 CFR Advances. The team used a new version of CFR called CFR+, which can solve games much larger than with prior CFR algorithms and is also capable of converging faster than CFR in both poker games and matrix games.</p> <p>The algorithm was originally developed by Oskar Tammelin, an independent researcher from Finland. This major achievement was noticed in the research community, the poker community, and was featured in many mainstream publications. Despite LHE’s popularity being in decline recently in favor of NLHE games, this was a very exciting announcement.</p> <p>The main change in CFR+ from CFR and prior research is that the regrets are calculated differently such that all regrets are constrained to be non-negative, so that actions that have looked bad (i.e. those regrets with value &lt;0) are chosen again immediately after proving useful instead of waiting many iterations to become positive. Also, the final strategy used is the current strategy at that time, not the average strategy (as in vanilla CFR), and no sampling is used. Additionally, advances were made enabling further use of compression to store the average strategy and regrets.</p> <p>Previously only perfect information games of this size, like checkers, have been solved and despite this game being smaller than checkers, it was much more challenging due to the incomplete information. HULHE has 3.16x10^17 game states, which places it as larger than Connect 4 and smaller than checkers. There are 3.19x10^14 decision points, or information sets, where the state is indistinguishable based on the player’s information in the hand (this is reduced to 1.38x10^13 after removing game symmetries). The main considerations when solving a game of this size come in the form of memory and computation power.</p> <p>CFR+ was implemented on 200 computation nodes, each with 24 2.1-GHz AMD cores, 32 GB of RAM, and a 1TB local hard disk. The solution came after 1,579 iterations in 69 days, using 900 core years of computation and 10.9 TB of disk space (the game without compression would have required 262 TB of space!), reaching an exploitability of .986 mbb/g, which required full traversal of the game tree to determine.</p> <p>Exploitability is defined as the amount less than the game value that the strategy achieves against the worst-case opponent strategy in expectation and it was determined that a 1 mbb/g was a good threshold for the game being considered essentially solved. This is based on assuming a player playing a worst case strategy for a lifetime and that he would be playing 200 games/hour, 12 hours/day, for 70 years, with a standard deviation of 5 bb/g, and with a 95% confidence interval (1.64 standard deviations). This results in the following threshold computation: $$1.64 * 5 * 200 * 12 * 365 * 70 = .00105$$</p> <p>Bowling et al. describe their solution as weakly solved, which means that the strategy finds a game theoretic value in reasonable time for initial conditions (whereas ultraweakly is finding a game theoretic value for initial positions and strongly is strategy determined for all positions to find the game theoretic value).</p> <p>The main findings from the solution are that, as expected, the dealer has a substantial advantage because he acts last, giving him more information to work with while making decisions. He is expected to win +87.7-89.7 mbb/g.</p> <p>Other interesting characteristics of the strategy are: that it is considered rarely (.06%) good to only call preflop, meaning that it is almost always best to raise or fold. The program also rarely folds preflop as the non dealer who has already put in the big blind. Finally, as dealer, who is acting last in rounds after preflop, the program rarely puts in the capped (4th) bet preflop, perhaps because it will have an advantage from acting last in later rounds and so keeping the pot smaller while less information is known may make sense. While other Nash equilibria could play differently, they would always achieve the same game value.</p> <p>Despite the significant growth in capabilities of solving computer poker games, NLHE games are generally still much too large to be solved unabstracted. NLHE usually has about 10^71 states, depending on the rules. The Royal NLHE simplified game has about 10^9 game states, solvable by most standard computers.</p> <h3 id="compact-cfr-and-pure-cfr">Compact CFR and Pure CFR</h3> <p>Although most CFR research had been tilted towards sampling optimization after 2009, Oskari Tammelin also developed the Pure CFR algorithm that uses pure strategy profiles on the vanilla version of CFR.</p> <p>Pure CFR was described in Richard Gibson’s 2014 thesis. Since all of the utilities of in poker are integers, all computations in Pure CFR can be done with integer arithmetic, which is both faster than floating-point arithmetic and allows for the cumulative regret and cumulative strategy profile as integers, which reduces memory costs by 50%.</p> <p>Another CFR version was published in early 2016 by Eric Jackson in a paper called Compact CFR. This version uses follow-the-leader instead of regret matching, which assigns the entire strategy probability at a node to the action with highest regret. Compact CFR is not a no-regret algorithm (i.e. it loses theoretical guarantees), but the average strategy does converge to Nash equilibrium.</p> <p>Since all we need to know in Compact CFR is the action with the highest regret at each information set, the regrets can be represented with offsets from 0, where the highest regret is 0 and others are positive values. These can be represented by bucketed unsigned data types, a form of compression that is only slightly worse than uncompressed regret storage. By also only taking the final strategy rather than the average strategy, every action at every information set can be represented by only one byte, whereas vanilla CFR requires 16 bytes and pure external CFR requires 8 bytes, a significant reduction in memory requirements.</p> <h3 id="strategy-purification-and-thresholding"><a href="https://www.cs.cmu.edu/~sandholm/StrategyPurification_AAMAS2012_camera_ready_2.pdf">Strategy Purification and Thresholding</a></h3> <p>In 2012, Ganzfried et al proposed that instead of solving abstract games for an equilibrium’s strategy and using this strategy translated into the full game, we can first modify the abstract game equilibrium using procedures called purification and thresholding. The main idea is that these approaches provide a “robustness to the solutions against overfitting one’s strategy to one’s lossy abstraction”, and the results “do not always come at the expense of worst-case exploitability”.</p> <p>Purification means that the player will always play his best strategy at each information set with probability 1 (rather than play an action based on his behavioral strategy distribution). This has similarities to Compact CFR from above. In the case of ties, both strategies are played with uniform probability. This is useful because it compensates for the failure of equilibrium finding algorithms to fully converge in the abstract game.</p> <p>Thresholding is a more relaxed approach, which simply eliminates certain actions with probabilities below a threshold value $$\epsilon$$, and then renormalizes the action probabilities. Logic given is that these strategies may either be due to noise or are played primarily to protect a player from being exploited, which may be an overstated issue about realistic opponents. Humans in practice generally use thresholding when implementing a poker strategy because it isn’t practical for humans to properly randomize with so many probabilities, and so it makes sense to remove the very lowest ones to simplify.</p> <p>A study was done using the game Leduc Hold’em and for almost all abstractions, purification was shown to bring a significant improvement in exploitability. Thresholding was also beneficial, but purification always performed better in the cases that it improved exploitability.</p> <p>Identical agents except for using either thresholding or purification were submitted to the ACPC and the purification agent performed better against all opponents, including the thresholding agent. In terms of worst case exploitability, the least exploitable was the one that used a thresholding level of 0.15 (meaning all actions below 15% were removed), perhaps because too much thresholding results in too little randomness and no thresholding at all results in overfitting to the abstraction.</p> <h3 id="decomposition"><a href="https://arxiv.org/abs/1303.4441">Decomposition</a></h3> <p>Decomposition, analyzing different subgames independently, has been a well known principle in perfect information games, but has been problematic in imperfect information games and when used has abandoned theoretical guarantees. In 2014, Burch, Johanson, and Bowling proposed a technique that does “retain optimality guarantees on the full game”. In perfect information games, “subgames can be solved independently and the strategy fragments created can be combined to form an optimal strategy for the entire game”.</p> <p>Decomposition can allow large savings in the memory required to solve a game and also allows for not storing the complete strategy, which may be too large, but rather to store and recomputed subgame strategies as needed.</p> <p>The authors show a method of using summary information about a subgame strategy to generate a new strategy that is no more exploitable than the original strategy. An algorithm called CFR-D, for decomposition, is shown to achieve “sub-linear space costs at the cost of increased computation time”.</p> <h3 id="endgame-solving-in-large-imperfect-information-games"><a href="https://www.cs.cmu.edu/~sandholm/Endgame_AAAI15_workshop_cr_1.pdf">Endgame Solving in Large Imperfect-Information Games</a></h3> <p>While the naive approach simply abstracts a game and uses translation to go back to the full game, more recent approaches tended to divide games into sequential phases, and now Ganzfried and Sandholm show how to solve the endgame specifically with a finer abstraction.</p> <p>In 2015, Ganzfried and Sandholm modified the standard CFR abstraction solution method by keeping the initial portion of the game tree and discarding the strategies for the final portion, the endgames. Then in real time, they solve the relevant endgame that has been reached using a linear program, with a greater degree of accuracy than the initial abstract strategy. Bayes’ rule is used to compute the distribution of player private information leading into the endgames from the precomputed strategies from the initial part of the game.</p> <p>Another benefit of this method is that “off-tree” problems are solved – that is, cases in which the opponent’s action is not allowed in the abstraction will actually be solved exactly in the endgame.</p> <p>The problem with endgame solving is that the Nash equilibrium guarantees are no longer valid. Combining an equilibrium from the early part of the game and end of the game could lead to mismatched equilibrium since games tend to have multiple different ones.</p> <p>Although endgame solving can lead to highly exploitable strategies in some games, it’s shown to have significant benefits in large imperfect information games, especially games where the endgame has is important, which is the case in all poker games, and especially no limit hold’em.</p> <p>This technique showed improved performance against the strongest agents from the 2013 ACPC. Having better abstractions at the endgame seems intuitively very valuable since this tends to be where it’s possible to narrow down opponent hand ranges and where the largest bets tend to be made, since the pot gradually builds up over the betting rounds.</p> <p>In 2017, Brown and Sandholm advanced the previous methods by using nested endgame solving in their Libratus agent in place of action translation in response to off-tree opponent actions. This may have made the difference in defeating world class human opponents.</p> <h3 id="warm-starting-and-regret-pruning">Warm Starting and Regret Pruning</h3> <p>Also in 2015, Brown and Sandholm found in <a href="https://www.cs.cmu.edu/~noamb/papers/16-AAAI-Strategy-Based.pdf">Strategy-Based Warm Starting for Regret Minimization Games</a> that it is possible to warm start in CFR by using a predetermined strategy and that with a single full traversal of the game tree, CFR is effectively warm started to as many iterations as it would have taken to reach a strategy profile of the same quality as the input strategies, and the convergence bounds are unchanged. By warm starting, CFR can bypass spending time analyzing early and expensive iterations that contain all nodes, even never-used ones that would be pruned.</p> <p>Brown and Sandholm developed a regret-based pruning (RBP) method in 2015 to prune actions with negative regret temporarily (for the minimum number of iterations that it would take for the regret to become positive in CFR) in the paper <a href="https://www.cs.cmu.edu/~noamb/papers/15-NIPS-Regret-Based.pdf">Regret-Based Pruning in Extensive-Form Games</a>.</p> <p>This process was shown to speed up CFR and then in 2016, they improved this with a new RBP version that can even reduce the space requirements of CFR over time by completely discarding pruned branches.</p> <h3 id="deep-learning">Deep Learning</h3> <p>Beginning in 2017 with the University of Alberta’s DeepStack, poker algorithms have been relying on deep neural networks as an alternative or complement to game abstractions. Noam Brown et al published <a href="https://arxiv.org/abs/1811.00164">Deep Counterfactual Regret Minimization</a> in 2018 that was was the first algorithm to successfully implement CFR with deep neural networks rather than the standard tabular format. It takes as input the exact cards and betting sequences to a neural network and outputs values in proportion to the regrets that would be found in tabular CFR. The technique effectively leaves the neural network to do the abstracting rather than requiring fixed abstractions built into the algorithm. We discuss this result in detail in the CFR Advances section. As computer processing capabilities have improved, using large neural networks becomes an increasingly valuable method for approximately solving poker games.</p> <h3 id="superhuman-ai-for-multiplayer-poker">Superhuman AI for multiplayer poker</h3> <p>In 2019, Noam Brown and Tuomas Sandholm of Carnegie Mellon University released this paper with an agent called Pluribus that beat strong human players in six-handed poker. The agent was able to outperform the humans in two settings: (1) five agents at the table with one human and (2) 5 humans at the table with one agent.</p> <p>Because there isn’t a clear game-theoretic solution to a multiplayer game, the goal was to create an agent that could empirically outperform top human players. This is detailed in the Top Agents and Research section.</p> <h2 id="computer-poker-competitions">Computer Poker Competitions</h2> <h3 id="the-annual-computer-poker-competition">The Annual Computer Poker Competition</h3> <p>In 2006, the University of Alberta and Carnegie Mellon University jointly founded the Annual Computer Poker Competition (ACPC), which is now held at the Poker Workshop of the annual Advancement of Artificial Intelligence (AAAI) conference during the Poker Workshop part of the conference, which began in 2012.</p> <p>The competition has attracted both hobbyists and academics from around the world each year, although a complaint is that supercomputer access that is possible for academics may not be feasible for hobbyists. Each match is played in either HUNL or HULHE consists of 3000 hands where each player starts with 200 big blinds, using blind sizes of 50 and 100 and a minimum bet of 1 chip. Matches are played in duplicate so that the same cards are given to each agent, then memories are cleared, and the same match is played again with cards being dealt to the opposite player.</p> <p>There are two primary competition types. The first is instant-run off, where agents earn or lose 1 point for each match’s win and loss, which favors small wins and equilibrium solutions. The total bankroll competition type counts the total winnings of each agent over all of its contests. This favors agents that are more exploitative.</p> <p>The general strategy in recent years has been for teams to develop algorithms that allow for larger and larger games to be solved, meaning that finer grained abstractions can be used, which have generally correlated to stronger performance, though such a result is not theoretically guaranteed, and counterexamples do exist.</p> <p>MIT now even holds a mini-course called MIT Pokerbots in which students are given one month to program autonomous pokerbots to compete against other teams in a tournament. The contest receives heavy sponsorship from trading companies due to the similarities in dealing with imperfect information and decisions based on probabilities and statistics.</p> <p>CFR and its variants are have been the most common approach used in the ACPC recently. It was first seen in 2007 with Zinkevich and the University of Alberta CPRG, using imperfect recall abstraction. CFR was used in 2/11 agents in 2012, 5/12 in 2013, 10/12 in 2014 (there was no competition in 2015 and the details of the competitors for more recent competition have not been released, except for the winners). In 2013, 2014, and 2016, the top three agents in the bankroll and instant run-off competitions all used some form of CFR.</p> <p>Alberta’s Hyperborean won every limit hold’em run-off competition from 2006 to 2008 using an imperfect recall abstraction in CFR. In 2009 it was defeated by GGValuta from the University of Bucharest, which also used CFR with a k-means clustering algorithm to do card abstraction.</p> <p>Hyperborean again won in 2009 by fusing independent separate solutions from independent pieces of the game tree. Hyperborean also won the 2010 no limit run-off event.</p> <p>Hyperborean, Slumbot by Eric Jackson, and Tartanian from Carnegie Mellon have consistently had excellent results since this time.</p> <p>We will now briefly go over some of the winners and their techniques from the most recent competitions.</p> <p>The 2014 winner was the Tartanian7 team from Carnegie Mellon. The program plays approximately a Nash equilibrium strategy that was computed on a supercomputer. They developed a new abstraction algorithm that clusters public flop boards based on how often their previous program grouped private hands together on the flop with different sets of public cards. Within each of the public flop clusters, the algorithm then buckets the flop, turn, and river hands that are possible given one of the public flops in the cluster, using imperfect-recall abstraction. They did not do any abstraction for the preflop round.</p> <p>They based their equilibrium finding algorithm on external sampling MCCFR by sampling one pair of preflop hands per iteration. Postflop, they sample community cards from their public clusters and MCCFR in parallel, and add weights to the samples to remove bias. They also used thresholding and purification and made the interesting observation that it is valuable to bias towards conservative actions to reduce variance, since higher variance means that the inferior opponent is more likely to win.</p> <p>Slumbot, by Eric Jackson, uses Pure External CFR for equilibrium computation. He breaks the game tree into pieces to be solved separately and uses differing abstraction levels depending on how often the game states are reached. Those more common ones are given more granularity in both bucketing and bet sizes possible.</p> <p>Hyperborean (for the auto run-off competition), made by the Computer Poker Games Research Group from the University of Alberta, also uses Pure CFR, imperfect recall, and k-means card bucketing abstraction. They interestingly use the final strategy of the algorithm rather than the average strategy, which is the one proven to converge to equilibrium. They use an asymmetric betting system in which the opponent can have more options than the agent, including such actions as minimum-betting.</p> <p>For the total bankroll competition, the program uses three distinct strategies and chooses one according to an algorithm. Two of the strategies are data-biased responses to aggregate data of competitors from the years 2011/12 and 2013, and the other strategy is similar to the auto run-off competition strategy, but also separates betting sequences into an “important” and “unimportant” part and creates more buckets for the important sequences. Importance is based on how often these sequences are seen in self-play.</p> <p>In 2016, only the no-limit hold’em total-bankroll and instant run-off competitions took place. The same three teams took the top three places in both competitions.</p> <p>Unfold Poker was trained by a distributed implementation of the Pure CFR algorithm and uses a heuristic to sometimes avoid certain game tree paths. Certain bet sizes were omitted and a distance metric that considers features from all postflop betting streets was used to construct the card abstraction on the river. Unfold Poker took 2nd place in the instant run-off event and 3rd in the total bankroll event.</p> <p>Slumbot took 1st in the instant run-off and 2nd in the total bankroll by using a new memory efficient CFR technique called Compact CFR, which was detailed in the section on CFR above.</p> <p>Finally, Carnegie Mellon University’s Baby Tartanian 8 won the total bankroll competition and took 3rd place in the instant run-off event. Baby Tartanian 8’s main new feature was to add pruning to cut down actions worth considering. They also took feedback from the 2015 man vs. machine competition in which Tartanian 7 suffered a loss to improve translations and do better endgame solving.</p> <p>Going forward, the event will feature six player games, which will accelerate research towards games that are commonly played by humans and present a new set of complexities, including whether the optimal approach is to aim for opponent exploitation or to continue on the unexploitable path, despite multiplayer games invalidating theoretical results that would be valid in zero-sum games. Based on the recent results from Pluribus, it would seem that the former works well.</p> <p>The last workshop with the ACPC competition took place in 2018 with heads up no limit Texas Hold’em and six player no limit Texas Hold’em. Teams now have a maximum submission size, which is a way to even the field between teams that might have more resources. Although there have been breakthroughs in recent years with Pluribus and other top agents, it could be interesting to see even better multiplayer agents. However, it does seem that many academic groups have moved on from poker and hobbyists may prefer to use their algorithms commercially or privately, so the future of the ACPC remains to be seen.</p>Background – History of Solving Poker There has been a rich literature of research and algorithms involving games and game theory, with poker specific research starting to grow in the late 1990s and early 2000s, especially after the creation of the Computer Poker Research Group (CPRG) at the University of Alberta in the early 1990s. Real life decision making settings almost always involve imperfect information and uncertainty, so algorithmic advances in poker are exciting with regards to the possible applications in the real world.