Jekyll2021-05-05T09:38:05+00:00https://aipokertutorial.com/feed.xmlAI Poker TutorialAIPT Section 2.1: Game Theory – Game Theory Foundation2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/game-theory-foundation<!-- indifference, make worst hand indifferent, otherwise would have different worst hand --> <h1 id="game-theory--game-theory-foundation">Game Theory – Game Theory Foundation</h1> <p>Let’s look at some important game theory concepts before we get into actually solving for poker strategies.</p> <h2 id="game-theory-optimal-gto">Game Theory Optimal (GTO)</h2> <p>What does it mean to “solve” a poker game? In the 2-player setting, this means to find a <strong>Nash Equilibrium strategy</strong> (aka GTO strategy) for the game. By definition, if both players are playing this strategy, then neither would want to change to a different strategy since neither could do better with any other strategy (assuming that the opponent’s strategy stays fixed).</p> <p>To break this down, if players A and B are both playing GTO, then both are doing as well as possible. If player A changes strategy, then player A is doing worse and player B is doing better. Player B could do <em>even better</em> by changing strategy to exploit player A’s new strategy, but then player A could take advantage of this change. If Player B stays put with GTO, then EV is not maximized, but there is no risk of being exploited. In this sense, GTO is an unexploitable strategy that gets a guaranteed minimum EV.</p> <p>With more than 2 players, there isn’t an exact GTO strategy, but it is possible to approximate one. In practice, even in the 2-player setting, we have to approximate GTO strategies in full-sized poker games. We will go more into the details of what it means to solve a game in section 3.1 “What is Solving”?</p> <p>Intuition for this in poker can be explained using a simple all-in game where one player must either fold or bet all his chips and the second player must either call or fold if the first player bets all the chips. We’ll refer to these two players as the “all-in player” and the “calling player”. We assume each player starts with 10 big blinds. There are three possible outcomes:</p> <table> <thead> <tr> <th>Scenarios</th> <th>Player 1 (SB)</th> <th>Player 2 (BB)</th> <th>Result</th> </tr> </thead> <tbody> <tr> <td>Case 1</td> <td>Fold</td> <td>–</td> <td>P2 wins 0.5 BB</td> </tr> <tr> <td>Case 2</td> <td>All-in</td> <td>Fold</td> <td>P1 wins 1 BB</td> </tr> <tr> <td>Case 3</td> <td>All-in</td> <td>Call</td> <td>Winner of showdown wins 10 BB (pot size 20 BB)</td> </tr> </tbody> </table> <p>In this situation, the calling player might begin the game with a default strategy of calling a low percentage of hands. An alert all-in player might exploit this by going all-in with a large range of hands.</p> <p><strong>ICMIZER of this with EV</strong></p> <p>After seeing the first player go all-in very frequently, the calling player might increase the calling range.</p> <p><strong>ICMIZER of this with EV</strong></p> <p>Once the all-in player observes this, it could lead him to reduce his all-in percentage. Once the all-in range of hands and the calling range stabilize such that neither player can unilaterally change his strategy to increase his profit, then the equilibrium strategies have been reached.</p> <p>A <strong>strategy</strong> in game theory is the set of actions one will take at every decision point. In the all-in game, there is only one decision for each player, so the entire strategy is the number of hands to go all-in with for Player 1 and the number of hands to call with for Player 2.</p> <p>We can use the ICMIZER program to compute the game theory optimal strategies in a 1v1 setting where both players start the hand with 10 big blinds. In this case, the small blind all-in player goes all-in 58% of the time and the big blind calling player calls 37% of the time.</p> <p><strong>ICMIZER of this with EV</strong></p> <p>If either player changed those percentages, then their EV would go down! If the calling player called more hands (looser), those hands wouldn’t be profitable. If the calling player called fewer hands (tighter), then he would be folding too much. If the all-in player went looser, those extra hands wouldn’t be profitable, and if he went tighter, then he would be folding too much.</p> <p>Why, intuitively, is the all-in player’s range so much wider than the calling player’s? David Sklansky coined the term “gap concept”, which states that a player needs a better hand to call with than to go all-in with – that difference is the gap. The main reasons for this are (a) the all-in player gets the initiative whereby he can force a fold, but the calling player can only call, (b) the all-in player is signaling the strength of his hand, and (c) when facing an all-in bet the pot odds are not especially appealing.</p> <h2 id="exploitation">Exploitation</h2> <p>What if the big blind calling player doesn’t feel comfortable calling with weaker hands like K2s and Q9o and maximized his calling range tighter than the equilibrium range of 37%? The game theoretic solution would not fully take advantage of this opportunity. The <strong>best response strategy</strong> is the one that maximally exploits the opponent by always performing the highest expected value play against their fixed strategy. In general, an exploitative strategy is one that exploits an opponent’s non-equilibrium play. In the above example, an exploitative play could be raising with all hands after seeing the opponent calling with a low percentage of hands. However, this strategy can itself be exploited.</p> <p><strong>table of EV vs. looser and tighter opponents compared to GTO and possible loss (pg 87 of Modern Poker Theory)</strong></p> <h2 id="normal-form">Normal Form</h2> <p>Normal Form is writing the <strong>strategies</strong> and game <strong>payouts</strong> in matrix form. The Player 1 strategies are in the rows and Player 2 strategies are in the columns. The payouts are written in terms of P1, P2.</p> <h3 id="zero-sum-all-in-poker-game">Zero-Sum All-in Poker Game</h3> <p>We can model the all-in game in normal form as below. Assume that each player looks at his/her hand and settles on an action, then the below chart is the result of those actions with the first number being Player 1’s <strong>payout</strong> and the second being Player 2’s. In general, normal form matrices show <strong>utilities</strong> for each player in a game, which is what the situation is valued for each player, but in poker settings, these are the payouts from the hand.</p> <p>Note that e.g. the call player cannot call when the all-in player folds, but we assume the actions are pre-selected and the payouts still remain the same.</p> <p>In a 1v1 poker game, the sum of the payouts in each box are 0 since whatever one player wins, the other loses, which is called a <strong>zero-sum game</strong> (not including the house commission, aka rake).</p> <table> <thead> <tr> <th>All-in Player/Call Player</th> <th>Call</th> <th>Fold</th> </tr> </thead> <tbody> <tr> <td>All-in</td> <td>EV of all-in, -EV of all-in</td> <td>1, -1</td> </tr> <tr> <td>Fold</td> <td>-0.5, 0.5</td> <td>-0.5, 0.5</td> </tr> </tbody> </table> <p>If Player 1 has JT offsuit and Player 2 has AK offsuit, the numbers are as below. The all-in call scenario has -2.5 for Player 1 and 2.5 for Player 2 because the hand odds are about 37.5% for Player 1 and 62.5% for Player 2, meaning that Player 1’s equity in a \$20 pot is about \$7.50 and Player 2’s equity is about $12.50, so the net expected profit is -\$2.50 and \$2.50, respectively.</p> <table> <thead> <tr> <th>All-in Player/Call Player</th> <th>Call</th> <th>Fold</th> </tr> </thead> <tbody> <tr> <td>All-in</td> <td>-2.5, 2.5</td> <td>1, -1</td> </tr> <tr> <td>Fold</td> <td>-0.5, 0.5</td> <td>-0.5, 0.5</td> </tr> </tbody> </table> <p>Because in poker the hands are hidden, there would be no way to actually know the all-in/call EV in advance, but we show this to understand how the normal form looks.</p> <h3 id="simple-2-action-game">Simple 2-Action Game</h3> <table> <thead> <tr> <th>P1/2</th> <th>Action 1</th> <th>Action 2</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>5, 3</td> <td>4, 0</td> </tr> <tr> <td>Action 2</td> <td>3, 2</td> <td>1, -1</td> </tr> </tbody> </table> <p>In the Player 1 Action 1 and Player 2 Action 1 slot, we have (5, 3), which represents P1 = 5 and P2 = 3. I.e., if these actions are taken, Player 1 wins 5 units and Player 2 wins 3 units.</p> <h4 id="dominated-strategies">Dominated Strategies</h4> <p>A dominated strategy is one that is strictly worse than an alternative strategy. Let’s find the equilibrium strategies for this game by using <strong>iterated elimination of dominated strategies</strong>.</p> <p>If Player 2 plays Action 1, then Player 1 gets a payout of 5 with Action 1 or 3 with Action 2. Therefore Player 1 prefers Action 1 in this case.</p> <p>If Player 2 plays Action 2, then Player 1 gets a payout of 4 with Action 1 or 1 with Action 2. Therefore Player 1 prefers Action 1 again in this case.</p> <p>This means that whatever Player 2 does, Player 1 prefers Action 1 and therefore can eliminate Action 2 entirely since it would never make sense to play Action 2. We can say Action 1 dominates Action 2 or Action 2 is dominated by Action 1.</p> <p>We can repeat the same process for Player 2. When Player 1 plays Action 1, Player 2 prefers Action 1 (3&gt;0). When Player 1 plays Action 2, Player 2 prefers Action 1 (2&gt;-1). Even though we already established that Player 1 will never play Action 2, Player 2 doesn’t know that so needs to evaluate that scenario.</p> <p>We see that Player 2 will also always play Action 1 and has eliminated Action 2.</p> <p>Therefore we have an <strong>equilibrium</strong> at (5,3) and no player would want to deviate or else they would have a lower payout!</p> <h3 id="3-action-game">3-Action Game</h3> <p>In the Player 1 Action 1 and Player 2 Action 1 slot, we have (10, 2), which represents P1 = 10 and P2 = 2. I.e. if these actions are taken, Player 1 wins 10 units and Player 2 wins 2 units.</p> <table> <thead> <tr> <th>P1/2</th> <th>Action 1</th> <th>Action 2</th> <th>Action 3</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>10, 2</td> <td>8, 1</td> <td>3, -1</td> </tr> <tr> <td>Action 2</td> <td>5, 8</td> <td>4, 0</td> <td>-1, 1</td> </tr> <tr> <td>Action 3</td> <td>7, 3</td> <td>5, -1</td> <td>0, 3</td> </tr> </tbody> </table> <p>Given this table, how can we determine the best actions for each player? Again, P1 is represented by the rows and P2 by the columns.</p> <p>We can see that Player 1’s strategy of Action 1 dominates Actions 2 and 3 because all of the values are strictly higher for Action 1. Regardless of Player 2’s action, Player 1’s Action 1 always has better results than Action 2 or 3.</p> <p>When P2 chooses Action 1, P1 earns 10 with Action 1, 5 with Action 2, and 7 with Action 3 When P2 chooses Action 2, P1 earns 8 with Action 1, 4 with Action 2, and 5 with Action 3 When P2 chooses Action 3, P1 earns 7 with Action 1, 5 with Action 2, and 0 with Action 3</p> <p>We also see that Action 1 dominates Action 2 for Player 2. Action 1 gets payouts of 2 or 8 or 3 depending on Player 1’s action, while Action 2 gets payouts of 1 or 0 or -1, so Action 1 is always superior.</p> <p>Action 1 <strong>weakly</strong> dominates Action 3 for Player 2. This means that Action 1 is greater than <strong>or equal</strong> to playing Action 3. In the case that Player 1 plays Action 3, Player 2’s Action 1 and Action 3 both result in a payout of 3 units.</p> <p>We can eliminate strictly dominated strategies and then arrive at the reduced Normal Form game. Recall that Player 1 would never play Actions 2 or 3 because Action 1 is always better. Similarly, Player 2 would never play Action 2 because Action 1 is always better.</p> <table> <thead> <tr> <th>P1/2</th> <th>Action 1</th> <th>Action 3</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>10, 2</td> <td>3, -1</td> </tr> </tbody> </table> <p>In this case, Player 2 prefers to play Action 1 since 2 &gt; -1, so we have a Nash Equilibrium with both players playing Action 1 100% of the time (also known as a <strong>pure strategy</strong>) and the payouts will be 10 to Player 1 and 2 to Player 2. The issue with Player 2’s Action 1 having a tie with Action 3 when Player 1 played Action 3 was resolved because we now know that Player 1 will never actually play that action and when Player 1 plays Action 1, Player 2 will always prefer Action 1 to Action 3.</p> <table> <thead> <tr> <th>P1/2</th> <th>Action 1</th> </tr> </thead> <tbody> <tr> <td>Action 1</td> <td>10, 2</td> </tr> </tbody> </table> <p>To summarize, Player 1 always plays Action 1 because it dominates Actions 2 and 3. When Player 1 is always playing Action 1, it only makes sense for Player 2 to also play Action 1 since it gives a payoff of 2 compared to payoffs of 1 and -1 with Actions 2 and 3, respectively.</p> <h3 id="tennis-vs-power-rangers">Tennis vs. Power Rangers</h3> <p>In this game, we have two people who are going to watch something together. P1 has a preference to watch tennis and P2 prefers Power Rangers. If they don’t agree, then they won’t watch anything and will have payouts of 0. If they do agree, then the person who gets to watch their preferred show has a higher reward than the other, but both are positive.</p> <table> <thead> <tr> <th>P1/2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>3, 2</td> <td>0, 0</td> </tr> <tr> <td>Power Rangers</td> <td>0, 0</td> <td>2, 3</td> </tr> </tbody> </table> <p>In this case, neither player can eliminate a strategy. For Player 1, if Player 2 chooses Tennis then he also prefers Tennis, but if Player 2 chooses Power Rangers, then he prefers Power Rangers as well (both of these are Nash Equilbrium). This is intuitive (if the people really like TV) because there is 0 value in watching nothing but at least some value if both agree to watch one thing. This also shows the Nash equilibrium principle of not being able to benefit from <strong>unilaterally</strong> changing strategies – if both are watching tennis and P2 changes to Power Rangers, that change would reduce value from 2 to 0!</p> <p>So what is the optimal strategy here? If each player simply picked their preference, then they’d always watch nothing and get 0! If they both always picked their non-preference, then the same thing would happen! If they pre-agreed to either Tennis or Power Rangers, then utilties would increase, but this would never be “fair” to either person.</p> <p>We can calculate the optimal strategies like this:</p> <p>Let’s call $$P(P1 Tennis) = p$$ and $$P(P1 Power Rangers) = 1 - p$$. These represent the probability that Player 1 would select each of these.</p> <p>If Player 2 chooses Tennis, Player 2 earns $$p*(2) + (1-p)*(0) = 2p$$. The EV is calculated as probabilities of Player 1 multiplied by payouts of Player 2 playing Tennis.</p> <p>If Player 2 chooses Power Rangers, Player 2 earns $$p*(0) + (1-p)*(3) = 3 - 3p$$</p> <p>We are trying to find a strategy that involves mixing between both options, a <strong>mixed strategy</strong>. A fundamental rule is that if you are going to play multiple strategies, then the value of each must be the same. Otherwise you would just pick one and stick with that.</p> <p>Therefore we can set these values equal to each other, so</p> $2p = 3 - 3p$ $5p = 3$ $p = 3/5$ <p>Therefore Player 1’s strategy is to choose Tennis $$p = 3/5$$ and Power Rangers $$1 - p = 2/5$$. This is a mixed strategy equilibrium because there is a probability distribution over which strategy to play.</p> <p>This result comes about because these are the probabilities for P1 that induce P2 to be indifferent between Tennis and Power Rangers.</p> <p>By symmetry, P2’s strategy is to choose Tennis $$2/5$$ and Power Rangers $$3/5$$.</p> <p>This means that each player is choosing his/her chosen program $$3/5$$ of the time, while choosing the other option $$2/5$$ of the time. Let’s see how the final outcomes look.</p> <p>So we have Tennis, Tennis occurring $$3/5 * 2/5 = 6/25$$ Power Rangers, Power Rangers $$2/5 * 3/5 = 6/25$$ Tennis, Power Rangers $$3/5 * 3/5 = 9/25$$ Power Rangers, Tennis $$2/5 * 2/5 = 4/25$$</p> <p>These probabilities are shown below (this is not a normal form matrix because we are showing probabilities and not payouts):</p> <table> <thead> <tr> <th>P1/2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>6/25</td> <td>9/25</td> </tr> <tr> <td>Power Rangers</td> <td>4/25</td> <td>6/25</td> </tr> </tbody> </table> <p>The average payouts to each player are $$6/25 * (3) + 6/25 * (2) = 30/25 = 1.2$$. This would have been higher if they had avoided the 0,0 payouts! Unfortunately $$9/25 + 4/25 = 13/25$$ of the time, the payouts were 0 to each player. Coordinating to watch <em>something</em> rather than so often watching nothing would be a much better solution!</p> <p>What if Player 1 decided to be sneaky and change his strategy to choosing Tennis always instead of 3/5 tennis and 2/5 Power Rangers? Remember that there can be no benefit to deviating from a Nash Equilibrium strategy by definition. If he tries this, then we have the following likelihoods since P1 is never choosing Power Rangers and so the probabilities are determined strictly by P2’s strategy of 2/5 tennis and 3/5 Power Rangers:</p> <table> <thead> <tr> <th>P1/2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>2/5</td> <td>3/5</td> </tr> <tr> <td>Power Rangers</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>The Tennis and Power Rangers 3/5 has no payoffs and the Tennis Tennis has a payoff of of $$2/5 * 3 = 6/5 = 1.2$$ for P1. This is the same as the payout he was already getting. Note that deviating from the equilibrium <em>can</em> maintain the same payoff, but cannot improve the payoffs. In the zero-sum case, the opponent also can only do better or equal when a player deviates, but in this case Player 2 actually has a lowe rpayoff of $$2/5 * 2 = 0.8$$ instead of $$1.2$$.</p> <p>However, P2 might catch on to this and then get revenge by pulling the same trick and changing strategy to always selecting Power Rangers, resulting in the following probabilities:</p> <table> <thead> <tr> <th>P1/2</th> <th>Tennis</th> <th>Power Rangers</th> </tr> </thead> <tbody> <tr> <td>Tennis</td> <td>0</td> <td>1</td> </tr> <tr> <td>Power Rangers</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>Now the probability is fully on P1 picking Tennis and P2 picking Power Rangers, and nobody gets anything!</p> <h3 id="rock-paper-scissors">Rock Paper Scissors</h3> <p>Finally, can also think about this concept in Rock-Paper-Scissors. Let’s define a win as +1, a tie as 0, and a loss as -1. The game matrix for the game is shown below in Normal Form:</p> <table> <thead> <tr> <th>P1/2</th> <th>Rock</th> <th>Paper</th> <th>Scissors</th> </tr> </thead> <tbody> <tr> <td>Rock</td> <td>0, 0</td> <td>-1, 1</td> <td>1, -1</td> </tr> <tr> <td>Paper</td> <td>1, -1</td> <td>0, 0</td> <td>-1, 1</td> </tr> <tr> <td>Scissors</td> <td>-1, 1</td> <td>1, -1</td> <td>0, 0</td> </tr> </tbody> </table> <p>As usual, Player 1 is the row player and Player 2 is the column player. The payouts are written in terms of P1, P2. So for example P1 Paper and P2 Rock corresponds to a reward of +1 for P1 and -1 for P2 since Paper beats Rock.</p> <p>The equilibrium strategy is to play each action with 1/3 probability. We can see this intuitively because if any player played anything other than this distribution, then you could crush them by always playing the strategy that beats the strategy that they most favor. For example if someone played rock 50%, paper 25%, and scissors 25%, they are overplaying rock, so you could always play paper and then would win 50% of the time, tie 25% of the time, and lose 25% of the time for an average gain of $$1*0.5 + 0*0.25 + (-1)*0.25 = 0.25$$ each game.</p> <table> <thead> <tr> <th>P1/P2</th> <th>Rock 50%</th> <th>Paper 25%</th> <th>Scissors 25%</th> </tr> </thead> <tbody> <tr> <td>Rock 0%</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <td>Paper 100%</td> <td>0.5*1 = 0.5</td> <td>0.25*0 = 0</td> <td>0.25*(-1) = -0.25</td> </tr> <tr> <td>Scissors 0%</td> <td>0</td> <td>0</td> <td>0</td> </tr> </tbody> </table> <p>We can also work it out mathematically. Let P1 play Rock r%, Paper p%, and Scissors s%. The utility of P2 playing Rock is then $$0*(r) + -1 * (p) + 1 * (s)$$. The utility of P2 playing Paper is $$1 * (r) + 0 * (p) + -1 * (s)$$. The utility of P2 playing Scissors is $$-1 * (r) + 1 * (p) + 0 * (s)$$.</p> <p>We can figure out the best strategy with this system of equations (the second equation below is because all probabilities must add up to 1):</p> $\begin{cases} -p + s = r - s = -r + p \\ r + p + s = 1 \end{cases}$ $-p + s = r - s ==&gt; 2s = p + r$ $r - s = - r + p ==&gt; 2r = s + p$ $-p + s = -r + p ==&gt; s + r = 2p$ $r + s + p = 1 ==&gt; r + s = 1 - p$ $1 - p = 2p 1 = 3p p = 1/3$ $r + s + p = 1 s + p = 1 - r$ $1 - r = 2r 1 = 3r 1/3 = r$ $1/3 + 1/3 + s = 1 s = 1/3$ <p>The equilibrium strategy is therefore to play each action with 1/3 probability.</p> <p>If your opponent plays the equilibrium strategy of Rock 1/3, Paper 1/3, Scissors 1/3, then he will have the following EV. EV = $$1*(1/3) + 0*(1/3) + (-1)*(1/3) = 0$$. Note that in Rock Paper Scissors, if you play equilibrium then you can never show a profit because you will always breakeven, regardless of what your opponent does. In poker, this is not the case.</p> <h2 id="regret">Regret</h2> <!-->https://vimeo.com/265401201 <--> <p>When I think of regret related to poker, the first thing that comes to mind is often “Wow you should’ve played way more hands in 2010 when poker was so easy”. Others may regret big folds or bluffs or calls that didn’t work out well.</p> <p>Here we will look at a less sad version, the mathematical concept of regret. Regret is a measure of how well you could have done compared to some alternative. Phrased differently, what you would have done in some situation instead.</p> <p>$$Regret = u(Alternative Strategy) - u(Current Strategy)$$ where $$u$$ represents utility</p> <p>If your current strategy for breakfast is cooking eggs at home, then maybe u(Current Home Egg Strategy) = 5. If you have an alternative of eating breakfast at a fancy buffet, then maybe u(Alternative Buffet Strategy) = 9, so the regret for not eating at the buffet is 9 - 5 = 4. If your alternative is getting a quick meal from McDonald’s, then you might value u(Alternative McDonald’s Strategy) = 2, so regret for not eating at McDonald’s is 2 - 5 = -3. We prefer alternative actions with high regret.</p> <p>We can give another example from Rock Paper Scissors:</p> <p>We play rock and opponent plays paper ⇒ u(rock,paper) = -1 Regret(scissors) = u(scissors,paper) - u(rock,paper) = 1-(-1) = 2 Regret(paper) = u(paper,paper) - u(rock,paper) = 0-(-1) = 1 Regret(rock) = u(rock,paper) - u(rock,paper) = -1-(-1) = 0</p> <p>We play scissors and opponent plays paper ⇒ u(scissors,paper) = 1 Regret(scissors) = u(scissors,paper) - u(scissors,paper) = 1-1 = 0 Regret(paper) = u(paper,paper) - u(scissors,paper) = 0-1 = -1 Regret(rock) = u(rock,paper) - u(scissors,paper) = -1-1 = -2</p> <p>We play paper and opponent plays paper ⇒ u(paper,paper) = 0 Regret(scissors) = u(scissors,paper) - u(paper,paper) = 1-0 = 1 Regret(paper) = u(paper,paper) - u(paper,paper) = 0-0 = 0 Regret(rock) = u(rock,paper) - u(paper,paper) = -1-0 = -1</p> <p>Again, we prefer alternative actions with high regret.</p> <p>To generalize for the Rock Paper Scissors case:</p> <ul> <li>The action played always gets a regret of 0 since the “alternative” is really just that same action</li> <li>When we play a tying action, the alternative losing action gets a regret of -1 and the alternative winning action gets a regret of +1</li> <li>When we play a winning action, the alternative tying action gets a regret of -1 and the alternative losing action gets a regret of -2</li> <li>When we play a losing action, the alternative winning action gets a regret of +2 and the alternative tying action gets a regret of +1</li> </ul> <h3 id="regret-matching">Regret Matching</h3> <p>What is the point of these regret values and what can we do with them?</p> <p>Regret matching means playing a strategy in proportion to the accumulated regrets. As we play, we keep track of the accumulated regrets for each action and then play in proportion to those values. For example, if the total regret values for Rock are 5, Paper 10, Scissors 5, then we have total regrets of 20 and we would play Rock 5/20 = 1/4, Paper 10/20 = 1/2, and Scissors 5/20 = 1/4.</p> <p>It makes sense intuitively to prefer actions with higher regrets because they provide higher utility, as shown in the prior section. So why not just play the highest regret action always? Because playing in proportion to the regrets allows us to keep testing all of the actions, while still more often playing the actions that have the higher chance of being best. It could be that at the beginning, the opponent happened to play Scissors 60% of the time even though their strategy in the long run is to play it much less. We wouldn’t want to exclusively play Rock in this case, we’d want to keep our strategy more robust.</p> <p>The regret matching algorithm works like this:</p> <ol> <li>Initialize regret for each action to 0</li> <li>Set the strategy as: $$\text{strategy_{action}_{i} = \begin{cases} \frac{R_{i}^{+}}{\sum_{k=1}^nR_{k}^{+}}, &amp; \mbox{if at least 1 pos regret} \\ \frac{1}{n}, &amp; \mbox{if all regrets negative} \end{cases}$$</li> <li>Accumulate regrets after each game and update the strategy</li> </ol> <p>Let’s consider Player 1 playing a fixed RPS strategy of Rock 40%, Paper 30%, Scissors 30% and Player 2 playing using regret matching. So Player 1 is playing almost the equilibrium strategy, but a little bit biased on favor of Rock.</p> <p>Let’s look at a sequence of plays in this scenario that were generated randomly.</p> <table> <thead> <tr> <th>P1</th> <th>P2</th> <th>New Regrets</th> <th>New Total Regrets</th> <th>Strategy [R,P,S]</th> <th>P2 Profits</th> </tr> </thead> <tbody> <tr> <td>S</td> <td>S</td> <td>[1,0,-1]</td> <td>[1,0,-1]</td> <td>[1,0,0]</td> <td>0</td> </tr> <tr> <td>P</td> <td>R</td> <td>[0,1,2]</td> <td>[1,1,1]</td> <td>[1/3, 1/3, 1/3]</td> <td>1</td> </tr> <tr> <td>S</td> <td>P</td> <td>[2,0,1]</td> <td>[3,1,2]</td> <td>[1/2, 1/6, 1/3]</td> <td>0</td> </tr> <tr> <td>P</td> <td>R</td> <td>[0,1,2]</td> <td>[3,2,4]</td> <td>[3/10, 1/5, 2/5]</td> <td>-1</td> </tr> <tr> <td>R</td> <td>S</td> <td>[1,2,0]</td> <td>[4,4,4]</td> <td>[1/3,1/3,1/3]</td> <td>-2</td> </tr> <tr> <td>R</td> <td>R</td> <td>[0,1,-1]</td> <td>[4,5,3]</td> <td>[1/3,5/12,1/4]</td> <td>-2</td> </tr> <tr> <td>P</td> <td>P</td> <td>[-1,0,1]</td> <td>[3,5,4]</td> <td>[1/4,5/12,1/3]</td> <td>-2</td> </tr> <tr> <td>S</td> <td>P</td> <td>[2,0,1]</td> <td>[5,5,5]</td> <td>[1/3, 1/3, 1/3]</td> <td>-3</td> </tr> <tr> <td>R</td> <td>R</td> <td>[0,1,-1]</td> <td>[5,6,4]</td> <td>[1/3, 2/5, 4/15]</td> <td>-3</td> </tr> <tr> <td>R</td> <td>P</td> <td>[-1,0,-2]</td> <td>[4,6,2]</td> <td>[1/3,1/2,1/6]</td> <td>-2</td> </tr> </tbody> </table> <p>In the long-run we know that P2 can win a large amount by always playing Paper to exploit the over-play of Rock by P1. The EV of always playing Paper is $$1*0.4 + 0*0.3 + (-1)*0.3 = 0.1$$ per game and indeed after 10 games, the strategy with regret matching has already become biased in favor of playing Paper as we see in the final row where the Paper strategy is listed as 1/2 or 50%.</p> <p>Depending on the run and how the regrets accumulate, the regret matching can figure this out immediately or it can take some time. Here are 10,000 sample runs of this scenario.</p> <p>The plots show the current strategy and average strategy over time of each of rock (green), paper (purple), and scissors (blue). These are on a 0 to 1 scale on the left axis. The black line measures the profit (aka rewards) on the right axis. The top plot shows how the algorithm can sometimes “catch on” very fast and almost immediately switch to always playing paper, while the second shows it taking about 1,500 games to figure that out.</p> <p><img src="../assets/section2/gametheory/rps_fast1.png" /></p> <p><img src="../assets/section2/gametheory/rps_slow1.png" /></p> <h3 id="regret-in-poker">Regret in Poker</h3> <p>The regret matching algorithm is at the core of selecting actions in the algorithms used to solve poker games. We will go into more detail in the CFR Algorithm section.</p> <h3 id="bandits">Bandits</h3> <p>A common way to analyze regret is the multi-armed bandit problem. The setup is a player sitting in front of a multi-armed “bandit” with some number of arms. (Think of this as sitting in front of a bunch of slot machines.)</p> <p>A basic setting initializes each of 10 arms with $$q_*(\text{arm}) = \mathcal{N}(0, 1)$$, so each is initialized with a center point found from the Gaussian distribution. Each pull of an arm then gets a reward of $$R = \mathcal{N}(q_*(\text{arm}), 1)$$.</p> <p>To clarify, this means each arm gets an initial value centered around 0 but with some variance, so each will be a bit different. Then from that point, the actual pull of an arm is centered around that new point as seen in this figure with a 10-armed bandit from Intro to Reinforcement Learning by Sutton and Barto:</p> <p><img src="../assets/section2/gametheory/banditsetup.png" alt="Bandit setup" /></p> <p>In simple terms, each machine has some set value that isn’t completely fixed at that value, but rather varies slightly around it, so a machine with a value of 3 might range from 2.5 to 3.5.</p> <p>Imagine that the goal is to play this game 2000 times with the intention to achieve the highest rewards. We can only learn about the rewards by pulling the arms – we don’t have any information about the distribution behind the scenes. We maintain an average reward per pull for each arm as a guide for which arm to pull in the future.</p> <p><strong>Greedy</strong> The most basic algorithm to score well is to pull each arm once and then forever pull the arm that performed the best in the sampling stage.</p> <p><strong>Epsilon Greedy</strong> $$\epsilon$$-Greedy works similarly to Greedy, but instead of <strong>always</strong> picking the best arm, we use an $$\epsilon$$ value that defines how often we should randomly pick a different arm. We keep track of which arm is the current best arm before each pull according to the average reward per pull, then play that arm $$1-\epsilon$$ of the time and play a random arm $$\epsilon$$ of the time.</p> <p>The idea of usually picking the best arm and sometimes switching to a random one is the concept of <strong>exploration vs. exploitation</strong>. Think of this in the context of picking a travel destination or picking a restaurant. You are likely to get a very high “reward” by continuing to go to a favorite vacation spot or restaurant, but it’s also useful to explore other options that you could end up preferring.</p> <p><strong>Bandit Regret</strong> The goal of the agent playing this game is to get the best reward. This is done by pulling the best arm. We can define a very sensible definition of average regret as</p> $\text{Regret}_t = \frac{1}{t} \sum_{\tau=1}^t (V^* - Q(a_\tau))$ <p>where $$V^*$$ is the fixed reward from the best action, $$Q(a_\tau)$$ is the reward from selecting arm $$a$$ at timestep $$\tau$$, and $$t$$ is the total number of timesteps.</p> <p>In words, this is the average of how much worse we have done than the best possible action over the number of timesteps.</p> <p>So if the best action would give a value of 5 and our rewards on our first 3 pulls were {3, 5, 1}, our regrets would be {5-3, 5-5, 5-1} = {2, 0, 4}, for an average of 2. So an equivalent to trying to maximize rewards is trying to minimize regret.</p> <p>Note that above we said that the idea was to maximize regret and that we’d play actions in proportion to the regrets with regret matching. Now we’re trying to minimize regret. This is confusing and regret can be interpreted in both ways depending on the situation.</p> <p>For values of $$\epsilon = 0$$ (greedy), $$\epsilon = 0.01$$, $$\epsilon = 0.1$$, and $$\epsilon = 0.5$$ and using the setup described above, we have averaged 2,000 runs of 1,000 timesteps each.</p> <p><img src="../assets/section2/gametheory/bandits_avg_reward.png" alt="Bandit average reward" /> <em>Average reward plot</em></p> <p>For the average reward plot, we see that the optimal $$\epsilon$$ amongst those used is 0.1, next best is 0.01, then 0, and then 0.5. This shows that some exploration is valuable, but too much (0.5) or too little (0) is not optimal.</p> <p><img src="../assets/section2/gametheory/bandits_avg_regret.png" alt="Bandit average regret" /> <em>Average regret plot</em></p> <p>The average regret plot is the inverse of the reward plot because it is the best possible reward minus the actual rewards received and so the goal is to minimize the regret.</p> <p><strong>Upper Confidence Bound (UCB)</strong> There are many algorithms for choosing bandit arms. The last one we’ll touch on is called Upper Confidence Bound (UCB).</p> $A_t = argmax_a{[Q_t(a) + c*sqrt{frac{log{t}}{N_t(a)}]}$AIPT Section 2.2: Game Theory – Trees in Games2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/trees-in-games<h1 id="game-theory--trees-in-games">Game Theory – Trees in Games</h1> <p>Many games can be solved using the minimax algorithm for exploring a tree and determining the best move from each position.</p> <h2 id="basic-tree">Basic Tree</h2> <p>Take a look at the game tree below. The circular nodes represent player positions and the lines represent possible actions. The “root” of the tree is the initial state at the top. We have P1 acting first, P2 acting second, and the payoffs at the leaf nodes in the standard P1, P2 format.</p> <p><img src="../assets/section2/trees/minimax.png" alt="Minimax tree" /></p> <p>The standard way to solve a tree like this is using <strong>backward induction</strong>, whereby we start with the leaves (i.e. the payoff nodes at the bottom) of the tree and see which decisions the last player, Player 2 in this case, will make at her decision nodes.</p> <p>Player 2’s goal is to minimize the maximum payoff of Player 1, which in the zero-sum setting is equivalent to minimizing her own maximum loss or maximizing her own minimum payoff. This is equivalent to a Nash equilibrium in the zero-sum setting.</p> <p>She picks the right node on the left side (payoff -1 instead of -5) and the left node on the right side (payoff 3 instead of -6).</p> <p>These values are then propagated up the tree so from Player 1’s perspective, the value of going left is 1 and of going right is -3. The other leaf nodes are not considered because Player 2 will never choose those. Player 1 then decides to play left to maximize his payoff.</p> <p><img src="../assets/section2/trees/minimax2.png" alt="Minimax tree solved" /></p> <p>We can see all possible payouts, where the rows are P1 actions and the columns are P2 actions after P1 actions (e.g. Left/Left means P1 chose Left and then P2 also chose Left).</p> <table> <thead> <tr> <th>P1/P2</th> <th>Left/Left</th> <th>Left/Right</th> <th>Right/Left</th> <th>Right/Right</th> </tr> </thead> <tbody> <tr> <td>Left</td> <td>5,-5</td> <td>5,-5</td> <td>1,-1</td> <td>1,-1</td> </tr> <tr> <td>Right</td> <td>-3,3</td> <td>-3,3</td> <td>6,-6</td> <td>6,-6</td> </tr> </tbody> </table> <p>Note that Player 1 choosing right <em>could</em> result in a higher payout (6) if Player 2 also chose right, but a rational Player 2 would not do that, and so the algorithm requires maximizing one’s minimum payoff, which means Player 1 must choose left (earning a guaranteed value of 1).</p> <p>By working backwards from the end of a game, we can evaluate each possible sequence of moves and propagate the values up the game tree.</p> <p><strong>Subgame perfect equilibrium</strong> means that each subgame, which is a decision state in the above game tree, is a Nash equilibrium. The strategy of P1 choosing Left and P2 choosing Right after Left and Left after Right is a subgame perfect equilibrium.</p> <p>Two main problems arise with minimax and backward induction.</p> <h3 id="problem-1-the-game-is-too-damn-large">Problem 1: The Game is too Damn Large</h3> <p>In theory, we could use the minimax algorithm to solve games like chess. The problem is that the game and the space of possible actions is HUGE. It’s not feasible to evaluate all possibilities. The first level of the tree would need to have every possible action and then the next level would have every possible action from each of those actions, and so on. Even checkers is very large, though smaller games like tic tac toe can be solved with minimax. More sophisticated methods and approximation techniques are used in practice for large games. One simple method is to only go down the tree to a depth of “X” and then approximate the value of the states there.</p> <h3 id="problem-2-perfect-information-vs-imperfect-information">Problem 2: Perfect Information vs. Imperfect Information</h3> <p>What about poker? Real poker games like Texas Hold’em are very large and run into the same problem we have with games like chess, but in addition, poker is an <strong>imperfect information game</strong> and games like chess and tic tac toe are <strong>perfect information games</strong>. The distinction is that in poker there is hidden information – each player’s private cards. In perfect information games, all players see all of the information.</p> <p>With perfect information, each player knows exactly what node/state he is in in the game tree. With imperfect information, there is uncertainty about the state of the game because the other player’s cards are unknown.</p> <h2 id="poker-tree">Poker Tree</h2> <p>Below we show the game tree for 1-card poker. In brief, it’s a 1v1 game where each player starts with$2 and antes $1, leaving a single$1 bet remaining. We’ll go into more details about the game in the next section.</p> <p>The top node is a chance node that “deals” the cards. To make it more readable, only 2 chance outcomes are shown, Player 1 dealt Q with Player 2 dealt J and Player 1 dealt Q with Player 2 dealt K.</p> <p><img src="../assets/section2/trees/infoset2.png" alt="1-card poker game tree" /> <em>1-card poker game tree from University of Alberta</em></p> <p>Player 1’s initial action is to either bet or pass. If Player 1 bets, Player 2 can call or fold. If Player 1 passes, Player 2 can bet or pass. If Player 1 passed and Player 2 bet, then Player 1 can call or fold.</p> <p>Note the nodes that are circled and connected by a line. This means that they are in the same <strong>information set</strong>. An information set consists of equivalent states based on information known to that player. For example, in the top information set, Player 1 has a Q in both of the shown states, so his actions will be the same in both even though Player 2 could have either a K or J. The information known to Player 1 is “Card Q, starting action”. At the later information set, the information known is “Card Q, I pass, opponent bets”. All decisions must be made based only on information known to each player! However, these are actually different true game states.</p> <p>Looking at the information set at the bottom where Player 1 passes and Player 2 bets, Player 1 has the same information in both cases, but calling when Player 2 has a J means winning 2 and calling when Player 2 has a K means losing 2. The payoffs are completely different!</p> <p>Therefore we can’t simply propagate values up the tree as we can do in perfect information games. Later in the tutorial, we will discuss CFR (counterfactual regret minimization), which is a way to solve games like poker that can’t be solved using minimax.</p> <h2 id="tic-tac-toe-tree">Tic Tac Toe Tree</h2> <p><strong>Tree goes here</strong></p> <p>On the tic tac toe tree, from the initial state, there are up to 9 levels of moves. Each subsequent level has fewer possible actions since more spaces on the game board are taken as we go down the tree. The tree ends at points either where the game is over because one player wins or when all the spaces are filled and no one has won, resulting in a tie.</p> <p>In tic tac toe, the sequence of actions prior to a certain game state are not important.</p> <h2 id="tic-tac-toe-python-implementations">Tic Tac Toe Python Implementations</h2> <p>While we’re mainly focused in this tutorial on poker and imperfect information games, we take a short detour to look more in-depth at minimax and Monte Carlo Tree Search (MCTS) through the lens of tic tac toe.</p> <h3 id="tic-tac-toe-in-python">Tic Tac Toe in Python</h3> <p>Below we show a basic Python class called Tictactoe. The board is initialized with all 0’s and each player is represented by a 1 or -1. Those numbers go into board spaces when the associated player makes a move. The class has 5 functions:</p> <ol> <li>make_move: Enters the player’s move onto the board if the space is available and advances the play to the next player</li> <li>new_state_with_move: Same as make_move, but returns a copy of the board with the new move instead of the original board</li> <li>available_moves: Lists the moves that are currently available on the board</li> <li>check_result: Checks every possible winning sequence and returns either the winning player’s ID if there is a winner, a 0 if the game has ended in a tie, or None if the game is not over yet</li> <li>repr: Used for printing the board. Empty slots are represented by their number from 0 to 8, player 1 is represented with ‘x’, player 2 is represented with ‘o’, and a line break is added as needed after the first 3 and middle 3 positions.</li> </ol> <p>We also have two simple agent classes:</p> <ol> <li>HumanAgent: Enters a move from 0-8 and the move is placed if it’s available, otherwise we ask for the move again</li> <li>RandomAgent: Randomly selects a move from the available moves</li> </ol> <p>Finally, we need another function to actually run the game.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Tictactoe</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="mi">9</span><span class="p">,</span> <span class="n">acting_player</span> <span class="o">=</span> <span class="mi">1</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span> <span class="o">=</span> <span class="n">board</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="o">=</span> <span class="n">acting_player</span> <span class="k">def</span> <span class="nf">make_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">move</span><span class="p">):</span> <span class="k">if</span> <span class="n">move</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">move</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="o">=</span> <span class="mi">0</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="c1">#Players are 1 or -1 </span> <span class="k">def</span> <span class="nf">new_state_with_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">move</span><span class="p">):</span> <span class="c1">#Return new ttt state with move, but don't change this state </span> <span class="k">if</span> <span class="n">move</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="n">board_copy</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">deepcopy</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">)</span> <span class="n">board_copy</span><span class="p">[</span><span class="n">move</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span> <span class="k">return</span> <span class="n">Tictactoe</span><span class="p">(</span><span class="n">board_copy</span><span class="p">,</span> <span class="mi">0</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="k">def</span> <span class="nf">available_moves</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">9</span><span class="p">)</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">]</span> <span class="k">def</span> <span class="nf">check_result</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">for</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">,</span><span class="n">c</span><span class="p">)</span> <span class="ow">in</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">),(</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">),(</span><span class="mi">6</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">8</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">6</span><span class="p">),(</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">7</span><span class="p">),(</span><span class="mi">2</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">8</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">8</span><span class="p">),(</span><span class="mi">2</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">6</span><span class="p">)]:</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">==</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">b</span><span class="p">]</span> <span class="o">==</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">available_moves</span><span class="p">()</span> <span class="o">==</span> <span class="p">[]:</span> <span class="k">return</span> <span class="mi">0</span> <span class="c1">#Tie </span> <span class="k">return</span> <span class="bp">None</span> <span class="c1">#Game not over </span> <span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="n">s</span><span class="o">=</span> <span class="s">""</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">9</span><span class="p">):</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="n">s</span><span class="o">+=</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="n">s</span><span class="o">+=</span><span class="s">'x'</span> <span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">board</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="n">s</span><span class="o">+=</span><span class="s">'o'</span> <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">2</span> <span class="ow">or</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">5</span><span class="p">:</span> <span class="n">s</span> <span class="o">+=</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="k">return</span> <span class="n">s</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HumanAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">select_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="s">'Enter your move (0-8): '</span><span class="p">)</span> <span class="n">move</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="nb">input</span><span class="p">()))</span> <span class="c1">#print('move', move) </span> <span class="c1">#print('game state available moves', game_state.available_moves()) </span> <span class="k">if</span> <span class="n">move</span> <span class="ow">in</span> <span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="k">return</span> <span class="n">move</span> <span class="k">else</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Invalid move, try again'</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">select_move</span><span class="p">(</span><span class="n">game_state</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RandomAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">select_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="k">return</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">())</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span> <span class="n">ttt</span> <span class="o">=</span> <span class="n">Tictactoe</span><span class="p">()</span> <span class="c1">#ttt = Tictactoe([0,0,-1,0,0,0,1,-1,1]) #Optionally can start from a pre-set game position </span> <span class="n">h</span> <span class="o">=</span> <span class="n">HumanAgent</span><span class="p">()</span> <span class="n">r</span> <span class="o">=</span> <span class="n">RandomAgent</span><span class="p">()</span> <span class="n">moves</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="n">ttt</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="k">print</span><span class="p">(</span><span class="n">ttt</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="s">'move'</span><span class="p">,</span> <span class="n">moves</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="s">'acting player'</span><span class="p">,</span> <span class="n">ttt</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="k">if</span> <span class="n">moves</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="n">move</span> <span class="o">=</span> <span class="n">player1</span><span class="p">.</span><span class="n">select_move</span><span class="p">(</span><span class="n">ttt</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="n">move</span> <span class="o">=</span> <span class="n">player2</span><span class="p">.</span><span class="n">select_move</span><span class="p">(</span><span class="n">ttt</span><span class="p">)</span> <span class="n">ttt</span><span class="p">.</span><span class="n">make_move</span><span class="p">(</span><span class="n">move</span><span class="p">)</span> <span class="k">if</span> <span class="n">ttt</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Draw game'</span><span class="p">)</span> <span class="k">break</span> <span class="k">elif</span> <span class="n">ttt</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Player 1 wins'</span><span class="p">)</span> <span class="k">elif</span> <span class="n">ttt</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'Player 2 wins'</span><span class="p">)</span> <span class="n">moves</span><span class="o">+=</span><span class="mi">1</span> </code></pre></div></div> <h3 id="minimax-applied-to-tic-tac-toe">Minimax Applied to Tic Tac Toe</h3> <p>We can apply the minimax algorithm to tic tac toe. Here we use a simplified version of minimax called negamax, because in a zero-sum game like tic tac toe, the value of a position to one player is the negative value to the other player.</p> <p>We store already evaluated states in a memo dictionary containing each move and its value. When a state has not been seen before, we check the game state, which will either return the winning player, 0 for tie, or None for “game not yet over”.</p> <p>If the game is over, then there is no move from this position and the value is simply the result of the game.</p> <p>If the game is not over, we iterate through the available moves and recursively find a value for each possible move. As each move is evaluated, we store the best move and the value for that move and return the overall best after evaluating each move.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">NegamaxAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span> <span class="o">=</span> <span class="p">{}</span> <span class="c1">#move, value </span> <span class="k">def</span> <span class="nf">negamax</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="k">if</span> <span class="n">game_state</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span><span class="p">:</span> <span class="c1">#already visited this state? </span> <span class="n">result</span> <span class="o">=</span> <span class="n">game_state</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="k">if</span> <span class="n">result</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span> <span class="c1">#leaf node or end of search </span> <span class="n">best_move</span> <span class="o">=</span> <span class="bp">None</span> <span class="n">best_val</span> <span class="o">=</span> <span class="n">result</span> <span class="o">*</span> <span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span> <span class="c1">#return 0 for tie or 1 for maximizing win or -1 for minimizing win </span> <span class="k">else</span><span class="p">:</span> <span class="n">best_val</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s">'-inf'</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">():</span> <span class="n">clone_state</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">deepcopy</span><span class="p">(</span><span class="n">game_state</span><span class="p">)</span> <span class="n">clone_state</span><span class="p">.</span><span class="n">make_move</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="c1">#makes move and switches to next player </span> <span class="n">_</span><span class="p">,</span> <span class="n">val</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">negamax</span><span class="p">(</span><span class="n">clone_state</span><span class="p">)</span> <span class="n">val</span> <span class="o">*=</span> <span class="o">-</span><span class="mi">1</span> <span class="k">if</span> <span class="n">val</span> <span class="o">&gt;</span> <span class="n">best_val</span><span class="p">:</span> <span class="n">best_move</span> <span class="o">=</span> <span class="n">i</span> <span class="n">best_val</span> <span class="o">=</span> <span class="n">val</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span><span class="p">[</span><span class="n">game_state</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">best_move</span><span class="p">,</span> <span class="n">best_val</span><span class="p">)</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">memo</span><span class="p">[</span><span class="n">game_state</span><span class="p">]</span> </code></pre></div></div> <h3 id="monte-carlo-tree-search-mcts-applied-to-tic-tac-toe">Monte Carlo Tree Search (MCTS) Applied to Tic Tac Toe</h3> <p>MCTS is a more advanced algorithm that finds the optimal move through simulation. This algorithm is used as part of some recent advances in AI poker agents.</p> <p>MCTS allows us to determine the best optimal move from a game state without having to expand the entire tree like we had to do in the minimax algorithm.</p> <p>Let’s consider that we want to find the best tic tac toe move from some state of the game that we can pre-specify. Let’s go through the algorithm step by step.</p> <p>The MCTSAgent class and its select_move function contain the core of the algorithm. The function begins by setting the root of the game tree as an MCTSNode class. A node is a decision point in the game tree, so the root node is the beginning of the tree. If we wanted to find out the best move from the beginning of the game, this would represent an empty tic tac toe board.</p> <p>Each node is initialized with a game state, a parent node, a move, a set of child nodes, a counter for wins by each player, a counter for rollouts that have gone through this node, and a list of available moves from this node.</p> <p>For some fixed number of rounds, we go through the following steps:</p> <ol> <li> <p>Selection:</p> </li> <li> <p>Expansion: If we can add a child node, then we select a random move from the current game state and create a new node to represent the game with this new move, which becomes a child of the prior node. This is done through the add_random_child function in the MCTSNode class.</p> </li> <li> <p>Simulation: Next we run a random playout from the newly expanded node until the game ends</p> </li> <li> <p>Backpropagation: From the end of the game, we update each node that was passed through by updating the win counts for each player (one player gets +1 and one gets -1 or both get 0 in the case of a tie) and adding 1 to the number of rollouts that have passed through each of the nodes.</p> </li> </ol> <p>After running MCTS, we look at each child node from the root and evaluate its win percentage over all of the simulations. We then print a list of the moves in order of their winning percentage, along with how many simulations were run for each move.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MCTSNode</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">,</span> <span class="n">parent</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="n">move</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">parent</span> <span class="o">=</span> <span class="n">parent</span> <span class="bp">self</span><span class="p">.</span><span class="n">move</span> <span class="o">=</span> <span class="n">move</span> <span class="bp">self</span><span class="p">.</span><span class="n">game_state</span> <span class="o">=</span> <span class="n">game_state</span> <span class="bp">self</span><span class="p">.</span><span class="n">children</span> <span class="o">=</span> <span class="p">[]</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="mi">0</span><span class="p">}</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_rollouts</span> <span class="o">=</span> <span class="mi">0</span> <span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span> <span class="o">=</span> <span class="n">game_state</span><span class="p">.</span><span class="n">available_moves</span><span class="p">()</span> <span class="k">def</span> <span class="nf">add_random_child</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="n">move_index</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1">#inclusive </span> <span class="n">new_move</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="n">move_index</span><span class="p">)</span> <span class="n">new_node</span> <span class="o">=</span> <span class="n">MCTSNode</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">game_state</span><span class="p">.</span><span class="n">new_state_with_move</span><span class="p">(</span><span class="n">new_move</span><span class="p">),</span> <span class="bp">self</span><span class="p">,</span> <span class="n">new_move</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">children</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_node</span><span class="p">)</span> <span class="k">return</span> <span class="n">new_node</span> <span class="k">def</span> <span class="nf">can_add_child</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">unvisited_moves</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">def</span> <span class="nf">is_terminal</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">game_state</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">result</span><span class="p">):</span> <span class="k">if</span> <span class="n">result</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">-=</span> <span class="mi">1</span> <span class="k">elif</span> <span class="n">result</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-=</span> <span class="mi">1</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_rollouts</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">def</span> <span class="nf">winning_frac</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">player</span><span class="p">):</span> <span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">win_counts</span><span class="p">[</span><span class="n">player</span><span class="p">])</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">num_rollouts</span><span class="p">)</span> <span class="k">class</span> <span class="nc">MCTSAgent</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">num_rounds</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">,</span> <span class="n">temperature</span> <span class="o">=</span> <span class="mi">2</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_rounds</span> <span class="o">=</span> <span class="n">num_rounds</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">=</span> <span class="n">temperature</span> <span class="k">def</span> <span class="nf">uct_select_child</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">node</span><span class="p">):</span> <span class="n">best_score</span> <span class="o">=</span> <span class="o">-</span><span class="nb">float</span><span class="p">(</span><span class="s">'inf'</span><span class="p">)</span> <span class="n">best_child</span> <span class="o">=</span> <span class="bp">None</span> <span class="n">total_rollouts</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">num_rollouts</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">node</span><span class="p">.</span><span class="n">children</span><span class="p">)</span> <span class="n">log_rollouts</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">total_rollouts</span><span class="p">)</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">node</span><span class="p">.</span><span class="n">children</span><span class="p">:</span> <span class="n">win_pct</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">winning_frac</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="n">exploration_factor</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">log_rollouts</span> <span class="o">/</span> <span class="n">child</span><span class="p">.</span><span class="n">num_rollouts</span><span class="p">)</span> <span class="n">uct_score</span> <span class="o">=</span> <span class="n">win_pct</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">*</span> <span class="n">exploration_factor</span> <span class="k">if</span> <span class="n">uct_score</span> <span class="o">&gt;</span> <span class="n">best_score</span><span class="p">:</span> <span class="n">best_score</span> <span class="o">=</span> <span class="n">uct_score</span> <span class="n">best_child</span> <span class="o">=</span> <span class="n">child</span> <span class="k">return</span> <span class="n">best_child</span> <span class="k">def</span> <span class="nf">select_move</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_state</span><span class="p">):</span> <span class="n">root</span> <span class="o">=</span> <span class="n">MCTSNode</span><span class="p">(</span><span class="n">game_state</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">num_rounds</span><span class="p">):</span> <span class="n">node</span> <span class="o">=</span> <span class="n">root</span> <span class="c1">#selection -- UCT select child until we get to a node that can be expanded </span> <span class="k">while</span> <span class="p">(</span><span class="ow">not</span> <span class="n">node</span><span class="p">.</span><span class="n">can_add_child</span><span class="p">())</span> <span class="ow">and</span> <span class="p">(</span><span class="ow">not</span> <span class="n">node</span><span class="p">.</span><span class="n">is_terminal</span><span class="p">()):</span> <span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">uct_select_child</span><span class="p">(</span><span class="n">node</span><span class="p">)</span> <span class="c1">#expansion -- expand from leaf unless leaf is end of game </span> <span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">can_add_child</span><span class="p">():</span> <span class="n">node</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">add_random_child</span><span class="p">()</span> <span class="c1">#simulation -- complete a random playout from the newly expanded node </span> <span class="n">gs_temp</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">deepcopy</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">game_state</span><span class="p">)</span> <span class="k">while</span> <span class="n">gs_temp</span><span class="p">.</span><span class="n">check_result</span><span class="p">()</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span> <span class="n">gs_temp</span><span class="p">.</span><span class="n">make_move</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">gs_temp</span><span class="p">.</span><span class="n">available_moves</span><span class="p">()))</span> <span class="c1">#backpropagation -- update all nodes from the selection to leaf stage </span> <span class="k">while</span> <span class="n">node</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">gs_temp</span><span class="p">.</span><span class="n">check_result</span><span class="p">())</span> <span class="n">node</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">parent</span> <span class="n">scored_moves</span> <span class="o">=</span> <span class="p">[(</span><span class="n">child</span><span class="p">.</span><span class="n">winning_frac</span><span class="p">(</span><span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span><span class="p">),</span> <span class="n">child</span><span class="p">.</span><span class="n">move</span><span class="p">,</span> <span class="n">child</span><span class="p">.</span><span class="n">num_rollouts</span><span class="p">)</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">root</span><span class="p">.</span><span class="n">children</span><span class="p">]</span> <span class="n">scored_moves</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">key</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">s</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">scored_moves</span><span class="p">[:</span><span class="mi">10</span><span class="p">]:</span> <span class="k">print</span><span class="p">(</span><span class="s">'%s - %.3f (%d)'</span> <span class="o">%</span> <span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">n</span><span class="p">))</span> <span class="n">best_pct</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="n">best_move</span> <span class="o">=</span> <span class="bp">None</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">root</span><span class="p">.</span><span class="n">children</span><span class="p">:</span> <span class="n">child_pct</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">winning_frac</span><span class="p">(</span><span class="n">game_state</span><span class="p">.</span><span class="n">acting_player</span><span class="p">)</span> <span class="k">if</span> <span class="n">child_pct</span> <span class="o">&gt;</span> <span class="n">best_pct</span><span class="p">:</span> <span class="n">best_pct</span> <span class="o">=</span> <span class="n">child_pct</span> <span class="n">best_move</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">move</span> <span class="k">print</span><span class="p">(</span><span class="s">'Select move %s with avg val %.3f'</span> <span class="o">%</span> <span class="p">(</span><span class="n">best_move</span><span class="p">,</span> <span class="n">best_pct</span><span class="p">))</span> <span class="k">return</span> <span class="n">best_move</span> </code></pre></div></div>Game Theory – Trees in Games Many games can be solved using the minimax algorithm for exploring a tree and determining the best move from each position.AIPT Section 3.2: Solving Poker – Toy Poker Games2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/toy-poker-games<!-- TODO: 1) Game tree for each decision point in analytical version and full one above 3) Tables to show y and x and y and z relative to each other 4) Deviations e.g. player who never bluffs 5) Compare to rule-based 6) That is, there are $$2^64$$ strategy combinations. (??) 7) Lin alg example 8) More on balancing bluffs Rhode Island Hold'em Resutls with other poker agents playing worse strategies exploitable --> <h1 id="solving-poker---toy-poker-games">Solving Poker - Toy Poker Games</h1> <p>We will take a look at solving a very simple toy poker game called Kuhn Poker.</p> <h2 id="kuhn-poker">Kuhn Poker</h2> <p><strong>Kuhn Poker</strong> is the most basic poker game with interesting strategic implications.</p> <p>The game in its standard form is played with 3 cards {A, K, Q} and 2 players. Each player starts with \$2 and places an ante (i.e., forced bet before the hand) of \$1. And therefore has \$1 left to bet with. Each player is then dealt 1 card and 1 round of betting ensues.</p> <p><img src="../assets/section3/toygames/deputydots.png" alt="The deputy likes dots" /></p> <p>The rules in bullet form:</p> <ul> <li>2 players</li> <li>3 card deck {A, K, Q}</li> <li>Each starts the hand with$2</li> <li>Each antes (i.e., makes forced bet of) $1 at the start of the hand</li> <li>Each player is dealt 1 card</li> <li>Each has$1 remaining for betting</li> <li>There is 1 betting round and 1 bet size of $1</li> <li>The highest card is the best (i.e., A$&gt;$K$&gt;$Q)</li> </ul> <p>Action starts with P1, who can Bet$1 or Check</p> <ul> <li>If P1 bets, P2 can either Call or Fold</li> <li>If P1 checks, P2 can either Bet or Check</li> <li>If P2 bets after P1 checks, P1 can then Call or Fold</li> </ul> <p>These outcomes are possible:</p> <ul> <li>If a player folds to a bet, the other player wins the pot of $2 (profit of \$1)</li> <li>If both players check, the highest card player wins the pot of $2 (profit of \$1)</li> <li>If there is a bet and call, the highest card player wins the pot of $4 (profit of \$2)</li> </ul> <p>The following are all of the possible full sequences.</p> <p>The “History full” shows the exact betting history with “k” for check, “b” for bet, “c” for call, “f” for fold.</p> <p>The “History short” uses a condensed format that uses only “b” for betting/calling and “p” (pass) for checking/folding, meaning that “b” is used when putting $1 into the pot and “p” when putting no money into the pot. We reference this shorthand format since we’ll use it when putting the game into code.</p> <table> <thead> <tr> <th>P1</th> <th>P2</th> <th>P1</th> <th>Pot size</th> <th>Result</th> <th>History full</th> <th>History short</th> </tr> </thead> <tbody> <tr> <td>Check</td> <td>Check</td> <td>–</td> <td>$2</td> <td>High card wins $1</td> <td>kk</td> <td>pp</td> </tr> <tr> <td>Check</td> <td>Bet$1</td> <td>Call $1</td> <td>$4</td> <td>High card wins $2</td> <td>kbc</td> <td>pbb</td> </tr> <tr> <td>Check</td> <td>Bet$1</td> <td>Fold</td> <td>$2</td> <td>P2 wins$1</td> <td>kbf</td> <td>pbp</td> </tr> <tr> <td>Bet $1</td> <td>Call$1</td> <td>–</td> <td>$4</td> <td>High card wins$2</td> <td>bc</td> <td>bb</td> </tr> <tr> <td>Bet $1</td> <td>Fold</td> <td>–</td> <td>$2</td> <td>P1 wins $1</td> <td>bf</td> <td>bp</td> </tr> </tbody> </table> <h2 id="solving-kuhn-poker">Solving Kuhn Poker</h2> <p>We’re going to solve for the GTO solution to this game using 3 methods, an analytical solution, a normal form solution, and then a more efficient normal form solution. We will then briefly mention game trees and the CFR counterfactual regret algorithm, that will be detailed more in section 4.1.</p> <p>What’s the point of solving such a simple game? We can learn some important poker principles even from this game, although they are most useful for beginner players. We can also see the limitations of these earlier solving methods and therefore why new methods were needed to solve games of even moderate size.</p> <h2 id="analytical-solution">Analytical Solution</h2> <p>There are 4 decision points in this game: P1’s opening action, P2 after P1 bets, P2 after P1 checks, and P1 after checking and P2 betting.</p> <h3 id="defining-the-variables">Defining the variables</h3> <p><strong>P1 initial action</strong></p> <p>Let’s first look at P1’s opening action. P1 should never bet the K card here because if he bets the K, P2 with Q will always fold (since the lowest card can never win) and P2 with A will always call (since the best card will always win). By checking the K always, P1 can try to induce a bluff from P2 when P2 has the Q and may be able to fold to a bet when P2 has the A.</p> <p>Therefore we assign P1’s strategy:</p> <ul> <li>Bet Q: $$x$$</li> <li>Bet K: $$0$$</li> <li>Bet A: $$y$$</li> </ul> <p><strong>P2 after P1 bet</strong></p> <p>After P1 bets, P2 should always call with the A and always fold the Q as explained above.</p> <p>Therefore we assign P2’s strategy after P1 bet:</p> <ul> <li>Call Q: $$0$$</li> <li>Call K: $$a$$</li> <li>Call A: $$1$$</li> </ul> <p><strong>P2 after P1 check</strong></p> <p>After P1 checks, P2 should never bet with the K for the same reason as P1 should never initially bet with the K.</p> <p>P2 should always bet with the A because it is the best hand and there is no bluff to induce by checking (the hand would simply end and P2 would win, but not have a chance to win more by betting).</p> <p>Therefore we assign P2’s strategy after P1 check:</p> <ul> <li>Bet Q: $$b$$</li> <li>Bet K: $$0$$</li> <li>Bet A: $$1$$</li> </ul> <p><strong>P1 after P1 check and P2 bet</strong></p> <p>This case is similar to P2’s actions after P1’s bet. P1 can never call here with the worst hand (Q) and must always call with the best hand (A).</p> <p>Therefore we assign P1’s strategy after P1 check and P2 bet:</p> <ul> <li>Call Q: $$0$$</li> <li>Call K: $$z$$</li> <li>Call A: $$1$$</li> </ul> <p>So we now have 5 different variables $$x, y, z$$ for P1 and $$a, b$$ for P2 to represent the unknown probabilities.</p> <h3 id="solving-for-the-variables">Solving for the variables</h3> <p><strong>The Indifference Principle</strong></p> <p>When we solve for the analytical game theory optimal strategy, we want to make the opponent indifferent. This means that the opponent cannot exploit our strategy. If we deviated from this equilibrium then since poker is a 2-player zero-sum game, our opponent’s EV could increase at the expense of our own EV.</p> <p><strong>Solving for $$x$$ and $$y$$</strong></p> <p>For P1 opening the action, $$x$$ is his probability of betting with Q (bluffing) and $$y$$ is his probability of betting with A (value betting). We want to make P2 indifferent between calling and folding with the K (since again, Q is always a fold and A is always a call for P2).</p> <p>When P2 has K, P1 has $$\frac{1}{2}$$ of having a Q and A each.</p> <p>P2’s EV of folding with a K to a bet is $$0$$. (Note that we are defining EV from the current decision point, meaning that money already put into the pot is sunk and not factored in.)</p> <p>P2’s EV of calling with a K to a bet $$= 3 * \text{P(P1 has Q and bets with Q)} + \\(-1) * \text{P(P1 has A and bets with A)}$$</p> $= (3) * \frac{1}{2} * x + (-1) * \frac{1}{2} * y$ <p>Setting the calling and folding EVs equal (because of the indifference principle), we have:</p> $0 = (3) * \frac{1}{2} * x + (-1) * \frac{1}{2} * y$ $y = 3 * x$ <p>That is, P1 should bet the A 3 times more than bluffing with the Q. This result is parametrized, meaning that there isn’t a fixed number solution, but rather a ratio of how often P1 should value-bet compared to bluff. For example, if P1 bluffs with the Q 10% of the time, he should value bet with the A 30% of the time.</p> <p><strong>Solving for $$a$$</strong></p> <p>$$a$$ is how often P2 should call with a K facing a bet from P1.</p> <p>P2 should call $$a$$ to make P1 indifferent to bluffing (i.e., betting or checking) with card Q.</p> <p>If P1 checks with card Q, P1 will always fold afterwards if P2 bets (because it is the worst card and can never win), so P1’s EV is 0.</p> $\text{EV P1 check with Q} = 0$ <p>If P1 bets with card Q,</p> $\text{EV P1 bet with Q} = (-1) * \text{P2 has A and always calls/wins} + \$$-1) * \text{P2 has K and calls/wins} + 2 * \text{P2 has K and folds} = \frac{1}{2} * (-1) + \frac{1}{2} * (a) * (-1) + \frac{1}{2} * (1 - a) * (2) = -\frac{1}{2} - \frac{1}{2} * a + (1 - a) = \frac{1}{2} - \frac{3}{2} * a$ <p>Setting the probabilities of betting with Q and checking with Q equal, we have: 0 = \frac{1}{2} - \frac{3}{2} * a</p> <p>\frac{3}{2} * a = \frac{1}{2}</p> <p>a = \frac{1}{3} </p> <p>Therefore P2 should call \(\frac{1}{3}$$ with a K when facing a bet from P1.</p> <p><strong>Solving for $$b$$</strong></p> <p>Now to solve for $$b$$, how often P2 should bet with a Q after P1 checks. The indifference for P1 is only relevant when he has a K, since if he has a Q or A, he will always fold or call, respectively.</p> <p>If P1 checks a K and then folds, then:</p> $\text{EV P1 check with K and then fold to bet} = 0$ <p>If P1 checks and calls, we have:</p> $\text{EV P1 check with K and then call a bet} = (-1) * \text{P(P2 has A and always bets) + (3) * P(P2 has Q and bets) = \frac{1}{2} * (-1) + \frac{1}{2} * b * (3)$ <p>Setting these probabilities equal, we have: $$0 = \frac{1}{2} * (-1) + \frac{1}{2} * b * (3)$$</p> $\frac{1}{2} = \frac{1}{2} * b * (3)$ $3 * b = 1$ $b = \frac{1}{3}$ <p>Therefore P2 should bet $$\frac{1}{3}$$ with a Q after P1 checks.</p> <p><strong>Solving for $$z$$</strong> The final case is when P1 checks a K, P2 bets, and P1 must decide how frequently to call so that P2 is indifferent to checking vs. betting (bluffing) with a Q.</p> <table> <tbody> <tr> <td>(Note that | denotes “given that” and we use the conditional probability formula of $$\P(A</td> <td>B) = \frac{P(A \cup B)}{P(B)}$$where$$\cup$$denotes the intersection of the sets, so in this case is where$$A$$and$$B$$ intersect – by intersect we just mean that they are both true at the same time, like the middle part of a Venn diagram)</td> </tr> </tbody> </table> <p>We start with finding the probability that P1 has an A given that P1 has checked and P2 has a Q, meaning that P1 has an A or K.</p> $\text{P(P1 has A | P1 checks A or K)} = \frac{\text{P(P1 has A and checks)}}{\text{P(P1 checks A or K)}}$ <p>We simplify the numerator to P1 having A and checking because there is no intersection between checking a K and having an A.</p> $= \frac{(1-y) * \frac{1}{2}}{(1-y) * \frac{1}{2} + \frac{1}{2}}$ $= \frac{1-y}{2-y}$ $\text{P(P1 has K | P1 checks A or K)} = 1 - \text{P(P1 has A | P1 checks A or K)}$ $= 1 - \frac{1-y}{2-y}$ $= \frac{2-y}{2-y} - \frac{1-y}{2-y}$ $= \frac{1}{2-y}$ <p>If P2 checks his Q, his EV $$= 0$$.</p> <p>If P2 bets (bluffs) with his Q, his EV is:</p> $-1 * P(P1 check A then call A) - 1 * P(P1 check K then call K) + 2 * P(P1 check K then fold K)$ $= -1 * \frac{1-y}{2-y} + -1 * z * \frac{1}{2-y} + 2 * (1-z) * \frac{1}{2-ya}$ <p>Setting these equal:</p> $0 = -1 * \frac{1-y}{2-y} + -1 * z * \frac{1}{2-y} + 2 * (1-z) * \frac{1}{2-y}$ $0 = -1 * \frac{1-y}{2-y} + -1 * z * \frac{1}{2-y} + 2 * (1-z) * \frac{1}{2-y}$ $0 = -\frac{1-y}{2-y} - z * \frac{3}{2-y} + \frac{2}{2-y}$ $z * \frac{3}{2-y} = \frac{2}{2-y} - \frac{1-y}{2-y}$ $z = \frac{2}{3} - \frac{1-y}{3}$ $z = \frac{y+1}{3}$ <p>So P1 should call with a K relative to the proportion of betting an A. This means if betting A 50% of the time ($$y=0.5$$), we would have $$z = \frac{1.5}{3} = 0.5$$ as well.</p> <h3 id="solution-summary">Solution summary</h3> <p>We now have the following result:</p> <p>P1 initial actions:</p> <p>Bet Q: $$x = \frac{y}{3}$$</p> <p>Bet A: $$y = 3*x$$</p> <p>P2 after P1 bet:</p> <p>Call K: $$a = \frac{1}{3}$$</p> <p>P2 after P1 check:</p> <p>Bet Q: $$b = \frac{1}{3}$$</p> <p>P1 after P1 check and P2 bet:</p> <p>Call K: $$z = \frac{y+1}{3}$$</p> <p>P2 has fixed actions, but P1’s are dependent on the $$y$$ parameter.</p> <h3 id="finding-the-game-value">Finding the game value</h3> <p>We can look at the expected value of every possible deal-out to evaluate the value for $$y$$. We format these EV calculations as $$\text{P1 action} * \text{P2 action} * \text{P1 action if applicable} * \text{EV}$$, all from the perspective of P1.</p> <p><strong>Case 1: P1 A, P2 K</strong></p> <ol> <li>Bet fold:</li> </ol> $y * \frac{2}{3} * 2 = \frac{y}{3}$ <ol> <li>Bet call:</li> </ol> $y * \frac{1}{3} * 3 = 2 * y$ <ol> <li>Check check:</li> </ol> $(1 - y) * 1 * 2 = 2 * (1 - y)$ <p>Total = $$\frac{y}{3} + 2$$</p> <p><strong>Case 2: P1 A, P2 Q</strong></p> <ol> <li>Bet fold:</li> </ol> $y * 1 * 2 = 2 * y$ <ol> <li>Check bet call:</li> </ol> $(1 - y) * \frac{1}{3} * 1 * 3 = 3 * \frac{1}{3} * (1 - y)$ <ol> <li>Check check:</li> </ol> $(1 - y) * \frac{2}{3} * 2 = 2 * \frac{2}{3} * (1 - y)$ <p>Total = $$2 * y + (1 - y) + \frac{4}{3} * (1-y) = \frac{1}{3} * (7 - y)$$</p> <p><strong>Case 3: P1 K, P2 A</strong></p> <ol> <li>Check bet call:</li> </ol> $(1) * (1) * \frac{y+1}{3} * (-1) = -\frac{y+1}{3}$ <ol> <li>Check bet fold:</li> </ol> $(1) * (1) * (1 - \frac{y+1}{3}) * (0) = 0$ <p>Total = -\frac{y+1}{3} $$</p> <p><strong>Case 4: P1 K, P2 Q</strong></p> <ol> <li>Check check:</li> </ol> $(1) * \frac{2}{3} * 2 = 2 * \frac{2}{3}$ <ol> <li>Check bet call:</li> </ol> $(1) * \frac{1}{3} * \frac{y+1}{3} * 3 = \frac{y+1}{3}$ <ol> <li>Check bet fold:</li> </ol> $(1) * \frac{1}{3} * (1 - \frac{y+1}{3}) * 0 = 0$ <p>Total = \frac{4}{3} + \frac{y+1}{3} = \frac{y+5}{3}$$</p> <p><strong>Case 5: P1 Q, P2 A</strong></p> <ol> <li>Bet call:</li> </ol> $\frac{y}{3} * 1 * (-1) = \frac{-y}{3}$ <ol> <li>Check bet fold:</li> </ol> $(1 - \frac{y}{3}) * 1 * 1 * (0) = 0$ <p>Total = \frac{-y}{3} $$</p> <p><strong>Case 6: P1 Q, P2 K</strong></p> <ol> <li>Bet call:</li> </ol> $\frac{y}{3} * \frac{1}{3} * (-1) = -\frac{y}{9}$ <ol> <li>Bet fold:</li> </ol> $\frac{y}{3} * \frac{2}{3} * 2 = \frac{4*y}{9}$ <ol> <li>Check check:</li> </ol> $(1-\frac{y}{3}) * 1 * (0) = 0$ <p>Total = -\frac{y}{9} + \frac{4*y}{9} = \frac{y}{3}$$</p> <p><strong>Summing up the cases</strong> Since each case is equally likely based on the initial deal, we can multiply each by $$\frac{1}{6}$$ and then sum them to find the EV of the game. Summing up all cases, we have:</p> <p>Overall total = $$\frac{1}{6} * [\frac{y}{3} + 2 + \frac{1}{3} * (7 - y) + -\frac{y+1}{3} + \frac{y+5}{3} + \frac{-y}{3} + \frac{y}{3}] = \frac{17}{18}$$</p> <h3 id="main-takeaways">Main takeaways</h3> <p>What does this number $$\frac{17}{18}$$ mean? It says that the expectation of the game from the perspective of Player 1 is $$\frac{17}{18}$$. Since this is $$&lt;1$$, we see that the expectation of Player 1 is $$1 - \frac{17}{18} = -0.05555$$. Therefore the value of the game for Player 2 is $$+0.05555$$. Every time that these players play a hand against each other (assuming they play the equilibrium strategies), that will be the outcome on average – meaning P1 will lose $$\5.56$$ on average per 100 hands and P2 will gain that amount.</p> <p>This indicates the advantage of acting last in poker – seeing what the opponent has done first gives an information advantage. In this game, the players would rotate who acts first for each hand, but the principle of playing more hands with the positional advantage is very important in real poker games.</p> <p>The expected value is not at all dependent on the $$y$$ variable which defines how often Player 1 bets his A hands. If we assumed that the pot was not a fixed size of$ 2 to start the hand, then it would be optimal for P1 to either always bet or always check the A (the math above would change and the result would depend on $$y$$), but we’ll stick with the simple case of the pot always starting at $2 from the antes.</p> <p>From a poker strategy perspective, the main takeaway is that we can essentially split our hands into:</p> <ol> <li>Strong hands</li> <li>Mid-strength hands</li> <li>Weak hands</li> </ol> <p>Mid-strength hands can win, but don’t want to build the pot. Strong hands try to generally make the pot large with value bets. Weak hands want to either give up or be used as bluffs.</p> <p>Note that this mathematically optimal solution automatically uses bluffs. Bluffs are not -EV bets that are used as “bad plays” to get more credit for value bets later, they are part of an overall optimal strategy.</p> <p>We also see that a major component of poker strategy is “balancing” bluffs. We see that P1 value bets 3 times more than she bluffs. In a real poker setting, you might have a similar strategy, but will have many possible bluff hands in your range to choose from, which means that they can be strategically selected to match the ratio, for example by bluffing with hands that make it less likely that your opponent is strong, while giving up with other weak hands.</p> <h2 id="kuhn-poker-in-normal-form">Kuhn Poker in Normal Form</h2> <p>Analytically solving all but the smallest games is not very feasible – a faster way to compute the strategy for this game is putting it into normal form.</p> <h3 id="information-sets">Information sets</h3> <p>There are 6 possible deals in Kuhn Poker: AK, AQ, KQ, KA, QK, QA.</p> <p>Each player has 2 decision points in the game. Player 1 has the initial action and the action after the sequence of P1 checks –&gt; P2 bets. Player 2 has the second action after Player 1 bets or Player 1 checks.</p> <p>Therefore each player has 12 possible acting states. For player 1 these are:</p> <ol> <li>AK acting first</li> <li>AQ acting first</li> <li>KQ acting first</li> <li>KA acting first</li> <li>QK acting first</li> <li>QA acting first</li> <li>AK check, P2 bets, P1 action</li> <li>AQ check, P2 bets, P1 action</li> <li>KQ check, P2 bets, P1 action</li> <li>KA check, P2 bets, P1 action</li> <li>QK check, P2 bets, P1 action</li> <li>QA check, P2 bets, P1 action</li> </ol> <p>However, the state of the game is not actually known to the players! Each player has 2 decision points that are equivalent from their point of view, even though the true game state is different. For player 1 these are:</p> <ol> <li>A acting first (combines AK and AQ)</li> <li>K acting first (combines KQ and KA)</li> <li>Q acting first (combines QK and QA)</li> <li>A check, P2 bets, P1 action (combines AK and AQ)</li> <li>K check, P2 bets, P1 action (combines KQ and KA)</li> <li>Q check, P2 bets, P1 action (combines QK and QA)</li> </ol> <p>From Player 1’s perspective, she only knows her own private card and can only make decisions based on knowledge of this card.</p> <p>For example, if Player 1 is dealt a K and Player 2 dealt a Q or P1 dealt K and P2 dealt A, P1 is facing the decision of having a K and not knowing what the opponent has.</p> <p>Likewise if Player 2 is dealt a K and is facing a bet, he must make the same action regardless of what the opponent has because from his perspective he only knows his own card.</p> <p>We define an information set as the set of information used to make decisions at a particular point in the game. In Kuhn Poker, it is equivalent to the card of the acting player and the history of actions up to that point.</p> <p>When writing game history sequences, we use “k” to define check, “b” for bet”, “f” for fold, and “c” for call. So for Player 1 acting first with a K, the information set is “K”. For Player 2 acting second with an A and facing a bet, the information set is “Ab”. For Player 2 acting second with a A and facing a check, the information set is “Ak”. For Player 1 with a K checking and facing a bet from Player 2, the information set is “Kkb”.</p> <p>The shorthand version is to combine “k” and “f” into “p” for pass and to combine “b” and “c” into “b” for bet. Pass indicates putting no money into the pot and bet indicates putting$1 into the pot.</p> <h3 id="writing-kuhn-poker-in-normal-form">Writing Kuhn Poker in Normal Form</h3> <p>Now that we have defined information sets, we see that each player in fact has 2 information sets per card that he can be dealt, which is a total of 6 information sets per player since each can be dealt a card in {Q, K, A}. (If the game was played with a larger deck size, then we would have $$\text{N} * 2$$ information sets, where N is the deck size.)</p> <p>Each information set has 2 actions possible, which are essentially “do not put money in the pot” (check when acting first/facing a check or fold when facing a bet – we call this pass) and “put in $1” (bet when acting first or call when facing a bet – we call this bet).</p> <p>The result is that each player has $$2^6 = 64$$ total combinations of strategies. Think of this as each player having a switch between pass/bet for each of the 6 information sets that can be on or off and deciding all of these in advance.</p> <p>Here are a few examples of the 64 strategies for Player 1 (randomly selected):</p> <ol> <li>A - bet, Apb - bet, K - bet, Kpb - bet, Q - bet, Qpb - bet</li> <li>A - bet, Apb - bet, K - bet, Kpb - bet, Q - bet, Qpb - pass</li> <li>A - bet, Apb - bet, K - pass, Kpb - bet, Q - bet, Qpb - bet</li> <li>A - bet, Apb - pass, K - bet, Kpb - pass, Q - bet, Qpb - bet</li> <li>A - bet, Apb - pass, K - bet, Kpb - bet, Q - bet, Qpb - bet</li> <li>A - pass, Apb - bet, K - bet, Kpb - bet, Q - pass, Qpb - bet</li> </ol> <p>We can create a $$64 \text{x} 64$$ payoff matrix with every possible strategy for each player on each axis and the payoffs inside.</p> <table> <thead> <tr> <th>P1/P2</th> <th>P2 Strat 1</th> <th>P2 Strat 2</th> <th>…</th> <th>P2 Strat 64</th> </tr> </thead> <tbody> <tr> <td>P1 Strat 1</td> <td>EV(1,1)</td> <td>EV(1,2)</td> <td>…</td> <td>EV(1,64)</td> </tr> <tr> <td>P1 Strat 2</td> <td>EV(2,1)</td> <td>EV(2,2)</td> <td>…</td> <td>EV(2,64)</td> </tr> <tr> <td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td> </tr> <tr> <td>P1 Strat 64</td> <td>EV(64,1)</td> <td>EV(64,2)</td> <td>…</td> <td>EV(64,64)</td> </tr> </tbody> </table> <p>This matrix has 4096 entries and would be difficult to use for something like iterated elimination of dominated strategies. We turn to linear programming to find a solution.</p> <h3 id="solving-with-linear-programming">Solving with Linear Programming</h3> <p>The general way to solve a game matrix of this size is with linear programming, which is essentially a way to optimize a linear objective, which we’ll define below. This kind of setup could be used in a problem like minimizing the cost of food while still meeting objectives like a minimum number of calories and maximum number of carbohydrates and sugar.</p> <p>We can define Player 1’s strategy as $$x$$, which is a vector of size 64 corresponding to the probability of playing each strategy. We do the same for Player 2 as $$y$$.</p> <p>We define the payoff matrix as $$A$$ with the payoffs written with respect to Player 1.</p> $A = \quad \begin{bmatrix} EV(1,2) &amp; EV(1,2) &amp; ... &amp; EV(1,64) &amp; \\ EV(2,1) &amp; EV(2,2) &amp; ... &amp; EV(2,64) &amp; \\ ... &amp; ... &amp; ... &amp; ... &amp; \\ EV(64,1) &amp; EV(64,2) &amp; ... &amp; EV(64,64) &amp; \\ \end{bmatrix}$ <p>We can use payoff matrix $$B$$ for payoffs written with respect to Player 2 – in zero-sum games like poker, $$A = -B$$, so it’s easiest to just use $$A$$.</p> <p>We can also define a constraint matrix for each player:</p> <p>Let P1’s constraint matrix = $$E$$ such that $$Ex = e$$</p> <p>Let P2’s constraint matrix = $$F$$ such that $$Fy = f$$</p> <p>The only constraint we have at this time is that the sum of the strategies is 1 since they are a probability distribution, so E and F will just be vectors of 1’s and e and f will $$= 1$$. In effect, this is just saying that each player has 64 strategies and should play each of those some % of the time and these %s have to add up to 1 since this is a probability distribution and probabilities always add up to 1.</p> <p>In the case of Kuhn Poker, for <strong>step 1</strong> we look at a best response for Player 2 (strategy y) to a fixed Player 1 (strategy x) and have the following. Best response means best possible strategy for Player 2 given Player 1’s fixed strategy.</p> <p>$$\max_{y} (x^TB)y$$ $$\text{Such that: } Fy = f, y \geq 0$$</p> <p>We are looking for the strategy parameters $$y$$ that maximize the payoffs for Player 2. $$x^TB** is the transpose of$$x$$multiplied by$$B$$, so the strategy of Player 1 multiplied by the payoffs to Player 2. Player 2 then can choose$$y$\$ to maximize his payoffs.</p> $= \max_{y} (x^T(-A))y$ <p>We substitute $$-A$$ for $$B$$ so we only have to work with the $$A$$ matrix.</p> $= \min_{y} (x^T(A))y$ <p>We can substitute $$-A$$ with $$A$$ and change our optimization to minimizing instead of maximizing.</p> $\text{Such that: } Fy = f, y \geq 0$ <p>In words, this is the expected value of the game from Player 2’s perspective because the $$x$$ and $$y$$ matrices represent the probability of ending in each state of the payoff matrix and the $$B == -A$$ value represents the payoff matrix itself. So Player 2 is trying to find a strategy $$y$$ that maximizes the payoff of the game from his perspective against a fixed $$x$$ player 1 strategy.</p> <p>For <strong>step 2</strong>, we look at a best response for Player 1 (strategy x) to a fixed Player 2 (strategy y) and have:</p> <p>$$\max_{x} x^T(Ay)$$ $$\text{Such that: } x^TE^T = e^T, x \geq 0$$</p> <p>Note that now Player 1 is trying to maximize this equation and Player 2 is trying to minimize this same thing.</p> <p>For <strong>step 3</strong>, we can combine the above 2 parts and now allow for $$x$$ and $$y$$ to no longer be fixed, which leads to the below minimax equation. In 2-player zero-sum games, the minimax solution is the same as the Nash equilibrium solution. We call this minimax because each player minimizes the maximum payoff possible for the other – since the game is zero-sum, they also minimize their own maximum loss (maximizing their minimum payoff). This is also why the Nash equilibrium strategy in poker can be thought of as a “defensive” strategy, since by minimizing the maximum loss, we aren’t trying to maximally exploit.</p> $\min_{y} \max_{x} [x^TAy]$ $\text{Such that: } x^TE^T = e^T, x \geq 0, Fy = f, y \geq 0$ <p>We can solve this with linear programming, but this would involve a huge payoff matrix $$A$$ and length 64 strategy vectors for each player. There is a much more efficient way!</p> <h2 id="solving-by-simplifying-the-matrix">Solving by Simplifying the Matrix</h2> <p>Kuhn Poker is the most basic poker game possible and requires solving a $$64 \text{x} 64$$ matrix. While this is feasible, any reasonably sized poker game would blow up the matrix size.</p> <p>We can improve on this form by considering the structure of the game tree. Rather than just saying that the constraints on the $$x$$ and $$y$$ matrices are that they must sum to 1, we can redefine these conditions according to the structure of the game tree.</p> <h3 id="simplified-matrices-for-player-1-with-behavioral-strategies">Simplified Matrices for Player 1 with Behavioral Strategies</h3> <p>Previously we defined $$E = F = \text{Vectors of } 1$$, the most basic constraint that all probabilities have to sum to 1.</p> <p>However, we know that some strategic decisions can only be made after certain other decisions have already been made. For example, Player 2’s actions after a Player 1 bet can only be made after Player 1 has first bet!</p> <p>Now we can redefine the $$E$$ constraint as follows for Player 1:</p> <table> <thead> <tr> <th>Infoset/Strategies</th> <th>0</th> <th>A_b</th> <th>A_p</th> <th>A_pb</th> <th>A_pp</th> <th>K_b</th> <th>K_p</th> <th>K_pb</th> <th>K_pp</th> <th>Q_b</th> <th>Q_p</th> <th>Q_pb</th> <th>Q_pp</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>A</td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Apb</td> <td> </td> <td> </td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>K</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Kpb</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>Qpb</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td>1</td> <td>1</td> </tr> </tbody> </table> <p>We see that $$E$$ is a $$7 \text{x} 13$$ matrix, representing the root of the game and the 6 information sets vertically and the root of the game and the 12 possible strategies horizontally. The difference now is that we are using <strong>behavioral strategies</strong> instead of <strong>mixed strategies</strong>. Mixed strategies meant specifying a probability of how often to play each of 64 possible pure strategies. Behavior strategies assign probability distributions over strategies at each information set. Kuhn’s theorem (the same Kuhn) states that in a game where players may remember all of their previous moves/states of the game available to them, for every mixed strategy there is a behavioral strategy that has an equivalent payoff (i.e. the strategies are equivalent).</p> <p>Within the matrix, the [0,0] entry is a dummy and filled with a 1. Each row has a single $$-1$$, which indicates the strategy (or root) that must precede the infoset. The $$1$$ entries represent strategies that exist from a certain infoset.</p> $E = \quad \begin{bmatrix} 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 \\ \end{bmatrix}$ <p>$$x$$ is a $$13 \text{x} 1$$ matrix of probabilities to play each strategy.</p> $x = \quad \begin{bmatrix} 1 \\ A_b \\ A_p \\ A_{pb} \\ A_{pp} \\ K_b \\ K_p \\ K_{pb} \\ K_{pp} \\ Q_b \\ Q_p \\ Q_{pb} \\ Q_{pp} \\ \end{bmatrix}$ <p>We have finally that $$e$$ is a $$7 \text{x} 1$$ fixed matrix.</p> $e = \quad \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}$ <p>So we have overall:</p> $\quad \begin{bmatrix} 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 1 &amp; 1 \\ \end{bmatrix} \quad \begin{bmatrix} 1 \\ A_b \\ A_p \\ A_{pb} \\ A_{pp} \\ K_b \\ K_p \\ K_{pb} \\ K_{pp} \\ Q_b \\ Q_p \\ Q_{pb} \\ Q_{pp} \\ \end{bmatrix} = \quad \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}$ <h3 id="what-do-the-matrices-mean">What do the matrices mean?</h3> <p>To understand how the matrix multiplication works and why it makes sense, let’s look at each of the 7 multiplications (i.e., each row of $$E$$ multiplied by the column vector of $$x$$ $$=$$ the corresponding row in the $$e$$ column vector.</p> <p><strong>Row 1</strong></p> <p>We have $$1 \text{x} 1$$ = 1. This is a “dummy”</p> <p><strong>Row 2</strong></p> <p>$$-1 + A_b + A_p = 0$$ $$A_b + A_p = 1$$</p> <p>This is the simple constraint that the probability between the initial actions in the game when dealt an A must sum to 1.</p> <p><strong>Row 3</strong> $$-A_p + A_{pb} + A_{pp} = 1$$ $$A_{pb} + A_{pp} = A_p$$</p> <p>The probabilities of Player 1 taking a bet or pass option with an A after initially passing must sum up to the probability of that initial pass $$A_p$$.</p> <p>The following are just repeats of Rows 2 and 3 with the other cards.</p> <p><strong>Row 4</strong></p> <p>$$-1 + K_b + K_p = 0$$ $$K_b + K_p = 1$$</p> <p><strong>Row 5</strong></p> <p>$$-K_p + K_{pb} + K_{pp} = 1$$ $$K_{pb} + K_{pp} = K_p$$</p> <p><strong>Row 6</strong></p> <p>$$-1 + Q_b + Q_p = 0$$ $$Q_b + Q_p = 1$$</p> <p><strong>Row 7</strong></p> <p>$$-Q_p + Q_{pb} + Q_{pp} = 1$$ $$Q_{pb} + Q_{pp} = Q_p$$</p> <h3 id="simplified-matrices-for-player-2">Simplified Matrices for Player 2</h3> <p>And $$F$$ works similarly for Player 2:</p> <table> <thead> <tr> <th>Infoset/Strategies</th> <th>0</th> <th>A_b(ab)</th> <th>A_p(ab)</th> <th>A_b(ap)</th> <th>A_p(ap)</th> <th>K_b(ab)</th> <th>K_p(ab)</th> <th>K_b(ap)</th> <th>K_p(ap)</th> <th>Q_b(ab)</th> <th>Q_p(ab)</th> <th>Q_b(ap)</th> <th>Q_p(ap)</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Ab</td> <td>-1</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Ap</td> <td>-1</td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Kb</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Kp</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Qb</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>Qp</td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td>1</td> </tr> </tbody> </table> <p>From the equivalent analysis as we did above with $$Fx = f$$, we will see that each pair of 1’s in the $$F$$ matrix will sum to $$1$$ since they are the 2 options at the information set node.</p> <h3 id="simplified-payoff-matrix">Simplified Payoff Matrix</h3> <p>Now instead of the $$64 \text{x} 64$$ matrix we made before, we can represent the payoff matrix as only $$6 \text{x} 2 \text{ x } 6\text{x}2 = 12 \text{x} 12$$. (It’s actually $$13 \text{x} 13$$ because we use a dummy row and column.) These payoffs are the actual results of the game when these strategies are played from the perspective of Player 1, where the results are in {-2, -1, 1, 2}.</p> <table> <thead> <tr> <th>P1/P2</th> <th>0</th> <th>A_b(ab)</th> <th>A_p(ab)</th> <th>A_b(ap)</th> <th>A_p(ap)</th> <th>K_b(ab)</th> <th>K_p(ab)</th> <th>K_b(ap)</th> <th>K_p(ap)</th> <th>Q_b(ab)</th> <th>Q_p(ab)</th> <th>Q_b(ap)</th> <th>Q_p(ap)</th> </tr> </thead> <tbody> <tr> <td>0</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>A_b</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>A_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>A_pb</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>0</td> </tr> <tr> <td>A_pp</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> </tr> <tr> <td>K_b</td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>K_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>K_pb</td> <td> </td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> </tr> <tr> <td>K_pp</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> </tr> <tr> <td>Q_b</td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_pb</td> <td> </td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_pp</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> $A = \quad \begin{bmatrix} 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 1 &amp; 0 &amp; 0 &amp; 2 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 1 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 \\ 0 &amp; -2 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 1 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 1 \\ 0 &amp; 0 &amp; 0 &amp; -2 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 2 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 \\ 0 &amp; -2 &amp; 1 &amp; 0 &amp; 0 &amp; -2 &amp; 1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; -2 &amp; 0 &amp; 0 &amp; 0 &amp; -2 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; -1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\ \end{bmatrix}$ <p>We could even further reduce this by eliminating dominated strategies:</p> <table> <thead> <tr> <th>P1/P2</th> <th>0</th> <th>A_b(ab)</th> <th>A_b(ap)</th> <th>A_p(ap)</th> <th>K_b(ab)</th> <th>K_p(ab)</th> <th>K_b(ap)</th> <th>K_p(ap)</th> <th>Q_b(ab)</th> <th>Q_p(ab)</th> <th>Q_b(ap)</th> <th>Q_p(ap)</th> </tr> </thead> <tbody> <tr> <td>0</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>A_b</td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> <td>2</td> <td>1</td> <td> </td> <td> </td> </tr> <tr> <td>A_p</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>A_pb</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> <td> </td> <td> </td> <td>2</td> <td>0</td> </tr> <tr> <td>K_p</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>1</td> </tr> <tr> <td>K_pb</td> <td> </td> <td> </td> <td>-2</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>2</td> <td> </td> </tr> <tr> <td>K_pp</td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> </tr> <tr> <td>Q_b</td> <td> </td> <td>1</td> <td> </td> <td> </td> <td>-2</td> <td>1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_p</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>Q_pp</td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td>-1</td> <td> </td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <p>For simplicity, let’s stick with the original $$A$$ payoff matrix and see how we can solve for the strategies and value of the game.</p> <h3 id="simplified-linear-program">Simplified Linear Program</h3> <p>Our linear program is now updated as follows. It is the same general form as before, but now our $$E$$ and $$F$$ matrices have constraints based on the game tree and the payoff matrix $$A$$ is smaller, evaluating when player strategies coincide and result in payoffs, rather than looking at every possible set of strategic options as we did before:</p> $\min_{y} \max_{x} [x^TAy]$ $\text{Such that: } x^TE^T = e^T, x \geq 0, Fy = f, y \geq 0$ <p>MATLAB code is available to solve this linear program: <a href="MATLAB LP code">https://www.dropbox.com/s/na9rta8sqj0zlmb/matlabkuhn.m?dl=0</a></p> <h2 id="iterative-algorithms">Iterative Algorithms</h2> <p>Now we have shown a way to solve games more efficiently based on the structure/ordering of the decision nodes (which can be expressed in tree form).</p> <p>Using behavioral strategies significantly reduces the size of the game and solving is much more efficient by using the structure/ordering of the decision nodes. This can be expresessed in tree form and leads to algorithms that can use self-play to iterate through the game tree.</p> <p>Specifically CFR (Counterfactual Regret Minimization) has become the foundation of imperfect information game solving algorithms. We will go into detail on this in section 4.1.</p>AIPT Section 3.1: Solving Poker – What is Solving?2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/what-is-solving<h1 id="solving-poker---what-is-solving">Solving Poker - What is Solving?</h1> <h2 id="definitions">Definitions</h2> <p>When we talk about a solution to a poker game, we mean playing the game theory optimal strategy. In small toy games, these strategies are relatively easy to find. In 1v1 Limit Texas Hold’em a very close approximation of this strategy <a href="https://science.sciencemag.org/content/347/6218/145">was computed in 2015 at the University of Alberta</a>. In commonly played games like 1v1 No Limit Texas Hold’em and multiplayer No Limit Texas Hold’em, no complete game theory optimal strategies exist…yet.</p> <p>Approximations exist even for these larger games and as AI continues to improve and computing power increases and as humans study more, AI and humans will both get gradually closer to optimal, but for now</p> <h3 id="measuring-closeness-to-gto">Measuring Closeness to GTO</h3> <p>There are two main ways to measure how good a strategy is: 1) We can look at a given strategy against the “best response” strategy, which is the strategy that maximally exploits the given strategy (i.e. how well can someone do against you if they know your exact strategy) 2) We can look at a given strategy against an actual game theory optimal strategy</p> <h2 id="why-the-gto-strategy">Why the GTO Strategy?</h2> <h2 id="solving-methods">Solving Methods</h2> <h2 id="indifference">Indifference</h2> <h2 id="solving-programs">Solving Programs</h2> <p>Solver programs like <a href="https://www.piosolver.com/">PioSOLVER</a> or <a href="https://monkerware.com/solver.html">Monker Solver</a> let users set up a betting tree with user-defined betting abstractions and then solve for optimal solutions within this abstracted game.</p> <p>Use to learn important lessons/strategies, but not full (couldn’t remember anyway) can make simplifications like always doing something instead of 90% or trends for certain board types/situations makes more sense to compile approx strategies for different situations than expect to have some grand GTO strategy</p> <p>The betting abstraction could be something like allowing only the minimum bet, 0.25 pot bet, 0.75 pot bet, and 1.25 pot bet. The self-play algorithm then simulates the game from the tree and finds an equilibrium strategy, i.e. strategies such that neither player can exploit the other. Solvers are very dependent on good user input because abstractions that make the game too large will take too long to solve and abstractions that make the game too small or that are not representative of true optimal strategy could result in poor results.</p> <h3 id="monker-solver-example">Monker Solver Example</h3> <h3 id="gto-strategies-for-sale">“GTO” Strategies for Sale</h3> <h3 id="ev-of-going-allin">EV of going allin</h3> <p>https://poker.stackexchange.com/q/78/88 MDF etc hand combinations</p>Solving Poker - What is Solving?AIPT Section 4.3: CFR – Agent Evaluation2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/agent-evaluationAIPT Section 5.1: Top Poker Agents – AI vs. Human Competitions2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/ai-vs-human-competitionsAIPT Section 4.4: CFR – CFR Advances2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/cfr-advancesAIPT Section 6.1: Other Topics – Interpreting Agent Decisions2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/decision-making-lessonsAIPT Section 4.2: CFR – Game Abstractions2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/game-abstractionsAIPT Section 1.2: Background – History of Solving Poker2021-02-03T00:00:00+00:002021-02-03T00:00:00+00:00https://aipokertutorial.com/history-of-solving-poker