Monte-Carlo Regret Minimization for Adversarial Team Games

Abstract

We study equilibrium approximation in extensive-form adversarial team games, in which two teams of rational players compete in a zero-sum interaction. The suitable solution concept in these settings is the Team-Maxmin Equilibrium with Correlation (TMECor), which naturally arises when the team players play ex-ante correlated strategies. While computing such an equilibrium is APX-hard, recent techniques show that scalability beyond toy instances is possible. However, even compact representations of the team’s strategy space, such as that exploiting Directed Acyclic Graphs (DAGs), have exponential size prohibiting solving large instances. In the present paper, we show that Monte Carlo sampling for regret minimization in adversarial team games can provide an important advancement. In particular, we design a DAG Monte Carlo Counterfactual Regret Minimization algorithm that performs outcome sampling with $O (d)$ time complexity per iteration, where d is the depth of the DAG, and with a convergence rate bound of $O (b^{kd})$ , where b is the branching factor and k is the maximum number of private states in each public state of the team. We empirically evaluate our algorithms with a standard testbed of games, showing their performance when approximating equilibria. We investigate both cases in which only one team is composed of multiple players and cases in which both teams are composed of multiple players. Our empirical results show that our algorithms can provide a rough approximation to equilibrium strategies in all game instances considered, trading off a lower precision in the computed equilibrium for lower resource utilization.

Keywords

Algorithmic game theory adversarial team games

1 Introduction

In the last decade, game-theoretic approaches to deal with strategic environments with imperfect information allowed the scientific community to achieve outstanding milestones. Some of the most brilliant examples are games like Texas Hold’em Poker [1 , 18] and Starcraft II [21] for which the researchers were able to develop algorithms achieving superhuman and high-human performances, respectively. Most of the efforts have been spent to tackle games with two competing players (i.e., two-player zero-sum games), while no satisfactory results have been proposed for games with multiple players. One of the reasons for this asymmetry lies in the different complexity of the equilibrium computation problem of the two settings. Indeed, while it is possible to develop scalable and efficient algorithms to approximate optimal strategies (i.e., equilibria) in the case of two-player zero-sum games, under standard complexity conjectures, the computation of equilibria in multi-player scenarios is a much more challenging task that can become intractable in the worst case.

In this work, we study the problem of approximating equilibria for adversarial team games (ATGs), a generalization of two-player zero-sum games in which each player is replaced by a team of agents. We focus on the case in which the team members can coordinate their strategies ex ante, i.e. they can freely communicate before the start of the game, in order to coordinate the choice of actions to be played. This model of coordination, together with the corresponding solution concept – the Team-Maxmin Equilibrium with a Correlation device (TMECor) – was introduced by Celli and Gatti [4]. In the same work, the authors show that approximating a TMECor is APX-hard in the worst case, showing that ATGs are intractable, similar to general multi-player games even if they are a simple extension of two-player zero-sum games. Nevertheless, the investigation of such problems is critical for extending the adoption of game-theoretic algorithms to more complex games like contract Bridge, as well as to a wide range of real-world non-recreational applications, e.g., military defense scenarios.

Related Works. The seminal approaches to computing a TMECor are based on mathematical programming techniques and, more precisely, column generation algorithms. In particular, Celli and Gatti [4] propose an algorithm called Hybrid Column Generation (HCG) which works iteratively. At every iteration, first, it computes the equilibrium of a game in a reduced space of the correlated strategies. Then, it verifies whether that strategy profile is an equilibrium for the whole game, and if it is not, HCG enlarges the strategy space by adding joint plans until the equilibrium is obtained. Its bottleneck is the algorithm generating the new plan to enrich the strategy space, which is based on an Integer Linear Program (ILP). Farina et al. [7, 11] improve upon the column generation procedure of HCG and rely on more compact definitions of the correlated strategy space that allow the formulation of more efficient ILP oracles. Despite the successes in enhancing the efficiency of the oracle, such approaches fail to scale up to large games. This has been imputed to the difficulty of column generation approaches to approximate more complex strategies from a finite set. More recently, Zhang and Sandholm [22] propose a junction tree decomposition of the correlated strategy polytope. Such a technique induces a concise description of the strategy space that can be employed in a linear program to directly find a TMECor. Furthermore, Carminati et al. [3] presents a technique for converting an ATG into an equivalent two-player formulation called Team Public Information (TPI) game. The main idea behind the TPI conversion is to explicitly represent one coordinator per team who observes only the information commonly known to the team members. The idea of exploiting common information was similarly pursued by Zhang et al. [24], where the authors show that the underlying decision process faced by the team can be encoded without loss of generality with a Direct Acyclic Graph (DAG) called Team-Belief DAG (TB-DAG), in which the team decision points’ are differentiated based on what information the team has. Both the TPI game and the TB-DAG representation open a wide spectrum of opportunities for solving ATGs that range from the adoption of popular no-regret learning dynamics to the possibility of combining successful techniques previously developed for two-player games, e.g., abstractions and subgame solving [1 , 18]. The DAG yields a strategy encoding that is significantly more compact than that provided by TPI and other previous representations. However, its dimension scales exponentially with the information complexity of the game (i.e., the maximum number of private states in a public state of the game), thus making the traversal, or in some cases even the memory representation, of the entire DAG unaffordable. This issue is particularly crucial as it prevents the adoption of tabular regret minimization algorithms in large instances of team games. More recently, several hybrid approaches have been proposed to overcome the challenges faced by algorithms based on pure column-generation and TB-DAG approaches. Zhang et al. [23] is a column-generation algorithm that employs a TB-DAG-based representation of the decision space to compute the coordinated strategy to add to the set of available strategies. Another direction has been taken by McAleer et al. [17], which combines the traditional column-generation pipeline with an algorithm based on deep reinforcement learning to expand the strategy set available to each team. However, both of the cited approaches face the same scalability limitation brought by the column-generation approach whenever the game to be solved has a relevant depth, as is the case of Leduc Poker testbeds. On the other hand, Zhang et al. [25] propose a mixed approach where a column-generation process is used to dynamically extend a TB-DAG representation of the coordinated strategy space, which is then solved for an approximated equilibrium using the no-regret techniques proposed by Zhang et al. [24].

Our Contribution. The aim of this work is to propose an alternative approach to surpass the scalability issues from previous algorithms and to unlock the adoption of regret-minimization techniques in huge ATGs. More precisely, we rely on the TB-DAG representation and propose to overcome the computational limitations arising from the need for traversing the whole DAG by combining no-regret algorithms with Monte-Carlo sampling techniques, starting from the Monte-Carlo CFR (MCCFR) algorithm originally proposed by Lanctot et al. [15]. In order to do so, we must face two different challenges. The first challenge is that both the trivial outcome sampling algorithm as introduced for two-player zero-sum games [15], as well as the computation of the average strategies of the players (i.e., the approximated TMECor) would require, in the worst case, a complete DAG traversal at each iteration. In the first part of this paper, we show how to define a suitable outcome sampling algorithm that, instead of visiting the whole TB-DAG, passes through a number of nodes that is linear in the depth d of the DAG. The second part of this paper is devoted to the problem of computing average strategies in this setting. We show how to efficiently track the average strategies by means of local updates at the nodes traversed during outcome sampling. Overall, our proposed solution can progressively optimize a correlated strategy by using light iterations based on sampling a single trajectory of play at a time while needing memory storage only for the nodes that have been effectively traversed during past play. Additionally, we provide high probability bounds for the regret and show that our method converges to a TMECor with $O (b^{kd})$ , where b represents the branching factor of the team game’s tree and k is the information complexity of the game. Experimental results show that our algorithm is sound and that there is convergence to approximately good strategies even with the low resource requirements of the proposed technique. The main tradeoff is on the approximation quality, which is not as tight as previous techniques, as a consequence of the adoption of sampling techniques.

2 Preliminaries

Adversarial Team Games. ATGs are imperfect information sequential games in which the players are partitioned in two teams. We denote the two teams as ▲ and ▼, respectively. The extensive form of an ATG is commonly represented through a rooted game tree. Nodes in the game tree are used to model decision points in which players (or chance) must take an action, while edges model the available actions that the players can perform. Leaves represent, instead, terminal states in the game, i.e., all those situations in which the game ends. The set of tree decision nodes is denoted as $H = H_{▲} \cup H_{▼} \cup H_{C}$ , where $H_{i}$ with i∈ { ▲ , ▼ , C } denotes the set of nodes in which i acts (C is the chance player).

For any pair of nodes $h, h^{'} \in H$ , we have that h′ is reachable from h, indicated as h ⪯ h′, if there exists a path in the tree connecting h to h′ with the former preceding the latter, while we write h ≺ h′ if h ⪯ h′ and h ≠ h′. Moreover, for each $h \in H$ , A (h) is the set of actions available at h. The set of terminal nodes is denoted as $Z$ . In correspondence with each $z \in Z$ , a utility value is assigned to each player. For each team i ∈ {▲ , ▼}, the function $u_{i} : Z \to ℝ$ returns the utility value for team i to each terminal node. We restrict our attention to games in which the type of interaction between the two teams is zero-sum, i.e., for all $z \in Z$ , u_▲ (z) = - u_▼ (z). For notational convenience, we consider a single utility u_▲ that is maximized by the max team ▲, while ▼ is the min team and tries to minimize it. Furthermore, we assume that the probabilities p (· |h) with which chance plays at nodes $h \in H_{C}$ are known to all the players. With a slight abuse of notation, for each $z \in Z$ , we denote with p [z] the chance reach of node z, i.e., the product of chance probabilities in the path from the root of the ATG to z. Partial observability is represented by means of information sets, or, in short, infosets. For each team i, an infoset I defines a partition of the nodes in $H_{i}$ . The set of i’s infosets is denoted as $I_{i}$ . We assume the ATG to be timeable [13], i.e., no infoset spans multiple layers of the game tree. A pure strategy for team i defines a deterministic choice of an action for each $I \in I_{i}$ . In this work, we will make use of strategies in realization form, that are vectors $x \in {0, 1}^{Z}$ such that for any $z \in Z$ , x [z] =1 if and only if the pure strategy of the team prescribes to play all the team actions in the path from the root of the game tree to z. A correlated strategy is a convex combination of pure strategies, and the corresponding realization form is obtained by taking the convex combination of pure strategies’ realization forms. The adoption of correlated strategies allows the team to coordinate their strategies against the opponent. When considering such a correlated strategy space, the relevant solution concept becomes the TMECor. Formally, the TMECor is defined as:

Definition 1 [Ε-TMECor]. Let $X$ , $Y$ be the correlated strategy space of the team ▲ and ▼, respectively. An Ε-TMECor is a pair of strategies $(x^{★}, y^{★}) \in X \times Y$ such that: $\begin{matrix} max_{x \in X} \sum_{z \in Z} (x [z] - x^{★} [z]) y^{★} [z] u_{▲} (z) \leq Ε \\ max_{y \in Y} \sum_{z \in Z} (y [z] - y^{★} [z]) x^{★} [z] u_{▼} (z) \leq Ε . \end{matrix}$ A TMECor is a 0-TMECor.

Celli and Gatti [4] show that the problem of finding a TMECor is APX-hard even when one of the two teams is composed of a single player.

Team Belief DAG. The TB-DAG [24] is a compact representation of the team’s correlated strategy space that exploits the notion of public information. Formally, two nodes $h, h^{'} \in H$ are indistinguishable to team i if there is some $I \in I_{i}$ that is reachable both from h and h′, i.e. it exists h″, h^″′ ∈ I such that h ⪯ h″ and h′ ⪯ h^″′. In addition, h and h′ are in the same public state for team i (i.e., they are characterized by the same public information for all the players in i) if they belong to a connected component of the graph induced by the indistinguishability relation of team i. For any i ∈ {▲ , ▼}, the TB-DAG representation for team i is constructed as a DAG-shaped decision problem that alternates decision points $B_{i}$ (called beliefs) where i acts and observation points $O_{i}$ where i observes realizations of some stochastic events. Both beliefs $B \in B_{i}$ and observation points $O \in O_{i}$ coincide with a set of nodes $B, O \subseteq H$ that the team cannot distinguish at a specific situation of the game. We write I ∼ B when I∩ B ≠ ∅. The TB-DAG is constructed recursively in the following way.

The root ∅ is a belief that contains only the root node of the ATG.

For each belief $B \in B_{i}$ , the edges leaving B correspond to the set of prescriptions $A (B) = {a ∣ a \in \times_{I \sim B} A (I)}$ i.e., tuples that assign an action to each infoset that has a nonempty intersection with B. After selecting prescription $a$ at belief B the decision process moves to the observation point $O = B a \in O_{i}$ , where $\begin{matrix} O = ⋃_{\overset{I_{j} \sim B}{a_{j} \in a}} & {h a_{j} ∣ h \in I_{j} \cap B} \cup \\ {ha ∣ h \in B \cap H_{- i}, a \in A (h)} . \end{matrix}$ If, instead, the belief contains nodes $z \in Z$ , then such a belief corresponds to a leaf of the DAG.

At any observation point $O \in O_{i}$ , the possible observations that the team can receive are the public states that have a nonempty intersection with O. After receiving the observation on the public state, a player moves in the DAG to a belief that coincides with the public state itself.

For completeness of exposition, we report an example of TB-DAG from [24] in Fig. 1.

Fig. 1

Simple example of adversarial team game from [24] (nodes named for ease of reference), its indistinguishability relation graph, and its TB-DAG.

The precedence relationship on the DAG is inherited from precedence ordering on ATGs, i.e., for two nodes $S, S^{'} \in B_{i} \cup O_{i}$ , it holds S ⪯ S′ if h ⪯ h′ for some h ∈ S and h′ ∈ S′. A path in the TB-DAG form of a team i connecting two nodes S ⪯ S′ is a sequence of pairs $(B, a)$ , with $B \in B_{i}$ , $a \in A (B)$ , that can be encountered when going from S to S′. The graphical notation S→λS′ is used to indicate that path λ goes from S to S′, and such notation will be used in place of λ to bind its starting and ending points. Similarly, $\emptyset \overset{λ}{\to} B a$ indicates that path λ starts at the root ∅ of the DAG and ends at $B a$ . The set of paths in the TB-DAG of team i from S to S′ is denoted as Λ_i (S, S′). Overloading the notation, for an ATG node $h \in H$ , we let Λ_i (S, h) = ⋃ _S′∋hΛ_i (S, S′) be set of paths connecting S to any DAG node that contains h.

The appropriate strategy encoding in a TB-DAG is the TB-DAG form. The space of TB-DAG-form strategies is equivalent to the space of correlated strategies, hence, in the worst case, it is not possible to describe the former with a number of constraints polynomial in the game dimension 1 . However, the TB-DAG-form strategy representation is, in practice, significantly more compact than the typical correlated strategy representation. More formally, starting from any team i’s pure strategy, it is possible to derive the corresponding pure TB-DAG-form strategy $x \in {0, 1}^{B_{i} \cup O_{i}}$ by taking $x [B] = 1$ if and only if, for any node h ∈ B the pure strategy prescribes to play all i’s actions in the path from the root of the ATG to h and no B′ ⊃ B satisfies such property. Furthermore, for all $B \in B_{i}$ , $a \in (B)$ , $x [B a] = 1$ if and only if x [B] =1 and the pure strategy prescribes to play all $a_{j} \in a$ . Given a correlated strategy, the corresponding TB-DAG-form strategy can be obtained by taking the suitable convex combination of pure strategies. We denote the polytope of TB-DAG-form strategies for teams ▲ and ▼ as $X$ and $Y$ , respectively. With a slight abuse of notation, given a TB-DAG-form strategy $x$ and a terminal node $z \in Z$ , we will denote realization probability of z, i.e. the contribution of $x$ to the probability of reaching z, as follows:

$x [z] = \sum_{B ∋ z} x [B]$ (1) Occasionally, we will need to specify prescription probabilities in behavioral form, i.e., we will need to model the probability with which a given prescription is played at a certain belief. Formally, for a belief B and a prescription $a \in A (B)$ , we write such probability as $π (a | B)$ . Given a path λ, overloading the notation, we denote the product of behavioral probabilities along λ as $π (λ) = \prod_{(B, a) \in λ} π (a | B)$ . Additionally, we specify the product of behavioral probabilities on the λ’s subpath connecting two nodes S and S′ as

$π (S \to λ S^{'}) = \prod_{\overset{(B, a) \in λ}{S ⪯ B a ⪯ S^{'}}} π (a | B) .$ (2)

By definition of the TB-DAG-form polytope, the strategy $x$ originating from a behavioral strategy π is such that x [S] = ∑_{λ∈Λ_i(∅,S)}π (λ), $\forall S \in B_{i} \cup O_{i}$ .

We introduce the following additional notations that we use in the paper. We denote with b the branching factor of the ATG, i.e., $b = max_{h \in H} | A (h) |$ , while $k = max_{i \in {▲, ▼}} max_{B \in B_{i}} | I_{i} (B) |$ , where $I_{i} (B) = {I \in I_{i} ∣ I \sim B}$ , is the information complexity of the game. The maximum depth of a DAG is d. Finally, for a set $S$ , $Δ^{S}$ denotes the $| S |$ -dimensional probability simplex.

Stochastic Regret Minimization. We make extensive use of the Online Convex Optimization (OCO) tools introduced by Zinkevich [26]. The basic scenario faced in OCO problems encompasses a decision maker that repeatedly interacts with a possibly adversarial environment, choosing at each time t a strategy $x$ from a closed, compact, and convex set $X$ . At every time t, after choosing the strategy, the agent receives a loss vector ℓ ^t from the environment. The benchmark used to evaluate the performance after T rounds is the regret R^T defined as: $R^{T} = max_{x^{★} \in X} \sum_{t = 1}^{T} (ℓ^{t})^{⊤} (x^{t} - x^{★}) .$ A regret minimizer is any algorithm that is capable of achieving sublinear regret (i.e., R^T = o (T)). In our work, we are interested in regret minimizers that use stochastic loss estimators ${\tilde{ℓ}}^{t}$ . Let ${\tilde{R}}^{T} = max_{x^{★} \in X} \sum_{t = 1}^{T} ({\tilde{ℓ}}^{t})^{⊤} (x^{t} - x^{★})$ be the cumulative regret that can be inferred from the estimated losses. Farina et al. [10] show that given any loss estimator ${\tilde{ℓ}}^{t}$ such that $E [{\tilde{ℓ}}^{t}] = ℓ^{t}$ , the following holds:

Proposition 2 (Farina et al. [10]). Let $M, \tilde{M} > 0$ be such that for all t ∈ [T] 2 and for all $x, x^{'} \in X$ , | ( ℓ ^t) ^⊤ ( x - x ′) | < M and $| ({\tilde{ℓ}}^{t})^{⊤} (x - x^{'}) | < \tilde{M}$ . Then, for all δ ∈ (0, 1): $ℙ (R^{T} \leq {\tilde{R}}^{T} + (M + \tilde{M}) \sqrt{2 T log \frac{1}{δ}}) \geq 1 - δ .$

3 DAG Monte-Carlo regret minimization

The TMECor can be equivalently formulated as the solution of a bilinear saddle point problem in which ▲ and ▼ respectively maximize and minimize the expected value of u_▲ over the game terminal nodes. Therefore, it is possible to use no-regret algorithms to approximate a TMECor.

Consider the setting in which at each iteration t ∈ [T] teams ▲ and ▼ select strategies $x^{t}$ and $y^{t}$ and each one receives respectively a loss $ℓ_{▲}^{t}, ℓ_{▼}^{t} \in ℝ^{Z}$ such that $\forall z \in Z$ , $ℓ_{▲}^{t} [z] = - u_{▲} (z) p [z] y^{t} [z]$ and $ℓ_{▼}^{t} [z] = - u_{▼} (z) p [z] x^{t} [z]$ . The cumulative regrets experienced by ▲ and ▼ are, then: $\begin{matrix} R_{▲}^{T} = max_{x \in X} \sum_{t = 1}^{T} \sum_{z \in Z} (x [z] - x^{t} [z]) u_{▲} (z) p [z] y^{t} [z] \\ R_{▼}^{T} = max_{y \in Y} \sum_{t = 1}^{T} \sum_{z \in Z} (y [z] - y^{t} [z]) u_{▼} (z) p [z] x^{t} [z] . \end{matrix}$ The following theorem formalizes the relationship between the two regrets and the degree of equilibrium approximation.

Proposition 3. The pair of strategies $({\bar{\vec{x}}}^{T}, {\bar{\vec{y}}}^{T})$ , where ${\bar{\vec{x}}}^{T} = \frac{1}{T} \sum_{t = 1}^{T} {\vec{x}}^{t}$ and ${\bar{\vec{y}}}^{T} = \frac{1}{T} \sum_{t = 1}^{T} {\vec{y}}^{t}$ , is an Ε-TMECor with $Ε \leq \frac{R_{▲}^{T} + R_{▼}^{T}}{T} .$

Proposition 3 constitutes a famous result in literature (see for example [8] for a proof). As a direct implication of Proposition 3, in order to compute an approximated TMECor, it would be sufficient to formulate a regret minimizing algorithm for the TB-DAG strategy polytopes $X$ , $Y$ , since $R_{i}^{T} = o (T)$ for each team i would guarantee that Ε approaches 0 with T→ ∞.

In the following, we first review how to apply Counterfactual Regret Minimization (CFR) [27] to the TB-DAG representation and discuss the main limitations of such an approach. Subsequently, we discuss a possible way of overcoming such limitations and lay the ground for the formulation of a sampling-based version of CFR on the DAG structure.

3.1 Counterfactual regret minimization on a DAG

In the following, we review the CFR algorithm proposed in [22] to operate on a TB-DAG. CFR is a self-play algorithm that performs regret minimization on a sequential decision problem by decomposing the regret into additive terms that are minimized locally at each decision point. For each iteration t ∈ [T], let $x^{t} \in X$ and $y^{t} \in Y$ be the TB-DAG-form strategies that are originated from behavioral probabilities $π_{▲}^{t}$ and $π_{▼}^{t}$ , respectively. Given such strategies at time t, to each belief $B \in B_{▲}$ it is associated a counterfactual loss.

Definition 4[Counterfactual Loss]. The counterfactual loss $ℓ_{B}^{t} \in ℝ^{A (B)}$ is the vector defined for each belief $B \in B_{▲}$ such that for all prescriptions $a \in A (B)$ : $\begin{matrix} ℓ_{B}^{t} [a] & = - \sum_{z ⪰ B a} u_{▲} (z) p [z] y^{t} [z] \sum_{λ \in Λ_{▲} (B a, z)} π_{▲}^{t} (λ), \end{matrix}$ where $\sum_{λ \in Λ_{▲} (B a, z)} π_{▲}^{t} (λ)$ denotes, intuitively, the probability with which ▲ plays at time t to reach z, conditioned that they traverse $B a$ .

Counterfactual losses for ▼ are defined symmetrically. The counterfactual loss $ℓ_{B}^{t}$ is used to determine the local regret $R_{B}^{T}$ associated to each belief B: $R_{B}^{T} = max_{π \in Δ^{A (B)}} \sum_{t = 1}^{T} (ℓ_{B}^{t})^{⊤} (π^{t} - π) .$ As Zhang et al. [24] show, the TB-DAG can be constructed recursively by applying scaled extension operations [9]. By Proposition 1 in [9], this implies that the cumulative regret $R_{i}^{T}$ experienced by a team i is upper bounded by the sum of cumulative regrets $R_{B}^{T}$ experienced by local regret minimizers instantiated at the belief level. Formally, $R_{i}^{T} \leq \sum_{B \in B_{i}} R_{B}^{T} .$ As a consequence of this fact, it is possible to design a CFR-like algorithm operating directly on the DAG. In particular, at each iteration t ∈ [T], each team i chooses $π_{i}^{t} (\cdot | B)$ for $B \in B_{i}$ as the behavioral prescription probabilities obtained from the Regret Matching (RM) algorithm [12] with regrets cumulated up to time t at belief B. Then, i observes at each belief B counterfactual losses $ℓ_{B}^{t}$ as defined in Definition 4. The overall CFR-like algorithm obtained by composing Regret Matching (Algorithm 2) with the utility propagation over a TB-DAG is summarized in Algorithm 1.

Algorithm 1 CFR on a TB-DAG
1: Input: a TB-DAG game
2: Initialize realization-form strategies $x^{0}$ , $y^{0}$ , $\bar{\vec{x}}$ , $\bar{\vec{y}}$ , all belonging to $ℝ^{Z}$ , cumulative regrets $r_{B} \in ℝ^{A (B)} \forall B \in B$
3: fort = 1, …, Tdo
4: $x^{t} \leftarrow PROPAGATE (x^{t} - 1, y^{t - 1}, u_{▲}, ▲)$
5: $y^{t} \leftarrow PROPAGATE (y^{t} - 1, x^{t - 1}, u_{▼}, ▼)$
6: $\bar{\vec{x}} \leftarrow \frac{t - 1}{t} \bar{\vec{x}} + \frac{1}{t} x^{t}$
7: $\bar{\vec{y}} \leftarrow \frac{t - 1}{t} \bar{\vec{y}} + \frac{1}{t} y^{t}$
8: end for
9: return $\bar{\vec{x}}$ . $\bar{\vec{y}}$
10: functionPropagate $x_{i}, x_{- i}, u_{i}, i$
11: for $B \in B$ do
12: Compute loss $ℓ_{B}^{t}$ specified in Def. 4
13: $π_{i}^{t + 1} (\cdot \| B) \leftarrow REGRETMATCHING (ℓ_{B}^{t})$
14: end for
15: return $x_{i}$ in realization form as per Eqs. (1)(2)
16: end function

Algorithm 2 Regret Matching
1: Input: immediate losses $ℓ_{B}^{t}$
2: State: current strategy $π \in Δ^{A (B)}$ , initialized to uniform, cumulated regrets $r_{B}$ initialized to 0
3: for $a \in A (B)$ do
4: $r_{B} [a] \leftarrow r_{B} + ℓ_{B}^{t} [a] - (ℓ_{B}^{t})^{T} π$
5: end for
6: for $a \in A (B)$ do
7: $π [a] \propto {\begin{matrix} r_{B} [a] & if r_{B} [a] \geq 0 \\ 0 & otherwise \end{matrix}$
8: end for
9: if π is identically equal to 0, set it to the uniform strategy
10: return $π$

The regret of the algorithm is upper bounded by a polynomial function of the DAG size:

Proposition 5 (Zhang et al. [24]). CFR can be run on a TB-DAG with set of beliefs $B$ and set of edges E with cumulative regret after T rounds $R^{T} = O (| B | \sqrt{T})$ and iteration time $O (E)$ .

As aforementioned, the representation of the TB-DAG strategy polytope is significantly more compact than the vanilla correlated strategy polytope’s encoding. This is due to the adoption of a correlation scheme that is local at each public state instead of being trivially defined over the set of pure strategies. However, since prescriptions are defined as tuples assigning an action to each information set with a nonempty intersection with the belief, the TB-DAG representation still requires a number of constraints that scales exponentially with the game’s information complexity. Considering this fact, Proposition 5 has two important implications: in the worst-case scenario (i.e., in games in which the amount of private information that the players have is high, e.g., Bridge) (i) the computational complexity of the single CFR iteration is exponential with respect to the ATG size and (ii) the regret bound has an exponential dependency on the ATG dimension. While (ii) is a drawback inherited from the TMECor’s APX-hardness, the exponential computational complexity of the single iteration is a much more undesirable property. Indeed, in practice, tabular algorithms like CFR and its deterministic variants can not scale past small games due to the necessity of traversing the whole DAG at each iteration (see experimental results reported in [24]). Another crucial point is related to the space complexity of the algorithm. Also, in this case, tabular procedures exhibit highly inefficient requirements, as they need to keep the whole DAG structure in memory.

Inspired by the literature on two-player zero-sum games [10 , 19], we propose to overcome such limitations by combining the TB-DAG regret minimization framework with Monte-Carlo sampling techniques.

4 Sampling enters the picture: MCCFR-DAG

Algorithm 3 MCCFR-DAG
1: Input: Exploration strategies $w_{▲}$ , $w_{▼}$
2: Initialize strategies $x^{0}$ , $y^{0}$ , $\bar{\vec{x}}$ , $\bar{\vec{y}}$
3: fort = 1, …, Tdo
4: SampleAndPropagate(w_▲, y^t-1, u_▲,▲)
5: SampleAndPropagate(w_▼, x^t-1, u_▼, ▼)
6: $\bar{\vec{x}} \leftarrow UPDATEAVG (\bar{\vec{x}}, x^{t})$
7: $\bar{\vec{y}} \leftarrow UPDATEAVG (\bar{\vec{y}}, y^{t})$
8: end for
9: return $\bar{\vec{x}}$ . $\bar{\vec{y}}$

As stated above, we intend to exploit the advantages of sampling to pursue two objectives: (i) reducing the computational complexity of the single iteration and (ii) reducing the algorithm’s space complexity. In order to accomplish such a result, we propose to extend to the TB-DAG the Outcome Sampling (OS) approach that was previously proposed for two-player zero-sum games by Lanctot et al. [15]. In a nutshell, the main idea behind OS is to estimate the loss vector at each iteration based on a game rollout collected following the players’ strategies and perform sparse strategy updates starting from such an estimated loss. A high-level overview of the algorithm – that we call DAG Monte Carlo Counterfactual Regret Minimization (MCCFR-DAG) – is provided in Algorithm 3. A detailed analysis on how to use OS in order to estimate the loss vector and then how to perform strategy updates based on the loss observed is presented in Section 5.

Theorem 6 provides the formal guarantees for MCCFR-DAG.

Theorem 6. After T iterations, for each i ∈ {▲ , ▼} and for all δ ∈ (0, 1), MCCFR-DAG achieves with probability greater than 1 - δ the following regret bound: $R_{i}^{T} \leq U_{\max} (\sqrt{b^{k}} + (b^{kd} + 1) \sqrt{2 log \frac{| B_{i} |}{δ}}) \sqrt{T},$ where $U_{\max} = max_{z \in Z} | u_{i} (z) |$ .

5 Loss sampling and backpropagation

As widely mentioned before, the loss sampling algorithm that we adopt in this work is essentially an Outcome Sampling estimator. In the basic version already defined for two-players games, the definition of the OS estimator is obtained from the sampling of a terminal node through a game simulation. However, after sampling the terminal node, the trivial OS procedure would require to backpropagate the loss through all the DAG paths that connect the root ∅ to the terminal node sampled. Unfortunately, the number of such paths is, in the worst case, an exponential function of the ATG size, which would cause the required time to complete an iteration to explode. The solution that we propose in this work, instead, backpropagates the loss only through the path of nodes actually traversed when simulating the game, thus reducing the number of visited beliefs to a linear function of the TB-DAG’s depth.

A graphical representation of the two update schemes is given at Fig. 2, while the detailed algorithm for loss sampling and backpropagation is shown in Algorithm 4.

Fig. 2

Example of different loss propagation algorithms on the example game from Fig. 1, where the nodes and edges traversed during the propagation are dotted. As an example, we assume to have reached terminal belief J through the belief path A → BC → EG → J.

Algorithm 4 SampleAndPropagate
1: Input:
2: Sampling strategy $w$ , Opponent strategy $y^{t}$
3: Utility function u, Team i
4: Sample joint path ${\tilde{λ}}^{t}$ according to Eq. (3)
5: for $B a \in {\tilde{λ}}_{i}^{t}$
6: Compute loss $ℓ_{B}^{t}$ specified in Eq. (4)
7: $π_{i}^{t + 1} (\cdot \| B) \leftarrow REGRETMATCHING (ℓ_{B}^{t})$
8: end for

For the sake of illustration, let us consider the problem of sampling the loss vector for team ▲. The loss sampling procedure for ▼ is defined equivalently. We assume to have as input a random sampling strategy $w_{▲} \in X$ that is yielded by a uniform behavioral strategy ${\bar{π}}_{▲}$ such that ${\bar{π}}_{▲} (a | B) = \frac{1}{| A (B) |}$ for all $B \in B_{▲}$ , $a \in A (B)$ . At each iteration t, we perform a game rollout, following strategy $w_{▲}$ for ▲ and strategy $y^{t}$ for ▼. As a result of the rollout, we obtain terminal node ${\tilde{z}}^{t}$ and a joint path ${\tilde{λ}}^{t} = ({\tilde{λ}}_{▲}^{t}, {\tilde{λ}}_{▼}^{t}) \in Λ_{▲} (\emptyset, {\tilde{z}}^{t}) \times Λ_{▼} (\emptyset, {\tilde{z}}^{t})$ . The probability of sampling each joint path is the following:

$ℙ ({\tilde{λ}}^{t} = λ) = {\bar{π}}_{▲} (λ_{▲}) π_{▼}^{t} (λ_{▼}) p [z] \forall λ,$ (3) while the probability of sampling a terminal node $z \in Z$ is $ℙ ({\tilde{z}}^{t} = z) = p_{c} [z] w_{▲} y^{t} [z] \forall z \in Z .$ Let us remark that, by linearity of the TB-DAG strategy space, it follows that: $w_{▲} [z] = \sum_{λ \in Λ_{▲} (\emptyset, z)} {\bar{π}}_{▲} (λ) \forall z \in Z .$ The same holds for $y^{t}$ .

After sampling a joint path and the corresponding terminal node, at each belief $B \in B_{▲}$ and for each prescription $a \in A (B)$ , team ▲ observes a loss:

${\tilde{ℓ}}_{B}^{t} [a] = {\begin{matrix} \frac{- u_{▲} ({\tilde{z}}^{t})}{| Λ_{▲} (\emptyset, B) |} \frac{π_{▲}^{t} (B a \to {\tilde{λ}}_{▲}^{t} {\tilde{z}}^{t})}{{\bar{π}}_{▲} ({\tilde{λ}}_{▲}^{t})} & if B a \in {\tilde{λ}}_{▲}^{t} \\ 0 & otherwise . \end{matrix}$ (4) Intuitively, for each pair belief B and action $a$ observed during the rollout phase, the loss ${\tilde{ℓ}}_{B}^{t}$ is a function of the utility at the terminal node ${\tilde{z}}^{t}$ and the probability with which the player plays from $B a$ to ${\tilde{z}}^{t}$ along the collected path ${\tilde{λ}}_{▲}^{t}$ . The factor $\frac{1}{| Λ_{▲} (\emptyset, B) | {\bar{π}}_{▲} ({\tilde{λ}}_{▲}^{t})}$ is, essentially, an importance sampling weight that balances the probability with which the path is sampled. Instead, the loss is zero for all beliefs and prescriptions that were not observed during the sampling phase. We point out that in order to compute the number of paths |Λ_▲ (∅ , B) | connecting ∅ and B, it would be necessary to perform a complete DAG traversal. However, since the term $\frac{1}{| Λ_{▲} (\emptyset, B) |}$ multiplies the whole loss vector ${\tilde{ℓ}}_{B}^{t}$ , by using local regret minimizers that exhibit the scale invariance property (e.g., Regret Matching), it is possible to safely ignore that factor, thus, in practice, avoiding the need for an upfront DAG preprocessing.

The following Lemma shows that ${\tilde{ℓ}}_{B}^{t}$ constitutes an unbiased estimate of CFR-DAG’s loss $ℓ_{B}^{t}$ , $\forall B \in B_{▲}$ .

Lemma 7. For all t ∈ [T], for all $B \in B_{▲}$ and for all $a \in A (B)$ it holds that: $E_{{\tilde{z}}^{t}, {\tilde{λ}}^{t}} [{\tilde{ℓ}}_{B}^{t} [a]] = ℓ_{B}^{T} [a] .$ Furthermore, for all t ∈ [T], for all $B \in B_{▲}$ and for all $x, x^{'} \in Δ^{A (B)}$ , $({\tilde{ℓ}}_{B}^{t})^{⊤} (x - x^{'}) \leq b^{kd} U_{\max},$ where $U_{\max} = max_{z \in Z} u_{▲} (z)$ .

6 Average strategy computation

The second task that we address in this work concerns the computation of the average strategy. Such a procedure is indeed fundamental for the algorithm as the theoretical properties exhibited by MCCFR-DAG guarantee TMECor convergence on average (Proposition 3). In this section, we first discuss the challenges of tracking average strategies and present a simple algorithm that, at each iteration, runs linearly in time in the number of visited DAG nodes up to that iteration. Then, in order to reduce the computational burden of such a task, we introduce an unbiased estimator for average strategies and analyze its theoretical properties in terms of variance.

For memory efficiency, we focus on the computation of average realizations over terminal nodes $z \in Z$ .

Exact Computation. Previous works that combine Monte-Carlo sampling with model-free learning in sequential decision processes [6, 15] adopt an efficient procedure for average strategy computation based on a weighted behavioral averaging performed locally at each decision node, with weights equal to the reach probabilities of the decision node itself. While this technique is particularly efficient for tree-shaped decision problems, the DAG structure significantly limits its applicability in our scenario. The main issue is that we need to explore all the paths in the DAG reaching a specific node; hence, in the worst case, we could traverse the whole TB-DAG.

Stochastic Estimation. We propose an estimator that updates only along the sampled path ${\tilde{λ}}_{▲}^{t}$ , which makes it especially efficient to implement in combination with the loss update procedure from Section 5. In particular, for each belief B, we store a vector $c_{B}^{t}$ indexed by the available joint actions at B, which stores unnormalized play probabilities at B. Every time B is present along path ${\tilde{λ}}_{▲}^{t}$ , we perform the update $\begin{matrix} c_{B}^{t} (a) \leftarrow c_{B}^{t - 1} (a) + \frac{π_{▲}^{t} (\emptyset \to λ_{▲}^{t} B a)}{{\bar{π}}_{▲} (\emptyset \to λ_{▲}^{t} B)} . \end{matrix}$ which corresponds to updating the average strategy with the current strategy at the node. The fact that this update happens only when we sample the path $λ_{▲}^{t}$ acts as a stochastic weight that is corrected with the term ${\bar{π}}_{▲} (\emptyset \to λ_{▲}^{t} B)$ . At beliefs B not visited, we perform no update, i.e. c^t (· |B) ← c^t-1 (· |B). The following result follows directly from the definition of the estimator:

Proposition 8. For all beliefs B, joint actions $a$ legal at B, and times T, we have: $\begin{matrix} E c_{B}^{T} (a) = | Λ_{▲} (\emptyset, B) | \sum_{t \leq T} x_{▲}^{t} (B a) . \end{matrix}$

where, as before, $x^{t}$ is the TB-DAG form of $π^{t}$ . That is, up to scaling by T|Λ_▲ (∅ , B) |, our estimator $c^{T} (\cdot | B)$ is an unbiased estimate of the true average strategy at B. Thus, at test time, our algorithm is simply to play the behavioral strategy ${\hat{π}}_{▲}^{t}$ defined by ${\hat{π}}_{▲}^{t} (a | B) \propto c_{B}^{t} (a)$ , or uniformly at random if $c_{B}^{t} (a) = 0$ for all $a$ at B.

Thus, applying an Azuma-Hoeffding analysis similar to that of the proof of Proposition 2 yields the following result.

Proposition 9. Let ${\hat{x}}^{t}$ be the realization form of ${\hat{π}}_{▲}^{t}$ . Then, for any time T, with probability 1 - δ, we have $\begin{matrix} {∥ {\hat{x}}^{T} - \frac{1}{T} \sum_{t \leq T} x^{t} ∥}_{1} \leq 2 {dL}^{2} \sqrt{\frac{2}{T} log \frac{2 | O_{▲} |}{δ}} \end{matrix}$ where d is the depth of the game, and L is the number of paths through ▲’s TB-DAG.

Therefore, if we use the estimated average strategies $\hat{x}$ instead of the true averages, our exploitability increases by at most the right-hand side of the above inequality.

This method of approximating the average strategy is not new: for example, it is implemented in the open-source library OpenSpiel [16] for two-player zero-sum games, that is, for tree-form decision problems. However, to our knowledge, its correctness has not been proven, even in the two-player zero-sum setting. In that setting, L and $| O_{▲} |$ are both bounded by the number of sequences of the player’s strategy space, so the bound is polynomial in the size of the game.

7 Experimental evaluation

In this section we evaluate the performance of our proposed averaging scheme in the context of the MCCFR-DAG algorithm. To do so, we compare the MCCFR-DAG algorithm with exact averaging (only on the explored parts of the tree) against the stochastic averaging scheme. The performance is compared in terms of NashConv, i.e. a measure of exploitability of the current strategy against a best responding opponent. This metric is related to the distance from a Nash Equilibrium of the current strategy. In line with previous works in the field, the chosen benchmark is that of Kuhn poker [14] and Leduc poker [20] games, where we allow team collusion. In particular, we consider instances with three or four players (with the first two playing players in the same team and the remaining being their adversary) and vary the parameters governing the game size. Whenever considering game instances, we will use the pK-r acronym to indicate Kuhn poker instances where p is the number of players and r is the number of ranks of cards in the game. Similarly, the pLbrs acronym indicates Leduc poker instances where b is the maximum bet in each betting round, r is the number of card ranks, and s the number of suits. The experiments were run on a server running Ubuntu 22.04 LTS with a 32-core Intel Xeon E5-4610 v2 CPU and 128 GB of memory. The processes running the experiments have been limited to 5GB of memory.

Figures 3 to 6 show the results of running the algorithms for either 10⁶ iterations at most or with a time limit of 10⁴ seconds and plot the NashConv performance both against time and iteration number. To reduce the effect of stochasticity due to the probabilistic nature of both MCCFR and the averaging algorithm used, we show the average performance across 5 runs. We remark the rank of the game is directly proportional to the information complexity of the team game, and that the theoretical convergence rate of MCCFR-DAG is $O (b^{kd})$ . Therefore, the larger the ranks of the poker instances, the wider the team game to be considered is, and the more difficult its resolution for the TB-DAG-based algorithms is.

Fig. 3

Average NashConv over 5 runs of MCCFR-DAG with exact average strategy and stochastic estimation technique on 3K3 (Figs. 3a and 3b), 3K4 (Figs. 3c and 3d). The light areas around the plots show the 95% confidence bound on the performance.

Fig. 4

Average NashConv over 5 runs of MCCFR-DAG with exact average strategy and stochastic estimation technique on 3K5 (Figs. 4a and 4b), 3K6 (Figs. 4c and 4d), 3K12 (Figs. 4e and 4f). The light areas around the plots show the 95% confidence bound on the performance.

Fig. 5

Average NashConv over 5 runs of MCCFR-DAG with exact average strategy and stochastic estimation technique on 3L133 (Figs. 5a and 5b), 3L153 (Figs. 5c and 5d), 3L223 (Figs. 5e and 5f). The light areas around the plots show the 95% confidence bound on the performance.

Fig. 6

Average NashConv over 5 runs of MCCFR-DAG with exact average strategy and stochastic estimation technique on K4-5 (Figs. 6a and 6b), K4-8 (Figs. 6c and 6d), 4L133 (Figs. 6e and 6f). The light areas around the plots show the 95% confidence bound on the performance.

The experimental results confirm that stochastic averaging enables faster resolution times while the iteration performance remains comparable among the two averaging schemes. In particular, we highlight the results on large instances of games such as 3K12 Figs. 4e and 4f, where the proposed averaging scheme allows considerable time savings (≈10⁶ vs ≈10⁵ iterations in the same amount of time). This translates to comparable performance achieved in an order of magnitude of time in advance.

A similar trend is verified in Leduc poker games (Fig. 5), which have a deeper game tree than the Kuhn poker instances, while the number of ranks is kept lower. In this case, we can also see the impacts from the extra variance due to stochastic averaging w.r.t. the exact averaging. This introduces an initial unstable phase where the exploitability increases instead of decreasing; this is then recovered after some iterations of the algorithm.

On the other hand, the fact of having two teams playing against each other instead of a team vs a single player does not impact the performance (Fig. 6). This is in line with the intuition that the information complexity and depth challenges impact the intra-team coordination capabilities rather than the adversarial aspect of the game.

Overall, the NashConv results obtained w.r.t. the time elapsed highlight the well-known difficulties that sampling-based methods have in computing low-exploitability equilibria, as already noted in [15]. While this makes algorithms such as the ones proposed by [23, 25] a better option for equilibrium computation in the games customarily used for benchmarks, the low resources required by the iterative sampling methods proposed in this paper can be useful to further scale the size of the games addressed. Again, this already happened in two-player zero-sum games, where sampling variants of the CFR algorithm have been the bulk of Libratus [1] and Pluribus [2] poker bots. Therefore, MCCFR-DAG is characterized as a low-resource equilibrium computation algorithm for computing TMECors in team games.

8 Conclusions

In this paper we propose Monte Carlo CFR techniques to compute TMECor strategies in large team games more efficiently. To do so, we formalized suitable techniques for loss sampling, backpropagation, and strategy averaging. After proving complexity guarantees on their convergence properties, we empirically showed how such techniques can benefit performance when solving team games.

Our contributions can open new paths for scaling solving techniques to larger instances of team games while maintaining a low computational cost per iteration. A particularly interesting research direction would be to further combine these techniques with the recent column-generation and TB-DAG hybrid approach from [25].

Footnotes

Appendix

Unless P=NP [4, ].

In this work we denote with [T] the first T positive integers.

References

Brown

, Sandholm

Safe and nested subgame solving for imperfect-information games. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017.

Brown

, Sandholm.

Superhuman ai for multiplayer poker. Science 365(6456) (2019).

Carminati

, Cacciamani

, Ciccone

, Gatti.

A marriage between adversarial team games and 2-player games: Enabling abstractions, no-regret learning, and subgame solving. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of ICML’22, pages 2638–2657, 2022.

Celli

, Gatti.

Computational results for extensive-form adversarial team games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Chu

, Halpern.

On the NP-completeness of finding an optimal strategy in games with common payoffs, International Journal of Game Theory 30 (2001), 99–106.

Farina

, Sandholm.

Model-free online learning in unknown sequential decision making problems and games. In AAAI Conference on Artificial Intelligence, 2021.

Farina

, Celli

, Gatti

, Sandholm.

Ex ante coordination and collusion in zero-sum multi-player extensive-form games. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, 2018.

Farina

, Kroer

, Sandholm.

Online convex optimization for sequential decision processes and extensive-form games. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI’19. AAAI Press, 2019.

Farina

, Ling

C.K.

, Fang

, Sandholm.

Efficient regret minimization algorithm for extensive-form correlated equilibrium. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.

10.

Farina

, Kroer

, Sandholm.

Stochastic regret minimization in extensive-form games. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, 2020.

11.

Farina

, Celli

, Gatti

, Sandholm.

Connecting optimal ex-ante collusion in teams to extensive-form correlation: Faster algorithms and positive complexity results. In International Conference on Machine Learning, pages 3164–3173, 2021.

12.

Hart

, Mas-Colell.

A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5) (2000), 1127–1150.

13.

Jakobsen

S.K.

, Sorensen

T.B.

, Conitzer.

Timeability of extensive-form games. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS’16, 2016.

14.

Kuhn.

A simplified two-person poker, Contributions to the Theory of Games, 1 (1951), 97–103.

15.

Lanctot

Waugh

Zinkevich

Bowling

Monte carlo sampling for regret minimization in extensive games. Advances in Neural Information Processing Systems, 22 (2009).

16.

Lanctot

, Lockhart

, Lespiau

J.-B.

, Zambaldi

, Upadhyay

, Perolat

, Srinivasan

, Timbers

, Tuyls

, Omidshafiei

, Hennes

, Morrill

, Muller

, Ewalds

, Faulkner

, Kramar

, Vylder

B.D.

, Saeta

, Bradbury

, Ding

, Borgeaud

, Lai

, Schrittwieser

, Anthony

, Hughes

, Danihelka

, Ryan-Davis.

OpenSpiel: A framework for reinforcement learning in games. CoRR, abs/1908.09453, 2019.

17.

McAleer

, Farina

, Zhou

, Wang

, Yang

, Sandholm.

Team-psro for learning approximate tmecor in large team games via cooperative reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS’23, 2024.

18.

Moravčík

Schmid

Burch

Lis‘y

Morrill

Bard

Davis

Waugh

T.K.

Johanson

Bowling

Deepstack: Expert-level artificial intelligence in heads-up nolimit poker. Science, 356(6337) (2017), 508–513.

19.

Schmid

, Burch

, Lanctot

, Moravcik

, Kadlec

, Bowling.

Variance reduction in monte carlo counterfactual regret minimization (vr-mccfr) for extensive form games using ines. In Proceedings of the AAAI Conference on Artificial Intelligence volume 33, pages 2157–2164, 2019.

20.

Southey

, Bowling

, Larson

, Piccione

, Burch

, Billings

, Rayner.

Bayes’ bluff: Opponent modelling in poker. 2005.

21.

Vinyals

, Babuschkin

, Czarnecki

W.M.

, Mathieu

, Dudzik

, Chung

, Choi

D.H.

, Powell

, Ewalds

, Georgiev

, et al. Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature 575(7782) (2019), 350–354.

22.

Zhang

B.H.

, Sandholm.

Team correlated equilibria in zero-sum extensive-form games via tree decompositions. In AAAI, 2022.

23.

Zhang

B.H.

, Farina

, Celli

, Sandholm.

Optimal correlated equilibria in general-sum extensive-form games: Fixed-parameter algorithms, hardness, and two-sided columngeneration. In Proceedings of the 23rd ACM Conference on Economics and Computation, EC’22, page 1119–1120, 2022.

24.

Zhang

B.H.

, Farina

, Sandholm.

Team belief dag: generalizing the sequence form to team games for fast computation of correlated team max-min equilibria via regret minimization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23, 2023.

25.

Zhang

, An

, Zeng.

D.D.

Dag-based column generation for adversarial team games. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, 2024.

26.

Zinkevich.

Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.

27.

Zinkevich

, Johanson

, Bowling

, Piccione.

Regret minimization in games with incomplete information. Advances in Neural Information Processing Systems, 20 (2007).