We study the problem of online interaction in general decision making problems, where the objective is not only to find optimal strategies, but also to satisfy certain safety guarantees, expressed in terms of costs accrued. In particular, we focus on the online learning problem in which an agent has to find the optimal solution of a linear objective. Moreover, the agent has to satisfy a linear safety constraint at each round. We propose a theoretical framework to address such problems and present BAN-SOLO, a UCB-like algorithm that, in an online interaction with an unknown environment, attains sublinear regret of order and satisfies a safety constraint with high probability at each iteration. BAN-SOLO provides a general framework that can be applied to any setting in which estimators of the objective and the cost function are available. At its core, it relies on tools from convex duality to manage environment exploration while satisfying the safety constraint imposed by the problem. To show the applicability of our framework, we provide two game theoretical applications: normal-form games and sequential decision-making problems.
In the recent years, improvements in the field of Artificial Intelligence, more in particular in the subjects of Algorithmic Game Theory (AGT) and Reinforcement Learning (RL), made it possible to achieve outstanding results in various applications. These range from recreative applications, e.g., Chess [19], Texas hold’em poker [7], Go [18], to security applications such as anti-poaching patrolling [20]. The increasing number of successful applications of decision making algorithms to real-world tasks raises fundamental concerns in terms of safety. Indeed, when considering critical tasks with humans in the loop, it becomes of upmost importance to avoid the occurrence of undesirable and potentially dangerous behaviour of the algorithms, especially during the learning process.
Recently, there has been a growing attention on the problem of learning satisfying some constraints during the process. A line of work focuses on satisfying the constraints on average. In particular, the decision process is repeated for T rounds and the goal is to satisfy the constraint in the long run. [24] consider an online convex optimization framework with stochastic constraints and adversarial objective. [24] provide a primal-dual algorithm achieving cumulative regret and constraint violation by assuming Slater’s condition. Moreover, [23] provide similar guarantees under a slightly less stringent version of the Slater’s condition. Other works also consider the case in which the constraints are adversarial, providing best-of-both-worlds guarantees [8, 9]. Other works follow an approach more similar to ours and aim at satisfying the constraints at each round. Most of the works focus on the multi-armed bandit model. In particular, they focus on the problem of maximizing the cumulative reward, while guaranteeing that the expected cost of the arm pulled at each round satisfies cost constraints. In this setting, the learner can converge to an optimal solution of the underling optimization problem without violating the constraints with high probability [2, 16]. Some works study more general settings. For instance, Usmanova et al. [21] consider the problem of optimizing a convex function over unknown linear constraints. Finally, some recent works study the constrained problems related to conditioning decision-making towards human strategies [3, 12].
In this work, we propose a theoretical framework to address the problem of learning optimal solution while satisfying a safety constraint. Differently from previous works, we do not focus on a particular application and we provide a general framework that can be applied to a variety of settings. Indeed, the decision-making model that we adopt can capture many popular strategic scenarios, e.g., sequential decision making, multi-armed bandits, partially observable Markov decision processes. We extend the classic online learning model introducing a safety constraint. In our model, our goal is to maximize an unknown linear objective function. Moreover, we model the safety constraint with a linear cost function that can be suitably specified to represent a particular, domain-dependent safety constraint. Our goal is to achieve sublinear regret with respect to the optimal safe strategy, while guaranteeing that the safety constraint is not violated. This introduce the non-trivial problem of exploring cautiously. In particular, we need to explore new strategies without violating the safety constraint.
Contributions
In this paper, we propose a general framework that even in an unknown environment can achieve performances close to the one of the best safe strategy, i.e., sublinear regret, by repeatedly interacting in an online fashion with it. At the same time, our procedure satisfies the safety constraint during the whole learning process with high probability. The algorithm proposed is called BANdit Safe Online Linear Optimization (BAN-SOLO). The analysis of our algorithm is based on two main results. We show that whenever we can design a sequence of feasibility sets that converges to the set of safe strategies, also the reward achieved playing the sequence of optimal strategies in these feasibility sets converges to the reward of the optimal safe strategy. Moreover, we show that whenever we are able to build a confidence region on the safety constraint that converges to the actual safety constraint, we can design a sequence of feasibility sets that converges to the set of safe strategy. Then, during the online interaction with the environment, BAN-SOLO pursues a twofold objective: (i) minimize the regret against the best possible safe strategy and (ii) always select safe strategies. In order to achieve such a goal, we assume that it possible to use the feedback received from the environment to estimate an high confidence region around the environment parameters. Then, BAN-SOLO exploits the concept of convex dual set of such an high confidence region to achieve regret, while guaranteeing safety with high-probability at each iteration. Finally, we conclude showing that the feedback received from the environment can be used to estimate a high confidence region for the environment model in two applications: normal-form games and sequential decision-making problems.
Comparison with previous work
In this section, we highlight the connection of our work with the growing field of online optimization subject to safety constraints. Most of the literature focuses on the problem of satisfying the constraint in the long run and hence allowing large violations in some rounds [8, 24]. Here, we focus on the arguably more challenging task of satisfying the constraint at each round. This problem has been studied in the multi-armed bandit model [2, 16] and in some specific online convex optimization settings [21]. Differently from previous work, here we provide a general meta-algorithm for online linear optimization. Our meta-algorithm works in any setting and it only requires an oracle that outputs a sequence of confidence regions on the safety region that converges to the actual the safety set. As we show in two examples of applications, in many settings this sequence of confidence regions can be usually designed with standard techniques. Under this assumption our meta-algorithm achieves an optimal regret bound satisfying the constraint at each round with high probability. This matches the state-of-the-art results for less general models like the ones in [2, 16].
Preliminaries
In this section we will define the model of interaction that is adopted throughout the rest of the paper, and then we will review some concepts from variational analysis.
Decision making with bandit feedback and costs
We consider a general decision making scenario in which the set of strategies available to the agent is a bounded and convex set . After selecting a strategy x ∈ X, the agent receives a utility u (x) and pays a cost c (x). We focus on the case in which the utility and cost functions are linear functions with unknown parameters y ★ and , respectively, i.e., such that u (x) = 〈x, y ★〉 and . Intuitively, y ★ and are the optimal parameters that our learning algorithm try to learn. We remark that the utility and cost functions are stationary and do not depend on the time step.
At each time instant t ∈ [T],1
1
the agent selects a strategy xt ∈ X and receives a partial feedback ℓt from the environment. We make no assumption on the specific feedback. We only need to assume that this feedback provides sufficiently information to derive some confidence regions, i.e., regions to which the parameters belong with high probability. We remark that this feedback is dependent on the application, and we will provide some examples in Section 6. At each round t ∈ [T], the objective of the agent is to select strategies that maximize her utility, while being safe. The concept of safety that we adopt in this work is expressed in terms of costs. In particular, a strategy is said to be safe if it guarantees that the cost is within an interval C : = (- ∞ , β] ⊂ ℝ. Thus, the set of safe strategies is defined as X★ : = {x ∈ X ∣ c (x) ∈ C}.
The performances of the agent are evaluated in terms of cumulative regretRT which is defined as
We demand that the cumulative regret is sub-linear in T, while , for all t, with high probability. We assume that X ★ is not empty (otherwise the problem is trivially impossible to solve) and that we have a starting set X0 ⊂ X ★ for the safe exploration. In particular, we assume that there exists x ⋄ ∈ X such that for all .
Variational analysis
First we are going to recall the definition of Hausdorff distance which defines a metric between compact sets. We will use extensively this notion throughout the paper. Let d (· , ·) be the Euclidean Distance. Then,
Definition 1. Let A and B be two non-empty subsets of a metric space (M, d). The Hausdorff distance between A and B is defined as
where Aϵ is the ϵ-flattening of the set A defined as
We exploit this definition to define the rate of convergence of a sequence of sets.
Definition 2. We write At→K (t) B when a sequence of sets {At} t∈Naturals converges to B with rate K (t) = o (t), i.e., dH (At, B) ≤ K (t), for all t.
2
We will also use the notion of polar sets of convex polytopes. Namely, we will use the fact that any bounded convex polytope can be defined in two equivalent ways: as a convex hull of a finite set of points () and as the intersection of a finite number of closed half-spaces (). Indeed, starting from the definition a (respectively ) it is always possible to obtain the polar set (indicated with the superscript ∘) in terms of (respectively ).
Lemma 1. [Polar Set [25]] Given any set of p points, {a1, . . . , ap} with ai ∈ ℝn ∀i, if A is the n × p matrix whose ith column is ai, then
Conversely, we have that
assuming the left hand side is bounded.
Additional Notation. We indicate the convergence of a sequence of real numbers {αt} t≥0 to α★ with |αt - α★| ≤ K (t) as αt→K (t) α★. We denote with the epigraph of a function , where .
3
Finally, we use δX (x) to denote a function that is 0 if x ∈ X and +∞ if x ∉ X.
Overview of the sequential decision model.
Structure of the paper
The remaining of the paper is organized as follows. In Section 3, we assume to have an oracle returning at each time t ∈ [T] a feasibility set Xt ⊂ X ★ such that, for all x ∈ Xt, it holds with high probability. Moreover, we assume that the oracle returns increasingly better feasibility sets Xt as the time t progresses, and more experience is collected. Formally, we use Definition 2 of set-convergence and ask that Xt→K (t) X ★. A schematic representation of this model can be found in Fig. 1. We show that when Xt→K (t) X ★ and we know y ★, we obtain a convergence rate of to the optimal value , by playing .
4
Next, in Section 4, we remove the assumption that we know the sequence of feasibility sets Xt. In particular, we will propose a procedure that returns safe feasibility sets Xt, by exploiting a suitable high probability region Wt over the vector . This construction of Xt exploits the polar set of the high probability region Wt. We will prove that such construction generates a set Xt that satisfies the assumptions made in Section 3. In Section 5, we describe our meta-algorithm BAN-SOLO that exploit the previous results to obtain the desired guarantees on the regret and constraint violation. In particular, the algorithm is based on Lin-UCB, that achieves sub-linear regret and does not violate the cost constraint at each round with high probability. The meta-algorithm assumes to have an oracle that builds confidence regions around y ★, and around . In Section 6 we show how to design this oracle for some applications.
Rate of convergence for linear programs
In this section, we prove the following theorem which relates the Hausdorff distance of the approximate feasibility set Xt to X ★ and the difference in the objective of the respective optimization problems.
Theorem 1. Let Xt ⊂ X ★ and Xt→K (t) X ★ for some function K (t) = o (t) and bounded set X ★ ⊂ ℝn. Then, it holds
where .
For the ease of presentation, before proving the theorem we introduce the following functions:
and their epigraphs Et = epi (gt), E★ = epi (g★). Moreover, H+ is the epigraph of the linear function -〈 · , y ★〉.
5
Figure 2 provides a schematic representation of the epigraphs of the function involved.
In Section 3.1 we will give an informal sketch of the proof of Theorem 1, while in Section 3.2 we provide a formal proof of the theorem.
Schematic representation of the epigraphs Et and E★ of the functions gt and g★.
Intuition on the proof of Theorem 1
In this section, we provide a sketch of the proof of Theorem 1 that gives an intuition of the techniques used. The idea of the proof can be divided in two main points.
Relate the Hausdorff distance of Xt and X ★ to the Hausdorff distance of the epigraphs of gt and g★.
Relate the Hausdorff distance of the two epigraphs to the difference in values of their optima.
Intuitively, we can understand why this point of view is convenient. The epigraphs of gt and g★ differ only in terms of the epigraph of the indicator functions of the feasibility sets δXt and δX★. Indeed, the epigraph Et and E★ can be expressed as intersection of the half-space H+ and the epigraph of the indicator functions δXt and δX★, respectively. This observation suggests that that we can provide the relation in (i) by proving that dH (Et, E★) ≈ dH (Xt, X ★). Moreover, we can see that the optimal value of functions gt and g★ can be written as the distance of their epigraphs Et and E★ to the optimal value xt ∈ arg inf gt and x ★ ∈ arg inf g★, when xt and x ★ are thought of as embedded in ℝn +1. This helps solving point (ii) of the proof, that informally states that inf gt - inf g★ ≈ dH (Et, E★). Finally, by combining (i) and (ii), we can conclude that inf gt - inf g★ ≈ dH (Xt, X ★).
Proof of Theorem 1
In this section, we are going to prove all the formal results needed to complete the informal discussion on the proof of Theorem 1. The inequality that relates the distance between the feasibility sets Xt and X ★, and between the epigraphs of gt and g★ is formalized in the following lemma.
Lemma 2. Given a t ∈ [T], and two epigraphs Et and E★, it holds.
Proof. An equivalent definition of the Hausdorff distance between Xt and X★ is:
Similarly, an equivalent definition of the Hausdorff distance between Et, and E★ is:
We start by showing that
Let , where x ∈ Xt and . We show that there exists a such that . This is sufficient to prove Equation (3). We start noticing that since , there exists an such that
where the inequality holds by Equation (1). Moreover, since (x, v) ∈ Et, it holds v ≥ - 〈x, y ★〉. Hence, taking , we have that
where in the inequality we use Cauchy–Schwarz inequality. Hence, and we can conclude that
where in the second inequality we use the triangular inequality. Since the inequality above holds for each z ∈ Et, it follows that Equation (3) holds. Using a similar argument, we can prove the symmetric case showing that
Finally, exploiting Equation (2) we can conclude the proof of the lemma. This result states that we can bound the distance between the epigraphs Et and E★ using the distance between the sets Xt and X★, only suffering an increase in the Hausdorff distance up to a multiplicative factor. Moreover, the following lemma relates the distance in epigraph and the distance in optimal values of the functions.
Lemma 3. Given a t ∈ [T], and two epigraphs Et and , it holds
Proof. We start restating a known equivalent definition of Hausdorff distance. In particular, it is well known that (see, e.g., [17]) for any two sets A and B it holds:
which in equivalent to
Let x (1) ∈ arg inf gt. Then, it is easy to see that it holds inf gt = d (x (1), Et) where, with abuse of notation we denoted with x (1) the vector (x (1), 0). Similarly, let x (2) ∈ arg inf g★. We have that inf g★ = d (x (2), E★). Consider now the following inequalities:
where the first inequality comes from the optimality of x (1) for gt and the second one by Equation (1)1. Similarly, we can prove that
This concludes the proof of the statement. □
This result can be seen as a case of sensitivity analysis of linear programs [14] in which we exploited the Hausdorff distance between their epigraphs. Finally, by using Lemma 2 and Lemma 3 we can easily prove Theorem 1.
By Lemma 2, we know that, for each t ∈ [T], it holds
since by assumption, we have that dH (Xt, X ★) ≤ K (t). Then, from Lemma 3 we have that:
The theorem follows from the fact that and
Exploiting convex duality to deal with uncertainty on
In this section we drop the assumption made in the previous section about the existence of an oracle that provides a sequence of feasibility sets Xt and propose an explicit construction of such region. More precisely, we loosen the assumptions and we assume only to have a high confidence region Wt for the vector . In particular, we assume to have an high confidence region Wt that converges to the singleton with some convergence rate K (t) = o (t), namely . Then, we define the sets Xt and X ★ as polarization of the sets Wt and W★, respectively. Formally:
We remark that the existence of x ⋄ ensures that Xt is non-empty for all t ∈ [T]. As we will show in the following sections, we will use the partial feedback ℓt to estimate by building a sequence of regions Wt in which such parameter lies with high probability. Under these assumptions, we can state the following result.
Theorem 2. Assume that Wt is a bounded polytope so that, with probability at least 1 - δ we have , and that for some function K (t) = o (t). Moreover, let Xt and X ★ be as in Equations (6). Then,
where
and
Overview of the sequential decision model with estimator of the high probability region on .
Proof. In order to show the convergence of the LP solution we need to prove that Xt→R2K (t)/|β|X ★ and then apply Theorem 1 to obtain the statement. Thus, we need to study the relation between the convergence rates of Wt and of Xt. Since the set Wt is assumed to be a bounded polytope, it can be expressed as the convex hull of the set of vertices that define the high probability region for . Formally, , where . We will exploit [11, Lemma 3], which gives bounds on the Hausdorff distance between sets and their corresponding polar sets. Formally it states that:
where R is the diameter of X ★, defined as the radius of the smallest ball centered in zero that contains X ★. Thus, in order to bound the Hausdorff distance between Xt and X ★ we have to study the distance between their polar sets. Given Lemma 1 it is possible to derive an expression for the polar sets Xt, ∘ and X ★, ∘:
where vert (X ∘) is the set of vertices of X ∘. An equivalent definition of the Hausdorff distance between X ★, ∘ and Xt, ∘ is:
Let us consider . The sup can be tackled by considering exclusively the vertices of X ★, ∘. Thus:
where is the set of vertices of the convex set X ★, ∘. Considering that the two sets X ★, ∘ and Xt, ∘ have the vertices of X ∘ in common, we can ignore such vertices as they are characterized by 0 distance. Hence:
This is equivalent to . We can upper bound the distance with the distance computed with respect to the vertices. Hence we have that:
Now we can bound the last term as follows:
since by assumption we had that dH (Y ★, Yt) ≤ K (t) and Y ★≔ {y *}. It follows that:
Similarly, it is possible to derive the bound
Thus:
Now we can plug Equation (8) into Equation (7) and prove that . By using Theorem 1 we can conclude the proof. □
Analysis of the Algorithm
In the previous section, we show how the assumption of Section 3 of having an oracle returning the feasibility set Xt can be substituted by a high probability region Wt for . The last assumption that we need to remove is the knowledge of the parameter . While up to now the results were given for a known y ★, in this section we will provide an algorithm that works using as input only a high probability region Yt. This algorithm achieves sub-linear regret assuming that that and . Moreover, it never violates the safety constraint with high probability. In order to obtain these guarantees, we adopt an approach inspired from linear multi-armed bandit problems [1]. The pseudo code for the BAN-SOLO algorithm is provided in Algorithm 1. It is based on the optimism in face of uncertainty principle, and, thus, it selects a strategy xt ∈ Xt that maximizes the payoff 〈x, y〉 under the assumption that, for every x ∈ Xt, strategy y is an optimistic estimate of y ★ taken from the confidence region Yt, that is, y ∈ Yt maximizes the same player i’s payoff 〈x, y〉 (Line 4). The only assumption is that one can build the high confidence regions Yt and Wt (Line 2). In Section 6 we will provide some applications and show how to build this application-specific high confidence region.
Algorithm 1 BAN-SOLO
1: fort ∈ {1, …, T}
2: Build confidence regions Yt and Wt from past feedback {ℓ 1, …, ℓ t-1}
3: Build linear set Xt by exploiting Wt(Section 4
5: Play strategy xt
6: Observe feedback ℓt from the environment
end for
The following theorem provides a high-probability sub-linear regret guarantee for BAN-SOLO. Moreover, it shows that the algorithm does not violate the constraint with high probability.
Theorem 3. Given a δ > 0, if , , and both and y ★ ∈ Yt holds with probability at least 1 - δ, then we have that
where . Moreover, with probability at least 1 - δ, the cost constraint is not violated at any round.
Proof. We will use the following instantaneous regret decomposition:
which decompose the regret in the regret due to the fact that we have to guarantee safety () and y ★ is unknown (). Now we are going to consider the two terms of the instantaneous regret separately.
Bounding : From Theorem 2 we have that .
Bounding : Define
Then, with probability at least 1 - δ we have that:
where the first inequality follows from the fact that (xt, yt) is a pessimistic estimate of the payoff, the second inequality comes from the Cauchy-Schwartz inequality while the last inequality come from the fact that . D is defined as , which upper bounds ||x||2 for all x ∈ Xt as Xt ⊆ X ★.
Hence to conclude we have that:
which implies that
This concludes the proof. □ Note that in Theorem 3 the parameter K (δ) is written as a function of the parameter δ, because, in general, to guarantee with higher probability (smaller δ) one needs a larger Wt (larger K). Moreover, notice that BAN-SOLO needs to solve a linearly-constrained bilinear optimization problem at each iteration, which can be done efficiently by state-of-the-art solvers. Theorem 3 shows that one can achieve sub-linear regret just by having good (e.g., of order ) estimators for the parameters y ★ and . In the following section, we show applications of BAN-SOLO to common strategic scenarios, like normal-form games and sequential decision making problems. We show that in all these cases, the linearly-constrained bilinear optimization problem can be solved in polynomial-time.
Applications
In this section, we provide two applications of our framework. In particular, we show how to build the confidence regions Yt and Wt in Normal-Form Games and in a sequential decision making problem. As we show Section 5 this is sufficient to apply BAN-SOLO to these problems.
Normal-form games
A normal form game against a stochastic player is the simplest case of a decision making process, in which we have a single initial state, N2 terminal states, where N is the dimension of the action space. Indeed, we have a stochastic adversary that picks an action yt ∈ {e1, …, eN}, with a categorical distribution defined by y ★ ∈ ΔN, where ΔN is the N - 1 dimensional simplex. In turn the agent picks an action xt ∈ {e1, …, eN} and gets a utility of 〈xt, Uyt〉 where U ∈ ℝN ×N is the payoff matrix and a cost of 〈xt, Byt〉 where is the costs matrix. In this setting, the feedback ℓt correspond to the actions yt played by the adversary. Thus by noticing that y1, …, yt are i.i.d. samples from y ★, one we can build a confidence polytope for y ★, through the concentration inequality of [10], that states that for any δ ≤ 3e-4N/5 we have that with probability at least 1 - δ, lies in the set: where is the empirical mean after t turns. Finally, Wt can be defined similarly.
We conclude the section showing how the bilinear optimization problem in Line 4 of Algorithm 1 can be solved in polynomial time in this specific setting. Indeed, since x is always positive the optimal value of y is always .
Sequential decision making
Not in all application we have that the feedback ℓt is completely informative on the whole decision making problem. An example of when this does not happen are sequential decision making problems (SDM) [13]. A SDM against a stochastic adversary, is a special case instance of our model that allows to model imperfect information game such as Poker [7]. In general, SDM model those situations in which the agent and the opponent choose actions sequentially and at the end of the interaction receive a utility and cost values which depend on the terminal state reached. The strategy spaces X and Y can be described by treeplexes [15], which allow us to have a linear structure for the payoffs and costs, as in the case of NFG described above. In particular, the sets X and Y can be defined as the polytopes X≔ {x ∣ Fxx = fvecx} and Y≔ {y ∣ Fyy = fvecy}, where the matrices Fx, Fy and the vectors fvecx, fvecy are specified in order to capture the sequence-form representation [22] of the agent’s and adversary’s decision making process, respectively. Intuitively, each entry of a strategy x ∈ X (respectively y ∈ Y) models the probability with which the agent (respectively the opponent) plays a specific sequence of actions. This formulation allows us to specify the utility and cost functions as the following linear functions: u (x) = 〈x, Uy ★〉 and c (x) = 〈x, Cy ★〉, where y ★ is the strategy played by the opponent and U and C are suitably defined matrices that specify the utility and cost values associated to each terminal state of the sequential interaction. The main aspect that differentiates SDM from NFG lies in the fact that in the case of SDM, at each turn, the feedback ℓt = (yt, ωt) contains information only on those states actually traversed during the sequential interaction. This implies that at each round t, the high-confidence regions Yt and Wt can be updated only on such states. The problem of constructing the sets Yt, Wt is called Opponent Modeling and as shown for instance in [6, Theorem 1], [4, Lemma 2] and [5, Lemma 8] that both Yt and Wt can be obtained from past feedback {yt, ωt} t. Such high-confidence sets are obtained from unbiased estimators , of y ★ and ω ★ and are shaped as and , where, under strict feasibility-like assumptions, . The definition of Yt and Wt can be exploited in order to efficiently solve the bilinear optimization problem in Algorithm 1. In particular, notice that the set Xt is such that which trivially satisfies for some constant K. Furthermore, for each x ∈ X it holds that , which allows us to write the bilinear program in Algorithm 1 as , which is a linear program and thus can be solved efficiently. We remark that, for general sets Xt and Yt, solving the bilinear program is NP-hard while SDM, as well as NFG, constitute a particular case in which we are able to efficiently compute an optimal solution.
Conclusions and future works
In this work, we study the problem of playing safely decision making problems with bandit feedback and linear cost and payoff functions. We first derived a general result on the the sensitivity of a linear optimization problem based on their feasibility set and their Hausdorff distance. Then, we showed how such result can be used in our case, by exploiting the polar set formulation that follows from the safety constraints. Finally, we proposed BAN-SOLO, a Lin-UCB algorithm, which guarantees sub-linear regret and, at the same time, the satisfaction with high probability of the safety constraints during the entire learning process.
There are several direction in which we can extend our model. For instance, it would be interesting to study settings with more than one safety constraint. Moreover, we are interested in studying more general constraints, e.g., constraints on the objective function or general convex constraints on the strategy set.
Footnotes
Acknowledgments
This paper is supported by PNRR-PE-AI FAIR project funded by the NextGeneration EU program.
In this work we let [N] be the first N natural numbers.
With slight abuse of notation, we use At→K (t) b instead of At→K (t) {b} when bis a single element.
is the extended real line that contains ±∞.
Note that it is customary in sensitivity analysis to investigate convergence of the inf of an objective function f. It is possible to trivially obtain our case simply by considering a different objective function g = - f.
Note that H+ is also a half-space.
References
1.
Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári, Improved algorithms for linear stochastic bandits, Advances in Neural Information Processing Systems24 (2011), 2312–2320.
2.
AmaniS., AlizadehM., ThrampoulidisC. Regret bound for safe gaussian process bandit optimization, In L4DC, pages 158–159, 2020.
3.
Nolan Bard, Jakob FoersterN., Sarath Chandar, Neil Burch, Marc Lanctot, Francis SongH., Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hugheset al.
The hanabi challenge: A new frontier for ai research, Artificial Intelligence280 (2020), 103216.
4.
Martino Bernasconi, Federico Cacciamani, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti, Francesco Trovò Safe learning in tree-form sequential decision making: Handling hard and soft constraints, In International Conference on Machine Learning, pages 1854–1873. PMLR, 2022.
5.
Martino Bernasconi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti, Francesco Trovò, Sequential information design: Learning to persuade in the dark,, Advances in Neural Information Processing Systems35 (2022).
6.
Martino Bernasconi-de-Luca, Federico Cacciamani, Simone Fioravanti, Nicola Gatti, Alberto Marchesi, Francesco Trovò, Exploiting opponents under utility constraints in sequential games,, Advances in Neural Information Processing Systems34 (2021).
7.
Noam Brown, Tuomas Sandholm, Superhuman ai for heads-up no-limit poker: Libratus beats top professionals, Science359 (2018), 418–424.
8.
Matteo Castiglioni, Andrea Celli, Christian Kroer Online learning with knapsacks: the best of both worlds, In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17–23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2767–2783. PMLR, 2022. URL https://proceedings.mlr.press/v162/castiglioni22a.html
9.
Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Giulia Romano, Nicola Gattiet al. A unifying framework for online optimization with long-term constraints, In Thirty-sixth Conference on Neural Information Processing Systems, pages (2022), 1–30.
10.
Luc Devroye The equivalence of weak, strong and complete convergence in l1 for kernel density estimates, The Annals of Statistics, pages 896–904, 1983.
11.
Lutz Dumbgen, Gunther Walther, Rates of convergence for random approximations of convex sets, Advances in Applied Probability28(2) (1996), 384–393.
12.
Meta Fundamental AI Research Diplomacy Team (FAIR)Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Huet al. Human-level play in the game of diplomacy by combining language models with strategic reasoning, Science378(6624) (2022), 1067–1074.
13.
Gabriele Farina, Robin Schmucker, Tuomas Sandholm, Bandit linear optimization for sequential decision making and extensive-form games, In Proceedings of the AAAI Conference on Artificial Intelligence35 (2021), 5372–5380.
14.
Julia HigleL. and Stein WallaceW., Sensitivity analysis and uncertainty in linear programming, Interfaces33(4) (2003), 53–60.
15.
Samid Hoda, Andrew Gilpin, Javier Pena, Tuomas Sandholm, Smoothing techniques for computing nash equilibria of sequential games, Mathematics of Operations Research35(2) (2010), 494–512.
16.
MoradipariA., ThrampoulidisC., AlizadehM. Stage wise conservative linear bandits, In NeurIPS, pages 11191–11201, 2020.
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Boltonet al. Mastering the game of go without human knowledge, nature550(7676) (2017), 354–359.
19.
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepelet al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, Science362(6419) (2018), 1140–1144.
20.
Milind Tambe Security and game theory: algorithms, deployed systems, lessons learned, Cambridge University press, 2011.
21.
Ilnura Usmanova, Andreas Krause, Maryam Kamgarpour Safe convex learning under uncertain constraints, In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2106–2114. PMLR, 16–18 Apr. 2019, URL https://proceedings.mlr.press/v89/usmanova19a.html
22.
Bernhard Von Stengel, Efficient computation of behavior strategies, Games and Economic Behavior14(2) (1996), 220–246
23.
Xiaohan Wei, Hao Yu and Michael NeelyJ., Online primal-dual mirror descent under stochastic constraints, Proceedings of the ACM on Measurement and Analysis of Computing Systems4(2) (2020), 1–36.
24.
Hao Yu, Michael Neely, Xiaohan Wei, and Online convex optimization with stochastic constraints, Advances in Neural Information Processing Systems30 (2017).
25.
Günter ZieglerM. Lectures on Polytopes, volume 152, Springer Science & Business Media, 2012.