Abstract
The majority of studies on international conflict escalation use a variety of measures of hostility including the use of force, reciprocity, and the number of fatalities. The use of different measures, however, leads to different empirical results and creates difficulties when testing existing theories of interstate conflict. Furthermore, hostility measures currently used in the conflict literature are ill suited to the task of identifying consistent predictors of international conflict escalation. This article presents a new dyadic latent measure of interstate hostility, created using a Bayesian item-response theory model and conflict data from the Militarized Interstate Dispute (MID) and Phoenix political event datasets. This model (1) provides a more granular, conceptually precise, and validated measure of hostility, which incorporates the uncertainty inherent in the latent variable; and (2) solves the problem of temporal variation in event data using a varying-intercept structure and human-coded data as a benchmark against which biases in machine-coded data are corrected. In addition, this measurement model allows for the systematic evaluation of how existing measures relate to the construct of hostility. The presented model will therefore enhance the ability of researchers to understand factors affecting conflict dynamics, including escalation and de-escalation processes.
Introduction
Despite the existence of a relatively large body of theoretical and empirical research on conflict escalation, there is still no consensus among international relations scholars on why some interstate disputes lead to war while others do not. Part of the explanation for this issue is that the measures currently used in the quantitative interstate conflict literature are ill suited to the task of identifying consistent predictors of conflict escalation. In this article, I provide a theoretically motivated measurement model that enhances researchers’ ability to explain conflict processes. In particular, I define interstate conflict escalation as an increase in the level of hostility between countries involved in a militarized conflict. This approach requires a granular and validated measure of hostility – a variable we cannot observe directly, but manifestations of which we can. In this article, I create a latent measure of interstate hostility by constructing a Bayesian ordinal item-response theory model using conflict events data, including the Dyadic Militarized Interstate Disputes (MID) (Maoz et al., 2019; Palmer et al., 2015) and Phoenix political event datasets (Althaus et al., 2017).
This project makes several contributions to the literature. First, it introduces a more precise and granular measure of hostility as the model combines the accuracy of the expert-coded MIDs and the granularity of the Phoenix event datasets. By capturing the underlying tension in the relationship between states, this novel measure can help to answer theoretical questions related to conflict dynamic processes including both escalation and de-escalation of interstate conflcts. In addition, the model employed in this article allows for the systematic evaluation of how existing measures relate to the construct of hostility. By integrating some of these measures, the model enables one to make inferences about their quality. In addition, my new measure of hostility incorporates uncertainty inherent in the latent variable, which has been largely ignored by existing measures of conflict.
Second, the presented model addresses the problem of reporting bias in the machine-coded event data through the use of a time-varying intercept structure and the use of human-coded data as a benchmark against which biases in machine-coded data are corrected. The proposed solution can be applied to almost any measurement problem that involves the use of human- and machine-coded event datasets.
Conflict escalation: The need for a new measure
The concept of escalation is central to numerous theories of conflict including explanations for the democratic peace (Maoz & Abdolali, 1989; Senese, 1997; Dixon & Senese, 2002), deterrence theory (Huth & Russett, 1988; Geller, 1990; Brams & Kilgour, 1987), bargaining theory (Fearon, 1994; Schelling, 1960; Snyder & Diesing, 1977), the steps-to-war explanation (Senese & Vasquez, 2008; Vasquez, 1987; Vasquez & Henehan, 2001; Gibler, 1997), and power transition theory (Organski & Kugler, 1981; Lemke & Reed, 1996). Despite its theoretical centrality, relatively few studies have endeavored to empirically measure or analyze escalation, choosing instead to focus on the onset of militarized hostilities. As a result, little is known about what distinguishes the conflicts that escalate to wars from those that stay at the same level of hostility or de-escalate.
One reason for this is a lack of consensus among international conflict scholars on how to properly measure conflict escalation. The consequences of this disagreement were illustrated in a study conducted by Braithwaite & Lemke (2011), who compare a variety of measures in their test of five correlates associated with conflict escalation: regime type, issue at stake, satisfaction with status quo, power preponderance, and joint alliance membership. They use six measures of conflict escalation, which are based on the information from the Militarized Interstate Dispute (MID) dataset and include: reciprocation of hostilities; use of force; mutual use of force; and the number of fatalities with thresholds at 0, 250, and 1,000 battle-deaths. They found that among the five possible causes of escalation, only territory has been a consistent predictor of the dependent variable. Their findings underscore the absence of a conceptually precise and validated measure of conflict escalation as a significant barrier to our understanding of escalation processes.
In order to address this issue, I introduce a more granular measure of hostility that can be used as a foundation for more precise analyses of escalation processes, which can be tracked by identifying changes in the level of hostility over time. The idea that escalation is an increase in the level of hostility is not new in the conflict literature. In fact, numerous studies operationalize escalation using the level of hostility in a dispute (Palmer, London & Regan, 2004; Bueno de Mesquita & Lalman, 2008; Schultz, 2001). The majority of existing measures are based on states’ behavior once militarized conflict has already started (and often when the use of force has already occured) thus ignoring states’ interaction before militarization of the dispute. As a result, we are not aware of the processes that lead from a non-militarized threat or accusation to armed conflict. This can be important, for example, in the case of rivalry, when states perceive each other as enemies and the source of a threat that is likely to be militarized in the future (Thompson, 2001). Hostile interactions short of the use of force can be useful in identifying the reasons why some militarized disputes lead to war, but also why some conflicts become militarized while others stay at the same level or get resolved. Furthermore, current measures of hostility lack the granularity necessary for analyzing precise changes in conflict dynamics. The reason for this is the source of the data. As the current hostility measures are constructed using highly aggregated human-coded datasets, they miss the information, which is likely to be captured in machine-coded data. Machine-coded datasets, however, tend to be noisy and to have temporal and spatial biases. The model presented here provides a solution for this trade-off between granularity and bias by combining the two types of data using human-coded data as a benchmark against which biases in machine-coded data are corrected.
Current measures of hostility
MID hostility scale
Maoz (1982) converted the COW ordinal hostility scale into a 14-category scale interval measure of dispute severity, with the threat to blockade as the least severe and war as the most severe actions. In their attempt to construct more precise indicators of dispute severity that would capture low-level conflicts, Diehl & Goertz (2001) came up with their own interval-level 200-point measure of dispute severity, based on the level of hostility and number of fatalities provided by the MID (Jones, Bremer & Singer, 1996) and COW datasets (Sarkees & Wayman, 2010). The distinct feature of this measure is the fact that wars and non-war MIDs are scaled together, thus providing a range of levels of severity even among wars (Diehl & Goertz, 2001). Finally, the concept of crisis severity is also embedded in the ICB dataset, in which crises are described using seven dimensions: (1) source or trigger mechanism, (2) gravity, (3) complexity, (4) intensity, (5) duration, (6) communication pattern, and (7) outcome (Brecher, 1977).
Diehl & Goertz (2001) acknowledge, however, that the existing measures of hostility/dispute severity are often crude and do not allow researchers to make any inferences about relatively small changes in conflict dynamics. The example they discuss is rivalry behavior. As the militarized dispute data provide information about rivalry only at the time of armed conflict, the use of finer-grained data can provide a more precise picture of rivalry (Diehl & Goertz, 2001: 265).
A new measure of hostility
In order to introduce my measure of hostility, I deploy Goertz’s (2006) framework, according to which a concept can be described at theoretical, ontological, and operationalization levels. At the theoretical level, I define hostility as the enmity directed from one state to another and the level of hostility as the intensity of this enmity. For example, a high level of hostility is likely to be reflected in the relatively high frequency and/or intensity of aggressive interactions between states, while low levels of hostility might be reflected in the absence of any conflicts.
At the ontological level, I decompose the concept into its constitutive elements. Given that hostility represents the nature of states’ relationships, it can be observed only through the expressions of this relationship. Therefore, the elements are hostile rhetoric, such as threats to use force, or hostile behavior (e.g. shows of force or attacks). Finally, at the operationalization level or, in other words, data/indicator level, I operationalize hostility as a latent trait manifested through conflict events.
My conceptualization of hostility is thus different from the most widely used measure adopted by the Correlates of War (COW) project. In its common use, scholars deploy the MID data to identify threats, displays, and uses of military force, assuming that each corresponds to a greater degree of hostility. An implicit assumption of the COW hostility scale is that certain types of actions are inherently more hostile, or more reflective of greater degrees of enmity than others. However, while higher levels of hostility are more likely to be reflected in the use of force, such as an attack or the beginning of war, the relationship between the actual level of hostility and a conflictual action might be context dependent. While the assumption that material actions are more hostile than verbal ones may generally be accurate, there are also contexts in which this is unlikely to be the case – a threat to deploy chemical, biological, or radiological weapons is likely to reflect greater enmity than a display of military force through the deployment of patrol boats in disputed waters, the current MID hostility scale would rank the latter action more highly than it would the former. I make no assumptions about the ranking of conflict actions in developing my measure. More importantly, however, I believe there is variation in the level of hostility among the actions that involve the use of force and this variation is omitted in the COW’s hostility measure.
To summarize, the assumptions I make in developing my measure of hostility are the following. First, as hostility is a part of an underlying relationship between states, it is a continuous trait, which we can observe only at certain points of time through ‘hostile’ actions, such as a blockade or an attack on one state by another. Second, while hostility is directed from one state to another, it does not need to be reciprocated. In this sense it is dyadic without necessarily being symmetric. In order to capture this asymmetry, I focus on directed dyads. In other words, I allow the level of hostility of the United States to Russia to be different from the level of hostility of Russia to the United States. Third, I assume that hostility is a unidimensional trait, which means it has a single underlying dimension and can be captured using a single measure. For instance, at any given time a dyad can be either more hostile or less hostile. Fourth, hostility is a latent trait, meaning that we cannot observe or measure it directly but we can infer its level based on its manifestations – conflictual behavior. While the idea that disputes are manifestations of states’ hostility is not new in the conflict literature (Klein, Goertz & Diehl, 2006; Zinnes & Muncaster, 1984), this project is among the first that directly incorporates this theoretical assumption into an empirical model. Finally, as hostility is a latent trait, there is uncertainty associated with it. The model of hostility presented here allows researchers to incorporate this uncertainty into their statistical analysis. The model proposed in this article does not look at the effect of hostility on the probability of conflict but rather focuses on the information which states’ conflict behavior or lack thereof provides. The goal of the model is to use states’ actions, including those taken in the context of a conflict, in order to infer the level of hostility within a dyad at different points of time and to facilitate the analysis of conflict dynamics.
Machine-coded data and temporal bias
A potential solution to the problem outlined in the previous section is the use of machine-coded political event datasets. One of the most prominent examples is the Phoenix dataset (Althaus et al., 2017), which provides information on states’ cooperative and conflict behavior, including conflicts short of the threshold for militarized interstate disputes. However, using this dataset has several possible limitations. First, machine-coded data are often noisy. For instance, D’Orazio et al. (2016) illustrate that the ability of The number of material conflicts as reported in Phoenix dataset
Data variables and description
Model
My model assumes that hostility is a unidimensional trait that can be measured using observed outcomes. I employ an item-response theory (IRT) model, which is a type of latent variable model used to generate estimates of a latent trait of interest (hostility) by combining information from observable items or manifest variables (conflict events). It has been used increasingly in political science, enabling researchers to estimate a number of unobservable concepts (Jackman, 2009; Reuning, Kenwick & Fariss, 2019). Furthermore, this model allows me to directly model the uncertainty associated with the latent trait.
Data
I use the directed dyad-quarter as the unit of analysis. As one of the main goals of this project is to create a more granular measure of hostility that can be used as a measure of conflict escalation and thus is able to capture even the smallest changes in states’ interactions, I focus on the annual quarter instead of more traditional year-level unit of analysis. My sample consists of politically relevant dyad-quarters from 1950 to 2010. Politically relevant dyads are dyads where either states are contigious or at least one of the states is a major power. I reduce my sample to politically relevant dyads both for computational tractability and following common practice in the international conflict literature which has shown that this sample is generalizable to the population of dyads (Lemke & Reed, 2001). As a result, I have 648,368 directed dyad-quarter observations and 3,388 directed dyads.
Combination of fatalities and variables indicating the use of force
The variables from the MID dataset include: Show of force, Alert, Nuclear alert, Mobilization, Border fortication, Border violation, Blockade, Occupation of territory, Seizure, Attack, Clash, Beginning of war, and Joining war. All of these variables except Attack and Clash are dichotomous. Attack and Clash variables are ordinal. As attacks and clashes are more likely to involve fatalities than other conflict actions short of war, I incorporate the fatality level into these variables. The assumption I am making is that attacks and clashes with fatalities reflect higher levels of hostility than the same actions that do not lead to fatalities. As a result, 0 indicates the absence of attack/clash, 1 – attack/clash without fatalities, 2 – attack/clash with missing information on fatalities, and 3 – attack/clash with fatalities (Table III).
Material conflict is a quad category from the Phoenix dataset. Quad categories represent a high level of aggregation of the Conflict and Mediation Event Observations (CAMEO) framework. The Material conflict category includes exhibition of force posture, reduction in relations, coercion, assault, fight, and engagement in unconventional mass violence (Gerner et al., 2002). 0 indicates the absence of the material conflict, 1 indicates one material conflict, 2 corresponds to two material conflicts, and I assigned 3 to the variable if the number of events is equal to or exceeds three. The assumption here is that higher hostility is likely to be reflected in the higher number of material conflicts within the dyad.
Model
In order to construct a latent measure of hostility, I use an ordinal item-response (O-IRT) model. The goal of this model is to estimate hostility, where
Each item is indexed j = 1,…, J and is observed at the dyad-quarter level, where dyads are indexed i = 1,…,N and time is indexed t = 1,…,T. Each item has two parameters: the ‘item discrimination’ parameter
To account for the temporal bias in the machine-coded dataset variable (Material conflict), I make the following modification to the model. I assume that the difficulty parameter for the human-coded MID variables is constant over time and leave it as
Following Fariss’s (2014) framework, I specify the probability distribution as:
where vj = 1 when the j indicator is one of the Phoenix variables and vj = 0 when it is one of the MID variables.
As Model parameters
In order to assess convergence, I look primarily to the
Model output
My final hostility estimate has an average value of 0, with the highest estimate of 1.5 for the China–Taiwan dyad and the lowest estimate of –0.17 for the United Kingdom–Kyrgyzstan dyad. Given the fact that the China–Taiwan dyad has emerged out of conflict, while the United Kingdom and Kyrgyzstan have few within-dyad conflict interactions, this quick examination of the estimates provides some evidence for face validity of my measure.
The contribution of each of the variables in the model is reflected in the difficulty and discrimination parameters (left panel of Figure 2). For a given hostility level, the probability of conflict event increases as the item difficulty decreases. At the average level of hostility, the probability of a nuclear alert is lower than the probability of any other conflict event, which can be explained by the fact that a nuclear alert is a relatively rare event and we would expect a nuclear alert only at a high level of hostility.
The discrimination parameter reflects the extent to which a change in the level of hostility corresponds to the change in each of the manifest variables. For instance, Figure 2, right panel, shows that if the level of hostility within a dyad increases, we are more likely to observe Alert than Attack or Clash. This is important, as it indicates that, for example, Alert is a more hostile action than Seizure. This observation, however, contradicts the MID hostility level scale. It should be noted that while the MID hostility scale is based on theoretical assumptions, the ranking presented here is generated by the conflict data. Thus, the model output suggests that the use of the ordinal scale for measuring hostility and conflict escalation might bias the results as the measure would not necessarily reflect reality.

Temporal variation in the level of global hostility
Concurrent validity
As a concurrent validity check, I look at variation of the level of global hostility across time. As the number of interstate wars has been decreasing over time (Pinker, 2012), I would expect the level of global hostility to be decreasing as well. I take 1,000 draws from the posterior estimates of
I also look at the levels of hostility in several dyads. For instance, according to Diehl & Goertz (2001), the China–South Korea rivalry is an example of an enduring rivalry with a pattern of decreasing severity over time. Therefore, we would expect hostility in this dyad to be decreasing over time as well. Figure 4 shows that varying-intercept model (b) is more consistent with our expectation than fixed-effect model (a). Thus, the varying-intercept model passes the concurrent validity check. The fixed-effect model also shows increasing levels of hostility in the United States–Soviet Union/Russia (Figure 5) dyad, which provides additional evidence for the presence of the temporal bias in the model. In both cases, however, the varying-intercept model is aligned well with what we observe in the real world.
Predictive validity check
In this section, I assess the predictive validity of my hostility measure. If this measure is valid, it should be able to predict the processes to which it is theoretically linked (Trochim, 2006). In particular, I extend Findley, Piazza & Young’s (2012) study on terrorism in international rivalries.
Findley, Piazza & Young (2012) argue that states hostile toward each other are more likely to abet transnational terrorists than other states. Sponsorship of terrorism gives the states involved tactical advantages such as plausible deniability and disproportionate effectiveness. Furthermore, the increasing costs of conventional warfare makes terrorism a viable option for states that try to avoid direct confrontation with their rivals (Jenkins, 1975; Conrad, 2011).
In order to test their theory, Findley, Piazza & Young (2012) use two measures of rivalry. The first measure is Temporal variation in the level of hostility in the China–South Korea dyad Temporal variation in the level of hostility in the United States–Soviet Union/Russia dyad Negative binomial models of transnational terrorist attacks using dyads 1968–2002 (without rivals) Standard errors in parentheses.


I extend Findley, Piazza & Young’s (2012) study by looking at the effect of hostility on the number of terrorist attacks. Based on these theoretical and empirical studies, if my measure of interstate hostility is valid, the level of hostility should be able to reproduce a strong positive effect on the count of transnational terrorist attacks.
The model specification is very similar to the one used by Findley, Piazza & Young (2012). The unit of analysis is the directed dyad-year. The analysis includes only politically relevant directed dyads from 1968 to 2002. Following Findley, Piazza & Young (2012), I use two approaches in estimating the effect of hostility on the dependent variable. In both approaches, the origin country is defined as the nationality of the terrorists. However, in the first approach, the target country is defined as the country in which the terrorist event occurred, while in the second approach, the target country is the nationality of the victims. Finally, I preserve the set of control variables used in Findley, Piazza & Young’s (2012) paper. The small set includes only dyadic variables: rivalry, hostility, joint democracy, contiguity, and capability ratio. In addition to these covariates, the fully specified model includes the history of terrorism, interstate war, Cold War, and civil war in both the origin and target states. First, I reproduce the analysis on the full sample and found that hostility in fact has a significant positive effect on the number of transnational attacks. The results are shown in Online appendix D. Second, I designed a harder test for my measure by excluding rivals from the dataset. This test provides a challenge for my measure as it reduces the sample in size and limits it to the dyads that are not engaged in a rivalry and, thus, are likely to be on the lower end of hostility scale. Moreover, this test can illustrate the utility of my hostility measure over the dichotomous rivalry approaches, showing that states’ interactions are likely to be more nuanced and thus require a fine-grained measure.
Table IV presents the results. 4 The coefficient for interstate hostility is positive and significant across all models, which provides support for my hypothesis, suggesting that dyads with higher levels of hostility are more likely to experience transnational terrorist attacks than other dyads. In addition to demonstrating the predictive validity of my measure, this analysis suggests that the theoretical expectations of Findley, Piazza & Young (2012) extend into important new grounds – I find that there is a link between hostility and transnational terrorism even in non-rivalrous dyads.
Conclusion
Current discussions of interstate conflict escalation are hindered by the lack of a single framework, which could be used for studying conflict dynamics. Understanding the process of conflict escalation, however, is important for numerous theories on international conflict. A significant part of the problem is the absence of a valid measure of conflict escalation. In this article, I address this issue by developing a new measure of interstate hostility that can serve as a foundation for measuring escalation. One of the major contributions of this project is a solution to the trade-off between granularity and bias in measuring hostility. For instance, it is well known that while human-coded datasets, such as MID, are often criticized for being too highly aggregated, events data struggle with irregular reporting. The measure presented in this article mitigates these issues through the use of a time-varying intercept model structure and combination of human-coded and machine-coded data. Furthermore, it incorporates uncertainty in the underlying measure which is hitherto ignored in virtually every measure of hostility to date. In addition, the presented model suggests the necessity for a re-evaluation of existing measures of dispute severity and hostility in terms of their relation toward the concept. Finally, the new measure of interstate hostility not only allows for testing of existing theories, but also can motivate new theories on escalation and de-escalation processes as well as foreign policy in general. For example, the replication of Findley, Piazza & Young’s (2012) study with the hostility measure suggests that the level of interstate hostility can potentially explain states’ sponsorship of transnational terrorism. The fact that this finding holds even for non-rivalrious dyads suggests that Findley, Piazza & Young’s (2012) argument can be extended to a larger set of dyads and illustrates the utility of the finer-grained measure of hostility.
This project constitutes a first step in the analysis of conflict escalation. Moving forward, incorporating more complex dependencies in the international system into the study of escalation seems to be a fruitful area of research. Potential lines of inquiry include the analysis of the impact of the conflict escalation process within one dyad on the conflict/cooperation in other dyads or the relationship between the alliance network structure and probability of the conflict escalation.
Footnotes
Replication data
Acknowedgements
I am grateful to Michael Kenwick and Christopher Fariss for their guidance. I would also like to thank Glenn Palmer, Douglas Lemke, Burt Monroe, Kevin Reuning, Bruce Desmarais, Zita Oravesz, and James A. Piazza for their help, feedback, and useful comments.
