Abstract
There are situations where the data or the theory suggest or require, respectively, that one estimate the boundary lines that separate regions of observations from regions of no observations. Of particular interest are ceiling or floor lines. For example, many theories use terms such as veto player, constraint, only if, and so on, which suggest ceilings. Ceiling hypotheses have a nonstandard form claiming the probability of Y will be zero for all values of Y greater than the ceiling value of Yc for a given value of X. Conversely, ceiling hypotheses make no specific prediction about the value of Y for a given value of X except that it will be less than the ceiling value. Floors work by guaranteeing minimum levels. The article gives numerous examples of theories that imply ceiling or floor hypotheses and numerous examples of data that fit such hypotheses. The article proposes quantile regression as a means of estimating the boundaries of the no-data zone as well as criteria for evaluating the importance of the boundary variable. These techniques are illustrated for ceiling and floor hypotheses relating gross domestic product/capita and democracy.
The failure of many newly independent nations to establish democratic rule after the Second World War challenged many scholars to identify the necessary linchpins of democracies.
Introduction
Of constant concern to social scientists is fitting empirical data analysis, for example, statistics, to theories. There has been progress in matching game theoretic models with appropriate statistical methods (e.g., EITM Project in political science). In this article, we explore a mismatch between theories and statistical data analysis. We show that there exist a large class of theories, models, and hypotheses that postulate “floors” and “ceilings.” To test and evaluate these theories, then we need methodologies that can estimate these quantities of theoretical interest.
The examples in the tables below illustrate that a wide range of theoretical language implies that ceilings are what the hypothesis is about. A ceiling is a value Yc for a given value of X that observations rarely if ever exceed. A “glass ceiling” for women means that there are professional levels that are very difficult to attain. Conversely, the ceiling hypothesis makes no specific claim about the exact value of Y in the range [0, Yc] for a given value of X (for convenience in much of our presentation we will assume that all observations lie in [0,1]). 1
“Floors” work in the opposite manner. The value of Yf for a given value of X is the minimum which we will find for that value of X. In other words, virtually all of the observations will lie above the floor, [Yf,1].
The core argument of our article is (1) that many theories predict ceiling or floor data patterns, (2) many descriptive scatterplots have ceiling or floor no-data patterns, (3) the quantity of theory interest is not a line through the middle of the data but the ceiling or floor line, and (4) the importance of the ceiling or floor is the relative size of the no observation zone created by the floor or the ceiling.
We work from both directions: we provide examples of theories that predict ceiling or floor patterns, and at the same time we illustrate that data scatterplots with large no-data zones are not uncommon, that is, that ceiling or floor theories would fit or explain these data. In particular, we think that ceiling or floor scatterplots arise quite frequently, particularly in large cross-national studies, as well as in research focusing on institutions (domestic or international).
A variety of theories or causal mechanisms can produce ceiling or floor hypotheses. For example, many theories of institutions invoke them as “constraints” on behavior, which suggests a ceiling effect. There are multiple causal mechanisms that generate ceiling or floor hypotheses. We focus in particular on an important class, those hypotheses and theories formulated in terms of necessary and sufficient conditions. It must be stressed that our methodology is not limited to these kinds of hypotheses, but extends to any theory that explicitly or implicitly invokes constraints, floor, ceilings, and so on.
Ceilings and floors often produce what we will call “triangular no-data” patterns or scatterplots. 2 While the zone of no data can take many forms there are good theoretical and empirical reasons to focus on zones that are “triangular” in shape. In particular, necessary and sufficient conditions by definition produce triangular no-data zones. But we shall also see that game theory produces hypotheses about triangular no-data zones as well.
So while we think our methodology is particularly well suited to necessary and sufficient condition hypotheses, it is not limited to them. Nothing in the methodology requires fuzzy logic variables or requires the use of necessary and sufficient condition language. For example, the researcher might prefer the language of constraints, for example, veto players, to the logic of necessary and sufficient conditions.
The first half of our article focuses on different ways in which hypotheses about no-data zones can arise. While we focus much of our attention on necessary and/or sufficient condition theories and hypotheses as a core example, this is by no means the only way that theories focus on no-data zones. For example, in our example involving democracy and wealth, we suggest that Przeworski et al.’s (2000) hypothesis about this relationship postulates a floor below which we should see no cases. We also discuss game theoretic models as another large class of examples.
The second half of the article provides our methodological solution. The key idea is to draw a line dividing the zone of data from the zone of no data, that is, the floor or the ceiling. The “importance” of the constraint, floor or ceiling, then is determined by how large this region is. Here we use directly the meaning of constraints: the more important a constraint, the larger the set of options that is eliminated.
There exist statistical methods for estimating boundaries of data. 3 We focus in particular on quantile regression as one such technique. While moderately well known to econometricians (particularly in labor economics) and statisticians, quantile regression has virtually never been used in political science and sociology. In particular, it seems very well suited for drawing the boundaries of triangular data. We briefly discuss the basics of quantile regression and then provide an illustration of its application to the data and debate on the economic requisites of democracy.
To illustrate the theoretical and methodological importance of ceiling and floor hypotheses, along with the importance of no-data zones, we use the 50-year history of the economic and social requisites of democracy. This example has several advantages. (1) The hypothesis relating wealth, gross domestic product (GDP)/capita, or economic development to democracy has been variously formulated as a ceiling or floor hypothesis as well as a linear relationship (e.g., regression models). Each of these three hypotheses predicts a different scatterplot, hence they are quite different. (2) It is a good example because of its core importance in sociology and comparative politics for over 50 years. (3) It illustrates how ceiling and floor hypotheses arise naturally in large-N cross-national research dealing with institutions.
Ceiling or floor hypotheses are about drawing lines that separate regions of data from regions of no data, that is, they focus on the boundaries of the data. We want to know the line that separates the region where we have observations from the region where we find no observations. The potential outcomes approach defines “causal effect” in terms of average treatment effect (ATE). In contrast, focusing on ceilings and floors implies another kind of causal effect in terms of the minimum (floor) or maximum (ceiling) value of Y for a given value of X. We shall see an example below, where the average impact of X on Y is zero (i.e., a flat regression line), but where there is a strong ceiling effect. In sum, we must look for regions of no observations in order to analyze a whole class of important causal effects.
Constraints, Ceilings, and Floors
A wide range of social science theories, hypotheses, and models use the concepts of constraints, barriers, prerequisites, and so on. All of these related concepts propose that there are regions where we should see no data because of the constraints, barriers, and limits set by X on Y. In this section, we look at some more specific ways in which constraints appear in theories and data. We focus on the ideas of floors and ceilings as a specific kind of constraint or barrier. We link these in particular to necessary and sufficient conditions. These by definition are about floors and ceilings in that they specify regions of no data. Thus, there is a sort of hierarchy, a nicely nested set of Russian eggs, in our analysis which starts out at the highest level with ideas of constraint; the next level down focuses on ceiling and floors as common kinds of constraints; the next level is the particular specification of floors and ceilings in terms of necessary and sufficient conditions.
Geometrically, we focus particular attention on no-data regions of triangular shape. Constraints can potentially be of any shape, but if we think in standard Cartesian coordinate terms, the regions of no data often lie at the corners. This means that they are triangular. Continuous necessary conditions (i.e., fuzzy logic) by definition have empty triangular data zones. Finally, quantile regression is a natural way to draw boundary lines around triangular no-data regions that lie in corners. 4
Hence, it is natural to look for what we call “triangular data or theories.” If constraint, ceiling, and floor hypotheses predict empty zones in the data, then one possible interpretation of triangular data is via ceiling, floor, or constraint theories. As we shall see, game theory models often predict triangular no-data zones. Since economic and game theory models often involve constraints of various sorts, they might naturally generate predictions of no-data in certain regions.
Ceilings and Necessary Conditions
Figure 1 illustrates the connection between triangular no-data zones and necessary conditions. These scatterplots are what we should see if X is a necessary condition for Y. As Ragin’s (2000, 2008) methodology stresses, to find a necessary condition is to find a zone of no data. Another common no-data region is a zero in a 2×2 table, illustrated in Table 1. As the table shows, if X is necessary for Y then we expect a zero in the upper-left cell.

No-data regions: ceilings and necessary conditions.
No-Data Regions in 2×2 Tables.
Necessary condition hypotheses are part of causal mechanisms that predict or explain the absence of specific values of Y for given values of X. Necessary condition causal mechanisms produce regions with zero cases and boundaries between zones with data and with no data. This is true for 2×2 tables (e.g., Table 1) as well as for continuous necessary conditions (e.g., Figure 1). The fuzzy logic definition of a necessary condition, X ≥ Y, is a ceiling hypothesis stating that “the value of Y cannot exceed the value of X.” The region above the diagonal should be empty.
Many important necessary condition hypotheses exist in the literature. The Goertz and Starr anthology (2003) contains a variety of examples. Goertz’s (2003) chapter provides a listing of 150 necessary condition hypotheses from prominent social scientists. Mintz (2003) looks at noncompensatory decision making. Harvey (2003) examines theories of deterrence. Tsebelis (1999) discusses veto players. Cioffi-Revilla and Starr (2003) combine opportunity and willingness with substitutability. The various decision-making theories discussed in that volume, noncompensatory, veto player, and so on, illustrate how the idea of constraints in decision making lead to necessary condition hypotheses and models. Dul et al. (2010) discuss critical success factors in the implementation of new business processes. Hence, there is no shortage of theories and hypotheses that imply ceilings.
Floors and Sufficient Conditions
The basic ideas for ceiling hypotheses carry over to floor hypotheses. We are still delimiting a region of data from a region of no-data. Just like ceiling hypotheses are related to necessary conditions, so floor hypotheses are related to sufficient conditions: “a sufficient causal condition or combination of conditions establishes a floor for the expression of the outcome” (Ragin 2000:237). Instead of X ≥ Y we now have Y ≥ X.
Figure 2 illustrates what a continuous sufficient condition relationship between X and Y looks like using fuzzy logic. It is worth noting that these data fit perfectly the sufficient condition hypothesis. Like with necessary conditions, we have a triangular region of data and a triangular region of no data. The diagonal boundary between the two is the constraint that we are interested in.

No-data regions: floors and sufficient conditions.
Table 2 provides a sample of sufficient condition hypotheses. These examples illustrate that sufficient condition hypotheses can come from a variety of theoretical, substantive, and methodological traditions.
Sufficient Conditions.
The language and metaphor of floor hypotheses seem to be less rich than that for ceilings. We can draw on a rich body of language to describe ceilings, such as veto players, constraints, possibility, and so on. It appears that our linguistic resources for dealing with floors are more limited. It is not quite clear what conclusion one could or should draw from this. Perhaps, the most important is that researchers should devote more attention to this kind of phenomena. We shall see below that triangular data that fit floor hypotheses are not that uncommon, so at least empirically we need to think about floor hypotheses.
Triangular Theories
Necessary or sufficient condition theories directly produce ceiling and floor hypotheses. There are other theories that explain or predict that all observations are in one zone, and hence that there are regions where no observations occur. Since the no-data zone is often triangular in nature, we can call these “triangular” theories.
Table 3 gives a few examples that we have uncovered in our reading. It is worth noting the presence of game theoretic models (e.g., Acemoglu and Robinson, Bueno de Mesquita, Gartzke, etc.) in this list. While beyond the scope of this article we suggest that a triangular data pattern is one of the more common empirical implications of game theoretic models.
Triangular theories.
Figure 3 provides an example of what we have been calling triangular theories taken from Acemoglu and Robinson’s Economic origins of dictatorship and democracy. This is a good example, given our interest in theories relating economic variables with democracy and also a good example of how game theoretic models can easily produce a prediction of triangular data patterns. Acemoglu and Robinson’s theory of democratic consolidation predicts a region determined by the “costs of coup”—a constraint variable—and “inequality” where we should see democratic consolidations; of course, this means that there is also a zone of no democratic consolidations.

Triangular theories: Democratic consolidation or coups?
More generally, the Acemoglu and Robinson book has quite a few examples of where the game theoretic model predicts that we should see data concentrated in various regions. Not all these involve triangular shapes. For example, sometimes the region is rectangular (see Figure 4 below where we discuss such regions using Geddes’s data on economic growth and labor repression). The key point is not necessarily the shape of the region. The key point is that there are lines (or curves) that separate a region of data from one of no-data.

Estimating ceilings: GDP/capita growth and labor repression in higher-income developing countries.
Triangular and Rectangular No-data Scatterplots
The previous section focused on theories that produce ceilings or floors. If the data used to test these hypotheses have the appropriate, usually, triangular, form then the hypothesis is supported by the data. In this section, we work from the data to ceiling or floor hypotheses. If we have access to scatterplots—and sometimes we do—then we can see if a ceiling or floor hypothesis fits the data. It may well be the case that the hypothesis in question is vague and only says that X is positively or negatively related to Y: The nature of that relationship is often not given a specific functional form. Tables 4 and 5 give a sample of research with scatterplots that fit a ceiling (Table 4) or a floor (Table 5). We have not especially searched for these scatterplots; rather we have discovered them in the normal course of our reading for research, teaching, and even pleasure. We are of course at the mercy of authors and journal editors who publish such figures. While it is not common to publish scatterplots (one recommendation of this article would be to make that much more common), they are not excessively rare either.
Ceiling Scatterplots.
Floor Scatterplots.
When authors and journals publish scatterplots typically they involve core variables. Most often they contain the dependent variable plotted against a central independent variable. As such, the ceiling and floor hypotheses implicit in the scatterplots listed in Tables 4 and 5 are not marginal but lie at the center of the research agenda.
Wealth and Democracy: A Floor or Ceiling Relationship?
One of the most constant findings in the comparative politics literature is that democracy is related to wealth, or more generally, levels of economic development. With over 50 years of research on the question behind us, virtually all find a statistically significant positive relationship. However, what still remains very open is the nature or form of that relationship. In this section, we survey some (the whole literature is huge) hypotheses and data on the relationship between wealth and democracy, focusing our attention on the causal arrow running from wealth to democracy (there is a less extensive literature on the impact of democracy on economic development and growth). Launched by the famous Lipset article of 1959, the hypothesis that economic development is a cause of democracy has remained a core part of the fields of sociology and comparative politics. During the 1990s, there was renewed interest in this topic, in particular the Przeworski et al. (2000) and Acemoglu and Robinson (2006) formulations have received much attention. We have 50 years worth of discussion and data analysis to draw on, including work by (in alphabetical order) Bollen, Dahl, Diamond, Jackman, and Lipset among other prominent scholars.
In particular, we contrast ceiling, floor, and linear forms that this relationship can take. In doing so, we move back and forth between hypotheses and data patterns. Theories are often an interpretation of data patterns, so it is really impossible to separate the two. This example illustrates our intuition that comparative cross-national studies involving institutions appear to be substantive island where floor and ceiling hypotheses abound. It also has the advantage of a long history with many theoretical and empirical analyses. Finally, the relationship between wealth (i.e., GDP/capita) and democracy has been expressed in terms of necessary conditions, but also in terms of sufficient conditions.
Virtually, everyone cites Lipset’s 1959 article as central to this research tradition. The title of that piece “Some social requisites of democracy: economic development and political legitimacy” entails a ceiling view of the relationship. However, as the subsequent literature illustrates, there is little consensus on the nature of the relationship between economic development and democracy. By asking about the requisites of democracy, Lipset suggested that something must be present for democracy to appear or exist. Throughout the decades, various scholars have talked about the relationship in terms of necessary or sufficient conditions, a couple of early examples:
[L]et us pose the key question in slightly different form: What are the necessary and sufficient conditions for maximizing democracy in the real world? (Dahl 1956:64, see also 75)
It has been argued by Max Weber among others that the factors making for democracy in this area are a historically unique concatenation of elements, The basic argument runs that capitalist economic development created the burgher class whose existence was both a catalyst and a necessary condition for democracy. (Lipset 1959:85)
As we have seen above (Table 1), a ceiling hypothesis refers to a zero in a specific cell of a 2×2 table. Descriptions of data often look only at a particular cell. Diamond (1992) gives an example of this: “In accord with Lipset’s thesis and all its extensions, only three low-income countries are democratic” (1992:100). This says that the (0,1) cell—the necessary condition cell—in Table 1 has almost no observations.
The principle of empty cells in 2×2 tables extends naturally to N×N tables, which can be thought of as half-way to a continuous scatterplot. Table 6 illustrates this with data from Diamond (1992). Here we begin to see clearly the triangular nature of the data. In the upper left-hand corner, we have low development and high democracy, with virtually no cases. This triangular no-data pattern fits the hypothesis that economic development is a necessary condition for democracy.
Ceiling: Human Development Is Necessary for Democracy, 1990.
aFreedom House democracy scale.
Source: Diamond 1992.
Are there theories or data that would suggest a floor version of the relationship between wealth and democracy? One way to look for floor hypotheses is to look in scatterplots for a pattern similar to that in Figure 2. Can we find scatterplots between wealth and democracy with such a no-data floor configuration? Figure 5, which we analyze in detail below in the section on statistical methodology, has such a floor pattern. Another way to look for floor hypotheses is to look for the logical language of sufficiency: if X then Y (which contrasts with the language of necessary conditions, Y only if X ).

Estimating floors and ceilings: GDP/capita and the level of democracy.
O’Donnell’s oft-cited description of Lipset has this character: “if other countries become as rich as the economically advanced nations, [then] it is highly probable that they will become political democracies” (1973:3). 5 This is in fact not a “description” at all but a very different formulation of the relationship between economic development and democracy. He has transformed Lipset’s original necessary condition hypothesis into a sufficient condition one.
Przeworski et al. also provide an example of a floor theory. While seen as part of the literature on the wealth–democracy relationship, it is really a theory about the absence of transitions from democracy back to authoritarianism in wealthy or economically developed states: “We would thus expect democracies to appear randomly with regard to levels of development, and then to die in the poorer countries and to survive in the wealthier countries. And because every time a dictatorship happened to die in an affluent country democracy would be there to stay, history should gradually accumulate wealthy democracies. Democracy appears exogenously, deus ex machina. It tends to survive if a country is ‘modern’” (Przeworski et al. 2000:89–90). The floor in Figure 5 is the barrier that wealth puts on the downward movement toward authoritarianism. Przeworski et al. stress that what wealth does is to prevent a transition to authoritarianism.
There is a long history of looking at wealth–democracy relationships using large-N statistical techniques, such as probit, logit, event history, latent variable models, and so on. Almost without exception, researchers have found a positive, statistically significant relationship between GDP/capita and democracy. However, there is no literature that systematically compares floor or ceiling relationships with linear ones. So what do all these positive statistical results mean? One needs to ask what is the theoretical entity of interest. In the case of Przeworski et al., we suggest that the floor is really the entity we want to estimate. If we take the requisites language of many scholars seriously we would want to estimate a ceiling.
First, it is important to repeat that the statistical models look for (1) linear relationships (or S-curved ones in the case of probit/logit) and (2) estimate a line through the middle of the data. If one takes the data in Figures 1 and 2 and applies the common statistical models there will be a clear positive relationship between X and Y. However, as we show below in the case of Geddes’s data, it is quite possible for there to be a clear ceiling and a regression line not different from zero. 6
The key principle to note is that ceiling and floor hypotheses involve a fundamentally different orientation to hypotheses and data analysis: Ceiling and floor hypotheses are about drawing boundaries lines between zones of data and zones of no data; they are not about drawing lines through the middle of data.
Most statistical methodologists today see causation and causal models in terms of estimating ATE. Thinking about causal relationships in terms of constraints, floors, and ceils means that there are other causal effects worth looking at.
Estimating Floors and Ceilings: The Basic Principles
In this section, we outline the basic principles of estimating the floors or ceilings of data and means for evaluating their impact. We focus on ceilings in this section, but the same principles apply to floors. Since ceilings and floors, along with necessary and sufficient conditions, are quite common in qualitative methods this section provides most of the principles and basic methodology that qualitative scholars need for their own research. It also serves as an intuitive and nontechnical introduction to the material in the next section which provides statistical techniques and more developed criteria for estimating boundary lines and their importance.
We have seen that there are close ties between theories invoking constraints, ceilings, and necessary conditions. Here we illustrate these linkages in a simple, but real-life, example involving the relationship between labor repression and economic growth.
There is a large qualitative, case study, literature on the causes of high economic growth that arose in an attempt to analyze the rapid growth of some economies in the 1970s and 1980s. Most obvious were the Asian tiger economies such as South Korea, Singapore, and Taiwan. Many qualitative analysts based their analyses on these countries and argued that their rapid growth rate depended on a disciplined and quiescent labor force and, therefore, on government’s extensive control over labor (labor repression). Repressed labor meant lower labor costs, increased international competitivity, and so on.
Many of the arguments were about the constraints that free organized labor put on the rate of economic growth (e.g., Deyo 1989; Haggard 1990). This then can be converted in a necessary condition hypothesis: Hypothesis (necessary condition): High levels of labor repression are necessary for high levels of economic growth.
We can frame this in terms of ceilings: Hypothesis (ceiling): There should be no observations in the zone of low labor repression and high economic growth.
Here we see a nice concrete example of the natural relationship between constraints, causal mechanisms, necessary conditions, and ceiling hypotheses.
The first principle of ceiling (or floor) analysis says: Where does the ceiling or necessary condition hypothesis claim is the region of no observations?
The simple, qualitative, but very useful test is to examine the data to see if there are any observations in the predicted no-data zone.
Figure 4 reproduces Geddes’s data on all 32 developing countries whose GDP per capita in 1970 was greater than that of South Korea (Geddes 2003:104; we follow Geddes in choosing this set of countries). The ceiling hypothesis states that there should be no observations in the upper-left part of the scatterplot. This is where we would find the low labor repression–high economic growth cases. It turns out that this region is in fact empty supporting the original propositions of qualitative, case study scholars.
Geddes (2003:93–94) observes that these scholars drew on evidence from a number of high growth–high repression countries like Singapore, Taiwan, and so on. Geddes makes the point that usually only a limited set of eligible countries was selected for the analysis (e.g., typically only high-growth Asian countries). She argues that there is no relationship between GDP per capita growth and labor repression if one includes a complete (or more complete) sample of countries. Geddes proposes to look at this relationship via a regression analysis. In short, she has transformed a ceiling hypothesis into a linear one: Hypothesis (linear): There is a positive linear relationship between labor repression and economic growth.
Interestingly enough, Geddes’s expectation that there is no such linear relationship is also true. As the regression line in Figure 4 illustrates, there is no linear relationship through the middle of the data (slope is .09 and R 2 = .003). 7
As we have stressed, ceiling and floor hypotheses look at fundamentally different kinds of causal effects. While there is evidence to support Geddes’s claim that there is no significant average, linear treatment effect of labor repression on economic growth, there is significant evidence for the ceiling hypothesis.
The next key step in the methodology of ceiling hypotheses is to estimate the size of the no-data zone. To make things simple, we consider rectangular regions of no data (in the next section we do triangular). We have “estimated” the ceiling zone by drawing rectangles (“Ceiling zones” in Figure 4). Also, to keep things simple we stick literally to “no-data,” in the next section we relax this to admit a few counterexamples.
A key methodological principle is that one should maximize the size of the no observation region. This is analogous to the least squares principle in drawing a regression line or maximum likelihood principle of estimation. For purposes of illustration in this section, we are requiring that the zone should have a rectangular shape.
As the dashed lines in Figure 4 illustrate, even restricting ourselves to rectangles means we have choices. We could choose the rectangle A+B or the rectangle A+C. Rectangle A has a GDP/capita growth range of 7–10 percent and the right-hand boundary at a labor repression score of 2.25, thus an area of 3×2.25 = 6.75. Rectangle B is of size 2×2.25 = 4.5, while rectangle C is 3×.6 = 1.8. Hence the A+B ceiling zone is of area 11.25 while the A+C zone is 8.55. Sticking with the rectangle restriction, then we would prefer the A+B ceiling zone. Abandoning the rectangle restriction, we could enlarge the zone by combining zones A, B, and C. We then have a new, larger ceiling zone in the form of an indented rectangle.
The choice of the ceiling boundary can have major implications for the substantive interpretation of the results. For example, drawing the horizontal line at 5 percent growth versus 7 percent growth means that there is more room for economic growth that could be achieved without increasing labor repression. This could have major policy implications; most governments would be very happy with 7 percent growth so there would be less argument for labor repression. Similarly, if we move the vertical line to the right, then it means you have to pay for significantly more labor repression to get high growth.
A natural question is how important are these constraints on economic growth? Are these ceilings and constraints important or minor?
To answer this question leads to the next step in our proposed methodology. The previous steps gave us some idea of the size of the no-data zone. We now need to compare that to something in order to get some idea of how important such constraints are. In order to do this we must first establish what we call the empirical or theoretical scope of the analysis. This is the next critical step in the methodology of ceilings (or floors).
In Figure 4, we need to fix the scope for labor repression as well as GDP/capita growth. Ideally, the researcher should have good theoretical and/or empirical reasons for setting the scope. However, since explicit scope decisions are rare we suspect that the most popular option will be to use the maximum and minimum of the empirical data to fix the scope. The range of the labor repression data is from zero to about 5 (4.4, Iraq, is the maximum). Hence, one might fix the scope of labor repression to be [0,5]. In many cases, there are reasons to think that values significantly higher than those in the data are reasonable (particularly in modest N settings). Conversely, extreme outliers might suggest using something like the 95–99 percentiles.
The literature that Geddes was reacting to focused on the conditions for “high economic growth.” So in our calculations, this should enter into the determination of the size of the scope. Scope is thus high economic growth, not the whole range, positive and negative, of economic growth. If we look at the usual understanding of “high” economic growth in the post–World War II period (“high” growth would be significantly lower in the 19th century), it ranges from about 4 percent to about 10 percent. To make our calculations easier, we choose the scope of [5,10].
Now that we have the scope limits we can proceed to estimating the importance of the ceiling. The basic principle is simple: The importance of the ceiling is the size of the no-data zone compared with the size of the scope zone, i.e., the ratio of the two.
In Figure 4, if we take the largest rectangle, A+B then we have a estimated constraint of 11.25/25 = .45. We think this constitutes considerable limits on high economic growth, since the labor repression variable excludes almost half of the scope. We think that it will take much more experience with estimating constraints in this manner to get a feel for what is “large” and what it is not. However, as a rough first proposal we think constraints above 15–20 percent would clearly be important.
This example illustrates quite dramatically the difference between statistical procedures that estimate lines through data versus our procedure which estimates lines that separate the region of data from the one of no data. Because Geddes was not looking for regions of no data, she did not see them; once you are looking for them, they jump out at you. Using the data on labor repression and economic growth, we have found support for the hypothesis that labor repression is a strong constraint on high economic growth. In the next section, we abandon our restrictions on rectangular shape and perfect fit. As many of our examples above illustrate, we want to estimate triangular regions of no data, and we typically want to allow a few observations into the region of no data.
In summary, the key steps in the methodology are the following:
Explicitly formulate the constraint as a ceiling (or floor) hypothesis.
Estimate the size of the no-data zone.
Determine the scope and its size.
Calculate the ratio of the no-data zone size to the scope size.
In this section, we have started from more or less clear hypotheses that X is a constraint on Y. It is clear from Figure 4 that one can work backward from the data to hypotheses. The empty zone in a scatterplot can be interpreted as a constraint and/or a necessary condition. Since the relationship between X and Y is often not specified, empty zones can help the researcher think about the causal relationship in terms of constraints. Of course, whether such an interpretation makes sense depends on the empirical and theoretical context.
A Formal, Statistical Methodology for Analyzing Ceilings and Floors
The previous section outlined the basic principles for analyzing ceilings and floors. It would be useful (1) to have more systematic and statistical means for drawing the boundaries, (2) to allow some counterexamples in the no-data zone, and (3) to provide criteria for choosing among alternative boundary lines.
In this section, we introduce quantile regression as a methodology which allows us to systematically draw lines bounding no-data zones. We focus on triangular zones because as we have seen they are probably the most common and simplest kind of geometric shape. Using quantile regression allows us to vary the number of counterexamples that we allow (on average) into the “no-data” zone, which now becomes the “almost no-data” zone.
Once we allow counterexamples into the analysis, we are faced with a fundamental trade-off. On one hand, our principle is to maximize the size of no-data zone. We can enlarge this by including more and more counterexamples. However, we have a opposing principle which is that we would like as few counterexamples as possible. We shall propose a formula, a criterion, that balances these competing goals allowing one to calculate what we call the optimal boundary line (OBL).
As such, this section is more technical since we briefly describe what quantile regression is. Also we discuss the technical details and logic behind our OBL formula. For those not interested in the technical details, we encourage them to skip to our analysis of the GDP/capita–democracy relationships. Most of the key points of this section are made in the discussion of this example, and most of the discussion is understandable with the material from the preceeding section in hand.
We need a systematic way to allow for some error rate in drawing the line, say, .01, .05, or .10. First, social science data are not perfect, there are conceptual and measurement problems, and so on. Second, one might also consider that no observations is too high a standard. If the zone is “virtually” empty, then one might consider that the ceiling hypothesis is supported by the data. Quantile regression is designed exactly to do this since we can ask for the .99, .95, or .90 quantile regression line. Since quantile regression has almost never been used in sociology and political science (according to our JSTOR search) and rarely in economics (though see Heckman, Ichimura, and Todd 1997; Abadie, Angrist, and Imbens 2002; for nice and relatively nontechnical introductions see Angrist and Piscke 2009; Cade and Noon 2003), it is useful to give a basic description of the technique. 8
Quantile regression was developed in the late 1970s largely by Roger Koenker and colleagues (e.g., Koenker and Bassett 1978). This was a period when statisticians were very interested in developing robust statistical techniques. It was equally motivated by common problems of heteroscedasticity in data and its implications for the estimation of confidence intervals and the like. This literature often mentions an early remark by Mosteller and Tukey (1977) that one could easily investigate estimated changes in things other than the mean of the response variable, and that focusing just on the mean might give an incomplete view of the relationship between the Y and X variables. Of course, that is what we have been arguing here, we are not always so interested in the mean effect of the treatment on Y but rather the impact of X on the boundary of Y.
The basic idea behind quantile regression is quite simple: instead of focusing on the mean one looks at quantiles. So the quantile regression analogue of least squares regression is a median regression. As such, a quantile regression looks very similar to an ordinary regression:
The conditional quantiles denoted by
When choosing large, for example, .90 or .95, or small, for example, .10 or .05, quantiles one estimates lines at the boundaries, top or bottom respectively, of the data. This immediately gives us the possibility of allowing some observations into the ceiling or floor zones. If we choose a 95 quantile regression, then on average we will find about 5 observations of the 100 in the no-observation zone. 9
A key insight of the quantile regression methodology is that there may be no relationship between X and Y when looking at the mean treatment effect, but the regression line for the .95 quantile might show an important relationship. The labor repression–high economic growth example we discussed above illustrates this: The regression line is flat but there is a clear no-observation zone, and we find that the importance of labor repression for high economic growth is large. In terms of the equation above,
So while quantile regression was originally developed more as a robust technique for regression (focusing on the median and no distributional assumptions) it has found perhaps its most important applications in areas where boundaries are of key empirical and theoretical importance. For example, a major area of application is ecology, where often one wants to know about the carrying capacity of environments. Cade and Noon in their introduction to quantile regression for ecologists make this argument: “The ecological concept of limiting factors as constraints on organisms often focuses on rates of change in quantiles near the maximum response, when only a subset of limiting factors are measured” (2003:413). This quote uses the terms we have often seen where the focus is on the zone of no observations, such as “constraints” and “limiting factors.” It is perhaps not an accident that five of the six scatterplots Cade and Noon chose to illustrate quantile regression have triangular no-data regions.
As we will see in Tables 7 and 8, one typically estimates a number of quantile regression lines. In part, this is because of its sensitivity to outliers, particularly at extreme percentiles, but also because the researcher may be interested in the changing relationship between X and Y at different percentiles.
Floors: GDP/Capita and Democracy, 1995.
N = 136. “—” indicates no calculation possible because of division by zero.
Ceilings: GDP/Capita and Democracy, 1995.
N = 136.
As our example of using quantile regression, we continue with the example of the relationship between wealth and democracy. We borrow some data from Gerring (2007), who looks at GDP/capita and polity democracy scores for 1995, excluding countries with high GDP/capita from oil revenues, for example, oil monarchies.
Przeworski et al. (2000) have provocatively argued that the wealth–democracy relationship is not the one proposed by modernization or endogenous growth models. What wealth does is to prevent democracies from lapsing back into authoritarianism. We can express his proposition in terms of sufficient conditions, hence a hypothesis about floors: Democracy and a high level of GDP/capita are sufficient for no transition to authoritarianism. In this formulation, we have a theory that predicts a floor pattern in the data. Figure 5 shows that in fact we do see a floor pattern (the extreme outlier is Singapore).
The first key principle in using quantile regression for our purposes is to estimate boundary lines for a range of quantiles. Table 7 illustrates this for the floor of the wealth–democracy data, where we have calculated lines for .01–.20 quantiles. This is important because in any given situation we do not know how many counterexamples are best to allow into the floor zone.
It would be useful to have a method for determining which of the various quantile regression lines is the “best” according to some reasonable criteria. In determining the OBL, we have several criteria. In Table 7, we have the following key variables in the columns, where S = scope size 10 :
τ = quantile regression, mean percentage of counterexamples permitted
C = number of counterexamples
Z = size of zone of no observations
Area per counterexample 11 (ACE) = Z/C
Constraint relative to scope (CRS) = Z/S
“Area per counterexample” (ACE) gives us an idea of how much area we get per counterexample. This tells how much we are gaining per counterexample. The column “Constraint relative to scope” (CRS) gives us how large the no-observation zone is relative to the whole scope. The third factor we propose including deals with our preference for as few counterexamples as possible. All things being equal we prefer a lower quantile, that is, one with fewer counterexamples: (
The OBL formula allows us to balance the costs of allowing in more counterexamples against the benefits of increasing the exclusion zone. A decision rule would be to take the maximum OBL score to determine the “best boundary line” for a given floor or ceiling.
We think that the OBL formula is quite useful for choosing the best line within a data set or population. However, we eventually want to be able to make some comparisons across studies. One way to do this is to take a fixed standard. For many reasons, an obvious choice is the .95 quantile regression. Using the .95 quantile regression means that there are, on average, 5 percent counterexamples. This means we find a 5 percent error rate acceptable, and reflects the fact that we take measurement error into account. For example, Braumoeller and Goertz (2000) use this standard. Obviously 0.05 is the common standard for type I error in statistical studies. Using the .95 quantile regression line means that we will always have roughly the same proportion of counterexamples (5 percent of all data points) and, therefore, always the same relation between the relative number of counterexamples (in percentages) and their proportion of the zone (1 percent of counterexamples will take 20 percent of the zone). Relative size of the zone will directly correspond with relative size of a specified proportion of counterexamples. This facilitates making comparisons and evaluations across studies because we use a uniform standard.
We think both the OBL and the .95 quantile line have advantages. As we have illustrated, if one is calculating a number of quantile regression lines, by default one is certain to have the .95 line. Since one is calculating many lines, it is not much more work to calculate the OBL as well since all the relevant information is at hand. The advantage of using the .95 is that we have a consistent standard to apply to all studies. The disadvantage is that it may not take into account the particularities of the data, variable scales, and theoretical context (as we shall see below).
Figure 5 illustrates these key lines using the democracy–wealth data. The middle line is the .50 quantile regression line through the data. This is analogous to the least squares line, except that we use the median instead of the mean. This reproduces the common finding in the large-N statistical literature that there is a positive relationship between logged GDP/capita and democracy.
Looking simultaneously at Figure 5 and Table 7, we see that the data in fact depart in some ways from a “nice” triangular no-data shape for the no-data zone. There is a bulge in the data for GDP/capita just below 9 and for polity scores from −10 to about 0. This comes out in the OBL calculations in Table 7. We see that the low quantiles have a pretty large OBL score, making them candidates for the optimal line. The OBL scores then decline because the bulge in the data produces many counterexamples. Once the bulge passed, the quantile regression lines grab a lot of empty space with almost no counterexamples; we then see the OBL score increasing significantly again around the .15 quantile to almost reach the values of the .02–.03 quantiles.
We have a choice between the optimal line at .03 or one in the range .15 to .20. If we were to mechanically choose the maximum, it would be .03. However, as the Geddes example already illustrated, one must take into account the substantive meanings of the values on the scale, the nature of the scales, and the larger theoretical context. Two arguments might suggest taking the .15 as the line, in spite of the large number of counterexamples. The first is the nature of the polity scale itself. It is quite imbalanced between democracy and authoritarianism. Democracy, in a dichotomous coding, is by convention the range 7–10. This means that out of a range of 21 (i.e., –10 to 10), democracy is only a relatively small part of the whole scale, i.e., 4/21 = .19. The second argument for the .15 line looks at the theoretical context. Przeworski’s central argument was about a floor for democracy. While the .15 line produces many counterexamples, they are located clearly in the authoritarian region; there are virtually no counterexamples in the democracy zone. 14
The key thing is that what we are really interested in is the boundary, not the line through the middle of the data. This boundary is implying that if a country has a given GDP/capita level it is not going to slip below a certain level of democracy. The implication is that it will not transition to levels of democracy–authoritarianism below that floor.
As we have noted above, for example, Table 6, many scholars have noted necessary condition relationships in these data. This ceiling hypothesis is: high GDP/capita is a necessary condition for democracy. Hence, it is useful to look at the ceiling boundary for the data in Figure 5. The procedure for ceilings is the same as for floors except one is using 80–.99 quantiles instead of .01 to .20.
Here the data are much better behaved and have a much clearer triangular shape. Unlike the floor data, we are clearly in the zone of democracy in the upper left corner. The OBL scores in Table 8 show once again that we have a choice for the optimal line. The actual maximum OBL value is for the .87 quantile, but we get quite good scores for the .95 quantile. Given that the data and their scales are not problematic for the ceiling, we think that following the .95 rule makes a lot of sense. We have six observations above the ceiling line which is 3.5 percent of the whole data set.
As Table 8 reports, the size of the ceiling zone is much smaller than the floor zone (in Table 7). So if we consider the scope of all the data, one might be tempted to conclude that the floor is more important than the ceiling, but when we look at the ceiling we are no longer really looking at the scope of all the data, so one needs to take into account the changing nature of scope.
This ceiling illustrates another key point of boundary line analysis: often we are interested in regions of the scope. In our particular case, scholars have been very interested in high-quality democracy or democracy in general. As we have stressed in our brief literature review, many have thought about the wealth–democracy relationship in terms of necessary conditions. Perhaps, the most important and common version of this is that wealth is a necessary condition for democracy (e.g., tested in Braumoeller and Goertz 2000). If this is the proposition of interest, then we limit ourselves (as we did analogously for high growth in the Geddes example) to the 7–10 region of polity scores. Taking the standard .95 we have an important constraint at almost 25 percent of scope.
All of sudden what was a relatively unimportant ceiling in general becomes a significant one in the context of a specific hypothesis. The data indicate that it is very difficult for a poor country to become a democracy and even more difficult to be a high-quality democracy, that is, polity = 10.
Our discussion of floors and ceilings illustrates that the substantive interpretation of the ceiling and floor zones is critical in many cases. High-quality democracy is only a small region of the polity authoritarianism to democracy scale; it is only one level of a possible 21 levels on the polity scale. But substantively we have a great interest in the causes and consequences of good democracy. This example illustrates how important the definition of the scope is in evaluating ceilings and floors. We think that one of the more novel aspects of our boundary methodology is its explicit inclusion of scope considerations into the calculations.
Our very brief analysis of ceilings and floors in the wealth–democracy relationship illustrates the strength of the quantile regression methodology and the usefulness of asking about regions of no observations. Our brief analysis has produced four important results:
Starting at about logged GDP/capita score of 7, there is a floor below which countries cannot transition to lower levels of authoritarianism-democracy, a strong floor for wealthy democratic countries.
For intermediate regions of GDP/capita, there is little relationship between wealth and democracy.
Very poor countries are not democracies, that is, modest wealth is a necessary condition for democracy.
Moderately high levels of wealth are necessary for high-quality democracy.
Notice that our looking for no-data zones means that we have potentially a variety of conclusions and results even though it is just a bivariate scatterplot. A typical statistical analysis would estimate the line through all the data and one would have one parameter estimate of interest. Here we see that in fact we have a series of conclusions depending on the region of the data we are looking at. Often these are a combination of very strong results about ceilings or floors, combined with very weak results in areas where the scatterplot looks pretty random. Thus, looking for ceilings or floors is an interesting way to dissect data for strong relationships.
Conclusions
Our focus on ceilings and floors allowed us to integrate many of the disparate findings in the literature relating wealth to democracy. Instead of a set of isolated empirical findings, we have a consistent set of relationships. Instead of looking at one line through the middle of the data, we have seen that there are multiple regions where there are few data points; these correspond to well-known claims.
We have suggested that ceiling–floor hypotheses, theories, and data are not uncommon in political science and sociology. We have also suggested that some fields are more likely to formulate these than others. One area in particular we think is quite full of these hypotheses is the wide variety of literatures on the causes and consequences of political institutions. To get a feel for the extent of floor and ceiling hypotheses, we examined a prominent anthology on comparative institutions, Steinmo, Thelen, and Longstreth (1992) which has seven substantive chapters. Three of those chapters clearly deal with ceiling issues: Weir’s chapter “Ideas and the politics of bounded innovation,” Immergut’s (1992) chapter on veto players which has a triangular theory (figure 3.1), Rothstein’s (1992) chapter on labor-market institutions, which has some nice rectangular data (table 2.1).
Within the special topic of the causes, or at least correlates, of democracy, Acemoglu and Robinson’s (2006) chapter 3 is quite useful in getting a feel for the prevalence of triangular data. They provide a number of scatterplots of democracy versus various popular independent variables, such as inequality, education, tax revenue, along with GDP/capita. Three of these four variables show a clear triangular relationship with democracy (tax revenue is the exception).
Another area worthy of future work are triangular theories that arise from game theoretic models. Triangular theories seem to arise naturally in game theory settings; this potential linkage needs exploration. More generally, Amartya Sen has stressed that size of the choice set, in contrast to the actual choice, is critical in understanding development and inequality (1992:51–52).
We have only looked at bivariate relationships involving ceilings and floors. An obvious question is how do control and confounding variables fit into this analysis? One of the most important concerns in statistical and causal analysis is confounding variables. How do the floor and ceiling factors interact with other causal variables? While it goes beyond the confines of a single article, it is likely that things will look much different than in traditional statistical analyses. To get a sense of how things can be different, imagine that Figure 5 is a standard time-series cross-sectional data of democracy and GDP/capita. Very common in such analyses is the inclusion of fixed effects for each country as control or confounding variables. One can imagine tracking a given country’s values over time in Figure 5; the line for some countries might be going up over time, some going down, some constant, and so on (related to the problem of ecological inference). Looking at this from the point of view of ceilings and floors the key point is that once the country reaches the ceiling or floor it must change direction. Thus, the inclusion of country fixed effects will not affect our analysis of the ceilings or floors (see Goertz 2012 for more on this point). This simple exercise suggests that many intuitions about the role of control or confounding variables may change when the focus shifts from average causal effects to ceilings and floors.
We hope that researchers will begin to look for ceiling and floor effects both in their theories and in their data. Once one is looking for something the odds of finding it increase dramatically.
Footnotes
Acknowledgments
We thank Jan Box-Steffensmeier, Bear Braumoeller, Rick Doner, Alex Hicks, Gary King, and SMR reviewers for comments on earlier drafts of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
