Speaking Stata: Three commands for data reduction

Abstract

Three new commands for data reduction—cisets, pctilesets, and quantilesets—are introduced with examples. Each command lists a set of results and saves those results to a separate dataset, whether of confidence intervals or of percentiles or quantiles. Immediate applications include graphical displays, illustrated by examples of confidence intervals shown in various ways and of complements and alternatives to box plots.

Keywords

dm0119 cisets pctilesets quantilesets confidence intervals percentiles quantiles quantile plots box plots data reduction

1 Introduction

Reducing a dataset to another dataset containing summary or other statistics is an old problem. A century ago, the point was emphasized by Fisher (1925, 6-7): “No human mind is capable of grasping in its entirety the meaning of any considerable quantity of numerical data. We want to be able to express all the relevant information contained in the mass by means of comparatively few numerical values.” Fifty years and more ago, Bevington (1969) and Ehrenberg (1975) used data reduction in the titles of their texts. Many more authors might have done something similar, had reference to data analysis not become the prevailing fashion.

Data reduction has been addressed in Stata by official commands such as collapse, contract, or statsby and by various community-contributed commands. Often, an underlying principle is that a valuable command should do one thing well so that a reduction command is just one step in a sequence that includes other analyses, whether graphical or numeric. This column publicizes three more reduction comsmands: cisets, pctilesets, and quantilesets. Furthermore, two sibling commands, momentsets (Cox 2024a) and lmomentsets (Cox 2025a), are available on the Statistical Software Components (SSC) Archive and so may be downloaded using the ssc command. Discussion of those sibling commands is intended for a later issue of the Stata Journal.

A sequel to this column is planned that will focus on various plots that start with quantile displays, showing all the data for a single variable or group in order. Examples in this column provide a taste of this approach.

The approach to correlation confidence intervals of Cox (2008) is broadly similar. Type search corrci in Stata to find the latest update to the software files.

The rest of the article gives a more detailed overview of the three commands (section 2), followed by some examples (section 3), a more formal statement of syntax (section 4), and detailing of options allowed (section 5).

2 Overview of cisets, pctilesets, and quantilesets

2.1 Scope

cisets computes confidence interval sets for population means, proportions, variances, standard deviations, geometric means, harmonic means, and centiles. Which results are desired is specified by a subcommand, such as cisets mean or cisets centile. cisets is a wrapper for various official commands, namely, ci (for means, proportions, variances, and standard deviations), ameans (for geometric and harmonic means), and centile. The help files for these commands and the corresponding manual entries should be consulted for more detail on statistical principles and procedures.

pctilesets computes percentile or quantile sets. Typically, the user will specify one or more of the options pctile(), minimum, and maximum to select extremes and any of or all the percentiles for 1, 5, 10, 25 (lower quartile), 50 (median), 75 (upper quartile), 90, 95, and 99% cumulative probability. pctilesets is a wrapper for official command summarize. The help file for summarize and the corresponding manual entry should be consulted for more detail on statistical principles and procedures.

quantilesets computes percentile or quantile sets. It is a wrapper for the Mata function quantile() added to StataNow 19.5 on November 12, 2025. The help file for quantile() and the corresponding manual entry should be consulted for more detail. Typically, the user will specify one or more probability levels between 0 and 1 inclusive using the required option prob().

There are two flavors to each command.

If a numeric variable list, varlist, is supplied, then results are calculated for one or more variables in that list. This is called the variables syntax.

If one numeric variable, varname, is supplied together with an over(groupvar) option specifying a grouping variable, then results are calculated for that one variable for each distinct value of groupvar. This is called the groups syntax.

The set referred to by each command name is a temporary reduced dataset consisting of results for each variable or group named. Optionally, but also usually, the dataset of results may be saved for later use.

2.2 Variables in each results set

2.2.1 All commands

varname is a string variable holding the name or names of each variable being summarized.

varlabel is a string variable holding the variable label of each variable being summarized. If no variable label has been defined, the value is instead the variable name.

(Groups syntax only) origgvar is a numeric or string variable as specified in the over() option.

(Groups syntax only) groupvar is a string variable holding the name of the group variable specified in the over() option.

(Groups syntax only) gvarlabel is a string variable holding the variable label of the group variable specified in the over() option. If no variable label has been defined, the value is instead the variable name.

(Groups syntax only) group is a numeric variable with value labels describing each distinct value of groupvar. Each such variable has integer values 1 up and value labels derived from the variable specified.

n is a numeric variable holding the number of observations used in the estimate.

2.2.2 cisets and pctilesets only

weights is a string variable appearing if (and only if) weights were specified as a record of such use.

2.2.3 cisets only

statname is a string variable holding a brief description of the parameter being estimated by a point estimate and a confidence interval. With the subcommand centile, the corresponding percent is shown, either as specified with centile() or as defaulting to 50% (median).

point is a numeric variable holding the point estimate of the parameter being estimated.

(Subcommands means and proportions only) se is a numeric variable holding the standard error reported.

lb is a numeric variable holding the lower bound of the confidence interval estimate.

ub is a numeric variable holding the upper bound of the confidence interval estimate.

level is a numeric variable holding the confidence level used.

options is a string variable appearing if (and only if) other options have been specified as a record of such option choice.

2.2.4 pctilesets only

Any or all of min, p1, p5, p10, p25, p50, p75, p90, p95, p99, and max holding results for the measure concerned.

2.2.5 quantilesets only

One or more quantile variables named q1, q2, and so forth. Conventions are best explained by example. Suppose the specification was prob(0.25 0.5 0.75). Then there are three resulting quantile variables named q1, q2, and q3, and they have variable labels 0.25 quantile, 0.5 quantile, and 0.75 quantile.

method is a string variable naming the estimation method used. The default method choice is tukey.

2.3 Why both pctilesets and quantilesets?

pctilesets and quantilesets differ as follows:

As stated, pctilesets is a wrapper for summarize, while quantilesets is a wrapper for Mata function quantile().

pctilesets offers no choice of estimation method, while quantilesets offers several methods. That could be important if you have a strong preference in principle for one method, wish to match a choice made in other software, or wish to compare results obtained with different methods. Often, differences will be trivial but not always, as with small datasets or those with gaps, spikes, or multiple modes.

pctilesets as a wrapper for summarize offers only selected percentiles, including the minimum and maximum. quantilesets allows any probability level between 0 and 1 (inclusive) to be specified.

pctilesets supports aweights and fweights, while quantilesets does not support weights.

2.4 More on quantiles and percentiles

The ideas of quantiles and percentiles have been developed in different if complementary directions. Some notes follow that discuss the main ideas, various uses of those terms, and key Stata commands and what they offer. You may wish to skip or skim, depending on your interests and whether the material is familiar.

As explained, for example, by Cox (2024b), the term quantile has acquired related but distinct meanings. One meaning refers to all the values of a variable sorted or ordered in magnitude, the order statistics, especially when plotted, usually against the quantiles of another variable or as estimated for a candidate fitted distribution. This is the sense behind official commands quantile, qqplot, qnorm, and qchi and behind community-contributed commands such as qplot (Cox 1999, 2005b) (discussed below), multqplot (Cox 2012, 2019), and qqplotg (Cox 2024b). The term percentile is also occasionally used in this sense (for example, Cleveland [1985]).

The related but distinct meaning foremost in both pctilesets and quantilesets is that of summary statistics (or correspondingly, parameter estimates) defined by the fraction or probability of values being lower and thus also by the complementary fraction of values being higher. Such statistics must be calculated or such parameters estimated using a rule or recipe that variously yields either original data values or points between them. The simplest example of such a recipe is that for the median, as given by the middlemost value if the number of values is odd and by the mean of the two middlemost values (the comedians) if the number of values is even. This recipe is explained to mathematical audiences as a convention and to less mathematical audiences as a rule.

Much of the rationale for the Mata function quantile() and thus for quantilesets is that several such rules or recipes exist, which usually produce similar but not necessarily identical results. Such varying methods are tied up with different methods for calculating the corresponding cumulative probabilities (in a graphical context often known as plotting positions) because quantiles are obtained in principle by inverting the (cumulative) distribution function. The documentation for quantile() flags Cunnane (1978) and Hyndman and Fan (1996) as key references, but yet more approaches to quantile estimation exist. See, for example, Harrell and Davis (1982) (and correspondingly hdquantile [Cox 2005a] from the SSC Archive) and Ma, Genton, and Parzen (2011).

In official Stata, the commands for estimating particular quantiles include pctile, _pctile, and centile. Other official commands producing quantiles typically rest on one such command. See also Jann’s (2005) mm_quantile() and related Mata functions from his package moremata (downloadable from the SSC Archive).

In practice, the term percentile is typically used in this sense of quantile, as a summary statistic or as a parameter estimate defined by the percent of values lower. Usage has often departed from any implication that there are 99 distinct percentiles for percents 1(1)99 (or 101 if the minimum and maximum are regarded as the 0 and 100% percentiles). Thus, many statistical people would feel no discomfort in regarding, say, the 2.5% and 97.5% points as also being percentiles. For a menagerie of related terms, from tertile onward, see Cox (2016). To the list given in that article may be added pentile = quintile (5 bins implied), decentile = decile (10), hexadecile = suboctile (16), ventile = vigintile (20), and trentile (30).

Yet another meaning of quantile is to refer to the bins, classes, or intervals they delimit, as when the first quartile is the lowest quarter of a distribution. That meaning is not directly implied by any command introduced in this column.

2.5 Strategy

cisets, pctilesets, and quantilesets just list their results by default. Although saving to a permanent dataset is optional, that is the intended key to many useful applications. Either the reduced results dataset is what is needed, or it may be combined using append or merge with other such sets for further analysis.

Thus, the approach is one of providing a building block that may be useful directly or if combined with other building blocks. Flexibility is needed because so many different problems may be of interest, not just comparison of results for different variables or of results for one variable for different groups but also comparison of results for several variables and several groups, and so forth. Many projects call for or at least would benefit from comparison of parameters (say, different kinds of mean), comparison of intervals for different confidence levels, comparison of different methods for estimating intervals or quantiles, and so forth.

At first sight, a results set may seem repetitious. With a little experience, you will see that such repetition is often helpful when combining such sets. In any case, you can always ignore what you do not need. Similarly, you can use rename and replace as you wish on the results of this command.

Graphical and other applications lie downstream of each command, as discussed in section 3 and as will be shown in more detail in a later column. Helper commands include myaxis to sort on some criterion (Cox 2021) and mylabels and nicelabels (Cox 2022) and niceloglabels (Cox 2018, 2025b) for automating axis labels.

2.6 Ignoring inapplicable data

As in the official command ci, variables that are not (0, 1) binary variables are ignored with cisets proportions.

As in the official command ameans, any zero or negative values are ignored with cisets gmean and cisets hmean.

3 Examples

3.1 Using cisets

For first examples, we use cisets to produce confidence intervals for variables in auto.dta. Confidence intervals for means are often desired. We illustrate for groups of data and for different variables. Here and below, we focus on Stata technique only and freely overwrite a generic results dataset, results.dta. In a real research project, it is much more likely that you will be carefully saving different results to different places. More broadly, saving results to a dataset is not at all compulsory, but it is often a good idea.

We could use the same approach for producing confidence interval sets for other summary measures, as already explained in section 2.1. As one variation among many possible, let’s look at medians and keep going all the way to graphic display. The median is the default result for cisets centile. Note the option total, which adds results for all observations with nonmissing values on the variables specified.

The default statistic name, 50 pctile, is a little ugly to my taste for a graph, so I specify an alternative. Otherwise, the code for a display is fairly general, using numeric and string data in the results set. Here spikes are used for intervals and marker symbols for point estimates (figure 1).

Figure 1.

Median miles per gallon for various levels of repair record and for all observations, together with 95% confidence intervals

You may not be familiar with the ‘ = ‘ syntax documented at help macro. When we invoke a variable name in that way, as in the calls to the xtitle(), ytitle(), and subtitle() options, Stata will use the value of each variable in the first observation, which is fine for our example because the variables concerned are constant by construction. That syntax does not rule out specifying particular observation numbers too.

From Stata 19 onward, the code can be abbreviated using the command twoway rpspike, which combines range and point elements. Here is that alternative syntax. Because the graph produced is identical to figure 1, it is not repeated here.

You might prefer a horizontal display. The horizontal option specifies that for twoway rspike or twoway rpspike. Otherwise, for scatter, just reverse the order of variables. Options referring to x and y also need exchanging.

The graph produced is also not shown here to save space.

It is good practice when reporting confidence intervals to tell people about sample or subsample sizes, although that is often not done. Sample size is always saved by cisets as a variable in the results set, which should help. The small questions that then arise are exactly where and how to show sizes on the graph.

The overall range across confidence intervals on the display will be the difference between the maximum upper bound and the minimum lower bound. Here the sample sizes are displayed at one-tenth that range below the smallest value on the display. We add a prefix n = to each sample size. Your choices may differ, which is much of the point: the style here is to customize a display as you wish (figure 2).

Figure 2.

Median miles per gallon for various levels of repair record and for all observations, together with 95% confidence intervals and subset sizes. This is similar to figure 1, except that subset sizes have been added.

The alternative call using twoway rpspike should now be easy to work out. Indeed, capped spikes as a popular alternative can be obtained using twoway rcap or (from Stata 19 on) twoway rpcap.

Yet another way to show confidence intervals is to use range bars rather than range spikes (figure 3).

Figure 3.

3.2 How to represent confidence intervals graphically

Let’s back up for a brief discussion of ways to show confidence intervals.

In statistical graphics, showing confidence intervals—or their antecedents and relatives under various names such as error bars—has a history over decades, if not centuries. That history does not seem well documented.

Intervals of ±1 and ±2 probable errors were displayed on a combined histogram and dot or strip plot by Wood and Stratton (1910, 427).^[1] The probable error in their work is in essence the estimated distance between the center of a symmetric distribution, taken to be normal, and each quartile. It was, in terms more usual now, calculated as the standard deviation multiplied by 0.67. For a more precise evaluation of that multiplier, use invnormal(0.75) in Stata. If you know of an earlier graphical use, I would be delighted to learn about it.

A similar device with intervals of one probable error on histograms was used by Brunt (1917, 1931) in his text.^[2] Brunt used a multiplier of 0.6745, which is a little more accurate, although for graphical purposes the difference is likely to be immaterial.

In terms of Stata’s own twoway commands, use of something like scatter to show a point estimate as a point or marker symbol is very common but not universal. Some authors argue that the importance of showing estimates as intervals means that a point symbol should be suppressed. More frequently, bars usually starting at 0 are used to show point estimates. When combined with capped or uncapped spikes, as mentioned just below, such plots have been described pejoratively as dynamite, detonator, or plunger plots and are often deplored. Problems with such plots include the facts that comparisons with 0 are often not interesting or useful and that such displays may show too little about the underlying data.

Otherwise, showing point estimates with markers or point symbols using, say, the scatter command and the intervals with capped spikes using something like twoway rcap seems the most common style. Showing markers and uncapped spikes using something like twoway rspike is the next most common. Showing range bars, whether colored or blank, using something like twoway rbar seems less common than either of those. The choice may seem a matter of style or personal preference unless there is evidence that any form is most effective. Wilkinson (1999, 2005) gives examples of all, including range bars. See also Wilkinson (2006) on Pareto dot plots and Wilkinson (2023) more generally.

3.3 Using pctilesets

pctilesets produces sets of results based on any or even all of those percentiles (including the minimum and the maximum) that may be calculated by summarize. Its small but definite advantages over summarize include being able to select which of those percentiles you want and being able to save results to a new dataset. Some of those advantages are shared with various official commands, including tabstat, statsby, and table, but the particular combination in pctilesets of modest functionality and moderate simplicity may have some appeal.

We illustrate with a dataset of discreetly anonymous mathematics marks from Mar- dia, Kent, and Bibby (1979) and Mardia, Kent, and Taylor (2024). The appearance of a second edition of a text 45 years after the first should encourage all technical authors. Although the data were given as separate variables, reshaping them to a group structure is driven here by what will be easier for producing graphs.

By default, the string variable subject when used for a graph axis will be sorted alphabetically, from algebra to vectors, which is not necessary and would not be especially helpful. myaxis (Cox 2021) is designed for this problem and defines a one-to-one recoding based on a stated criterion, here median marks.

The graph shown now by way of illustration is a variation on a box plot. Beyond the possibilities offered by graph box or graph hbox, researchers often seek something rather different. A box plot customarily starts with a box or boxes, each showing median and lower and upper quartiles. Taste and circumstance may lead to different ideas about what else should be added, either beyond the quartiles or as extra indications, say, markers or lines for means, geometric means, or other summaries. General advice for Stata users seeking alternatives to conventional box plots is to concoct your own favored display using twoway calls (Cox 2009, 2010, 2013). Here we exemplify technique by quartile boxes with medians as markers, whiskers out to 5 and 95% points, and markers for extremes (figure 4).

Figure 4.

A variant on box plots with conventions explained in a marginal note

The attitude here is very simple: researchers are in charge and can devise their own display for their own purposes. But nonstandard designs certainly need to be explained too.

3.4 Using quantilesets

You would need to use quantilesets rather than pctilesets if you wanted 1) to produce particular quantiles (percentiles, if you prefer that term) that are not possible results of summarize; or 2) to use a calculation method available through Mata function quantile() but not available through summarize.

Those motives naturally could both hold in a particular project.

As an example of motive 1, we will include calculation of octiles as well as other measures previously mentioned. The octiles (strictly, first and seventh octiles) correspond to cumulative probabilities 0.125 and 0.875. The term octile with this statistical meaning appears to have been introduced by McAlister (1879) in a brief but important article, which is better known for introducing lognormal distributions, although without the latter term.^[3]

Octiles were used by Crowe (1933) in an early precedent of the box plot, starting a tradition of frequent use of box plot ideas in geographical and climatological literature that was flagged by Cox and Jones (1981). They were also used by Matthews (1936) and Grove (1956). Many such references were given in a review of Crowe’s work and its influence by Johnston (2019). The homely term eighth was introduced by Tukey (1970, chap. 12) for one version of octiles. Eighths are among the so-called letter values; see Cox (2016) and its references for more detail on those.

Whatever the name, the point of showing an octile range graphically is no more and no less than to show where the central 75% of the distribution lies, just as the point of showing a quartile range (the interval bounded by the quartiles, with length the interquartile range) is just to show where the central 50% lies. Using these particular intervals is inevitably a little arbitrary, for which the main defense is that we also show the entire distribution.

As an example of motive 2, it is sufficient to use the default method named for Tukey, calculating quantiles that correspond to cumulative probabilities (i — 1/3)/(n + 1/3) for rank i and sample size n. This choice goes back at least to Tukey (1962a,b) and is mentioned, if a little indirectly, in Mosteller and Tukey (1977, chap. 5). There is a connection between this rule for plotting positions and advice in Tukey (1977, 496-497) on quite how to work with counted fractions. Hoaglin (1983, 44-49) explained that and gave a more detailed discussion. A key point is that to a good approximation, the median of the distribution of any particular order statistic is at the point where the value of the cumulative distribution function is given by this recipe. See also Baath (2013b,a) for entertaining blog posts on plotting counted fractions and Kerman (2011a,b) for related technical developments, some of Bayesian flavor.

For data here, we revisit the leading example given by Parzen (1979) in an article that is the starting point for the intended sequel. He used data on annual snowfall in Buffalo, New York, for 1910 to 1972 (63 years). The original data had been used earlier in PhD theses. Although they were not included in Parzen’s article, they have somehow become widely available since then and frequently used as a sandbox in articles and software. An especially convenient source is Miecznikowski (2019), which reprints the original data together with data for 1973 to 2015 (43 more years).

So we use quantilesets to get the desired quantiles and then merge back with the original data.

We are going to use quantile plots for the main display but with box plots to the left and right of each period, 1910-1972 and 1973-2015. To do that, we need somewhere to put them.

Here and elsewhere, the code was produced in steps. What is presented now is divided into chunks to ease printing and understanding. First come chunks of code for the box plots, including median symbols and octile spikes.

Although any differences are very small, it is appropriate to override the default plotting position. The syntax ‘=l/3’ instructs Stata to calculate 1/3 on the fly. That is preferable to specifying any of 0.33, 0.333, and so on.

Figure 5 shows that the distribution has changed between the two periods. The later period shows higher level and spread and a lengthening of the tail of high values. The quartile boxes and octile spikes match that pattern.

Figure 5.

Snowfall at Buffalo, New York, 1910-1972 (data used by Parzen [1979]) and 1973-2015. Superimposed quantile plots are combined with display of medians, quartiles, and octiles for 1910-1972 (left) and 1973-2015 (right).

The focus here is on Stata technique, specifically graphical technique, with no intent to provide a self-contained climatological case study. Two simple disclaimers deserve mention. First, the division at 1972 and 1973 is purely a matter of when Parzen’s data stopped and has no climatological rationale. Indeed, the climate is better thought to be evolving fairly smoothly rather than jumping from one state to another. Second, a serious case study would need to consider a battery of climatological and other physical predictors, not just year of occurrence.

We close our examples with more illustrations of what can be done with the machinery now in use.

The first quantile plot many researchers meet in their statistical education is often a normal quantile plot in which one axis shows what is expected in a sample of the same size from a normal distribution, as standard normal deviates, with units (value — mean) / standard deviation. This plot has many other names, such as normal probability plot, normal scores plot, probit plot, rankit plot, fractile plot, and yet others: some people prefer Gaussian to normal. A normal quantile plot can be useful even if there is no strong expectation that data are, or even in some sense should be, normally distributed. The point was well put by Hills (1974, 28): “It can be useful to plot an observed distribution against the standard Gaussian even though there is no question of it being Gaussian in shape. The motive is that it is easier to study a distribution by comparing it with a standard shape than just by looking at it.” The help file for qplot includes various other quotations and references in the same spirit.

This plot is quite easy with qplot. We need to change what is shown on the x axis and similarly bar widths for the boxes.

Another natural addition is an axis using metric units for the convenience of readers in most countries of the world. In a nutshell, 1 inch = 25.4 millimeters (mm)

Figure 6 is the result.

Figure 6.

Snowfall at Buffalo, New York, 1910-1972 (data used by Parzen [1979]) and 1973-2015. Superimposed quantile plots are combined with medians, quartiles, and octiles for 1910-1972 (left) and 1973-2015 (right). Differences from figure 5 are the use of a normal horizontal scale and an extra axis showing labels in mm.

4 Syntax

4.1 Confidence interval sets for various summary statistics

Confidence interval sets for means, normal distribution

Variables syntax

Groups syntax

Confidence interval sets for means, Poisson distribution

Variables syntax

Groups syntax

Confidence interval sets for proportions

Variables syntax

Groups syntax

Confidence interval sets for variances

Variables syntax

Groups syntax

Confidence interval sets for standard deviations

Variables syntax

Groups syntax

Confidence interval sets for geometric means

Variables syntax

Groups syntax

Confidence interval sets for harmonic means

Variables syntax

Groups syntax

Confidence interval sets for centiles

Variables syntax

Groups syntax

aweights are allowed with cisets subcommands means for normal data, gmean, and hmean.

fweights are allowed with cisets subcommands means, proportions, variances, gmean, and hmean.

Weights are not allowed with cisets subcommand centile; see [U] 11.1.6 weight.

4.2 Percentile or quantile sets for selected levels

Variables syntax

Groups syntax

4.3 Quantile sets for selected probability levels

Variables syntax

Groups syntax

5 Options

5.1 All commands

(Variables syntax) inclusive may be specified if you wish to work with several variables together. By default, calculations are made only with observations that have nonmissing values for all variables specified. This option overrides that default selection; hence, for several variables, which observations with nonmissing values are used will be determined separately for each variable. In other jargon, this option triggers casewise deletion, not listwise deletion or complete case analysis. As a convenience for people familiar with that term or with other syntax used to this effect, cw and allobs are allowed as synonyms.

(Groups syntax) over(groupvar) must be specified to name the group variable. Distinct groups of observations on groupvar will be used to determine results for the main variable specified.

(Groups syntax) total may be used with over(). It specifies that in addition to output for each group, output be added for all groups combined.

saving(filespec[, replace]) saves the results set to a file as a Stata dataset. The suboption replace must be specified to overwrite an existing dataset.

list_options are any options of list other than noobs that may be specified to tune listing of the results set.

5.2 cisets only

level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level.

(cisets mean) poisson specifies that the variables are Poisson-distributed counts; exact Poisson confidence intervals will be calculated. By default, confidence intervals for means are calculated based on a normal distribution.

(cisets mean, poisson) exposure(varname) is used only with poisson. You do not need to specify poisson if you specify exposure(); poisson is assumed. varname contains the total exposure (typically a time or an area) during or over which the number of events recorded was observed.

(cisets proportions) exact, wald, wilson, agresti, and jeffreys specify how binomial confidence intervals are to be calculated. Only one of these options may be specified.

exact is the default and specifies exact (also known in the literature as Clopper- Pearson) binomial confidence intervals.

wald specifies calculation of Wald confidence intervals.

wilson specifies calculation of Wilson confidence intervals.

agresti specifies calculation of Agresti-Coull confidence intervals.

jeffreys specifies calculation of Jeffreys confidence intervals.

(cisets variances) sd specifies that confidence intervals for standard deviations be calculated. The default is to compute confidence intervals for variances.

(cisets variances) bonett specifies that Bonett confidence intervals be calculated. The default is to compute normal-based confidence intervals, which assume normality for the data.

(cisets centile) centile(#) specifies the centile or percentile to be reported. The default is to display the 50th centile or percentile (median). Specifying centile(5) requests that the fifth centile be reported.

Only one of the following options may be specified.

cci (conservative confidence interval) forces the confidence limits to fall exactly on sample values. Confidence intervals displayed with the cci option are slightly wider than those with the default.

normal causes the confidence interval to be calculated by using a formula for the standard error of a normal-distribution quantile. The normal option is useful when you want empirical centiles—that is, centiles based on sample order statistics rather than on the mean and standard deviation—and are willing to assume normality.

meansd causes the centile and confidence interval to be calculated based on the sample mean and standard deviation, and it assumes normality.

5.3 pctilesets only

pctile(numlist) allows any or all of 1, 5, 10, 25, 50, 75, 90, 95, and 99, indicating that such percentiles be included in the results as calculated by summarize. Any other integers between 2 and 98 will be ignored with a warning.

minimum requests that the minimum be included in the results.

maximum requests that the maximum be included in the results.

5.4 quantilesets only

prob(numlist) specifies one or more probability levels between 0 and 1 (inclusive) for estimation of quantiles. For example, prob(0.25 0.5 0.75) specifies estimation of the quantiles often known as lower (first) quartile, median, and upper (third) quartile. If not presented in ascending order, levels will be sorted to ascending order any way.

method(name) requests the use of a particular estimation method. The default is method(tukey). See the help on quantile() for a list of allowed methods. See Hoaglin (1983, 44-49) for much more detail on the Tukey method hinging on use of (rank — 1/3)/(sample size + 1/3) as probability level.

6 Conclusion

cisets, pctilesets, and quantilesets have been introduced as three new commands for data reduction, for confidence intervals for various summary measures, and for those percentiles or quantiles available from summarize or the Mata function quantile(). Their major benefit is that users may save results to a new dataset with further analysis in mind. Here the examples of applications have all been graphical, to plotting one or more of confidence intervals, the entire set of quantiles, and selected summary percentiles or quantiles.

An intended sequel will reverse the emphasis, taking these reduction commands as merely means toward graphical ends. It will focus on the graphical ideas, placing them in a fuller historical, statistical, and scientific context and giving more depth and detail.

7 Programs and supplemental materials

To install the software files as they existed at the time of publication of this article, type

Supplemental Material

sj-txt-1-stj-10.1177_1536867X261450274 - Supplemental material for Speaking Stata: Three commands for data reduction

Supplemental material, sj-txt-1-stj-10.1177_1536867X261450274 for Speaking Stata: Three commands for data reduction by Nicholas J. Cox in The Stata Journal

Supplemental Material

sj-dta-1-stj-10.1177_1536867X261450274 - Supplemental material for Speaking Stata: Three commands for data reduction

Supplemental material, sj-dta-1-stj-10.1177_1536867X261450274 for Speaking Stata: Three commands for data reduction by Nicholas J. Cox in The Stata Journal

Footnotes

Notes

About the author

Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks, postings, FAQs, and programs to the Stata user community. He has also coauthored 16 commands in official Stata. He was an author of several inserts in the Stata Technical Bulletin and is Editor-at-Large of the Stata Journal. His “Speaking Stata” articles on graphics from 2004 to 2013 have been collected as Speaking Stata Graphics (2014, College Station, TX: Stata Press). He is the Editor of Stata Tips, Volumes I and II (2024, also Stata Press).

References

Baath

. 2013a. A Bayesian twist on Tukey’s flogs. Publishable Stuff: Rasmus Baath’s Blog. https://www.sumsar.net/blog/2013/09/a-bayesian-twist-on-tukeys-flogs/ .

_____. 2013b. Going to plot some proportions? Why not flog ‘em first? Publishable Stuff: Rasmus Baath’s Blog. https://www.sumsar.net/blog/2013/09/going-to-plot-some-proportions/ .

Bevington

P. R

. 1969. Data Reduction and Error Analysis for the Physical Sciences. New York: McGraw-Hill.

Brunt

. 1917. The Combination of Observations. London: Cambridge University Press.

_____. 1931. The Combination of Observations. 2nd ed. London: Cambridge University Press.

Cleveland

W. S

. 1985. The Elements of Graphing Data. Monterey, CA: Wadsworth.

Cox

N. J

. 1999. gr42: Quantile plots, generalized. Stata Technical Bulletin 51: 16-18. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 113-116. College Station, TX: Stata Press.

_____. 2005a. hdquantile: Stata module for Harrell-Davis estimator of quantiles. Statistical Software Components S449601, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s449601.html .

_____. 2005b. Speaking Stata: The protean quantile plot. Stata Journal 5: 442–460. 10.1177/1536867X0500500312.

10.

_____. 2008. Speaking Stata: Correlation with confidence, or Fisher’s z revisited. Stata Journal 8: 413–439. 10.1177/1536867X0800800307.

11.

_____. 2009. Speaking Stata: Creating and varying box plots. Stata Journal 9: 478–496. 10.1177/1536867X0900900309.

12.

_____. 2010. Speaking Stata: The statsby strategy. Stata Journal 10: 143–151. 10.1177/1536867X1001000112.

13.

_____. 2012. Speaking Stata: Axis practice, or what goes where on a graph. Stata Journal 12: 549–561. 10.1177/1536867X1201200314.

14.

_____. 2013. Speaking Stata: Creating and varying box plots: Correction. Stata Journal 13: 398–400. 10.1177/1536867X1301300214.

15.

_____. 2016. Speaking Stata: Letter values as selected quantiles. Stata Journal 16: 1058–1071. https://doi.org/10.1177/1536867X1601600413.

16.

_____. 2018. Speaking Stata: Logarithmic binning and labeling. Stata Journal 18: 262–286. 10.1177/1536867X1801800116.

17.

_____. 2019. Software Updates: gr42_8: Quantile plots, generalized. Stata Journal 19: 748–751. 10.1177/1536867X19874265.

18.

_____. 2021. Speaking Stata: Ordering or ranking groups of observations. Stata Journal 21: 818–837. 10.1177/1536867X211045582.

19.

_____. 2022. Speaking Stata: Automating axis labels: Nice numbers and transformed scales. Stata Journal 22: 975–995. 10.1177/1536867X221141058.

20.

_____. 2024a. momentsets: Stata module for moment-based measures collected as datasets. Statistical Software Components S459392, Department of Economics, Boston College. https: //ideas.repec.org/c/boc/bocode/s459392.html.

21.

_____.2024b. Speaking Stata: Quantile-quantile plots, generalized. Stata Journal 24: 514–534. 10.1177/1536867X241276114.

22.

_____. 2025a. lmomentsets: Stata module for L-moment-based measures collected as datasets. Statistsical Software Components S459440, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s459440.html .

23.

_____. 2025b. Software Updates: gr0072_2: Speaking Stata: Logarithmic binning and labeling. Stata Journal 25: 874. 10.1177/1536867X251398323.

24.

Cox

N. J.

Jones

. 1981. “Exploratory data analysis”. In Quantitative Geography: A British View, edited by Wrigley

Bennett

R. J.

, 135–143. London: Routledge and Kegan Paul.

25.

Crowe

P. R

. 1933. The analysis of rainfall probability: A graphical method and its application to European data. Scottish Geographical Magazine 49: 73–91. 10.1080/00369223308734882.

26.

Cunnane

. 1978. Unbiased plotting positions—a review. Journal of Hydrology 37: 205–222. 10.1016/0022-1694(78)90017-3.

27.

Ehrenberg

A. S. C

. 1975. Data Reduction: Analysing and Interpreting Statistical Data. London: Wiley.

28.

Fisher

R. A

. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.

29.

Grove

A. T

. 1956. “Soil erosion in Nigeria”. In Geographical Essays on British Tropical Lands, edited by Steel

R. W.

Fisher

C. A.

, 79–111. London: George Philip.

30.

Harrell

F. E.

Davis

C. E.

. 1982. A new distribution-free quantile estimator. Biometrika 69: 635–640. 10.2307/2335999.

31.

Hills

. 1974. Statistics for Comparative Studies. London: Chapman and Hall.

32.

Hoaglin

D. C

. 1983. “Letter values: A set of selected order statistics”. In Understanding Robust and Exploratory Data Analysis, edited by Hoaglin

D. C.

Mosteller

Tukey

J. W.

, 33–57. New York: Wiley.

33.

Hyndman

R. J.

Fan

. 1996. Sample quantiles in statistical packages. American Statistician 50: 361–365. 10.1080/00031305.1996.10473566.

34.

Jann

. 2005. moremata: Stata module (Mata) to provide various functions. Statistical Software Components S455001, Department of Economics, Boston College. https: //ideas.repec.org/c/boc/bocode/s455001.html .

35.

Johnston

. 2019. Percy Crowe: A forgotten pioneer quantitative geographer and climatologist. Progress in Physical Geography: Earth and Environment 43: 586–600. https://doi.org/10.1177/0309133319843430.

36.

Kerman

. 2011a. A closed-form approximation for the median of the beta distribution. arXiv:1111.0433v1 [math.ST], 10.48550/arXiv.1111.0433.

37.

_____. 2011b. Neutral noninformative and informative conjugate beta and gamma prior distributions. Electronic Journal of Statistics 5: 1450–1470. 10.1214/11EJS648.

38.

Genton

M. G.

Parzen

. 2011. Asymptotic properties of sample quantiles of discrete distributions. Annals of the Institute of Statistical Mathematics 63: 227–243. 10.1007/s10463-008-0215-z.

39.

Mardia

K. V.

Kent

J. T.

Bibby

J. M.

. 1979. Multivariate Analysis. London: Academic Press.

40.

Mardia

K. V.

Kent

J. T.

Taylor

C. C.

. 2024. Multivariate Analysis. 2nd ed. Hoboken, NJ: Wiley.

41.

Matthews

H. A

. 1936. A new view of some familiar Indian rainfalls. Scottish Geographical Magazine 52: 84–97. 10.1080/00369223608735013.

42.

McAlister

. 1879. The law of the geometric mean. Proceedings of the Royal Society of London 29: 367–376. 10.1098/rspl.1879.0061.

43.

_____. 1881. On the law of the geometric mean in the theory of errors. Quarterly Journal of Pure and Applied Mathematics 17: 175–194.

44.

Miecznikowski

J. C.

2019. The polar refresh; Updating Emanuel Parzen’s Buffalo NY snowfall dataset. Technical Report 1901, Department of Biostatistics, University at Buffalo. https://publichealth.buffalo.edu/content/dam/sphhp/biostatistics/Documents/techreports/UB-Biostatistics-TR1902.pdf.pdf.

45.

Mosteller

Tukey

J. W.

. 1977. Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison-Wesley.

46.

Parzen

. 1979. Nonparametric statistical data modeling. Journal of the American Statistical Association 74: 105–121. 10.2307/2286734.

47.

Tukey

J. W

. 1962a. The future of data analysis. Annals of Mathematical Statistics 33: 1–67. 10.1214/aoms/1177704711.

48.

_____. 1962b. Correction notes: Correction to “The future of data analysis”. Annals of Mathematical Statistics 33: 812. 10.1214/aoms/1177704604.

49.

_____. 1970. Exploratory Data Analysis. Vol. 1, limited preliminary ed. Reading, MA: sAddison-Wesley.

50.

_____. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.

51.

Wilkinson

. 1999. The Grammar of Graphics. New York: Springer. 10.1007/978-1-4757-3100-2.

52.

_____. 2005. The Grammar of Graphics. 2nd ed. New York: Springer. 10.1007/0-387-28695-0.

53.

_____. 2006. Revising the Pareto chart. American Statistician 60: 332–334. 10.1198/000313006X152243.

54.

_____. 2023. “Graphic displays of data”. In APA Handbook of Research Methods in Psychology: Data Analysis and Research Publication. APA Handbooks in Psychology Series, edited by H. Cooper, M. N. Coutanche, L. M. McMullen, A. T. Panter, D. Rindskopf, and K. J. Sher, vol. 3: 77-110. 2nd ed. Washington, DC: American Psychological Association. 10.1037/0000320-004.

55.

Wood

T. B.

Stratton

F. J. M.

. 1910. The interpretation of statistical results. Journal of Agricultural Science 3: 417–440. 10.1017/S0021859600001210.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB