Abstract
Most regional economic databases (e.g., US Economic Census and County Business Patterns [CBP]) have some employment records suppressed and then represented as ranges, in order to guarantee the confidentiality of the data. This article incorporates the implicit temporal relationships between annual employment data over several years into an optimization model designed to estimate suppressed records. This model minimizes (1) the sum of the deviations between the estimates and target values within the corresponding ranges and (2) the sum of the deviations between the estimates and an employment trend curve endogenously determined through absolute-value regression. The 1999–2006 CBP data for Arizona are used to test the model. Two decision-theoretic criteria (Pareto frontier and concordance–discordance analysis) are used to analyze the results, pointing to a specific set of parameters yielding the best estimates.
Keywords
Introduction
Regional economic databases, such as the Economic Census (EC) and County Business Patterns (CBP) in the United States and the Labour Force Survey in the United Kingdom, provide essential inputs to economic analyses. These databases vary in terms of their geographical detail, down to the state (EC-mining), county (EC and CBP), and place with 2,500+ people (EC-manufacturing), and in terms of their publication frequency (annual for CBP, every five years for EC). They also vary in terms of content and economic sector disaggregation, with employment as the most common data. An additional feature of these databases is that a significant number of data records are suppressed and represented by interval flags.
The issue of economic data suppression has generated a limited literature on methods to estimate suppressed records (e.g., Gardocki and Baj 1985; Sechrist 1986; Kreahling, Smith, and Frumento 1996; Isserman and Westervelt 2006; Zhang and Guldmann 2009). These methods focus on annual cross sections of data. However, all economic databases are longitudinal, and a given record, defined by sector and geography, may be suppressed in some years, but not in others. Accounting for the available time series may improve the estimation. This article develops and implements an optimization methodology for estimating suppressed employment data in a longitudinal database, extending the approach developed by Zhang and Guldmann (2009). The 1999–2006 CBP data for Arizona are used as a case study.
The remainder of the article is organized as follows. The second section presents an overview of CBP data and some of their applications, clarifies and illustrates the data suppression problem, and reviews existing estimation methods. The third section presents the estimation methodology, including the narrowing of the flag intervals and the optimization model. The fourth section presents an application of the methodology to Arizona CBP data, including a search for combinations of model parameters that provide better estimates. The fifth section consists of a discussion of the results and caveats. The sixth section concludes the article and outline areas for further research.
Statement of the Problem
The US Census Bureau (CB) began to publish CBP data annually in 1964, and complete data sets are available for download from the Census website (http://www.census.gov/econ/cbp/download/index.htm) back to 1986. These data sets provide employment, payroll, and total number of establishment, by industry and by geography. Numbers of establishments by employment range are also available. The database includes a national file, a state file for each of the fifty states, and a county file for each of the 3,143 counties. The North American Industry Classification System (NAICS) has been used since 1998. It has a hierarchical structure. The grand total in any geography is disaggregated into twenty-one 2-digit sectors, each of which consists of a variable number of 3-digit sectors. This nested disaggregation continues down to the 6-digit level.
CBP data have been extensively used by academic researchers, governmental agencies, and private firms. For instance, CBP data have been used by Isserman (1977) and Guimaraes, Figueiredo, and Woodward (2009) to compute location quotients in order to assess regional economic impacts and to measure industrial agglomeration, and by Stevens and Moore (1980), Miller and Blair (1985), and Henderson (1997) for shift-share, regional input–output, and econometric analyses. They have also been used to evaluate government economic policies, such as enterprise zone programs (Dowall 1996; Moore 2003). Pagoulatos (2004) has combined CBP with other data to assess the environmental impacts of economic development.
To guarantee that individual establishment records cannot be retrieved, records have to be suppressed. Table 1 presents the distribution of suppressed records across the economic and geographic hierarchies of the 2006 CBP. More records are suppressed at higher disaggregation levels. Only 4.0 percent and 31.2 percent of two-digit records are suppressed in the state and county files, respectively, as compared to 41.2 percent and 73.9 percent at the six-digit level. More records are suppressed in smaller geographies. At the national level, twelve records (0.6 percent) are suppressed (of the 2,148 records). At the state level, 33.4 percent of the 99,082 records are suppressed. As the county level, 68.1 percent of the 2,203,501 records are suppressed. The CB provides a flag (A, B, C, E, … , or M) for each suppressed record. Each flag is associated with an employment interval. For instance, flag A corresponds to the interval [0, 19].
Numbers of Suppressed Records and Rates of Suppression in the 2006 County Business Patterns (CBP).
The CB recommends that the midpoints of the flag intervals be used in place of the suppressed employment data. Table 2 illustrates the suppression issue when the midpoint approach is applied to the 2000 CBP data for Arizona (sixteen counties and twenty-one 2-digit sectors). The actual and estimated employments are presented for only four counties (Coconino, Graham, Greenlee, and Pinal) and for sector and county totals. A gray cell indicates a suppressed record. Note that county and sector totals are not suppressed. At least two records must be suppressed in every row and column to avoid possible disclosure of information. For instance, in Coconino county, if the data of only one sector (eleven or ninety-nine) is suppressed, this data can be retrieved by subtracting the sum of all the known data from the county total (38,917). Similarly, if the data for educational services is suppressed for only one county (Graham or Greenlee), it can be retrieved by subtracting the sum of the known data from the sector total (24,486). The midpoint approach creates inconsistencies in county and sector totals, as illustrated in the last row and column in Table 2, where the differences between the computed and actual totals are presented. This inconsistency is more acute in counties with large numbers of suppressed records, for instance, Greenlee (529 or 19.1 percent).
Illustration of Data Suppression and the Midpoint Approach—The Case of 2000 Arizona Two-digit Data.
Several other methods have been proposed to estimate suppressed information. Gardocki and Baj (1985) take advantage of the lower suppression rates in large geographies, and use state or national data to estimate suppressed county employment. Weddleton and Olson (1988) test regression and ratio approaches to estimate undisclosed CBP earnings, using available disclosed data. The regressions estimate the relations between industry earnings and number of establishments, and the ratio approaches directly use the earning–establishment ratio, which may be available for larger geographies and is assumed to be stable within industries. The above methods do not consider internal consistency within the geographic and economic hierarchies. The method of indirect standardization (Sechrist 1986) guarantees internal consistency within the economic hierarchy (the sum of disclosed and estimated employment matches upper-level disclosed totals). Kreahling, Smith, and Frumento (1996) propose an approach similar to the indirect standardization but with improved accuracy. They consider both flag range and establishment size information (ESI) to create smaller intervals. Ellison and Glaeser (1997) further consider geographic consistency and use an ad hoc sequential procedure to estimate suppressed information. Isserman and Westervelt (2006) propose an ad hoc approach to provide employment ranges internally consistent with the available information. They claim that these are the smallest possible ranges. Zhang and Guldmann (2009) develop an optimization model that considers flag and establishment information and other constraints within the economic and geographic hierarchies, and generates employment estimates that minimize the sum of weighted squared deviations between the estimates and a set of target values. The approach was tested with 2000 Arizona CBP data.
Method
The optimization model presented in this section extends the static model proposed by Zhang and Guldmann (2009), by incorporating employment trend information and tighter bounds for the suppressed data. Prior to presenting this model, the approach to first tightening the bounds is discussed, following in part the approach of Isserman and Westervelt (2006). Second, actual employment trends are analyzed by industry and county to assess the potential value of incorporating temporal information into the model.
Narrowing the Flagged Employment Intervals
Several pieces of information are used to generate smaller employment ranges, including suppression flag, number of establishments by employment range, and hierarchical relationships across NAICS levels. This process is illustrated with Arizona data.
The narrowing process starts with the flag intervals. There are 2,832 two-digit records in the sixteen counties of Arizona over 1998–2006, 778 of which are suppressed, mostly flagged as A or B. The lower and upper bounds of a flag interval for sector i in county c in year t, as defined by the CB, are noted as FMINcit and FMAXcit . The column “Interval 1” in Table 3 presents the interval size for each flag.
Descriptive Statistics for the Size of Flagged and Narrowed Intervals—The Case of Two-digit Arizona Data over 1998–2006.
Note: aInterval 1 is the flag interval itself. Interval 2 accounts for additional two-digit level establishment information. Interval 3 further considers information across hierarchical levels.
bComputed as (mean interval 3 size/interval 1 size)/interval 1 size, that is, percentage reduction for the flagged interval.
ESI, which is never suppressed, allows for narrowing the flag interval. For instance, sector 21 (mining, quarrying, and oil and gas extraction) in county 021 (Pinal) is flagged as F [500, 999] in 2000. However, the ESI indicates a range of [347, 708]. 1 The flag interval [500, 999] can then be narrowed down to [500,708]. For 23 of the 51,883 Arizona records flagged over 1998–2006 (across all NAICS levels), the lower bound generated by the ESI bound exceeds the upper bound of the flag. Census documentation indicates that the flag interval is based on mid-March employment, while establishment counts include any establishment with at least one employee during the year (http://www.census.gov/econ/cbp/definitions.htm). Therefore, employment inconsistencies may occur between flag and establishment intervals. The lower ESI should therefore be considered unreliable, and only the upper ESI bound is used to narrow down the flag interval. In Table 1, the “Interval 2” columns provide descriptive statistics for the sizes of these new intervals.
Finally, information at more disaggregated NAICS levels can be used to further narrow Interval 2. Appendix A illustrates a forward approach to find a narrower interval for sector 21 (mining) in county 21 (Pinal, Arizona) in 2000, considering flag intervals and ESI across all NAICS levels, starting at the six-digit level. In Table 3, “Interval 3” columns provide descriptive statistics for this new set of intervals. For every type of flag, the average interval size has been further reduced. Overall, reductions in interval size range from 27 percent to 44 percent, except for flag H.
Employment Temporal Patterns
Does sector/county employment display a stable trend? To test this hypothesis, several regression models, running from linear to quadratic to cubic, are estimated with actual (non-suppressed) Arizona employment data over 1998–2006. More details are presented in Appendix B. Overall, these regressions suggest that most sectors display a stable employment pattern over time. On average, the linear model accounts for 50.9 percent of employment variations, the quadratic model for 65.6 percent, and the cubic model for 77.3 percent. These patterns support the incorporation of employment time series in the optimization model.
Regression-constrained Optimization Model
Definitions
The indices are:
county;
two-digit NAICS sector;
time (=1 if 1998, =2 if 1999, … , and =9 if 2006).
The basic decision variable is:
employment estimate for two-digit sector i in county c at time t.
The model parameters are:
set of combinations (c, i, t) for which employment is flagged;
actual employment, (c, i, t), (c, t), (i, t) ∉ M;
minimum employment (c, i, t) ∊ M;
maximum employment (c, i, t) ∊ M;
minimum county total employment (c, t) ∊ M;
maximum county total employment (c, t) ∊ M;
minimum sectoral total employment (i, t) ∊ M;
maximum sectoral total employment (i, t) ∊ M.
The narrow interval [MINcit , MAXcit ] for a two-digit suppressed record is discussed in Narrowing the Flagged Employment Intervals section. As county totals and sector totals (across the whole state) may also be suppressed, [CMINct , CMAXct ] and [SMINit , SMAXit ] represent the narrow intervals for the corresponding records. When county or sector totals are not suppressed, the above intervals collapse onto the actual total employments, Ect and Eit .
Basic Constraints
The following constraints enforce data consistency in terms of (1) suppressed data intervals, (2) county total employment, and (3) state total employment:
Constraint (1) forces the estimation variable to be zero when the actual employment is available. Constraint (2) guarantees that any estimate must lie within its interval. Constraint (3) guarantees consistency between the sum of sectoral employments in a county and that county total employment. Constraint (4) replaces constraint (3) when the total county employment is flagged. Constraint (5) guarantees consistency between the sum of sectoral employments across all counties and that sector total employment at the state level. Constraint (6) replaces constraint (5) when the total state employment is flagged.
Estimation Objective Functions
The first objective function, F
1, is based on Zhang and Guldmann (2009) and minimizes the sum of the absolute values of the deviations between an estimate and a target location:
where l is a location parameter varying between 0 and 1. For instance, l = 0, 0.5, or 1 indicates that the target location is the lower bound, the midpoint, or the upper bound of [MINcit , MAXcit ].
In order to illustrate the second objective function, F
2, that involves employment trends, consider the hypothetical employment data pattern in Figure 1, corresponding to a specific county/sector (c, i). The eight time periods (e.g., years) include four periods (

Hypothetical employment regression curve.
In order to avoid the above problem, the OLS estimation could incorporate both the actual data (Ecit
) and the estimation variables (Xcit
), which would be subject to constraints (1)–(6). However, this approach would then completely ignore the deviation function F
1, thus creating a single-objective optimization model. It would take the form of a nonlinear program, due to the nonlinearity of the objective function, with the possibility of generating only locally optimal solutions. The alternative is to use an absolute-value deviation regression, which can be easily linearized. Consider the third order polynomial regression function:
where the regression coefficients (
In equation (9), either Xcit = 0 or Ecit = 0, and the unknowns are Xcit , aci , bc , dci , and gci . The form of the regression curve is endogenously determined. For instance, if a linear curve provides the best fit for both the estimated data (Xcit ) and the actual data (Ecit ), then the output will include dci = 0 and gci = 0.
Linearizing the Objective Function
In order to convert the objective functions (7) and (9) into linear forms, two sets of deviation variables are introduced: (1) PDcit
and NDcit
, to measure the deviation between an employment estimate and its location target and (2) PRcit
and NRcit
, to measure the deviation between an employment estimate or the actual employment and the regression curve. The two objective functions are reformulated as:
The deviation constraints are as follows:
Minimizing either F
1 or F
2 is likely to lead to contrasted solutions, as F
1 pulls the solution to a selected target within the intervals and F
2 pulls it toward the regression curves. As it is impossible to determine, a priori, which criterion is more important, the two functions are combined into a single function representing a weighted convex combination of F
1 and F
2, using the weights (1 − w) and w, with
w = 1 indicates that only temporal trends are considered, and w = 0 completely ignores the temporal trends. The latter implies independent optimization of each cross section, as in the Zhang-Guldmann (2009) model. The final model takes the form of a linear program, minimizing F (equation 15) subject to constraints (1)–(6) and (12)–(14).
Application to Arizona
Overview
The methodology presented in Method section is applied using CBP county data for Arizona. Because an inconsistency was uncovered in the source data between county total and two-digit level information 2 in the 1998 CBP, all further analyses use only the 1999–2006 data. After a short graphical illustration of the estimation results (Estimation Illustration section), the approach to determining the “best” parameters l and w is outlined in Choosing the Parameters l and w section. The notations L and W are introduced as aliases for l and w, the location and weight parameters.
The 1999–2006 suppressed employment records are estimated for ranges of values for L and W, creating complete data sets. As the CB does not provide any information on its data suppression algorithm, it is impossible to model the exact data generation process that creates the suppressed records. The alternative is to use a Monte Carlo simulation (Kennedy 2003) to replicate the observed suppression patterns by generating several suppressed data sets by artificially suppressing records in the complete data sets. The optimization model then derives employment estimates for different location and weight combinations (l, w). The quality of these estimates is then assessed using different criteria, and dominant parameter combinations (l, w) are proposed, that would generate “best” estimates.
Estimation Illustration
To illustrate the applicability of the model, the parameter combinations (L = W = 0.5) and (L = 0.5, W = 0) are first used to estimate the originally suppressed two-digit Arizona records over 1999–2006. Figure 2 illustrates the improvement due to introducing employment trends, focusing on three sectors/counties. The upper-block diagrams (A) present results when L = 0.5 and W = 0, thus only considering the objective F 1. The lower-block diagrams (B) present results when L = W = 0.5, thus weighting F 1 and F 2 equally. The continuous black curve is derived by OLS with actual employment data, while the absolute-deviation regression curve derived by the optimization appears only on the lower blocks in dashed red line. In case 1, the optimization-derived curve closely matches the OLS curve, and the unique estimated value (t = 5) shifts upward from A to B to be closer to the dashed curve, at the interval boundary. In case 2, with two suppressed records, the match between the two curves is less close, because the optimization-derived curve is adjusted to capture the midpoint of the interval in year t = 1. The OLS curve only accounts for the actual employment data and thus ignores the information provided by these intervals. Finally, in case 3, the OLS curve cannot be estimated because there is only one actual employment value. However, this is not an issue for the optimization-derived curve. As expected, the estimates are either at the midpoint or very close to the midpoint when W = 0 (A). However, when W = 0.5, some estimates are pulled away from the midpoints and toward the optimization-derived regression curve.

Estimation illustration—Arizona data.
Choosing the Parameters l and w
Are there specific (l, w) combinations that will generate better estimates? Figure 3 presents the flowchart of the process used for delineating such combinations. Any (L, W) combination generates a set of estimates for the originally suppressed records, and therefore creates a complete data set, which is then randomly suppressed, generating ten suppressed data sets (D1–D10). For each of these data sets, any combination (l, w) generates a set of estimates, the quality of which is then assessed by comparing them with the “known” values in the complete data set, using various criteria. Several combinations (l, w) are tested, and the best (or dominant) ones are selected. However, these “best” (l, w) combinations may depend on the choice of (L, W). Hence, the above process is repeated for different (L, W) combinations. The best (l, w) combinations are compared across the range of (L, W) combinations, in order to identify a core of combinations that produce better estimation results, no matter how the complete data set is created. If this core exists, one could be confident that its combinations would achieve better estimation results.

Assessing combinations (l, w).
Generating Suppressed Data Sets
Twenty-five combinations of
Distribution of Two-digit Suppressed Records in Arizona, 1999–2006.
The next step is to apply the optimization model, with different (l, w) combinations, to the 250 data sets DLWv
. Preliminary tests were carried out for data set
Evaluation of the Estimated Data Sets
The estimation results are compared to the known values of data set DLW to assess the effects of the different parameter combinations. Let DLWvlw be the data set obtained by applying the optimization model with parameters (l, w) to estimate the suppressed data in data set DLWv . This section presents the assessment of these estimates. A goodness-of-fit criterion is first defined, that helps identify preferred combination (l, w) across the (L, W) combinations. Next, Pareto frontier (PF) and concordance–discordance (CD) criteria are proposed and computed. A summary comparison across these criteria identifies core (l, w) values.
Goodness of Fit
The goodness-of-fit criterion (S) represents the sum of the squared differences between estimates and “known” values. In the case of location and weight parameters (l, w), sample v, and complete data set DLW
, the criterion is given by:
where
The gray cells in Figure 4 represent the top ten best combinations (l, w) for any specific data set DLW
, using the average S criterion. For instance, for the data set DLW
estimated with L = W = 0.5, the combination (l = 0.3, w = 0.7) yields the lowest average S over the ten samples. Overall, the best combinations are generated by using

Best combinations of location and regression parameters (l and w) based on the average goodness of fit.
PF Analysis
All the 225 S values for each sample DLWv
are next normalized, using the lowest S value. Table 5 presents descriptive statistics for the original and normalized S (NS) in the case L = W = 0.5. For instance, in the case of sample 1, all the 225 S values are divided by the lowest one (144,489). More generally, a NS is formulated as:
Descriptive Statistics for Original and Normalized S for L = W = 0.5.
The NS values enable direct comparisons of the performance of parameter combinations across samples. The closer NS is to 1 the better the choice, and the best choice corresponds to a value of 1. The best (l, w) combinations are expected to consistently yield close-to-one NS values for all the ten samples of a data set DLW .
For each combination (L, W, l, w), there are ten NS values for the ten samples v. The mean value
may be viewed as a good indicator of the performance of the combination (L, W, l, w) and the closer to one the better. However, the spread of these values, as measured by the standard deviation
should also be considered. A truly optimal combination (L, W, l, w) would minimize both the mean and the standard deviation.
A PF analysis is conducted to assess the trade-off between MNS and SNS across the ten samples for (L, W, l, w) combinations. Figure 5 presents the scatter plot of MNS and SNS for the 225 data sets created with L = 0.1 and W = 0.9. A PF emerges, pointing to three best combinations of parameters, [l = 0.25, w = 0.8], [l = 0.35, w = 0.8], and [l = 0.45, w = 0. 8]. They do not dominate each other, that is, when MNS decreases, SNS increases, and vice versa. Further, as shown in Figure 5, there is a clear cluster of points when w ≠ 0. The estimates produced with w = 0, that is, not considering employment time trends, are clearly inferior to the other estimates, confirming the results in Goodness of Fit section. Results produced with w = 0.5 and w = 1 are not optimal but acceptable. This pattern holds for the other twenty-four combinations of L and W.

Pareto frontier analysis for L = 0.1, W = 0.9.
Table 6 lists the points (l, w) on the PFs of all the twenty-five data sets DLW
, together with the MNS and SNS values. There are five (L, W) combinations where the PF is reduced to one point, which corresponds to the minimizing of both MNS and SNS. The PF consists of three points for five (L, W) combinations, two points for fifteen combination, and one point (where both MNS and SNS are optimized simultaneously) for five combinations. The results point to the following ranges:
Summary of Pareto Frontier Analysis.
CD Analysis
Figure 4 points to the ranges [0.2–0.55] for l, and [0.6–0.8] for w, or a total of forty combinations for every data set (L, W). For each (l, w) combination, there are 250 S values. These forty combinations are first sorted by average S value from the smallest to the largest, denoted Y
1, Y
2, … , Y
40. Given that these Y’s are not normally distributed, the Wilcoxon Signed-Rank test is used for the following hypothesis:
where g = 1→40, j = 2→40, and g < j. The null hypotheses (H 0) states that Yg is larger than Yj (i.e., Yg is inferior to Yj ). The alternative hypothesis (HA ) states the opposite, that is, Yg dominates Yj . A CD analysis is conduced based on the Wilcoxon Signed-Rank test results.
When making decisions about the best parameter combinations, CD analysis (Tsoukias, Perny, and Vincke 2002) considers both positive and negative reasons. The positive reason is that a combination dominates a number of other combinations, which supports the superiority of this combination. Nevertheless, the same combination may also be dominated by other combinations. A concordance measure is defined as Cgj
= 1 if Yg
dominates Yj
; otherwise, Cgj
= 0, where g = 1→42, j = 2→42, and g < j. Similarly, a discordance measure is defined as Dgj
= −1 if Yg
is dominated by Yj
; otherwise, Dgj
= 0. Finally, the CD index for combination g is defined as
CD indices are calculated based on the Wilcoxon Signed-Rank test results, at the 1 percent, 5 percent, and 10 percent significance levels, and for each (L, W) combination. Table 7 summarizes the results. At all significance levels, the parameter combinations with
Summary of Concordance/Discordance Analyses.
Evaluation Summary
Table 8 recapitulates the “optimal” parameter combinations delineated in the previous sections. They are consistent with each other and provide a secure basis for parameter selection. Selecting l within [0.25–0.50] and w within [0.7–0.8] provides better estimates than other parameter combinations. However, this recommendation is made for Arizona only, based on Arizona data. The extension to other states is discussed in Discussion section.
Summary of Parameter Analyses.
Discussion
In this section, the focus is on several issues raised by the proposed methodology, namely: (1) what is the computational burden of implementing the optimization model? (2) can the results obtained for Arizona be extended to other states? (3) what is the applicability of the technique to the whole CBP database? (4) how could this method be applied to other regional economic databases?
The optimization model was solved using the General Algebraic Modeling System (GAMS) software installed on a personal computer with an installed random access memory of 4 GB and an Intel Core 2 CPU 6600@2.40 GHz processor. The basic model, for any given (l, w) combination, has been implemented for Arizona over eight years, sixteen counties, and twenty-one 2-digit level sector. It includes 10,910 single equations organized into fifteen blocks of equations and 13,130 single variables organized into fifteen blocks of variables. The basic run takes, on average, 5.3 seconds. The work space allocated to solve the model is around 35 Mb. The above suggests that a basic state model is solvable within a very short time. Of course, the computation time was much larger in this study, because the basic model was solved 56,250 (225 × 250) times. Preparing the data inputs to the GAMS model is a lengthier, though by no means difficult process. It is outlined in Appendix C.
When extending the research to other states, a similar suppression analysis should be conducted to find the best parameters. As a preliminary analysis for other states, the location parameter l could be obtained by analyzing unsuppressed information. For instance, in Arizona, the optimal range for the location parameter (0.25–0.5) is consistent with unsuppressed data (38.4 percent of Arizona 1998-006 records across all hierarchies). If these records were suppressed and replaced by the appropriate flag interval, the resulting average location parameter would be 0.31, which is consistent with the positively skewed distributions of firm sizes, as observed in the United States (Axtell 2001) and Portugal (Cabral and Mata 2003). With regard to parameter w, this research confirms that considering employment trends should significantly improve the accuracy of the estimates. But it is unclear whether [0.7, 0.8] is the optimal range for other states. Preliminary testing is recommended.
The application of the methodology to the whole CBP database (all states, all counties, all economic hierarchical levels, and multiple years) would, of course, be the holy grail of this research. There is no conceptual obstacle to this expansion, as the optimization model would be expanded to include blocks of equations and variables for each of the fifty US states. The new features of this expanded model would be constraints on (1) the summation of state sector employments, which should match the employment in the US CBP file and (2) the summation of employment at any economic hierarchical level, which should match the corresponding employment at the higher hierarchical level.
The methodology can be applied to the employment data of other US regional economic databases, such as the EC and the Bureau of Economic Analysis (BEA) data, with necessary revisions of the constraints. For instance, the EC uses the NAICS to classify economic activities and follows the same flag strategy as used in CBP. However, county-level employment data are not provided for several sectors, 3 and therefore, the county total constraint (equation 3) would have to be modified. In some of these sectors, only state totals are available. Metropolitan and micropolitan area data are available for other sectors (e.g., utilities). In some sectors (e.g., manufacturing), data are available for places and counties. Therefore, the methodology proposed here would have to be tailored to each major sector of the EC, depending upon the geographical detail. In the BEA database, flag information is not available for a suppressed record, and the flag constraint (equation 2) would have to be removed from the model. BEA employment data are available for counties, metropolitan, and micropolitan areas, BEA economic areas, states, and the whole United States. However, the level of industry disaggregation varies across these geographies, and county-level employment data are available only at the one-digit NAICS level. Constraints could be introduced to enforce consistency between county and metropolitan/micropolitan employment data, in addition to consistency with county and state totals.
Conclusion
A regression-constrained optimization methodology for estimating suppressed records in a longitudinal economic database has been developed and implemented with 1999–2006 CBP data for Arizona. The actual suppressed records were first estimated to create a set of twenty-five complete data sets, using twenty-five combinations of the parameters used in the optimization model. Ten random data set samples were then created for each of these twenty-five complete data sets, mimicking the suppression pattern in the original data. The optimization model was then applied to these suppressed data sets while varying two parameters: the location l of the estimation target within its interval, and the weight w used to trade-off the two objective functions. A goodness-of-fit measure was defined and calculated to measure the quality of the estimates. The results show that accounting for employment temporal trends strongly improves the accuracy of the estimates. Two decision-theoretic approaches (PF and CD) were then implemented to uncover possibly dominant parameter combinations. These analyses consistently point to the ranges of
Several issues related to the applicability and possible extensions of the proposed methodology were discussed. Further research could focus on (1) the adaptation and application of the methodology to other databases (e.g., EC and BEA), accounting for their specific features and (2) assessing the benefits of using data in other databases to help reduce the uncertainty in the estimation of suppressed data in any given database. In addition, spatial econometric models accounting for the possible spatial autocorrelation of employment between any county and its surrounding counties could help obtain more precise estimates. This extension is supported by Desmet and Fafchamps (2005), who analyze the spatial concentration of employment across US counties and provide evidence of spatial spillover effects.
Footnotes
Appendix A
Appendix B
Appendix C
Authors’ Note
An earlier version of this article was presented at the Mid-Continent Regional Science Association 43rd Annual Conference, Bloomington, Minnesota, 2012. The comments and suggestions of five referees on earlier drafts of this article are very much appreciated.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
