Abstract
Capture-recapture (CRC) is currently considered a promising method to integrate big data in official statistics. We previously applied CRC to estimate road freight transport with survey data (as the first capture) and road sensor data (as the second capture), using license plate and time-stamp to identify re-captured vehicles. A considerable difference was found between the single-source, design-based survey estimate, and the multiple-source, model-based CRC estimate. One possible explanation is underreporting in the survey, which is conceivable given the response burden of diary questionnaires. In this paper, we explore alternative explanations by quantifying their effect on the estimated amount of underreporting. In particular, we study the effects of 1) reporting errors, including a mismatch between the reported day of loading and the measured day of driving, 2) measurement errors, including false positives and OCR failure, 3) considering vehicles reported not owned as nonresponse error instead of frame error, and 4) response mode. We conclude that alternative hypotheses are unlikely to fully explain the difference between the survey estimate and the CRC estimate. Underreporting, therefore, remains a likely explanation, illustrating the power of combining survey and sensor data.
Introduction
Sensor data is becoming increasingly important in official statistics because such data has some advantages over sample survey data: no declining response rates, no response burden, and continuous measurements in real-time [1, 2, 3, 4]. Sensor data is currently rarely used to produce direct estimates due to the often unknown data generating process. However, sensor data can be used to assess underreporting bias in survey point estimates by linking them to survey data and applying capture-recapture (CRC) techniques. [5] illustrated this method using road sensor data and diary survey data to estimate road freight transport. Their CRC-estimates based on the combined sources were higher than the design-based, survey-only estimates. One possible explanation for the difference is underreporting in the survey, which is conceivable given the high response burden, especially in diary questionnaires.
In this paper, we study potential alternative explanations from which practical implications and guidelines for survey agencies and transportation management can be drawn. First, owners of vehicles in the sample are requested to fill out the day of loading, whereas the sensors measure the day of driving. Second, sensors measure vehicles that do not have to be reported (false positives), or Optical Character Recognition (OCR) failure prevents linkage of sensor records to survey reports. Third, vehicles reported not owned were considered a frame error but may alternatively be considered nonresponse. Finally, if underreporting would be the main explanation for the difference, it should be apparent only in the manual response modes but not in the automated response mode.
We see great potential for the use of CRC in assessing survey underreporting, as several European countries conduct road freight transport surveys [6, 7] and have installed a Weigh-in-Motion road sensor network [8, 9]. Our study provides a profound understanding of the CRC methodology and strengthens the trustworthiness in combining new data sources with established methodology in official statistics. Such empirical studies are of high practical importance for ultimately uptaking these methods in statistical production processes leading to a transition from experimental to official statistics [10, 11].
Research background
Time-based diary surveys impose a heavy response burden. To reduce the response burden, respondents may respond inaccurately or not at all. Accordingly, such surveys often yield low response rates and downward-biased estimates [12, 13]. Diary surveys on transport and mobility suffer from these drawbacks, since the target variables rely on accurate and complete responses [14]. New data sources for validation and estimation could improve official statistics based on diary surveys.
For studies on underreporting in transport and mobility surveys, units have been equipped with GPS receivers or mobile phones [15]. [16] reported for the first GPS household travel survey (1997, USA) underreporting in trip rates up to 31%. In the California Statewide Household Travel Survey, [17] reported rates of missed trips between up to 42%. In a comparative study, [18] documented levels of trip underreporting between 11% and 81% in GPS-based travel surveys in the USA. [19] reported only 7% underreporting for the Sydney Household Travel Survey (2004). The results of prior studies in the field of transport, travel, and mobility, show evidence for underreporting bias, although the amount varies both within and between the reported findings. Issues commonly occurring in these studies include technical problems with GPS devices and varying data quality between instrument types [20]. Further, problems due to switch-offs, delays, battery issues, or the device not being carried and difficulties matching recorded and reported data were reported by [21, 22].
Here, instead of using GPS receivers, permanently installed road sensors are used to validate and adjust an underreporting bias in survey point estimates. Therefore, respondent-related issues such as described above can be neglected. This methodology has been first suggested by [5]. In this article, we report on further research to better understand differences in freight transport estimates with and without road sensor data.
Data
Survey data
The Dutch Road Freight and Transport Survey is a sample survey, conducted by Statistics Netherlands (CBS). A central aim of the survey is to estimate the total transported shipment weight (
The target population is the Dutch commercial vehicle fleet, excluding military, agricultural, and vehicles older than 25 years. Only vehicles with a weight of at least 3.5 t (empty vehicle weight
Each quarter a stratified random sample was drawn totaling 33,817 unique vehicle-week combinations in 2015. Vehicle owners are legally required to report the days on which the vehicle was loaded and the corresponding shipment weight for one week. No report is required if the vehicle was driving all day without loading or unloading or if it was not driving for transport purposes. The effect of measuring these false positives by road sensors is one of the research questions addressed in this paper.
Of the 34 thousand vehicle weeks, 22,454 vehicles (66%) were reported used on at least one day during the assigned week, 5304 vehicles (16%) were reported not used during the assigned week, 2462 vehicles (7%) were reported not owned, and 3597 (11%) was nonresponse.
The option to report that the vehicle has not been used reduces the respondent’s burden considerably since only small parts of the questionnaire have to be answered. That is the expected major cause of underreporting. Another way of responding with minimal burden is to report only a single day. The CRC approach allows assessing the severity of these two reporting errors.
Vehicles reported not owned are treated as nonresponse in the regular published official statistics. For the CRC study, such responses are defined as frame errors and excluded from analysis since the validation of such responses is not possible (only quarterly updates of the vehicle register, complexity in holding companies, vehicle rental/leasing). This decision was made to avoid false-positive links. An assumption made here is that all sample units classified as nonresponse own the vehicle, i.e., that if the vehicle is not owned this is always reported. The effect of this decision is one of the research questions addressed in this paper.
Owners of sampled license plates can respond by completing an internet questionnaire, a paper questionnaire, or by providing a data export (XML) from a software-based journey planning system commonly used by large haulers. The paper questionnaire is only available upon request and is used only by a minor fraction of the respondents. Generally paper questionnaires cause lower data quality compared to technology-based response modes [24]. We will refer to the internet and paper questionnaires as the manual response modes, and the planning system as the automated response mode. In this paper, we test the hypothesis that underreporting occurs only in the manual response modes.
Sensor data
The Weigh-in-Motion road sensor network’s purpose is to detect and enforce penalties on overloaded trucks, tractors, and other heavy transport vehicles [25]. In 2015, nine sensor systems were operating in both directions on Dutch highways, resulting in 18 measurement points (Fig. 1).

Road sensor network on Dutch highways consisting of nine different systems with 18 measurement stations. Crosses and circles indicate the two stations of the nine WiM systems.
When a vehicle passes a station, it is weighted, classified and a photograph of its front license plate is taken, which allows linkage to additional register information. A vehicle’s individual axle weights are measured, summing to the total weight. Upward-biased axle measurements (over 16.1 t) were corrected using conditional mean imputation, i.e., replaced by the average across the axle measurements below 16.1 t per truck, based on the guidelines of [26] and expert information from the road administration. Transported shipment weight is calculated by subtracting the registered empty weight from the measured total weight. Negative values were trimmed to 0.
The empty trailer weight could not always be linked. The rear license plate was not recognized by the OCR software (11,340), or the trailer was not registered in the vehicle register (5980). If the front and rear license plates are identical, no trailer was pulled, and the trailer weight was set to 0. For the remaining missing values, conditional mean imputation was applied.
After matching survey and sensor data by license plate and day as unique key, contingency tables can be constructed for
Number of vehicle days
reported in survey and measured by sensors
Number of vehicle days
Transported shipment weight
CRC was initially developed to estimate the unknown size of animal populations but has been applied to human populations [27] and is based on the following assumptions: independent datasets, closed population, homogeneous capture probabilities, all elements belong to the target population, and perfect linkage of datasets.
The first assumption of independent datasets and the second assumption of a closed population are fulfilled. The probability that a sampled vehicle is reported in the survey is independent of the probability that it is measured by a sensor. By equating the study population to the survey sample, the population is closed by definition.
Third, the response or capture probabilities for the elements should be homogeneous for at least one data source [28, 29]. This assumption is met by modeling capture probabilities as a function of auxiliary information (see Subsection 4.3).
Fourth, elements in the data sources should belong to the population of interest. This assumption is met because sampled vehicles belong to the population by definition and vehicles detected by the sensors are filtered by linking their license plate to the register. Vehicles might not belong to the target population on specific days and in specific situations, but we consider these violations of the fifth assumption of perfect linkage.
The fifth assumption addresses the perfect linkage of the elements. It is the strength of the WiM system that vehicles can be linked by license plate. However, this assumption may occasionally be violated in both data sources for different reasons. Vehicle owners might report too few, too many or the wrong dates and they have to report the day of loading rather than the day of driving. Sensors also measure vehicles that are driving but not transporting, and OCR failure prevents linkage. These potential violations of the fifth assumption are addressed in this paper.
Register data
To model capture probabilities, variables from the vehicle and business registers are used. The vehicle register provides both technical and non-technical vehicle features. The business register provides 5 features about the vehicle owner.
Register data are linked at micro-level using license plate and quarter as unique key. For some sample units, no register information could be linked. For variables with rather small proportions of missing values, the missing values were replaced with the most common category (mode imputation). Otherwise, the category ‘Unknown’ was assigned. Continuous variables were then categorized based on their unweighted response quantiles.
Methods
Definitions and notations
Underreporting is estimated by comparing survey estimates corrected for selective nonresponse with CRC estimates corrected for both selective nonresponse and measurement error.
We define the indicator
Underreporting in the survey is estimated as the relative difference
The survey estimators for the total number of truck days
with
where
so that
Log-linear models for population size estimation in closed populations were introduced by [30]. They can be written as a generalized linear model, where count
where
where
The model is extended with auxiliary information from the register, to model heterogeneity in the capture probabilities in both sources:
where
The CRC estimator for target variable
where
A stepwise selection procedure based on the Bayesian Information Criterion (BIC) was used to choose an optimal model for the CRC estimator [31]. Only main effects of the variables were considered to avoid sparse contingency tables for the log-linear model. The model selection for the log-linear model is based on a logit-model selection, to cover the full information of the covariates. Using the approach proposed by [32, 33], two independent logit-models were applied: using
Precision estimation
To estimate the precision of
Reporting errors
Survey respondents must fill out the day of loading instead of the day of driving. When comparing survey and sensor data, this may be considered a reporting error by design. Two systematic reporting errors were simulated: underreporting and overreporting error. The simulation study was restricted to target variable
If vehicle
It follows,
The response pattern can be written as a string of 7 indicators
Underreporting error: replace each trailing 0 with a 1. For instance, 0010000 becomes 0011111. This new pattern attempts to correct for the questionnaire asking about the day of loading instead of the day of driving, and for respondents pooling multiple days of loading to the first day. Overreporting error: replace each trailing 1 with a 0. For instance, 0111010 becomes 0100010. This new pattern attempts to correct for overreporting, which is not our main concern but a control that should show the opposite effect of underreporting.
The response pattern was replaced for 1% to 100% of the respondents, repeating each sample 100 times. Changing 100% of the data implies all reported response patterns being erroneous, which is very unlikely. The results are considered an approximation of an upper error limit.
Measurement errors
Errors in the sensor observations were simulated to evaluate the effect of linking false positives (FP) and OCR failure. In contrast to the ‘reporting error’ simulation, both
As already reported in Subsection 3.2, about 30% of the sensor observations had no recognized license plate. The effect of OCR failure was simulated by moving units from measured to not measured by the sensors, independent from what was reported in the survey. As a result, both
The fraction FP and OCR failure was simulated from 1% to 100%. For every simulated step, observations were randomly dropped. To estimate the sampling error, each step was repeated 100 times. Changing 100% of the sensor observations in both setups implies the data source being completely unreliable, which is very unlikely.
Survey estimates, CRC estimates and the estimated underreporting (%) of the number of vehicle days
and transported shipment weight
(kt)
Survey estimates, CRC estimates and the estimated underreporting (%) of the number of vehicle days
Vehicles reported not owned were treated as frame error and excluded because we do not know whether or not they can be measured by the sensors. If they have been scrapped or exported, the sensors cannot measure them. The decision to exclude these vehicles implies that all sample units classified as nonresponse are assumed to own the vehicle, i.e., that if the vehicle is not owned, then it is assumed that this is always reported. On the other hand, vehicles reported not owned are treated as nonresponse error in the published official statistics. To compare these two visions, we re-estimated the amount of underreporting when treating vehicles reported not owned as nonresponse error. This means that the weights of the respondents were rescaled to
so that
To test the hypothesis that the difference between survey estimates and CRC estimates is due to manual underreporting in the survey, we stratified both estimators by response mode. Two strata were formed: manual (internet and paper questionnaires) and automated response mode (journey planning system). The weights
Results
Total
According to the CRC estimator, the weighted survey estimator underestimates both
If the mean or median were used in case of multiple recordings/reports per day (see Subsection 4.1), the underestimation by
The fit of the selected log-linear model is moderate. Using a likelihood ratio test comparing deviances showed that the selected model had a better fit than the null model. Using Pearson correlation coefficient, the linear relationship between the actual and predicted frequencies for
Reporting errors
If the reported day of loading would result in days of driving on all remaining days of the reporting week, both the survey estimate and the CRC estimate for

Effect of simulated percentage of vehicles with reporting errors on the survey and CRC estimate of the number of vehicle days
In the simulation correcting for overreporting errors, the survey estimate would decrease linearly with the proportion of vehicles for which this error would be corrected (overreporting error in Fig. 2). The CRC estimate, however, is fairly robust and only steeply increases at high proportions. As a result, the absolute relative difference only increases. Obviously, overreporting cannot explain the relative difference.

Effect of simulated percentage of FP (left panels) and OCR failure (right panels) on the estimated relative difference between the survey and CRC estimate of the number of vehicle days
The relative difference between the survey estimate and the CRC estimate gradually vanishes as more units are considered false positives (FP in Fig. 3). Recall that in this simulation, units were removed that were not reported in the survey but were measured by the sensors (cell
Thus, false positives can explain the relative difference between survey and CRC estimate, but the estimated underreporting only disappears if the sensors would measure mostly vehicles that do not have to be reported in the survey, or if the weighting model is not effective.
The relative difference between the survey estimate and the CRC estimate remains the same as more OCR failures are simulated (OCR in Fig. 3). Only the precision is compromised. Recall that in this simulation, units were moved from measured to not measured by the sensors, irrespective of whether or not they were reported in the survey (Subsection 4.7). Thus, OCR failure cannot explain the relative difference. This can also be shown analytically for a simple CRC estimator, such as the Lincoln-Petersen estimator [34].
Reported not owned
If the 2462 units reported not owned in the survey are considered a nonresponse error instead of a frame error, the nonresponse correction by the survey estimator increases from 8% (Subsection 5.1) to 16% for

Effect of treating vehicles reported not owned as frame error or nonresponse error on the estimated relative difference between the survey and CRC estimate of the number of vehicle days
In contrast to reporting errors and measurement errors, no simulation is required because the number of vehicles reported not owned measured by the sensors is known (2219 vehicle days or 30.2 kt). If we nevertheless simulate that none of them is detected by the sensors, the absolute relative difference would decrease to about 12% (not shown). However, if a sensor detects none, it makes more sense to treat them as frame error. If simulating all being detected, the relative difference would not change.

Effect of response mode on the estimated relative difference between the survey and CRC estimate of the number of vehicle days
The relative difference between the survey and CRC estimate is considerably larger in the manual response modes than in the automated response mode (Fig. 5). This finding supports our hypothesis that the difference is caused by underreporting in the survey. The relative difference is, however, still negative in the automated response mode. Thus, according to this analysis, about 13%-point (
Discussion and conclusion
With the rare opportunity to link survey, sensor and register data using a deterministic key, we presented a comprehensive study to better understand the use of sensor data and capture-recapture in official statistics. We studied several sources of error, potentially biasing log-linear CRC estimates on road freight transport. As evident from the literature, we also showed that different levels of underreporting can be found within a study on underreporting, depending on study design. However, even in demonstrated implausible scenarios, the estimated amount of underreporting did not decrease below 10%. We explored four alternatives to underreporting as an explanation for a difference between the survey and CRC estimates of road freight and transport.
First, vehicle owners are asked to report the day of loading, whereas the sensors measure the day of driving. To study the effect of this mismatch, we simulated that the vehicle was driving on all days following the reported day of loading. Although this decreased the estimated difference between the survey and CRC estimate, the difference would not drop below about 10%. The opposite, overreporting error corrected for by collapsing multiple days of loading to the first reported day, only increased the difference.
Second, putative underreporting by survey respondents could be due to over-detection by sensors. Detected vehicles may not have to be reported, for instance, because they are empty or drive for maintenance. Unlikely high proportions of false positives can explain the difference completely, but the difference remains substantial at more reasonable, albeit unknown rates. The difference can be shown to be robust against linkage errors and sensor failure, although precision is compromised.
Third, vehicles reported not owned were considered as a frame error, assuming they have been scrapped or exported. Treating them as nonresponse error only explained about 2% to 3%-point of the relative difference between the survey and CRC estimates. Moreover, the less are measured by the sensors, the more the relative difference can be explained by treating them as nonresponse error. However, if it is more likely they have been scrapped or exported, they should be treated as frame error.
Fourth, if the difference would be caused by underreporting in the survey, this would only be apparent in the manual response modes (web and paper) and not in the automatic response mode. Estimating the difference by response mode indeed showed that the difference is much lower in the automated than in the manual modes. The remaining difference in automated mode, however, suggests that 11%–13% can be attributed to underreporting. The remaining difference in automated mode could, for instance, be a nonresponse error not corrected for by post-stratification. The CRC model selection supports this hypothesis by choosing variables currently not included in the survey post-strata. On the other hand, it could still be underreporting, as the data is entered manually into this system by humans. Finally, this finding might be confounded by variables such as company size. As small companies do not use the automated mode, it is not possible to control for such a confounding variable.
Limitations of this study are first, that we were not able to estimate the amount of false positives links. This error source is considered of high importance since the CRC estimates are highly sensitive to such error. Second, multiple explanations were not studied simultaneously but individually. Regarding generalizability, future research could apply the CRC estimates and simulations to other systems, or other data generated in different years of the system described here.
This study also provides approaches to discussions in the area of social and political implications, especially on freight transport management. Such research is the basis for additional research to derive recommendations for transport (infrastructure) management, economic/financial implications, emission, and climate change.
We conclude that sensor data, in combination with the CRC estimator, provides a valid tool to assess underreporting in survey questionnaires. We consider the contribution of this study a useful reference for official statistics, survey agencies, transportation management, and statisticians if survey, sensor, and register data can be linked, and CRC can be applied.
