Abstract
Over the years, official statistics have shown an increasing territorial focus on providing detailed and quality information. The Population and Housing Census has always ensured the availability of sub-municipal data useful for social, economic, and environmental decision-making processes. The new Italian Permanent Census focuses heavily on the integration of administrative and sample data and plans to provide more stable and consistent statistical data at the various territorial levels every year. Within this framework, sub-municipal data are derived from the integration of the Base Register of Individuals and the Base Register of Places. Data accuracy depends on the quality of the registers and the procedures adopted to integrate and process the input data. In this regard, Istat is working to improve geocoding information and linking procedures. One of the problems encountered is the presence of non-geocoded units due to problems in the administrative data. Istat has studied a procedure that integrates deterministic and probabilistic approaches to assign the enumeration area code to these critical units. It was conducted an experimental study to assess the quality of the imputation procedure. In this paper, we discuss the approach adopted, the evaluation process, the results obtained, and the impact on data quality.
Introduction
One of the main objectives of the Census is to provide statistical information at the municipal and sub-municipal levels, which is in high demand by the research community, the private sector, and public administrators. In recent years, there has been a transition in Italy from the traditional census to a strategy based on the integration of statistical registers, built mainly by integrating administrative data, and sample data. This has required an enormous effort on the part of the Italian Institute of Statistics (ISTAT) to identify the best possible methodological and IT solutions to guarantee high levels of data accuracy, consistency, and timeliness [1]. In this context, the reliability of the results depends largely on the quality of the administrative data supplying the registers. This is something on which there is still a lot of work to be done.
In this article, reference is made to the spatial information in the archives that, in some situations, does not allow for the localization of all statistical census units in a sub-municipal area, thus preventing the production of data at maximum spatial detail. In past censuses, the geocoding operation took place during field data collection; the enumerator, when submitting the census questionnaire, recorded the household’s home address and the code of the enumeration area1 (EA) to which it belonged. This now occurs in the data processing phase, not without errors of missing or incorrect geocoding of census units (individuals, households, dwellings, buildings) in the EAs.
The accuracy of geocoding can certainly be increased by requiring source owners to pay more attention to quality and by working in the preliminary stages of processing the data once they have been acquired. On the other hand, specific imputation procedures (deterministic and probabilistic) have been developed to retrieve the geocoding of units without EA, exploiting geographical information, past census data, and data contained in archives.
The article will explain the imputation procedures, experimental tests, the analysis process, the results obtained, and will evaluate the impact of the procedures on the results. Finally, the validation process of some census results produced at the sub-municipal level will be illustrated, a preliminary operation to the official dissemination of the final data.
The demand for statistical information at the sub-municipal level
Over the years, there has been an increasing demand for sub-domain data from different actors and different purposes. National and international institutions, the academic and research world, and the private sector have a continuous need for highly disaggregated spatial data to conduct spatial analyses, for their business objectives and to make social, economic, and environmental decisions.
European legislation requires the production of data per 1 km
The most frequently requested data concern the presence and characteristics of foreigners, the educational level of the population, current activity status, position in the profession, type of occupation or sector of economic activity of the employed, family types, and structural characteristics of dwellings.
The Census of Population and Housing is the only statistical survey able to guarantee the availability of data at such a high territorial detail. The transition from the traditional decennial Census in Italy to the new permanent Census has required a careful review of the strategy to ensure census results that allow comparability with the past and adequate availability of sub-municipal domain data. The statistical methodologies and IT procedures used in the Permanent Census strategy are constantly evolving, as is the supply of sub-municipal data in quantitative and qualitative terms.
The permanent census strategy
Since 2018, Istat has been conducting the Permanent Census of Population and Housing. The census has moved from a traditional to a combined approach that integrates data from administrative sources and two sample surveys on households appropriately designed for the objectives of the annual census production: the population count and the calculation of contingency tables related to the main census variables.
Appropriate methodological and computer architectures enable the integration of information of register data and data collected on the sample households to produce census results at all geographic levels, from national to municipal.
The production of census data at the sub-municipal level requires the integration of two registers prepared by Istat: BRI – Base Register of Individuals2 and BRP – Base Register of Places3[3]. Through the operation of “linkage” between those registers, it is possible to locate individuals and households on the territory and, in particular, in EAs [4]. In the past, this operation took place in conjunction with field data collection; with the new strategy of the permanent census, the geocoding of statistical units at EAs is the result of the integration of BRI and BRP, and of course missed geocoded units4 may arise. This critical issue is mainly due to the unavoidable structural deficiencies in the territorial archives.
Geocoded and non-geocoded population by type of error
Geocoded and non-geocoded population by type of error
Source: Base Register of Individuals (BRI) as of 31.12.2019.
To be able to produce data at the sub-municipal level, Istat has developed some solutions applicable in a previous stage to the linkage operation to improve the outcome of the integration process between BRI and BRP, and specific procedures, applicable at an ex-post stage, aimed at processing the (residual) units that result without geocoding. This made it possible to define a process that led to the complete geocoding of all units (households and individuals) to EAs. The paper illustrates the proposed imputation methodologies and the experimental study that allowed the definition of a procedure for the recovery of the EA of non-geocoded units.
As mentioned in Section 3, BRP plays a crucial role in the process of producing census results as it allows the geographical localization of statistical units belonging to Istat registers and surveys.
The construction of BRP is a very complex and innovative process [5]. It consists of four components:
Census geographical database (enumeration areas and micro-zones); Addresses and geographical coordinates; Residential buildings and housing; Administrative and Statistical Territorial Units.
All these components contribute to the geo-coding of units to produce the sub-municipal data of the permanent census. However, public administrations, as providers, still pay little attention to spatial information in the data sources used. In particular, the quality of the “address string” of individuals residing in the Population Registers (PR) of the municipalities, the main source supplying the BRI, is not high.
The main operation performed to improve the quality of address strings is to subject them to a normalization process with commercial software. The software also returns the geographical coordinates of house numbers, if any, contributing to the generation of the BRP. This process allows the addresses to be geocoded and the two archives BRI and BRP to be linked through a specific “unique address key” (CUI).
In this context, the sources of possible errors are mainly due to:
the sources supplying the registers; the register generation process; the linkage process between BRI and BRP.
These critical issues are also reflected in the statistical units of the BRI, causing cases of missing or incorrect geocoding of individuals.
Table 1 shows the absolute values and percentages of the census population as of 31.12.2019. It focuses on approximately 2.6 million non-geocoded individuals. The errors for this residual group are due to non-geocoding. The main unresolved cases include individuals with an address without a house number from the PR or with a house number not recognized by the normalizer because it does not exist or is incorrect. In both cases, the number of individuals is about 1.1 million, or 1.9% of the total number of individuals. Finally, the last two lines concern individuals without geographical coordinates or enumeration area, i.e. 0.2%, and those who, although they have an address, the address is not recognized by the normalization software or is non-existent. The percentage of these cases is 0.5% of the total number of individuals.
A statistical process of checking and correcting the “consistency” of units of different populations (individuals, households, dwellings, buildings) was recently started in Istat. This should make all statistical units that belong to the same sub-municipal domain geographically consistent. Thus, each individual finds a household residing in the dwelling of a single building and this building should be correctly geocoded. These considerations contrast, however, with what has already been said about possible sources of error. A further critical point is the quality of the cadastral source used by the buildings register. To reduce these problems and to improve the final quality of the register, additional sources from the open world and from the Italian regions that produce cartography are also used.
Process improvement actions
To overcome the illustrated criticalities that do not allow the production of sub-municipal census data with acceptable levels of quality, Istat has defined several interventions and made some methodological choices to improve the statistical process that concern:
the coverage and accuracy of sources, including ad-hoc surveys of municipalities; the process of address recognition and CUI assignment; the process of assigning geographic coordinates and EA to house numbers, including through the implementation of other open sources; the improvement of the record linkage operation between BRI and BRP; the integration of housing and building sources.
To improve the quality of spatial information in the archives, the list of unrecognized and non-geocoded addresses was sent to the municipalities. Municipalities were asked to correct errors, complete missing data, and update the information in their archives.
Procedures to improve the address recognition phase and the linkage operation between the BRI and BRP components were also implemented using other sources [6]. This made it possible to increase the level and quality of geocoding.
Nevertheless, there still remains a percentage of non-geocoded units for which specific EA retrieval procedures have been defined. These techniques described in the following sections will be implemented in the future in the final sub-municipal data production process.
Imputation procedures for non-geocoded units
The literature on the imputation of missing addresses (also known as geo-imputation) is sparse. Applications on Census data and health studies can be found in [7, 8, 9] and references therein. Generally speaking, there are two types of geo-imputation strategies: Stochastic and deterministic. Cases are deterministically assigned to locations using deterministic processes, which follow a set of principles. They are adopted when there is a high degree of reliability on rules determining the imputations. Their advantage is the computational efficiency, and the clarity at the basis of the imputations. On the other hand, when decisions on imputations are not clear enough, it is better to resort to probabilistic imputation methods. In this setting, uncertainty can be taken under control and evaluated by resorting to the tools typical of a probabilistic setting.
In practice, as in other NSIs’ imputation procedures, the two approaches are combined. Generally, the first step is carried out by resorting to deterministic methods, then the remaining missing data are imputed through probabilistic methods.
This is the approach adopted for the EA imputation in our application. In particular, the results of the deterministic procedures are jointly used as follows; the deterministic imputed EA is the value with the highest frequency obtained by the deterministic procedures. For instance, if two or more procedures give the same EA, this will be the imputed value. When all procedures provide discordant EAs, the value corresponding to the most reliable method is selected for imputation.
In the following sections, we detail the deterministic and imputation methods, and the experiment performed to assess their quality and to design the overall imputation procedure.
Deterministic methods
Different deterministic methods are proposed that exploit information from the household, the last traditional census, the real estate properties, and the geographical spatial coordinates.
Family reconstruction (FR)
This method retrieves the EA of individuals found to be missing considering the family to which they belong. Within a household, the EA of the geocoded members is assigned to the non-geocoded household components. This procedure is not applied when the geocoded members have discordant EAs, and households have an anomalous high number of individuals.
Spatial interpolation (SI)
This method aims to assign the EA to the population that lacks it but has an address, including house number, for which the location on the territory is unknown. There are several causes attributable to this problem, among the most frequent is that of new established addresses by municipalities, for which geographic coordinates are not yet available within the databases used.
SI bases its foundation on the concept of geographic proximity. Specifically, an address is assumed to have a high probability of being in the same EA as a neighbouring address. In Italy, addresses have house numbers, and for the population without EA, the distance in terms of house numbers of the address without a geographic location was measured with the nearest address whose location is known. The latter value is used for the imputation of EA. We notice that, since house numbers are even and odd numbers according to the side of the street (right and left side of the street), the distance between even and odd numbering was also taken into account when measuring distance.
Address strings from the 2011 census (AD11)
This method is based on a comparison of the address strings associated with individuals without an EA, with the address strings in the 2011 Census data. When the linkage takes place, the EA associated with the address in the 2011 Census archives is retrieved, after verification on the municipality to which it belongs.
More in detail, the retrieval by strings of the 2011 Census addresses, as a whole, is carried out in three successive steps. The logic of the retrieval involves gradually less restrictive linkage constraints.
The first step of the retrieval consists of recognizing, within the 2011 Census archive, the EA associated with the address string. Retrieval is made possible only for addresses found to be associated in the municipality with only one EA (uniqueness of correspondence between the address and EA).
The second step introduces the individual’s identification code into the linkage key. If the individual is the same, lives in the same municipality as in 2011, and at the same address, it is still reasonable to impute the same EA as in 2011, even when the linkage between the text strings of the two addresses is not fully and completely realized.
The third and last step is similar to the first one but makes use of the individual code as a linkage key. Less detailed information of the address string is used It is composed of the “name” of the street and the “house number” of the address, but the “species” element is excluded. In this way, it is possible to overcome failures of recognition related to:
the lack of the element of “species” in either of the two textual addresses considered (e.g., “WITHOUT SPECIES TURRITA LOC. SCATTERED HOUSES 67” is considered equivalent to “FR. TURRITA LOC. SCATTERED HOUSES 67”); to the presence of different ways of abbreviating the “species” (e.g., “CONTRADA SAN SALVATORE 77” is considered equivalent to “CDA SAN SALVATORE 77,” to “C.DA SAN SALVATORE 77” and to “CONTR. SAN SALVATORE 77”); to temporal changes of “kind” in the address (e.g., “PIAZZA GIUSEPPE MAZZINI 11” is considered equivalent to “PIAZZETTA GIUSEPPE MAZZINI 11”).
Real estate property (REP)
This method involves the use of information about the real estate property that the individual is reported to own. The source of information on real estate properties in Italy is the “Catasto delle Unità Immobiliari dell’Agenzia delle Entrate” (Cadastre of Real Estate Units). Istat places the real estate from this administrative data source in an EA. Owners are imputed with the EA associated with the house, provided that some conditions are fulfilled. The real estate unit must be owned by the individual in the data year and the municipality of residence or tax domicile. In cases the individual owns more than one real estate unit in the municipality for the given year, de-duplication steps are used.
The result is extended to family members who still lack an EA. The extension of EA to family members is relevant numerically, as not all members of a family own the housing unit in which they live.
Real estate rentals (RER)
This method proceeds similarly to the previous one. The difference is only in the relationship between the individual and the property. In the previous one, the property is owned, in this case, it is a rented property. Again in the presence of more than one lease, de-duplication is used for the above considerations.
The results obtained are then extended to family members still without an EA using the method that retrieves the EA through the family. The extension of the EA to family members is also important here since not all members of a household hold a lease related to the housing unit in which they live.
Probabilistic imputation
At the end of the deterministic imputation procedures, a certain number of units are still without EA. For those units, a step of probabilistic imputation is performed.
The probabilistic imputation is composed of a sequence of donor imputation steps mainly characterized by different imputation cells. Donor imputation steps are performed by choosing at random an observed household in the imputation cells and then by imputing the corresponding EA to a household with missing information in the same cells, that is with the same characteristic of the variables determining the imputation cells. At each step, the imputations are applied to the units not imputed in the previous one.
EA is imputed to the household following the principle that all the individuals of a household should be in the same EA.
Imputations by methods and concordances with observed EAs and ADAs
Imputations by methods and concordances with observed EAs and ADAs
Footnotes
Enumeration areas are geographical areas of defined spatial dimensions assigned to enumerators for population and housing census operations. They draw a partition of the municipalities’ territory and may consist of maps or lists of streets or dwellings/buildings.
The BRI contains information on some demographic variables such as sex, place and date of birth, citizenship, and place of residence from administrative data.
The BRP contains addresses, enumeration areas and, if possible, geographic coordinates.
4.4% of BRI units (as of 31/12/2019), about 2.6 million, were non-geocoded.
Sub-municipal areas, that correspond to an administrative geographical classification and consist of the aggregation of contiguous EAs.
The 2011 population and housing census was the last census conducted in Italy in the traditional way, where individuals and households were allocated in enumeration areas at the same time as the census operations. The enumerators covered the entire municipal territory, travelling through the enumeration areas and surveying all the population living within them.
Taking the municipality of Rome as an example, the dissimilarity index value of 6 percent compared to 2011 is equivalent to about 150,000 people distributed differently across the territory compared to 2011. This quantity is roughly equivalent to the flow data on registry registrations and cancellations (for births, deaths and transfers of residence) observable in a single year on the municipal registry.
The quality of geo-coding of addresses to enumeration areas is assessed by means of certain algorithms for matching information reported in different sources and administrative records.
Enumeration areas are classified according to whether they are located in urban areas, small residential areas outside the municipality, rural areas or industrial zones.




