Take me home: Attribute-driven spatial matching of registers in official statistics

Abstract

In response to the increasing demand for information, official statistics often rely on multi-source production processes. Since registers have lower production costs compared to sample surveys, they are frequently integrated to produce the required estimates. This paper proposes a reference reconciliation method for integrating a register of addresses with a register of buildings. The method exploits spatial proximity and property ownership information. It is implemented and evaluated using residential addresses from the Italian region of Tuscany, based on data from the Integrated System of Statistical Registers of the Italian National Institute of Statistics. Results show that the proposed method increases the percentage of matched addresses by 20%. Moderate precision and high recall values are key performance indicators of the approach. The method also allows for various extensions, including parameterization (e.g., urban/rural contexts), application to different statistical units and the use of complex distance measures.

Keywords

addresses population census cadastral archive housing arrangements spatial matching reference reconciliation data integration registers

1. Introduction

Over the past decade, the Italian National Institute of Statistics (Istat) has shifted its statistical production paradigm toward an integrated approach, promoting extensive use of registers.¹ The modernization of production processes involves several high-impact initiatives aimed at extending and enhancing an integrated use of multiple data sources.² This new production paradigm underpins the Integrated System of Statistical Registers (ISSR), a unified logical data asset resulting from the integration of survey and administrative data.

The ISSR includes four master statistical registers (ISSR also includes other registers.): the Population Register, the Business Register, the Register of Activities, and the Register of Places. These registers describe their entire corresponding populations of statistical units.³ Within the ISSR, each individual, enterprise or place is assigned a unique and unambiguous identification code.

One of the main objectives of the ISSR is to geo-reference all core statistical units. To reach this aim, extensive integration procedures have been implemented, involving the Statistical Register of Places and other registers. However, an initial integration step is needed within the Statistical Register of Places. Indeed, this register integrates and connects different spatial units, including addresses, buildings and dwellings, census blocks, grid cells, and territorial administrative and statistical areas.⁴ Figure 1 illustrates the ISSR components involved in this research: the Register of Addresses (RA), the Register of Buildings (RB), and the Population Register.

Figure 1.

Registers involved in this study: population register and, in the box, components of the statistical register of places.

The complex structure of the Statistical Register of Places is based on the observation that most statistical units rely on addresses. An address is defined as any direct or indirect access from a street to a building unit where activities may occur. It is therefore represented as a spatial point that can be mapped using its latitude and longitude coordinates. In Italy, the main source of address data within the ISSR is the National Archive of Addresses of Urban Streets, although several other administrative archives and data sources also contribute.⁴ The Register of Addresses (RA) contains the addresses gathered from all these administrative sources. Each address in the RA is assigned a unique identification code, called ID. Currently, the RA contains data on approximately 30 million IDs; about 88 percent of them have associated geographical coordinates, 44 percent are residential addresses, i.e., at least one person has declared residence at that location.⁴

Italian real estate assets constitute the stock information contained in the Register of Buildings (RB). This register primarily relies on cadastral administrative and spatial data, while also integrating information from geographic agencies and open data sources. A unique identifier is assigned to each building, preserving the hierarchical structure of the units (nested building units or dwellings within a building). Currently, the RB includes data on approximately 29 million buildings, of which 14.4 million are residential.⁴ Around 80 percent of buildings have valid geographical coordinates (of their building centroid).

An additional ISSR objective is to achieve single cohesive units—such as individuals linked to their dwellings, employees to their workplaces, and students to their schools, etc. Joint information on population and dwellings is of strategic importance, as it enables the production of numerous quality-of-life statistical indicators. In official statistics, common examples include indicators on living conditions, housing quality, home ownership, overcrowding rates, and population statistics related to territories. Istat already disseminates such statistics, but usually they stem from sample surveys, which involve high response burden and high costs. Estimating these indicators through the ISSR would significantly reduce costs for both respondents and producers.

The creation of cohesive statistical units allows for the effective use of microdata from available registers in a manner consistent with official statistics quality standards.⁵ From a theoretical perspective, the sole information on addresses would allow the integration of individuals with their dwellings. Specifically, each individual in the Population Register resides at a residential address r, which is included in the RA. Simultaneously, each individual's home is located at a building address, b, which is drawn from the RB. By linking residential and building addresses through an ID-based deterministic record linkage, it becomes possible to integrate information on the population and the territorial characteristics of dwellings.

However, the information provided by addresses alone is insufficient to create a complete set of cohesive units. Specifically, an ID may exist in the Register of Addresses (RA) but not in the Register of Buildings (RB); that is, there may be residential addresses where no building is located. A key assumption in ID-based deterministic record linkage is that an address is registered in both the RA and RB under the same ID. When this assumption does not hold, biased statistics may result.⁶ The problem of determining which IDs correspond to the same real-world address is known as reference or identifier reconciliation.^7–9 Despite the use of various synonyms (see Table 1 in¹¹, for example), scientific literature mentions reference reconciliation as the initial step in any data integration process.^10–14

This paper focuses on the reference (ID) reconciliation of addresses between RA and RB, using attributes from both RB and the Population Register. In this context, all addresses are contained within RA, but they originate from different sources.

On the one hand, residential addresses, denoted as r, are those included in administrative records certifying population residence. In Italy, two key regulations govern residential addresses: Chapter VII (Articles 41–44) of Legislative Decree No. 223/1989 and the Prime Minister's Decree (D.P.C.M.) of 12 May 2016 on the Population Census and the national archive of addresses. As in many countries,¹⁵ municipalities are responsible for the registration, deletion, and updating of residential addresses. As a result, the efficiency of this administrative process varies significantly across different municipalities, leading to a level of heterogeneity that cannot be easily accounted for in standardized data processing phases.

On the other hand, building addresses, denoted as b, originate from cadastral administrative data sources, which have distinct data generation, processing, and transmission procedures. It is well known that cadastral administrative data are prone to updating issues, as changes are typically made only in response to specific requests from property owners or following legal, fiscal, or administrative actions. As a result, cadastral information is updated infrequently, while residential addresses are more regularly validated and updated, often due to population migration, for example. Therefore, discrepancies between residential and building addresses may also arise from the temporal misalignment in the updates of these two types of addresses.¹⁶

Addresses may also be incomplete, particularly with regard to street numbers, suffixes, or types (e.g., street vs. road). Additionally, street numbers are more commonly missing in rural areas than in urban areas. In low-density population regions, further complications arise from the use of rural routes or place names (e.g., localities) instead of numerical street addresses.^17,18 Finally, while standardization methods routinely applied to addresses help resolve many inconsistencies, they can also introduce uncertainties. This is especially true for ambiguous street names (e.g., “via del Corso” (Avenue Street, an example of using a type of street as a street name)) or combined street names (e.g., names with double adjectives or previously assigned names).

To summarize, differences may exist for the same address when derived from different sources. Figure 2 illustrates an address r in the Register of Addresses (RA), where the textual component is correctly and fully listed, along with its assigned ID. The nearby building with address b, derived from cadastral information, is missing street number details. As a result, b has a different ID, as its content differs from that of r. The ID-based record linkage fails because the missing textual information prevents the correct r-b association, even though they represent the same physical address. To improve the cohesiveness of the ISSR, it is therefore necessary to reconcile the addresses in RA and RB.

Figure 2.

Id address of residence, r, and address of the building, b. They do not allow for an ID-based record linkage between the residents in r and their home in building at the address b due to missing street number in b.

The aim of this paper is to present a spatial reference (ID) reconciliation method applied to address and building registers. The method utilizes additional information from the specific statistical unit to be geo-referenced. In this study, the reference reconciliation focuses on residential addresses, where the population resides.

The paper is structured as follows. Section 2 describes the proposed attribute-driven spatial reference reconciliation method for addresses in the RA and RB registers, with a brief overview of the involved registers. Section 3 presents preliminary findings from the Italian region of Tuscany. Finally, Section 4 concludes the paper and outlines potential directions for future research and improvements.

2. Data sources and methodology

This section firstly provides a description of the study area, along with an overview of the Register of Addresses (RA) and the Register of Buildings (RB). It then details the proposed spatial reconciliation method.

2.1 Study area

The data used in this paper are derived from the Register of Buildings and the Register of Addresses. The study area corresponds to the Nomenclature of Territorial Units for Statistics (NUTS)¹⁹ level two region (NUTS2) of Tuscany, located in central Italy. Tuscany is composed of ten smaller NUTS3 regions and 273 municipalities (see Figure 3). According to the degree of urbanization,²⁰ over 60 percent of the municipalities are rural, 36 percent are classified as towns, and there are seven cities. The total population is approximately 3.7 million, spread over an area of less than 23 thousands km².

Figure 3.

Tuscany region in Italy and its smaller NUTS3 regions.

2.2 Data sources

The Italian Cadastral Information System²¹ serves as the primary source of information for the Register of Buildings (RB). The cadastral system is fundamental for the legal identification of real estate, fiscal equity, and the management of land use and zoning,²² as well as for land planning.¹⁵ Italy employs a system of cadastral and land registry parcels, which includes an urban Cadastre (for buildings). The Land Registry is simply a collection of deeds,²² while the urban Cadastre identifies each real estate unit. Cadastres based on the concept of a cadastral parcel—defined by the uniformity of soil, topography, ownership, and municipality—assign a single identifier to each parcel.

For the purposes of this paper, the cadastral information in the Register of Buildings (RB) is divided into two core components: the administrative component (derived from the Urban Building Cadastre archive) and its spatial representation within a cartographic system.

The administrative component of RB includes technical-physical, legal, and economic data for each unit. The main variables utilized in this paper are the cadastral building identifier, address identifier (ID), municipality code, cadastral typology, and personal identifier (see the Appendix). This paper focuses solely on buildings with dwellings, referred to as residential buildings. The spatial information in RB includes the geo-building identifier, address identifier (ID), municipality code, and building centroid coordinates²³ (see the Appendix).

Before the reconciliation process, a record linkage between the administrative and spatial components of the Register of Buildings (RB) is required. Specifically, matching by building identifier (Cadastral Building Identifier and Geo Building Identifier) enables the association of each unit with its corresponding centroid coordinates. This step may result in some buildings having missing coordinates. Inaccuracies such as updating issues, differences in map projections, or errors in reporting outdated cadastral information are common causes of internal mismatches.^22,24 In Tuscany, approximately 89 percent of buildings in RB have valid geographical coordinates, specifically those of the building centroids (see the first column in Table 1).

Table 1.
Percentage of buildings and residential addresses with geographical coordinates by NUTS3 regions of Tuscany.

NUTS3 Region name Buildings Residential addresses

Massa-Carrara 87.1 83.1

Lucca 82.8 88.6

Pistoia 89.3 95.2

Firenze 87.5 92.5

Livorno 91.5 91.1

Pisa 92.7 88.7

Arezzo 76.7 81.6

Siena 96.4 80.2

Grosseto 91.6 79.2

Prato 90.4 96.5

NUTS3 Region name	Buildings	Residential addresses
Massa-Carrara	87.1	83.1
Lucca	82.8	88.6
Pistoia	89.3	95.2
Firenze	87.5	92.5
Livorno	91.5	91.1
Pisa	92.7	88.7
Arezzo	76.7	81.6
Siena	96.4	80.2
Grosseto	91.6	79.2
Prato	90.4	96.5

The main variables from the Register of Addresses (RA) used in this research are address identifier (ID), municipality code, address coordinates, and information indicating whether the address is residential (see the Appendix). Several address quality issues can hinder the geo-referencing process.²⁵ The positional accuracy of addresses has been widely studied,²⁶ but other issues include missing street numbers, spelling errors in street names,²⁷ and address types referring to localities, particularly in rural areas. Table 1 presents the distribution of residential addresses with geographical coordinates. The variability observed in Table 1 is due to differences in data collection procedures, as well as the geomorphological and demographic characteristics of the NUTS3 regions. Nevertheless, the percentages suggest a satisfactory level of geospatial coverage.

The ID-based deterministic record linkage between the Register of Addresses (RA) and the Register of Buildings (RB) for Tuscany reveals that these registers share approximately 56 percent of addresses. This suggests that individuals living at the remaining percentage of addresses do not have corresponding dwellings. Table 2 displays the distribution of shared addresses by NUTS3 region. The goal of the spatial reference reconciliation method presented in this paper is to increase these shared percentages.

Table 2.

Percentage distribution of RA and RB shared IDs before the application of the reference reconciliation method, by NUTS3 regions of Tuscany (percentage on total number of residential IDs).

NUTS3 Region name	Shared IDs (%)
Massa-Carrara	44.8
Lucca	43.2
Pistoia	51.8
Firenze	70.7
Livorno	64.7
Pisa	49.4
Arezzo	59.8
Siena	48.6
Grosseto	52.6
Prato	69.2

2.3 Methodology

The scientific literature on address reference reconciliation, particularly in the context of matching points of interest, encompasses various methods, ranging from simple rule-based approaches to machine or deep learning techniques.^28,29 In the context of ISSR data integration, this paper proposes an optimization-based address reconciliation method that leverages spatial information associated with addresses, building centroids, and property characteristics.

In this case study, the units linked to addresses are individuals. Only non-shared residential addresses with spatial coordinates are selected from the Register of Addresses (RA). From the Register of Buildings (RB), only buildings with single-owners (single-owners are those individuals owning a single dwelling in a municipality or multiple dwellings as long as these belong to the same building) are included in the analysis. Additionally, the method selects only those buildings for which building centroid coordinates are available.

The rationale behind the method relies on the following two assumptions:

Ownership condition: single-owners of a dwelling in the municipality of residence usually live in their own property;

Spatial condition: the geographical coordinates of a building's centroid are assumed to lie in close proximity to the point representing its address on the map.

While the second condition is intuitive, the first is based on the fact that approximately 75 percent of Italians live in their own homes (https://ec.europa.eu/eurostat/web/interactive-publications/housing-2023#:∼:text=In%20the%20EU%20in%202022,31%25%20lived%20in%20rented%20housing). The proposed ID reconciliation method operationalizes these two conditions. Figure 4 illustrates the method. In the first step, residents at a given non-shared address $r \in R A$ in the Register of Addresses (RA) are identified. Then, their owned buildings (for the sake of simplicity, buildings are mentioned in the paper but dwellings in buildings are considered) are selected. Next, only the buildings that are in the same municipality (or another relevant territory/area) as r and whose owners are single-owner residents at r are considered for matching. Finally, among these buildings, the closest address b to r is labeled as match, resulting in an ID reconciliation between r and b. This application uses the Euclidean distance between the spatial coordinates of r and the building centroid coordinates of b. Other distance functions could also be used to measure proximity between r and b.

Figure 4.

Address matching driven by ownership status; r = residential address, b = building address, symbols labeled i1, i2, i3 and i4 represent individuals, dashed lines indicate residential relationships, full lines indicate ownership relationships.

The spatial matching problem described earlier can be viewed as an optimal assignment problem, where features in the Register of Addresses (RA) are matched to corresponding features in the Register of Buildings (RB). For each $r \in R A$ , let B_r be the set of buildings in the same municipality of r and whose owners are single-owner residents at r. Let $D S$ be a dissimilarity function between $r \in R A$ and $b \in R B$ . Thus, $d s_{r, b} = D S (r, b)$ can be considered the cost of matching any two r and b. In our context, the dissimilarity

\begin{aligned} d s_{r, b} = {\begin{array}{ll} H, & b \notin B_{r} \\ d i s t_{r, b}, & b \in B_{r} \end{array} \end{aligned}

(1)

expresses the cost of matching an address

r \in R A

to an address

b \in R B

where

d i s t_{r, b}

is the distance between addresses, while H is a high value greater than the maximum distance between addresses in RA.

Further, let $f_{r, b}$ be a binary indicator function of linking addresses $r \in R A$ and $b \in R B$ :

\begin{aligned} f_{r, b} = {\begin{array}{ll} 1, & i f r i s m a t c h e d t o b \\ 0, & o t h e r w i s e \end{array} \end{aligned}

(2)

Then, the reference reconciliation problem may be expressed as a linear optimization problem (LP):

\begin{aligned} \begin{array}{l} min \sum_{r \in R A, b \in R B} f_{r, b} * d s_{r, b} \\ s u b j e c t t o \\ \sum_{b \in R B} f_{r, b} = 1, \forall r \in R A \end{array} \end{aligned}

(3)

which minimizes the global cost of assignments. The constraint in equation (3) states that each residential address

r \in R A

may be linked to a unique building address

b \in R B

The formulation of address matching problems within an optimization framework offers several advantages. Firstly, LP problems are typically solved using specialized software such as Gurobi (https://www.gurobi.com/) or CPLEX (https://www.ibm.com/it-it/products/ilog-cplex-optimization-studio). Secondly, different cost functions and constraints in equation (3) can reflect various matching settings. For example, the constraints in equation (3) can be straightforwardly modified to establish a one-to-many or many-to-many relationship between RA and RB, thereby addressing different assignment problems.³⁰

The method does not rely on any a priori distance threshold for identifying r-b links. By being independent of any distance threshold, the method avoids issues related to computational burden or the risk of having empty admissible subsets of the Register of Buildings (RB).³¹

However, in practice, some data selection may be beneficial for computational efficiency. A maximum distance could be set, beyond which no candidate pairs are considered admissible. Alternatively, a further refinement can be achieved by using a slightly more complex proximity-based dissimilarity function like $d s_{r, b} = {\begin{array}{ll} D i s t_{m a x}, & b \notin B_{r} \\ d i s t_{r, b}, & b \in B_{r} \end{array}$ where $d i s t_{r, b}$ represents the distance (Euclidean distance or some other distance between addresses) between r and b while $D i s t_{m a x}$ is a high distance threshold above which no match is admissible. Depending on the location, different admissibility distance thresholds $D i s t_{m a x}$ for identifying r-b links can be applied, i.e., smaller distances in densely populated urban areas and larger distances in sparsely populated rural zones.

3. Results

This section presents preliminary results of the spatial reconciliation procedure, focusing on its outcomes, identification rates, precision, and recall.

Using information from addresses, buildings, and population registers in the Tuscany region, the spatial matching procedure identifies approximately 200 thousands new addresses (21 percent) as referring to the same real-world entity. Table 3 illustrates the improvements across Tuscany regions. The Pisa and Pistoia regions show the most significant number of newly matched addresses. Firenze exhibits the smallest increase in newly matched addresses; however, addresses in this region demonstrate the best agreement even without reference reconciliation (see Table 2). In general, the lower the initial matching rate between addresses, the greater the impact of ID reconciliation.

Table 3.
Increase in the percentage distribution of RA and RB shared addresses after the application of the reference reconciliation method by NUTS3 regions of Tuscany.

NUTS3 Region name New r-b links (%)

Massa-Carrara 26.3

Lucca 26.6

Pistoia 26.8

Firenze 14.2

Livorno 16.6

Pisa 27.3

Arezzo 16.9

Siena 23.3

Grosseto 18.1

Prato 17.8

NUTS3 Region name	New r-b links (%)
Massa-Carrara	26.3
Lucca	26.6
Pistoia	26.8
Firenze	14.2
Livorno	16.6
Pisa	27.3
Arezzo	16.9
Siena	23.3
Grosseto	18.1
Prato	17.8

Investigating the causes of missing original links could help improve administrative data collection processes. Once links between addresses are validated, it becomes possible to identify the factors that hindered their original linkage. In this application, 60 percent of these addresses have a different street number. Approximately 23 percent of building addresses lack a street number altogether, while only 10 percent show a discrepancy in street type or street name.

To evaluate the spatial matching procedure's ability to reconcile candidate r-b pairs, the identification rate is computed. This rate is defined as the number of reconciled r and b addresses divided by the corresponding target number of residential and building addresses. Figure 5 shows these rates, which are evaluated for different admissibility distances between residential addresses and building centroids. In line with scientific literature on address quality, the identification rate is calculated based on the degree of urbanization of Tuscany municipalities.²⁰

Figure 5.

Tuscany identification rates of addresses in RA and RB by degree of urbanization.

Our results confirm that urban and rural areas require distinct processing. For any fixed admissibility distance, significant differences between urban and rural areas are evident in both registers; however, RB exhibits considerably higher dissimilarities than RA. For the RA, the urban-rural divide decreases as the distance increases: the greater the distance, the smaller the difference between urban and rural addresses. In urban areas, a 100 m distance is sufficient to identify about 80 percent of addresses, whereas in rural areas, a distance of 300 m is needed to achieve the same result. Overall, distance does not influence urban-rural differences in RB. This issue may be related to several factors, such as second homes, depopulation, or aging populations in rural areas, and will be the subject of further investigation.

Finally, the last two indicators used to evaluate the outcomes of the spatial matching procedure are precision and recall. Precision is computed as the percentage of correctly matched units among all units, while recall is defined as the percentage of correctly matched units among all units that should be matched. Table 4 presents the values of these two quality measures in Tuscany.

Table 4.

Precision and recall of the spatial matching procedure by NUTS3 regions of Tuscany.

NUTS3 Region name	Precision (%)	Recall (%)
Massa-Carrara	54.4	78.3
Lucca	51.6	89.1
Pistoia	56.8	87.4
Firenze	78.3	95.6
Livorno	73.8	87.5
Pisa	56.2	90.4
Arezzo	70.7	89.7
Siena	57.7	83.3
Grosseto	64.4	80
Prato	73.9	84.9

To approximate these statistics, the benchmark is represented by the addresses shared by RA and RB (i.e., the common IDs). In four NUTS3 regions, precision exceeds 70 percent. However, other regions show lower values for this indicator. One way to increase precision is by reducing the number of false positives—i.e., decreasing the number of r-b links established by the spatial matching procedure that are not originally linked. While the proposed spatial matching procedure can be further improved, it is important to note that, in our data reconciliation problem, the benchmark does not fully cover the entire population. As a result, any reference reconciliation method will inevitably yield some false positives.

Table 4 shows high recall values (with an overall mean of about 87 percent), indicating the excellent accuracy of the proposed attribute-driven spatial matching procedure.

4. Conclusions and further developments

In response to the high costs of sampling surveys and to the aim of improving the quality of its products and processes, official statistics have increasingly adopted a multi-source paradigm. In Italy, the Integrated System of Statistical Registers (ISSR), a unified logical data asset, supports the production of official statistics. This paper focuses on implementing a reference reconciliation method for addresses and building registers to enhance integration within the ISSR. Several quality-of-life statistics could benefit from the joint use of these registers.

The method incorporates both spatial and non-spatial conditions. The spatial condition relates to the proximity between the spatial point representing the residential address and the geographical coordinates of a building centroid. The non-spatial condition involves the attribute used to filter and validate matches between residential addresses and buildings, specifically the property ownership status of the resident population. The results from the application to the Tuscany region are highly encouraging.

Based on these results, we assert that ownership-based spatial matching is a valid context-specific reference reconciliation method. While the method has been tested on residential addresses, the same approach can be applied to other statistical units, such as local units of enterprises, thereby enhancing reconciliation for a broader range of data. An additional attractive feature of the method is its independence from training sets, making it a ready-to-use algorithm. Furthermore, the procedure is flexible enough to be adapted to different contexts, such as urban/rural areas, by adjusting the allowed thresholds. Finally, the method can also be applied in other scenarios, such as validating existing linkages between addresses and buildings, deduplication or assessing the quality of cadastral information, thus highlighting municipalities with specific issues.

The method is particularly relevant for National Statistical Institutes that have access to detailed administrative data on property ownership and need to improve the matching of addresses from different sources. A high rate of dwelling ownership can be advantageous for achieving good results, but other information, such as rental data, can also be considered. A key factor in this application is the availability of detailed information on the internal composition of buildings, including the number and configuration of residential units. When multiple owners reside in the same building, the correctness of associations can be verified based on their homogeneity. Conversely, if no prior associations exist, the propagation of links can be extended to all dwellings within the building, thereby increasing the matched population rate. Finally, we note that the method relies on a high degree of completeness in the Register of Addresses. However, this type of register often suffers from over-representativeness. This reference reconciliation method can also help improve and update dictionaries and thesauruses.

In addition to its application across the entire national territory, the ownership-driven spatial data reconciliation method can be extended in several directions. On one hand, it could be applied to other types of holding rights, such as rentals, or to different types of owners, including legal entities or individuals who own multiple properties. The method could also be adapted to focus on households rather than individuals. On the other hand, alternative distance measures could be explored, such as incorporating string similarity, for instance. From a computational perspective, to reduce complexity, clusters of addresses and groups of buildings could be processed simultaneously. In this context, the proposed optimization formulation could be adapted to accommodate more complex scenarios. From a statistical standpoint, evaluation tools could be further developed, such as performing sensitivity analysis, even considering the geometrical shapes of the buildings involved.

The ultimate goal is the development and maintenance of a National Statistical Institute territorial infrastructure that combines spatial and non-spatial information, enhancing the rate of cohesive units, which are essential for accurate geo-coding and multi-source statistics.

Footnotes

Acknowledgements

Istat is not responsible for any views expressed in this paper.

We are deeply grateful to the reviewers for their careful reading of the manuscript and for their insightful comments and detailed suggestions, which significantly contributed to improving the quality of the paper.

ORCID iDs

Daniela Ichim

Luisa Franconi

Damiano Abbatini

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix

List of the main variables used in this research from the Register of Buildings (RB) - both administrative and spatial parts - and the Register of Addresses (RA).

Variables in the administrative part of the Register of Buildings.

Variable name	Description
Cadastral building identifier	Concatenation of cadastral municipality code, section code, map sheet number, parcel number.
Address identifier (ID)	Identifier produced by parsing and coding processes in the Register of Addresses. This study focusses only on residential addresses.
Municipality code	Istat code of the municipality derived from the corresponding cadastral municipality code.
Cadastral typology	Description of the building unit intended use. For the scope of this research, the focus is on residential buildings/dwellings.
Consistency data	Number of rooms, surface of the unit, etc.
Personal Identifier	Identifier of individuals or legal entities holding rights (ownership, rental, etc.) on the building unit inside the building.

Variables in the spatial part of the Register of Buildings.

Variable name	Description
Geo Building identifier	Concatenation of cadastral municipality code, section code, map sheet number, parcel number present in the spatial part of the register.
Address identifier (ID)	Identifier produced by parsing and coding processes in the Register of Addresses.
Municipality code	Istat code of the municipality derived from the corresponding cadastral municipality code.
Building centroid coordinates	Spatial coordinates of the centroid of the polygon representing the building on the map.

Variables in the Register of Addresses.

Variable name	Description
Address identifier	Identifier produced by parsing and coding processes.
Municipality code	Istat code of the municipality to whom the address belongs at a specific date.
Address coordinates	Address spatial coordinates.

References

Istat. Istat’s Modernisation Programme. 2016. Available from: https://www.istat.it/en/files/2011/04/IstatsModernistionProgramme_EN.pdf (accessed 15 May 2025)

Alleva

Falorsi

Petrarca

, et al. Measuring the accuracy of aggregates computed from a statistical register. J Off Stat 2021; 37: 481–503.

Alleva

. The new role of sample surveys in official statistics. In: ITACOSM 2017, The 5th Italian Conference on Survey Methodology, Bologna, Italy, 2017. Available from: https://www.istat.it/it/files//2015/10/Alleva_ITACOSM_14062017.pdf (accessed 15 May 2025)

Abbatini

Clary

Chiocchini

, et al. Statistical register of places: opportunities for sustainable and climate change related indicators. Ital J Econ Demogr Stat 2024; 78: 101–112. Available from: https://www.rieds-journal.org/rieds/issue/view/61 (accessed 15 May 2025)

Zhang

. Topics of statistical theory for register-based statistics and data integration. Stat Neerl 2012; 66: 41–63.

Crume

Duca

Ong

, et al. Population-level surveillance of congenital heart defects among adolescents and adults in Colorado: implications of record linkage. Am Heart J 2020; 226: 75–84.

Bakhtouchi

. Data reconciliation and fusion methods: a survey. Appl Comput Inform 2022; 18: 182–194.

Saïs

Pernelle

Rousset

. Combining a logical and a numerical method for data reconciliation. In: Spaccapietra

(ed.) J data semant XII. Berlin: Springer, 2009, pp. 70–98. https://doi.org/10.1007/978-3-642-00685-2_3

Benecha

Abreu

Jennings

, et al. Data reconciliation and estimation in an agricultural survey. Stat J IAOS 2023; 39: 461–472.

10.

Cochinwala

Kurien

Lalk

, et al. Efficient data reconciliation. Inf Sci 2001; 137: 1–15.

11.

Barlaug

Gulla

. Neural networks for entity matching: a survey. ACM Trans Knowl Discov Data 2021; 15: 1–37.

12.

Elmagarmid

Ipeirotis

Verykios

. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 2007; 19: 1–16.

13.

Enríquez

Domínguez-Mayo

Escalona

, et al. Entity reconciliation in big data sources: a systematic mapping study. Expert Syst Appl 2017; 80: 14–27.

14.

Lei

. Harmonizing full and partial matching in geospatial conflation: a unified optimization model. ISPRS Int J Geo-Inf 2022; 11: 375.

15.

Cienciała

Sobolewska-Mikulska

Sobura

. Credibility of the cadastral data on land use and the methodology for their verification and update. Land Use Policy 2021; 102: 105204.

16.

Foley

Shuttleworth

Martin

. Administrative data quality: investigating record-level address accuracy in the Northern Ireland health register. J Off Stat 2018; 34: 55–81.

17.

Vine

Degnan

Hanchette

. Geographic information systems: their use in environmental epidemiologic research. Environ Health Perspect 1997; 105: 598–605.

18.

Biljecki

Arroyo Ohori

Ledoux

, et al. Population estimation using a 3D city model: a multi-scale country-wide study in The Netherlands. PLoS One 2016; 11: e0156808.

19.

Eurostat. Statistical regions in the European Union and partner countries. NUTS and statistical regions 2021. Luxembourg: Publications Office of the European Union, 2020. https://doi.org/10.2785/850262

20.

Eurostat. Methodological manual on territorial typologies. Luxembourg: Publications Office of the European Union, 2019. https://doi.org/10.2785/930137

21.

Agenzia delle Entrate. The Italian Cadastral System. 2020. Available from: https://www.agenziaentrate.gov.it/portale/documents/180690/1185444/the+italian+cadestral+system+2018_The+Italian+Cadastral+System+2018.pdf (accessed 15 May 2025)

22.

Comparetti

Raimondi

. Cadastral models in EU member states. EQA Int J Environ Qual 2019; 33: 55–78.

23.

Salvucci

Salvati

. Official statistics, building censuses, and OpenStreetMap completeness in Italy. ISPRS Int J Geo-Inf 2022; 11: 29.

24.

Mora-Navarro

Femenia-Ribera

Velilla Torres

, et al. Geographical data and metadata on land administration in Spain. Land (Basel) 2022; 11: 1107.

25.

Zandbergen

. A comparison of address point, parcel and street geocoding techniques. Comput Environ Urban Syst 2008; 32: 214–232.

26.

Ratcliffe

. On the accuracy of TIGER-type geocoded address data in relation to cadastral and census areal units. ISPRS Int J Geo-Inf 2001; 15: 473–485.

27.

Curriero

Kulldorff

Boscoe

, et al. Using imputation to provide location information for nongeocoded addresses. PLoS One 2010; 5: e8998.

28.

Sun

, et al. Conflating point of interest (POI) data: a systematic review of matching methods. Comput Environ Urban Syst 2023; 103: 101977. https://doi.org/10.1016/j.compenvurbsys.2023.101977

29.

Rezaei Ghahroodi

Ranji

Rezaee

. Address matching using machine learning methods: an application to register-based census. Stat J IAOS 2024; 40: 25–40.

30.

Goodchild

. An optimisation model for linear feature matching in geographical data conflation. Int J Image Data Fusion 2011; 2: 309–328.

31.

Abbatini

Franconi

Ichim

. Improving statistical registers’ quality through attribute-driven spatial matching. In: Proc EU Qual Off Stat. 2024. Available from: https://airdrive.eventsair.com/eventsairwesteuprod/production-leading-public/2c79b176d4fd4985af183788d57ce5f2 (accessed 15 May 2025)