Abstract
In order to rationalize the massive use of administrative data in the production of demographic and social statistics, the French national statistical institute (INSEE) has decided to launch a program to create a system of interconnected registers of individuals, dwellings and households. This program, called RESIL, is based on the mobilization of external data, particularly administrative data, in strict compliance with the conditions for the protection of individual data.
Thus, through these registers, INSEE will have a reference universe that will allow: (i) to constitute the sampling frames as well as the calibration margins of household surveys; (ii) to measure the quality of the coverage of sources; and (iii) to match different datasets: surveys with administrative data, administrative data with administrative data in order to provide richer information.
To set up this information system, INSEE will draw on international experiences, as it plans to use methodologies already implemented in several national statistical institutes, including:
deterministic and probabilistic matching methods for identifying and comparing different administrative sources. These methods are essential insofar as France does not have a unique and shared identifier; the sign of life method to define the reference population from the presence of individuals in several input data; the Dual System Estimation methodology to measure the quality of coverage of the registers using annual population census data (it is a real opportunity to have these annual points of comparison to assess the quality of the registers, especially during the set-up period).
This paper will therefore present the stakes and the context, in particular the legal context, of the implementation of these registers, and will then describe the main methodological principles and the proofs of concept planned for its implementation.
Introduction
The French National Statistical Institute (INSEE) has decided to make major methodological and IT investments in order to build up an Information System (IS) based on the linking of multiple administrative sources. It is in this context that it was decided to launch the RESIL program. Through the constitution of statistical registers of individuals, dwellings and households, RESIL aims to facilitate, secure and make reliable the mobilization of external data, notably administrative data, within the French official statistical service, in strict compliance with the conditions of individual data protection.
Thus, RESIL will constitute a reference universe based on the use of a large number of sources through the so-called “sign of life” methodology, with the elaboration of a residence index. This reference universe will allow to measure the coverage defects of a source and thus to correct them if necessary, to serve as a basis for the constitution of survey sampling frames, etc. In addition, RESIL will offer an enrichment service that will make it possible to enrich files from the National Statistical System (NSS, that is to say INSEE and the Ministerial Statistical Departments) by matching them with other social data (e.g. employment data, income data …). It will thus enable the transformation of statistical operations by gradually replacing survey data with administrative data.
This article presents why and how these major changes will be introduced. Then it focus on the main methodological challenges that will need to be addressed and on the solutions inspired by international experiences.
Why and under what conditions to build a register of individuals, dwellings and households in France?
The needs it will address
The current French demographic and social information system, despite its richness, has several weaknesses:
The French National Register of Identification of Natural Persons
The French National Register of Identification of Natural Persons
An increasing use of administrative data, that needs to be more resilient
INSEE already – and more and more – uses administrative sources to produce statistics and thus complete its surveys. For example, the sampling frame for the surveys is constructed from tax data. However, this system is fragile because it depends, for many processes, on a single source which also has defects in terms of coverage and location of students.
Besides, the increase in the production of administrative data means that we are moving from a world in which INSEE constructs its own data and uses them for studies, to a competitive world in which more data produced outside are open, and therefore offer a potential for analyses carried out well beyond official statistics.
The current internal organization at INSEE leads to receiving administrative sources by domain. This has made it possible to develop relevant statistics, but reflection on the more cross-sectional use of certain administrative sources, beyond the primary field of application, has been more sporadic. In the field of business statistics, the information system is much more structured and has been built for several years around the backbone of a statistical business register which is supported by the inter-administrative business register [2]. This experience will be very useful in building RESIL.
A “rolling census” that provides data every year, essential opportunity for crossed quality checking
Since 2004, INSEE conducts a rolling census [1]. This annual census data collection is a very important asset for measuring the quality improvement of a register system under construction. Every year, the annual census survey covers 5 million dwellings and 9.3 million inhabitants (living in dwelling, and as well in institutions and homeless people), which provides a considerable sample size to serve as a basis for such an operation. The address register which is used as a sampling frame in municipalities over 10,000 inhabitants, will also be used to measure the quality of the dwelling register. Compared to countries that conduct a quinquennial or decennial census, INSEE will be able to measure annually the degree of coverage of the register, but also of the census, thus providing a lever for improving the quality of the census.
Legal context and ethical issues
First of all, it should be pointed out that the aim of RESIL is to construct statistical registers that will be used solely by the NSS. From a legal point of view, the processing envisaged does not differ significantly from the processing underway at INSEE, except for its systematic and exhaustive nature. For example, INSEE conducts longitudinal studies using panels with the same characteristics of matching and long-term storage of identification data (surnames, first names, birth date …). As with such processing, the completeness of the register and its use to match various sources will require full compliance with the “General Data Protection Regulation” (GDPR) in force in the European Union [3].
From a legal point of view, INSEE has the right to access administrative data (including, under certain conditions, sources held by a private operator as part of a public service mission), provided that the necessary measures are taken to ensure that this communication does not infringe secrets protected by law.
In France, the register registration number (French acronym NIR), which is used in some of the administrative sources, is not shared with all French administrations, unlike the business identifier, and its use for statistical purposes is strictly controlled. To overcome these limitations, INSEE has launched a project called “Non-significant Statistical Code” (CSNS), in order to transform the NIR into a statistical identifier using hashing and encryption techniques, with no possibility to reverse the transformation to obtain the NIR from the CSNS. This statistical identifier will be used to build a service to facilitate the matching of files within the NSS, whenever they are based on the NIR or on civil status data (surname, first name(s), sex, date and place of birth). A decree was issued to authorize the creation of the CSNS and to define the conditions for its use (only public statistical purposes, conducted within the public statistical service, only with “non sensitive variables”).
The CSNS will provide the tool and procedures for generating an anonymized, non-significant identifier for the register of individuals and making easier the record linkage with other files.
Architecture of RESIL.
Beyond the technical aspects, there is a strong issue of acceptability and legitimacy in having an exhaustive register of persons that can be used for record linking. INSEE must be transparent about its use of external data to reassure the French population. The existence of such a register is not self-evident and it will be necessary not only to show its real added value but also to explain the measures implemented to guarantee the security and confidentiality of the identifying data it contains. The legal and ethical framework for the use of these data and their availability will have to be widely communicated in order to reassure the public. Each of the matches performed must be made public by the recipients of the service.
The RESIL program aims to build a sustainable and evolving system of interconnected registers of individuals, households and dwellings, updated from various administrative sources. This investment represents an indispensable first step and an essential lever for transforming the demographic and social information system towards a wider use of administrative (and even private) sources. The system of registers will serve as the backbone of this information system and will allow matching with other sources, statistical, administrative or private, insofar as they include a common identifier with the register, either directly or by means of prior identification.
RESIL is fed by external sources and the National Register of Identification of Natural Persons. It offers:
a reception and identification service for external sources; a reference universe based on the sign of life method; a linkage service allowing external data to be enriched through prior identification and data linkage.
The architecture of the solution
Figure 1 shows the architecture and construction process of the RESIL system of statistical registers. External administrative and even private data sources are received and given a unique identifier to be made available to the system of registers and also to the direct users of these sources.
External sources considered include:
The National Register of Identification of Natural Persons (RNIPP): image of the civil status registers held in each commune for people born in France; it also includes people born abroad if they needed to be registered for the management of their social rights. The RNIPP only contains information on civil status: name, first names, sex, date and place of birth, date, and place of death as well as the registration number in the register (NIR). The RNIPP contains more people than the population resident in France, for example people who have left France or people who died abroad for whom a death certificate has not been sent to the French registration authorities. Tax files: the Tax Administration maintains a register of real estate properties and their owners. From 2023 onwards, the housing tax file, which made it possible to assign individuals to dwellings, will be replaced by an online portal to identify secondary and vacant dwellings, still subject to taxation. Owners will declare the identity of the persons residing in their property (or properties). With this portal, the occupancy status (main, secondary, or vacant) of each dwelling will be known, as well as the name of the leaseholder, but not the names of the other occupants. At individual level, the Tax Administration also provides the personal income tax file, which contains information on the income declared and the income tax paid by taxpayers. Social files: social security beneficiaries, recipients of various social allowances, salaries declaration by employers … Other sources covering sub-populations such as the resident permits file or student information system.
From the information contained in the external sources, RESIL will therefore constitute a reference universe (list of functional dwellings and individuals residing in France during the year) from the method of ‘signs of life’ – residency index [4]. But RESIL will not contain any statistical data, it will only be the custodian of statistical units. Other demographics and social IS will be the custodian of the characteristics of these statistical units.
In addition, the register will make it possible to link the statistical units to each other (link between individuals and between individuals and dwellings), this is the purpose of the household register (which will be able to identify several types of households (dwelling, fiscal, etc.). The identity and contact information will be isolated from the rest of the register for security reasons. As a result, the registers will not be huge databases, they will contain many rows (individuals) but only few columns (variables).
In parallel with RESIL, INSEE is setting up an address register which will be useful for RESIL. It will provide an identifier for each address as well as x, y coordinates for georeferencing. RESIL will use the identification service offered by this geographical register and will simply store the address identifier.
Services offered
Reception and identification service for external sources
To build up its registers, RESIL needs to retrieve information on the identity of individuals (or dwellings) present in external sources and to assign them its identifier. In fact, only the identification data (surnames, first names, date and place of birth, identifiers …) are useful to RESIL. But as it must read the external sources to retrieve this information, it was decided, for reasons of efficiency, to extend this reception to all the data which can be useful to produce statistics (salaries, number of hours worked, surface area of the dwellings etc.). For a given source, it is therefore a matter of:
identify the external sources (from the administrative identifier if it is included or the identity information). This step requires the implementation of a matching tool (see 3.1 below); retrieving identity data and addresses and register in RESIL the individuals identified in the source. This data will be necessary for the implementation of the signs-of-life method (see 3,2 below); load and make available in the INSEE IS (in dedicated databases outside RESIL), the raw data from external sources which may be useful for producing statistics. This data will be anonymised but identified by the RESIL identifier (which will allow subsequent matching).
Identification and linkage process.
The register will produce a reference population for a given date from the cross-referencing of several administrative sources. This reference population consists in a list of individuals identified by a statistical identifier shared among the various social statistics information systems (demography, employment, income …) and will serve as a benchmark to measure their level of coverage and correct any bias.
Linkage services
The phase of reception and identification of external sources (described in paragraph 2.1 above) ensures that each INSEE internal process using external data has the identifier of the individuals in the register. However, partners in the NSS sometimes need to enrich their data with information available in the INSEE IS. For example, the statistical department of the ministry of health may wish to enrich morbidity data with information related to the professional status of the patient. This will require linking at individual level the patients file with the employment information system, managed by INSEE.
The enrichment service will offer:
a linkage service to retrieve useful variables from the INSEE IS using the RESIL identifier. an identification service if the input file does not contain the identifier used by RESIL. This service is described in detail in Fig. 2;
The first step is to try direct identification to the RNIPP using the identity data (surname, first names, date of birth …). If the person is found in the RNIPP, the national identifier (NIR) is retrieved and immediately transformed into the non-significant identifier CSNS. However, as the RNIPP contains less information than RESIL (it does not contain the address of the individuals or their relationships), a specific tool for RESIL (called “RESIL identification service”) will be developed using probabilistic matching methods to improve the matching efficiency. On this occasion, a visual control interface of the pairs (in orange on Fig. 2) will be implemented to measure the quality of the matching (estimation of false positives and false negatives) and to optimize the model parameters (definition of distances and thresholds) if necessary. These visual checks can be implemented asynchronously. Such visual controls are implemented by Statistics Canada [5].
Key methodological issues and proofs of concepts
A register based on administrative data requires checking the quality of its sources
The massive use of administrative sources for statistical purposes naturally begins with a phase of qualification of candidate sources. The quality of coverage of any administrative source will be assessed against the annual Census survey to have the most accurate diagnosis of the quality of coverage of the sources. This matching will involve a sample of about 9 million individuals annually.
Initial measurements of the quality of coverage of tax sources have already taken place, but mainly at municipal level. Since RESIL is based on the matching at individual level, more detailed examinations are required. The first results of the matching of individual tax data with census data show that overall, the results of these matches are of good quality, but they highlight difficulties that will have to be taken into account in order to achieve an optimal level of quality.
Thus, for dwellings, one of the main problems concerns the address, which may differ from one source to another depending on whether it is located at a different building entrance or not (this represents more than 60% of the observed differences).
It is also sometimes difficult to find an explanation for the small differences in dwellings between the census and the fiscal data for a building containing a large number of dwellings.
For individuals, the quality of the census input (handwriting, for example) can be a hindrance to obtain an excellent level of matching. Thus, where matching rates of 96% are achieved for Internet respondents, matching is only 89% for paper questionnaire respondents. The internet response rate is increasing year on year (60% for individuals in 2020) which is rather reassuring from this perspective. The false positive rates is less than 0.2% and false negative rates estimated at 0.5%. These results were obtained using deterministic matching methods and comparisons are underway with probabilistic methods.
Benchmark of matching methods
In the absence of a population register including the address of the individuals, the statistical register of individuals will be constituted by matching several administrative sources. As these sources do not have a unique identifier, the register will be created by matching the different sources based on the identity features they contain. It is also an opportunity to test different matching methods (deterministic versus probabilistic). On one hand, for deterministic matching, INSEE already has usable tools, which were used for the tests mentioned above. On the other hand, INSEE has no experience in terms of probabilistic matching. On this point, rather than developing specific tools, the idea is to reuse open-source tools implemented in other NSIs. Contacts have been made with Statistics Canada and Istat, the Italian Statistical Institute, to test their tools, respectively G-COUP [6] and RELAIS [7].
The sign-of-life method to define the reference population with the elaboration of a residence index
The presence of an individual in different administrative sources will be used to calculate a probability of presence on the French territory for a given period. This method was elaborated and used in Estonia [4].
A sign of life is a binary characteristic, which depends on three arguments – person, a source and a period (year), and has a value of 0 if the person in question was not active in a particular source during the period of observation, or a value of 1 if the person was active in that source at least once during the year. For example, a person may be given a sign of life if he or she received income from work during the year, or consulted a doctor at least once, or received at least one social assistance payment, or declared income to the tax authorities. It is of course useful to take into account the person’s situation in the previous year.
Therefore, the residence index of a person number
Where:
The population of reference of the year
The reference index may be calculated multiple times for a given reference year. Indeed, for a given year, not all sources will be available at the same time and the determination of several reference populations is envisaged (e.g., provisional, semi-definitive and definitive reference population). The threshold for the definition of residents from the residence index may vary according to the type of population. The residency index will therefore be dated.
The dual system estimation based on the rolling census
The production of registers must be accompanied by a measurement of their quality, and in particular their coverage, to overcome the current situation where discrepancies are observed between tax or social security sources and the census without being able to explain them or attribute them to one or other of these sources.
Furthermore, in the case of building a statistical register from many administrative sources, the issue of erroneous inclusions requires special attention.
Population estimation by comparing administrative sources or statistical registers and census data using the capture-recapture methodology and the DSE model is an area that is currently the subject of a great deal of methodological work.
Following the example of the CSO, the Irish Statistical Institute, the idea is to draw inspiration from the work of Zhang and Dunne [8, 9] who have developed a model known as “Trimmed Dual System Estimation (TDSE)” to take into account the case of erroneous inclusions but also of under-coverage in situations of comparison between administrative registers and the census.
In the case of France, the existence of annual census-type collections is a very important asset for measuring the quality of a register system under construction as it will enable measuring the degree of coverage of the register and of the census every year, thus providing a lever for improving the quality of both the register and the census.
Thus, RESIL will provide users not only with a reference population but also with indications on its quality. This information will be particularly valuable for defining the calibration margins of surveys and measuring the coverage of a new source.
Specific populations and territories
The quality of coverage of administrative sources is not the same for all territories and all types of population. In the overseas French territories, the quality of administrative sources is traditionally lower than in metropolitan France, and this quality even differs from one department to another. Therefore, specific procedures, for example additional surveys, will be implemented in these territories.
Similarly, the homeless population is difficult to capture from traditional administrative sources where they are less present than other populations. For example, only 40% of the French speaking homeless receive social allowances. It will therefore probably be necessary to adapt the life sign model for these individuals. In addition, even when present in administrative sources, homeless persons have rarely a registered address: the last survey on this specific sub-population showed that only 60% of the homeless had a registered address.
Finally, if we wish to take the homeless population into account in the calibration margins that RESIL will produce, it will be necessary to calculate adjustment coefficients to reconstitute aggregates in line with the population estimates known at the French level and coming from specific surveys, because individual consideration will be complicated for individuals outside the “traditional” sources.
The methodological issues related to taking these specific populations into account in a statistical population register are currently little addressed in the literature, but this topic will be the subject of additional studies for the implementation of RESIL.
Conclusion
The development of statistical registers of individuals, households, and dwellings in countries without an administrative population register, such as France, is not an easy task. Indeed, the absence of an administrative register allowing to locate accurately all individuals, unlike what may exist in the Nordic countries or even in Italy, prevents from having a reference universe against which to compare the statistical registers.
To overcome this difficulty, the objective is to make the best use of the opportunities offered by the rolling census and the annual availability of data on about one fifth of the territory to measure annually the quality of coverage of the registers but also to improve the quality of the census. In addition, INSEE will be able to benefit from the many international experiments that have been conducted on the subject and in particular to draw inspiration from similar approaches that have taken place in Canada, Ireland, UK, Italy, Spain, Australia and New-Zealand in particular. Finally, beyond the purely methodological aspects, there are also strong challenges in terms of communication and transparency of the processing operations implemented to reassure the public about the lawfulness of such processing as well as to provide guarantees regarding respect for the confidentiality of individual data.
