Abstract
The use of registers has been increasingly popular in the field of population census because of its advantages over the traditional census. While the traditional census requires a large amount of fieldwork and data collection, the registered-based census can rely on pre-existing administrative data. As a result, the register-based census can save both time and budget. Thailand explored a use of the register-based census in 2020. In this paper, the authors layout the methodology in an aspect of data preparation and integration as well as analyze data quality of the register-based census compared with the traditional census in Chachoengsao province, Thailand. In addition, we compared conceptual frameworks that are commonly used for a register-based census in several countries and the number of databases (a recent single database VS multiple databases) used to construct the register-based census. We found that using a conceptual framework that counts the number of populations based on the main census variables on a single recent database is better than using a framework that counts population who appears on many registers in term of overcoverage and data distribution regarding to sex. This provides the evidence of using one recent and complete database is sufficient for conducting the register-based census. The authors end up with recommendations for conducing the register-based census.
Introduction
Population census has been conducted for many years with a goal of enumerating number of population and examining socio-demographic structure [5]. In Thailand, traditional census, by mainly using a face-to-face interview, has been conducted every 10 years from 1960 to 2010 [8]. Data from the traditional census has been used to design country development plan and public policies, to estimate the number of populations, and to be used as a sampling frame for other surveys [8]. In the past two decades, traditional census has encountered several problems. Respondents were not willing to participate in the traditional census [5, 7]. This leads to a decrease in response rate and low data quality [8]. Additionally, it is costly to conduct a traditional census [3, 7, 8]. As such, instead of a traditional census, several countries have conducted register-based census in which administrative data from registers have been used.
The pyramid of populations in Chachoengsao province. (Left) the pyramid from the traditional census. (Right) the pyramid from the register-based census.
To transmit from traditional census to register-based census, some countries are currently doing a combined census using both the administrative data and the traditional census. For example, in 2021, Poland conducted the census by using a mixed model which incorporates administrative data and data collected directly from respondents [5, 12]. Data from registers and data from respondents were compared to assess the data quality from registers. A list for 2021 Poland census is built based on list of persons with assigned address and list of the address and housing.
Currently, several countries especially in European countries altered the way to collect the census data from the traditional census to the register-based census [5, 8]. For example, in 2011, Austria implemented a completely register-based census in which registers consist of a population census, a census of workplaces, building census, and housing census [7, 9]. In the same year, Sweden was conducted a register-based census using business register, population register, real property register, and dwellings register [1].
Five conceptual frameworks that many countries have used to develop administrative data from registers are as follows:
Framework 1. Count population who can identify their current address. If a person has been identified on the current address in the nation, he/she will be included on the register-based census.
Framework 2. Count population who has Personal Identification number (PID) on registers. If a person has a PID on every register, he/she will be included on the register-based census.
Framework 3. Count population who has data on main census variables shown on registers. If a person has data on every main census variable (such as PID, full name, sex) on registers, he/she will be included on the register-based census. Countries must decide on census variables. In this study, we have used PID, First name, Last name, year of birth, prefix of name, and sex as the main census variables.
Framework 4. Count population who has at least some data on main census variables shown on register. If a person has at least some data on census variables on registers, he/she will be included on the register-based census. In this study, if a person has at least data for PID, First name, Last name, year of birth, prefix of name, or sex, he/she will be included on the register-based census.
Framework 5. Count population who appears on many registers. If a person appears on many registers, he/she will be included on the register-based census. In this study, we specified the number of registers to be at least 2.
Note. Some countries have been used more than one framework mentioned above.
In Thailand, due to the COVID-19 pandemic in 2020, National Statistic Organization (NSO) collected the traditional population census data only from 4 districts namely Panat Nikom, Chonburi; Sriracha, Chonburi; Rayong; and Chachoengsao. This study will focus only on Chachoengsao province which consist of 419,492 people in the population (214,044 females and 205,448 males) in the register-based census. The pyramid of population regarding age and sex shows in Fig. 1. Note that, due to the COVID-19 outbreak, only partial areas of Chachoengsao were completely surveyed in the 2019 traditional census. Hence, we selected only areas that had been surveyed in traditional for register-based census. This made the total of population in the census areas was 182,614 populations.
This paper layouts methods for data preparation and integration for Register-based census. In addition, the authors evaluate the data quality of register-based census developed under the 5 conceptual frameworks and the number of databases (a recent database VS multiple databases) by comparing data from respondents in Chachoengsao with administrative data from registers. Four measures of data quality include coverage rate, overcoverage rate, Chi-square histogram distance of sex and age distribution between register-based census and traditional census have been used. In our conclusion and discussion, we address methodological and practical issues found from this study along with suggestion.
Method for data preparation and integration
Data pipeline for R-census
Before analyzing register databases, the first step is to standardize all available databases that came from various sources and have different formats to be the same format without any unusable values. Then, all databases are integrated together to be a single database that is ready for analyzing in the register-based census. To accomplish this goal, the raw databases have to be processed into the data pipeline.
A data pipeline is a process that takes raw databases as inputs and transforms them into output database(s) that are ready to be used for a specific purpose [6]. For the register-based census, the data pipeline consists of three important steps: 1) data preparation and cleansing, 2) de-identification for anonymizing sensitive data, and 3) data integration. These three steps are the main parts of preparing data for register-based census in Austria [9], Netherlands [2], and Poland [5].
Data preparation and cleansing
Data preparation
Given raw databases from various sources, the data preparation step has a main purpose to prepare these databases that might have different formats into a manageable format.
The first step is to make all databases have the correct format. For example, if a database has non-roman characters, then the step that is required here is to convert the database into UTF-8 encoding. For a non-binary files, they have to be converted into either text files (e.g. the Comma-separated values (csv) file format) or database file format. Some files also contain special characters that are in the wrong places and need to be removed before proceeding the next steps (e.g. newline character
The second step is to make all databases have the correct records and columns. Some files might contain several pieces of information within a single column (e.g. title, name, and surname is in a single column). Hence, this single column must be split into multiple columns. Another common case is a file that has a record as a single string and has a “fixed-length” bytes for specific information (e.g. the first 14 bytes are reserved for ID and the next 20 bytes are for a name). This kind of file must be converted into a file that has multiple columns per record. Each column contains only a single piece of information (e.g. ID, name, surname). After all databases are ready, they will be processed into the data cleansing step.
Data cleansing
The steps of data preparation and cleansing.
In this step, the main goal is to make all values that share similar meaning have similar formats, and to remove all redundancy and unusable values within the databases so that these databases can be integrated into a single one without any problem.
An extract, transform, load (ETL) tool is the crucial one in this step. Briefly, the ETL tool is used to load data from different sources and transform them into the same format. For instance, in Fig. 2, the databases are in different file formats. The ETL tool extracts these files into the process for transforming them to be formatted databases (e.g. .parquet or .csv files) and performing any actions to remove redundancy and unusable values before loading them into the next process, which is the de-identifying process.
For standardizing values with a similar meaning, the first step is to interpret all data dictionaries to find values that share similar meaning. For example, in a sex column, each database might represent as ‘M’ for male and ‘F’ for female, while some other databases might have ‘1’ for male and ‘2’ for female. If the ‘M’ for male and ‘F’ for female have been chosen as a standard, then the ETL tool must transform all sex columns to have only ‘M’ and ‘F’ values.
For redundancy, for each single database, if there are multiple rows with the similar primary key (e.g. citizen ID), then there are many actions that can be taken. If the redundant rows share similar name, surname, and other important values, we might keep the most updated one and remove others. If we cannot identify which row is more reliable than others, then we have to remove all of them. Otherwise, these redundancy rows can cause the overcoverage issue when joining a database to others since some people from redundant rows might not exist.
For unusable values, some databases were collected by typing, which can cause some typos issues. Hence, all columns might be transforming so that only valid values exist in the columns. For example, in the sex column, suppose only ‘M’ and ‘F’ are valid values, the ETL tools will remove anything else and replace those invalid values with N/A.
In this work, the Pentaho Data Integration (PDI) was deployed as the main ETL tool. The PDI is an open source ETL tool with high efficiency in resource utilization compared against several open source ETL tools [11].
The next step after data preparation and cleansing is the de-identification step. De-identification is a process that has the main purpose to protect sensitive information (e.g. citizen ID, name, medical records) while maintaining uniqueness of values so that these encrypted values can be used as keys for joining databases. De-identification uses a Cryptographic Hash Function (CHF) that can encrypt information such that it is almost impossible to decode the encrypted information.
In some countries, as required by laws, sensitive information of citizens such as personal data must be anonymized before being used for the census. Hence, the de-identification step is a crucial step for register-based census in some EU countries, such as Austria [9], Netherlands [2], etc.
A) The steps of de-identification of a citizen database, which contains sensitive information. B) The steps of retrieving encrypted records back require a real ID from the data owner and the salt.
For the process of de-identification, there are two important steps: 1) encrypting information and 2) retrieving encrypted records with known keys. For the encrypting information step, in Fig. 3A), the input is a citizen database that contains sensitive information (citizen IDs, name, surname, etc.). Suppose only citizen IDs will be encrypted, in the first step, the process combines citizen IDs with the salt, which is a secret key. Then, the IDs with salt are encrypted by some de-identification algorithm. Finally, all original IDs are replaced with the encrypted IDs in the output database.
For retrieving encrypted records, in Fig. 3B), the database owner must have two things to retrieve the encrypted records: the salt (secret key) and the original ID before encryption. This typically can be done when a citizen who is the owner of ID and the database owner who possesses the salt both agree to retrieve the record. After having both salt and original ID, the next step is to combine salt and original ID with the same de-identification algorithm to get the encrypted ID. Then, by using the encrypted ID to search in the encrypted database, the target record can be retrieved.
The process of encryption using a Cryptographic Hash Function (CHF).
In this work, we deploy the “SHA-256” method, which is a Cryptographic Hash Function (CHF) to perform de-identification. The CHF were invented and mainly used in the field of Cryptography to encrypt the original message to be encrypted one that is unable to decode back to the original value (Fig. 4). The SHA-256 encrypts information using Secure Hash Algorithm 2 (SHA-2) with 256 bits of hash values. The SHA-2 was invented by the United States National Security Agency (NSA) in 2001 with the US patent 6829355 [10].
For the SHA-256, it has 256 bits of hash values, which implies that there are 2
Data integration process
After all databases have been cleaned and de-identified, the next step before the analysis of register-based census is to integrate all databases together. In Fig. 5, the process of data integration consists of three important steps of setting: 1) re-labeling table fields, 2) setting joining conditions, and 3) setting priority of data replacement.
In the step of re-labeling fields, since databases from different sources typically have the different field names that share the same meaning, we need to normalize them to be in the same standard. For example, some databases have the field ‘sex’ to represent the biological sex of citizen while other have ‘gender’ to represent the same thing. In this step, the main focus is to make the match between fields that represent the same meaning across databases. This can be done by analyzing the data dictionary of databases.
Data integration steps.
For the setting of joining conditions, the main goal is to define the primary key (or combined key) for joining all databases together. In the register-based census literature, the common field that is used as a primary key is the field of citizen ID because the IDs are unique among citizens, and they are typically used in registers as a primary key and as identifiers when retrieving citizen records. However, in some countries that have no citizen IDs, other fields might be used as keys.
Lastly, setting priority of data replacement among databases is the important step in data integration. When the same person has multiple records across databases, the issue of which values from which database should be used in the integrated database can be solved by using the priority list of data replacement. For example, suppose person A has a sex value as ‘male’ in four databases and ‘female’ in one database. The priority list is deployed to decide whether A is male or female based on the setting.
After all settings have been set, the databases will be integrated together to be a single database with some additional log data and ready to be used in the census analysis.
In the next subsection, the details of the three steps above are explained.
Primary key
The primary key or combined key is used for joining tables from multiple databases. The following properties are necessary for considering any field(s) to be the primary key [5].
There is only one unique value for each person or object (e.g. institute, building, company) in the field(s). No two objects share the same value. The same value appears in all databases to refer to the same object. The value remains attached to the object forever; even if the object is obsoleted (e.g. a person is dead), there are no reusable values in the databases. The value can be validated with any digital check or reference register whether the value is correct. In the aspect of citizen IDs, the validation goal is to check whether the owner of the ID exists as a citizen in the country.
The citizen ID typically passes all criteria above, hence, it can be used as a primary key to join databases. In the case when citizen IDs are not available, the typical combined key that is used as a primary key is the combination of name, surname, date of birth, sex, and address fields.
The mandatory property beside above properties for joining databases is that all records must have values, otherwise, we have to remove records that have no value for a primary key. In the case of having a primary key as a citizen ID field, we have to remove all records in any databases that have no citizen IDs. In the case of having the combination of name, surname, date of birth, sex, and address fields as a parimary key, if any values in these fields is not available in any records, these record must also be removed. Choosing the fields to be a part of combined key can cause a problem if some fields (e.g. sex, date of birth) have too much empty values because there are many records that must be removed. Hence, if the citizen IDs or any identifier exists for all members in the population, then it should be used as a primary key since it can satisfy all properties above and, theoretically, there are no records that are needed to be removed.
For the foreigners or temporary visitors that are considered to be members of census population, it is challenging to count them since they might not have a citizen ID in some country (e.g. Thailand), and the universal ID might not exist to identify them. The field of visitor passport IDs might be used as a primary key in this case.
Joining conditions
After having primary key(s), the next step is to set the joining conditions of databases. In the case of one primary key, all databases can be simply joined with the key. However, if there are more than one primary key, the order of joining must be considered. For example, suppose there are two primary keys: the citizen-ID field, and the combined key of name, surname, and address, then we need a global primary key for integrating these primary keys together. The reason is that some records might have no value in the first primary key but have values in other primary keys. Hence, to satisfy the condition that a primary key must have value in all records, a global primary key might be created for each record as a new one.
If the primary key that has the highest priority is the citizen-ID field and the second priority key is the combined key, then there are three steps as follows. First, all databases must be joined with the citizen IDs and all records that cannot be joined as the first step (e.g. records without citizen IDs) must be moved to the next step. Then, new values in the global primary key are created for these records linking with the citizen IDs: each unique citizen ID has a unique global primary key. For the second step, the records that cannot be joined in the first step are joined together using the combined key. Then, a unique value of the global primary key is created and linked with each unique value of the combined key (i.e. a unique combination of name, surname, and address). Lastly, in the third step, all records in the first and second steps are merged together to be a single table.
In this work, we deploy the citizen ID field as a primary key.
Priority order of data replacement
When there are more than one records of the same person that have values in a specific field (e.g. sex), we have to decide that which value is the valid one. According to the work in [5], the following properties can be considered when selecting value as a representative value in the integrated database.
Timeline: the recent data comes first in the data replacement. The older data must be dropped. Relevance of databases: choosing values from reliable sources [9]. Some registers are the main register of a country that are maintaining citizen information (e.g. the register from the ministry of interior), while other databases are from less important ministries that do not have a direct duty to maintain valid citizen information. For example, if the record has a conflict of sex values between the database of the ministry of interior and databases from other ministries, then we choose the value from the ministry of interior register as the valid value. Coverage of information: choosing databases that are rich in the number of records and fields. In this method, large databases are considered to be more reliable than the smaller one. Quality of databases: choosing values from high quality databases first. The quality of databases can be measured in several ways. Below are some examples. The completeness of fields: counting usable values in each field compared to the total number of records. Timeline: how recent the data is. Degree of integration: how many records have valid IDs w.r.t. some reference register(s). Utility: how many fields are useful for the census analysis.
In this work, we deploy the timeline as the data replacement method since there is no other information regarding relevance of databases and all registers have almost the same size and quality.
To evaluate the 5 conceptual frameworks used to develop register-based census, the authors examine the data quality of register-based census by comparing traditional census collected from respondents and data from register-based census using 5 frameworks based on 4 measures as follows:
Coverage rate (CR) is the proportion of people who are in traditional census are actually in registers. This can be calculated from
Overcoverage rate (OR) is the proportion of people who should not be in registers; however, they are in. This can be calculated from
Chi-square histogram distance measured a similarity of sex distribution between register-based census and traditional census and 4. Chi-square histogram distance measured a similarity of age distribution between register-based census and traditional census Chi-square histogram distance is a measure used to examine the distance of data distribution between two sources, i.e. register-based census and traditional census. The chi-square histogram distance can be calculated by taking a square root of
where of population in register-based census of population in traditional census
An ideal register-based census should be indicated by having high coverage rate, low overcoverage rate, and small value of chi-square histogram distance for sex and age distribution.
Results
This paper compares the Register-based Census (R-census) and the Traditional census (T-census) by focusing on data coverage and the similarities in basic demographic characteristics including sex and age of Thai people in Chachoengsao province. To compare the use of the single recent database and the multiple databases to create R-census, two types of databases are used to be the registered based census which are the 2019 Basic Minimum Need (BMN) database (the 2019 R-census) and the 2017–2019 Basic Minimum Need (BMN) database (the 2017–2019 R-census).
The 2019 R-census and the 2017–2019 R-census are used to enumerate the numbers of members, i.e. Thai population, under 5 conceptual frameworks:
Framework 1. Count population who can identify their current address
Framework 2. Count population who has Personal Identification number (PID) on registers
Framework 3. Count population who has data on main census variables shown on registers
Framework 4. Count population who has at least some data on main census variables shown on register
Framework 5. Count population who appears on many registers (only for the 2017–2019 R-census)
Data obtained from each framework from the 2019 R-census and the 2017–2019 R-census were compared with data obtained from the 2019 population census database of the National Statistical Office (T-census) in Chachoengsao province, Thailand. The results of the comparison are presented in the following measures:
Coverage rate Overcoverage rate Chi-square histogram distance measured a similarity of sex distribution (CHD-sex) Chi-square histogram distance measured a similarity of age distribution (CHD-age)
Results from analyzing only one recent registered database- the 2019 BMN database (the 2019 R-census)
Coverage and overcoverage between the 2019 R-census and data obtained from the traditional population census database of the National Statistical Office in 2019 (T-census) as well as the number of members in Chachoengsao province across framework 1–4 are reported in Tables 1 and 2. As shown in Table 1, number of members across 4 frameworks in both coverage aspect (number of members in both the 2019 R-census and the T-census,
Consequently, the coverage rate and overcoverage rate are identical across 4 frameworks. The coverage rate is 0. 689, reflecting the use of data from the 2019 R-census to represent 68.5% of the T-census data. The overcoverage rate is 0.656. The use of the 2019 R-census data to represent T-census data results in bias due to including data from the population in R-census but not in T-census, which accounts for 65.6% of the population in R-census.
Additionally, the number of Thai population as well as with respect to sex (Table 1) and age (Table 2) are the identical in framework 1–4. This reflects the completion in data collection on every main census variable. This is confirmed by interviewing with the data controller who manage the 2019 BMN database. The results show the flexibility in selecting any framework (Framework 1 to Framework 4) for creating the 2019 R-census database.
Population size, coverage, overcoverage, and number of members under 4 frameworks in the 2019 R-census
Population size, coverage, overcoverage, and number of members under 4 frameworks in the 2019 R-census
Note. Conceptual framework 5 will be used when the R-census is constructed from multiple registers. Since the single recent register has been used, the conceptual framework 5 cannot be used.
Number of members across age under 4 frameworks in the 2019 R-census and the 2019 T-census
Note. Conceptual framework 5 will be used when the R-census is constructed from multiple registers. Since the single recent register has been used, the conceptual framework 5 cannot be used.
The difference in the population proportion with respect to sex and age between the 2019 R-census and the 2019 T-census is examined by using Chi-Square Test for Homogeneity and Chi-Square Histogram distance (CHD) and the result was shown in Tables 3 and 4 respectively. According to sex, result from the Chi-Square Test for Homogeneity shows a statistically significant difference in the population proportion between the 2019 R-census and the 2019 T-census (
Alternatively, considering the median, the difference between the proportions of male and female population from the 2019 R-census and the T-census was found to be 0.79%.
From Table 4, the Chi-Square Test for Homogeneity reveals that the proportion of the population categorized by age between the 2019 R-census and the T-census was statistically significantly different at the 0.05 level of significance (
Chi-square Test for Homogeneity with respect to sex comparing between the 2019 R-census and the T-census and Chi-square histogram distance
Chi-square Test for Homogeneity with respect to age comparing between the 2019 R-census and the T-census and Chi-square histogram distance
Population size, coverage, overcoverage, and number of members under 4 frameworks in the 2017–2019 R-census
Number of members across age under 4 frameworks in the 2017–2019 R-census
Number of members across age under 4 frameworks in the 2017–2019 R-census
Chi-square test for homogeneity with respect to sex comparing between the 2019–2019 R-census and the T-census and Chi-square histogram distance
Chi-square test for homogeneity with respect to age comparing between the 2019–2019 R-census and the T-census and Chi-square histogram distance
Coverage rate, overcoverage rate, chi-square histogram distance (sex), and chi-square histogram distance (age) and ranks
Note. The number in the parathesis is the rank where 1 indicate the best R-census based on a particular measure.
Tables 5 and 6 show the coverage and overcoverage between the 2017–2019 R-census and data obtained from the 2019 traditional population census database of the National Statistical Office (the T-census) as well as the number of members in Chachoengsao province across framework 1–5. Coverage rate and overcoverage rate were identical from conceptual framework 1 to conceptual framework 4, which was 0.721 and 0.686 respectively, and for conceptual framework 5, the coverage rate and overcoverage rate were 0.682 and 0.683 respectively. Since there is no item nonresponse on the main census variables in the 2017–2019 R-census, the number of member as well as coverage and overcoverage are identical among the conceptual framework 1–4. Therefore, results in this section will be discussed separately into two groups which are conceptual frameworks 1–4 and conceptual framework 5.
The coverage rate in the conceptual frameworks 1–4 and the conceptual framework 5 was 72.1% and 68.2% respectively. This reflects the use of data from the 2017–2019 R-census to represent the T-census data from the conceptual frameworks 1–4 was better than that of the conceptual framework 5. In addition, the overcoverage rate in the conceptual frameworks 1–4 and the conceptual framework 5 was 68.6% and 68.3% respectively. This indicates that the use of data from R-census to represent T-census data leds to a bias due to overcounting the population in the 2017–2019 R-census that was not in the T-census; however, the overcoverage rate do not significantly differ between the conceptual frameworks 1–4 and the conceptual framework 5.
The difference in the population proportion between the 2017–2019 R-census and the T-census with respect to sex and age groups were tested by using the Chi-Square Test for Homogeneity and the Chi-Square Histogram distance (CHD) and the results are in Tables 7 and 8. As shown in the result from the Chi-Square Test for Homogeneity in Table 7, the population proportions are significantly different between the 2017–2019 R-census and the T-census in all of 5 conceptual frameworks (
As shown in Table 8, the Chi-Square Test for Homogeneity indicates the population proportion classified by age differ significantly between the 2017–2019 R-census and T-census in all frameworks (
The above results are summarized and compared by ranking the measures in which the 1
From Table 9, the smallest average rank (1.5) of measures was in framework 1–4 from the 2019 R-census. Thus, the conceptual framework 1–4 from the 2019 R-census was the framework that provided the most comprehensive measure that aligns with the 2019 traditional census from the National Statistical Office. This framework provides the lowest overcoverage rate which reflects the smallest amount of people outside the coverage area (Chachoengsao province) participating in the census. However, the overcoverage rate of 0.656 or 65.6% is still considered high. Additionally, the conceptual framework 1–4 in the 2019 R-census yields the lowest CHD (sex) value, indicating the smallest difference in the population proportion classified by sex between the 2019 R-census and the 2019 traditional census, with a median of the difference in sex proportion of only 0.79%.
Conclusion and discussion
Comparing the framework 1 to framework 5 on using a single and multiple databases, we found the identical results between framework 1–4 which is a result from the completion of data with respect to the main census variables including PID, First name, Last name, year of birth, prefix of name, and sex. Comparing between framework 1–4 and framework 5 on using a single and multiple databases, using framework 1–4 results in a better coverage rate and better data distribution regarding to sex. In this study, we examine a quality of the framework as individual framework. As mentioned earlier, many countries use several frameworks to conduct the R-census. Future research may incorporate many frameworks and test the quality of those combined framework.
Regarding to the number of databases, it is surprisingly that using only one recent database provide better data quality (lower overcoverage rate and better data distribution regarding to sex) than using multiple databases. This might be because the data especially the main census variables is quite complete, i.e. no item missing data. Moreover, many people might migrate to other areas outside the census area before the census date. This might cause the use of multiple databases from previous years to have higher overcoverage. This study used only Basic Minimum Need (BMN) database as the register based. Future research may consider using more than one source of database as well as to use databases that are reliable to a specific variable. For example, variables related to health should come from a database provided by the Ministry of Public Health. However, we must certain that those databases contain main fields used to merge every database. Moreover, variables used in each source must have the identical meaning as the census variable. For example, the word “household” has been defined differently across government agencies in Thailand. As such, we must be careful to use the variable as well-defined as census variable.
Resources including people and infrastructure as well as regulation should be prepared to support the register-based census. Conducting register-based census requires people who have database and computer skills to do the data preparation and integration. Infrastructure including multiple servers for distributed storages and processing of big data should be well prepared to handle data streaming from multiple sources. Moreover, required by laws, all sensitive data must be protected and can be accessed by only legitimate parties. Hence, infrastructure must have high-security access control and data protection mechanism. Having these resources ready will facilitate the country to do a register-based census.
A regulation that prevents using database from each governmental agency should be modified so that it can facilitate data sharing to conduct the register-based census. In addition, the government’s data security and privacy policies are the crucial factors in ensuring public trust. Many register-based censuses use administrative data that are collect for specific purposes, such as healthcare, and may not inform a purpose of census enumeration to the people whom they are collecting data from. The government agency should definitely inform the purpose of using their data in the register-based census and ensure their data privacy and data security.
Footnotes
Acknowledgments
We would like to thank the Thai National Statistical Office and National Electronics and Computer Technology Center who provided the authors data and facility to conduct the research.
