Abstract
Never has there been a more exciting time to be an official statistician. The data revolution is responding to the demands of the CoVID-19 pandemic and a complex sustainable development agenda to improve how data is produced and used, to close data gaps to prevent discrimination, to build capacity and data literacy, to modernize data collection systems and to liberate data to promote transparency and accountability. But can all data be liberated in the production and communication of official statistics? This paper explores the UN Fundamental Principles of Official Statistics in the context of eight new and big data sources. The paper concludes each data source can be used for the production of official statistics in adherence with the Fundamental Principles and argues these data sources should be used if National Statistical Systems are to adhere to the first Fundamental Principle of compiling and making available official statistics that honor citizen’s entitlement to public information.
Introduction
The need for a data revolution was first expressed by a High-Level Panel on the Post-2015 Development Agenda, appointed by the then UN Secretary-General Ban Ki-moon to advise on the global development agenda after the 2015 millennium development goals (MDGs). The Panel report was quite brief and open to much interpretation but noted a “true data revolution would draw on existing and new sources of data to fully integrate statistics into decision making, promote open access to, and use of, data and ensure increased support to statistical systems” [1].
The data revolution has triggered a rapidly expanding data industry. Data providers and producers of statistics are proliferating. Policy makers, businesses and citizens want not only to be informed, they want to be informed quickly and in an easily accessible way. These users are, however, frequently turning to statistics other than official statistics, unconscious of the quality limitations of what they are receiving.
The data revolution is also challenging traditional producers of official statistics, National Statistical Offices (NSOs) to redefine themselves and their statistical production systems. While having progressively acquired and adopted new technologies, methods and standards, NSOs cannot remain solely providers of good-quality official statistics. Other official statistics producers from elsewhere in a Government’s National Statistical System are emerging including for example Agricultural Ministries, Fishing Ministries and Justice Ministries. As illustrated by Anil Arora [2], the role of NSOs, besides being the coordination one de lege in many cases, is therefore evolving towards producers as well as story-tellers, data stewards, data integrators and finally, quality assessors and providers of standards and secure data architecture, especially in the era of fake news and post-truths.
Advancing globalization and digitization is also forcing official statisticians to cover new areas, change existing surveys and explore the potential hidden in new and big data sources. Moreover, the data revolution is also impacting social attitudes to data sharing, privacy and confidentiality.
Therefore, the existence of reference principles for official statistics, the UN Fundamental Principles of Official Statistics [3] (UNFPOS) , describing universal key values of the official statistics is of utmost importance and the need to reconsider the meaning of the UN Fundamental Principles of Official Statistics in the data revolution surfaces.
This paper mapping the UN Fundamental Principles against new and big data sources concludes the features of the UN Fundamental Principles remain unchanged regardless of data sources used if producers of official statistics, regardless of where they are situated in a National Statistical System, adhere to them and adapt policies and frameworks to incorporate them. Following a brief definition of new and big data sources, chapter three discusses the use of new and big data sources1 in official statistics using three key steps in the statistical production cycle: data acquisition, data processing and statistical dissemination. Chapter four illustrates the application of the UN Fundamental Principles to eight new and big data sources for official statistics before concluding in chapter five.
New and big data sources
Although no uniform definition has yet been agreed upon for big data, it is broadly accepted that the notion refers to data sets of increasing volume, velocity and variety; the 3 V’s [4]. In the paper “An Assessment of big data for official statistics in the Caribbean”, a fourth V is added to describe big data, namely their “veracity”, conformity to facts [5]. Another feature of those data sources is their lack of consistent structure, meaning the lack of pre-defined data model and/or the fact that they do not fit well into conventional relational databases.
Following the classification of the High-Level Group for the Modernization of Official Statistics (HLGMOS) established in 2010 under the umbrella of Conference of European Statisticians, big data sources comprise:
Administrative data (arising from the administration of a program, be it governmental or not), e.g. electronic medical records, hospital visits, insurance records, bank records, food banks, etc. Commercial or transactional sources (arising from the transaction between two entities), e.g. credit card transactions, on-line transactions (including from mobile devices), etc. Data from sensors, e.g. satellite imaging, road sensors, climate sensors, etc. Data from tracking devices, e.g. tracking data from mobile telephones, GPS, etc. Behavioural data, e.g. online searches (about a product, a service or any other type of information), online page view, etc. Opinions, e.g. comments on social media, etc. [4]
Another type of classification, developed by the UNECE Task Team on Big Data [6] provides a division into human-sourced information (social networks, blogs and comments, personal documents, pictures, videos, internet searches, mobile data content, sms, user-generated maps and e-mail) process-mediated data (data produced by public agencies, medical records, data produced by businesses, commercial transactions, banking/stock records, e-commerce, credit cards) and machine-generated data, so called internet of things, i.e. data from sensors, home automation, weather/pollution sensors, traffic sensors, mobile sensors, mobile phone location, security/surveillance videos/images, mobile locations, and satellite images.
Big data in official statistics
Irrespective of the classification adopted, it is evident that the use of new and big data sources in official statistics represents a set of considerable challenges. These challenges fall into at least one of the following categories:
Legislative, i.e., with respect to the access and use of data; Privacy, i.e., managing public trust and acceptance of data re-use and its link to other sources; Financial, i.e., potential costs of sourcing data vs. benefits; Management, e.g., policies and directives about the management and protection of the data; Methodological, i.e., data quality and suitability of statistical methods; Technological, i.e., issues related to information technology [4].
The UN Fundamental Principles of Official Statistics emerge as the main reference and a solid basis for considering these six challenges Endorsed at the sixty-eighth session of the United Nations General Assembly in January 2014, the UN Fundamental Principles form a solid basis for all ethical and quality-related conceptual documents about official statistics throughout the world. The European Statistics Code of Practice (CoP), the OECD Recommendation for Good Statistical Practice or the Principles Governing International Statistical Activities can be named here as some of many examples of the transposition of the UNFPOS and their adaptation into various quality frameworks. The UN Fundamental Principles also constitute the main quality axe in the UN Handbook for Statistical Organizations and in the UN National Quality Assurance Framework Manual.
Principle 5 is considered to be the most applicable of the UN Fundamental Principles to new and big data sources [7]
Principle 5 sets up the framework for all types of data sources to be considered in the production of official statistics provided the data sources ensure the quality of statistical output including timeliness, are cost-efficient and minimize the reporting burden for the data providers.
Another reference point for considering the six challenges are Europe’s Ethical Guidelines [8] related to the use of Big Data in European statistics. The guidelines draw the attention of NSOs to possible issues of professional ethics that can appear with the use of big data in the production of official statistics and examine at three main stages of the statistical production process – acquisition, processing and dissemination – questions of an ethical nature concerning the cornerstone values of official statistics.
Data acquisition
While acquiring any data, whether from a traditional or non-traditional data source, the problem of data ownership emerges. New and big data are mostly collected by private companies which are usually not bound by the law in most countries to make data available for official statistics. Therefore, the acquisition of data for official statistics is often based on strategic partnerships established with private companies to gain access to data.
Here the main ethical reference should be UN Fundamental Principle 1, called the relevance, equal access and impartiality principle.
Principle 1 stresses the need to avoid any pressure from data providers, such as telecommunication companies in the case of mobile phone data, to put their interest above the public interest. Secondly, the selection of data providers should be done with full transparency and be preceded by thorough research.
The UN Friends of the Chair Group on the Fundamental Principles of Official Statistics offer some practical advice on demonstrating principle 1in the context of new and big data sources. For example, strengthening research divisions in NSOs, a preference to a multi-mode approach while collecting the information (e.g. from multiple data providers) and validation of information and constant crosschecks.
Relevance, impartiality, equal access, interaction with users and planning are most important features covered by Principle 1 and therefore the risk of competition between NSOs and other producers of official statistics to access new and big data sources has been mentioned as a factor to be avoided.
According to the Principle 6 – the confidentiality principle – NSOs need to assure data providers including providers of new and big data that their data will be used exclusively for the purposes of official statistics. Ensuring there is minimal risk of harm to their operations of providing the data to the NSO for official statistics is also an assurance mechanism consistent with Principle 6.
The confidentiality principle affords a distinct advantage for NSOs compared with other producers of official statistics as confidentiality is usually not only a Fundamental Principle adhered to by NSOs but it is also enshrined in the legal basis of a NSO.
In addition to confidentiality, NSOs should also be informed by data providers whether their customers are aware that the data about them can be delivered to statistical authorities. This requirement is often part of privacy laws. The case of social media data use for official statistics has been evoked as an example of risk and difficulty in obtaining users’ consent to the usage data containing information about them.
Data processing
At the stage of data processing, the UN Fundamental Principles appear as a remedy for the major threat, which is the number of quality-related issues that may be compromised while using and or integrating new and big data sources with other data sources, such as census data, for official statistics. Examples including the risk of bias and manipulation within big data sources, no guarantee in stability and continuity of data structure and lack of scientific proof while using statistical models or imputation techniques for the data processing.
The key reasons for risk are that new and big data sources are, in general, not designed for statistical purposes and therefore they do not comply with statistical definitions, standards and methods [8] According to Principle 3 it is of utmost importance to use new and big data sources according to verifiable and internationally comparable, transparent statistical standards and procedures.
Metadata and paradata, as well as the need for establishing and agreeing upon new quality attributes for new and big data sources, if they differ from the existing ones, is also critical. This would emphasize the professionalism of NSOs, which is one of their major assets that makes official statistics a reliable source of information.
During data processing, the risk of revealing personal information is also pertinent, as well as improper use of those data that can damage the reputation of the official statistics. Therefore again, Principle 1 (on relevance, equal access and impartiality), Principle 6 (on confidentiality) and Principle 2 (on trust) may be widely applicable.
Statistics dissemination
At the stage of dissemination of statistical information, faced with complex techniques needed for the production of statistical output while using new and big data sources, it is vital to: inform the users about methods and procedures used to produce statistics (Principle 3), adhere to international standards ensuring quality and comparability (Principle 9) as well as to communicate and educate users, preventing an erroneous interpretation of statistical information (Principle 4). Based upon that, new and big data sources should be duly described, and applied methods and models should be documented to support an independent assessment of data processing and statistical results.
Mapping different types of new and big data sources against the UN Fundamental Principles of Official Statistics
Different types of new and big data sources present different questions of an ethical nature and according to a range data characteristics, e.g. access to the data, content of personal information, quality issues in terms of suitability for the purpose of official statistics, the clarity of the methods to be applied in order to get statistical output etc. [8].
In this chapter, eight types of new and big data sources are explored in relation to the most relevant Principles to consider for each. A full mapping against each of the nine Fundamental Principles of Official Statistics is in Table 1. Country examples for each of the types of new and data sources used in official statistics are listed in the original document of the UN Statistical Commission Friends of the Chair Group [9],
Mobile phone data
Mobile phone operators’ systems generate a very large amount of data on the use of mobile communication, including location information. These data are mostly used for business and marketing purposes. However, the location data can be used for generating official statistics about space-time movement of phones, needed for instance to supplement official tourism statistics.
The ethical issues at stake with this type of data concern the privacy of the data subjects (mobile phone data contains sensitive personal information), as well as the professional independence of NSOs that might be compromised while creating partnerships with mobile data providers in view of their equal treatment.
Principle 1 can be applied here to stay relevant and impartial in order to best meet users’ needs, providing for ensuring the lack of biases connected with data collection by the data provider or connected with data acquisition by the NSOs.
When it comes to further stages of statistical production process, a major issue that can be identified during data processing is the lack of suitability of mobile phone data for statistical purposes. At the dissemination stage, because of differences in the definitions, the methodology concerning this type of data source can be complex. Here, Principle 3 is widely applicable, which requires a clear, transparent and understandable presentation of both official statistics and the corresponding metadata.
Data from smart electricity consumption meters
Data from smart electricity consumption meters can be interesting for official statistics because they provide information on energy consumption, which can be beneficial for statistics on household consumption expenditure, consumer price indexes, environment statistics or statistics on energy consumption.
The inclusion of this new and big data source into official statistics is well justified because of a considerable reduction in the burden on respondents, therefore the full application of Principle 5 can be observed here.
A possible difficulty in securing access to a sufficient level of detail can be addressed using appropriate legal provisions, however the coverage of this data source should also be explored to assess if the statistics produced are representative and relevant. The smart meters’ data are not intended for statistical purposes therefore the risk of discontinuity of data source may also be present
Satellite imagery data
NSOs are exploring the possibility of using data from satellite imagery in official statistics. This data source is expected to decrease burden on respondents, to improve timeliness, as well as to reduce survey costs. They can also contribute to providing more disaggregated data. Satellite data are used mainly to complement agriculture statistics.
The increasingly widespread use of this kind of data is linked with relatively few concerns of ethical nature. Data are mostly publicly available and they do not carry privacy concerns (or to a very limited extent).
The major issue can be quality of statistical output that is directly related to the quality of satellite images, as well as methodology which, following Principle 3, should be clearly explained to the users. According to Principle 1, NSOs need to perform thorough research on the methods used to compile statistics based on satellite images to guarantee the quality of their final products. At this point we can observe the relevance of Principle 10.
Principle 10 provides for the joint efforts of international statistical and research communities targeted at finding common methodological solutions, sharing infrastructure, saving resources and taking advantages of synergies.
Social media data
Social Media data can be used in at least three ways: As a subject of the official statistics (e.g. household and business use of social media); to disseminate official statistics, thus reaching out to all kinds of users; and lastly as a source to compile official statistics. In what follows social media as a source (social media data) will be dealt with.
Disseminated via the internet, social media data still represent an area for further exploration for official statistics. Adopting the form of messages, images, video or searches, these data are voluntarily submitted by users on the web. In several countries research is conducted to use social media to measure the level of well-being of societies (studies on happiness explore sentiment analysis).
The ethical concerns associated with this type of new and big data source are related to privacy issues (there is a recurrent question whether the users of social networks should be notified that the information they post, although being public, will be used by statistical agencies) or to lack of access to data which sometimes has to be purchased from private owners. In this case an ethical question arises as to whether NSOs should pay for data sources that are going to be transformed into public official statistics.
Processing this type of new and big data source depends critically on the methodology adeptly to ensure a sufficient level of quality in the resulting official statistics. Therefore, strengthening research divisions in statistical offices is important if these type of data sources are to be used effectively.
Social media data are also a vulnerable source of data when it comes to bias and manipulation. While disseminating statistics based on social media, it is important to accompany them with proper metadata and to describe them in an understandable way.
Again, while using this type of new and big data source to produce official statistics, Principles 1, 2, 3, 6 and 9 could be applied here, to ensure relevance, to prevent loss of trust, to make them subject to a comprehensive, comparable methodologies and standards, and to ensure users that their confidentiality is protected respectively.
The use of social media data should also be studied from the perspective of the fake news proliferation. Following Holan (2016), fake news can be defined as “invented material that has been cleverly manipulated so as to come across as reliable, journalistic reporting that may easily be spread online to a large audience that is willing to believe the stories and spread the message” [10].
In the context of social media data, there is a danger of collecting data from users and making it available for advertisers who use it mainly to target advertisements. Another example of manipulating the truth is the existence of fake accounts, so called “bots” who may affect the factual image, hindering the quality of social media data to be used for statistical purposes.
Against this background, the ethical reference provided by the UN Fundamental Principles seems to be one of the most effective, as they promote official statistics rather than the raw social media data, thereby valuing more highly the production of reliable, comparable and high-quality data – those which meet international standards [11]. Elaborating on the arguments expressed during the conference “Truth in numbers: the role of data in a world of fact, fiction and everything in between” [12] firstly NSOs should be provided with the resources and infrastructure they need to support their roles as both a standard setter and co-ordinator across the National Statistical System, and secondly it should be ensured by governments that there is no political interference in their national statistical systems, resulting in greater trust by citizens and reflecting national statistical offices as being “guardians of the facts” [12].
Web-scraped data
Web scraping refers to a technique used for extracting data from websites. Official statistics uses webscraping techniques for example to collect prices of different goods from the internet and to use this type of data source as a supplement to the Consumer Price Index (CPI). If online prices replaced prices collected in a traditional way, costs of statistics could be considerably minimized. Similarly, to other new and big data sources, web-scraped data may raise legal issues (terms of use of websites differ across countries), they are not designed for statistical purposes, therefore methodological issues are also at stake. Once these data are made available, the statistics partially based on web-scraped data should – as in previously described cases – be accompanied by proper metadata.
Road traffic sensors data and passengers tracking sensors data
Vehicle detection loops, installed in pavements, or lasers can detect vehicles passing or arriving at a certain
Mapping of the Fundamental Principles of Official Statistics against new and big data sources
Mapping of the Fundamental Principles of Official Statistics against new and big data sources
point, e.g. approaching a traffic light or in motorway traffic. or approaching a bus terminal. Normally, the data are stored in a central data warehouse of the responsible authority, e.g. the national transport agency [8]. Their use in official statistics is rather common. They serve to draw a picture of the number of vehicles, the speed with which they move, along with surveys for estimating commuting time. Passenger information collected by the sensors can be used to determine bus routes as well as monitor ridership to determine if service to certain areas should be increased or decreased.
While using this type of new and big data source, the major issues are linked to quality. There are usually no privacy-related concerns. Therefore, the Principles that could be applied here refer to common methodologies, standards and definitions, as well to the proper explanation of statistics for the users.
While purchasing a product in a supermarket, a record is being created by the scanner. These transaction records collected from many sellers are of a great interest for statisticians. Used for the purposes of the Consumer Price Index (CPI) or complementing other domains of official statistics, such as business statistics or house expenditure, this new and big data source may outweigh the traditional collection methods mainly because of its lower cost.
Ethical issues in this respect concern i.e. establishing partnerships with private companies and can be linked with Principle 1. Another question related to scanner data concern methodological challenges (proper classifications of products, choosing appropriate index methodologies) – therefore the observance of Principles 3 and 9 can be considered a solution in the mentioned cases.
CCTV data (Security/Surveillance videos), e.g. for citizen security purposes
Several countries/cities have decided to install camera’s for citizen security purposes. While these may provide more comprehensive coverage of certain types of crime in certain areas, one has to be on the alert for both privacy issues (camera’s capturing everybody: criminals and non-criminals alike) and quality issues (proper distribution of the cameras, so as to have complete coverage is a must).
Conclusion
Specific features of new and big data sources result in challenges for NSOs to comply with Fundamental Principles of Official Statistics such as professional independence, access to sources, mandate for data collection, adequacy of resources, impartiality, objectivity and clarity of the methods used.
However, the technological development of today’s world makes it impossible not to include these sources in official statistical production processes. Their use reinforces their complementarity to traditional data sources. As much as the NSOs should strive to build partnerships with non-official data producers and owners to make those data part of official statistical outputs, the cornerstone values of official statistics, such as quality, standards, and professional independence should never be neglected, nor compromised.
The timelessness and relevance of the UN Fundamental Principles of Official Statistics make them useful for the new technological reality. The proliferation of new and big data sources is a good opportunity to test and reinterpret the UN Fundamental Principles and thereby to confirm their universal character.
The mapping exercise that has been performed by the UN Statistical Commission Friends of the Chair Group on the Fundamental Principles of Official Statistics confirmed extensive applicability of the Principles to all aspects proper for new and big data sources. Moreover, owing to the fact that the Fundamental Principles constitute a basis for all data and statistical quality-related codes, charters, and frameworks throughout the world, it can be argued that they can respond to current challenges, be it data revolution or the 2030 Agenda, with NSOs playing the leading and coordination role in the monitoring process.
Footnotes
For the purposes of the document, notions “big data”, “new data”, “non-conventional data” and “on-traditional data” are used interchangeably.
Acknowledgments
The paper is an edited and revised version of an unofficial background paper prepared by a United Nations Friends of the Chair Group on the Fundamental Principles of Official Statistics for the 51st session of the United Nations Statistical Commission.
The authors acknowledge and greatly appreciate the contributions of the Friends of the Chair group members.
