Abstract
The rapid development of Big Data as result of increasing interactivity with online systems between humans (e.g., online shopping, marketplace) and machine (internet of things, mobile phone, etc.) has led to a measurement revolution. This massive data if being mined and analyzed correctly can provide valuable alternative data sources for official statistics, especially price statistics. Several studies for using diverse Big Data as new sources of price statistics in Indonesia have been initiated. This article would provide a comprehensive review of experiences in exploiting various Big Data sources for price statistics, followed by the current development and the near future plans. The development of system and IT infrastructure is also discussed. Based on this experience, limitations, challenges, and advances for each approach would be presented.
Introduction
The rapid development of Big Data as results of increasing interactivity with online systems between human (e.g., social network platforms, marketplace) and machines (internet of things, mobile phone, etc.) has led to a new era of measurement revolution. Several sources of Big Data are available now, and easily accessed for almost all citizen. This massive data are being used by business, researchers, and governments for public policy. Big Data can provide innovative, real-time, and more granular insight for national economic can be an innovative data source in the production of official statistics [1]. Pramana et. al discussed several potentials and also challenges of Big Data implementations for government policy in Indonesia [2]. Consumer Price Index (CPI) is one of the important economic indicators that can provide information about prices of commodities paid by consumers. Currently, most of National Statistics Offices (NSO), including, BPS-Statistics Indonesia, rely on collecting commodities’ prices manually through a direct field survey.
Advance Big Data technologies and vast data sources has led statisticians and economists around the world to experiment in creating price indexes based on new alternative sources e.g., scanner data and web scrapping data from online shops or marketplaces. Scanner data are digital transaction data on sales, price and type of items sold recorded at the retail shops. Web scraping extracts data from any publicly available information of websites. Studies on the application of Big Data with e-commerce data in the measurement of the Consumer Price Index (CPI) have been carried out by several countries. The CPI is an indicator that provides information about the development of the price of goods/services paid by consumers. Interpretation of changes in the CPI itself describes the level of increase or decrease in goods/services.
NSOs in several countries have started investigating and implementing studies on the use of Big Data to retrieve information from e-commerce to measure the CPI. Statistics Netherlands has used scanner data from several supermarket chains in the compilation of price indices since 2010 [3]. Statistics Canada show challenges in utilizing scanner data for CPI show challenges in utilizing scanner data for CPI. Practical Guide for Processing Supermarket Scanner Data by [4] describes how to use scanner data for supermarkets so the CPI of EU countries are harmonized and comparable. The Federal Statistical Office (Destatis) is using web scraping in official German consumer price statistics, and still examining how scanner data can be used in current production of CPI [5].
Statistics New Zealand initiated a study to investigate the commodities that can be covered by e-commerce. The results shows that there are several CPI commodities in New Zealand that cannot be covered by commodities in e-commerce [6]. Furthermore, Statistics Italy shows that the use of web scrapping techniques can improve the quality of Consumer Price Survey (CPS) in terms of time [7]. In addition, the US Bureau of Labor Statistics (BLS) have initiated projects to obtain price from different sources, such as web scraping data [8]. It has been shown that these alternative sources are potential to increase sample size, reduce respondent burden, obtain transaction prices more consistently, and obtaining real-time information [9]. Another alternative Big Data source is based on crowdsourcing which collects information from a large community of users. Several NSOs such as Statistics Canada have proposed to use crowdsourcing data for complement their statistics.
In Southeast Asia, the development of e-commerce has a major impact on retail and financial technology-based companies. In Indonesia, e-commerce enthusiasts are increasing, both as sellers and as buyers. Several companies have reached the level of Unicorn, a start-up business that is valued at over one billion US dollars, such as Gojek, Traveloka, Tokopedia, and Bukalapak. The rapid growth of e-commerce is caused by three things: the first is thanks to 40 percent of the total population or around 106 million people in Indonesia owning smartphones. Secondly, consumers in Indonesia currently 2.6 times more frequently transact via smartphone applications compared to 2014. Finally, total online business in Indonesia has increased to around 4.5 million in 2017. Of these, around 99 percent are micro entrepreneurs with income of less than 300 million rupiah per year and 50 percent are online businesses without physical stores. The emergence of e-commerce will affect the demand and supply of goods and services, which in turn affect market prices, especially at the consumer level. Thus, it can indirectly affect the CPI in Indonesia. In Indonesia, BPS Statistics Indonesia still do not have any access to scanner data of retails nor supermarket chains in Indonesia due to no regulation yet in allowing the access. This makes the research development focusses on the publicly available data. Though the plan is to also have scanner data as one of the big data sources.
This study would provide a comprehensive review of past experiences, current developments and the future plans on utilizing different Big Data sources such as web-scraped online retail and market place and crowdsourcing to produce more relevant and timely price statistics than traditional sample survey data.
Past studies on development of price data collection
Social media signals on food prices
For the past years, several studies have been conducted to utilize different Big Data sources, e.g., social media, crowd sourcing and websites, for prices statistics. In 2014, Pulse Lab Jakarta (PLJ), (an innovation initiative of the United Nations Global Pulse and the Government of Indonesia) used social media, i.e., Twitter, to nowcast the food prices in Indonesia. Tweets were collected using six keywords (commodity names) and then refined, and filtered. The data are modelled to obtain daily price point for each commodities (chicken, beef, and onion). The result from social media are then compared with the official price obtained from a traditional survey.
The study shows that the tweets about food price are closely related to the official figures. The public tweets may be an important sign for predicting price and inflation in Indonesia [10]. One of the drawbacks of this approach is the Twitter is limited only in cities, not many people in rural and outside Indonesia main island Java, uses twitter. In addition, currently the number of twitter users is getting lower. Another drawback is the people tend to keep silent, not twitting, when the prices is getting cheaper. They only “shout” when the prices increase.
Number of daily reports of commodity fish in allmarkets period April–June 2015 in Lombok, NTB.
In 2015, PLJ in collaboration with Premise and in close cooperation with World Food Programme & Food and Agriculture Organization collected and monitored in real time the prices of 32 basic commodities, including food such as tofu, Tempe, spinach, mackerel, and eggs using a smart-phone apps in Lombok Island, Nusa Tenggara Barat, Indonesia. Prices are collected every day and throughout the geographic coverage area of nearly 20,000 square kilometers in the province. From April to June 2015, Premise was able to build a network of more than 500 contributors on the island of Lombok alone with more than 5,000 unique places visited, and 66,902 observations reports submitted.
Figure 1 shows the number of daily reports of fish, in almost all traditional markets in Lombok, NTB. The size and color, respectively, represents number of reports and the time of reporting (morning, afternoon, evening). Mostly people report the price the morning. Four popular markets are Pasar Pelulan, Pasar Induk Mandalika, Pasar Umum Kediri, and Pasar Narmada, which are the main markets in Lombok Island. Number of daily reports are different, especially in the beginning of the program as the contributors were being recruited. After a month, more contributors were registered and leading to the increase of number of reports.
Collecting is just a beginning of long process to obtain reliable statistics. By crowdsourcing, data can be collected in almost real-time, wide range area. However, the collected data contains noise as the contributors can fill any prices, and different price units. It poses various problems such as different unit size, unknown commodity quality, uncompleted data, extreme prices, etc. They carried out some pre-processing techniques such as removing incomplete records, standardizing the unit size including unit conversion and deleting records which have irrational unit size, removing extreme and unacceptable price per unit size, and calculating daily by combining pre-processed data.
The official prices collected by BPS were used as the ground check to compare the prices from crowdsourcing. BPS collected price of goods and services at consumer level by conducting CPS. For volatile commodities (chicken, beef, egg, onion, chili, low quality rice, and premium quality rice) the survey is performed every week. Whereas for non-volatile commodities consist of mackerel, long bean, instant dry noodles, peanuts, and vegetable tomatoes, the data are collected every two weeks. The publication of official price index is in every month. As the nature of the commodities are diverse, different approaches are proposed to nowcast volatile and non-volatile food prices. Rizkika et al. [11] discussed these approaches:
a. Volatile food price
The crowdsource-based predicted price, and official Data of price of broiler chicken.
The crowdsource-based predicted price, and official statistics of price of tomatoes.
Before modeling, pre-processing techniques are applied such as data cleaning, data transformation, smoothing, and data imputation. Data cleaning includes filtering to filter data according to time and place of study, removing incomplete record, removing extreme prices, and removing outliers. Data transformation includes standardizing the unit price, and calculating daily and weekly price. Smoothing is used to minimize the fluctuation pattern, using smoothing spline. Data imputation is needed to complete the unavailable data in a certain day, also to get daily price from weekly price using temporal disaggregation. After pre-processing, two approaches Distributed Lag Model [12] and Neural Network Resilient Backpropagation [13] are carried out and compared. Figure 2, shows the prices of broiler chicken during the study period. The points represent the reported price, the solid lines is the official price, and the other lines are the predicted price based on crowdsourcing data.
b. Non-volatile food price
For this type of commodities, two methods were implemented for nowcasting: time series-based (Nowcast Model), and statistical filtering-based which is followed by cubic smoothing spline modeling (IQR-Spline Model, KDE-Spline Model). Similar data preprocessing stages are carried out: Data cleaning (including outlier handling and fraud detection), and data transformation to standardize units according to their commodities. Figure 3 shows the price and predicted price for tomatoes. The result shows that the best method depends on the type of commodities. IQR-Spline model is better than the KDE-Spline model for the commodity of long beans, and mackerel. Whereas the KDE-Spline model is better for the commodities of instant dry noodles, peanuts, and vegetable tomatoes.
Pasbeli consumer price data collection.
The crowdsourcing study shows high participation on price reporting through a mobile apps. However, the main shortcoming of the previous crowdsourcing strategies is that the mobile apps was not developed for specifically price survey for specific commodities. Users can fill any values, price, and any size/unit. Furthermore, all the required commodities for CPI along with standard unit was not included in the Premise Apps. To answer that, an android apps for price data collection called PasBeli was developed [14]. The screenshots of the Apps is shown in Fig. 4.
The apps have included all commodities used for CPI along with the right size/unit. It connects to google map so users can report the price of the commodities bought at closest traditional or modern market. The system would record the prices and dashboard would show real time price for around Indonesia. The main challenges is to promote the apps so large number of people would download and use it.
Rapid growth of online shopping, and now boosted due to Covid-19 pandemic, has made changes on daily transaction all over the world including Indonesia. Online shopping is a form of electronic commerce which allows consumers to directly buy goods or services from a seller over the Internet. The e-commerce can be defined as a technology, application, and business process that connects companies, consumers, and communities through electronic transactions and trade in goods, services, and information electronically [15]. In general, e-commerce can be grouped as classified ads, marketplace, and online retail.
In Indonesia, various online retails are listed in the directory of members of the Indonesian E-Commerce Association (IDEA), for instance Hypermart, Klikmart, Bukupedia, Bhineka, Sephora, Zalora, BerryBenka, Electronic City, Century Pharmacy, Babyzania, Stationary and Mothercare. For marketplace, the big players in Indonesia are Tokopedia, Shopee, and Bukalapak. The information publicly available from these websites can be captured, stored and then used for price statistics and other macroeconomics facts. In order to capture information from such e-commerce, the scraping programs can be implemented. The programs visit e-commerce the websites and retrieve some information such as product names, product prices, products sold, etc. Then these informations are analyzed to identify the price changes occurred in a certain time period.
In Indonesia, several studies have been carried out and several tools have been developed in implementing web scrapers that can retrieve commodity package prices used in compiling the CPI.
Screen shot of online retail prediction.
Sutiawan and Nugraha [16] developed a system to collect many products’ information as consumption commodities from several online retailers, using web scraping technique. The data obtained from web scraping are then used to predict the price using artificial neural network method. The predicted price are then compared to the official price to see the accuracy of the approach. The results show that the system can predict price of certain commodity close to the price data survey. However, the study show moderate prediction accuracy due to small sample sizes as they only uses few months for the case study.
This systems consist of online CPS by BPS, market necessity monitoring systems by the Ministry of Trade, and commodity price information system by Commodity Futures Trading Regulatory Agency.
Arief and Kurniawan [17] developed a system for web scraping from different sites, online retail (Klikindomaret, Alfacart, HappyFresh, and Hypermart Online) and a website reporting daily prices from traditional markets in Jakarta (
Coverage of commodity sold in e-commerce for eachcommodity group in CPS.
Wijaya and Mariyah [18] developed a crawling system that can retrieve information published in some selected e-commerce such as product name, brand name, quality, price, and other available information. The development was triggered by the necessity of BPS to collect the price data. However, at the time BPS relied on data collected from both traditional and modern markets only. The price of goods and services sold online was not collected although transactions of online purchases were going up fast and slowly changing the way people shop. Approach taken by [18] started with the determination of e-commerce websites. The determination aimed to select appropriate e-commerce websites. The selection criteria were the frequency with which people buy goods and the coverage of BPS’s commodity package sold through e-commerce websites. There were 14 e-commerce websites chosen and the finding was that there was no one e-commerce sold all goods listed in BPS’ commodity package. Therefore, each of e-commerce website completed each other. Then they listed the URL that posts the page selling those goods. It was done manually because it needed high precision to select the goods sold that match with the criteria of BPS’s commodity such as brand, quality, and quantity. This stage was quite challenging because there were various similar items for each commodity.
Figure 6 shows that many kinds of dishes or foods were not yet possible to be sold and purchased online. For a group of transportation, communication, and financial service, there was no commodity sold in e-commerce match to criteria. Data preprocessing techniques were applied to all successfully scrapped data. It aimed to handle noise, to normalize data, and to transform data such as data type, unit uniformity, and scale uniformity. They found that each e-commerce website had its own style and layout in selling the goods. Their research did not stop at this stage. They tried to calculate the consumer price index. Therefore, the scraping was done for three months to get quite series to exercise. In consumer price index calculation, for each month, and each commodity, they calculated the average price, relative price, and the consumption value. The challenge was in consumption value calculation because it needed the quantity of each commodity consumed for each month. The approach they took was adoption of consumption quantity of goods sold in the traditional or modern market. There was no survey measuring the consumption value of goods sold online in e-commerce yet. Consequently, the calculation would be bias because online purchases may have different patterns compared to offline purchases.
Approach taken by [18] had some limitations. Firstly, the structure and layout of e-commerce websites often changed. As a result, some information could not be retrieved. In addition to the website’s change, the URLs listed could no longer be up-to-date. In consumer price calculation performed by BPS, the same quality or brand from various selected respondents was calculated. However, in their study, the quality or similar brands from some e-commerce could not be calculated. They found that it was difficult to distinguish the same quantity of goods sold in different e-commerce websites. It was challenging to know where the goods were manufactured and where the goods were distributed and sold. Therefore, the quality weight and area weight could not be calculated which caused the consumer price calculation could not be disaggregated into the provincial level.
Flowchart of marketplace web scraping.
Marketplace is a special kind of e-commerce platform which provides well-organized items, commonly in a website, for selling and buying purposes. Marketplace website, as well as other HTML-based web pages, contains a lot of HTML tags as the markup language that will be rendered to intended view by web browser. For official statistics purposes, only specific part of the website will be taken into account as data sources. Structure of website that includes HTML elements were inspected to find right endpoint that will return data sources, commonly in JSON format. General processes of market place web scraping is shown in Fig. 7. There are some information made available to the public by marketplace to help user preview the quality and reputation of product and shop. Several statistical related data could be explored to produce price statistics. The price of each product are shown clearly, but some noise to the stated price should be considered, such as discount, reseller items, rare items, or second-hand goods.
CPI calculation by BPS
Consumer Price Index (CPI) is one of the economic indicators used to measure the price change in terms of inflation and deflation at the consumer level. Due to the change of public’s consumption behavior, since 2020 BPS applies CPI on year 2018
LCS 2018 produced 835 commodities. The largest number of commodity basket is in Jakarta with 473 goods and services. The city with the least number of commodity basket is in Sintang municipality with 248 commodities. The number of core component commodity basket (core inflation) is 711 commodities, administered prices component are 23 commodities, and component with volatile prices are 101 commodities. The enumeration of Consumer Price Statistics are carried out in traditional markets, modern markets, outlets, and official online stores in each region. Price of each commodity is obtained from direct interview to retailers and/or by scraping prices information from official online stores. The frequency of data collection differs from one commodity to other commodities which depends on the characteristics of each commodity. Price of some commodities are collected weekly, some others are collected fortnightly, and the rest are collected monthly.
In general, the formula for calculating CPI by BPS is the modified Laspeyres formula, as below:
where:
Maintaining the previously developed system
As several online retail web scrapping applications have been developed, for current development is to maintain the system, especially the one developed by [18]. As the Covid-19 pandemic force all field face-to-face survey to be postponed, the price from online retails take important role. The biggest challenge of web scraping is the system need to be modified if the website change the web structure or layout. It would be a problem for some websites which frequently change their website. Currently the system is run and maintained so if there is a change the scraping system is being updated. Price of commodities from online retails such as Sephora, Elekcity, Klikmart, Century, Babyzania, Mothercare, Bukupedia are collected.
Web scraping new sources
The online retails scrapped previously do not show the commodity prices of a specific region. Hence need online retail only can be used for national level. However, the CPI of city or region is needed. For current development, Haqqoni and Pramana [19] investigates other strategies to obtain region level prices. For that purposes, data from several online retails, information sites and marketplace which provide the price at region level are collected. An online retail
Two agencies in Indonesia collect data from traditional markets and report the price daily are Commodity Futures Trading Regulatory Agency (Bappebti)
Flowchart of marketplace web scraping for specific products.
The collected product are then categorized into commodities listed in the commodity basket in the BPS’ CPS guidebook. The prices of these commodities are standardized. Furthermore, the weight of each type of commodity will also be balanced in order to avoid price bias on each type of commodity at a different source. In this case several types of commodities are converted to per kilogram/per liter. Another new source is one of the largest marketplace in Southeast Asia which have broader products. Bustaman et al. [20] discussed technical web data gathering steps, computer system architecture, data cleaning for obtaining data from a marketplace. However, their approach is more to take information of all products which take around two weeks to complete. Not all collected products are included in the commodity basket used to calculate CPI. Hence, the for the purpose of CPI, the web-scraping is based on keywords suits to BPS commodity basket. Examples of the keywords are: rice, purebred, chicken, meat, beef, and eggs. The procedures of web scraping is shown in Fig. 8. The challenges in marketplace site is the cleaning process. Seller can put any price and any unit which need to be adjusted and check for quality. In the cleaning stage, only products that have been sold previously are selected. The same approach such as unit standardization, is carried out.
Collecting, storing, and analyzing data coming from “unusual” sources instead of conventional survey is not trivial tasks when dealing with computing infrastructures. We divide computing resources, i.e. hardware and software, used into three different layers based on identical functionalities and process characteristics, namely data collector, data staging and storage, and data processing, see Fig. 9.
Big Data processing stages.
Data collector is a subsystem which handles raw data collection or raw data capture from Big Data sources, such as from internet, intranet, or particular file format. Data collector runs several application instances to speed up the collection process. In price statistics, each instance works in multi-thread and multi-process to scrap the data from e-commerce website. E-commerce website, as well as other HTML-based web pages, contains a lot of HTML tags as the markup language that will be rendered to intended view by web browser. For official statistics purposes, only specific part of the website will be taken into account as data sources. Structure of website that includes HTML elements were inspected to find right endpoint that will return data sources, commonly in JSON format. Based on our finding, large marketplaces have well-organized website structure, thereby the data endpoints are easy to be queried. We call that data endpoint public API (Application Programming Interface). Public APIs were invoked using small but reliable crawler program or script. Currently we use two kind crawlers, the first is Scrappy, which is an open-source web crawling framework written in Python [21], and the second is custom multi-threaded Java-based application. The Java-based application uses smaller memory footprint and CPU usages rather than Scrappy [22].
In the second layer, there are data staging and data storage which responsible to provide both temporary and persistent database respectively to the data collector. Captured data were forwarded to data staging database as the temporary storage before transferred to the persistent database. Data staging should have good write performance, especially in INSERT operation, to minimize waiting time of crawler. In contrast to data staging, persistent data storage should have great read performance due to frequent SELECT queries will be carried out. Data are accumulated in data staging layer before they are loaded into persistent storage in batch or asynchronous way. In certain cases, parsing and decoding are needed to transform data into structured model that can be analyzed by data scientists. Read performance could be improved by mirroring or replicating the data to several nodes and partitions to distribute data access avoiding single point bottleneck. Some approaches implemented in data staging are by using in-memory variables, localhost relational database, and cloud database. We use MySQL and Google Big Query as localhost database and cloud database respectively. In persistence data storage, we use Google Big Query [23].
Data processing subsystem is a set of application systems that used by data scientist to analyze, explore, and simulate the data sets. There are variety of methods, such as statistical analysis, data mining, and data simulation. Due to this diversity, the use of analysis tools and software are vary from standalone application, to modern parallel processing software such as Apache Zappelin. For simple data tabulation or pivoting, business intelligence (BI) applications are utilized, i.e. Microsoft Power BI and Tableau.
Currently, we generalize and standardize our Big Data infrastructure based on previous finding and experience. High level architecture are defined without depend on specific technologies and implementation. We also increase the modularity of subsystem to gain separation of concerns by dividing data staging and persistent data storage into two different layers. It enables flexible implementation of wide range of modern storage technologies, such as in-memory cluster database and distributed file systems that can be elastically expanded without restructuring the whole systems. We are in process of evaluating different combinations of data storage infrastructures and technologies. Old systems are gradually migrated into new stable integrated systems while attempting to adjust parameters, such as number of threads and instances, in order to find proper configurations.
Big Data initiative for price data collection
The next phase would focus on several directions. Exploring new data sources such as scanner data, marketplace or on-line shops data transactions. Based on a government regulations no 80 year 2019 about Trade through Electronic Systems or e-commerce, all domestic traders and foreign traders who perform Trade through Electronic Systems in Indonesia must submit their data and/or information regularly to government agencies that carry out government affairs in the field of statistics. Advanced system, governance, and technical regulation for incorporating that regulation are on progress.
For current big data sources, the focus would be on quality assurance, in both the process and data to match with the Statistics Quality Assurance Framework. Furthermore, as the internet full of copies meaning some data from several sites are duplicated, an artificial intelligence system would be developed to match products from similar sellers. The system would also automatically classify the products into the commodity basket according to the Classification of Individual Consumption by Purpose (COICOP). All process mentioned above would be compiled in the pipeline system of real time price data collection.
Another plan is the methodological aspect on developing national and regional online Consumer Price Index, and to combine the price index of survey and Big Data for having more timely, and reliable price statistics. Moreover, nowcasting approach based on these news statistics would be developed as well. In addition, another important focus is strengthening the legal aspect of the web crawling for official statistics.
Future design of Big Data infrastructure
The main limitations of current systems are regarding the performance and scalability. The web scrapper required one month to complete the scrapping process from one large marketplace for selected attributes and top 50 the most purchases products. Our experiment show that the estimated time to collect all complete list of products using current infrastructure from a large marketplace is about 83 days. It means we will lose monthly series of data awaiting scrapping process. The complete single marketplace data requires approximately 2 Terabytes of storage spaces for one round data collections. It is inefficient and impracticable way to store that data in single server due to high hardware specification for several months’ data.
Next generation Big Data systems should increase the collection speed and capacity produce faster and more comprehensive analysis. There are several choices to increase the performance, such as resource clustering, distributed processing, and in-memory processing. Fast computer memory, such as Random Access Memory (RAM) and Solid State Disk (SSD), becomes inexpensive and have higher throughput. It will change the way data processed from slow disk-based data access to fast in-memory data processing. Similarly, database systems are evolving to elastic capacity where the scale of storage space can be expanded online without restructuring existing deployment. Large scale mainframe computer’s storage can be achieved by coordinating numerous commodity server machines, or even personal computers, called nodes, to work together in connected network. Reliable replication algorithms also ensure high availability of the systems by maintaining the copy of data into more than one node as backup when the main node fails to operate.
For long-term reliable production systems, it is not a simple task to decide which strategies and technologies to be chosen. Recurrent experiment and linear testing should be carried out to accurately examine them. Big Data technologies are in the state-of-the-art phase with rapid development and emerging new tools. Structured research should be established to assure sustainable and continuous system improvement following the advancement in Big Data technologies.
Summary
Several Big Data sources have been explored for having timelier price statistics in Indonesia. Social Media give a good signal for price fluctuation, however, it is limited for increase in the price. Normal or lower price cannot be captured as people tend be silent on this situation. Social media is also limited as the users may get lower, and the users may change the platforms. Crowdsourcing is another alternative that can get the actual consumer prices. However, the challenges is to make sure the validity of the data, avoiding fraud and removing outliers and duplicates. Another challenge is to increase the participation from citizen. A synergistic system must be implemented so the participants can get advantages in reporting the prices.
Web scraping is useful for official statistics as it can provide faster, higher quality statistics, reduce number of survey and respondent burden. However, there are still challenges in scraping such as getting the right websites, scraping design, data processing, coverage, and also legal issues. Each website has a specific structure and it may change anytime. This makes designing the scraper need to be customized for each website and time. Building robust and adaptive scrapers is the challenge. Furthermore, different sites have different products with different unit size, make it more challenging in data cleaning and preprocessing stages. The number of products and seller are getting increase significantly requires suitable storage and computational infrastructure. Web scraping can be done in areas where digitalization already mature implementation, In Indonesia some areas still need to use conventional approach. Legal aspect of web scraping, i.e., permission to crawl, is also important. Hence, if feasible the owners should be informed. The scrapping procedure must follow the Robots Exclusion Protocol provided by the website which each site may have different regulations on web scraping.
Implementation of Big Data for producing price statistics in Indonesia have been initiated with various data sources, approaches and methods. The implementation of Big Data for price statistics become increasingly viable as it can reduce workload of data collection and improve the data granularity. However, many challenges and actions to make it as official statistics. The next challenges is not only obtaining new data sources but also finding the best data processing and modeling methods to produce high quality statistics together with suitable IT infrastructure.
