Abstract
We describe a case of web scraping data-based price indices implemented in a mid-size city in a moderate inflationary country. Full consumer price (CPI) and construction cost (CCI) indices were implemented for an entire city obtaining efficient results compared to statistics using traditional data collection methods. We state that web scraping combined with big data techniques will allow estimating more individualized and efficient metrics comparable in quality to official statistics. Web scraping technologies empower civil society and small research groups alike by allowing them gather and interpret socioeconomic data. It also helps to create new dimensions of analysis by allowing changes in frequency and focus on specific groups of products and services.
Introduction
Argentina has long been recognized as a country of great plains, extensive herds of cattle, good wine, tango, and sadly to say, persistent inflation. While sustained price increases were a major World problem in the seventies and eighties, mainly due to supply shocks, the South American country showed a process of continuing inflation since the mid-fifties that derived in two hyperinflations by the end of the eighties. During the nineties inflation was slowed down helped by a tight fixed exchange rate program that ended up in a major depression in the early 2000s. After this period there were no price pressures but by 2005 the demand-expanding policies were making prices to slowly but persistently increase year after year. Unlike previous inflationary processes, by 2007 [19, 3] Argentine political administration intervened the national statistical office named Instituto Nacional de Estadísticas y Censos (INDEC), a highly respected institution at that time [4]. The purpose was not clear at the beginning but soon after removing senior professional staff, inflation figures publicly released were suspiciously low. People began to distrust official information on this and other national statistical measurements as perceived variations were higher than reported ones.
At the same time, digitalization had so far become omnipresent in daily life. A rising ecommerce trend made this option available to most parts of the World, including of course Argentina. Electronic commerce was possible only if online catalogues and digital payment methods were present. These informative lists usually show information on products’ features and prices. Since buyers can’t see the products by themselves for purchasing online, retailers make an effort to display detailed information on each item, including the price, name, brand, package size, category, several pictures, and whether it is on promotion or under price control. As mentioned earlier, by the early 2000s developments in software that scrap data from the Internet facilitate gathering enormous databases. While actually this online information serves for various purposes, such as anticipating inflation [14], it is primarily well suited for the purpose of constructing price indicators that substitute those untrusted official statistics.
The aim of this paper is to present this successful case of private web scraped data based entrepreneurship in that working with a minimum work team size it was managed to replicate and to adapt consumer price and construction cost indexes for a mid-urban area in Argentina. Given the inflationary setup of economy, data collection frequency and calculations has been increased to weekly outcomes. The results so far support the conclusion that compared with institutions that estimate inflation with standard procedures (hand-picking prices), the process of online data collection is remarkably efficient while it still faces tradeoffs. More volume of data is managed and comparable results were obtained at a cost that is a thousandth fraction of traditional methods. In the other hand, many baskets of goods change composition regularly and changes due to quality variations also affect prices and both still are uncontrolled.
Specifically, in late 2012 two social entrepreneurs in Bahia Blanca (700 km to the south of the capital city, Buenos Aires), began to collect an increasing amount of prices from local stores in databases for disputing untrustworthy official statistics figures. Lacking economics background, and of course unknowing any precedent such as [5, 6, 8], by early 2014 they looked for help on how to properly take advantage of this database and were advised by a local economist to create a more precise instrument for measuring living costs in the city. By September 2014 they officially launched IPC Online (Consumer Price Index Online) and began to make public all the information on local inflation. The quality of the estimation could be directly tested given that a private institution was applying long before a local CPI based on the traditional method of personal survey.
The paper follows with Section 2 with a brief literature review. Section 3 where the project is presented and Section 4 where data and results are shown, with special focus in alternative measurement comparisons. Section 5 discusses weakness and strength of the applied methodology and Section 6 ends with conclusions.
Web scraping data based indices literature
Most retailers, including every single great retailer, gather information through electronic scanners. In the interim, with the advancement of digitalization, internet commerce has turned out to be better known by the general public. Online prices are now available via web scraping techniques and become one of the most promising sources of socioeconomic information [12]. In the same fashion that scanner data, web scraped counterpart is subject to seasonalities and promotions perhaps even at higher frequency, a fact that always should be considered. Daily access to online general stores and retailers, which is usually freely permitted, can prompt a mass accumulation of information [2]. For instance, an enormous US retailer, such as Wallmart, offers more than 50 basic supplies classes for web based shopping with around and average of 35 items per category. This prompts at least 1,750 items from the web, data which might be gathered from this retailer on a daily basis. As a rule, there are at least 4 noteworthy retailers in a nation that supports this development, keeping open data on item costs and attributes that are accessible on the web. Computerized information gathering can swap conventional value accumulation for some item classes. A few national statistical offices right now make utilization of scanner and web scraped data, including the Netherlands, Norway, Singapore, Sweden, Switzerland, and New Zealand [16: 1001]. With these “big data” sources, statistical agencies and academic researchers have an opportunity to study many research issues that used to be operationally infeasible and purely theoretical, and explore new methods to solve them.
As previously mentioned, during 2007 the Argentine government intervention on the INDEC made clear that it has did not has any goal of improving the national statistics but the purpose was to control them. Official inflation was not credible for people as living cost reference. By that time, technology offered possibilities never explored before. One of this new path was early proposed by [27, 28]. They design algorithms that explore html format from web pages and extract and store precise part of the coding. Specifically, InflacionVerdadera.com [5] was created to provide alternative price indexes to the INDEC indices in Argentina. From 2007 to 2012 it published a Food and Drinks Index as well as a Basic Food Basket Index, using a combination of daily prices from two large supermarkets. Table 1 shows a large departure from locally based inflation estimates (CREEBBA and IPC Online) and the INDEC CPI published at that time. Perhaps this divergence represents also the gap between people’s perception and official figures.
CPIs Descriptive statistics (Sep-2014/Jun-2018)
CPIs Descriptive statistics (Sep-2014/Jun-2018)
Source: The authors.
Web scraped data have since been also used in many contributions. For instance, [22] tests the technique for collecting consumer electronics products (goods) and airfares (services) prices for the Italian Harmonised Index of Consumer Prices (HICP). [27] uses web scraped data from World Bank Poverty database for improving analysis of poverty-related variables that were publicly available but dispersed and underutilized. More recently, [22] uses a web scraped database for estimating forecast at the level of products. [18] makes experiments with web scraped data collection for categories in the Norwegian CPI. They also remark the careful treatment data have to be taken before processed. In a more technical approach, there is an ongoing discussion on different forms of processing web scraped data-based indices according to procedures such as FEMS (Fixed Effects with a Window Splice), GEKS or Törnqvist RYGEKS for controlling quality changes (isolating only price change effects) on CPI. [16, 17, 9, 10] are remarkable references of these technical issues. Mainly these studies show the potential use of web scraped data for isolated and special groups of products: electronics and airfares, vegetables, cell phones, clothing, among others. However, none of the citations has properly faced the task of building a full CPI, including if not all, at least a significant portion of the categories present in an index.
Since 2007, inflation measurement, among other socioeconomic indicators, was considered distorted in Argentina [5]. With sustained fiscal and monetary policies since de 2001 debacle, demand expansion pushed prices up in a noticeable way resulting in higher cost-of-living’s expenses. While regulations in prices (and its following controls) were present, they only comprised a handful of products and services and soon market pressures demonstrated that purchasing power of incomes were deteriorating. The Argentinian government, on the other hand, stuck to its own figures. There is currently statistical evidence that shows that price variations published by INDEC do not follow a natural pattern of variation but have been manipulated and biased downwards [20].
Private solutions emerged in the form of consultants hiring former INDEC employees to create a basket of goods and services and by making price surveys in a low rate basis (at least once a month) and obtaining usually incomplete baskets. As citizens, there were no alternatives to indicators generated by the statistical office. At first, the work itself was hardly reproducible given financial and technical restrictions. It was barely feasible to team up surveyors and to access to skills in inquiry design. It was also required funding for any facility or to sustain the proper wages of the personnel. From the citizen side of the problem, it seemed that not much could be done besides complaining.
By 2012 two young entrepreneurs, without economic background, decided to implement web scraping technologies for gathering price data. They targeted their search to supermarkets in their own city and began to collect thousands of price data for a year and a half years and by 2014 the stock of data was large enough to begin processing. The startup enlarges by adding an economist who provides data organization and methodological design for the implementation of a consumer price index. New data was added and a few items began to be recollected by more traditional methods (by phone or by reviewing price lists). By September 2014 the index was ready to reach public audience with a positive reception.
The procedure for registering the CPI has customarily included a cautious outline of price lists, certainly guided by the information of the technical reports from INDEC and adapting them to local environment (for instance, subway tickets cannot be considered in a city that lacks that service) to gather price information. Web scraped data indexes of prices have a tendency to be significantly bigger and considerably messier than their hand-picked counterparts, up to the degree that human preparing and categorizing information are unreasonable. The mechanization of these assignments from prices to categories concordant with those of the national CPI is subsequently a noteworthy segment of the present project. The categorization was done by using artificial intelligence algorithms that recognize common words and presentations using a “bag-of-words” machine learning algorithm to classify products into categories. Even doing so, most data collected could not be processed entirely. Approximately 200 thousand prices are collected each month but only 45 thousands are processed in the CPI.
Another milestone was achieved when the new government won the election in November 2015 when web scraping frequency was augmented to four runs in a month. Technically it could be web scraped daily or even hourly. By June 2016 another index was designed, calculated, and released: the Construction Cost Index Online for Bahía Blanca. The project received small funding by the local national university.
It is interesting to note that [23], for instance, remark that they envisage their work collecting thousands of data and estimating a CPI as “potentially leading to the production of inflation measures pertaining to small demographic or geographic populations, on timescales of days rather than months.” IPC Online project is actually working at a limited geographical scope and with higher frequency in data collection (weekly) making that hope a real outcome.
Data and outcomes
The project started by collecting information on food and beverage prices that compose the primary spending more regularly observed by families (weight 0.38 in national CPI). Data then is stored and accumulated for a year and a half. After receiving technical advice and expanding the team to join a new member, the project widened the search list of web prices including information on the 9 chapters (enlarged categories) of the INDEC CPI. In the following months the database was enlarging with new prices and the algorithms were doing their work. In the spirit of [23] we will briefly describe how the project works in ways non-specialized readers would understand.
Monthly inflation measurement at Bahia Blanca: CREEBBA and IPC Online
Since 1996, Bahia Blanca has a private institution that estimates local inflation by using traditional survey of basket of goods and services. The Centro Regional de Estudios Económicos Bahía Blanca Argentina (CREEBBA) [Bahia Blanca Argentina Economic Studies Regional Center] performs the task of measuring inflation in a monthly basis in a period of highly stable prices in Argentina. The methodology implemented is a scaled-down variant of INDEC by using no more than four surveyors that gather prices in main supermarkets and middle retail business and also recollect information on services prices. An average basket comprehends 800 items and prices are collected once a month in a hand-picking fashion.
By September 2014, the new IPC Online index was publicly announced and outcomes were regularly publicized by local media. The whole process includes several stages:
At the beginning, data was collected once a month. Now it is collected four times a month, even though the software system is capable of doing at a higher frequency. Raw data is classified by a machine learning algorithm of the “bag of words” – type and assigned to each product level of the INDEC CPI category layout. Weights were adapted to the facilities and supply of the city and a reweighting procedure was made in the design stage. For completing the whole set of INDEC CPI categories some prices must be obtained in the traditional way, by gathering data in the field. Less than 100 items are obtained that way, especially services (regulated and non-regulated), non-online stores, cigarettes, and school tuitions. INDEC CPI categories’ coverage ranges from 76% to 90% depending on changes in web lists updating and technical changes. It regularly includes an average of 12 thousand items and 40 thousand prices for the monthly data take.
Inflation evolution at national level (INDEC) and Bahia Blanca level (CREEBBA and IPC Online). Source: The authors. By mid-2017, the team developed software that reads the outcome sheets, draws charts and writes in natural language a fairly standard and automated press release that is uploaded to a blog (

Kernel density of the general level of the three CPI (data from Sep-14 to May-18). Source: The authors.
Kernel density of the general level of the three CPI (data from Apr-16 to May-18). Source: The authors.
Bahia Blanca has the rare honor of being the second city of the country (after Buenos Aires) traditionally surveyed and online scraped in terms of price collection and inflation estimation. This way the outcome of the IPC Online project could be in a certain way contrasted to an estimation made by a traditional survey.
The INDEC CPI series was interrupted during four months between December 2015 and March 2016. This was done while new administration was dealing with the sabotage that the former officials made on data, algorithms, and computers [19]. Procedures were misled, software was corrupted, and unions made troubles that hindered the normalization of the institution. That embarrassing process, called normalization, took months during which INDEC did not make public its CPI and endorsed private consulting firms as valid measurements of national inflation.
As observed in Fig. 1, inflation measured by INDEC show lower rates in the period previous to normalization compared to CREEBBA and IPC Online, and figures were catching up closely in the posterior period after April 2016.
As for the frequency distribution of the outcomes it can be seen in Fig. 2 the kernel distributions of the three CPI series presented in Fig. 2. Again, there is an almost juxtaposition in both local CPI and a non-correspondence in the INDEC CPI for the whole period. However, Fig. 3 remarks that this is aligned when considering the period after the aforementioned normalization in April 2016. These results are akin with extensive information presented in [5].
For this to be remarked, Table 1 shows the descriptive statistics for the three series and the two samples. Again, the whole sample shows an INDEC CPI significantly down in average, median, kurtosis, and skewness compared to the other two. In the latest subsample, median and average show convergence but kurtosis and skewness remain low.
This divergence is, of course, present when estimated the accumulated variation of the three CPI. For that to be observed, Figs 4 and 5 show the accumulated rates of variation of the three CPI in the two samples, respectively. While Fig. 4 shows a departure from the beginning in the whole sample, Fig. 5 states that accumulative variation since the normalization of INDEC CPI back in April 2016.
CPI Accumulated variation (Sep-2014/May-2018). Source: The authors.
CPI Accumulated variation (Apr-2016/May-2018). Source: The authors.
This way, this subsection points out that a web scraped data-based CPI was created and maintained by a minimum size team replicating official and private institution estimation that rely on hand-picking price approach. Results end up being highly comparable, both visually and statistically.
In October 2015, elections were held and the opposition won. During the change of administration in November 2015 prices rise high week after week but they were no official measurement of that evolution. Several economists pointed out this in the social media and IPC Online staff was asked to supply any available information. By December 2015, algorithms were recoded for weekly price data collection and results were made publicly available through the project’s website. Each month was divided in four weeks, labeled Week 1 (W1) from day 1 to day 7
Inter-monthly weekly accumulated variation (IMWAV)
Each week (average weighted price) is compared against the full CPI of the previous month. As each week advances during the month, IMWAV adds the new one (average weighted price set) and, again, compares it against the full CPI of the previous month. This way, we can observe how monthly inflation is constructed up to the 4
It uses the following formula: let
Where
IMWAV in June 2018. Source: The authors.
Each week (average weighted prices) is compared only to the previous week. It reflects more perfectly weekly inflation dynamics. WIMV (T-1) is defined by Eq. (2)
Figure 7 shows the T-1 inflation rate and accumulated inflation since the beginning of the weekly collecting period December 2015.
Rate and accumulated weekly inflation T-1. Source: The authors.
Figure 8 remarks the data presented in Fig. 6 where the ERPT effect seems to have taken place in week 3 affecting chapters Home equipment and maintenance and Other goods and services, both dragging the CPI while in week 4 prices rose for Clothing, Food and beverages, and Recreation mainly. It seems a simple method for initially exploring if depreciation pass through prices.
T-1 Weekly rate of inflation, June 2018. Source: The authors.
Each week (average weighted price) is compared to the same week of the previous month. So February W3 is compared to January W3 alone. It creates four (relatively) independent indices that allow us to observe the different weekly dynamics and during what week increments and decrements locate.
Data are of course are more granular with the new frequency and allows to anticipate monthly inflation by observing how it is evolving week after week, in the case of IMWAV. By observing T-1 inflation the reader depicts the week where punctual ups and downs are present and to what specific categories affect them. This way, WIMV (T-4) is defined by Eq. (3):
On the other hand, T-4 inflation is a quite particular measurement. It was designed as a suggestion of the many local economists interacting with the Twitter account of the project (@ipconlinebb). It was supposed to be interpreted in the same fashion of monthly inter-annual comparisons, but for week and months. Once estimated it was a little hard to understand the information it portrays. Each week is not comparable with the precedent or the following week (contrary to IMWAV or T-1) and stands for itself compared against its own values of the previous month. This way, T-4 weekly inflation is interpreted as how each week varies in a monthly basis and accumulated variation compared to the other weeks. It helps to observe if inflation is distinctively diverse in each week. Figures 9 and 10 show that prices are higher in the first and fourth week (beginning and ending of the month) while intermediate weeks explain lesser of the total inflation.
WIMV T-4 (each week compared to same week previous month). Source: The authors.
Accumulated T-4 weekly inflation. Source: The authors.
Weekly estimations in a moderate inflation setting are highly demanded by agents taking decisions. Precisely, as mentioned the developing of weekly indices was made under demand pressure. When prices were rising without official (and trusty) information, agents’ expectations became anxious. The project was just a year and four months of publicly releasing price information and the team was flexible enough to adapt to changes in data frequency recollection. Now the project sometimes releases inflation information inside the same month and it has become many times as national leader in releasing early inflation information.
At the beginning of 2015 the team began to scrap data targeting main construction and hardware online stores located at Bahia Blanca. By June of that year the project began to make publicly available their estimation of a local Índice de Costo de la Construcción Online (ICC Online or Online Construction Cost Index), one of the first up to our knowledge; a proposal for an index of the kind has been apparently sketched in [26]. The index was based on the Construction Cost Index by INDEC which is an indicator of how building items’ prices evolve grouped in three chapters: constructions materials (0.46 weight), wages and labor costs (0.456), and general expenditures (0.084). First publicly released results were compared to the national INDEC CCI. The resemblances were quite marked between both as depicted in Fig. 11. The significant peaks in the rate of variation correspond to the chapter Wages and labor costs given that official salaries are determined by the union’s work collective wage agreement (Construction Worker Union of the Argentine Republic or Unión Obrera de la Construcción de la República Argentina-UOCRA, in Spanish). Each peak corresponds to month of wage updating according to inflation that shocks the second main chapter according to weight. Construction materials follow similar time series too.
Construction cost indices evolution at national level (INDEC) and Bahia Blanca level (Online CCI). Source: The authors.
Kernel densities of the two CCI. Source: The authors.
Descriptive statistics show quite similar values as shown in the precedent figures.
CCIs descriptive statistics
Source: The authors.
Web scraped data-based prices indexes, including of course IPC Online, have diverse weaknesses proper to the actual limitations this technique faces. Problems are related mainly to the lack of regularity in the presence of product offers in the web price list and a more intrinsic problem of whether the actual captured price is what a consumer is facing when going to a supermarket. A shortlist of problems is:
The project does not deal with any quality adjustment on product and services. This way changes in prices associated to quality changes are not taken into account. In categories such as electronics, airfares [22], or clothing [13] changes in quality are frequent and diverse [17, 18]. There are also present sector-specific shocks and sampling errors as long noted in any fixed-weight index [1]. Products appear and disappear in web lists so basket of goods and services representing categories change in their composition.
Actually, the project deals with this problem by always comparing two identical baskets in the current vs. precedent period. In any case, basket may not be directly compared for longer periods. Web scraped data cover at least three quarters of the complete information required to replicate the INDEC CPI. It is still not possible to cover the full CPI with web scraped data and it is required to manually obtain the rest of the data (mainly services and local products). Prices from web lists do not necessarily present the same prices in the supermarket aisle. This problem is present all across the World at different degrees according to [6].
Strengths might be sketched in the enormous flexibility of the operation, associated with using robots for repeated tasks. Prices are collected, jointly with feature information, in the millions. This might be accelerating depending on data collection frequency and scope. As main strengths a few items must be remarked:
The costs of the implementation are almost incomparable. This project in particular began with zero funding and received, up to this moment, less than one thousand and five hundred euros in state funded grants and seed capital, without full time employees and null physical structure costs. IPC Online works as an academic and non-profit project. Institutions that follow traditional methods for calculating CPI in the city spend at least a hundred times higher in term of costs. Efficiency can be observed in the figures presented in this contribution: All estimations follow closely those of traditional methods. Data frequency recollection can be customized as required, with the subsequent problem of managing (more) millions of prices, that in this particular case, it will represent a posterior bottleneck effect in item categorization. Smaller price collection structure of course will be less demanding. Again, when stores release, change, or update their price lists data captured is continuously stacking in a database that rapidly reach millions of prices. That has no human-effort parallel in terms of traditional methods. Automation also reaches the point of redacting final reports on inflation. Again human intervention is not used at this stage.
It is stated across this contribution that web scraping is an empowerment tool. The motivation of the IPC Online, quite in the spirit of Cavallo’s contributions, was to act and make accountable a policy that misinformed economic agents. It allows for cross checking a highly important variable such as prices with different geographical scopes. It has been shown how web scraping-data-based urban indices were estimated and contrasted to traditional-based indices granting highly efficient measurements at one thousandth of cost. Combined with many other variables that can be captured or downloaded from the Internet [15] paints a future of combined indicators for tracking the socioeconomic life, even remotely.
Another important topic present in the automation: most phases of the IPC Online have no human intervention. The process comprehends: Algorithms autonomously collect and sort the data, once finished they wait the order of calculation. Once finished an algorithm reads the results and captures month inflation, core inflation, chapter (main category) variations and write a text sketched so as to represent a basic press release. Finally, it waits for approval for uploading it to the site. One can envisage a future of statistical offices all across the World doing a relatively similar task.
Precisely, many of the uses discussed in the recent and growing literature on web scraped-data based indices remark the optimism in the implementation of this information in the traditional indices as complementary information. This contribution adds the positive experience of embarking in a full scale CPI relying on web scraped data where hand-picking prices are complementary to the former. Final outcome in the case of IPC Online do not deviate much from more complex and costly process of obtaining data: inflation rates are remarkably similar even using more frequency, more data, and less personnel.
