Abstract
Refining big data is a new multipurpose way to find, collect, and analyze information obtained from the web and off-line information sources about any research subject. It gives the opportunity to investigate (with an assumed level of statistical significance) the past and current status of information on a subject, and it can even predict the future. The refining of big data makes it possible to quantitatively investigate a wide spectrum of raw information on significant human issues—social, scientific, political, business, and others. Refining creates a space for new, rich sources of information and opens innovative ways for research. The article describes a procedure for refining big data and gives examples of its use.
Introduction
Almost all of the information generated in the world is recorded in digital form. Massive, unstructured, and structured digital resources are referred to as “big data” (BD). Analysis of such data can provide a new source of valuable information (Brabham, 2017). The process of obtaining it—mainly from the web—is called refining information (RI).
RI offers an opportunity to investigate (with an assumed level of significance) and describe the past and current status of hidden information about the subject/phenomena that is of interest to us. With a certain probability, we can even predict the future (Salcedo, 2017). Generally, RI creates a space of rich and valuable sources of useful information for all intellectual human activity.
The usefulness of RI can be confirmed by the results of deep analysis of different fields of information—education, journalism, business, science, politics, and others. Our research was done at the Faculty of Journalism, Information, and Book Studies at the University of Warsaw, in relation to the past and the present. It enabled us to quantitatively investigate social trends, using the collective memory of the web. Thanks to RI and adopting particular information technology tools, we found data that were and are important subjects for media and technological innovation. This was confirmed by our predictions of the results of parliamentary and presidential elections in Poland in 2011 and 2015 (Gogołek, Jaruga, Kowalik, & Celiński, 2015) and by research carried out for the National Centre for Research and Development, aimed at finding the trends in technological innovations (Ceron, Curini, & Iacus, 2015; Ecker, 2017).
Steps of Refining Information
In the RI procedure, there are several steps that must be taken to collect and analyze information. They enable the accurate assessment of the subject under study in terms of attributes (words describing the subject) or in terms of special attributes, or sentiments (words describing emotions, e.g., good, bad, or other words perceived as emotional evaluations of the subject). Assessment covers the past, real time, as well as the prediction of the future.
The first five steps in this procedure are the basic part of RI (Figure 1.). They form the basic procedure used in the RI computer system (software). The last two steps in this procedure (omitted in the article) concern the precise modeling of the subject and feedback regarding the procedure.

Steps of refining information (RI).
The First Step: Defining the Pillar: The Subject for Refining
The pillar is the subject under study. This can be the name of a brand, product, political party, organization, city, person, and so on. The pillar does not have to be restricted to a single word or phrase. On the contrary, it may be a whole set of words and phrases that are synonymous or antonymous, or a set of words and phrases concerning a given subject matter (e.g., Internet of Things, artificial intelligence [AI]). The pillar can encompass all the possible forms of a word or expression, including neologisms, words containing spelling mistakes or typos, and even hashtags, which are becoming increasingly more common.
The Second Step: Identification of Sources of Dedicated Information
The substance subjected to RI are the BD source materials (materials) acquired from the web or from another valuable collection of BD available off-line in digital form (Marx, 2013). The final outcome of RI is the result of statistical analysis of key phrases that make up the pillar, and the sentiments or attributes surrounding this pillar. Thanks to RI, it is possible to extract new, valuable information, which was hidden/invisible in the materials, for example, the evaluation of a social phenomenon (support, satisfaction, negation).
The information that is used by RI is identified using detection systems, for example, Web of Science and Scopus (tools: SciVal, InCites). Depending on the subject under study, data/content from public institutions (e.g., the government, local government, European Union), commercial organizations, full-text databases—for example, DOAJ (Directory of Open Access Journals), patent databases, social websites, blogs, microblogs, forums, and RSS (Really Simple Syndication) feeds (e.g., from newspapers) are used. On average, 70,000 sources are identified for a study. Each of them supplies from one to a hundred materials.
The Third Step: Collecting and Cleaning Source Materials
The next step, collecting the materials, is performed by a special RI robot (robot). This step consists of three stages: reading information from a source (mainly the web), filtering the content, purifying the data, and transforming them into a standardized form required by the RI system.
Using the sources, a robot collects the materials concerning the subject of research (e.g., election in a country, problems of health care in a region). 1 The fundamental value of the materials collected by a robot is the independence of their selection. Nobody decides about any form of censorship of the analyzed materials.
The materials available online can be divided into three basic groups based on the manner of presentation: textual materials, images of text, and sound. At present, the RI system is able to analyze textual materials and text images. This does not preclude the use of audio materials examined using tools that analyze human speech. Graphic materials containing texts (e.g., parts of pdf documents) are analyzed using OCR (optical character recognition) technology.
The textual material obtained from the materials has to be given a standardized form before being subjected to further analysis.
The Fourth Step: Attributes and Sentiments
In many cases, the purpose of RI on the pillar is to identify statistically significant attributes: the words that most often appear in the vicinity of the pillar. With the passage of time and changes to the subject of research, attributes become more or less significant. Due to this, identification procedures for attributes must be continuously updated. The most popular attributes create a dynamic picture of the pillar (Table 1).
List of a Selection of Pillars of the Research Subject Industry 4.0.
Note. The table gives number of occurrences, percentage share, and prediction of each pillar [January 2017-November 2018].
For some pillars in specific research, it is reasonable to count the occurrences of sentiments instead of attributes. A sentiment is a word with an emotive nature, for example, “good” or “bad.” In the case of predicting the results of general elections, the number of positive sentiments entitles the anticipation of the electoral success or failure of a candidate.
The RI system assumes that sentiments of pillars can be determined by means of three procedures: (1) the researcher’s intuition (Delphi method) based on examining a random sample of texts from the source dataset (Table 2) that will be subjected to investigation, (2) available dictionaries of words (dictionaries) that can be regarded as sentiments (which have so far been verified experimentally), and (3) a selection of attributes/presentiments (words) based on a frequency analysis of all the words from the source dataset selected for the subject. After sentiments are identified, these are verified through objective, independent sentiment assessments.
Example of Positive Sentiments for the Pillar Business, Obtained by the Delphi Method (Ordered by Frequency From Left to Right and Down).
Note. The sentiments were used to refine stock exchange data.
The Fifth Step: A Simple Model of Refining
The natural need of any business or work that requires decision making is access to current, reliable information assessing the state of a particular subject, individual, or event. The aim is to access the present and background facts (history, relative information) in an effective, clear manner.
The simple model in the fifth step allows the attributes that describe the pillar to be identified, for example, quotes from candidates in the presidential election. In some types of research (e.g., elections), the first part of the procedure performed using the model is to identify the sentiments—a special set of attributes that describe emotion (Table 3)—that occur most frequently around the pillar. For each sentiment, the frequency is calculated at time intervals.
List of Positive and Negative Sentiments.
Source. M. M. Bradley and P. J. Lang, Affective Norms for English Words (ANEW). List of all words is not publicly available. It was obtained at the special request of the authors from its creators from The Center for the Study of Emotion and Attention at the University of Florida, http://csea.phhp.ufl.edu/media/anewmessage.html (accessed April 2, 2013).
The result of analyzing the changes of the sentiments of the pillar at time intervals allows changes to the pillar in time tn + 1 to be predicted only on the basis of frequencies of sentiments (predictors) in time before tn. The calculations employ multiple regression analysis, which is used to build a model that is possibly best adapted to empirical data (frequencies of sentiments) in time before (tn) and enables the status of the pillar in time tn + 1 to be evaluated. This is due to the fact that the regression model is more suitable (statistically significant) for the obtained data (in time tn + 1) than random data.
Case Study: Using Sentiments to Predict the Outcome of an Election
The RI procedure outlined above was applied to predict the result of parliamentary and presidential elections in Poland (Gogołek & Kuczma, 2013a, 2013b; Gogołek et al., 2015). On the basis of only sentiments, a forecast of election results was obtained 3 days prior to the day of presidential and parliamentary elections.
The percentage difference between the number of positive sentiments collected on the eve of elections between Duda/Komorowski (May 23, 2015) and the factual results (May 24, 2015) of elections was <0.66% (real difference: 3.10%, result of RI: 2.44%, see Figure 2).

Dynamics of changes in the number of positive sentiments for candidates in the presidential election.
Case Study: Using Attributes to Determine the Fastest Growing Development Trends in Artificial Intelligence
The RI procedure was also applied to identify development trends in AI by identifying and analyzing attributes.
Usually, there are around 10,000 to 15,000 words in the vicinity of each pillar. The frequency of occurrence of each word and the frequency of occurrence of the pillar is calculated over a given period, for example, every 7 days, for a year. Two variables are obtained, each consisting of 52 measurements (number of weeks per year). Exceeding the accepted threshold correlation value (e.g., r > 0.8) between the frequencies of occurrence of the word and the frequencies of occurrence of the pillar indicates that this word is an attribute. The basic attributes of AI are given in Table 4.
Frequency of Attributes Related to Artificial Intelligence (I.2017-XI.2018).
Analysis of AI can be extended by noting changes in the frequency of occurrence of its attributes. Figure 3 shows the linear and exponential function of nine attributes of AI using the least square line of best fit. The linear function is marked with a blue continuous line. The exponential function is drawn with a red dotted line. Black points show the number of publications/sources.

Quantification of attributes of artificial intelligence (I.2017-XI.2018).
The graphs show that the fastest growing AI development trends are reflected by its attributes: neural network, machine learning, deep learning, heuristic methods, fuzzy sets, rough sets, autonomous roots, and robot control.
Conclusion
The outlined RI procedure and the results of quantitative research using BD show that RI can be a reliable source of information about the status, changes, and prediction of changes of a process, event, thing, or person. This helps diagnose the conditions and dynamics of changes of the subject under study.
RI provides a meaningful way to discover new information sources for a wide spectrum of uses—from election study to tracking changes in AI.
Experience gained from the studies conducted for this paper (there were nearly 100) allowed us to create an effective multipurpose system for refining scientific and economic information. 2 It allowed us to collect dedicated information (using almost sophisticated robots) from online and off-line sources in our private database in order to quantify the subject under study. It also helped to indicate new information, point out determinants, and predict changes of the subject.
The next steps in the RI procedure are the precise modeling of the subject under study and feedback regarding the procedure.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: the National Centre for Research and Development, Poland.
