A quantitative and text-based characterization of big data research

Abstract

This paper tries to map the research work carried out in the field of Big Data through a detailed analysis of scholarly articles published on the theme during 2010-16, as indexed in Scopus. We have collected and analyzed all relevant publications on Big Data, as indexed in Scopus, through a quantitative as well as textual characterization. The analysis attempts to dwell into parameters like research productivity, growth of research and citations, thematic trends, top publication sources and emerging topics in this field. The analytical study also investigates country-wise publications output and impact in terms of average citations per paper, country-level collaboration patterns, authorship and leading contributors (countries, institutions) etc. The scholarly publication data is also subjected to a detailed textual analysis method to identify key themes in Big Data research, disciplinary variations and thematic trends and patterns. The results produce interesting inferences. Quantitative measures show that there has been a tremendous increase in number of publications related to Big Data during last few years. Research work in Big Data, though primarily considered a sub-discipline of Computer Science, is now carried out by researchers in many disciplines. Thematic analysis of publications in Big Data show that it’s a discipline involving research interest from fields as diverse as Medicine to Social Sciences. The paper also identifies major keywords now associated with Big Data research such as Cloud Computing, Deep Learning, Social Media and Data Analytics. This helps in a thorough understanding and visualization of the Big Data research area.

Keywords

Big data big data analytics data science scientometrics

1. Introduction

Big Data refers to those datasets that are so large as to pose challenges in storage and analysis via traditional data handling techniques. Big Data Analytics (BDA) is the umbrella term given to the practice of collecting, organizing and analyzing large sets of data (Big Data). Big Data analytics allows organizations to comprehend the information contained within the data, in a better and sound manner and also helps in identifying the data that provides insightful knowledge for the current as well as future business decisions. Big Data has percolated into possibly all domains of technology and is gaining huge attention from academia as well as industries and governments. Big Data analytics as a research area has become very important in recent years. It now encompasses all the techniques used to analyze data at large scale spanning across health care, policy making, astronomy, city planning, education, telecommunications, banking, IT and risk management, advertising, marketing and other strategic business domains.

Although Big Data has become a popular area of research, it still has several definition trying to explain the area. There are many debates about it and no singly agreeable definition exists. There are three types of definition found in literature [1]: Attribute Definition, Comparative Definition and Architectural Definition. These three definitions cover major concepts and key-points said about Big Data by industry and academia experts.

International Data Corporation (IDC) 1 defines Big Data as a technology that is used to extract value from such large data sets that are beyond the processing capabilities of traditional relational approaches. Definitions apart, Big Data has been a result of our current technological capacities and we have come to a point where we can gather, store, and analyze massive amounts of data. The term Big Data has been floating around for almost two decades with the tags of high-performance data mining, text mining, predictive analytics, forecasting and optimization and more lately as data science [2]. However, it has only been in the recent years that such popular resonance has been witnessed in the field, that is now being fostered openly and speedily researching about fresh, better and effective approaches and technologies to manage Big Data.

Research advancements and results are typically enunciated through publishing the research work. Because of progression in science, researchers around the globe consistently deliver an expanding vast volume of ‘scholarly data’, which give the mechanical premise to overall dispersal of logical discoveries. This scholarly data can be analyzed mainly through analyzing academic social networks and mining scholarly text. Analyzing this scholarly data will lead to better understanding of science of science [3]. It is in this context that this paper makes an attempt to map the locales of research done on the area of Big Data by using scientometric and text-analytics methodologies.

The scholarly data on the theme of ‘Big Data’ published during last seven years (2010-16) in reputed journals and conferences, as indexed in Scopus 2 international multidisciplinary bibliographical database, are collected for analysis. This data is then computationally analyzed to:

compute standard scientometric indicators of research area,

identify growth pattern of published work in Big Data,

understand authorship & collaboration patterns,

analyze citation impact,

identify top publication sources,

identify most productive countries & institutions,

understand the thematic trends,

characterize the interdisciplinary research landscape of Big Data, and

The paper presents a comprehensive and analytical mapping of the emerging area of Big Data research. The assessment obtains distinctive and useful information about the activity and patterns in Big Data research. The rest of the paper has been organized as follows. Section 2 presents a general overview of Big Data discipline and the broader objectives of this analytical mapping effort. Section 3 describes some related work done earlier. Section 4 describes the data collection used for analysis. Section 5 presents computed results on distribution and growth of research output in Big Data, as seen in major journals and conferences. Section 6 describes the authorship patterns and the collaboration behaviors among researchers in Big Data. Section 7 identifies the major countries and institutions doing research in Big Data, and top publication sources in which Big Data research is getting published. Section 8 presents disciplinary trends and variations observed in Big Data research output. Section 9 describes the major keywords and themes observed in published work in Big Data. A somewhat comprehensive list of major Big Data Analytics softwares/ platforms is provided in section 10. The paper concludes in Section 11 with a short summary of the paper and its relevance & usefulness for the Big Data research community.

2. Overview and objectives

Big Data covers all those techniques and technologies that require new insights to uncover hidden values and patterns from large datasets that are diverse, complex, and of a massive scale. In the past Big Data has been characterized by three Vs namely: Volume, Velocity and Variety, as originally given by Gartner [4]. Volume refers to large voluminous nature of data, Variety refers to complex and various forms of data, and Velocity refers to high speed of rate of growth of Big Data research. Later two more V’s (Value and Veracity) were added to the definition. Value, added to this list by IDC [5], refers to the deep hidden value to be mined from the Big Data sets. Veracity [6] refers to the measure of accountability or trustworthiness of the data. There have been attempts to efficiently manage these data characteristics [7]. The interest of people in Big data can be observed through an analysis of search queries on the term. The Fig. 1 indicates research interests over time measured from search queries. Here the numbers represent search interest relative to the highest point on the chart for the given time. A value of 100 means the term is highly popular. A value of 50 means that the term is half as popular. Likewise, a score of 0 means the term is less likely to be popular at all. We can clearly observe that Big Data is creating enormous interest in people.

Fig.1

Research interests over time.

The idea of the Internet of Things (IoT) and cloud computing (both of which started around 2010) gave momentum to research work in Big Data, which in turn called for trained manpower in the area. According to Tableau 3 (a business intelligence firm), convergence of IoT, cloud, and big data will create new opportunities for self-service analytics. It might be the case, that everything in the coming years will have a sensor that sends data back to a mother-network. Gigantic volumes of structured and unstructured information is being generated by IoT, and a considerable share of this information is being deployed on cloud administrations. The data, thus produced is often assorted and resides across multiple systems, ranging from NoSQL databases to Hadoop clusters. While advancements are being accelerated in storage and processing, getting to and understanding the big data itself still represent a noteworthy challenge. As an upshot, demand is growing for analytical tools that consistently interface with and consolidate a wide range of cloud-facilitated information sources. Such devices empower organizations to investigate and picture any kind of information kept at any place, helping them find shrouded opportunity in their IoT venture.

As far as cloud is concerned, in most of the big data scenarios, major share of the information is coming from external sources, such as from social media, demographic data, web data, events, feeds, etc. Organizations perceive the developing significance of online social networking though they are facing difficulties to utilize its potential. The good reason as to why big data can bode well in in the cloud can include Big Data’s requirement for a range of cutting edge tools, skills, and ventures; involvement of huge amounts of external and distributed data; and requirement of data services [8, 9]. Smart and intelligent services in healthcare [10], cities [11], education [12], businesses [13] and social sensing [14], disaster management [15] are all examples of the amalgamation of these phenomena. At the beginning of the data science era, majority of the research papers were built and concentrated on the conception of multidimensional data. The temporal and streaming aspects are the most focused topic of study nowadays and are being studied in the context of complex data types such as graphs, social networks, and social streams. This area of analysis is rich and yet remains largely unexplored.

The main objective of this research paper is to analyze the metadata available in the scientific research publications produced on Big Data during last seven years. The research paper set is obtained from Scopus and contains all relevant documents which have at least one occurrence of the term Big Data in the topic, keyword or abstract fields. Our aim is to present quantitative measures of research work in Big Data as well as identifying major keywords, themes and disciplinary trends and variations in Big Data research. Various kinds of analysis are performed for this purpose on full data and on data for three key journals in Big Data research area.

3. Related work

Computational analysis of scientific articles in a specific discipline, has been a popular research option to explore. These studies have attempted to acquire an in-depth understanding of the chosen field of research. Although no existing work could be found that conducts quantitative and textual-based mapping on the scholarly data available on ‘Big Data’ in Scopus, as conceptualized and carried out in this paper. Nevertheless, several previous studies have helped in formulating the research plan and carrying out the analytical work of the present paper. For envisaging the budding area of Big Data in recent past, various authors have augmented the field with their visions and laid down foundation for better in-depth conception of Big Data core concepts and definitions. One previous work [16] has inspected the myths that are going viral among the masses due to novelty of the research area, whereas another work [17] discusses the significance of studying Big Data and the challenges posed by the existence of Big Data. Several platforms and tools [18] related to storage, processing and retrieval are instrumental in increase in efficiency of study in Big Data science.

Few other papers deal with characterization and analysis of research on Big Data. Boyd and Crawford [19] have tried pointing some crucial questions and answers related to Big Data research in their work. Hoskins [20] talks over the trends of contemplating big data. Chen et al. [21] discussed the future prospects and growth of Big Data as a research theme, along with a discussion on vision for the Big Data research theme. Park & Leydesdorff [22] examined the social and semantic networks that emerge in the Big Data. Halevi and Moed [23] inspected the evolution of research related to Big Data focused on descriptive statistics by using the Scopus database. However, the current work for analytical characterization of Big Data research remains unparalleled to the best of our knowledge. This paper presents account of a comprehensive approach for quantitative and text-based characterization of Big Data research published during last seven years. It identifies rate of growth of publications, differential growth in publication types and distribution of papers among countries, institutions and journals. Subsequently, results of text-based analysis are presented which help in identifying major keywords, themes in the area and thematic trends. Analysis of three selected Big Data journals also present useful insight about Big Data research area.

4. Data

The analysis is based on research papers published during last seven years (2010-16) and indexed in Scopus. The search query used for the data collection is given in Table 1. A total of 25,334 records were found as a result of this search query. The data collected includes documents of the type: conference paper, conference review, article, article in press, review, short survey, editorial, book chapter, book, note or letter. Out of these categories of scholarly data, ‘article’, ‘review’ and ‘article in press’ have been considered as journal publications while ‘conference paper’ and ‘book chapter’ have been considered to be in conference publications. The remaining categories of publications were not taken into account for analysis.

Table 1
Dataset details

Query

[TITLE-ABS-KEY (“BIG DATA”) AND (LIMIT-TO (PUBYEAR, 2016) OR LIMIT-TO (PUBYEAR, 2015) OR LIMIT-TO (PUBYEAR, 2014) OR LIMIT-TO (PUBYEAR, 2013) OR LIMIT-TO (PUBYEAR, 2012) OR LIMIT-TO (PUBYEAR, 2011) OR LIMIT-TO (PUBYEAR, 2010))]

Category Number of Publications

Journal 8720

Conference 16614

Query
[TITLE-ABS-KEY (“BIG DATA”) AND (LIMIT-TO (PUBYEAR, 2016) OR LIMIT-TO (PUBYEAR, 2015) OR LIMIT-TO (PUBYEAR, 2014) OR LIMIT-TO (PUBYEAR, 2013) OR LIMIT-TO (PUBYEAR, 2012) OR LIMIT-TO (PUBYEAR, 2011) OR LIMIT-TO (PUBYEAR, 2010))]
Category	Number of Publications
Journal	8720
Conference	16614

The whole analytical mapping has thereafter been carried out separately for journal publications and conference publications. After grouping and filtering data, a total of 8,720 journal publications and 16,614 conference publications were retrieved. The data in Scopus is organized as records, each consisting of 41 fields which describe multiple attributes namely-Authors, Title, Year, Source title, Volume, Issue, Art. No., Page start, Page end, Page count, Cited by, DOI, Link, Affiliations, Authors with affiliations, Abstract, Author Keywords, Index Keywords, Molecular Sequence Numbers, Chemicals/CAS, Trade names, Manufacturers, Funding Details, References, Correspondence Address, Editors, Sponsors, Publisher, Conference name, Conference date, Conference location, Conference code, ISSN, ISBN, CODEN, PubMed ID, Language of Original Document, Abbreviated Source Title, Document Type, Source, and EID. The data contained in different fields has been utilized for the quantitative analysis.

5. Quantifying the big data research output

The first parameter of analysis was to identify the number of papers published in the area of Big Data per year and also to measure the rate of growth. This has been done by first categorizing the research papers in the data into two categories-conference and journals. The year-wise distribution of research output and cumulative research output, for the time period 2010–2016 is presented in Table 2, categorized separately for conferences as well as journals. It is observed that the number of conference papers produced is much higher than the number of journal papers for the same period. This is expected since Big Data is a relatively new research area. It can also be observed from Table 2, that the total research output has increased significantly during 2010 to 2016. There is a trend of high growth in research done on Big Data especially in the last three years-2014, 2015 and 2016. The Fig. 2 depicts the annual growth of the number of publications on big data. Both the publication types: conference and journal record high growth.

Table 2
Year-wise research output

Year JP Cumulative CP Cumulative

2010 9 9 21 21

2011 32 41 42 63

2012 201 242 383 446

2013 625 867 1556 2002

2014 1660 2527 2714 4716

2015 2553 5080 5363 10079

2016 3640 8720 6535 16614

Total/Mean 8720 16614

Year	JP	Cumulative	CP	Cumulative
2010	9	9	21	21
2011	32	41	42	63
2012	201	242	383	446
2013	625	867	1556	2002
2014	1660	2527	2714	4716
2015	2553	5080	5363	10079
2016	3640	8720	6535	16614
Total/Mean	8720		16614

Fig.2

Annual growth of research output.

The second parameter analyzed was country-wise distribution of research output on Big Data. Table 3(a) and (b) present the research output for top 10 most productive countries for all the seven years, categorized as journals and conferences, respectively. As seen in the table, largest number of papers in Big Data are from United States. This is then followed by China, United Kingdom and India, in order. There appears a close competition between US and China in producing largest number of conference papers. This is then followed by India and Germany. The Fig. 3 presents a graphical representation of the number of journal and conference papers in Big Data by the major productive countries for the total period of analysis. It can be observed that US and China taken together are producing more than 50% of the total research papers in Big Data.

Table 3(a)

Country-wise research output (in Journals)

No. of Journal Publications = 8720
Country/ Year	2010	2011	2012	2013	2014	2015	2016	Total	% ^a
United States	6	14	74	244	583	816	950	2687	30.8
China	2	3	17	110	355	680	1063	2230	25.6
United Kingdom	1	1	12	51	122	188	293	668	7.7
India	0	0	1	10	55	180	283	529	6.1
South Korea	0	0	3	18	89	116	201	427	4.9
Germany	0	1	13	29	82	117	166	408	4.7
Australia	0	0	6	23	68	122	180	399	4.6
Japan	0	1	26	23	66	85	110	311	3.6
Canada	0	0	7	24	54	83	134	302	3.5
Italy	1	2	8	17	36	74	132	270	3.1

^a Percentage Contribution w. r. t. 8720 journal publications

Table 3(b)

Country-wise research output (in Conferences)

No. of Journal Publications = 16614
Country/ Year	2010	2011	2012	2013	2014	2015	2016	Total	% ^a
China	3	10	54	344	600	1190	2057	4258	25.6
United States	5	21	174	520	767	1457	1266	4210	25.3
India	0	2	22	60	158	378	638	1258	7.6
Germany	3	3	25	90	254	277	288	940	5.7
United Kingdom	0	1	15	97	148	268	276	805	4.8
South Korea	1	0	20	46	68	248	231	614	3.7
Italy	0	2	8	56	92	189	208	555	3.3
Australia	0	2	9	57	96	173	190	527	3.2
Japan	1	1	14	87	112	164	189	568	3.4
France	1	2	6	52	79	178	175	493	3.0

^a Percentage Contribution w. r. t. 16614 conference publications

Fig.3

Quantity of publications by top contributing countries.

6. Authorship and International collaboration patterns

Collaboration pattern among researchers producing papers in Big Data has been the third parameter of analysis. First of all, it was measured as to what proportion of papers on Big Data have one, two, three or more than three authors. Figures 4 and 5 present distribution of the total research output in Big Data among one, two, three or more than three authors, for journal and conference papers, respectively. It is observed that both journal as well as conference papers now have a trend towards multi-authored papers. To further identify the nature of authorship collaboration, the author affiliation of each of the authors in a multi-authored paper was extracted. This information was then used to identify country of each author and hence finding out how many papers have authors from two or more than two countries. The Fig. 6 plots the country-level collaboration pattern of papers on Big Data. In this plot, the bigger a node is the higher is the number of collaborated papers the country has. Similarly, the thickness of an edge indicates the strength of collaborated research among two countries. According to spring-embedding algorithm [24], the closer a pair of countries in the two-dimensional space, the stronger is the relationship between them. The line length is inversely proportional to the number of the collaborative papers. It can be observed from the figure that ‘United States - China’ tie is the strongest international collaboration instance followed by ‘United States – United Kingdom’ pair. Further, ‘United States’ has the highest ICP instances involving different countries. International collaboration pattern can be analyzed in more detail [25 –27] for a more detailed analytical account.

Fig.4

Authorship pattern (of Journal Papers) plotted year-wise.

Fig.5

Authorship pattern (of Conference Papers) plotted year-wise.

Fig.6

ICP Network at Country level.

7. Top publications sources, institutions and citations

The next quantitative parameter analyzed was identifying top publication sources in which Big Data research is being published and also the major contributing institutions. For this purpose, the data has been processed to compute various scientometric indicators. These indicators include Total Publications (TP), Total Citations (TC), Average Citation Per Paper (ACPP) and h-index [28]. The h-index is considered a good composite measure for quantity and quality both. Table 4(a) shows the list of journals (arranged according to TP) producing highest number of papers on Big Data. It is observed that ‘International Journal of Applied Engineering Research’ tops the list with a total of 113 papers related to Big Data. However, on the impact parameter (measured by ACPP), ‘IEEE Intelligent Systems’ and ‘PLOS One’ are two most prominent journals, having higher ACPP. Other important journals that published good amount of research work on Big Data include ‘IEEE Access’, ‘Future Generation Computer Systems’ and ‘Computer’. Similarly, Table 4(b) depicts the most productive conferences (and conference proceedings) on Big Data where ‘Lecture Notes in Computer Science’ tops with TP value of 1774 and h-index of 16. ‘Proceedings of the ACM SIGMOD’ is another quality venue of publication in the field with a commendable h-index of 17 and ACPPof 10.00.

Table 4(a)
Top publication sources–journals

Publication Source TP TC h-Index ACPP

International Journal of Applied Engineering Research 113 25 2 0.22

IBM Data Management Magazine 83 10 1 0.12

Future Generation Computer Systems 78 668 14 8.56

Indian Journal of Science And Technology 64 119 7 1.86

IEEE Access 61 538 9 8.82

Computer 61 489 11 8.02

Big Data 59 219 7 3.71

Plos One 56 408 9 7.29

Neurocomputing 50 226 7 4.52

Cluster Computing 48 179 9 3.73

Publication Source	TP	TC	h-Index	ACPP
International Journal of Applied Engineering Research	113	25	2	0.22
IBM Data Management Magazine	83	10	1	0.12
Future Generation Computer Systems	78	668	14	8.56
Indian Journal of Science And Technology	64	119	7	1.86
IEEE Access	61	538	9	8.82
Computer	61	489	11	8.02
Big Data	59	219	7	3.71
Plos One	56	408	9	7.29
Neurocomputing	50	226	7	4.52
Cluster Computing	48	179	9	3.73

Table 4(b)

Top publication sources— conferences/ proceedings

Publication Source	TP	TC	h-Index	ACPP
Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)	1774	1945	16	1.10
Acm international conference proceeding series	572	729	11	1.27
Procedia computer science	384	679	9	1.77
Communications in computer and information science	346	155	5	0.45
Ceur workshop proceedings	313	189	6	0.60
Advances in intelligent systems and computing	180	68	4	0.38
Ifip advances in information and communication technology	115	50	3	0.43
Applied mechanics and materials	99	26	3	0.26
Lecture notes in electrical engineering	92	49	3	0.53
Proceedings of the acm sigmod international conference on management of data	85	850	17	10.00

After identifying the top publication sources, it was the time to identify most productive institutions on Big Data research. The TC, ACPP, h-index values have thus been computed for the data corresponding to each of the affiliating institutions. Table 5(a) and (b) present the top 10 contributing institutions to the Big Data research, measured in terms of TP in journals and conference categories, respectively. These tables indicate the TP, TC, ACPP, h-index values for each of the institutions. We can observe that ‘Chinese Academy of Sciences’ and ‘Tsinghua University’ are the top contributors in the journal papers as well as conference papers. Different institutions, however, rank differently on different parameters. For example, ‘Massachusetts Institute of Technology’ and ‘Stanford University’ can be seen as the most popular institutes contributing journal publications with a ACPP of 18.44 and 13.37,respectively.

Table 5(a)

Institutions with highest no. of journal papers

Institution	TP	TC	ACPP	h-Index
Chinese Academy of Sciences	219	1517	6.93	20
Tsinghua University	146	1203	8.24	15
Wuhan University	92	342	3.72	10
Ministry of Education China	83	339	4.08	9
Stanford University	65	869	13.37	14
Massachusetts Institute of Technology	64	1180	18.44	15
University of Oxford	64	741	11.58	15
Shanghai Jiaotong University	61	568	9.31	9
University of Southern California	61	757	12.41	12
Huazhong University of Science and Technology	59	504	8.54	12

Table 5(b)

Institutions with highest no. of Conference Papers

Institution	TP	TC	ACPP	h-Index
Chinese Academy of Sciences	281	681	2.42	12
Tsinghua University	204	601	2.95	7
Beijing University of Posts and Telecommunications	156	328	2.10	6
National University of Defense Technology	116	73	0.63	4
Shanghai Jiaotong University	112	227	2.03	9
Ministry of Education China	106	47	0.44	4
Beihang University	101	123	1.22	5
Amity University, Uttar Pradesh	99	26	0.26	3
Peking University	90	103	1.14	4
CNRS Centre National de la Recherche Scientifique	85	76	0.89	4

To examine the frequency and patterns of citations in journal articles and conference articles, citation analysis is done by calculating the number of citations made by journal and conference papers. The number of citations is obtained from Scopus. The analysis of citation pattern in Table 6(a) and (b) present the citation behavior of journal and conference papers. It is observed that journal publications are cited more by journal publications (∼66%) and lesser byconference publications (∼33%). Similarly, conference publications are cited more by conference publications (∼68%) as compared to journal publications (∼32%). This is really an interesting observation to note and it appears that there are different kinds of authors producing journal papers and conference papers.

Table 6(a)

Citation pattern of papers in journals

Year	Journal	Cited by Journal	Cited by Conference
2010	9	271(70.21%)	115(29.79%)
2011	32	613(75.21%)	202(24.79%)
2012	201	2154(52.91%)	1917(47.09%)
2013	625	4065(62.49%)	2440(37.51%)
2014	1660	7739(65.13%)	4144(34.87%)
2015	2553	7220(69.96%)	3100(30.04%)
2016	3640	3190(77.30%)	937(22.70%)
Total	8720	25252(66.27%)	12855(33.73%)

Table 6(b)

Citation pattern of papers in conferences

Year	Conference	Cited by Journal	Cited by Conference
2010	21	46(29.87%)	108(70.13%)
2011	42	365(31.38%)	798(68.12%)
2012	383	761(33.87%)	1486(66.13%)
2013	1556	1760(30.65%)	3983(69.35%)
2014	2714	2103(34.08%)	4068(65.92%)
2015	5363	1885(29.84%)	4432(70.16%)
2016	6535	492(30.67%)	1112(69.33%)
Total	16614	7412(31.68%)	15987(68.32%)

8. Disciplinary variation and keyword occurrence trends

The area of Big Data research is considered to be an interdisciplinary research area. Therefore, it will be natural to expect that researchers from many discipline contribute to the field and that Big Data research work is not limited to disciplinary boundaries. It is in this context that we have tried to identify disciplinary distribution of research records comprising our dataset. Table 7 presents the discipline-wise distribution of research publications in 15 majordisciplines along with their percentage contribution to the total research output, both for conferences and journal publications. This categorization into disciplines is based on the ‘Subject’ field of Scopus data downloaded. It is observed that a total of 12,630 out of 16,614 conference publications (∼76%) and 3,612 out of 8,720 journal publications (∼41%) are from ‘Computer Science’ discipline. This is an interesting observation. We see that more than 50% journal papers in Big Data research are coming from other disciplines than Computer Science. Engineering, Medicine, Social Sciences, Mathematics, and Business Management & Accounting are some of the major contributing disciplines to Big Data research. Since, a research publication may be categorized into more than one discipline (due to interdisciplinary outputs) and hence the total percentage value here exceeds 100. This comes from the fact that more and more computer-oriented disciplines are emerging outside computer science (computational chemistry, computational biology, etc.) i.e. an amalgamation of computer science and other fields.

Table 7
Discipline-wise distribution of research output

Journals Conferences

Discipline TP % Discipline TP %

Computer Science 3642 41.77 Computer Science 12630 76.02

Engineering 2131 24.44 Engineering 3662 22.04

Medicine 1174 13.46 Mathematics 3108 18.71

Social Sciences 1172 13.44 Decision Sciences 1454 8.75

Mathematics 1011 11.59 Social Sciences 1401 8.43

Business, Management and Accounting 672 7.71 Medicine 659 3.97

Decision Sciences 569 6.53 Business, Management and Accounting 538 3.24

Biochemistry, Genetics and Molecular Biology 536 6.15 Physics and Astronomy 415 2.50

Earth and Planetary Sciences 330 3.78 Materials Science 364 2.19

Materials Science 288 3.30 Energy 263 1.58

Physics and Astronomy 266 3.05 Earth and Planetary Sciences 183 1.10

Environmental Science 265 3.04 Economics, Econometrics and Finance 129 0.78

Multidisciplinary 243 2.79 Health Professions 122 0.73

Arts and Humanities 223 2.56 Environmental Science 98 0.59

Agricultural and Biological Sciences 217 2.49 Biochemistry, Genetics and Molecular Biology 83 0.50

Journals	Conferences
Computer Science	3642	41.77	Computer Science	12630	76.02
Engineering	2131	24.44	Engineering	3662	22.04
Medicine	1174	13.46	Mathematics	3108	18.71
Social Sciences	1172	13.44	Decision Sciences	1454	8.75
Mathematics	1011	11.59	Social Sciences	1401	8.43
Business, Management and Accounting	672	7.71	Medicine	659	3.97
Decision Sciences	569	6.53	Business, Management and Accounting	538	3.24
Biochemistry, Genetics and Molecular Biology	536	6.15	Physics and Astronomy	415	2.50
Earth and Planetary Sciences	330	3.78	Materials Science	364	2.19
Materials Science	288	3.30	Energy	263	1.58
Physics and Astronomy	266	3.05	Earth and Planetary Sciences	183	1.10
Environmental Science	265	3.04	Economics, Econometrics and Finance	129	0.78
Multidisciplinary	243	2.79	Health Professions	122	0.73
Arts and Humanities	223	2.56	Environmental Science	98	0.59
Agricultural and Biological Sciences	217	2.49	Biochemistry, Genetics and Molecular Biology	83	0.50

To further identify disciplinary variations in Big Data research, important control terms are identified and burst detection algorithm is run to understand the keyword-based trends. For this purpose, author keywords for each paper in the data are extracted and then occurrence frequencies for all such distinct author keywords are computed. The author keywords are arranged according to descending order of their occurrence frequencies. Then the well-known burst detection algorithm (developed by Kleinberg [29]) is used to analyze the text streams of these keywords. This burst analysis helps to quantify the activity of a text stream during specific period of time. The Science of Science (Sci2) 4 tool is used for the burst detection. Table 8 depicts top 15 control terms with their starting and ending year of the burst. A higher weight describes the high popularity (more number of research papers) of that control term in Big Data research. The length of the burst gives the time period during which the research in that area was active. Some concepts seems to have become infrequent such as ‘Distributed Computing’ in 2014, while some are still researched about such as ‘Social Media’, ‘Cloud Computing’, ‘Internet of things’ and ‘Hadoop-Map Reduce’. The burst helps in identifying the important keywords occurring in Big Data research publications and their periods of usage. The burst is calculated for the data from year 2010 to 2016 only.

9. Interdisciplinary and co-word analysis

Taking into account the fact that a lot of research in Big Data can be interdisciplinary in nature, we have analyzed interdisciplinary collaboration trend in Big Data research. For this purpose, each paper has to be mapped to a subject area. Scopus subject area based-download to categorize papers into 42 fields helped us in this process. It may be noted that one article may appear in more than one discipline-which constitutes the essence of interdisciplinary research. The scholarly data with 43 fields (42 Scopus fields + Subject Area field) has been given as input to Sci2 tool for creation of interdisciplinary collaboration networks. An interdisciplinary collaboration network is necessarily a co-occurring disciplines network where nodes represent disciplines and edges represent the relationship between disciplines. The thickness of an edge depicts the strength of tie (inter-disciplinary research) between a pair of disciplines. For visualization of co-occurring disciplines network, VOS viewer software has been used. The Fig. 7 shows the interdisciplinary collaboration network (i.e. co-occurring disciplines network). From the figure, it is observed that collaboration of ‘Computer science’ with ‘Mathematics’ and ‘Engineering’ is more significant than any other disciplinary collaboration. The Fig. 8 shows the alluvial diagram for the contribution of each discipline across the 7-year period. It can be observed that many research papers in the initial years are from non-Computer Science discipline.

Table 8
Top 15 control terms with burst detection

Word Weight Length Start End

Big data analytics 16.12 7 2010 –

Social media 8.84 7 2010 –

Cloud computing 7.59 7 2010 –

Internet of Things 7.13 7 2010 –

HADOOP-MapReduce 5.73 2 2015 –

Security 5.44 7 2010 –

Distributed computing 5.06 5 2010 2014

Spark 4.84 7 2010 –

Data integration 4.84 2 2015 –

Visual analytics 4.33 3 2014 –

Supervised learning 4.33 7 2010 –

Sentiment Analysis 3.44 4 2010 2013

Deep learning 3.09 7 2010 –

Data Analysis 2.6 2 2015 –

Business intelligence 2.13 7 2010 –

Word	Weight	Length	Start	End
Big data analytics	16.12	7	2010	–
Social media	8.84	7	2010	–
Cloud computing	7.59	7	2010	–
Internet of Things	7.13	7	2010	–
HADOOP-MapReduce	5.73	2	2015	–
Security	5.44	7	2010	–
Distributed computing	5.06	5	2010	2014
Spark	4.84	7	2010	–
Data integration	4.84	2	2015	–
Visual analytics	4.33	3	2014	–
Supervised learning	4.33	7	2010	–
Sentiment Analysis	3.44	4	2010	2013
Deep learning	3.09	7	2010	–
Data Analysis	2.6	2	2015	–
Business intelligence	2.13	7	2010	–

Fig.7

Interdisciplinary collaboration network (2010–2016).

Fig.8

Alluvial Diagram for contribution of disciplines yearwise.

We have also performed a Co-word analysis to identify important thematic terms and their relationship, occurring in Big Data research. Co-word analysis is an approach which targets the co-occurrence of words and phrases in the dataset. It creates the relationship between the concept/idea and the subject area. The occurrence of keywords together in the same article depicts the association between the topics [30, 31]. The Co-word analysis performed by us has used author keywords as the main data. Out of 8,720 journal articles, 1,919 articles (22% of 8,720 journal publications) doesn’t contain any author keyword. There is a total of 15,852 unique author keywords found in the whole data. We have selected only those keywords which have an occurrence frequency of at least 15 for the Co-word analysis result generation. A total of 131 such keywords are found and they are found to occur in 5,130 scholarly articles (60% of 8,720 journal articles). The co-occurrence matrix of these 131 author keywords is extracted from the scholarly articles and the network is created and visualized using VOS viewer. The hierarchical clustering algorithm of VOS viewer is applied on Co-word matrix. The Fig. 9 shows the Co-word network of these author keywords. We observe different clusters containing related terms-(i) HDFS, MapReduce, spark, NoSQL and HBase keywords represent the big data technologies; (ii) machine learning, deep learning, clustering, classification, feature extraction author keywords representing the techniques applied on Big Data; (iii) business analysis, supply chain management, predictive analysis, and decision making representing the applications of big data analytics; and (iv) social media platforms, networks, and twitter representing the sources of big data generation. We have also plotted a cosine-normalized map of some frequently occurring keywords by using the methodology of Mingers & Leydesdorff [32]. The Fig. 10 shows the cosine normalized map of 90 top most author keywords whose occurrence frequency is at least 20. It is basically a semantic mapping which depicts pragmatic groupings of related concepts. For instance healthcare, intelligent agents, predictive analytics, social media analysis, parallel and distributed computing can be observed in this map.

Fig.9

The co-word network of keywords.

Fig.10

Cosine-normalized map of the 90 keywords which occur twenty or more times in the 8720 journal publication (cosine >0.1, modularity = 0.357).

The practice of referring to someone else’s work gives the fundamental linkages between individuals, thoughts, journals and organizations to constitute an experimental network that can be further analyzed. Moreover, the references and citations likewise give a linkage in time–between the previous research of its references and the subsequent publications which cite it. The number of citations have been derived from Scopus. We have also plotted journal to journal citation network of the Big Data research data. Figure 11 shows the overlay visualization of journal to journal citation relationship network. The network contains 97 journals, each have at least 10 articles with atleast 10 citation (i.e. h-index> = 10). There are 2,713 unique journals that were cited in 8,720 journal articles, out of which 97 journal have matched the said criteria. The node size in the network is proportional to its citations and the edges in network depict the citation relationship between the journals. Graphical overlays aid in analyzing prominent journals which are actively publishing in the field of Big Data. The most active journals have larger node size showcasing the number of citations received by them.

Fig.11

Overlay visualization of journal to journal citation relationship network.

10. Big data analytics software

We have also analyzed the research data downloaded to identify the major software platforms and architectures used in Big Data research. Since Big Data is primarily a multi-disciplinary field encompassing a number of disciplines, it has tools and platforms designed to address diverse applications ranging from government [33], policing [34], physics [35], chemistry [36], life sciences [37], finance [38], e-commerce, health care [39], drug discovery [40], and more recently sentiment analysis [41, 42]. The advances in the field has reached a point where researchers can predict about terrorist activities or decode human DNA or can even determine which gene has contributed the most for certain diseases and, of course, which click streams you are most likely to respond to. Table 9 lists some of the popular big data processing technologies, along with their basic architecture and description. Varied solutions for solving real-time issues in big data have been given, for instance3D terrain modelling [43], climate change [44] etc. Big Data will continue to grow horizontally as well as verticallyencompassing more and more naïve fields and in the time to come more and more technological platforms will be required by researchers. The data-as-a-service business model is around the sphere. Demand for more efficient and more effective algorithms is also emerging [45]. The current research in Big Data is no doubt a journey and not a final destination.

Table 9
Some popular big data processing technologies

Technology Basic Architecture Description

Map Reduce^a Programming Model Processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Hadoop^b Software Framework Distributed storage and processing of very large data sets on computer clusters built from commodity hardware.

Pig^c High-level platform and language Creating programs that run on Apache Hadoop.

Hive^d Data warehouse Data summarization, query, and analysis.

MongoDB^e Cross-platform document-oriented database Storage of documents rather than relational data

NoSQL^f Database Storage and retrieval of non-relational data

HBase^g Database Fault-tolerant storage of large quantities of sparse non-relational data in a distributed fashion

Flume^h Distributed Service Collecting, aggregating, and moving large amounts of log data in robust and fault tolerant way

Sparkⁱ Cluster computing framework Implicit data parallelism and fault-tolerance.

Mahout^j Library Distributed and scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

Yarn^k Cluster management technology Resource management, large-scale, distributed operating system for big data applications

Zookeeper^l Shared configuration service. Providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems.

Oozie^m Server-based Tool Workflow scheduling system to manage Hadoop jobs.

Impalaⁿ Analytic massively parallel processing (MPP) database SQL query engine for data stored in a computer cluster running Apache Hadoop.

Sqoop^o Command-line interface application Data transfer between relational databases and Hadoop.

Storm^p Distributed computation framework Event processing and batch, distributed processing of streaming data.

Technology	Basic Architecture	Description
Map Reduce^a	Programming Model	Processing and generating large data sets with a parallel, distributed algorithm on a cluster.
Hadoop^b	Software Framework	Distributed storage and processing of very large data sets on computer clusters built from commodity hardware.
Pig^c	High-level platform and language	Creating programs that run on Apache Hadoop.
Hive^d	Data warehouse	Data summarization, query, and analysis.
MongoDB^e	Cross-platform document-oriented database	Storage of documents rather than relational data
NoSQL^f	Database	Storage and retrieval of non-relational data
HBase^g	Database	Fault-tolerant storage of large quantities of sparse non-relational data in a distributed fashion
Flume^h	Distributed Service	Collecting, aggregating, and moving large amounts of log data in robust and fault tolerant way
Sparkⁱ	Cluster computing framework	Implicit data parallelism and fault-tolerance.
Mahout^j	Library	Distributed and scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.
Yarn^k	Cluster management technology	Resource management, large-scale, distributed operating system for big data applications
Zookeeper^l	Shared configuration service.	Providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems.
Oozie^m	Server-based Tool	Workflow scheduling system to manage Hadoop jobs.
Impalaⁿ	Analytic massively parallel processing (MPP) database	SQL query engine for data stored in a computer cluster running Apache Hadoop.
Sqoop^o	Command-line interface application	Data transfer between relational databases and Hadoop.
Storm^p	Distributed computation framework	Event processing and batch, distributed processing of streaming data.

^a http://hortonworks.com/apache/mapreduce/ ^b http://hadoop.apache.org/ ^c https://pig.apache.org/ ^d https://hive.apache.org/ ^e https://www.mongodb.com/ ^f http://nosql-database.org/ ^g https://hbase.apache.org/ ^h https://flume.apache.org/ ⁱ http://spark.apache.org/ ^j http://mahout.apache.org/ ^k http://hortonworks.com/apache/yarn/ ^l https://zookeeper.apache.org/ ^m http://oozie.apache.org/ ⁿ http://impala.io/ ^o http://hortonworks.com/apache/sqoop/ ^p http://storm.apache.org/

11. Conclusion

In this paper, a detailed analysis and mapping of research output on Big Data has been performed. The research output data from Scopus is used for a detailed characterization of the Big Data research. This paper presented analytical outcomes for year-wise growth of research output, country-wise output, country-level international collaboration patterns and authorship type & collaboration patterns by depicting the statistics and trends of the current status of research in Big Data. To begin with, the data has been arranged chronologically using frequency or percentile method. Prominent growth in the number of research publications has been witnessed in the last three years. It is observed that there is an inclination towards multi-author papers. Similar to other disciplines, collaborative research is predominant in Big Data field also. Other parameters like ACPP and h-index give an overview of citationanalysis and the trend of citation so far in journals as well as conferences.

In addition to standard parametric characterization, text-analytics based approaches have also been used to identify the discipline-wise research output on Big Data. Computer Science and Engineering have been identified as two major disciplines in this area of research. The research output has even been arranged for important control terms that grew in Big Data over the period of time. This bursty structure suggests that there is an ongoing trend in Big Data research since 2010 with major emphasis on ‘social media’, ‘cloud computing’ and ‘internet of things’. The country-wise analysis of research output shows the widespread geography of research in Big Data which is however dominated by United States and China (both in journals and conferences).US and China also have the largest number of inter-collaborative research works. The study also identifies major contributors to Big Data research. If quantity of research has to be seen, then ‘Chinese Academy of Sciences’ and ‘Tsinghua University’ produced largest number of publications in last seven years. However, based on the citation index ACPP, ‘Massachusetts Institute of Technology’ tops the journal publications’ list and ‘Chinese Academy of Sciences’ tops the conference publications’ list. Most productive source of journal publications has been ‘International Journal of Applied Engineering Research’ while the most cited journal is ‘Future Generation Computer Systems’. In case of conference publications ‘Lecture Notes in Computer Science’ has been the most productive as well as the most cited source. Big Data is now a growing and rapidly changing field of research which calls for involvement of multiple disciplines. It cannot be limited only to one particular field of interest. To our surprise, the research revealed that ‘Computer Science’ emerged as the most prominently researched Big Data field just a couple of years back.

For analyzing the interdisciplinary nature of research, text-analytics methods have been deployed. It is evident from the interdisciplinary collaboration network that ‘Mathematics’ & ‘Engineering’ have the strongest collaboration with ‘Computer Science’. Also, associated concepts have been discovered by performing co-word analysis.

It can be established that big data is an ever-growing field. And there is no single silver bullet to answer all the queries raised by Big Data scenarios. Big Data is an extensively probed area for which more is less. Similar to its definition, the ways to handle it also are multiple (volume), varied (variety) and ever increasing(velocity).This paper helps in understanding the genesis and characteristics of research in the area.

Footnotes

References

Hu ,

Wen ,

T.-S.

Chua and

Li , Toward scalable systems for big data analytics: A technology tutorial, IEEE Access12 (2014), 652–687.

Provost and

Fawcett , Data science and its relationship to big data and data-driven decision making, Big Data1(1) (2013), 51–59.

Xia ,

Wang ,

T.M.

Bekele and

Liu , Big scholarly data: A survey, IEEE Transactions on Big Data3(1) (2017), 18–35.

M.A.

Beyer and

Laney , The importance of “big data”: A definition, Stamford, CT: Gartner (2012), 2014–2018.

Gantz and

Reinsel , Extracting value from chaos, IDC Iview1142 (2011), 1–12.

Schroeck ,

Shockley ,

Smart ,

Romero-Morales and

Tufano , Analytics: The real-world use of big data: How innovative enterprises extract value from uncertain data, Executive Report, IBM Institute for Business Value and Said Business School at the University of Oxford, 2012.

Kaur and

S.K.

Sood , Efficient resource management system based on 4Vs of big data streams, Big Data Research9 (2017), 98–106.

Depeige and

Doyencourt , Actionable Knowledge As A Service (AKAAS): Leveraging big data analytics in cloud computing environments, Journal of Big Data2(1) (2015).

Guo ,

T.G.

Papaioannou and

Aberer , Efficient indexing and query processing of model-view sensor satain the cloud, Big Data Research1 (2014), 52–65.

10.

Sakr and

Elgammal , Towards a comprehensive data analytics framework for smart healthcare services, Big Data Research4 (2016), 44–58.

11.

Griffin ,

B.W.

Nordstrom ,

Scholes ,

Joncas ,

Gordon ,

Krivenko ,

Haynes ,

Higdon ,

Stewart ,

Kolker ,

Montague and

Kolker , A case study: Analyzing city vitality with four pillars of activity—live, work, shop, and play, Big Data4(1) (2016), 60–66.

12.

Dhar ,

Nilekani ,

Maruwada and

Pappu , Big data as an enabler of primary education, Big Data4(3) (2016), 137–140.

13.

Fan ,

R.Y.K.

Lau and

J.L.

Zhao , Demystifying big data analytics for business intelligence through the lens of marketing mix, Big Data Research2(1) (2015), 28–32.

14.

K.R.

Varshney ,

G.H.

Chen ,

Abelson ,

Nowocin ,

Sakhrani ,

Xu and

B.L.

Spatocco , Targeting villages for rural development using satellite image analysis, Big Data3(1) (2015), 41–53.

15.

Ofli ,

Meier ,

Imran ,

Castillo ,

Tuia ,

Rey ,

Briant ,

Millet ,

Reinhard ,

Parkan and

Joost , Combining human computing and machine learning to make sense of big (Aerial) data for disaster response, Big Data4(1) (2016), 47–59.

16.

H.V.

Jagadish , Big data and science: Myths and reality, Big Data Research2(2) (2015), 49–52.

17.

Jin ,

B.W.

Wah ,

Cheng and

Wang , Significance and challenges of big data research, Big Data Research2(2) (2015), 59–64.

18.

Singh and

C.K.

Reddy , A survey on platforms for big data analytics, Journal of Big Data2(1) (2014).

19.

Boyd and

Crawford , Critical Questions for big data, Information, Communication & Society15(5) (2012), 662–679.

20.

Hoskins , Big data 2.0: Cataclysm or catalyst? Big Data2(1) (2014), 5–6.

21.

M. Chen,

Mao and

Zhang , VCM LeungBig data: Related technologies, challenges and future prospects, Heidelberg: Springer, 2014, p. 35.

22.

H.W.

Park and

Leydesdorff , Decomposing social and semantic networks in emerging 'big data' research,Journal of Informetrics7(3) (2013), 756–765.

23.

Halevi and

Moed , The evolution of big data as a research and scientific topic: Overview of the literature, Research Trends30(1) (2012), 3–6.

24.

Kamada and

Kawai , An algorithm for drawing general undirected graphs, Information Processing Letters31(1) (1989), 7–15.

25.

I L.

Leydesdorff ,

C.S.

Wagner ,

H.-W.

Park and

Adams , International collaboration in science: The global map and the network, El Profesional de la Informacion22(1) (2013), 87–95.

26.

Choi ,

J.S.

Yang and

H.W.

Park , The triple helix and international collaboration in science, Journal of the Association for Information Science and Technology66(1) (2014), 201–212.

27.

Mehmood ,

G.S.

Choi ,

O.F.

von Feigenblatt and

H.W.

Park , Proving ground for social network analysis in the emerging research area 'Internet of Things' (IoT), Scien-tometrics109(1) (2016), 185–201.

28.

Hirsch , An index to quantify an individual's scient research output, Proceedings of the National Academ Sciences of the United States of America102(46) (2005) 16569.

29.

Kleinberg , Bursty and hierarchical structure in streams Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mi KDD '02, 2002.

30.

Ravikumar ,

Agrahari and

S.N.

Singh , Mapping the intellectual structure of scientometrics: A co-word analysis of the journal Scientometrics (2005-2010), Scientometrics102(1) (2014), 929–955.

31.

Cambrosio ,

Limoges ,

J.P.

Courtial and

Laville , Historical scientometrics? Mapping over 70 years of biological safety research with coword analysis, Scientometrics27(2) (1993), 119–143.

32.

Mingers and

Leydesdorff , A review of theory and practice in scientometrics, European Journal of Operational Research246(1) (2015), 1–19.

33.

Yiu , The big data opportunity, Policy Exchange8 (2012).

34.

"News: Live Mint". Are Indian companies making enough sense of Big Data? Live Mint - . Retrieved 2014-11-22.

35.

Hogg , Big data challenges for physics in the next decades, Bulletin of the American Physical Society57 (2012).

36.

Wold ,

Albano ,

W.J.

Dunn ,

Edlund ,

Esbensen ,

Geladi ,

Hellberg ,

Johansson ,

Lindberg and

Sjostrom , Multivariate data analysis in chemistry, Chemo-metrics (1984), 17–95.

37.

I. Higdon,

Haynes ,

Stanberry ,

Stewart ,

Yandl ,

Howard ,

Broomall ,

Kolker and

Kolker , Unraveling the complexities of life sciences data, Big Data1(1) (2013), 42–50.

38.

Dhar , Big data and the rise of machines in financial markets, Big Data2(2) (2014), 65–67.

39.

Dhar , Big data and predictive analytics in health care, Big Data2(3) (2014), 113–116.

40.

Howe ,

Costanzo ,

Fey ,

Gojobori ,

Hannick ,

Hide ,

D.P.

Hill ,

Kania ,

Schaeffer ,

StPierre ,

Twig-ger ,

White and

Yon Rhee , The future of biocuration, Nature455(7209) (2008), 47–50.

41.

Devaraj ,

Piryani and

V.K.

Singh , Lexicon ensemble and lexicon pooling for sentiment polarity detection, IETE Technical Review33(3) (2015), 332–340.

42.

Dhar , Can big data machines analyze stock market sen-Itiment? Big Data2(4) (2014), 177–181.

43.

Wu ,

Deng and

Paul , 3D terrain real-time rendering method based on CUDA-OpenGL interoperability, IETE Technical Review32(6) (2015), 471–478.

44.

J.H.

Faghmous and

Kumar , A big data guide to understanding climate change: The case for theory-guided data science, Big Data2(3) (2014), 155–163.

45.

V.S.

Agneeswaran ,

Tonpay and

Tiwary , Paradigms for realizing machine learning algorithms, Big Data1(4) (2013), 207–214.

Journals			Conferences
Discipline	TP	%	Discipline	TP	%
Computer Science	3642	41.77	Computer Science	12630	76.02
Engineering	2131	24.44	Engineering	3662	22.04
Medicine	1174	13.46	Mathematics	3108	18.71
Social Sciences	1172	13.44	Decision Sciences	1454	8.75
Mathematics	1011	11.59	Social Sciences	1401	8.43
Business, Management and Accounting	672	7.71	Medicine	659	3.97
Decision Sciences	569	6.53	Business, Management and Accounting	538	3.24
Biochemistry, Genetics and Molecular Biology	536	6.15	Physics and Astronomy	415	2.50
Earth and Planetary Sciences	330	3.78	Materials Science	364	2.19
Materials Science	288	3.30	Energy	263	1.58
Physics and Astronomy	266	3.05	Earth and Planetary Sciences	183	1.10
Environmental Science	265	3.04	Economics, Econometrics and Finance	129	0.78
Multidisciplinary	243	2.79	Health Professions	122	0.73
Arts and Humanities	223	2.56	Environmental Science	98	0.59
Agricultural and Biological Sciences	217	2.49	Biochemistry, Genetics and Molecular Biology	83	0.50

A quantitative and text-based characterization of big data research

Abstract

Keywords

1. Introduction

2. Overview and objectives

4. Data

Table 2 Year-wise research output Year JP Cumulative CP Cumulative 2010 9 9 21 21 2011 32 41 42 63 2012 201 242 383 446 2013 625 867 1556 2002 2014 1660 2527 2714 4716 2015 2553 5080 5363 10079 2016 3640 8720 6535 16614 Total/Mean 8720 16614

Footnotes

References

Table 2
Year-wise research output

Year JP Cumulative CP Cumulative

2010 9 9 21 21

2011 32 41 42 63

2012 201 242 383 446

2013 625 867 1556 2002

2014 1660 2527 2714 4716

2015 2553 5080 5363 10079

2016 3640 8720 6535 16614

Total/Mean 8720 16614