Abstract
Scholarly articles are considered one of the primary medium for dissemination of inventions and discoveries. Traditionally, usefulness and popularity of a scholarly article has been measured in terms of citations it receives. However, in the changed research publishing landscape, where most of the publications are now available in digital form accessible through various digital libraries; new measures of measuring usefulness of scholarly articles have emerged. Nowadays, scholarly articles are easily available for access and download from various digital access portals. The use and popularity of these digital access portals has also made it possible to integrate various social media platforms with journal access and use. Most of the journals now maintain statistics about reads, number of downloads, social profile shares etc. Several newer platforms like ResearchGate, Academia and Mendeley have also become popular. Researchers now often share their articles on various such platforms and also use social media channels to disseminate their article to a wider audience. This transformed environment has allowed to track and measure usefulness and popularity of scholarly articles through alternative metrics (now popularly known as Altmetrics) as compared to traditional citation impact measures. Altmetrics attempts to derive impact of a scholarly article by using data from different kinds such as social network share, mentions, tweets etc. The use of Altmetrics varies widely from country to country and discipline to discipline. This paper attempts to present findings of an exploratory analysis of relevance of Altmetrics data through a case study of scholarly articles from India published during 2016 and indexed in Web of Science and also updated on ResearchGate. The results obtained provide an interesting insight on relatedness and correlation of presence of scholarly articles in Web of Science and ResearchGate. It is observed that about 61% papers indexed in Web of Science have an entry in ResearchGate. There are, however, disciplinary variations in presence of articles in ResearchGate. Only about 61% of the total disciplines in Web of Science are found to be covered in ResearchGate.
Introduction
Scholarly articles have been the primary medium of dissemination of research advancements for centuries. Traditionally scholarly articles were published in journals printed in different volumes and issues. Since last two decades, there has been a growing use of digital libraries and portals for dissemination and access of scholarly articles. Nowadays, most of the journals are publishing articles in electronic form, with some even discontinuing print versions. For several decades, measuring impact of a scholarly article was based on measuring citations that an article receives. However, with the ecosystem of scholarly article publishing and access going digital, there are now new possibilities of measuring usefulness and impact of a scholarly article. Now scholarly information exchange is tightly connected to information and communication technology [1], which in turn has made it possible to observe and measure the online transactions (such as reads, downloads etc.).
Authors are now using social networks to disseminate their published scholarly articles to a wider audience. Some of the social platforms also provide an archive where authors can upload the pre-print versions of their articles. ResearchGate, Academia, Mendeley etc. are some of the widely known and used social networks providing such facility. On some platforms, authors can also publish draft versions of their articles and obtain feedback of fellow researchers before actually submitting the article to a journal. Availability of data about scholarly articles in such platforms and the online transactions around them creates a possibility of identifying usefulness and importance of a scholarly articles. This may also solve the large time period requirement of the traditional bibliometric approaches for analysing usefulness of articles. A measure like citations require that an article usage and reference be observed over a sufficiently large period to measure the number of citations that it receive. However, in today’s digital era, it is possible to create an alternative measure for accessing the usefulness of an article by measuring the online transactions around an article. This is commonly known as alternative metrics i.e. Altmetrics [2]. It presents an alternative and a quite useful way to assess the initial impact of a scholarly article, even before it starts getting cited in traditional journals [3].
This paper tries to present a case study of analysing the alternative metrics use for the published scholarly data from India during the year 2016. We have tried to see what proportion of scholarly articles published in various journals by Indian authors find a place in popular platform, ResearchGate. To do this, first of all publication records for country India for the year 2016 is obtained from Web of Science (WoS), a well-known bibliographic database. This data is then used as input and all articles are searched for their presence in ResearchGate. ResearchGate statistics like ‘reads’, ‘downloads’, ‘comments’, ‘citations’ etc. are recorded for the papers found in data. This analysis is expected to help understand the usefulness of alternative impact assessment for articles from India. Disciplinary differences in presence of articles on ResearchGate are also identified.
The rest of the paper is organized as follows: Section 2 describes the related work on the theme. Section 3 presents details of the dataset used. Section 4 describes the methodology and Section 5 presents experimental results obtained. Section 6 presents the conclusions obtained.
Related work
Social networks attracted researchers right from their inception. Several research works have been carried out on quantifying transactions on online social networks, measuring the size and spread of social networks, modeling the dynamics of social networks etc. Research on scholarly articles in social networks is relatively a newer phenomenon. With the use of social network platforms by researchers to popularize their articles and disseminate news about them, a large number of scholarly articles now found a mention in some social platform. Several new tools are now introduced and new possibilities of measuring the impact of scholarly articles in social networks are being explored [4]. The new form of metrics based on online social networks or reference manager tools was first proposed in 2010 [5] termed as “altmetrics”.
Altmetrics is the study and use of scholarly impact measures based on activity in online tools and environments. Altmetrics is not only the field of study scholarly data but also metrics itself which makes it different from “bibliometrics” or “scientometrics” [2]. Precisely, Altmetrics a transformation from traditional bibliometrics towards new kinds of metrics computed from social network transactions around scholarly articles [6]. The popularity of altemetrics has now resulted in many journals providing the social statistics about articles on their webpages.
As social networks are becoming ubiquitous, researchers are increasingly getting interested in altmetrics research. This topic even seems quite dominating to replace the h-index metric [7]. Altmetrics are based on events on social media, which are created as a result of actions around research articles. There are several classifications for events [8]. In ‘article-level metrics’ (ALMs, [10]), views, downloads, clicks, notes, saves, tweets, shares, likes, recommends, tags, posts, trackbacks, discussions, bookmarks, and comments are counted, rather than just citations of a paper in a database such as Scopus (Elsevier), or by a publisher such as the Public Library of Science (PLOS, [9]) [10]. There are many different ways by which altmetrics events can be measured from a data source [11].
There is, however, one challenge that there is no standard definition of a specific altmetric event given, thus aggregators name their events differently. In addition, event counts from a single data source could be measured in different ways, and aggregators do not always explicitly state how the events are counted. Another issue in using social network data is, social network data sources are liable to change or discontinue their service [8] at some point of time. This fluctuation in the availability of altmetrics poses a challenge, especially regarding reproducing the evidence for the event counts.
Social networks play a pivotal role in not only sharing the research but also promoting research. For example [12] claimed that Web publication has considerably changed the behavior in which researchers disseminate and advance their research. Nowadays many researchers are trying to analyze the scholarly transaction behavior on different social platforms. There are, however, relatively less research work reported on use of ResearchGate data. One of the reasons could be the difficulties associated with data collection from ResearchGate. Out of the existing works on ResearchGate, most are done either on the description of different features in ResearchGate or using ResearchGate data for special purposes. ResearchGate data has also been used in various analysis such as author disambiguation [13].
Since ResearchGate mainly operates through the user profiles, so it can be very useful to identify the authors. Also, these data has a social network among these authors, which can be used to identify well known social network analysis indicators e.g. centrality [14]. ResearchGate may also be helpful to find early citations or predict them [15]. ResearchGate is becoming very useful mode of information exchange among researchers in multiple disciplines [16]. Some previous research has discussed how ResearchGate can help to build the reputation [17] and authorship mechanisms [18]. ResearchGate has, however, also been criticized for their RG score1 shown on their website [19].
To the best of our knowledge no previous study has aimed to study altmetric for a particular country or a group of disciplines. The only closest work could be work on poor altmetric score of China where they used twitter for their analysis [20]. Some other related research work that use ResearchGate data use small datasets [21]. ResearchGate has also been studied in terms of discussion on its features, shortcomings etc. Some authors have also tried to use the ResearchGate data for indirect benefit like removing author ambiguities, early citations, network analysis etc. Our work is perhaps one of the first such works on a large dataset involving all publications from a big country like India for a complete year. It took us a long time to crawl the data from ResearchGate. The purpose of our analysis is to broadly identify how popular is ResearchGate among Indian researchers and any disciplinary differences in the trend. Impact from ResearchGate is also studied and recorded for future correlation studies with traditional citation counts.
Data collection
We have collected data from ResearchGate for all publications originating from India during 2016 as indexed in Web of Science (WoS). Since no API was available to obtain data from ResearchGate, we designed a crawler to get the data. It took a lot of time in data crawling. First of all, we collected data of publications originating from India during the year 2016 in WoS and then extracted author information from this. Thereafter, for each publication record, ResearchGate was accessed programmatically to find out if the publication is indexed in it and if the publication is found the relevant statistics for it is scrapped from ResearchGate. We obtained a total of 88,259 records indexed in WoS for the year 2016 for country India. The data collected during 5th to 10th May 2017.
ResearchGate is crawled to extract the information about each of the publications found to be present in it. The crawling process was completed during a span of 5 months (May-September, 2017). Out of the 88,259 records indexed in WoS, 68,827 records are found in ResearchGate. Thus, approximately 78% of the Indian publications indexed in WoS during 2016 are found in ResearchGate. This is a decent amount of coverage. For the publication records found in ResearchGate, different usage and mention statistics were extracted and analyzed.
Methodology
The data obtained from WoS and ResearchGate was first pre-processed, which required cleaning and formatting. The ResearchGate data crawled was unstructured and also contained some noise. Data was, therefore, first cleaned by removing special characters and some other undesired information. Data is then formatted in desired structure through programs. The pre-processed data is then verified for correctness. For this purpose, a cross-matching operation is performed with the bibliometric data collected from WoS. The cross-matching of data is tagged with WoS data for further analysis. For this purpose, DOIs are used, owing to their uniqueness and reliability. A total of 53,832 records are successfully matched within two sets. The total crawled set consists of 68,827 records but for reliable and detailed analysis, the final dataset is taken of these 53,832 matched records. This set constitutes approximately78% of crawled data. Further, this constitutes about 61% of WoS (88,259) publication records. This percentage of data is enough to represent the whole dataset and specially verified with DOIs. Thus, the final dataset for analysis consisted of 53,832 tagged and verified records. The Table 1 presents a summary of the dataset.
Data Summary
Data Summary
In ResearchGate, an author or publisher can upload all the details including the full text of paper. ResearchGate provides various information regarding uploaded papers, such as reads, citations, recommendations, comments etc. The crawling process aimed to scrap all this information. Each record crawled from ResearchGate consists of Title, Article Type, Authors, Published year, Reads etc. The Table 2 describes an example record crawled from ResearchGate.
Example of Data (Simplified Form)
Data usually collected has no structure. As for example, the author fields usually consists of the author order, their RG score and affiliations as well. Affiliation of author usually comes if this is provided in their profile in a pre-defined format. The unformatted data looks like: “1st Manish Verma2.28 · University of Duisburg-Essen2nd Kanik Ram”. Here, one author is available with his affiliation and another has not provided the affiliation information. Another challenge was that ResearchGate shows information for only maximum four authors of a paper. This implies, if any publication has more than 4 authors than only first four will be crawled.
The data obtained from ResearchGate for each record in the dataset was analyzed to obtain useful inferences. First, the data visibility was analyzed. By visibility we refer the number of reads recorded in ResearchGate. In ResearchGate, reads are provided for each article. Reads is a good measure of use and influence of an article present in ResearchGate by people using ResearchGate. Read can also be counted as proxy for impact of an article, if every read is assumed to have an impact on the reader. To further understand the reads parameter’s usefulness, the recommendations and comments fields are also analyzed. These mentions (comments & recommendations in ResearchGate) are another way to find out how much an article has been used. ResearchGate also provides citation value, which can be used to find out how many times a paper uploaded on ResearchGate is cited by another paper on ResearchGate. We also wanted to find out of there exists any disciplinary differences in these statistics on ResearchGate, therefore, we performed a subject category wise analysis. This helped in understanding which subjects have higher coverage and which one have lesser.
Visibility
It is interesting to find the visibility and reads of articles present in ResearchGate. It has been observed that about 61% articles indexed in WoS from India during year 2016 are present in ResearchGate. The number of reads of each article is another measure obtained, which is a good indicator of popularity of an article even before it starts attracting citations. It is very impressive to find that 97% of the total set i.e., 52,175 papers are having at least one read each. We plotted the read percentage in Fig. 1. We observe that a total of 29, 66,101 reads are found for the whole data set. This in turn indicates that output of Indian research has good number of reads in a relatively very less period of time (as the data is from 2016 only).

Read-Unread Distribution of the crawled Data.
The total number of papers never read is only about 3% of total papers in the dataset. ResearchGate provides one feature to institutions that, any institution can upload published data as a repository. In this way, there is no need to create individual authors. This is found to be a primary reason for having papers without reads as out of 1,660 papers 974 papers found (∼59%) without at least one author in ResearchGate. Author profiles play a pivotal role to attract readers since authors attract followers as prospective readers. ResearchGate appears to be an individual-centric platform.
The average number of reads for each paper is 55.09, if the whole dataset is accounted for. If we take into account only those papers which are read at least once (i.e., 52,175 papers), this comes out to be ∼57. There are 80 papers found in the dataset which have more than 1,000 reads each. These 80 papers get a cumulative read of 1,72,674, which is almost 6% of total reads in the whole dataset. To understand the pattern of readership, we identified top 10 read papers from this set. The top 10 most read papers are shown in Table 3 along with their authors, citations, comments. The highest read of any paper in this set is from chemistry subject category with 7,059 reads.
Top 10 most read Papers
The next step of analysis was focused on comments and recommendations field of data. It was observed that a total of 4,523 comments are made by ResearchGate users in these articles. However, only 1,520 papers have some comments. These comments can be seen as an indicator of readers’ interest in articles present in ResearchGate. Unlike reads, comments indicate a higher level of engagement. Out of the 1,520 articles, one article has 491 comments in it. As far as recommendations are concerned, there are total 1,644 articles recommended by researchers. The total number of recommendations for all articles taken together is 2,945. It is observed that one article has 226 recommendations, which is highest amount of recommendations for any individual paper.
Similarly, on analyzing number of citations, we found that total number of citations are 98,542. This amounts to an Average Citation Per Paper of 1.83. The Table 4 presents details of comments, recommendations and citations for the dataset. We can observe that total citations of papers in the dataset in WoS is 55,748 as against total citations of 98,542 in ResearchGate. The ACPP in ResearchGate is 1.83 as compared to ACPP of 1.03 in WoS. We can observe that ResearchGate obtains quick citations as compared to WoS.
Impact & Citation Analysis Statistics
Impact & Citation Analysis Statistics
We tried to find out if there exists similarity in patterns of reads and citations and if highly read papers are also highly cited and vice-versa. For this, first of all we marked the most read paper in the dataset (with a total of 7,059) and then found the citations for it. It was observed that this highly read paper has only 3 citations. Similarly, most cited paper is marked (132 citations) but it’s found to have only 9 reads. The Table 5 shows these statistics. The results show that there is no direct correlation between reads and citations of papers. This result correlates with earlier such findings as well [22]. This result concludes that neither read nor citation are replaceable by each other but both can be used to derive better indicators. Read and Citations are seemed to be totally disjoint.
Most Read Article Vs. Most Cited Article
We extended this analysis for 100 most read papers, taken in set A and 100 most cited papers, taken in set B. A Venn diagram is plotted to understand the relationship between two sets in Fig. 2. We observe that only 8 articles are common in these two sets. Here, the read threshold was 903 which implies the 100th most read article has 903 reads. Whereas in the case of citations threshold is 30.

Venn diagram of 100 Most Read vs. Cited Articles.
This phenomenon seems quite interesting as well as important to visualize. Therefore, we took a further larger size of sets to clearly understand the relationship between reads and citations. The size of the observed sets is now increased to 1000. Figure 3 presents a Venn Diagram for this case. It is seen that in 1000 most read papers, the threshold of 1000th read comes to 303 which is substantially lesser from the earlier set and the citation threshold decreased to 12. A total of 234 articles are found to be common in these two sets, which accounts for 23% common articles in the two sets. It can be concluded that there are no definite patterns of commonality between reads and citations.

Venn diagram of 1000 Most Read vs. Cited Articles.
It is important to have a brief idea about the subject distribution of crawled data as it depicts the research domain of India. To visualize the subject categorical distribution, matched data is tagged with the WoS provided subject categories in ‘SC’ field of data. There are 252 total subject categories in WoS. Dataset used in this work is found to be distributed in over 61% of categories (153 in total). Thus, there are no papers from some WoS categories present in ResearchGate. The Fig. 4 presents the subject category wise distribution of articles from dataset present in ResearchGate.

Subject Category (SC) Distribution of Crawled Data.
We thought it would also be interesting to find out a detailed distribution of subject categories among research papers in dataset. The data is thus further analyzed to see tops 15 subject categories found in collected data. First of all, the top 15 subject categories (in terms of number of papers) is extracted from WoS data. The top 15 subject categories are then also extracted from the used dataset in this work. Out of this, about 60% subject categories have coverage in ResearchGate. The detailed results are shown in Table 6. It shows that some subjects have higher coverage in ResearchGate whereas some are not at all represented. In this subject category wise analysis, Chemistry is found in the top of both WoS and ResearchGate list, though the coverage percentage is a bit low in ResearchGate (60.03%). Among these top 15 research areas, ‘general & internal medicine’ is found to be having highest coverage in ResearchGate with coverage of 72.18% . These results show a clear differentiation in presence of articles of different subject categories in ResearchGate. Thus, there are disciplinary differences in coverage of articles in ResearchGate, with some disciplines having better coverage.
Top 15 Subject category-wise Coverage in ResearchGate
The paper presents an altmetric analysis of research articles from India as a case study to observe the coverage, visibility, usage patterns and disciplinary variations. WoS data for publications for 2016 is obtained and analyzed vis-à-vis presence in ResearchGate. The analytical results obtain useful inferences. It is observed that about 61% of the research data indexed in WoS for the period 2016, is found in ResearchGate. This is indeed a good amount of coverage. Disciplinary variations are found in terms of coverage in ResearchGate, with some disciplines not at all covered. Analysis of ResearchGate statistics also obtain important results. It is found that there is no clearly seen relationship between reads and citations. In fact, analysis of different sized sets produces different amount of common papers. Also, a highly read paper does not necessarily mean that it would be a highly cited paper and vice-versa. The rate of attracting citations is found higher in ResearchGate. There are also huge differences in magnitude of mentions, comments, downloads and recommendations of different papers; with having high values and majority having very less value.
The analytical results obtain useful results but at the same time create new questions worth answering. One such question that can be explored is whether citation patterns, reads, and recommendations of research papers in ResearchGate follow Power Law. From initial results it looks quite probable and it can be explored in a future work. Second important question that can be explored is about disciplinary variations in coverage and usage statistics of ResearchGate for papers from different subject categories. It must be explored as to what could be cause of this observed variation. Third question that can be explored is about coverage levels of research papers from different countries. It remains to be seen whether coverage level remains close to 60% for research papers from other countries and regions. There are several other directions of future research in the area. For example, once can try to do a research on correlation between different usage metrics of ResearchGate or even metrics from different platforms like Mendeley, Academia etc. Another possibility could be to find out if presence in platforms like ResearchGate, Mendeley or Academia does have any impact on citations that a paper obtains in future. The field of research is relatively new with lots of unanswered questions to explore.
