Abstract
This article discusses the method of calculating the disruption index (D index) based on COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations), breaks through the difficulties brought by the acquisition of massive citation data and verifies the reliability of the method of calculating the disruption index based on open citation data based on empirical research. Through empirical research, we found that (1) there is little difference in the number of citation data of focus papers in Web of Science (WoS) and COCI; (2) the levels of disruptive innovation of the papers calculated based on the WoS and COCI are significantly strongly correlated; (3) among the D index and related extended indicators calculated based on COCI,
Keywords
1. Introduction
In business theory, disruptive innovations are those that create new markets and value networks or enter the bottom of existing markets and eventually displace established market-leading firms, products and alliances. Since Christensen and Rosenbloom [1] of Harvard Business School proposed this theory in 1995, many research cases have proved that seizing the important opportunity of disruptive innovation is crucial to getting a head start in the wave of global technological development. Therefore, in recent years, research related to disruptive innovation has been favoured by a large number of scholars in related fields who have conducted a series of studies on the connotation, characteristics, prediction, identification, evaluation and management of disruptive innovation.
In the current context of the urgent need for disruptive innovation in the field of basic research, the first thing that must be addressed is the inclusiveness of the structure of the science and technology evaluation system and the evaluation paradigm, thus forming a system and culture that supports disruptive innovation. This plays an important role in promoting the national innovation-driven development strategy on the ground and improving national scientific and technological strength, but the evaluation of disruptive innovation in basic research has been a common challenge for the international academic community [2]. How to evaluate research results in terms of scientifically sound disruptive innovations is a common challenge faced by the international academic community.
As the main form of output of current scientific and technological research results, the innovative evaluation of patents and academic papers is a top priority. Currently, for the disruptive innovation evaluation of these two types of scientific research results, Funk and Owen-Smith [3] and Wu et al. [4] have carried out preliminary exploration and proposed two indicators consolidation-disruption (CD) index and disruption index (D index), as shown in equations (1) and (2)) to measure the level of disruptive innovation of patents and academic papers, respectively. Related scholars have also carried out a series of works to refine and expand them. Among them, Leydesdorff and Bornmann [5] proposed an automated computational program based on Web of Science (WoS) citation data. Although this study solved the problem of batch computation of D index, the data needed for the automated computation program proposed in Leydesdorff’s study still needed to be obtained manually by researchers. As an example, in the evaluation study of five scientific and technical talents in the segmented field by Song et al. [6] which selected from 193 focus papers between 2006 and 2018, there were 8465 citation of focus papers and 3,012,930 citations of the reference of focus papers that needed to be acquired. This implies that even if a very small-scale measurement of disruptive innovation is conducted, the citation relationship data collection effort required is quite large. In this regard, Leydesdorff’s study avoids this problem using metadata provided by the Max Planck Digital Library (MPDL), which essentially still relies on commercial citation databases. Although commercial citation databases often have the advantages of rapid data updates, operational sustainability and rich functionality, their high prices are not affordable for all countries and academic institutions. In addition, the current mainstream commercial citation databases usually do not allow researchers to make accessible and in-depth use of the data, nor do they allow researchers to make secondary distribution of the data obtained from the databases. This poses tremendous difficulties for scholars concerned with relevant research to conduct results validation, multi-perspective studies and baseline comparisons. However, when researchers take informal data access channels such as web crawlers, they may pose a greater risk to the access to other databases purchased by the institution, while using open citation data will effectively avoid this problem
In this article, the sources and acquisition methods of citation data used in the current calculation of the disruption index and its limitations on research in practice are proposed, and then the calculation method of the disruption index based on open citation data is proposed. Finally, the reliability of calculating the disruption index based on open citation data is verified by empirical research.
2. Overview of data sources for calculating the disruption index
In this article, we collected a total of 48 relevant papers using D index and related extended indices based on WoS and CNKI (China National Knowledge Infrastructure) databases and investigated the citation data sources used and the acquisition methods. Of the 48 research papers, 20 used commercial databases and 30 used open data (as shown in Figure 1). Among the 20 papers that used commercial databases for their research, the main commercial database was WoS, but the access methods were different. Among them, four research teams, Wu et al. [4], Li and colleagues [7–10], Bornmann and colleagues [5,11–14] and Liu colleagues [15–18], mainly relied on the citation data resources provided by Clarivate (directly from Clarivate, MPDL, Indiana University and NSLC, respectively) and Song obtained them through web crawlers. Among the 30 papers using open data for research, a wide variety of open data sets/repositories were involved, including USPTO Open Data, ORCID, APS Open Data, Third-party open data set, MAG, PubMed and OpenCitations Index of Crossref open DOI-to-DOI citations (COCI).

Data sources, access and frequency of use in studies related to the disruption index.
In terms of the nature of research, both groundbreaking [3] and disruptive [4] studies in this field have used open data sets, and Wu et al.’s [4] study also used commercial databases. However, in terms of study size, most of the studies using larger-scale data still rely on WoS. Third-party data and open data sets from domain institutions are usually applied in the initial exploration of research branches, validation and cross-sectional studies. Among all the data sources involved, those with a wide range of subject areas include WoS, USPTO, MAG, PubMed and COCI with USPTO and PubMed focusing on patents and life sciences research papers, respectively, and MAG being out of service. According to a recent independent analysis by Martín-Martín et al. [19] the coverage of OpenCitations is now close to that of the two major proprietary citation indexes, WoS and Scopus. Thus, compared with the commercial database WoS, COCI has achieved a certain balance between accessibility and domain coverage.
3. Calculation method of the disruption index based on open citation data
3.1. Data source
OpenCitations is an open scholarly infrastructure dedicated to publishing open citation data as linked open data using semantic web technologies, thus providing a disruptive alternative to traditional proprietary citation indexes [20]. OpenCitations was born out of the Initiative of Open Citations (I4OC) was proposed in 2017. The aim of the initiative is to promote structured, separable and open citation data. Structured means that the data representing each publication and citation instance is represented in a common machine-readable format which can be accessed programmatically. Separable means that citation instances can be accessed and analysed without access to the original documents in which the citations were created. Open means that the data can be freely accessed and reused.
As David Shotton [21], one of the two directors of OpenCitations who is also at the Centre for Electronic Research at the University of Oxford, argues, ‘Bibliographic metadata is a key component of the open scholarly ecosystem and should not be placed behind the subscription paywall’. However, there is currently a sharp conflict between the intellectual property rights and data disposal rights of database vendors and the right to know and autonomy about data of the scholarly community and it is often difficult for researchers to have a voice. And achieving this goal could bring many benefits to researchers and related institutions in the field of scientometrics, including but not limited to: increased discoverability of published content which would particularly benefit individuals who are not members of academic institutions that subscribe to commercial citation databases; the ability to build new services from publicly available citation data that benefit publishers, researchers, funding agencies, academic institutions and the public and enhance existing services; create public citation mapping to explore connections between fields of knowledge and track the evolution of ideas and scholarly disciplines [22].
Based on the relevant information disclosed on the OpenCitations website and the article Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations published by Heibi et al. [23] in Scientometrics, this article summarises the current resources and services provided by OpenCitations (as shown in Figure 2), which include four main parts: data sets, query functions, tools and software. In this article, we focus on the data sets provided by OpenCitations:
OpenCitations Indexes and OpenCitations Meta. OpenCitations Indexes contain information about the citations where citations are not considered as simple links but as data entities. This means that it is allowed to assign descriptive properties to each citation, such as the date of creation, time span and type. The advantage of this alternative approach is that it allows giving descriptive attributes to citations and citations become easier to describe, distinguish, count and process. OpenCitations Meta, on the contrary, is responsible for storing and providing bibliographic metadata for all publications covered in OpenCitations Indexes. The master data set of OpenCitations Index, COCI, provides scholarly citation information between publications covering all scholarly subject areas, containing over 77 million bibliographic resources and over 1.463 billion citation links (as of December 2022 [24]). As a public digital infrastructure [25], OpenCitations has become one of the primary sources of scholarly data for publishers, authors, librarians, funders and researchers [26]. The metadata it contains is expanding at an average rate of 11% per year, and the functionality and support it provides for application programming interfaces (APIs) have been enhanced and expanded [27].
OpenCitations Corpus. OpenCitations Corpus contains publicly downloadable RDF data sets of bibliographic and citation data and is publicly available under the Creative Commons 0 (CC0) protocol. The collection of articles is based on the Europe PubMed Central REST API.
OpenCitations in Context Corpus. OpenCitations in Context Corpus is also an RDF data set that includes bibliographic and citation data mined from the full text of articles, such as text citations, structural elements, rhetorical elements and sequential numbering of structural elements, including text citation pointers [28]. Like OpenCitations Corpus, OpenCitations in Context Corpus content is derived from the open access subset provided by PubMed Central and is also collected using the Europe PubMed Central REST API and publicly distributed under the CC0 protocol.

Current composition of OpenCitations’ resources and services.
3.2. Data acquisition
OpenCitations provides a four-fold form of access to all data in COCI. First, OpenCitations provides SPARQL endpoints for all published data sets. Second, all data in any OpenCitations published data set can also be retrieved using the HTTP REST API. This provides web developers and users without specialised semantic skills easy access to the data contained in OpenCitations data sets. OpenCitations has also developed user-friendly browsing interfaces that can be used to search data in all OpenCitations data sets. Again, OpenCitations offers files in format of CSV, N-Triples and Scholix on Figshare. The availability of multiple access methods can meet the needs of various types of users in different usage scenarios, effectively relieving the network load on the service provider. The most suitable for large-scale data computing is to use the data dumped in Figshare and download it to the local researchers for further study. Figshare is an online open access repository, where researchers can upload all their research results and make them publicly available. Users can upload files in any format and projects will be given a DOI for free.
3.3. Data processing
Take, for example, the processing of the data set stored on Figshare in CSV format which is a plain text file containing a list of data and usually used to exchange data between different applications. The special value of this format is to solve the problem of storing, transferring and sharing data between different application environments, rather than utilising it directly. After obtaining the dump data from Figshare, it needs to be processed accordingly in order to make it available for efficient use by researchers. Considering the size of resources in COCI data sets, converting data resources in this format into local databases is more beneficial for researchers to perform calculation of the disruption index. Due to Figshare’s file storage setting, the file of COCI data set on Figshare is sliced into multiple CSV format files with annotated time series. The transformation of the data set from multiple fragmented files to a single database can then be done using the data import tool of a visual database management tool such as Navicat. If the hardware environment used for the study does not support visualisation, the same purpose can be achieved by directly using native database software such as SQLite or using a programming language with database manipulation capabilities such as Python (based on sqlite3 lib).
After completing the basic data format conversion, multi-dimensional and multi-level data slicing of the entire database according to attributes such as citation creation time, journal source, authors, etc. and reasonable indexing of the data tables can save time spent in the subsequent calculation of the disruption index of the focus papers. Considering the size of the data and the performance requirements of the actual research, the above model is more suitable for a single researcher or a small team to carry out the relevant practice. If the research institution can provide high-performance equipment, Pandas or an in-memory database such as Redis would be a better choice for processing and would offer the possibility of interactive operations [29].
3.4. Calculation of the disruption index
After Wu proposed the D index, many scholars improved it and carried out further applications and studies according to specific application scenarios (see Table 1). However, we can see from this that no matter how the index is expanded, the index actually depends on three parameters, namely NF, NB and NR. Therefore, it is only necessary to obtain the three relevant parameters of the focus papers to complete the calculation of the disruption index of the focus papers. It just so happens that seven fields are provided in the COCI data set (see Table 2). Once the researcher has selected the focus papers, the three parameters
The disruption index and various extension indicators.
The meaning of each field of COCI.

Process of calculating the disruption index based on COCI open citation data.
4. Empirical study
4.1. Research object
In order to better reflect the effect of disruptive index calculation based on open citation data, this study chooses 11 papers identified as Landmark in Faculty Opinions 2020 (see Table 3), as well as a Nobel Prize paper and a group of highly cited papers with the same topic and publication time as the research object (see Table 4). Based on these papers, the reliability of the method of calculating the disruptive index based on open citation data is verified.
Landmark papers in Faculty Opinions of 2020.
Nobel Prize papers and highly cited papers of the same year on the same topic.
Faculty Opinions, the most authoritative peer-reviewed database in the global biomedical field for nearly 20 years, incorporates the combined efforts of more than 8000 leading international experts worldwide and is a knowledge discovery tool for evaluating published research. Faculty Opinions’ reviewers are leading experts in the life sciences and medicine and provide the quality and rigour of the reviewers is such that they are able to provide comments, opinions and validation of key papers in their fields. The quality and rigour of the reviewers means that researchers can be assured of the quality of the papers they recommend and Faculty Opinions brings these recommendations together to recommend high-quality research to a wider audience.
4.2. Research indicators
The evaluation indices in this study include the following three types of indices: impact indices, innovation indices and peer review indices. Among them, Cumulative Citations for 3 years (CC3) are selected as the impact indicator;
4.3. Research results
4.3.1. Comparison of citation relationship data coverage
The number of citation relationship data of the 11 focus papers in the WoS core set (SCIE, SSCI and AHCI) and COCI is shown in Tables 5 and 6 and Figures 4 and 5.
Reference/citation relationship data of Landmark papers in WOS core set and COCI.
Reference/citation relationship data of Nobel Price and relevant papers in WOS core set and COCI.

Reference/citation relationship data of Landmark papers in WOS core set and COCI.

Reference/citation relationship data of Nobel Price and relevant papers in WOS core set and COCI.
From them, we can find that (1) the difference in citation relationship data between most of the focus papers in the WoS core set and COCI is very small, (2) the number of citations in COCI is higher than the number of citations in the WoS core set for all focus papers and (3) only few focus papers showed a large difference in the number of references in the WoS core set and in the COCI.
Phenomenon 1 is due to the fact that the WoS core set contains the vast majority of high-quality journals and therefore also covers the majority of cited literature sources. Phenomenon 2 is due to the fact that COCI also contains a portion of journals that are not included in the WoS core set and this portion of journals also contributes a portion of the citation count. Phenomenon 3 may be due to the fact that the focus papers cite a portion of earlier published literature (without DOI registration) or the journals and publishers (especially some large for-profit publishers [37]) that do not open citation data to Crossref [38].
4.3.2. Comparison and correlation analysis of the disruption index calculation results
Taking the

The logarithmic calculation result based on WoS and COCI.
The calculation result based on WoS and COCI.
From the calculation results of Nobel Prize paper and the comparative highly cited papers in two different citation data sources, the level of disruptive innovation of Nobel Prize papers is much higher than that of the comparative highly cited papers. Moreover, the evaluation of disruptive innovation based on COCI can compare the differences of disruptive innovation level of different papers in more detail (Table 8).
The calculation result of Nobel Prize papers and comparative highly cited papers based on WoS and COCI.
4.3.3. Comparison of the disruption index extended metrics with peer-reviewed results
The results of different extension indicators of the disruption index calculated for the selected 11 focus paper based on COCI citation data are shown in Table 9. The differences between the ranking of the focus papers and the sum of absolute ranking differences with the results of peer review are shown in Table 10. It can be found that all the extended indices are closer to the results of expert peer review than the original D index, among which the evaluation results of
Innovation level of papers based on COCI citation data using different extended disruption indices.
Ranking of innovation level of papers based on COCI citation data different extended disruption indices.
The peer review indicators of the selected 11 focus papers in Faculty Opinions are shown in Table 11. The results of correlation analysis between different disruption indices and peer review indices are shown in Table 12. The correlation between
Peer review results of 11 landmark papers.
Correlation analysis between different extended disruptive indices and peer-reviewed indicators.
5. Conclusion
5.1. COCI and WoS have considerable citation coverage
The number of citation relationship data of the focus papers in the WoS core set and COCI has not significantly different. Nine of the 11 focus papers can obtain more citation/reference data in COCI than in WoS core set. Only two focus papers have less citation/reference data from COCI than from WoS core set.
5.2. The calculation results based on COCI are consistent with those based on WoS
The levels of disruptive innovation of the papers calculated based on the WoS core set and COCI open citation data, respectively, are generally consistent and the correlation between them is 0.9779 which showing a significant correlation.
5.3. The expanded indices show the same superiority in COCI as in WoS
Among the selected disruption indices and related extension indicators, the correlation between
5.4. The method based on open citation has its advantage and meaning
The use of the COCI data set for calculation of the disruption index avoids the direct retrieval of citation data from commercial databases and the resulting costly and time-consuming problems, effectively reducing the difficulty of obtaining large-scale citation data which improves the accuracy of scientometric metrics while ensuring the reproducibility of research. Considering the broad disciplinary coverage of COCI [40], although it has not yet fully replaced the function of commercial citation databases represented by WoS in research evaluation [41], it can undoubtedly serve as an important source of citation data for further disruptive index research in the future.
In essence, the use of open citation data represented by COCI for scientometrics research is not confined to the single topic of the disruption index calculation but can be applied to the whole field of scientometrics. Research based on open citation data is conducive to promoting the organic integration of scientology and open science, solving the long-standing problems in academic publishing and scientific communication such as academic plagiarism, data falsification, difficult to verify results and difficult to reuse data, thus promoting the further development of scientometrics.
5.5. The method based on open citation and this study still have some limitations
From the perspective of the method itself, we collected citation relation data of focus papers from COCI in this study. However, the earlier papers usually did not register DOI and some large for-profit publishers that do not open citation data to Crossref. Therefore, this method also has the problem of missing references and citations without DOI.
From the research point of view, we assumed that landmark results must have a high level of disruptive innovation and therefore chose the results of peer review by domain experts as the ‘gold standard’ for disruptive innovation evaluation in this study. However, Faculty Opinions has a limited number of reviewers, there may be bias in peer assessment [42] and peer review tends to find high potential impact results rather than disruptive results [43]. In addition, Faculty Opinions provides a less differentiated scoring mechanism for reviewers and does not provide a good reminder of the degree of variation in the quality of the literature. The practice of Brezis and Birukou [44] illustrates that the correlation between the review results and the quality of the papers will be significantly improved if the number of reviewers is increased to about 10. However, it is very fortunate that the findings obtained in this study based on open citation data are all largely consistent with those obtained from related studies conducted using commercial citation data. In future studies, we will systematically evaluate the effects of calculation of the disruption index based on open citation data and the differences in the evaluation effects of different extended disruption indices based on open peer review data from more different subject areas.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research received the financial support from the National Social Science Foundation of China (23BTQ085).
Availability of data
All reference relationship data can be obtained from opencitations.net, and all peer review information can be obtained from facultyopinions.com.
