Abstract
This manuscript builds on a novel, automatic, freely-available Bayesian approach to extract information in abstracts and titles to classify research topics by quartile. This approach is demonstrated for all
Keywords
Introduction
Institutions of higher education place increasing emphasis on the number and quality of peer-reviewed publications achieved by their faculty (Mantzavinos, 2018). Hence, definition of a relevant research agenda in prevalent topics within high impact journals can be a crucial step to academic success for new researchers. This is a complex endeavour due to the wide spectrum and dynamic nature of publication topics in the literature (Syed & Weber, 2018; Zou et al., 2019; Vale, 2015; Mane & Borner, 2004).
New journals are launched constantly, making it increasingly difficult to keep up with academic research within any discipline. The evolution of open access journals has further contributed to that expansion, though it has also sparked a debate regarding the quality of some of these publishers (Beall, 2016; Vale, 2015; Kaiser, 2017; Wang, 2017). This debate has been enhanced with the perceived lack of rigor among some new journals, especially those launched by predatory publishers (Beall, 2016).
A common requirement for faculty worldwide is to publish regularly, with publication targets set on pre-specified time schedules (oftentimes yearly), sometimes leading to unintended quality reduction in scientific publications (Binswanger, 2014). Given the small odds of publishing in top journals in any discipline, it has been suggested assessing, also, the probability of publishing across lower tier journals when identifying journal destinations for manuscripts (Sugimoto et al., 2013). However, the likelihood of publication among top and second tier journals may also be influenced by the research topic. Journals’ scopes constitute the tightest filtering factor, where research topics will need to fall within these scopes for publication. The width of these scopes is journal-specific, with more generalist journals embracing a wide array of research topics, yet others self-defining in a more specialized way. This defines the feasibility of publication of any given manuscript within a journal. However, the likelihood of publication, while mainly associated with manuscript quality, is oftentimes also influenced by editorial preferences, especially in higher quartile journals, where research topics that are perceived as more relevant at a particular point in time may be favored, resulting in not every high quality article being able to be published. There is evidence that publishers exert a strong influence on defining which research topics become mainstream (Kraus, 2002; Yuzhuo, 2019; Nash et al., 2018; Vale, 15), leading to a more complex assessment of what are current and future influential topics within a discipline, as well as their trends. The combined likelihood of publication of each research topic in top journals (e.g., top quartile journals) will be defined by the combined set of scopes and editorial preferences. This will be empirically reflected in the relative prevalence of each research topic within the combined set of manuscripts accepted for publication in such journals. As researchers progress in their careers, tenure and promotion committees may also find it challenging to assess the relevance of researchers’ portfolios and agendas due to the high number of specialized topics, even within common disciplines or departments (Moher et al., 2018; McKiernan et al., 2019; Wright & Vanderford, 17). These issues are even more relevant for young researchers, especially those in lower income countries or universities with limited resources, for whom information inequities generate larger gaps from their peers and set them at an informational disadvantage during early stages of their careers which may persist throughout their progression toward faculty tenure and/or promotion.
There is need for objective, freely-available, evidence-based methods to assess areas of focus and relative prevalence among the scientific community. The hypothesis explored in this manuscript is that the prevalence of research topics is not uniform across topics or journal impact factors within the biological sciences, and, therefore, topic selection can have a major impact on young researchers’ careers. If this is the case, then the development of methods for assessment of research-trending topics (as a snapshot or an inter-temporal view) can serve researchers to define and align their research agendas alongside expectations of their home or aspirational institutions. This can also serve graduate students as they define their area of specialization by cataloging topics according to outcome metrics, such as target journal impact factors or quartiles. Meeting expectations of quantity and quartile of peer-reviewed publications held by institutions may be more likely to occur among those with research agendas developed around more prevalent topics. Additionally, expectations of researchers’ home institutions may be defined, or assessed, against specific peer institutional benchmarks (Trainer, 2008). In such cases, a comparative assessment against specific peer institution clusters could also be relevant.
Accurate classification of prevalent research topics and their associated trends in the literature can also enhance the submission process by identifying the most relevant venues for publication through the use of text mining procedures (Rebholz-Schuhmann et al., 2012a; Camargo et al., 2018). The availability of greater computational power has led to the development of powerful statistical methods to analyze different types of datasets, including those with heavy textual content (Rebholz-Schuhmann et al., 2012a; Liu et al., 2016; Syed & Weber, 2018; Zou et al., 2019). Researchers have new tools available to discover research interest trends, and to gear their research productivity along with (or away from) such trends (Mane & Borner, 2004; Nettle & Frankenhuis, 2019; Wang, 17). Bibliometric studies have grown rapidly across a wide range of disciplines such as sport sciences (Xianliang & Hongying, 2012), big data analysis (Akoka et al., 2017), environmental impact (Geng et a., 2017), energy research (Mao et al., 2015), computer science (Shukla et al., 2019), and software engineering (Garousi & Mantyla, 2016), among others. Within a wider topic-review framework, Chen et al. (2016) applied co-word analysis on projects of China’s National Natural Science Foundation, revealing hot topics such as game theory, supply chain management, and data mining.
Krallinger et al. (2010) reviewed some of the initial approaches to text mining within biological sciences. Rebholz-Schuhmann et al. (2012b) demonstrated the use of these techniques to form an integrative biology approach through a descriptive method, but did not rely on inferential, information-borrowing models. Other approaches built on pre-defined dictionaries include tools like Thalia (Soto et al., 2019) or BioReader (Simon et a., 2019), both demonstrated in the context of biomedical abstract classification. Sub-analyses have been performed to explore narrower research topics within the biological sciences, such as plant research with a focus on Arabidopsis thaliana (Landeghem et al., 2013), which is also built on the back of a dictionary-reliant software (EVEX), microbial diversity in food using the software AlvisIR (Chaix et al., 2019), or microorganisms (Lim et al., 2016), extracted on the back of the text mining software @MInter. Such tools, while well-trained for narrower domains, may overly-rely on pre-defined dictionaries, statistical training, and narrow scopes. They may, therefore, require regular updates and may be less applicable to wider ranges of topics, or topics encountered throughout interdisciplinary research.
From a methodological standpoint, Syed and Weber (2018) used Latent Dirichlet Allocation (Blei et al., 2003) to model documents as constructs distributed over
In these studies, researchers used a variety of methods to carry out textual analysis, such as counting key word frequency, aggregating the h-index of authors or journal papers, or elaborating a systematic mapping of existence research. However, a simple word count of topics within each manuscript for each of the journal quartiles assumes independence among the different quartiles a priori. This is difficult to justify both empirically and theoretically. Instead, by clustering the data according to International Scientific Indexing (ISI) quartile (though other metrics are also possible), the proposed approach in this manuscript calculates probabilities that are further modeled in a hierarchical manner. This is an essential step to borrow information from each cluster/quartile in order to have dependence among the four different quartiles and provide a more accurate classification. This is a novel addition to the biological sciences literature. Topic exploration is aligned with commonly-used quartile-based journal clusters (Camargo et al., 2018; Casarin et al., 2021; Cortes et al., 2021), a metric oftentimes used for research quality assessment. Additionally, the proposed approach does not rely on data dictionaries and is freely available, which makes its use more democratic, serving as a tool to reduce global information inequities.
Methods
Data
This manuscript analyzes abstracts and titles of 149,129 papers published across all sub-disciplines of the biological sciences during the year 2017. The scope of this study is restricted to English-language publications indexed by the ISI to define a transparent set of well-established journals with a common textual space. Journals are clustered by ISI impact factor quartile. ISI quartile is a categorical ordered classifier that has been shown to have relevance in topic discovery (Taddy, 2013a). The study covers all scientific areas of biology, although more clustered analyses that focus on sub-disciplines are feasible. However, the definition of journals for sub-disciplines can be more challenging, especially as many publications are interdisciplinary within the biological sciences.
The definition of the set of journals comprising the dataset was performed following ISI’s categorization, where the inclusion category was the definition by ISI as a biological sciences journal, regardless of specialization (if any).
Model
The proposed methodological approach builds on the multinomial inverse regression method (MNIR) described in Taddy (2013b), which also provides a description of the freely-available R package used for the analysis within this manuscript.
Collections of documents can be analyzed as exchangeable sets of tokens (uni-grams), or more generally n-grams (Jurafsky & Martin, 2008). When dealing with text documents, tokens are stemmed words. For example, the words “hormone”, “hormones”, and “hormonal” all become “hormon”. This is a necessary step to collapse equivalent topics which may differ by a suffix.
For consistency and replicability of our approach, this manuscript follows the notation in Taddy (2013b), where
Each of the possible
Modelling the conditional distribution of
where
The sufficient reduction score is
The model is completed with independent gamma-Laplace priors (Taddy, 2013b). This choice circumvents the need for traditional Markov Chain Monte Carlo approaches, while keeping the Bayesian structure and interpretability of outcomes. Simple optimization reaches the maximum a posteriori (MAP). In most real-time applications, only the MAP would be needed for decision-making, such as the assessment of key topics within the biological sciences.
A simple word count for each journal quartile (i.e., category), a commonly-used approach, assumes independence among the different categories a priori. The proposed strategy uses text-specific dimension reductions based on the multinomial characteristics implied by exchangeability of token counts. As shown in Taddy (2013b), a topic model treats documents as drawn from a multinomial distribution with probabilities arising as a weighted combination of topic factors. These probabilities are further modeled in hierarchical form. Information is borrowed from each category in order to generate and extract dependencies among the four different sentiment/impact factor categories (journal quartiles).
One key advantage of the method is its computational speed and tractability. Traditional approaches oftentimes have computational times and memory usage that are at least linear with the dimensionality of the text data. The combination of dimensionality reduction, information borrowing, and optimization within a Bayesian framework offers an atractive tool toward automatization of sentiment scores within the bigram text space.
This manuscript performs a dual analysis by extracting and classifying different sets of information aligned with different research questions and outcomes. First, all ISI publications in biological sciences are considered, regardless of authorship origin. This allows for a definition of the trends among the scientific community. The outcomes of this analysis can inform researchers regarding trending areas of interest, and whether their own research agendas are geared toward particular sentiment clusters (journal quartiles).
Second, a much smaller subset is extracted, corresponding solely to publications with an origin within two competing universities in Colombia (Uniandes and Unal). This exemplifies how this tool can serve for assessment of differentiation areas between institutions. This type of information can be used in multiple forms, including: (1) identification of research areas that may be understaffed and where peer/competing institutions may have, or start developing, a competitive advantage, thus informing decisions regarding new hires; and (2) identification of areas of research where institutions have a competitive advantage, hence informing decisions regarding marketing and promotion of the institution, as well as internal/institutional support toward increased external funding efforts.
Results
Table 1 lists the top 40 countries in number of ISI publications in the biological sciences during 2017. As expected, the United States and most of Europe are among the regions with a high rate of publications, as well as Australia, China, India, and Mexico. This aligns not only with the relative economic power of these countries, but also with their population sizes. Africa, South America, and Central America show a more heterogeneous distribution. Colombia, which hosts the two institutions used for demonstrating the comparative analysis, ranked second in South America within the biological sciences.
Top 40 countries representing the most ISI publications in 2017 in biological sciences, with number of publications and percentage of the global research production
Top 40 countries representing the most ISI publications in 2017 in biological sciences, with number of publications and percentage of the global research production
Abstract and title bigrams were collected from all journals. The top 40 bigrams extracted from the hierarchical Bayesian approach were listed in descending relevance for each quartile. The choice of 40 topics was arbitrary and defined by available space in a table of results, though a full distribution is available as the model outcome.
At a global scale, as reflected in Tables 2 and 3, some of the frequent topics persist across quartile groups. However, there are some differences in the relative relevance of topics across quartiles. For example, grain yield is a major topic in the bottom quartiles, while it ranks lower in relevance within the top quartiles. The most frequent bigrams were those related with the big challenges of sustainable development: climate change, biodiversity, and water. The most frequent topics within water were drought stress, water quality, and soil water. This shows that the scientific community is, as expected, placing substantial emphasis on persistent environmental challenges of our time. Other recurrent bigrams are those related to soil health, namely soil moisture, soil water, soil organic, and soil layer.
Highest ranked research topics worldwide among ISI journals in biological sciences during 2017 (top 2 quartiles)
Highest ranked research topics worldwide among ISI journals in biological sciences during 2017 (top 2 quartiles)
Highest ranked research topics worldwide among ISI journals in biological sciences during 2017 (bottom 2 quartiles)
Plant research appears more prevalent than animal or microorganism research, showing a bias toward this group of organisms. This could also be a factor of the relative importance of these subjects within academic departments. In the top quartile, the bigrams relating to plant species, plant growth, and tree species constitute most of the top 15 bigrams. Plant height and leaf area also appear in the remaining quartiles, indicating that these subjects, while relevant, are less prevalent in the top biological sciences journals. Microorganisms did not appear in the top 15 of any of the four quartile lists. However, they constitute an open area of research, preferred particularly by researchers in Central and South America, indicating the relevance of more geographically-focused sub-analyses. The only microorganism that appears in the top quartiles within the global list is the bacteria Escherichia coli. This highlights the global relevance of the topic. Issues that are more localized or regional in nature are relegated to lower relevance in both ranking and quartile. A notable absence is new molecular biology technology, such as the CRISPR Cas technology. Other remarkable absences in the more frequent bigrams are those related to immunology, cancer biology, and general health-related bigrams. However, those could be confounded by appearing in medical and public health publications, rather than biological sciences journals.
Finally, it is clear that genetic research remains a topic of high relevance. It appears in the top four topics within the first two quartiles, and in the top 10 within the lower quartiles. This appears to indicate that research in this topic remains mainstream within the top journals, while, when not publishable among those, the option of lower tier journals is also available for such manuscripts. The opposite occurs with topics such as body weight, which, although relevant across quartiles, experience a decaying relevance among journals in higher quartiles. An even more extreme case occurs with heavy metals research, which belongs to the top 10 topics in the lowest quartile, but does not make it to the top 40 in the highest quartile.
When analyzing the bigrams at a more local scale and between peers, bigrams become more specialized. In Colombia, the top two universities, Unal (
Highest ranked research topics at Unal among ISI journals in biological sciences during 2017 (top 2 quartiles)
Highest ranked research topics at Unal among ISI journals in biological sciences during 2017 (top 2 quartiles)
Highest ranked research topics at Unal among ISI journals in biological sciences during 2017 (bottom 2 quartiles)
Highest ranked research topics at Uniandes among ISI journals in biological sciences during 2017 (top 2 quartiles)
Highest ranked research topics at Uniandes among ISI journals in biological sciences during 2017 (bottom 2 quartiles)
The analysis of Unal’s bigrams, listed in Tables 4 and 5, includes some terms related to agriculture, in particular passion fruit and cape gooseberry, a solanaceous fruit that is exported worldwide, a reflection of the links between their research and local agricultural needs. Other bigrams related to agriculture come from plant pathogens or plant disease, such as late blight, an important disease of potato and other solanaceous crops and oxysporum dianthi, referring to the important pathogen Fusarium oxysporum f.sp. dianthi. Oil palm appears among the top three topics across the top three quartiles, dropping to tenth in the bottom quartile.
In contrast to the global list, some terms related to health and medicine (breast cancer and public health) appear in the list, which indicates that Unal’s researchers favor biological sciences journals for these topics compared to a more mainstream approach by biological sciences researchers. References to immunity are more frequent within Unal, which align with the long history of medical research within the institution. The causal agents of malaria, both Plasmodium falciparum and Plasmodium vivax appear in Unal’s top quartile topics, which also reflects a major local need, given the endemic nature of malaria in Colombia (Padilla-Rodriguez et al., 2020). This highlights the importance of this disease in their research agenda. When comparing to the global research list, animals, protozoans, and fungi are underrepresented compared to plants. While Unal’s major topics include several medical terms, there is no mention to transgenics or gene editing, which appear to be underrepresented areas of research at the institution when compared to the global list.
Uniandes is a smaller but highly prestigious private university. Uniandes’ top bigrams, listed in Tables 6 and 7, emphasize research in neglected tropical diseases, and in particular the Chagas disease (chagasic patients and Chagas disease), the pathogen Trypanosoma cruzi, and its vector Rhodnius prolixus. The most frequent topics appear driven by regional needs. There are multiple bigrams related to microorganisms, such as Staphylococcus aureus, Lysinibacillus sphaericus, Trypanosoma cruzi, and Escherichia coli. This higher frequency of research on microorganisms reflects the existence of a microbiology program and highlights a potential preference for specialization of the institution within this area. There are also bigrams related to human populations, with a regional focus, namely Amerindian populations and south aboriginals, which also reflects on the geographical scope and access to information for researchers in this institution.
When comparing both institutions, there are several areas of overlap. There is a high number of references to plants and microorganisms, in particular some protozoans and causal agents of tropical diseases. However, there are also some differences due to the specialization of each university. The departmental structure, with a single department within the biological sciences in each institution, also favours the co-existence of multiple areas of research and a complex definition of a departmental specialization. Therefore, the diversity in research topics within a single department is significant. Further exploration of the internal dynamics and historical causes for the similarities and differences between the two institutions is outside the scope of the manuscript. However, this type of comparitive analysis is an outcome now available to both institutions to better define their respective areas of specialization and competitive advantage, as well as to identify areas of need within their biological sciences programs. This provides an example of how this approach can serve department heads and administrators toward defining their areas of interest and focus, whether the target is to increase breadth among or depth within research topics.
This study analyzes 149,129 manuscript abstracts and titles of academic ISI journals published during 2017 across the biological sciences. The journals were classified in four quartile categories according to their impact factors, and information borrowing across quartiles was possible through a hierarchical Bayesian approach. The analysis revealed that some topics persisted in all quartile groups, but others ranked higher depenting on the quartile.
This manuscript presents a novel contribution to research classification within the biological sciences literature. First, it examines a neglected unit of analysis (bigram), which has a major impact in topic definition, since many topics in biology are defined on the basis of two words. For example, a unigram such as plant will be highly uninformative unless it qualifies the type of analysis with a second word, such as plant species or plant height. Others cannot be identified at all if not through a bigram directly, such as passion fruit, or oil palm, rendering traditional unigram methods less usable. While the approach is illustrated through bigrams, the methods allow for any n-dimensional combination of words, making it fully flexible to adjust to an incremental length of key words in the scientific literature, as well as differences by (sub)discipline or increased granularity of analysis within particular areas of research.
Bayesian MNIR efficient methods are used to explore the impact that journal quartiles have on the definition of the most popular research topics across the biological sciences. These methods borrow information across quartiles to better assess their multidimensional nature.
One key advantage of the proposed approach is that it does not rely on pre-defined dictionaries or any prior knowledge. This is important for young researchers in lower income countries or institutions with limited resources, who may not have access to classification resources offered by publishers (oftentimes at a fee). This tool is freely available in R, a well-known free software. By democratizing information access, a more level playing field for competition is possible. By classifying bibliographic resources and extracting common themes, this proposed approach allows researchers to define and adjust their research agendas based on trending topics and their relevance.
A global assessment is provided with topics outlined across journal quartiles. However, most researchers are measured/evaluated against specific sets of expectations, which in some cases are regional or peer-based in nature (e.g., minimum impact factors, or minimum quartiles for their publications). A comparative assessment between global research topics and those in a localized set of universities demonstrate some important differences. These differences can help administrators and department heads gear future hires and reward particular areas of research. While some institutions will prefer to specialize and assess whether specific areas of research are competitive at a global scale, others can use the information to identify areas that may be lacking in their departmental research portfolios and gear hires to increase the breadth of topic representation within the department.
This study also allows for identification of research areas where a country or particular university can take a lead because of specific local conditions. For example, multiple topics in top quartiles within the comparative analysis were found to relate to biodiversity. Colombia, which is the second most biodiverse country in the world, can have direct access to data and enjoy a competitive advantage, though its top two institutions appear to be lagging.
Limitations and future research
One limitation of this study comes from constraining the scope of inclusion to manuscripts written in English. While it is the most common language for peer-reviewed publications, it does not constitute the full spectrum of ISI journals. This may have an influence throughout particular locations where English is not the main language of instruction or research, such as the South American continent. However, this effect will be tempered as research-oriented institutions become more focused on the impact of their research at a global level, and institutions where research is highly relevant tend to favour English-language publications.
Another limitation of this approach comes from the relatively narrow definition of a publication in biological sciences, which comprises our motivating example. Manuscripts that cross boundaries across sciences and are published in non-biological journals will fall outside the scope of this analysis. However, they will probably constitute a small proportion of publications by biology researchers.
Since trends represent long-term phenomena, future research includes the exploration of those dynamics both on a rolling basis and a fixed time-unit basis (monthly, yearly, etc.), allowing for snapshots of relative prevalence of current research topics and for assessment of the nature and speed of evolution of those over time. While we demonstrate the work as a snapshot in time, inter-temporal comparisons are possible. This approach would also help researchers assess whether particular topics are increasing or decreasing their relevance over time. Also, while bigrams were used to demonstrate the approach due to the nature of the biological sciences terminology, other disciplines or sub-disciplines may benefit from a different dimensionality of research topic groupings. This may also be beneficial if further clustering by sub-discipline and/or overall research topic. For example, if the interest is to extract sub-topics of research around climate change, and those are best represented by larger n-grams, the set of manuscripts can be initially restricted to those identified as addressing climate change (bigram) and a subsequent [
Conclusion
While journal scope is the key factor for consideration of a manuscript submission, prevalence of research topics in the biological sciences is uneven by journal quartile. This reflects an unequal likelihood of publication across research topics, some of which may be considered more relevant at specific points in time among the highest impact journals. This constitutes important information that young researchers – as well as those in lower income countries or within universities with limited resouces – can use to define and gear their research portfolios, especially when their career progress is linked to research outcomes in higher impact publications. Our results demonstrate the prevailing differences in a specific year, but the methodology can be applied at any point in time, as well as over time, to explore both the current state of the literature and its anticipated dynamics.
