Hierarchical Bayesian classification methods to identify topics by journal quartile with an application in biological sciences

Abstract

This manuscript builds on a novel, automatic, freely-available Bayesian approach to extract information in abstracts and titles to classify research topics by quartile. This approach is demonstrated for all $N=$ 149,129 ISI-indexed publications in biological sciences journals during 2017. A Bayesian multinomial inverse regression approach is used to extract rankings of topics without the need of a pre-defined dictionary. Bigrams are used for extraction of research topics across manuscripts, and rankings of research topics are constructed by quartile. Worldwide and local results (e.g., comparison between two peer/aspirational research institutions in Colombia) are provided, and differences are explored both at the global and local levels. Some topics persist across quartiles, while the relevance of others is quartile-specific. Challenges in sustainable development appear as more prevalent in top quartile journals across institutions, while the two Colombian institutions favour plant and microorganism research. This approach can reduce information inequities, by allowing young/incipient researchers in biological sciences, especially within lower income countries or universities with limited resources, to freely assess the state of the literature and the relative likelihood of publication in higher impact journals by research topic. This can also serve institutions of higher education to identify missing research topics and areas of competitive advantage.

Keywords

Information inequity biological sciences topic classification Bayesian clustering bigram trending research topics

1. Introduction

Institutions of higher education place increasing emphasis on the number and quality of peer-reviewed publications achieved by their faculty (Mantzavinos, 2018). Hence, definition of a relevant research agenda in prevalent topics within high impact journals can be a crucial step to academic success for new researchers. This is a complex endeavour due to the wide spectrum and dynamic nature of publication topics in the literature (Syed & Weber, 2018; Zou et al., 2019; Vale, 2015; Mane & Borner, 2004).

New journals are launched constantly, making it increasingly difficult to keep up with academic research within any discipline. The evolution of open access journals has further contributed to that expansion, though it has also sparked a debate regarding the quality of some of these publishers (Beall, 2016; Vale, 2015; Kaiser, 2017; Wang, 2017). This debate has been enhanced with the perceived lack of rigor among some new journals, especially those launched by predatory publishers (Beall, 2016).

A common requirement for faculty worldwide is to publish regularly, with publication targets set on pre-specified time schedules (oftentimes yearly), sometimes leading to unintended quality reduction in scientific publications (Binswanger, 2014). Given the small odds of publishing in top journals in any discipline, it has been suggested assessing, also, the probability of publishing across lower tier journals when identifying journal destinations for manuscripts (Sugimoto et al., 2013). However, the likelihood of publication among top and second tier journals may also be influenced by the research topic. Journals’ scopes constitute the tightest filtering factor, where research topics will need to fall within these scopes for publication. The width of these scopes is journal-specific, with more generalist journals embracing a wide array of research topics, yet others self-defining in a more specialized way. This defines the feasibility of publication of any given manuscript within a journal. However, the likelihood of publication, while mainly associated with manuscript quality, is oftentimes also influenced by editorial preferences, especially in higher quartile journals, where research topics that are perceived as more relevant at a particular point in time may be favored, resulting in not every high quality article being able to be published. There is evidence that publishers exert a strong influence on defining which research topics become mainstream (Kraus, 2002; Yuzhuo, 2019; Nash et al., 2018; Vale, 15), leading to a more complex assessment of what are current and future influential topics within a discipline, as well as their trends. The combined likelihood of publication of each research topic in top journals (e.g., top quartile journals) will be defined by the combined set of scopes and editorial preferences. This will be empirically reflected in the relative prevalence of each research topic within the combined set of manuscripts accepted for publication in such journals. As researchers progress in their careers, tenure and promotion committees may also find it challenging to assess the relevance of researchers’ portfolios and agendas due to the high number of specialized topics, even within common disciplines or departments (Moher et al., 2018; McKiernan et al., 2019; Wright & Vanderford, 17). These issues are even more relevant for young researchers, especially those in lower income countries or universities with limited resources, for whom information inequities generate larger gaps from their peers and set them at an informational disadvantage during early stages of their careers which may persist throughout their progression toward faculty tenure and/or promotion.

There is need for objective, freely-available, evidence-based methods to assess areas of focus and relative prevalence among the scientific community. The hypothesis explored in this manuscript is that the prevalence of research topics is not uniform across topics or journal impact factors within the biological sciences, and, therefore, topic selection can have a major impact on young researchers’ careers. If this is the case, then the development of methods for assessment of research-trending topics (as a snapshot or an inter-temporal view) can serve researchers to define and align their research agendas alongside expectations of their home or aspirational institutions. This can also serve graduate students as they define their area of specialization by cataloging topics according to outcome metrics, such as target journal impact factors or quartiles. Meeting expectations of quantity and quartile of peer-reviewed publications held by institutions may be more likely to occur among those with research agendas developed around more prevalent topics. Additionally, expectations of researchers’ home institutions may be defined, or assessed, against specific peer institutional benchmarks (Trainer, 2008). In such cases, a comparative assessment against specific peer institution clusters could also be relevant.

Accurate classification of prevalent research topics and their associated trends in the literature can also enhance the submission process by identifying the most relevant venues for publication through the use of text mining procedures (Rebholz-Schuhmann et al., 2012a; Camargo et al., 2018). The availability of greater computational power has led to the development of powerful statistical methods to analyze different types of datasets, including those with heavy textual content (Rebholz-Schuhmann et al., 2012a; Liu et al., 2016; Syed & Weber, 2018; Zou et al., 2019). Researchers have new tools available to discover research interest trends, and to gear their research productivity along with (or away from) such trends (Mane & Borner, 2004; Nettle & Frankenhuis, 2019; Wang, 17). Bibliometric studies have grown rapidly across a wide range of disciplines such as sport sciences (Xianliang & Hongying, 2012), big data analysis (Akoka et al., 2017), environmental impact (Geng et a., 2017), energy research (Mao et al., 2015), computer science (Shukla et al., 2019), and software engineering (Garousi & Mantyla, 2016), among others. Within a wider topic-review framework, Chen et al. (2016) applied co-word analysis on projects of China’s National Natural Science Foundation, revealing hot topics such as game theory, supply chain management, and data mining.

Krallinger et al. (2010) reviewed some of the initial approaches to text mining within biological sciences. Rebholz-Schuhmann et al. (2012b) demonstrated the use of these techniques to form an integrative biology approach through a descriptive method, but did not rely on inferential, information-borrowing models. Other approaches built on pre-defined dictionaries include tools like Thalia (Soto et al., 2019) or BioReader (Simon et a., 2019), both demonstrated in the context of biomedical abstract classification. Sub-analyses have been performed to explore narrower research topics within the biological sciences, such as plant research with a focus on Arabidopsis thaliana (Landeghem et al., 2013), which is also built on the back of a dictionary-reliant software (EVEX), microbial diversity in food using the software AlvisIR (Chaix et al., 2019), or microorganisms (Lim et al., 2016), extracted on the back of the text mining software @MInter. Such tools, while well-trained for narrower domains, may overly-rely on pre-defined dictionaries, statistical training, and narrow scopes. They may, therefore, require regular updates and may be less applicable to wider ranges of topics, or topics encountered throughout interdisciplinary research.

From a methodological standpoint, Syed and Weber (2018) used Latent Dirichlet Allocation (Blei et al., 2003) to model documents as constructs distributed over $k$ latent topics, where they included co-occurring words analysis and cluster analysis. Liu et al. (2016) used bags of words to preprocess the data and perform a cluster analysis together with the Latent Dirichlet Allocation process to model topics in bioinformatics. Nettle and Frankenhuis (2019) used accessible predetermined tools of databases to construct maps and build clusters to describe the changing structure of life history theory. Zou et al. (2019) used linear regression to detect the trends in oncolytic virus research. Rebholz-Schuhmann et al. (2012a) explored advances in literature analysis in an automated way, through an exhaustive list of pre-designed resources provided by several institutions within the life sciences and biology. Wang (2017) used arithmetic indexes and ratios to evaluate emerging research topics across different disciplines using Scopus.

In these studies, researchers used a variety of methods to carry out textual analysis, such as counting key word frequency, aggregating the h-index of authors or journal papers, or elaborating a systematic mapping of existence research. However, a simple word count of topics within each manuscript for each of the journal quartiles assumes independence among the different quartiles a priori. This is difficult to justify both empirically and theoretically. Instead, by clustering the data according to International Scientific Indexing (ISI) quartile (though other metrics are also possible), the proposed approach in this manuscript calculates probabilities that are further modeled in a hierarchical manner. This is an essential step to borrow information from each cluster/quartile in order to have dependence among the four different quartiles and provide a more accurate classification. This is a novel addition to the biological sciences literature. Topic exploration is aligned with commonly-used quartile-based journal clusters (Camargo et al., 2018; Casarin et al., 2021; Cortes et al., 2021), a metric oftentimes used for research quality assessment. Additionally, the proposed approach does not rely on data dictionaries and is freely available, which makes its use more democratic, serving as a tool to reduce global information inequities.

2. Methods

2.1 Data

This manuscript analyzes abstracts and titles of 149,129 papers published across all sub-disciplines of the biological sciences during the year 2017. The scope of this study is restricted to English-language publications indexed by the ISI to define a transparent set of well-established journals with a common textual space. Journals are clustered by ISI impact factor quartile. ISI quartile is a categorical ordered classifier that has been shown to have relevance in topic discovery (Taddy, 2013a). The study covers all scientific areas of biology, although more clustered analyses that focus on sub-disciplines are feasible. However, the definition of journals for sub-disciplines can be more challenging, especially as many publications are interdisciplinary within the biological sciences.

The definition of the set of journals comprising the dataset was performed following ISI’s categorization, where the inclusion category was the definition by ISI as a biological sciences journal, regardless of specialization (if any).

2.2 Model

The proposed methodological approach builds on the multinomial inverse regression method (MNIR) described in Taddy (2013b), which also provides a description of the freely-available R package used for the analysis within this manuscript.

Collections of documents can be analyzed as exchangeable sets of tokens (uni-grams), or more generally n-grams (Jurafsky & Martin, 2008). When dealing with text documents, tokens are stemmed words. For example, the words “hormone”, “hormones”, and “hormonal” all become “hormon”. This is a necessary step to collapse equivalent topics which may differ by a suffix.

For consistency and replicability of our approach, this manuscript follows the notation in Taddy (2013b), where ${\bf x}_{i}=(x_{i1},\ldots,x_{ip})^{\prime}$ is the vector of counts of $p$ possible tokens. In this particular application, and without loss of generality, these are bigrams of words included in article abstracts or titles. The choice of bigram instead of unigram as the unit of interest is relevant because some topics require two words to be fully described, such as climate change or soil water. Larger n-grams may enhance the specificity of the topic definition, but, in the absence of a dictionary, may also dilute some research topics in the rankings by splitting topics with a common underlying theme, such as climate change mitigation and climate change adaptation. Additionally, many plant and microorganism names are defined as 2-grams with a generic noun and a describing adjective, such as Escherichia (type of gut microbiota) and coli (referring to its common location in the colon). Empirical frequencies are defined as ${\bf f}_{i}={\bf x_{i}}/m_{i}$ , with $m_{i}=\sum_{j=1}^{p}x_{ij}$ .

Each of the possible $n=$ 149,129 articles can be defined as a document. Documents are related to associated observable sentiment variables, $y_{i}$ , which can be ordered into discrete ordered categories. In our case, those sentiment categories correspond to journal quartiles. The categorical ordered variable, $y$ , has four categories, with $y=$ 4 defining the lowest sentiment, which aligns with standard quartile definitions (Camargo et al., 2018; Casarin et al., 2021).

Modelling the conditional distribution of $y_{i}|{\bf x}_{i}$ can be computationally prohibitive. However, upon collapsing token counts ${\bf x}_{y}=\sum_{i:y_{i}=y}{\bf x}_{i}$ by sentiment category $y\in\mathcal{Y}$ , the resulting equation becomes

$\displaystyle{\bf x_{y}}\sim MN[{\bf q_{y}},m_{y}]\textrm{\,with\,}$

(1) $\displaystyle q_{yj}=\frac{\exp(\alpha_{j}+y\psi_{j})}{\sum_{l=1}^{p}\exp(% \alpha_{l}+y\psi_{l})}\textrm{\,for\,}j=1,\ldots,p,y\in\mathcal{Y}$

where ${\bf x}_{y}$ is a $p$ -dimensional multinomial distribution with size parameter vector ${\bf m}_{y}=\sum_{i:y_{i}=y}{\bf m}_{i}$ and probabilities ${\bf q}_{y}=[q_{y1},\ldots,q_{yp}]^{\prime}$ .

The sufficient reduction score is $z_{i}={\bf\psi}{\bf f}_{i}$ . This score is computed as the inner product between the multinomial inverse regression factor loadings, ${\bf\psi}=(\psi_{1},\ldots,\psi_{p})^{\prime}$ , from each token (or n-gram counts) and the empirical frequencies, ${\bf f}_{i}$ , from the token counts (Taddy, 2013b). Intuitively, $z_{i}$ provides the “average” factor loading contribution of document $i$ . This reduction score is similar in philosophy to Altman’s $z$ -score (Altman, 1968). The higher the $z$ -score, the higher the contents of bigrams with large factor loadings ( ${\bf\psi}$ ).

The model is completed with independent gamma-Laplace priors (Taddy, 2013b). This choice circumvents the need for traditional Markov Chain Monte Carlo approaches, while keeping the Bayesian structure and interpretability of outcomes. Simple optimization reaches the maximum a posteriori (MAP). In most real-time applications, only the MAP would be needed for decision-making, such as the assessment of key topics within the biological sciences.

A simple word count for each journal quartile (i.e., category), a commonly-used approach, assumes independence among the different categories a priori. The proposed strategy uses text-specific dimension reductions based on the multinomial characteristics implied by exchangeability of token counts. As shown in Taddy (2013b), a topic model treats documents as drawn from a multinomial distribution with probabilities arising as a weighted combination of topic factors. These probabilities are further modeled in hierarchical form. Information is borrowed from each category in order to generate and extract dependencies among the four different sentiment/impact factor categories (journal quartiles).

One key advantage of the method is its computational speed and tractability. Traditional approaches oftentimes have computational times and memory usage that are at least linear with the dimensionality of the text data. The combination of dimensionality reduction, information borrowing, and optimization within a Bayesian framework offers an atractive tool toward automatization of sentiment scores within the bigram text space.
2.3 Global analysis vs. comparative analysis

This manuscript performs a dual analysis by extracting and classifying different sets of information aligned with different research questions and outcomes. First, all ISI publications in biological sciences are considered, regardless of authorship origin. This allows for a definition of the trends among the scientific community. The outcomes of this analysis can inform researchers regarding trending areas of interest, and whether their own research agendas are geared toward particular sentiment clusters (journal quartiles).

Second, a much smaller subset is extracted, corresponding solely to publications with an origin within two competing universities in Colombia (Uniandes and Unal). This exemplifies how this tool can serve for assessment of differentiation areas between institutions. This type of information can be used in multiple forms, including: (1) identification of research areas that may be understaffed and where peer/competing institutions may have, or start developing, a competitive advantage, thus informing decisions regarding new hires; and (2) identification of areas of research where institutions have a competitive advantage, hence informing decisions regarding marketing and promotion of the institution, as well as internal/institutional support toward increased external funding efforts.

3. Results

Table 1 lists the top 40 countries in number of ISI publications in the biological sciences during 2017. As expected, the United States and most of Europe are among the regions with a high rate of publications, as well as Australia, China, India, and Mexico. This aligns not only with the relative economic power of these countries, but also with their population sizes. Africa, South America, and Central America show a more heterogeneous distribution. Colombia, which hosts the two institutions used for demonstrating the comparative analysis, ranked second in South America within the biological sciences.

Table 1
Top 40 countries representing the most ISI publications in 2017 in biological sciences, with number of publications and percentage of the global research production

Ranking	Country	# Publications	% Total
1	United States of America	14189	13.09
2	Mainland China	11610	10.71
3	United Kingdom	5392	4.97
4	Germany	4934	4.55
5	India	4330	3.99
6	Australia	3665	3.38
7	Canada	3642	3.36
8	Spain	3534	3.26
9	Brazil	3331	3.07
10	France	3311	3.05
11	Italy	2877	2.65
12	Japan	2322	2.14
13	Russian Federation	1931	1.78
14	Poland	1837	1.69
15	Netherlands	1775	1.64
16	Switzerland	1616	1.49
17	Iran	1567	1.45
18	Sweden	1478	1.36
19	South Korea	1471	1.36
20	Denmark	1305	1.20
21	South Africa	1300	1.20
22	Mexico	1291	1.19
23	Belgium	1242	1.15
24	Turkey	1236	1.14
25	Czech Republic	1049	0.97
26	New Zealand	1047	0.97
27	Austria	1011	0.93
28	Malaysia	991	0.91
29	Norway	967	0.89
30	Portugal	952	0.88
31	Argentina	877	0.81
32	Indonesia	851	0.79
33	Egypt	831	0.77
34	Saudi Arabia	831	0.77
35	Finland	813	0.75
36	Thailand	733	0.68
37	Colombia	712	0.66
38	Pakistan	685	0.63
39	Chile	605	0.56
40	Taiwan	535	0.49

Abstract and title bigrams were collected from all journals. The top 40 bigrams extracted from the hierarchical Bayesian approach were listed in descending relevance for each quartile. The choice of 40 topics was arbitrary and defined by available space in a table of results, though a full distribution is available as the model outcome.

3.1 Global analysis

At a global scale, as reflected in Tables 2 and 3, some of the frequent topics persist across quartile groups. However, there are some differences in the relative relevance of topics across quartiles. For example, grain yield is a major topic in the bottom quartiles, while it ranks lower in relevance within the top quartiles. The most frequent bigrams were those related with the big challenges of sustainable development: climate change, biodiversity, and water. The most frequent topics within water were drought stress, water quality, and soil water. This shows that the scientific community is, as expected, placing substantial emphasis on persistent environmental challenges of our time. Other recurrent bigrams are those related to soil health, namely soil moisture, soil water, soil organic, and soil layer.

Table 2
Highest ranked research topics worldwide among ISI journals in biological sciences during 2017 (top 2 quartiles)

Rank	Quartile 1	Quartile 2
1	Gene expression	Climate change
2	Climate change	Gene expression
3	Species richness	Genetic diversity
4	Genetic diversity	Plant species
5	Plant species	Species richness
6	Environmental conditions	Body weight
7	Oxidative stress	Antioxidant activity
8	Plant growth	Grain yield
9	Antioxidant activity	Plant growth
10	Body weight	Soil water
11	Tree species	Soil moisture
12	Grain yield	Tree species
13	Expression levels	Environmental conditions
14	Community composition	Water content
15	Cell wall	Leaf area
16	Soil moisture	Oxidative stress
17	Arabidopsis thaliana	Drought stress
18	Ecology evolutioned	Environmental factors
19	Body size	Water quality
20	Genes involved	Plant height
21	Soil water	Phenolic compounds
22	Gene flow	Expression levels
23	Phenolic compounds	Soil organic
24	Water content	Antioxidant capacity
25	Drought stress	Community structure
26	Chemical industry	Growth performance
27	Transcription factors	Body size
28	Body mass	Species composition
29	Reactive oxygen	Escherichia coli
30	Antioxidant capacity	Weight gain
31	Transcription factor	Essential oil
32	Community structure	Growing season
33	Oxygen species	Protein content
34	Genetic variation	Salt stress
35	Escherichia coli	Species diversity
36	Heat stress	Seed germination
37	Leaf area	Chemical composition
38	Environmental factors	Superoxide dismutase
39	Growth rates	Genetic variation
40	Growing season	Fish species

Table 3

Highest ranked research topics worldwide among ISI journals in biological sciences during 2017 (bottom 2 quartiles)

Rank	Quartile 3	Quartile 4
1	Grain yield	Grain yield
2	Climate change	Body weight
3	Body weight	Plant height
4	Antioxidant activity	Antioxidant activity
5	Genetic diversity	Leaf area
6	Plant species	Soil water
7	Plant growth	Climate change
8	Soil water	Genetic diversity
9	Plant height	Water quality
10	Leaf area	Heavy metals
11	Species richness	Water content
12	Soil moisture	Plant species
13	Water content	Plant growth
14	Water quality	Protein content
15	Environmental factors	Environmental factors
16	Tree species	Soil moisture
17	Soil organic	Essential oil
18	Drought stress	Soil organic
19	Protein content	Yield plant
20	Essential oil	Crude protein
21	Gene expression	Seed yield
22	Weight gain	Weight gain
23	Heavy metals	Growth performance
24	Growth performance	Chlorophyll content
25	Environmental conditions	Drought stress
26	Crude protein	River basin
27	Seed germination	Species richness
28	Species diversity	Seed germination
29	Chemical composition	Tree species
30	Phenolic compounds	Essential oils
31	Species composition	Root length
32	Chlorophyll content	Species diversity
33	Seed yield	Dominant species
34	Antioxidant capacity	Chemical composition
35	Community structure	Feed conversion
36	Radical scavenging	Radical scavenging
37	Superoxide dismutase	Soil layer
38	Oxidative stress	Nitrogen phosphorus
39	Salt stress	Enzyme activity
40	Essential oils	Lactic acid

Plant research appears more prevalent than animal or microorganism research, showing a bias toward this group of organisms. This could also be a factor of the relative importance of these subjects within academic departments. In the top quartile, the bigrams relating to plant species, plant growth, and tree species constitute most of the top 15 bigrams. Plant height and leaf area also appear in the remaining quartiles, indicating that these subjects, while relevant, are less prevalent in the top biological sciences journals. Microorganisms did not appear in the top 15 of any of the four quartile lists. However, they constitute an open area of research, preferred particularly by researchers in Central and South America, indicating the relevance of more geographically-focused sub-analyses. The only microorganism that appears in the top quartiles within the global list is the bacteria Escherichia coli. This highlights the global relevance of the topic. Issues that are more localized or regional in nature are relegated to lower relevance in both ranking and quartile. A notable absence is new molecular biology technology, such as the CRISPR Cas technology. Other remarkable absences in the more frequent bigrams are those related to immunology, cancer biology, and general health-related bigrams. However, those could be confounded by appearing in medical and public health publications, rather than biological sciences journals.

Finally, it is clear that genetic research remains a topic of high relevance. It appears in the top four topics within the first two quartiles, and in the top 10 within the lower quartiles. This appears to indicate that research in this topic remains mainstream within the top journals, while, when not publishable among those, the option of lower tier journals is also available for such manuscripts. The opposite occurs with topics such as body weight, which, although relevant across quartiles, experience a decaying relevance among journals in higher quartiles. An even more extreme case occurs with heavy metals research, which belongs to the top 10 topics in the lowest quartile, but does not make it to the top 40 in the highest quartile.

3.2 Comparative analysis

When analyzing the bigrams at a more local scale and between peers, bigrams become more specialized. In Colombia, the top two universities, Unal ( $N=$ 342 manuscripts) and Uniandes ( $N=$ 159 manuscripts), were compared. Climate change topics and terms related to biodiversity were not as highly ranked in these two institutions, as reflected in the rankings of research topics for these institutions in Tables 4–7 in comparison with Tables 2 and 3. This result is somehow surprising since Colombia is one of the most biodiverse countries in the world, but it also aligns with the difficulty to obtain funding for research on global problems. Some of the general results observed in the global list are also reflected in these two institutions. For example, they also present a majority of bigrams related to plants and microorganisms among their top ranked-topics.

Table 4
Highest ranked research topics at Unal among ISI journals in biological sciences during 2017 (top 2 quartiles)

Rank	Quartile 1	Quartile 2
1	Species richness	Oil palm
2	Genetic diversity	Genetic diversity
3	Oil palm	Species richness
4	Binding peptides	Gene expression
5	Red blood	Antioxidant activity
6	Immune response	Climate change
7	Gene expression	Dry season
8	Climate change	Plant species
9	Dry season	Gene flow
10	Plasmodium falciparum	Immune response
11	Aboveground biomass	Genetic variability
12	Plant species	Population structure
13	Ecosystem services	Life cycle
14	Gene flow	Public health
15	Antioxidant activity	Rainy season
16	Synthetic peptides	Antimicrobial activity
17	Mycobacterium tuberculosis	Cape gooseberry
18	Genetic variability	Dry forest
19	Population structure	Chagas disease
20	Life cycle	Genetic differentiation y
21	Public health	Costa rica
22	Rainy season	Passion fruit
23	Blood cell	Plasmodium falciparum
24	Tropical forests	Escherichia coli
25	Antimicrobial activity	Leaf area
26	Merozoite invasion	Protected areas
27	Membrane potential	Santa marta
28	Dry forest	Environmental conditions
29	Chagas disease	Body weight
30	Cervical cancer	Distribution species
31	Blood cells	River basin
32	Genetic differentiation	Binding peptides
33	Amazon basin	Environmental impact
34	Soybean meal	Plant growth
35	Costa rica	Xray diffraction
36	Economic environmental	Chemical composition
37	Escherichia coli	Blood samples
38	Cytochrome oxidase	Red blood
39	Immune responses	Geographical distribution
40	Breast cancer	Immune system

Table 5

Highest ranked research topics at Unal among ISI journals in biological sciences during 2017 (bottom 2 quartiles)

Rank	Quartile 3	Quartile 4
1	Antioxidant activity	Antioxidant capacity
2	Oil palm	Antioxidant activity
3	Passion fruit	Passion fruit
4	Gene expression	Cape gooseberry
5	Cape gooseberry	Soluble solids
6	Genetic diversity	Essential oils
7	Climate change	Leaf area
8	Antioxidant capacity	Storage time
9	Dry season	Essential oil
10	Leaf area	Oil palm
11	Santa marta	Santa marta
12	Species richness	Chemical composition
13	Soluble solids	Milk production
14	Chemical composition	Sugar cane
15	Plant species	Holstein cows
16	Gene flow	Gene expression
17	Genetic variability	Water deficit
18	Population structure	Weight loss
19	Essential oils	Climate change
20	Life cycle	Gas chromatography
21	Public health	Titratable acidity
22	Rainy season	Milk yield
23	Antimicrobial activity	Kikuyu grass
24	Essential oil	Dry season
25	Dry forest	Purple passion
26	Chagas disease	Water potential
27	Genetic differentiation	Water activity
28	Milk production	Gas production
29	Gas chromatography	Secondary metabolites
30	Holstein cows	Physalis viana
31	Costa rica	Fruit quality
32	Milk yield	Genetic diversity
33	Escherichia coli	Linoleic acid
34	Protected areas	Antioxidant properties
35	Water deficit	Plant species
36	Environmental conditions	Eastern andes
37	Body weight	Feeding systems
38	Distribution species	Gene flow
39	River basin	Coconut fiber
40	Environmental impact	Fruits stored

Table 6

Highest ranked research topics at Uniandes among ISI journals in biological sciences during 2017 (top 2 quartiles)

Rank	Quartile 1	Quartile 2
1	Chagas disease	Chagas disease
2	Gene flow	Gene flow
3	Genetic diversity	Physical activity
4	Species richness	Genetic diversity
5	Trypanosoma cruzi	Species richness
6	Climate change	Trypanosoma cruzi
7	Physical activity	Climate change
8	Chagasic patients	Chagasic patients
9	Mitochondrial dna	Santa marta
10	Body size	Mitochondrial dna
11	Genetic differentiation	Body size
12	Rhodnius prolixus	Genetic differentiation
13	Cleaner production	Rhodnius prolixus
14	Genetic variation	Genetic variation
15	Tree species	Tree species
16	Plant species	Plant species
17	Dry season	Dry season
18	Population structure	Population structure
19	Water quality	Water quality
20	Genetic structure	Genetic structure
21	Eastern cordillera	Eastern cordillera
22	Phylogenetic structure	Activated carbon
23	Activated carbon	Phylogenetic relationships
24	Phylogenetic relationships	Seed dispersal
25	Seed dispersal	Ae aegypti
26	Ae aegypti	Gene expression
27	Gene expression	National park
28	National park	Public health
29	Public health	Lysinibacillus sphaericus
30	Lysinibacillus sphaericus	Sierra nevada
31	Sierra nevada	Air quality
32	Air quality	Cloud forest
33	Cloud forest	Genetic variability
34	Genetic variability	Chronic chagasic
35	Santa marta	Soil water
36	Chronic chagasic	Tropical forests e
37	Soil water	Microsatellite loci
38	Tropical forests	Activated carbons
39	Microsatellite loci	Common bean a
40	Activated carbons	Life cycle

Table 7

Highest ranked research topics at Uniandes among ISI journals in biological sciences during 2017 (bottom 2 quartiles)

Rank	Quartile 3	Quartile 4
1	Physical activity	Physical activity
2	Chagas disease	Essential oil
3	Gene flow	Santa marta
4	Santa marta	Volatile compounds
5	Genetic diversity	Oil leaves
6	Species richness	Serranía perijá
7	Trypanosoma cruzi	New subspecies
8	Climate change	Staphylococcus aureus
9	Chagasic patients	Adaptive capacity
10	Volatile compounds	Stem cells
11	Mitochondrial dna	White shrimp
12	Essential oil	Chagas disease
13	Body size	Natural park
14	Adaptive capacity	Southern hemisphere
15	Genetic differentiation	Forest interior
16	Rhodnius prolixus	Gene flow
17	Staphylococcus aureus	Amerindian populations
18	Genetic variation	South aboriginals
19	Tree species	Species areas
20	Plant species	Economic value
21	Dry season	Forest edges
22	Population structure	Genetic diversity
23	Water quality	Expression serotonin
24	Genetic structure	Humid habitats
25	Eastern cordillera	Hta receptors
26	Activated carbon	Biopsies patients
27	Phylogenetic relationships	Cistothorus apolinari
28	Seed dispersal	Córdoba sucre
29	Ae aegypti	Decomposition process
30	Natural park	Departments córdoba
31	Gene expression	Desert huila
32	National park	Determined species
33	Public health	East andes
34	Lysinibacillus sphaericus	East slope
35	Sierra nevada	Gastric biopsies
36	Southern hemisphere	Germination percentage
37	Forest interior	Krishan bhallasons
38	Air quality	Landscape transformation
39	Cloud forest	Marta magdalena
40	Genetic variability	Martensis starmeri

The analysis of Unal’s bigrams, listed in Tables 4 and 5, includes some terms related to agriculture, in particular passion fruit and cape gooseberry, a solanaceous fruit that is exported worldwide, a reflection of the links between their research and local agricultural needs. Other bigrams related to agriculture come from plant pathogens or plant disease, such as late blight, an important disease of potato and other solanaceous crops and oxysporum dianthi, referring to the important pathogen Fusarium oxysporum f.sp. dianthi. Oil palm appears among the top three topics across the top three quartiles, dropping to tenth in the bottom quartile.

In contrast to the global list, some terms related to health and medicine (breast cancer and public health) appear in the list, which indicates that Unal’s researchers favor biological sciences journals for these topics compared to a more mainstream approach by biological sciences researchers. References to immunity are more frequent within Unal, which align with the long history of medical research within the institution. The causal agents of malaria, both Plasmodium falciparum and Plasmodium vivax appear in Unal’s top quartile topics, which also reflects a major local need, given the endemic nature of malaria in Colombia (Padilla-Rodriguez et al., 2020). This highlights the importance of this disease in their research agenda. When comparing to the global research list, animals, protozoans, and fungi are underrepresented compared to plants. While Unal’s major topics include several medical terms, there is no mention to transgenics or gene editing, which appear to be underrepresented areas of research at the institution when compared to the global list.

Uniandes is a smaller but highly prestigious private university. Uniandes’ top bigrams, listed in Tables 6 and 7, emphasize research in neglected tropical diseases, and in particular the Chagas disease (chagasic patients and Chagas disease), the pathogen Trypanosoma cruzi, and its vector Rhodnius prolixus. The most frequent topics appear driven by regional needs. There are multiple bigrams related to microorganisms, such as Staphylococcus aureus, Lysinibacillus sphaericus, Trypanosoma cruzi, and Escherichia coli. This higher frequency of research on microorganisms reflects the existence of a microbiology program and highlights a potential preference for specialization of the institution within this area. There are also bigrams related to human populations, with a regional focus, namely Amerindian populations and south aboriginals, which also reflects on the geographical scope and access to information for researchers in this institution.

When comparing both institutions, there are several areas of overlap. There is a high number of references to plants and microorganisms, in particular some protozoans and causal agents of tropical diseases. However, there are also some differences due to the specialization of each university. The departmental structure, with a single department within the biological sciences in each institution, also favours the co-existence of multiple areas of research and a complex definition of a departmental specialization. Therefore, the diversity in research topics within a single department is significant. Further exploration of the internal dynamics and historical causes for the similarities and differences between the two institutions is outside the scope of the manuscript. However, this type of comparitive analysis is an outcome now available to both institutions to better define their respective areas of specialization and competitive advantage, as well as to identify areas of need within their biological sciences programs. This provides an example of how this approach can serve department heads and administrators toward defining their areas of interest and focus, whether the target is to increase breadth among or depth within research topics.

4. Discussion

This study analyzes 149,129 manuscript abstracts and titles of academic ISI journals published during 2017 across the biological sciences. The journals were classified in four quartile categories according to their impact factors, and information borrowing across quartiles was possible through a hierarchical Bayesian approach. The analysis revealed that some topics persisted in all quartile groups, but others ranked higher depenting on the quartile.

This manuscript presents a novel contribution to research classification within the biological sciences literature. First, it examines a neglected unit of analysis (bigram), which has a major impact in topic definition, since many topics in biology are defined on the basis of two words. For example, a unigram such as plant will be highly uninformative unless it qualifies the type of analysis with a second word, such as plant species or plant height. Others cannot be identified at all if not through a bigram directly, such as passion fruit, or oil palm, rendering traditional unigram methods less usable. While the approach is illustrated through bigrams, the methods allow for any n-dimensional combination of words, making it fully flexible to adjust to an incremental length of key words in the scientific literature, as well as differences by (sub)discipline or increased granularity of analysis within particular areas of research.

Bayesian MNIR efficient methods are used to explore the impact that journal quartiles have on the definition of the most popular research topics across the biological sciences. These methods borrow information across quartiles to better assess their multidimensional nature.

One key advantage of the proposed approach is that it does not rely on pre-defined dictionaries or any prior knowledge. This is important for young researchers in lower income countries or institutions with limited resources, who may not have access to classification resources offered by publishers (oftentimes at a fee). This tool is freely available in R, a well-known free software. By democratizing information access, a more level playing field for competition is possible. By classifying bibliographic resources and extracting common themes, this proposed approach allows researchers to define and adjust their research agendas based on trending topics and their relevance.

A global assessment is provided with topics outlined across journal quartiles. However, most researchers are measured/evaluated against specific sets of expectations, which in some cases are regional or peer-based in nature (e.g., minimum impact factors, or minimum quartiles for their publications). A comparative assessment between global research topics and those in a localized set of universities demonstrate some important differences. These differences can help administrators and department heads gear future hires and reward particular areas of research. While some institutions will prefer to specialize and assess whether specific areas of research are competitive at a global scale, others can use the information to identify areas that may be lacking in their departmental research portfolios and gear hires to increase the breadth of topic representation within the department.

This study also allows for identification of research areas where a country or particular university can take a lead because of specific local conditions. For example, multiple topics in top quartiles within the comparative analysis were found to relate to biodiversity. Colombia, which is the second most biodiverse country in the world, can have direct access to data and enjoy a competitive advantage, though its top two institutions appear to be lagging.

4.1 Limitations and future research

One limitation of this study comes from constraining the scope of inclusion to manuscripts written in English. While it is the most common language for peer-reviewed publications, it does not constitute the full spectrum of ISI journals. This may have an influence throughout particular locations where English is not the main language of instruction or research, such as the South American continent. However, this effect will be tempered as research-oriented institutions become more focused on the impact of their research at a global level, and institutions where research is highly relevant tend to favour English-language publications.

Another limitation of this approach comes from the relatively narrow definition of a publication in biological sciences, which comprises our motivating example. Manuscripts that cross boundaries across sciences and are published in non-biological journals will fall outside the scope of this analysis. However, they will probably constitute a small proportion of publications by biology researchers.

Since trends represent long-term phenomena, future research includes the exploration of those dynamics both on a rolling basis and a fixed time-unit basis (monthly, yearly, etc.), allowing for snapshots of relative prevalence of current research topics and for assessment of the nature and speed of evolution of those over time. While we demonstrate the work as a snapshot in time, inter-temporal comparisons are possible. This approach would also help researchers assess whether particular topics are increasing or decreasing their relevance over time. Also, while bigrams were used to demonstrate the approach due to the nature of the biological sciences terminology, other disciplines or sub-disciplines may benefit from a different dimensionality of research topic groupings. This may also be beneficial if further clustering by sub-discipline and/or overall research topic. For example, if the interest is to extract sub-topics of research around climate change, and those are best represented by larger n-grams, the set of manuscripts can be initially restricted to those identified as addressing climate change (bigram) and a subsequent [ $n\geqslant$ 3]-gram analysis can be performed within that set of manuscripts.

5. Conclusion

While journal scope is the key factor for consideration of a manuscript submission, prevalence of research topics in the biological sciences is uneven by journal quartile. This reflects an unequal likelihood of publication across research topics, some of which may be considered more relevant at specific points in time among the highest impact journals. This constitutes important information that young researchers – as well as those in lower income countries or within universities with limited resouces – can use to define and gear their research portfolios, especially when their career progress is linked to research outcomes in higher impact publications. Our results demonstrate the prevailing differences in a specific year, but the methodology can be applied at any point in time, as well as over time, to explore both the current state of the literature and its anticipated dynamics.

References

Akoka

Wattiau

, & Laoufi

(2017). Research on big data – a systematic mapping study. Computer Standards and Interfaces.

Altman

E.I.

(1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4): 589-609.

Beall

(2016). Best practices for scholarly authors in the age of predatory journals. The Annals of The Royal College of Surgeons of England, 98: 77-79.

Binswanger

(2014). Excellence by Nonsense: The Competition For Publications in Modern Science. In: Bartling S., Friesike S. (eds) Opening Science. Springer.

Blei

, & Jordan

(2003). Latent dirichlet process. Journal of Machine Learning Research, 3: 993-1022.

Camargo

Gonzalez

Guzman

ter Horst

, & Trujillo

(2018). Topics and methods in economics, finance, and business journals: A content analysis enquiry. Heliyon, 4.

Casarin

Camargo

Correa

Dakduk

ter Horst

, & Molina

(2021). What makes a tweet be retweeted? a bayesian trigram analysis of tweet propagation during the 2015 colombian political campaign. Journal of Information Science, 47: 297-305.

Chaix

Deleger

Bossy

, & Nedellec

(2019). Text mining tools for extracting information about microbial biodiversity in food. Food Microbiology, 81: 63-75.

Chen

Xie

, & Li

(2016). Mapping the research trends by co-word analysis based on keywords from funded project. Procedia Computer Science, 91: 547-555. Promoting Business Analytics and Quantitative Management of Technology: 4th International Conference on Information Technology and Quantitative Management (ITQM 2016).

10.

Cortes

Gonzalez

Gunn

ter Horst

Molina

Restrepo

, & Zambrano

(2021). Assessment of research topic prevalence by journal impact quartile in oral health sciences using bayesian methods. Sage Open, 11: 1-8.

11.

Garousi

, & Mantyla

M.V.

(2016). Citations, research topics and active countries in software engineering: A bibliometrics study. Computer Science Review, 19: 56-77.

12.

Geng

Wang

Zuo

Zhou

, & Mao

(2017). Building life cycle assessment research: A review by bibliometric analysis. Renewable and Sustainable Energy Reviews, 76: 176-184.

13.

Jurafsky

, & Martin

J.H.

(2008). Speech and language processing. Prentice Hall: Series in Artificial Intelligence.

14.

Kaiser

(2017). The preprint dilemma. Science, 357(6358): 1344-1349.

15.

Krallinger

Leitner

, & Valencia

(2010). Analysis of biological processes and diseases using text mining approaches. Methods in Molecular Biology, 593: 341-382.

16.

Kraus

J.R.

(2002). Citation patterns of advanced undergraduate students in biology, 2000–2002. Science & Technology Libraries, 22(3–4): 161-179.

17.

Landeghem

Bodt

Drebert

Inze

, & Peer

(2013). The potential of text mining in data integration and network biology for plant research: A case study on arabidopsis. Plant Cell, 25: 794-807.

18.

Lim

Chng

, & Nagarajan

(2016). @minter: Automated text-mining of microbial interactions. Bioinformatics, 32: 2981-2987.

19.

Liu

Tang

Dong

Yao

, & Zhou

(2016). An overview of topic modeling and its current applications in bioinformatics. SpringerPlus, 5(1): 1608.

20.

Mane

, & Borner

(2004). Mapping topics and topic bursts in PNAS. Proceedings of the National Academy of Sciences, 101(suppl 1): 5287-5290.

21.

Mantzavinos

(2018). Publish or perish: Implications for authors, reviewers and editors. Journal of Chemical Technology and Biotechnology, page doi: 10.1002/jctb.5875.

22.

Mao

Liu

Zuo

J.Z.

, & Wang

(2015). Way forward for alternative energy research: A bibliometric analysis during 1994–2013. Renewable and Sustainable Energy Reviews, 48: 276-286.

23.

McKiernan

E.C.

Schimanski

L.A.

Munoz Nieves

Matthias

Niles

M.T.

, & Alperin

J.P.

(2019). Meta-research: Use of the journal impact factor in academic review, promotion, and tenure evaluations. eLife, 8: e47338.

24.

Moher

Naudet

Cristea

I.A.

Miedema

Ioannidis

J.P.A.

, & Goodman

S.N.

(2018). Assessing scientists for hiring, promotion, and tenure. PLoS Biology, 16(3): e2004089-e2004089.

25.

Nash

J.R.

Araujo

R.J.

, & Shideler

G.S.

(2018). Contributing factors to long-term citation count in marine and freshwater biology articles. Learned Publishing, 31(2): 131-139.

26.

Nettle

, & Frankenhuis

W.E.

(2019). The evolution of life-history theory: A bibliometric analysis of an interdisciplinary research area. Proceedings of the Royal Society B: Biological Sciences, 286(1899): 20190040.

27.

Padilla-Rodriguez

J.C.

Olivera

M.J.

, & Guevara-Garcia

B.D.

(2020). Parasite density in severe malaria in colombia. Plos One, 15: e0235119.

28.

Rebholz-Schuhmann

Oellrich

, & Hoehndorf

(2012a). Text-mining solutions for biomedical research: Enabling integrative biology. Nature Reviews Genetics, 13: 829.

29.

Rebholz-Schuhmann

Oellrich

, & Hoehndorf

(2012b). Text-mining solutions for biomedical research: Enabling integrative biology. Nature Reviews Genetics, 13: 829-839.

30.

Shukla

A.K.

Janmaijaya

Abraham

, & Muhuri

(2019). Engineering applications of artificial intelligence: A bibliometric analysis of 30 years (1988–2018). Engineering Applications of Artificial Intelligence, 85: 517-532.

31.

Simon

Davidsen

Hansen

Seymour

Barnkov

, & Olsen

(2019). Bioreader: a text mining tool for performing classification of biomedical literature. BMC Bioinformatics, 19.

32.

Soto

Przybyla

, & Ananiadou

(2019). Thalia: Semantic search engine for biomedical abstracts. Bioinformatics, 35: 1799-1801.

33.

Sugimoto

Lariviere

, & Cronin

(2013). Journal acceptance rates: A cross-disciplinary analysis of variability and relationships with journal 11 measures. Journal of Informetrics, 7: 897-906.

34.

Syed

, & Weber

C.T.

(2018). Using machine learning to uncover latent research topics in fishery models. Reviews in Fisheries Science & Aquaculture, 26(3): 319-336.

35.

Taddy

(2013a). Measuring political sentiment on twitter: Factor optimal design for multinomial inverse regression. Technometrics, 55(4): 415-425.

36.

Taddy

(2013b). Multinomial inverse regression for text analysis. Journal of the American Statistical Association, 108(503): 755-770.

37.

Trainer

(2008). The role of institutional research in conducting comparative analysis of peers. New Directions for Higher Education, pages 21–30.

38.

Vale

R.D.

(2015). Accelerating scientific publication in biology. Proceedings of the National Academy of Sciences, 112(44): 13439-13446.

39.

Wang

(2017). A Bibliometric Model for Identifying Emerging Research Topics. ArXiv, abs/1707.03599.

40.

Wright

C.B.

, & Vanderford

N.L.

(2017). What faculty hiring committees want. Nature Biotechnology, 35(9): 885-887.

41.

Xianliang

, & Hongying

(2012). A bibliometric analysis on china sport science (2001–2010) based on CSSCI literature. Physics Procedia, 33: 2045-2054. 2012 International Conference on Medical Physics and Biomedical Engineering (ICMPBE2012).

42.

Yuzhuo

(2019). Examining similarities and differences of citation patterns between monographs and papers: A case in biology and computer science. Information Discovery and Delivery, 47(4): 229-241.

43.

Zou

Luo

Zhang

Xia

Tan

, & Huang

(2019). Bibliometric analysis of oncolytic virus research, 2000 to 2018. Medicine, 98: e16817.

Hierarchical Bayesian classification methods to identify topics by journal quartile with an application in biological sciences

Abstract

Keywords

1. Introduction

2. Methods

2.1 Data

2.2 Model

3. Results

Table 1 Top 40 countries representing the most ISI publications in 2017 in biological sciences, with number of publications and percentage of the global research production

Table 2 Highest ranked research topics worldwide among ISI journals in biological sciences during 2017 (top 2 quartiles)

Table 4 Highest ranked research topics at Unal among ISI journals in biological sciences during 2017 (top 2 quartiles)

4.1 Limitations and future research

5. Conclusion

References

Table 1
Top 40 countries representing the most ISI publications in 2017 in biological sciences, with number of publications and percentage of the global research production

Table 2
Highest ranked research topics worldwide among ISI journals in biological sciences during 2017 (top 2 quartiles)

Table 4
Highest ranked research topics at Unal among ISI journals in biological sciences during 2017 (top 2 quartiles)