Explainable paper classification system using topic modeling and SHAP

Abstract

The exponential growth of academic papers necessitates sophisticated classification systems to effectively manage and navigate vast information repositories. Despite the proliferation of such systems, traditional approaches often rely on embeddings that do not allow for easy interpretation of classification decisions, creating a gap in transparency and understanding. To address these challenges, we propose an innovative explainable paper classification system that combines Latent Semantic Analysis (LSA) for topic modeling with explainable artificial intelligence (XAI) techniques. Our objective is to identify which topics significantly influence the classification outcomes, incorporating Shapley additive explanations (SHAP) as a key XAI technique. Our system extracts topic assignments and word assignments from paper abstracts using LSA topic modeling. Topic assignments are then employed as embeddings in a multilayer perceptron (MLP) classification model, with the word assignments further utilized alongside SHAP for interpreting the classification results at the corpus, document, and word levels, enhancing interpretability and providing a clear rationale for each classification decision. We applied our model to a dataset from the Web of Science, specifically focusing on the field of nanomaterials. Our model demonstrates superior classification performance compared to several baseline models. Ultimately, our proposed model offers a significant advancement in both the performance and explainability of the system, validated by case studies that illustrate its effectiveness in real-world applications.

Keywords

Paper classification topic modeling latent semantic analysis explainable artificial intelligence Shapley value

1. Introduction

In the swiftly evolving landscape of academic research, the proliferation of academic papers has been driven by technological advancements, global collaboration, and the increasing recognition of publications as a metric of academic and professional success. This expansion is further accelerated by funding prerequisites, specialized fields of study, and the burgeoning realm of interdisciplinary research. These dynamics collectively forge an environment where the volume of academic papers is not just growing, but doing so exponentially.

Given the escalating number of publications, there is a pressing need for an efficient classification system to manage and organize this vast trove of information. A robust classification system is crucial for facilitating the standardization of categories, ensuring consistent access across various databases and institutions, and supporting effective communication within the academic community. Additionally, such a system plays a vital role in enhancing information management strategies in academic research, bolstering knowledge management practices, and refining paper recommendation systems.

Topic modeling is an essential unsupervised learning technique used to identify latent thematic structures within extensive collections of unstructured documents [1,2,3,4,5]. This method relies on analyzing the distribution and co-occurrence of words across documents, thereby revealing underlying themes and enhancing the interpretability of large datasets by clustering related documents together [6]. Such algorithms not only discern the fundamental topics embedded within texts but also provide human-readable labels, which facilitate further analysis and focused exploration of specific themes within a corpus [7]. Consequently, topic modeling significantly augments the interpretability of document classification by organizing content into coherent themes and deepening the understanding of the underlying textual data, thus proving invaluable for paper classifiers [8].

In the specific application of topic modeling, this study seeks to advance our comprehension of document themes through the incorporation of latent semantic analysis (LSA) [9]. LSA distills documents into a lower-dimensional space, capturing essential semantic relationships and thereby refining the differentiation between topics [10]. This process generates keywords that epitomize each topic, enabling users to quickly identify papers corresponding to specific thematic areas. Employing LSA in classifying papers leverages traditional embedding techniques and is poised to enhance the accuracy and comprehensiveness of topic detection in academic papers. The expected contribution of this approach is to provide a more nuanced understanding of the thematic structures within the academic papers, thereby supporting more effective classification and retrieval practices in scholarly databases.

The complexity inherent in deep learning models, particularly when utilized for classification tasks, presents significant interpretive challenges [11,12]. This complexity often obscures the decision-making process, making it difficult to understand how outcomes are derived. Such opacity can be a major hurdle in settings where transparency and accountability are critical. Explainable artificial intelligence (XAI) technologies offer a promising solution to this dilemma by elucidating the mechanisms behind model predictions, thereby enhancing the interpretability of results [13,14].

We utilize Shapley additive explanations (SHAP) [15] to interpret the classification outcomes of academic papers. SHAP, a widely recognized XAI technique, provides a detailed decomposition of the predictive contribution of each feature, allowing for an in-depth understanding of model behavior [16,17]. This technique is particularly valuable when integrated with topic modeling, as it enables a granular analysis of how specific topics and their associated keywords influence the classification results. By applying SHAP, we not only elucidate the direct contributions of individual topics but also provide a basis for validating the model’s reliability and fairness.

The necessity of integrating SHAP with topic modeling arises from the goal of creating a transparent and accountable classification system. The interpretability facilitated by SHAP ensures that users can comprehensively understand and trust the outcomes of the classification process. This integration is expected to enhance the trustworthiness of the classification system, which is crucial for its adoption in academic and research settings where decision rationales need to be clearly articulated.

Furthermore, the insights gained from this approach are anticipated to drive improvements in the model’s design and implementation [18,19]. By understanding which topics and words are most influential in the classification decisions, researchers and developers can fine-tune the model to better align with academic and research priorities. This level of analysis not only supports more accurate classifications but also aids in the discovery of emerging trends and gaps within academic papers, thereby guiding future research directions.

In this paper, we present a novel explainable paper classification system that integrates topic modeling and SHAP to categorize academic papers effectively. Employing LSA enhances the interpretability of the classification process by enabling a nuanced understanding of the thematic structures within documents. Further, by leveraging SHAP values, we provide a detailed elucidation of how individual topics and critical words influence classification outcomes, thus ensuring a comprehensive interpretative framework. This integration allows for interpretation at three distinct levels: the corpus-level, document-level, and word-level, offering a granular insight into the classification dynamics. Such multi-level analysis is pivotal for understanding the varying influences that specific topics and terms have on the classification results, providing a clear roadmap for adjustments and improvements in the classification process. To validate the robustness and explainability of our system, we apply it to the Web of Science (WoS) dataset in the field of nanomaterials. This application demonstrates the system’s superior capability in classifying academic papers, thereby affirming its utility in real-world academic settings.

The remainder of the paper is structured as follows. We present the system flow of the proposed system and provide a detailed explanation of the system in Section 2. Section 3 provides a summary of the literature pertaining to paper classification systems. Section 4 delves into the experimental procedure, offering detailed information about the datasets, preprocessing methods, and a comparative study to evaluate the performance of the proposed system. In Section 5, we perform real data analysis focusing on the interpretability of the proposed system. Finally, Section 6 concludes this work and describes future work.

2. Proposed method

2.1. Overview

Figure 1.

Flow chart of the proposed classification system, illustrating the sequence from document processing to the final classification and explanation stages. The diagram outlines the transformation of documents into a TF-IDF matrix, followed by LSA topic modeling to determine topic assignments. These assignments feed into an MLP classifier, with SHAP providing both global and local explanations of the classification outcomes at the topic and word levels.

Let $i = 1, 2, \dots, M$ be the labeled documents (academic paper abstracts), where M is the total number of documents. Our vocabulary comprises N words, selected based on word frequency. As a preliminary step, each document undergoes tokenization, converting the text into a sequence of words. These documents are then used to construct a word dictionary, which is subsequently transformed into a term frequency-inverse document frequency (TF-IDF) matrix. LSA topic modeling is applied to the TF-IDF matrix to ascertain topic assignments for each document [20] and word assignments for each topic. The topic assignments serve as the input embeddings for a multilayer perceptron (MLP) classifier, which categorizes each document into an appropriate category. Further, SHAP and word assignments are employed to identify the topics that effectively characterize each category and to clarify the influence of specific words on the classification within these categories. Both global and local explanations provided by SHAP enhance our understanding of the broad and nuanced impacts of topics and words on classification outcomes. This integrated approach, combining LSA topic modeling, MLP classifier, and SHAP, provides a systematic method to achieve more transparent and interpretable results in the task of paper classification, thereby addressing the growing need for explainable machine learning methodologies in academic settings.

Figure 1 illustrates the flow chart of the proposed system in this paper.

2.2. Latent semantic analysis

Latent semantic analysis (LSA) was initially introduced by Deerwester et al. [9] as a method for analyzing the relationships between a set of documents and the terms they contain. It operates under the foundational principle that words that share similar meanings often occur in comparable contexts [21]. LSA seeks to exploit this premise by utilizing mutual document constraints to deduce underlying topics. This conceptual framework posits that semantic structures can be discerned through patterns of word usage across texts, providing a robust mechanism for the induction and representation of knowledge [22].

Let $C_{i, j}$ be the term frequency or the raw count of a word j in a document i, and let $M_{j}$ be the document frequency or the number of documents containing a word j. The TF-IDF matrix combines term frequency and document frequency to express the importance of words in a matrix form, defined as $A = [A_{i j}]_{i = 1, \dots, M, j = 1, \dots, N}$ , where M is the total number of documents and N is the number of unique words across the document set. Each element $A_{i j}$ is calculated as $C_{i, j} \cdot \log (M / M_{j})$ [23]. This matrix is valuable in identifying word significance across documents but suffers from issues such as high dimensionality and noise [9]. To address these limitations, singular value decomposition (SVD) is applied to the TF-IDF matrix, enabling the reduction of its dimensionality and facilitating the extraction of topic assignments [24]. This process, referred to as LSA, employs truncated SVD to decompose the high-dimensional matrix $A \in R^{M \times N}$ into a product of three smaller matrices as shown in the following equation [25]:

A \approx U Σ V^{'} .

(1)

Let K be the number of latent topics. The matrix $U \in R^{M \times K}$ is an orthogonal matrix whose columns are the eigenvectors of $A A^{'}$ , signifying the representation of documents in the latent topic space. The matrix $V^{'} \in R^{K \times N}$ , also orthogonal, consists of the eigenvectors of $A^{'} A$ and encapsulates the contribution of each word to the topics. Each row $V_{k}^{^{'}} \in R^{N}$ , corresponds to the contribution of all words in the vocabulary to topic k. The diagonal matrix $Σ \in R^{K \times K}$ contains the singular values, indicating the relative importance of each topic.

In this context, the i-th row of U, $U_{i} \in R^{K}$ , serves as the embedding vector for document i with each element $U_{i k}$ representing the contribution of topic k to that document. High absolute values in $U_{i}$ suggest a strong association between document i and the respective topics. Similarly, high absolute values in $V^{'}$ indicate a significant influence of corresponding words on the topics. These embeddings, which are interpretable in terms of topics, are crucial for the interpretability of the classification, a feature that will be further explored in subsequent sections.

2.3. Multilayer perceptron

The multilayer perceptron (MLP) is a neural network architecture consisting of multiple perceptron layers organized in a hierarchical structure [26]. This architecture typically includes an input layer, several hidden layers, and an output layer, with each layer containing multiple neurons [27]. Neurons within these layers are interconnected through weights and biases, which are parameters adjusted during the learning process. Non-linearity in the network is introduced via activation functions, enhancing the model’s capability to capture complex patterns in the data [28].

The MLP is particularly adept at learning intricate and abstract representations through its multilayer structure, which allows the model to handle non-linear relationships and recognize diverse features across data sets [29]. We configure our MLP classifier with four hidden layers containing 128, 64, 32, and 16 neurons, respectively. Each hidden layer utilizes the rectified linear unit (ReLU) activation function, $g (x) = max (0, x)$ , to introduce non-linearity [30]. The Adam optimizer is employed to update weights efficiently throughout the training process [31].

The training of our MLP classifier involves up to 1,000 iterations with a batch size of 32, and the initial learning rate is set at 0.001. To mitigate overfitting and improve convergence, early stopping is implemented, halting training if no improvement in validation performance is observed over 10 consecutive iterations [32]. The operational framework of the classifier for a document’s topic embedding vector $U_{i}$ is expressed by the following equation:

f (U_{i}) = h (W_{5} g (W_{4} g (W_{3} g (W_{2} g (W_{1} U_{i} + B_{1}) + B_{2}) + B_{3}) + B_{4}) + B_{5}),

(2)

where

h (x) = 1 / (1 + e^{- x})

is the sigmoid activation function, used here to model the output as a probability [33]. The variables

W_{l}

and

B_{l}

(for

l = 1, 2, 3, 4, 5

) denote the weights and biases associated with the four hidden layers (

l = 1, 2, 3, 4

) and the output layer (

l = 5

), respectively. The function

f (U_{i})

thus predicts the probability that document i belongs to a particular category, based on its topic embedding. This detailed configuration and the mathematical representation of the MLP classifier underscore its robustness in classifying documents into thematic categories based on their latent semantic features.

2.4. Shapley additive explanations

SHAP, suggested by Lundberg et al. [15], offers a robust mechanism for attributing the prediction influence of individual features, accounting for their interaction and contribution across all possible coalitions of features. This method not only supports local explanations but also facilitates global understanding of feature importance, making it exceptionally useful for identifying and analyzing the topics that significantly impact paper classification [34,35].

In implementing SHAP, we prioritize its ability to handle complex and interactive effects among features, which provides a more nuanced and comprehensive interpretation. SHAP’s theoretical foundation in Shapley values ensures fairness and consistency in feature contribution, which is crucial for our analysis of how specific topics influence the classification outcomes across the entire corpus [36]. Consequently, SHAP allows us to present both specific and aggregate insights into how certain topics and words sway the classification process, thereby enhancing the transparency and accountability of our predictive modeling.

The concept of global feature importance is pivotal in understanding the overall influence of each input variable on the predictive outcomes. This measure is particularly essential when explaining complex models to stakeholders who require insights into which features most significantly drive model predictions. Global feature importance can be calculated by averaging the contributions derived from each local explanation, as highlighted by the concept of Shapley values [37]. These contributions are consolidated into a simplified predictive model that approximates the behavior of the original complex model [15]:

f (z) \approx f^{'} (z^{'}) = ϕ_{0} + \sum_{k = 1}^{K} ϕ_{k} z_{k}^{^{'}},

(3)

where f denotes the MLP classifier previously defined in Eq. (2), and

f^{'}

represents a simplified explanatory model for f.

z^{'} \in {0, 1}^{K}

indicates the inclusion (1) or exclusion (0) of a specific topic in the estimation. The term

ϕ_{k} \in R

represents the weight or importance of each topic within the local context, while

ϕ_{0}

is the baseline value of the model when no topics are considered.

The local importance, or the Shapley value $ϕ_{k}$ , for each topic k, assigns a value based on its contribution on the model’s prediction. Each Shapley value $ϕ_{k}$ for a feature k is calculated to reflect its incremental impact when included in the model’s learning process. The computation of the Shapley value for a topic k is formalized as follows [15]:

ϕ_{k} = \sum_{S \subseteq F ∖ {k}} \frac{| S |! (K - | S | - 1)!}{K!} [f (S \cup {k}) - f (S)],

(4)

where

F = {1, 2, \dots, K}

denotes the full set of topics,

S \subseteq F ∖ {k}

is a subset excluding topic k, and

| S |

represents the number of topics in the subset S. The difference

[f (S \cup k) - f (S)]

measures the impact of including topic k in the prediction. The sum of the importance for topics

ϕ_{k}

yields an approximation of the prediction value for the original model f. Equation (4) illustrates that the Shapley value

ϕ_{k}

for topic k is computed as the weighted average of the changes in prediction values across all possible subsets of features, weighted by binomial coefficients. This comprehensive approach enables a matrix representation of all calculated Shapley values:

Φ = [ϕ_{i k}] \in R^{M \times K}, i = 1, 2, \dots, M, j = 1, 2, \dots, K, j

(5)

where

ϕ_{i k}

denotes the Shapley value for topic k associated with document i. Such quantitative assessments of topic influence are essential for understanding the model’s decision-making process [38]. A negative Shapley value indicates a negative influence on the classification outcome, whereas a positive value suggests a beneficial impact. Values close to zero imply minimal or negligible impact on the predictions [36,39].

In the classification task, the influence of topics that negatively affect the model is considered less crucial compared to those with a positive influence [36,37,40]. This perspective stems from the observation that a specific class typically correlates with a relatively small subset of topics from the total available K topics. Topics that exhibit large Shapley values are interpreted as exerting substantial influence on model predictions.

To quantitatively assess the impact of each topic, the mean absolute Shapley value is used as a metric for global interpretability [41,42,43]. In this paper, we introduce the Global Influence Factor (GIF) $ψ_{k}$ for topic k, defined as the average of the absolute Shapley values across each column:

ψ_{k} = \frac{1}{M} \sum_{i = 1}^{M} | ϕ_{i k} | .

(6)

A high GIF value indicates that the corresponding topic significantly influences the classification process. This metric allows us to systematically identify the most impactful topics at the corpus level, enhancing the explainability of the classification results.

In summary, the proposed system offers a three-fold advantage in enhancing the interpretability of classification outcomes: −

(Corpus-level) The system identifies which topics significantly impact classification outcomes through the GIF values calculated as per Eq. (6).

−

(Document-level) The system elucidates the reasons behind the classification of a document into a particular category, based on the Shapley values defined in Eq. (4).

−

(Word-level) The system provides insights into which words contribute to the classification of a document by examining the significant words within influential topics, utilizing the word assignments in the matrix $V^{'}$ .

These enhancements collectively foster a comprehensive understanding of the classification mechanism, from the broad perspective of topic influence down to the specific words that drive document classification.

3. Related work

The classification of academic papers is a critical task for organizing scholarly materials, enabling efficient access to relevant research within various fields. Over time, numerous techniques have emerged to categorize papers based on content, keywords, and structure.

Initial paper classification strategies employed traditional embedding techniques such as Word2Vec [44], Doc2Vec [45], FastText [46], and TF-IDF [47]. These were often paired with classical classification models like support vector machines (SVM) [48] and k-nearest neighbor (KNN) algorithms [49]. While these methods laid the groundwork, they generally lacked in performance and did not provide explanations for their classification decisions, limiting their utility for deeper research analysis.

Further enhancements in topic modeling have incorporated embedding techniques to refine thematic clustering. Nguyen et al. [50] proposed a sophisticated hybrid model that combines the probabilistic likelihoods from LDA with a log-linear model employing pre-trained word embeddings to enhance topic specificity. Similarly, Bunk and Krestel [51] introduced an innovative approach termed WeLDA, which involves randomly substituting words associated with a particular topic with their corresponding embeddings, sampled from a Gaussian distribution, to enrich the semantic texture of the topics. Xu et al. [52] explored a geometric method by utilizing Wasserstein distances to concurrently learn topics and word embeddings, providing a more mathematically grounded approach to topic discovery. Additionally, Keya et al. [53] created the neural embedding allocation (NEA), which parallels the generative process of the embedded topic model (ETM) but optimizes it using a pre-fitted LDA model, thereby enhancing topic accuracy and relevance [54].

The advent of deep learning models, including MLP [55], long short-term memory (LSTM) [56], and convolutional neural network (CNN), significantly enhanced performance [57]. Hybrid models like Bi-LSTM-CNN [58] and C-LSTM [59] combined LSTM and CNN capabilities to better capture complex patterns in text. However, these models continued to struggle with explaining results when using traditional embeddings.

A significant drawback of deep learning models, despite their enhanced performance, is their lack of interpretability. The inability to explain classification results remains a substantial hurdle, as understanding the reasoning behind decisions is crucial for validation and trust in automated systems. The concept of explainable artificial intelligence (XAI) has emerged to address this challenge, with numerous studies utilizing SHapley Additive exPlanations (SHAP) to provide insights into the decisions made by complex models. Vilone and Longo [60] have proposed a system that classifies all scientific studies hierarchically using XAI technique. Kim et al. [61] proposed the Explaining and Visualizing Convolutional Neural Networks for Text Information (EVCT) framework using XAI technique, which effectively minimizes information loss while enhancing the explanations for predictions made by the algorithm. Ayoub et al. [62] used explainable natural language processing models to counter the COVID-19 infodemic, employing SHAP to explain the outputs of the DistilBERT model.

Integrating XAI technologies with embedding-enhanced topic modeling marks a significant advancement in paper classification. Such technologies elucidate the influence of specific topics on classification outcomes, improving decision-making transparency and fostering trust in automated systems. These methodologies not only elevate classification accuracy but also provide vital insights into the rationale behind model decisions [63].

4. Dataset and experiments

4.1. Dataset

We utilize a dataset comprising 456,472 academic paper abstracts related to nanomaterials, sourced from the Web of Science (WoS) and spanning publications from 2012 to 2017. We focus on five specific fields within nanomaterial research: Carbon Nanotube, Quantum Dot, Graphene, Nanosilica, and Nanosilicon. By analyzing abstracts, we aim to capture a broad spectrum of topics covered in each paper, which allows for a more detailed understanding of the research trends and themes within the field of nanomaterials. The true labels of the dataset were generated by nanotechnology experts from the Nanotechnology Policy Center at the Korea Institute of Materials Science in the project “Study of Nanotechnology Policy and Information Analysis” (2017M3A7A7057113) of the National Research Foundation of Korea. It is worth noting that papers may be categorized under multiple nanomaterial fields, indicating the interdisciplinary nature of many studies. Table 1 provides a breakdown of the number of papers associated with each of the five nanomaterial fields.

Table 1
Distribution of academic papers across five nanomaterial fields from the WoS database.

Field Carbon Nanotube Quantum Dot Graphene Nanosilica Nanosilicon Total

Counts 116,763 50,884 173,707 47,608 45,365 456,472

Field	Carbon Nanotube	Quantum Dot	Graphene	Nanosilica	Nanosilicon	Total
Counts	116,763	50,884	173,707	47,608	45,365	456,472

4.2. Experimental process

Figure 2.

The experimental process for real data analysis, illustrating each step from data preprocessing to model training.

Figur 2 illustrates the experimental process of real data analysis. Before applying LSA topic modeling, the dataset underwent a series of preprocessing steps. The abstracts were converted to lowercase, and the tokenization of sentences into words was performed using the NLTK (version 3.8.1) library in Python. Lemmatization was applied to normalize different forms of words to their base forms, including converting plural to singular forms and standardizing verb tenses. Additionally, insignificant special characters, particles, articles, single-letter words, and words from the NLTK English stopword list were removed.

Upon preprocessing completion, the dataset was divided into training, validation, and testing sets. Initially, the dataset was split into training and testing sets in a 9:1 ratio. The training set was further subdivided into training and validation subsets with an 8:2 split.

To implement LSA topic modeling on the training and test datasets, we first trained the LSA model using the training data corpus and constructed a vocabulary set limited to words appearing in at least 30 documents. This approach ensures efficient computation and minimal discrepancies across various performance metrics, resulting in a lexicon containing N= 23,055 words. Words not present in the established vocabulary were excluded from both the training and test datasets. Subsequently, the data were transformed into TF-IDF matrices, where the TF-IDF matrix for the test data was generated based on the word importance derived from the training data.

The optimal number of topics, K, was established at 200 after conducting iterative adjustments ranging from 50 to 300 topics during the training phase. This number was selected based on its contribution to maximizing classification accuracy.

To address the class imbalance, undersampling techniques were employed, involving the random removal of indices of papers not pertaining to each specific nanomaterial field [64,65]. This approach was aimed at equalizing the number of papers across nanomaterial fields to improve model performance.

Table 2

Distribution of academic papers across five nanomaterial fields in the training and test datasets after applying undersampling.

	Training
Field	Carbon Nanotube	Quantum Dot	Graphene	Nanosilica	Nanosilicon	Test
Counts	233,526	101,768	347,414	95,216	90,730	45,647

The training dataset used for classification is detailed in Table 2, which presents the distribution of papers across the five nanomaterial fields after undersampling. Due to the varying amounts of training data per field, individual models were trained and evaluated separately for each field. Binary classification models were developed for each field, and the average of the individual evaluation results was calculated to derive comprehensive performance metrics across all five nanomaterial fields [66].

Shapley values for topics within each nanomaterial field were computed based on the test data. By taking the absolute values of these Shapley values for all papers and averaging them, GIF values for each topic were determined. This method allowed for an assessment of the average importance of each topic across all papers within each field.

4.3. Comparative study

We conducted a comparative analysis between our proposed system, which utilizes LSA and MLP, and four alternative embedding methods coupled with three different classification models. This comparison was designed to demonstrate the enhanced performance of our system in the task of paper classification. The effectiveness of each combination of embedding techniques and classification models was assessed using several model performance metrics, including accuracy, F1-score, and the area under the receiver operating characteristic curve (AUC).

4.3.1. Embedding methods

We utilized a range of embedding techniques to analyze the effectiveness of different textual representations. The selected methods include Word2vec [67,68], Doc2vec [69], LDA [1], and BERTopic [70] as the embedding techniques.

Word2vec is a prominent word embedding technique that represents words as vectors based on their meanings and contextual usage [67,68]. This model assesses the context of a word by considering up to five words preceding and following it within a sentence, including only those words that appear with a frequency of 40 or more. Notably, the Skip-Gram model was selected over the Continuous Bag of Words (CBOW) model due to its superior performance in predicting the center word from its surrounding context [67,71,72]. This choice enhances the model’s ability to capture contextual meanings. Each word is represented by a 200-dimensional vector. For document embedding, the average of word vectors within the document is used to form a document vector, providing a consolidated representation that encapsulates overall semantic content [73].

Doc2vec, an extension of Word2vec, uniquely assigns vectors to entire documents, effectively representing them as vectors [69]. Like Word2vec, it examines up to five words before and after the target word within a sentence, including only those that meet a minimum frequency threshold. The distributed memory version of the paragraph vector (PV-DM) model was utilized for our Doc2vec implementation [69,74,75]. This model captures the semantic meanings of words within their contexts, enhancing the overall document representation. Each document is thereby represented by a 200-dimensional vector, which ensures a rich, context-aware embedding that encapsulates the thematic essence of the text.

LDA is a topic modeling technique that probabilistically infers the topic structure within documents [1,76]. To train the LDA model, the same dictionary generated via LSA was utilized. The batch size for document processing was set to 2,000, and the model was configured to discern 200 topics. The parameters of the Dirichlet distribution for each topic distribution were set equally, resulting in $α = 0.005$ . The resulting probability distributions for the topics provided the embedding values for each document.

BERTopic leverages clustering technology and class-based TF-IDF to model topics effectively, generating discernible topic representations and enhancing the granularity of topic detection [70]. Documents are embedded into a vector space using the Sentence-BERT (SBERT) framework, specifically employing the all-MiniLM-L6-v2 model [77]. This version of SBERT has been trained on a vast corpus of 1,170,060,424 training tuples, including sources like Reddit comments and Wikipedia pages. It is known for its state-of-the-art performance on various sentence embedding tasks, facilitating effective semantic comparisons [78]. To manage the high dimensionality of the data, uniform manifold approximation and projection (UMAP) is applied, reducing dimensions to a three-dimensional space that preserves the semantic similarity of documents [79]. This step involves setting the number of neighboring points to 20 and the minimum distance between points to 0.1, using the Euclidean distance metric for calculations. Following dimensionality reduction, hierarchical density-based spatial clustering of applications with noise (HDBSCAN) is employed to identify dense clusters and segregate noise [80]. Additionally, clustering incorporates Prim’s algorithm, which utilizes k-dimensional (KD) trees to enhance clustering efficiency [81]. The minimum cluster size is set to 200, with a minimum of 30 neighbors required to form a cluster. The Euclidean distance metric and an excess mass method are used to select and validate clusters [82]. Finally, class-based TF-IDF (C-TFIDF) is implemented to refine topic representations. Unlike traditional TF-IDF, C-TFIDF assesses word importance within clusters, facilitating the generation of specific topic-word distributions for each document cluster. To optimize topic coherence, the representation of the least prevalent topic is iteratively merged with the most similar one, reducing the number of topics to 200.

4.3.2. Classification models

As classification models, we utilized logistic regression (LR) [83], randomforest (RF) [84], extreme gradient boosting (XGBoost) [85].

LR is widely used for binary classification, leveraging the logistic function to predict probabilities [83,23]. This method compares the output probability against a fixed threshold, set at 0.5, to categorize observations into binary classes (0 or 1) [86]. In our implementation, the model incorporates an L2 regularization penalty to prevent overfitting. We optimize the model using the limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm, renowned for its effectiveness in large-scale applications. The optimization process is controlled with a maximum of 1,000 iterations and a convergence tolerance of 0.001.

RF is an ensemble method that enhances predictive accuracy by aggregating outputs from multiple decision trees [84,83]. This method not only increases the robustness of predictions but also helps control overfitting through its ensemble approach. We configured our RF with 200 trees, employing the Gini index as the criterion for optimizing splits [87]. We capped the maximum depth of each tree at 20, allowing for complex pattern recognition while preventing overfitting. Each internal node in the trees requires at least 5 samples to split, ensuring sufficient data for reliable decision-making and maintaining an equilibrium between bias and variance.

XGBoost builds upon traditional gradient boosting frameworks by iteratively correcting the errors of previously built trees [85,88]. We use 200 trees, and each tree can grow to a maximum depth of 20 to capture complex interactions, and the minimum child weight is set at 10 to control overfitting by making the algorithm more conservative. The learning rate is fixed at 0.3 to moderate the impact of each individual tree and prevent rapid convergence to suboptimal solutions. Additionally, the gamma parameter, which specifies the minimum loss reduction required to make further splits on a leaf node, is set at 0.1.

5. Results

5.1. Results of comparative study

Table 3 presents a comparison of classification performance outcomes for nanomaterial academic papers. For performance comparisons on datasets related to Science & Engineering and Medical academic papers, please refer to the Appendix A. The F1 score, a crucial metric that harmonizes precision and recall, is particularly emphasized given its relevance in addressing class distribution imbalances [89]. This metric is instrumental in assessing classification performance on imbalanced datasets and elucidates distinct patterns of model efficacy.

Table 3
Comparative performance metrics of five embedding methods paired with four classification models. The table displays accuracy, F1 score, and AUC for each method-model combination, highlighting the superior performance of the proposed LSA-MLP configuration.

Embedding method Classification model Accuracy F1 AUC

Word2Vec LR 0.8667 0.6970 0.9300

RF 0.8074 0.6130 0.8905

XGBoost 0.8665 0.6990 0.9373

MLP 0.8792 0.7215 0.9468

Doc2Vec LR 0.8283 0.6247 0.8967

RF 0.7983 0.5684 0.8606

XGBoost 0.8339 0.6355 0.9062

MLP 0.8471 0.6566 0.9184

LDA LR 0.7684 0.5520 0.8466

RF 0.7695 0.5312 0.8225

XGBoost 0.7578 0.5349 0.8324

MLP 0.7745 0.5626 0.8575

LSA (proposed) LR 0.9077 0.7792 0.9490

RF 0.8746 0.6968 0.9227

XGBoost 0.8977 0.7568 0.9510

MLP (proposed) 0.9080 0.7816 0.9574

BERTopic LR 0.9079 0.7781 0.9481

RF 0.8748 0.6975 0.9226

XGBoost 0.8973 0.7558 0.9522

MLP 0.8983 0.7576 0.9503

Embedding method	Classification model	Accuracy	F1	AUC
Word2Vec	LR	0.8667	0.6970	0.9300
	RF	0.8074	0.6130	0.8905
	XGBoost	0.8665	0.6990	0.9373
	MLP	0.8792	0.7215	0.9468
Doc2Vec	LR	0.8283	0.6247	0.8967
	RF	0.7983	0.5684	0.8606
	XGBoost	0.8339	0.6355	0.9062
	MLP	0.8471	0.6566	0.9184
LDA	LR	0.7684	0.5520	0.8466
	RF	0.7695	0.5312	0.8225
	XGBoost	0.7578	0.5349	0.8324
	MLP	0.7745	0.5626	0.8575
LSA (proposed)	LR	0.9077	0.7792	0.9490
	RF	0.8746	0.6968	0.9227
	XGBoost	0.8977	0.7568	0.9510
	MLP (proposed)	0.9080	0.7816	0.9574
BERTopic	LR	0.9079	0.7781	0.9481
	RF	0.8748	0.6975	0.9226
	XGBoost	0.8973	0.7558	0.9522
	MLP	0.8983	0.7576	0.9503

The results demonstrate that the choice of embedding methods significantly influences the overall performance of the classification models. Among the evaluated combinations, the integration of LSA with the MLP classifier exhibited superior performance, achieving an F1 score of 0.7816. This result suggests an enhanced capability to capture and utilize semantic similarities and complex relationships within the data [90,91].

Interestingly, despite the advanced capabilities of BERTopic, which leverages a pre-trained model on extensive datasets to capture a broad semantic scope, it did not outperform the LSA-MLP configuration. The BERTopic and MLP combination achieved an F1 score of 0.7576, indicating that while it remains highly effective, it does not surpass the LSA-MLP setup in this specific evaluation. This finding is noteworthy, as it suggests that our proposed combination of LSA and MLP not only competes with but also slightly exceeds the performance of sophisticated pre-trained models like BERTopic in navigating the unique challenges presented by our dataset.

5.2. Results of topic modeling and SHAP

5.2.1. Evaluation for SHAP

The fidelity measures the degree to which the explanations approximate the original predictions of the model [92]. A lower fidelity error indicates a higher accuracy in the explanations provided by SHAP values in approximating the model’s predictions [93,94]. The smaller the fidelity value, the better the SHAP explanations align with the model’s output, suggesting that the interpretations are more faithful to the actual decision-making process of the model.

Table 4
Fidelity scores for SHAP across five nanomaterial fields.

Field Carbon Nanotube Quantum Dot Graphene Nanosilica Nanosilicon Overall average

Fidelity 4.2298e-6 2.3664e-5 4.4665e-6 2.3064e-5 3.6221e-5 0.00073

Field	Carbon Nanotube	Quantum Dot	Graphene	Nanosilica	Nanosilicon	Overall average
Fidelity	4.2298e-6	2.3664e-5	4.4665e-6	2.3064e-5	3.6221e-5	0.00073

Table 4 presents the fidelity scores across all nanomaterial fields, calculated as the mean squared error between the SHAP values’ sum and the model’s predicted probabilities. The fidelity values for all fields are exceptionally low, indicating that the SHAP explanations are highly consistent with the MLP classifier’s predictions in these fields. The overall average fidelity score, 0.00073, further corroborates the efficacy of SHAP in providing reliable explanations across all fields, thus validating the utility of SHAP in enhancing the transparency and reliability of the classification model.

5.2.2. Global explanations

Table 5
Summary of the top three topics with the highest GIF values for each nanomaterial field, along with the top five most influential words within each topic, based on their absolute assignments in the $V^{'}$ matrix.

Nanomaterial field Topic No. Topic name Top 5 words

Carbon Nanotube 11 Carbon-based materials tio, go, nanotube, qds, rgo

10 Graphene-oxide materials graphene, go, rgo, oxide, carbon

21 Metallic materials cu, fe, ag, cell, cnts

Quantum Dot 5 Quantum Dot-based membrane composite, membrane, quantum, go, dot

7 Photocatalytic catalysts tio, catalyst, photocatalytic, qds, cd

12 Materials for film-making qds, ag, np, film, co

Graphene 10 Graphene-oxide materials graphene, go, rgo, oxide, carbon

2 Electrochemical devices film, electrode, catalyst, capacity, electrochemical

11 Carbon-based materials tio, go, nanotube, qds, rgo

Nanosilica 12 Materials for film-making qds, ag, np, film, co

10 Carbon-Oxide composites graphene, go, rgo, oxide, carbon

38 Materials for fiber composites al, sio, cd, fiber, silica

Nanosilicon 1 6 Membranes for adsorption applications film, adsorption, membrane, tio, thin

26 Materials for electronic devices si, cu, rgo, zno, go

2 Electrochemical devices film, electrode, catalyst, capacity, electrochemical

Nanomaterial field	Topic No.	Topic name	Top 5 words
Carbon Nanotube	11	Carbon-based materials	tio, go, nanotube, qds, rgo
	10	Graphene-oxide materials	graphene, go, rgo, oxide, carbon
	21	Metallic materials	cu, fe, ag, cell, cnts
Quantum Dot	5	Quantum Dot-based membrane	composite, membrane, quantum, go, dot
	7	Photocatalytic catalysts	tio, catalyst, photocatalytic, qds, cd
	12	Materials for film-making	qds, ag, np, film, co
Graphene	10	Graphene-oxide materials	graphene, go, rgo, oxide, carbon
	2	Electrochemical devices	film, electrode, catalyst, capacity, electrochemical
	11	Carbon-based materials	tio, go, nanotube, qds, rgo
Nanosilica	12	Materials for film-making	qds, ag, np, film, co
	10	Carbon-Oxide composites	graphene, go, rgo, oxide, carbon
	38	Materials for fiber composites	al, sio, cd, fiber, silica
Nanosilicon 1	6	Membranes for adsorption applications	film, adsorption, membrane, tio, thin
	26	Materials for electronic devices	si, cu, rgo, zno, go
	2	Electrochemical devices	film, electrode, catalyst, capacity, electrochemical

Figure 3.

Graphical representation of the GIF values for the top three influential topics within each nanomaterial field. Each bar’s length in the graph reflects the GIF value, illustrating the relative impact of these topics on the classification outcomes across different nanomaterial fields.

We focus on corpus-level explanations to identify which topics significantly influence the classification into five distinct nanomaterial fields based on the defined GIF values in Section 2.4. Table 5 lists the top three topics with the highest GIF values for each nanomaterial field. Additionally, it provides the names of these topics, which were manually derived from the analysis of the five most significant words associated with each topic. The significance of these words is assessed based on their absolute values in the $V^{'}$ matrix, which results from the LSA topic modeling process. Words with larger absolute values are considered more influential within their respective topics. Figure 3 graphically represents the GIF values of these top three influential topics for each field. The horizontal extent of each bar in the graph correlates with the GIF value of the corresponding topic, visually indicating their impact on the global classification of papers. A longer bar denotes a greater influence on the classification outcomes within each specific nanomaterial field.

Carbon Nanotube

Carbon Nanotubes are cylindrical allotropes of carbon known for their unique nanostructure [95]. Renowned for their remarkable strength, high electrical conductivity, and low density, carbon nanotubes are highly valued in advanced applications such as batteries and composite materials [96,97,98]. The topics significantly influencing the classification of papers within the Carbon Nanotube field are Topics 11 (Carbon-based materials), 10 (Graphene-oxide materials), and 21 (Metallic materials).

Topic 11 predominantly covers carbon-based nanomaterials including nanotubes [95], quantum dots (qds) [99], reduced graphene oxide (rgo), and graphene oxide (go) [100]. Topic 10 is closely associated with graphene and its derivatives such as graphene oxide and reduced graphene oxide, reflecting the relevance of graphene-related research in this field [100]. Meanwhile, Topic 21 captures the classification of papers focusing on metals like copper (cu) [101], iron (fe) [102], and silver (ag) [103], known for their high electrical conductivity, alongside their interaction with carbon nanotubes [98].

Figure 3a presents the GIF values for these topics, where $ψ_{11} = 0.127$ and $ψ_{10} = 0.112$ , both notably higher than $ψ_{21} = 0.088$ for Topic 21. This disparity indicates that Topics 11 is approximately 1.5 times more influential in classifying papers within the Carbon Nanotube category compared to Topic 21, highlighting their significant impact on the field.

Quantum Dot

Quantum dots are ultrafine semiconductor particles that are pivotal in developing display devices, offering vivid colors, longer lifespans, and greater cost-effectiveness compared to traditional LEDs and OLEDs [104]. The significant influence on the classification of Quantum Dots is chiefly guided by Topics 5 (Quantum Dot-based membranes), 7 (Photocatalytic catalysts), and 12 (Materials for film-making).

Topic 5 pertains to the use of membranes comprising quantum dot composites, essential for various nanotechnology applications. Topic 7 includes research on photocatalytic substances such as titanium dioxide (tio2), which is often abbreviated as ‘tio’ in processed texts [105], qds [106], and cadmium (cd) [107] used in photocatalytic reactions. Topic 12 deals with materials employed in the manufacturing of displays, prominently featuring qds [104], silver (ag) [103], and cobalt (co) [108].

According to Fig. 3b, the GIF values for Topics 5, 7, and 12 are $ψ_{5} = 0.181$ , $ψ_{7} = 0.123$ , and $ψ_{12} = 0.096$ , respectively. Notably, Topic 5, which is directly related to quantum dots, demonstrates an impact that is approximately 1.5 times greater than that of Topic 7 and twice that of Topic 12.

Graphene

Graphene is a two-dimensional nanomaterial comprised of a single layer of carbon atoms arranged in a hexagonal lattice, celebrated for its remarkable electrical and thermal conductivities [109]. Its versatility makes it a pivotal material in the development of advanced biocomposites for dental and medical applications [110]. The classification of Graphene-related papers is predominantly influenced by Topics 10 (Graphene-oxide materials), 2 (Electrochemical devices), and 11 (Carbon-based materials).

Topic 2 includes terms associated with electrochemical applications, underscoring its relevance to devices that capitalize on graphene’s exceptional conductive properties. Topics 10 and 11 effectively describe both Carbon Nanotubes and Graphene. This is attributed to their similar properties to carbon allotropes [95,109].

As illustrated in Fig. 3c, the GIF value for Topic 10 is $ψ_{10} = 0.265$ , which is significantly higher than those for Topics 2 and 11, at $ψ_{2} = 0.094$ and $ψ_{11} = 0.080$ , respectively. This disparity underscores the paramount importance of graphene-oxide materials in the classification within the Graphene field, demonstrating that Topic 10 is over twice as influential as the other topics examined.

Nanosilica

Nanosilica refers to silica synthesized on the nanometer scale, which consists primarily of silicon and oxygen [111]. The classification of Nanosilica-related papers is significantly influenced by Topics 12 (Materials for film-making), 10 (Graphene-oxide materials), and 38 (Materials for fiber composites).

Topic 38, in particular, is characterized by its focus on fiber composite materials that often incorporate elements like aluminum (al) [112] and cadmium (cd) [113], demonstrating direct applications in Nanosilica technology. Notably, the term ‘sio’ within this topic represents silicon dioxide (SiO2), further emphasizing the connection to Nanosilica. This topic provides a comprehensive description of how Nanosilica is utilized within various composite materials, highlighting its widespread application.

Figure 3d presents the GIF values, with Topic 12, 10, and 38 showing values of $ψ_{12} = 0.099$ , $ψ_{10} = 0.093$ , and $ψ_{38} = 0.091$ . The similar GIF values across these topics indicate that while Topic 38 plays a crucial role in classifying Nanosilica, it operates within a context where several topics collectively contribute to the field’s classification.

Nanosilicon

Nanosilicon refers to silicon at the nanoscale [114], with its properties and applications extensively explored in the realms of bio- and energy-related materials [115,116]. The classification of Nanosilicon-related papers is notably influenced by Topics 6 (Membranes for adsorption applications), 26 (Materials for electronic devices), and 2 (Electrochemical devices).

Topic 6 is particularly centered on research pertaining to advanced membrane materials that are pivotal in adsorption applications, often employing thin-film technologies. This topic’s focus reflects the innovative use of nanoscale materials in enhancing the functionality and efficiency of adsorption processes. Topic 26 delves into materials used in electronic device applications, encompassing a range of essential components such as silicon (si) [112], zinc oxide (zno) [117], cu [101], rgo, and go. The inclusion of diverse materials underscores the broad application spectrum of Nanosilicon in modern electronics.

Figure 3e illustrates the GIF values for these topics, with $ψ_{6} = 0.111$ , $ψ_{26} = 0.098$ , and $ψ_{2} = 0.096$ , respectively. The close proximity of these values highlights a balanced impact of these diverse topics on the classification within the Nanosilicon field, indicating that while each topic contributes significantly, they do so to a similar degree.

Various topics related to the characteristics and applications of nanomaterials have been identified for each nanomaterial and provide insights into diverse research trends in the nanomaterial field. Carbon Nanotubes, Quantum Dots, and Graphene, which are all nanostructures primarily based on carbon [95,104,109], exemplify the versatility and wide-ranging utility of carbon in nanotechnology. Conversely, Nanosilica and Nanosilicon, which are derived from silicon [111,114], showcase the diverse applications of silicon-based materials. The shared utilization of these elemental nanomaterials across similar application domains not only underscores their comparable properties but also highlights their integral role in advancing the nanotechnology field.

5.2.3. Local explanations

Figure 4.

Visual representation of the top three topics with the highest Shapley values for documents classified into each nanomaterial field, used in the document-level analysis. The length of each bar in the graph correlates with the Shapley value, illustrating the relative impact of these topics on the classification of the respective documents.

Table 6

Frequencies of the top five significant words for each major topic as identified in Table 5, displayed for documents within each nanomaterial field. This table supports the word-level analysis by illustrating how frequently specific terms appear in the texts, aiding in understanding their influence on the classification of the documents into their respective nanomaterial fields.

Nanomaterial field	Topics
Carbon Nanotube	Topic 11		Topic 10		Topic 21
	Word	Frequency	Word	Frequency	Word	Frequency
	tio	0	graphene	1	cu	0
	go	0	go	0	fe	0
	nanotube	16	rgo	0	ag	0
	qds	0	oxide	0	cell	0
	rgo	0	carbon	2	cnts	3
	Total	16	Total	3	Total	3
Quantum Dot	Topic 5		Topic 7		Topic 12
	Word	Frequency	Word	Frequency	Word	Frequency
	composite	0	tio	8	qds	2
	membrane	0	catalyst	0	ag	4
	quantum	3	photocatalytic	3	np	0
	go	0	qds	2	film	0
	dot	3	cd	0	co	0
	Total	6	Total	13	Total	6
Graphene	Topic 10		Topic 2		Topic 11
	Word	Frequency	Word	Frequency	Word	Frequency
	graphene	3	film	0	tio	0
	go	1	electrode	2	go	1
	rgo	2	catalyst	0	nanotube	0
	oxide	1	capacity	0	qds	0
	carbon	4	electrochemical	3	rgo	2
	Total	11	Total	5	Total	3
Nanosilica	Topic 12		Topic 10		Topic 38
	Word	Frequency	Word	Frequency	Word	Frequency
	qds	0	graphene	0	al	6
	ag	0	go	0	sio	2
	np	5	rgo	0	cd	0
	film	0	oxide	1	fiber	0
	co	0	carbon	0	silica	2
	Total	5	Total	1	Total	10
Nanosilicon	Topic 6		Topic 26		Topic 2
	Word	Frequency	Word	Frequency	Word	Frequency
	film	8	si	5	film	8
	adsorption	0	cu	0	electrode	4
	membrane	0	rgo	0	catalyst	0
	tio	0	zno	0	capacity	0
	thin	1	go	0	electrochemical	2
	Total	9	Total	5	Total	14

Local explanations delve into why individual documents are classified into specific nanomaterial fields, offering a granular view of the underlying classification mechanisms. For this analysis, five representative documents, one from each nanomaterial field, were selected. These documents are examined at both the document-level and word-level to pinpoint the topics and words driving their classification.

At the document level, the influence of each topic on classification is quantified by Shapley values. We identify and compare the top three topics with the highest Shapley values for each document against those with the highest GIF values as listed in Table 5. The relative influence of each topic is visually illustrated in Fig. 4, where the horizontal length of each bar correlates directly with the Shapley value, indicating the topic’s impact on the classification.

The word-level analysis focuses on the word assignments from the $V^{'}$ matrix related to the topics identified as most influential at the document level. Table 6 presents the frequency of the top five significant words from each of the major topics identified in Table 5 within the analyzed documents. This analysis reveals which specific words are prominent in the texts and how they contribute to the categorization of the documents into their respective nanomaterial fields.

Through these detailed local explanations, we aim to provide a comprehensive understanding of which topics and words decisively influence the classification outcomes, enhancing the interpretability of our classification model within the context of nanomaterials research.

Carbon Nanotube

In the document-level analysis, the Shapley values clearly indicate a strong association with Topic 11, which has a Shapley value of 0.195. This value significantly surpasses those of Topics 40 and 23, which are 0.095 and 0.055, respectively, as shown in Fig. 4a. This prominent association suggests that the document is primarily influenced by the themes encapsulated in Topic 11.

At the word-level, the document contains several keywords critical to the Carbon Nanotube field: ‘nanotube’ appears 16 times, ‘carbon’ twice, and ‘cnts’ three times. The term ‘graphene’ was mentioned once, which may indicate a comparative discussion between Carbon Nanotube and Graphene. The frequent mentions of ‘nanotube’ align well with the themes of Topic 11, reinforcing its relevance.

It becomes evident that the document is classified as relating to Carbon Nanotube primarily due to its strong thematic alignment with Topic 11, as demonstrated by both the Shapley values and the predominant occurrence of related keywords. This correlation not only highlights the document’s substantial alignment with the identified topics but also confirms the influence of specific terms like ‘nanotube’ in steering the classification towards Carbon Nanotube. Consequently, the classification is strongly justified by the combined evidence from the document- and word-level insights.

Quantum Dot

At the document-level, the analysis reveals that Topic 7 dominates with a Shapley value of 0.570, significantly higher than those for Topics 11 and 18, which are 0.230 and 0.150, respectively, as depicted in Fig. 4b. This substantial value strongly suggests that the primary influence on this document stems from Topic 7.

Word-level scrutiny shows that the document extensively uses terms directly linked to Quantum Dot, such as ‘quantum,’ ‘dot,’ and ‘qds,’ which collectively appear eight times. Additionally, the frequencies of ‘tio,’ ‘photocatalytic,’ and ‘ag’ – terms integral to the discussion on silver-doped titanium dioxide photocatalysts – further aligns the content with Topic 7. In particular, ‘tio’ appears eight times, and ‘photocatalytic’ three times, reinforcing the document’s focus on photocatalytic materials.

This comprehensive analysis indicates that the document’s classification as Quantum Dot is convincingly justified by its alignment with Topic 7, as evidenced by the high frequency of relevant terms and the substantial Shapley values. The document’s content, enriched with specific references to Quantum Dot components and applications in photocatalysis, firmly positions it within this nanomaterial field.

Graphene

In the document-level analysis, the Shapley values from Fig. 4c show that Topic 2 has a Shapley value of 0.285, closely followed by Topic 10 with a value of 0.270, indicating that the paper is significantly influenced by both topics. This near equivalence in Shapley values highlights a substantial overlap in thematic content associated with electrochemical applications and graphene materials.

Turning to the word-level analysis, the paper’s text includes multiple instances of key terms that anchor it within these topics: ‘graphene’ is mentioned three times and ‘carbon’ four times, directly pointing to its focus on Graphene. Additionally, ‘go,’ ‘rgo,’ and ‘oxide’ appear once and twice, respectively, which are terms linked to chemically oxidized forms of graphene. The words ‘electrode’ and ‘electrochemical’ are mentioned twice and three times, respectively, emphasizing the discussion on graphene’s role in electrode materials. The combination of these words aligns perfectly with the themes of both Topics 2 and 10, corroborating the Shapley value analysis. The inclusion of every keyword from Topic 10 at least once within the document underscores its strong alignment with this topic. Given the significant representation of themes and vocabulary for both Topics 2 and 10, the paper is aptly classified under Graphene.

Nanosilica

The document-level analysis indicates a strong association with Topic 38, as evidenced by the highest Shapley value of 0.185 displayed in Fig. 4d. Topic 38’s dominance is further corroborated by the word ‘sio’ and ‘silica,’ terms that are quintessential to Nanosilica, appearing a combined total of four times. The presence of ‘np’ (mentioned six times) and ‘al’ (also six times) points towards discussions related to silica’s applications in cleaning agents.

Additionally, Topic 12 shows a significant influence with a Shapley value of 0.170. Keywords from this topic, appearing five times, denote it as the second most relevant topic for this paper. This topic’s presence supports the notion that the paper covers broader aspects of silica use, likely extending beyond just cleaning applications.

Given the frequent appearance of key terms from both Topics 38 and 12 and their respective Shapley values, the paper’s classification as Nanosilica is well justified.

Nanosilicon

In the document-level analysis, Topic 6 emerges as the predominant theme, with a Shapley value of 0.405 as shown in Fig. 4e. This value is significantly higher than those assigned to other topics, indicating a strong association with Topic 6. The analysis reveals that terms pertinent to Topic 6 appear nine times within the document, emphasizing its central theme.

At the word-level, the document frequently uses ‘si,’ a direct indicator of Nanosilicon, mentioned five times. The frequent mentions of ‘film’ and ‘thin,’ appearing eight and once respectively, support the notion that the paper discusses the use of Nanosilicon in the fabrication of thin films for displays. This context is well-aligned with the focus of Topic 6, which deals with advanced materials for electronics and displays.

The comprehensive presence of specific keywords from Topic 6, combined with the dominant Shapley value, substantiates the classification of this paper as Nanosilicon. The strong thematic ties to Nanosilicon highlighted by both the document-level Shapley values and the word-level frequency analysis convincingly justify the paper’s categorization within this nanomaterial field.

Overall, we can say that papers are typically classified into specific nanomaterial fields based on the prominence of elemental symbols or abbreviations that directly represent the nanomaterial, as well as terms associated with its applications, within the abstract. This trend underscores the importance of targeted vocabulary in accurately categorizing academic papers according to their focus within the realm of nanomaterials.

6. Conclusion

In this study, we developed an explainable paper classification system integrated LSA topic modeling for embeddings, an MLP classification model, and SHAP. Our system, evaluated against four other embedding techniques and three classification models, demonstrated superior performance, achieving an F1 score of 0.7816. Enhanced interpretability was achieved through SHAP value calculations, which provided corpus, document and word level explanations of the model’s decisions, increasing both transparency and accountability.

Despite its effective classification performance and interpretability, LSA’s reliance on linear assumptions can sometimes result in oversimplifications and misinterpretations of text data. Recent advancements have favored transformer-based pretrained models like BERT for more complex text interpretation tasks. In our comparative study, the proposed LSA-MLP combination turned out to outperform BERTopic-MLP combination, confirming its efficacy in classifying nanomaterial-related literature. Additionally, while SHAP values enhance model interpretability, they depend heavily on the underlying model’s accuracy. Misinterpretations or biases in the model can distort SHAP outcomes, and their computational intensity might limit their use in real-time or large-scale applications.

The system is poised to significantly assist researchers by efficiently classifying papers, enabling them to identify current research trends and essential keywords. The use of XAI technology ensures result transparency, helps researchers understand classification criteria, and ensures reliable access to research outcomes. Moreover, visualizing the impact of specific topics or words on classifications enhances decision-making. This system’s utility extends beyond academic papers to other domains, like meeting minutes and news articles, facilitating improved information retrieval and organizational efficiency. Ultimately, it is designed to reduce researchers’ efforts, accelerating scholarly progress.

Future research will explore extending this system’s application beyond the nanomaterials domain to include broader datasets. We plan to assess its adaptability and scalability across various fields, aiming to enhance its performance and utility. Investigating improvements in handling diverse datasets and optimizing computational efficiency are key areas of interest, promising to broaden the system’s applicability and effectiveness in academic and practical applications.

Footnotes

Acknowledgments

This study was supported by the Sungshin Women’s University research grant of 2023 (H20230054).

Data availability statements

The datasets are available from the corresponding author upon request.

Appendix

References

Blei

D.M.

A.Y.

Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3(Jan) (2003), 993–1022.

Blei

D.M.

Lafferty

J.D.

, Topic models, in: Text Mining, Chapman and Hall/CRC , 2009, pp. 101–124.

Blei

D.M.

, Probabilistic topic models, Communications of the ACM 55(4) (2012), 77–84.

Lee

Jung

, Keyword analysis of twitter data on new digital technology through topic modeling and ERGM, Journal of The Korean Data Analysis Society 25(6) (2023), 2093–2107.

Lee

Jung

, Analysis of Korea and global monopoly research trends using topic modeling and time-series analysis, Journal of The Korean Data Analysis Society 25(5) (2023), 1683–1699.

Than

T.B.

, Modeling the diversity and log-normality of data, Intelligent Data Analysis 18(6) (2014), 1067–1088.

Crain

S.P.

Zhou

Yang

S.-H.

Zha

, Dimensionality reduction and topic modeling: From latent semantic indexing to latent dirichlet allocation and beyond, Mining Text Data, 2012, 129–161.

Huang

A.H.

Lehavy

Zang

A.Y.

Zheng

, Analyst information discovery and interpretation roles: A topic modeling approach, Management Science 64(6) (2018), 2833–2855.

Deerwester

Dumais

S.T.

Furnas

G.W.

Landauer

T.K.

Harshman

, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41(6) (1990), 391–407.

10.

Kherwa

Bansal

, Topic modeling: A comprehensive review, EAI Endorsed Transactions on Scalable Information Systems 7(24) (2019).

11.

Yang

Nazir

Ali

et al., Deep learning algorithms and multicriteria decision-making used in big data: A systematic literature review, Complexity 2020 (2020).

12.

Alibabaei

Gaspar

P.D.

Lima

T.M.

Campos

R.M.

Girão

Monteiro

Lopes

C.M.

, A review of the challenges of using deep learning algorithms to support decision-making in agricultural activities, Remote Sensing 14(3) (2022), 638.

13.

Arrieta

A.B.

Díaz-Rodríguez

Del Ser

Bennetot

Tabik

Barbado

García

Gil-López

Molina

Benjamins

et al., Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI, Information Fusion 58 (2020), 82–115.

14.

Tjoa

Guan

, A survey on explainable artificial intelligence (XAI): Toward medical XAI, IEEE Transactions on Neural Networks and Learning Systems 32(11) (2020), 4793–4813.

15.

Lundberg

S.M.

Lee

S.-I.

, A Unified Approach to Interpreting Model Predictions, in: Advances in Neural Information Processing Systems , Vol. 30, Curran Associates, Inc., 2017.

16.

Sarhan

Layeghy

Portmann

, Evaluating standard feature sets towards increased generalisability and explainability of ML-based network intrusion detection, Big Data Research 30 (2022), 100359.

17.

Mahbooba

Timilsina

Sahal

Serrano

, Explainable artificial intelligence (XAI) to enhance trust management in intrusion detection systems using decision tree model, Complexity 2021 (2021), 1–11.

18.

Baptista

M.L.

Goebel

Henriques

E.M.

, Relation between prognostics predictor evaluation metrics and local interpretability SHAP values, Artificial Intelligence 306 (2022), 103667.

19.

Gorgoglione

Russo

Gioia

Iacobellis

Castro

, First Flush Occurrence Prediction and Ranking of Its Influential Variables in Urban Watersheds: Evaluation of XGBoost and SHAP Techniques, in: International Conference on Computational Science and Its Applications , Springer, 2022, pp. 423–434.

20.

Muñoz Ponce

F.J.

, Representation of astronomical time series using information retrieval theory, 2022.

21.

Landauer

T.K.

Foltz

P.W.

Laham

, An introduction to latent semantic analysis, Discourse Processes 25(2–3) (1998), 259–284.

22.

Landauer

T.K.

Dumais

S.T.

, A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychological Review 104(2) (1997), 211.

23.

Rahmaningrum

S.A.

Oktaviana

P.P.

, Sentiment classification of hotel service review on traveloka sites using naïve bayes classifier (NBC) and binary logistic regression, in: Journal of Physics: Conference Series , Vol. 1490, IOP Publishing, 2020, p. 012065.

24.

Berry

M.W.

Martin

D.I.

, Principal component analysis for information retrieval, in: Handbook of Parallel Computing and Statistics , Chapman and Hall/CRC, 2005, pp. 415–430.

25.

Kherwa

Bansal

, Latent semantic analysis: an approach to understand semantic of text, in: International Conference on Current Trends in Computer, Electrical, Electronics and Communication , IEEE, 2017, pp. 870–874.

26.

Popescu

M.-C.

Balas

V.E.

Perescu-Popescu

Mastorakis

, Multilayer perceptron and neural networks, WSEAS Transactions on Circuits and Systems 8(7) (2009), 579–588.

27.

Han

Jian

, A multi-layer multi-view stacking model for credit risk assessment, Intelligent Data Analysis, 2023, 1–19.

28.

Sharma

Athaiya

, Activation functions in neural networks, Towards Data Sci 6(12) (2017), 310–316.

29.

Gallagher

, Multi-layer perceptron error surfaces: visualization, structure and modelling, PhD thesis, Citeseer, 2000.

30.

Agarap

A.F.

, Deep learning using rectified linear units (relu), arXiv preprint arXiv:1803.08375, 2018.

31.

Kingma

D.P.

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

32.

Rice

Wong

Kolter

, Overfitting in adversarially robust deep learning, in: International Conference on Machine Learning , PMLR, 2020, pp. 8093–8104.

33.

Alippi

Storti-Gajani

, Simple approximation of sigmoidal functions: realistic design of digital neural networks capable of learning, in: IEEE International Sympoisum on Circuits and Systems , IEEE, 1991, pp. 1505–1508.

34.

Shapley

L.S.

et al., A value for n-person games, 1953.

35.

Lundberg

S.M.

Erion

Chen

DeGrave

Prutkin

J.M.

Nair

Katz

Himmelfarb

Bansal

Lee

S.-I.

, From local explanations to global understanding with explainable AI for trees, Nature Machine Intelligence 2(1) (2020), 56–67.

36.

Fryer

Strümke

Nguyen

, Shapley values for feature selection: The good, the bad, and the axioms, IEEE Access 9 (2021), 144352–144360.

37.

Covert

Lundberg

S.M.

Lee

S.-I.

, Understanding global feature contributions with additive importance measures, Advances in Neural Information Processing Systems 33 (2020), 17212–17223.

38.

Štrumbelj

Kononenko

, Explaining prediction models and individual predictions with feature contributions, Knowledge and Information Systems 41 (2014), 647–665.

39.

Smith

Alvarez

, Identifying mortality factors from Machine Learning using Shapley values – a case of COVID19, Expert Systems with Applications 176 (2021), 114832.

40.

Hasib

K.M.

Towhid

N.A.

Faruk

K.O.

Al Mahmud

Mridha

, Strategies for enhancing the performance of news article classification in bangla: Handling imbalance and interpretation, Engineering Applications of Artificial Intelligence 125 (2023), 106688.

41.

Lundberg

S.M.

Erion

Chen

DeGrave

Prutkin

J.M.

Nair

Katz

Himmelfarb

Bansal

Lee

S.-I.

, Explainable AI for trees: From local explanations to global understanding, arXiv preprint arXiv:1905.04610, 2019.

42.

Chen

Yang

, Enhancing land cover mapping and monitoring: An interactive and explainable machine learning approach using google earth engine, Remote Sensing 15(18) (2023), 4585.

43.

Felefly

Roukoz

Fares

Achkar

Yazbeck

Meyer

Kordahi

Azoury

Nasr

D.N.

Nasr

et al., An Explainable MRI-Radiomic Quantum Neural Network to Differentiate Between Large Brain Metastases and High-Grade Glioma Using Quantum Annealing for Feature Selection, Journal of Digital Imaging 36(6) (2023), 2335–2346.

44.

Mustafa

Usman

Afzal

M.T.

Sulaiman

Shahid

, Multi-label classification of research articles using Word2Vec and identification of similarity threshold, Scientific Reports 11(1) (2021), 21900.

45.

Kim

Seo

Cho

Kang

, Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec, Information Sciences 477 (2019), 15–29.

46.

Yao

Zhai

Gao

, Text classification model based on fasttext, in: IEEE International Conference on Artificial Intelligence and Information Systems , IEEE, 2020, pp. 154–157.

47.

Kim

S.-W.

Gil

J.-M.

, Research paper classification systems based on TF-IDF and LDA schemes, Human-centric Computing and Information Sciences 9 (2019), 1–21.

48.

Chowdhury

Schoen

M.P.

, Research paper classification using supervised machine learning techniques, in: Intermountain Engineering, Technology and Computing , IEEE, 2020, pp. 1–6.

49.

Nguyen

T.H.

Shirai

, Text classification of technical papers based on text segmentation, in: International Conference on Applications of Natural Language to Information Systems , Springer, 2013, pp. 278–284.

50.

Nguyen

D.Q.

Billingsley

Johnson

, Improving topic models with latent feature word representations, Transactions of the Association for Computational Linguistics 3 (2015), 299–313.

51.

Bunk

Krestel

, Welda: Enhancing topic models by incorporating local word context, in: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries , 2018, pp. 293–302.

52.

Wang

Liu

Carin

, Distilled Wasserstein Learning for Word Embedding and Topic Modeling, in: Advances in Neural Information Processing Systems , Vol. 31, Curran Associates, Inc., 2018.

53.

Keya

K.N.

Papanikolaou

Foulds

J.R.

, Neural embedding allocation: Distributed representations of topic models, Computational Linguistics 48(4) (2022), 1021–1052.

54.

Dieng

A.B.

Ruiz

F.J.

Blei

D.M.

, Topic modeling in embedding spaces, Transactions of the Association for Computational Linguistics 8 (2020), 439–453.

55.

Jindal

Malhotra

Jain

, Techniques for text classification: Literature review and current trends, Webology 12(2) (2015).

56.

Ranjan

M.N.M.

Ghorpade

Kanthale

Ghorpade

Dubey

, Document classification using lstm neural network, Journal of Data Mining and Management 2(2) (2017), 1–9.

57.

Ech-Chouyyekh

Omara

Lazaar

, Scientific paper classification using convolutional neural networks, in: International Conference on Big Data and Internet of Things , 2019, pp. 1–6.

58.

Zhan

, News text classification based on improved Bi-LSTM-CNN, in: International Conference on Information Technology in Medicine and Education , IEEE, 2018, pp. 890–893.

59.

Zhou

Sun

Liu

Lau

, A C-LSTM neural network for text classification, arXiv preprint arXiv:1511.08630, 2015.

60.

Vilone

Longo

, Notions of explainability and evaluation approaches for explainable artificial intelligence, Information Fusion 76 (2021), 89–106.

61.

Kim

Park

Suh

, Transparency and accountability in AI decision support: Explaining and visualizing convolutional neural networks for text information, Decision Support Systems 134 (2020), 113302.

62.

Ayoub

Yang

X.J.

Zhou

, Combat COVID-19 infodemic using explainable natural language processing models, Information Processing & Management 58(4) (2021), 102569.

63.

Alicioglu

Sun

, A survey of visual analytics for explainable artificial intelligence methods, Computers & Graphics 102 (2022), 502–520.

64.

Japkowicz

, The class imbalance problem: Significance and strategies, in: Proceedings of the International Conference on Artificial Intelligence , Vol. 56, 2000, pp. 111–117.

65.

Kubat

Matwin

et al., Addressing the curse of imbalanced training sets: one-sided selection, in: International Conference on Machine Learning , Vol. 97, Citeseer, 1997, p. 179.

66.

Surantha

Gozali

I.D.

, Evaluation of the improved extreme learning machine for machine failure multiclass classification, Electronics 12(16) (2023), 3501.

67.

Mikolov

Chen

Corrado

Dean

, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.

68.

Mikolov

Sutskever

Chen

Corrado

G.S.

Dean

, Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26 (2013).

69.

Mikolov

, Distributed representations of sentences and documents, in: International Conference on Machine Learning , PMLR, 2014, pp. 1188–1196.

70.

Grootendorst

, BERTopic: Neural topic modeling with a class-based TF-IDF procedure, arXiv preprint arXiv:2203. 05794, 2022.

71.

Zhang

, Using Word2Vec to process big text data, in: IEEE International Conference on Big Data , IEEE, 2015, pp. 2895–2897.

72.

Muhammad

P.F.

Kusumaningrum

Wibowo

, Sentiment analysis using Word2vec and long short-term memory (LSTM) for Indonesian hotel reviews, Procedia Computer Science 179 (2021), 728–735.

73.

Yen

I.E.

Balakrishnan

Chen

P.-Y.

Ravikumar

Witbrock

M.J.

, Word mover’s embedding: From word2vec to document embedding, arXiv preprint arXiv:1811.01713, 2018.

74.

Sarı

Özbayoğlu

A.M.

, Classification of Turkish Documents Using Paragraph Vector, in: International Conference on Artificial Intelligence and Data Processing , IEEE, 2018, pp. 1–5.

75.

Memarzadeh

Ghadiri

Samwald

Lotfi Shahreza

, A study into patient similarity through representation learning from medical records, Knowledge and Information Systems 64(12) (2022), 3293–3324.

76.

Huang

Wang

Yang

, Topic mining of tourist attractions based on a seasonal context aware LDA model, Intelligent Data Analysis 22(2) (2018), 383–405.

77.

Reimers

Gurevych

, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084, 2019.

78.

Reimers

Gurevych

, Making monolingual sentence embeddings multilingual using knowledge distillation, arXiv preprint arXiv:2004.09813, 2020.

79.

McInnes

Healy

Melville

, Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426, 2018.

80.

McInnes

Healy

Astels

et al., Hdbscan: Hierarchical density based clustering, Journal of Open Source Software 2(11) (2017), 205.

81.

Marpaung

et al., Comparative of prim’s and boruvka’s algorithm to solve minimum spanning tree problems, in: Journal of Physics: Conference Series , Vol. 1462, IOP Publishing, 2020, p. 012043.

82.

Müller

D.W.

Sawitzki

, Excess mass estimates and tests for multimodality, Journal of the American Statistical Association 86(415) (1991), 738–746.

83.

Kirasich

Smith

Sadler

, Random forest vs logistic regression: Binary classification for heterogeneous datasets, SMU Data Science Review 1(3) (2018), 9.

84.

Ali

Khan

Ahmad

Maqsood

, Random forests and decision trees, International Journal of Computer Science Issues 9(5) (2012), 272.

85.

Chen

Guestrin

, Xgboost: A scalable tree boosting system, in: International Conference on Knowledge Discovery and Data Mining , 2016, pp. 785–794.

86.

Handoyo

Chen

Y.-P.

Irianto

Widodo

, The varying threshold values of logistic regression and linear discriminant for classifying fraudulent firm, Mathematics and Statistics 9(2) (2021), 135–143.

87.

Kaur

, An approach for sentiment analysis using Gini index with random forest classification, in: Computational Vision and Bio-Inspired Computing , Springer, 2020, pp. 541–554.

88.

den Bieman

J.P.

Wilms

J.M.

van den Boogaard

H.F.

van Gent

M.R.

, Prediction of mean wave overtopping discharge using gradient boosting decision trees, Water 12(6) (2020), 1703.

89.

Jeni

L.A.

Cohn

J.F.

De La Torre

, Facing imbalanced data-recommendations for the use of performance metrics, in: Humaine Association Conference on Affective Computing and Intelligent Interaction , IEEE, 2013, pp. 245–251.

90.

Kashyap

Han

Yus

Sleeman

Satyapanich

Gandhi

Finin

, Robust semantic text similarity using LSA, machine learning, and linguistic resources, Language Resources and Evaluation 50 (2016), 125–161.

91.

Pergola

Lowe

, Topical phrase extraction from clinical reports by incorporating both local and global context, in: AAAI Conference on Artificial Intelligence , 2018.

92.

Zhou

Gandomi

A.H.

Chen

Holzinger

, Evaluating the quality of machine learning explanations: A survey on methods and metrics, Electronics 10(5) (2021), 593.

93.

Messalas

Kanellopoulos

Makris

, Model-agnostic interpretability with shapley values, in: International Conference on Information, Intelligence, Systems and Applications , IEEE, 2019, pp. 1–7.

94.

Guidotti

Monreale

Giannotti

Pedreschi

Ruggieri

Turini

, Factual and counterfactual explanations for black box decision making, IEEE Intelligent Systems 34(6) (2019), 14–23.

95.

Liang

Liu

Wang

Guo

Jiang

, Carbon-based sorbents: Carbon nanotubes, Journal of Chromatography A 1357 (2014), 53–67.

96.

Wang

Luo

Chen

, High-strength carbon nanotube fibre-like ribbon with high ductility and high electrical conductivity, Nature Communications 5(1) (2014), 3848.

97.

Zhang

Hao

Nguyen

Oluwalowo

Liu

Dessureault

Park

J.G.

Liang

, Carbon nanotube/carbon composite fiber with improved strength and electrical conductivity via interface engineering, Carbon 144 (2019), 628–638.

98.

Duongthipthewa

Zhou

, Electrical conductivity and mechanical property improvement by low-temperature carbon nanotube growth on carbon fiber fabric with nanofiller incorporation, Composites Part B: Engineering 182 (2020), 107581.

99.

Lim

S.Y.

Shen

Gao

, Carbon quantum dots and their applications, Chemical Society Reviews 44(1) (2015), 362–381.

100.

Ray

S.C.

, Application and uses of graphene oxide and reduced graphene oxide, Applications of Graphene and Graphene-Oxide based Nanomaterials 6(8) (2015), 39–55.

101.

Shen

Chen

Qian

, Ultrahigh strength and high electrical conductivity in copper, Science 304(5669) (2004), 422–426.

102.

Pozzo

Davies

Gubbins

Alfè

, Thermal and electrical conductivity of solid iron and iron-silicon mixtures at Earth’s core conditions, Earth and Planetary Science Letters 393 (2014), 159–164.

103.

Hebb

M.H.

, Electrical conductivity of silver sulfide, The Journal of Chemical Physics 20(1) (1952), 185–190.

104.

Kim

Shim

H.J.

Yang

Choi

M.K.

Kim

D.C.

Kim

Hyeon

Kim

D.-H.

, Ultrathin quantum dot display integrated with wearable electronics, Advanced Materials 29(38) (2017), 1700217.

105.

Fujishima

Rao

T.N.

Tryk

D.A.

, Titanium dioxide photocatalysis, Journal of Photochemistry and Photobiology C: Photochemistry Reviews 1(1) (2000), 1–21.

106.

Xia

Wang

Yin

Huang

Chen

, New insight of Ag quantum dots with the improved molecular oxygen activation ability for photocatalytic applications, Applied Catalysis B: Environmental 188 (2016), 376–387.

107.

Yuan

Y.-J.

Chen

Z.-T.

Zou

Z.-G.

, Cadmium sulfide-based nanomaterials for photocatalytic hydrogen production, Journal of Materials Chemistry A 6(25) (2018), 11606–11630.

108.

Hitzler

Alifui-Segbaya

Williams

Heine

Heitzmann

Hall

Merkel

Öchsner

, Additive manufacturing of cobalt-based dental alloys: Analysis of microstructure and physicomechanical properties, Advances in Materials Science and Engineering 2018 (2018), 1–12.

109.

Rao

C.e.e.

Sood

A.e.

Subrahmanyam

K.e.

Govindaraj

, Graphene: The new two-dimensional nanomaterial, Angewandte Chemie International Edition 48(42) (2009), 7752–7777.

110.

Xie

Cao

Rodríguez-Lozano

F.J.

Luong-Van

E.K.

Rosa

, Graphene for the development of the next-generation of biocomposites for dental and medical applications, Dental Materials 33(7) (2017), 765–774.

111.

Clement

Diener

Gross

Künzner

Timoshenko

V.Y.

Kovalev

, Highly explosive nanosilicon-based composite materials, Physica Status Solidi (A) 202(8) (2005), 1357–1364.

112.

Mavhungu

Akinlabi

Onitiri

Varachia

, Aluminum matrix composites for industrial use: Advances and trends, Procedia Manufacturing 7 (2017), 178–182.

113.

Min

Han

Shin

Park

, Improvement of cadmium ion removal by base treatment of juniper fiber, Water Research 38(5) (2004), 1289–1295.

114.

Daldosso

Pavesi

, Nanosilicon photonics, Laser & Photonics Reviews 3(6) (2009), 508–534.

115.

Kabashin

A.V.

Singh

Swihart

M.T.

Zavestovskaya

I.N.

Prasad

P.N.

, Laser-processed nanosilicon: A multifunctional nanomaterial for energy and healthcare, ACS Nano 13(9) (2019), 9841–9867.

116.

Nguyen

N.T.

Nguyen

D.H.

Pham

D.D.

Dang

V.P.

Nguyen

Q.H.

Hoang

D.Q.

, New oligochitosan-nanosilica hybrid materials: Preparation and application on chili plants for resistance to anthracnose disease and growth enhancement, Polymer Journal 49(12) (2017), 861–869.

117.

Fortunato

Gonçalves

Pimentel

Barquinha

Gonçalves

Pereira

Ferreira

Martins

, Zinc oxide, a multifunctional material: From material to device applications, Applied Physics A 96 (2009), 197–205.

118.

Kowsari

Brown

D.E.

Heidarysafa

Jafari Meimandi

Gerber

M.S.

Barnes

L.E.

, HDLTex: Hierarchical Deep Learning for Text Classification, in: IEEE International Conference on Machine Learning and Applications , IEEE, 2017.

119.

Schopf

Braun

Matthes

, Evaluating unsupervised text classification: Zero-shot and similarity-based approaches, in: International Conference on Natural Language Processing and Information Retrieval , 2022, pp. 6–15.

Explainable paper classification system using topic modeling and SHAP

Abstract

Keywords

1. Introduction

2. Proposed method

2.1. Overview

4. Dataset and experiments

4.1. Dataset

Table 1 Distribution of academic papers across five nanomaterial fields from the WoS database. Field Carbon Nanotube Quantum Dot Graphene Nanosilica Nanosilicon Total Counts 116,763 50,884 173,707 47,608 45,365 456,472

4.3.1. Embedding methods

4.3.2. Classification models

5. Results

5.1. Results of comparative study

5.2.1. Evaluation for SHAP

Table 4 Fidelity scores for SHAP across five nanomaterial fields. Field Carbon Nanotube Quantum Dot Graphene Nanosilica Nanosilicon Overall average Fidelity 4.2298e-6 2.3664e-5 4.4665e-6 2.3064e-5 3.6221e-5 0.00073

Carbon Nanotube

Quantum Dot

Graphene

Nanosilica

Nanosilicon

Carbon Nanotube

Quantum Dot

Graphene

Nanosilica

Nanosilicon

Footnotes

Acknowledgments

Data availability statements

Appendix

References

Table 1
Distribution of academic papers across five nanomaterial fields from the WoS database.

Field Carbon Nanotube Quantum Dot Graphene Nanosilica Nanosilicon Total

Counts 116,763 50,884 173,707 47,608 45,365 456,472

Table 4
Fidelity scores for SHAP across five nanomaterial fields.

Field Carbon Nanotube Quantum Dot Graphene Nanosilica Nanosilicon Overall average

Fidelity 4.2298e-6 2.3664e-5 4.4665e-6 2.3064e-5 3.6221e-5 0.00073