Abstract
Medical and health text documents pose a challenge for data handling and retrieving the relevant and meaningful documents. Automatically retrieval of significant knowledge with a better understanding of medical and health documents is a challenging task. One popular approach for thematically understand the medical and health text documents and finding the topics from these documents is topic modeling. In this research, we propose a novel topic modeling approach Fuzzy k-means latent semantic analysis (FKLSA) by using the fuzzy clustering. Our method generates local and global term frequencies through the bag of words (BOW) model. Principal component analysis is used for removing high dimensionality negative impact on global term weighting. Previous work shows that in medical and health documents redundancy issue has a negative impact on the quality of text mining. Therefore, the main achievement of FKLSA is the handling of the redundancy issue in medical and text documents and discover semantically more precise topics. FKLSA is socially utilized for finding the themes from medical and health text corpus. These topics are further used for text classification and clustering tasks in text mining. Experimental results show that FKLSA performs better than LDA and RedLDA for redundant corpora. FKLSA’s time performance is also stable with an increase in number of topics and thus better than LDA and LSA on a big twitter heath dataset. Quantitative evaluations of the real-world dataset for health and medical documents show that FKLSA gives a higher performance as compared to state-of-the-art topic models like Latent Dirichlet allocation and Latent semantic analysis.
Introduction
Nowadays, large number of medical and health data are stored as an electronic health record (EHR) and there is a need to analyze these documents. National science foundation identified that vast quantity of scientific records management and analysis is one of the challenging tasks and a new field for future study [1]. These large collections of electronic records need new tools for organizing, searching, indexing and browsing the documents. In addition, finding the most relevant documents from all available documents is a difficult task. Particularly health and medical text data are generated and stored very fast like the number of paper publications on PubMed website has been more than 6 million in 2015. US hospital average discharges are also greater than the 30 million records [2].
Therefore, by using advanced data analytics techniques for health and medical text data, companies can save $450 billion annually. There is a need to develop efficient techniques that discover the hidden themes in complicated health and medical text data.
In natural language processing to obtain representations of words distributed from the input data, the bag-of-words model is used. A popular technique for medical and health text representation is Bag-of-words (BOW) model. This model is one of the word2vec models that analyze the semantics of natural language [3]. Working of BOW has explained by an example in Fig. 1. There are three documents and an individual list is constructed for each document. In each document, words are scored according to their occurrences in their respective document. The words are converted into a vector with respect to their occurrences. Rows represent the words and columns represent documents. Such as word “cancer” represents one time in document 1 and word “of” represents two times in document 2. The rough set is a tool for conceptualizing, organizing and analyzing several types of data that deal with inaccurate and unclear information in the application associated to artificial intelligence. In accordance with the rough set theory the information which is irrelevant is eliminated from the documents for classification and decision-making process [4]. In the soft set theory, the N-soft sets are utilized in the decision-making algorithms [5]. The parameters of the soft sets are words, numbers and sentences in various documents.

Example of bag-of-words.
When documents are large then the matrix is called a sparse matrix [6]. Sparsity means that the occurrence of many words in corpus but in one document a small percentage of all words. Therefore, in the BOW matrix, most elements (words) are zero in documents [7].
Topic modeling is the popular probabilistic text modeling technique that is rapidly accepted in text mining and information retrieval fields [8–10]. Topic modeling is a standard technique to deal with high dimensionality and sparsity problems.
Topic modeling automatically classifies many documents into topics and their equivalent distribution.
Topic modeling is a text analysis technique in which the documents contain different words and these words’ term frequencies are features. Topic modeling output matrix is shown in Fig. 2.

Topic modeling matrix interpretation.
In medical and health text mining, topic modeling is an efficient technique but still need improvement because the redundancy has a negative impact on topic modeling [11]. Most of the medical and health documents are redundant [12].
Therefore, in this research dissertation, we proposed a novel fuzzy k-means latent semantic analysis technique for topic modeling approach used for medical and health text corpora. There are some research questions that are solved by the proposed technique. How semantically precise topics are discovered from medical and health text documents? How to remove the high dimensionality effect? How the redundancy issue is removed from the redundant corpus. How can achieved the higher results of classification and clustering in medical text documents?
The analysis of medical and text documents is useful for documents classification and clustering tasks. It is also helpful for removing the redundancy issue in medical text corpus. Therefore, we are motivated to develop a topic modeling approach that will solve redundancy issue in medical text corpus and topics are utilized for text classification and clustering.
The proposed topic modeling approach shows higher performance in both redundant and non-redundant documents and estimates the numbers of topics within the corpus.
To measuring the performance of FKLSA, we conduct extensive experiments on four real-world health and medical text collection datasets.
Results of the experiment show that 1) FKLSA discovers semantically strong topics from medical and health text corpora. 2) Our method generates local and global term frequencies through the bag of words (BOW) model. Principal component analysis is used for removing high dimensionality negative impact on global term weighting. 3) FKLSA performance is better for redundant corpora as compare to its competitors LDA and RedLDA. 4) FKLSA’s time performance is stable for big twitter health dataset with an increase in number of topics and better than LDA and LSA. 5) FKLSA discovers more precise topics from health and medical text documents as compared to baseline topic models. Therefore, classification and clustering results of FKLSA is higher than the state-of-the-art topic models.
Rest of the paper is organized as follows. Section 2 describes previous related work. Section 3 explains the proposed technique more briefly. Section 4 describes experiments and results, Section 5 is discussion, Section 6 is example of topics and section 7 explains the conclusion.
In this section, we describe the previous medical and health applications for topic modeling and fuzzy clustering.
In machine learning, statistical topic modeling is a method that finds patterns and hidden information in text data [13]. Text mining has two approaches; supervised and unsupervised learning. In classification, the corpus is a train with their labels and assigning the labels to new documents [13]. In clustering, clusters are assigned to every document in the large collection of documents that is based on similarity or dissimilarity between clusters.
Topic modeling is a standard unsupervised method of text mining, which has many applications in spam detection, SMS [14] and tagging of images [15]. Topic modeling defines that documents are a probability distribution over topics and topics are a probability distribution over words.
Latent Semantic Analysis (LSA) is a topic modeling approach, particularly distributive semantics, to analyze the relationships between a set of documents and the terms they contain by producing a set of topics related to documents and terms. Unfortunately, a large collection of medical text documents is not structured and contains unfavorable patterns such as redundancies and contradictions [16]. This creates problems for precise and automatic classification for medical text documents [17]. Latent semantic analysis is an approach used for medical text classification and text mining [18–20]. It assumes that words having a close meaning will appear in similar parts of the text. A matrix that contains word counts per document (rows represent single words and columns represent each document) is constructed from a large portion of text and a dimension reduction technique called singular value decomposition (SVD) is used to reduce the number of rows and preserve the structure of similarity between columns.
The Latent Dirichlet allocation (LDA) topic model shows better performance in medical and health text mining [21]. In topic modeling, LDA is a generative probabilistic topic model [10]. There are various applications of LDA in health and medical such as the discovery of clinical concepts and structures in patient health record [22], predicting proteins and their relationships [23], pattern identification in clinical events of brain cancer patients [24], genomic analysis of time to event outcomes [25].
Topic models used for the task of assigning an ICD-9 code for discharges summaries [26]. In [27] they apply the LDA of topic modeling to FDA of drugs side influence labels. Results of their experiments show that extracted topics are clustered the drugs by safe care and various therapeutic uses.
LDA is also used for a clinical pathway to find the treatment behavior of patients [28], predicting the clinical ordering of patterns and several treatment activities [29] are modeled with their timestamps in the clinical pathway [30].
One version of LDA for redundancy issue in medical documents is Redundancy-aware LDA (RedLDA) that shows higher performance than LDA [31].
The documents on the web are very complex and complicated links open from one web document to another web document. Effective clustering methods extract the latent meaning from documents. In [32] a technique is proposed that extract the appropriate meaning in web documents through fuzzy linguistic topological space along with the fuzzy clustering algorithm. The topics are extracted from the web documents. In [33] proposed a technique that reveal the main concepts from the text documents. The concepts are explained by word-based connections given in a semantic topological space and built by the formal model.
The two major approaches for clustering are hard and fuzzy (soft). Every object belongs to one cluster in hard clustering.
The membership is fuzzy in fuzzy clustering and objects belong to various clusters [34]. Fuzzy clustering is used to predict the citalopram in alcohol dependence in response to treatment [35], analyze the diabetic neuropathy [36], early detection of diabetic retinopathy [37], characterize the stroke and causes of ischemic stroke [38–40], detect cancer like breast cancer [41] and improve decision making for radiation therapy [42]. Fuzzy clustering is also used for ultrasound imaging techniques improvement [43] and analyzing the data of microarray [44].
However, fuzzy clustering has many applications in the medical and health domain specifically in image processing but very less considered for topic modeling.
Therefore, in this research, a new topic modeling approach FKLSA has proposed for the analysis of medical and health text documents, which provide a link between the fuzzy clustering and topic modeling.
Research methodology
In this section, we describe our proposed method, fuzzy k-means latent semantic analysis (FKLSA) that discovers latent semantic features from medical and health documents. FKLSA deal with fuzzy perspective, which is a new approach for topic modeling and verified through different experiments on health and medical text corpora. FKLSA has the potential to handle the redundancy issue, discover more precise topics and gives higher performance than its competitors such as LDA and LSA.
FKLSA
Fuzzy logic is an extension of classical logic one or zero to the truth-values between one and zero. Documents and words can be fuzzy clustered through FKLSA and each cluster is a topic. For example, in given documents, FKLSA discovers the four topics as shown in Fig. 3. There are some words in the left side (part A) which relate to some topics, after applying the fuzzy process these words confirm their association with their most related topics. In this process, the words are assigned a degree of fuzzy belonging with respect to each topic (cluster). There is three colors of circles which show the membership magnitude from low (light gray) to high (dark) level. In figure’s (part B) topic 1 is “disease” which contains words germs, afflictions, infection and topic 2 is “exercise” which contains words gym, weight, body. Topic 3 is “bacteria” which consists of words toxin, bacteria, virus and topic 4 is “injury” with words pain, wounds, therapy.

Example of topic modeling with the fuzzy process.
The proposed technique finds five matrices i.e. probability of words P(W), probability of documents P(D), probability of words over documents P(W|D), probability of topics over documents P(T|D) and probability of words over topics P(W|T). Figure 4 describe the overall conceptual framework for proposed FKLSA topic modeling approach.

Conceptual Framework of FKLSA topic modeling approach.
Punctuation erase from text data.
Text data converts into lower case.
Create the tokenized documents.
Stop words like “an”, “a”, “the” etc. are removed from the text data.
Short words with 2 characters are removed from the text data.
Long words with greater than 15 characters are removed from the text data.
Normalize the words using porter stemmer.
After applying the BOW model, removing those words that do not appear more than two times also removed empty documents.
The typical weight w, term k using the vector length of the normalization factor i is shown in Equation 2 for the document d.
Term weights, which are decreasing these terms, are important and assigned to the weights w
d
k that will be varied constantly between zero and one. The highest weight one used for the more important term and zero for least important term. In some cases, it may be helpful to use normalize weight assignments wherever term weight individual relies to some extent of weights on different terms within the same vector. Term frequency weighting scheme is widely used in text classification and categorization [48–52]. Let fi,j is the occurrence of frequency of term index k
i
in documents d
j
then the total frequency F
i
of k term is defined in Equation 3.
N is the number of documents in the huge collection of data. Document frequency n i of a term k i is a number of documents in which k i occurs and n i < F i .
We use ten GTW methods in this research including “Normal, Unary, IDF smooth, IDF max, Entropy, Inverse document frequency (IDF), Probabilistic inverse document frequency (ProbIDF), Incremental global frequency IDF, Square root global frequency IDF and Global frequency IDF”.
Global term weighting in Table 1 are find through Equations 4 and 5. Where, tf
ij
represents the occurrence of words i in j documents. N represents the amount of document and n
i
is the number of documents in which i terms appear in Table 1.
Global term weighting formulas
Normal is used to normalize and correct discrepancies in documents [54]. In unary GTW, no global weight is used. In IDF higher weight can be assigned to infrequent terms and lower weights to common terms [55]. Entropy GTW assigns a higher weight to the terms which are the lower frequency in the documents [56]. IDF smooth and IDF max are variants of inverse document frequency. ProbIDF gives very low weight to the term that appears in each document [54]. In [57] IGFI, IGFS, and IGFF revealed and discussed.
IGFS is proposed by the combination of formulas where the square root that is good local weight assumed as a global weight. They found that subtract a large number from F i /n i improves the performance. IGFF performs the best and adding one to its formula and a new global term weighting formula IGFI is proposed.
The generated output of GTW step is the term frequency (TF) matrix for the documents that are TF-Normal, TF-Unary, TF- IDF smooth, TF-IDF max, TF-IDF, TF-Entropy, TF-ProbIDF, TF-IGFI, TF-IGFS, and TF-IGFF. Table 1 shows all global term weighting formulas.
Principal component analysis (PCA) [58] technique is used to avoid the high dimensionality negative impact on ten GTW methods matrices. This method used to reduce the dimensions of data. PCA is most popular multivariate technique. Its major goals are: Extraction of most important information from the data table. Size of the data compressed and only keeping the important information. Dataset description is simplified. Observations and variables structure analyzed.
To achieve these goals PCA computes the new variable that referred to the principal component, which is obtained from a linear combination of all original variables.
The FKM algorithm partitions the data point into k clusters where S l , (l = 1, 2, 3, … k) are associated with the clusters centered C l . Data point and clusters relationship are fuzzy.
The membership ui,j ∈ [0, 1] represents belonging of clusters centers C
j
and data point X
i
. The set of the data point is S = {X
i
}. The fuzzy k means algorithm based on minimizing distortion as shown in Equation 6. Where u
I
, j is membership and C
j
represents clusters.
N represents the number of data points, fuzzifier parameter is q; numbers of clusters are represented by k and di,j is the squared Euclidean distance between cluster representative C j and data point X i .
The fuzzy k-means clustering algorithm mapped the representative vectors and improved through data partition point. Fuzzy k-means clustering algorithm start with initial clusters centered and process is repeated until stop criteria satisfied. It is supposed that no two clusters have the same representative. If di,j < n then, ui,j = 1 and ui,j = 0 for l ≠ j where n is a positive number. Now the fuzzy k-means algorithm performs following steps.
Set the initial clusters SC o equal to (C j (0)) and ɛ value p = 1.
Set of clusters SC
p
is given and compute di,j for i = 1 to N, j = 1 to k and update the memberships ui,j using Equation 7. ui,j membership degree for each document (D) with clusters (topics). Value of ui,j can be the probability of topics i with documents j as P (T
i
|D
j
).
Where, di,j < n and value of n is very small and ui,j = 1. P (T i |D j ) can be used to find (words×topics) matrix.
Using Equation 8 centers for every cluster are computed to get new clusters which are represented by SC
p
+ 1.
If the (∥ C j (p) - C j (p - 1) ∥) < ∈ and j = 1 to k, then stop where ∈ > zero is the small positive number. Otherwise set p + 1 → p and move to step 2 of fuzzy k-means algorithm.
In terms of numbers of calculation, the computational complexity of fuzzy-k means is O (N kt ) and t represents several iterations.
After that, normalized P (D, T) for each of the topics by Equation 11.
Datasets
Four publicly available datasets are used in this research. The first dataset MuchMore Springer Bilingual Corpus 1 (M Dataset) is labeled dataset. The dataset is medical abstract of English scientific corpus from Springer website. During this research analysis, two journals are used which consists of Federal health standard sheet and Arthroskopie.
The second dataset is Ohsumed Collection 2 (O datasets) dataset (O dataset) which is labels corpus of the medical abstract from MeSH categories. In this research, categories are virus diseases, mycoses, and bacterial infections.
The third dataset is tweets of health news dataset 3 (T-datasets) which is the unlabeled dataset.
The fourth dataset is Synthetic WSJ redundant corpora based on wall street journal corpus and widely used in Natural Language Processing (NLP) [59, 60].
Documents classification
The first evaluation of classification with Bayesian optimization has performed on two datasets MuchMore Springer medical abstracts and Ohsumed datasets. The optimization is the process of locating the point that minimizes real-valued function, which is called the objective function. Bayesian optimization is a Gaussian process model for objective function and uses the evaluations of the objective function to train the model. The error in cross-validated response is minimized by using Bayesian optimization. The Fit function in Matlab is used for Bayesian optimization.
MuchMore and Ohsumed datasets are label (classes) datasets. Two classes arthroskopie and federal health sheet are selected for Springer dataset.
Classification of the document is performed on P (T|D). In this classification, documents are assigned to some class and features have extracted from text data. To achieve the high-performance ratio, we used K-nearest neighbor (KNN) with Bayesian optimization. KNN classifier is the top performing text categorization method and machine learning algorithm. KNN uses standard information retrieval technique (e.g., tfidf). KNN assigns a category to the most common neighbor documents (vote for neighbor category). KNN [61] strongly rely on the distance metric for the input data patterns. Distance metric learning is to learn a distance metric for the data entry space of a given collection of similar/different point pairs that preserves the distance relationship between the training data. K neighbors can be estimated by cross-validation [13, 62].
The FKLSA performance has checked with LDA and LSA model on a tenfold cross-validation method. By using this method data have divided into 10 subsets for 10 iterations. Each subset is selected for testing and others for training.
The KNN method has used with 50, 100 and 150 topics for input features of documents in classification. The output of KNN is presented in a confusion matrix as shown in Table 2.
Confusion matrix
Confusion matrix
TN: True negative (TN) is the correct predictions that instance is negative.
FP: False positive (FP) is that the incorrect predictions with the positive instance.
FN: False negative (FN) is the incorrect prediction and the instance is negative.
TP: True positive (TP) is the correct predictions with the positive instance.
We used precision, recall, accuracy, F-measures, and specificity to check the performance of FKLSA. Equations 14–18 show the precision, recall, accuracy, f-measure, and specificity formula.
Table 3 shows the basic statistics of datasets used in this research with their description and numbers of documents.
Datasets basic statistics
The classification results on two datasets have shown in Tables 4–9. Results indicate that proposed technique with Normal, Unary, IDF smooth, IDF max, Entropy, inverse document frequency (IDF), Probabilistic inverse document frequency (ProbIDF), log global frequency IDF, incremental global frequency IDF, square root global frequency IDF and global frequency IDF methods give the highest performance as compared to LSA and LDA model on different number of topics.
MuchMore springer datasets classification results on 50 topics
MuchMore springer datasets classification results on 100 topics
MuchMore springer datasets classification results on 150 topics
Ohsumed datasets classification results on 50 topics
Ohsumed datasets classification results on 100 topics
Ohsumed datasets classification results on 150 topics
FKLSA classification performance in terms of accuracy, precision, recall, F1-score and specificity with a different number of topics is highest and better than LDA and LSA topic models.
The second performance of clustering is checked on unlabeled health news tweets dataset. Documents clustering has performed on P (T|D) and k-means clustering technique is used. There are two methods for clustering, external and internal validation methods. The internal validation method is more precise as compared to external validation one [63]. The different numbers of topics and clusters are evaluated using Calinski-Har-abasz index internal validation method. Calinsiki-Har-abasz (CH) index [64] is a popular internal validation method. CH index is the ratio type of index in which cohesion is estimated on the distance of clusters point to the centroid, as shown in Equation 19.
The Calinsiki-Har-abasz index can evaluate all clusters validity on the sum of squared error clusters average. Higher Calinsiki-Har-abasz index shows the better clustering results. Calinsiki-Har-abasz index obtains best results on clusters [65].
The performance of FKLSA compared with LSA and LDA model with 2 to 10 numbers of clusters on 25, 50, 75 100,125,150 number of topics. The result of CH index indicates that the proposed technique’s clustering performance is higher than LDA and LSA topic models with different features and clusters, as shown in Figs. 5–10.

Calinski-Harabasz for 25 topics of tweets datasets.

Calinski-Harabasz for 50 topics of tweets datasets.

Calinski-Harabasz for 75 topics of tweets datasets.

Calinski-Harabasz for 100 topics of tweets datasets.

Calinski-Harabasz for 125 topics of tweets datasets.

Calinski-Harabasz for 150 topics of tweets datasets.
In Figs. 5–10 the LSA seems near to zero level because CH index value of LSA is lower than FKLSA. The clustering performance of FKLSA is better than the LSA.
This experiment analyzed the impact of redundancy issue using Synthetic WSJ redundant corpora. FKLSA has compared with LDA and RedLDA that was developed for handling redundancy issue in medical documents [31]. LDA, RedLDA, and FKLSA are trained on Synthetic WSJ redundant corpora for performance comparison of these models. Log-likelihood is used for test of evaluation of models. The higher score of log-likelihood indicates better performance of generalization and the topic model is more successes in modeling documents structure for input text corpus. Figure 11 shows Log-likelihood for Synthetic WSJ redundant corpora with a different number of topics in a range of 50 to 450. The experiment shows that FKLSA log-likelihood scores are higher than LDA and RedLDA for the redundant documents. Therefore, FKLSA performance for redundant text documents is greater than LDA and RedLDA.

Likelihood comparison of WSJ corpora.
FKLSA’s speed is compared with LDA and LSA using health news tweets big dataset in this research. The process of topic modeling based on probability distribution of topics and words with higher probability in each topic for posterior distribution. LDA approximate method is Gibbs sampling for the experiment. This algorithm needs multiple numbers of iterations, which increases computational cost with the amount of words, topics, and documents. Figure 12 shows that FKLSA’s time performance is stable even with an increasing number of topics and better than LDA and LSA.

Time Execution comparison for twitter health dataset.
The experimental results indicate that classification performance of FKLSA is higher than LDA and LSA with different numbers of topics. In clustering results CH-index of FKLSA with various clusters is higher than LDA and LSA. Therefore, clustering performance of FKLSA is better than others topic models LDA and LSA. The redundancy issue is also analyzed the FKLSA against LDA and RedLDA. The log-likelihood scores of FKLSA is better than LDA and RedLDA for redundant corpus. However, execution time of FKLSA is stable with different numbers for topics as compared to LDA and LSA.
Example
Medical and health collection of documents contains several words. FKLSA can execute on different numbers of topics to estimate the performance of the proposed technique. FKLSA discovered the more precise topics from health and medical documents. In this example, four topics are showing in Tables 10 and 11. Topics (T7 to T10) and topics (T5 to T8) are extracted from two datasets. These four topics are discovered from the Ohsumed and Twitters health news tweets dataset. Each topic contains ten different words.
Example of four topics for Ohsumed dataset
Example of four topics for Ohsumed dataset
Example of four topics for twitter health dataset
The medical and health text collection of documents is continuously increasing nowadays, and analysis of these documents is very important for getting the valuable resource of information. The archives of medical and health like PubMed are providing valuable services in the scientific community.
Topic Modeling is a popular method that discovers the hidden theme and structure in unorganized medical and health documents. The structure has used for searching, indexing, browsing and summarizing these documents.
In machine learning, fuzzy techniques are widely used for health and medical image processing and text processing. Existing topic modeling techniques are based on linear algebra and statistical distribution approaches.
In this research, FKLSA topic modeling technique is developed and evaluated for discovering the latent semantic themes in medical and health documents. FKLSA avoids the negative effect of redundancy in medical and health documents. Experimental results show that time performance of FKLSA is stable with increasing number of topics. Furthermore, FKLSA also improves the classification accuracy for medical and health datasets and provides a new approach for text mining over medical and health datasets.
FKLSA is a new topic model for researchers and practitioners that has the flexibility to work with an extensive variety of fuzzy clustering and dimension reduction techniques. FKLSA used for discovering the themes from medical and health text documents efficiently. Additionally, FKLSA works with discrete and continuous data and estimates the number of topics in medical documents.
Quantitative evaluation of four datasets shows that FKLSA outperforms progressive baselines with vital enhancements. The experimental results suggest that FKLSA is a strong method that identifies the hidden structure in health and medical dataset. Results also show that FKLSA’s classification and clustering performance is higher than state-of-the-art baselines topic models like LDA and LSA.
