Abstract
The text clustering model becomes an essential process to sort the unstructured text data in an appropriate format. But, it does not give the pave for extracting the information to facilitate the document representation. In today’s date, it becomes crucial to retrieve the relevant text data. Mostly, the data comprises an unstructured text format that it is difficult to categorize the data. The major intention of this work is to implement a new text clustering model of unstructured data using classifier approaches. At first, the unstructured data is taken from standard benchmark datasets focusing on both English and Telugu languages. The collected text data is then given to the pre-processing stage. The pre-processed data is fed into the model of the feature extraction stage 1, in which the GloVe embedding technique is used for extracting text features. Similarly, in the feature extraction stage 2, the pre-processed data is used to extract the deep text features using Text Convolutional Neural Network (Text CNN). Then, the text features from Stage 1 and deep features from Stage 2 are all together and employed for optimal feature selection using the Hybrid Sea Lion Grasshopper Optimization (HSLnGO), where the traditional SLnO is superimposed with GOA. Finally, the text clustering is processed with the help of Deep CNN-assisted hierarchical clustering, where the parameter optimization is done to improve the clustering performance using HSLnGO. Thus, the simulation findings illustrate that the framework yields impressive performance of text classification in contrast with other techniques while implementing the unstructured text data using different quantitative measures.
Keywords
Introduction
The interpretation is made through the speech and text data between the humans, which entails unfortunate content of unstructured data, thus requiring some machine learning methods to handle to transmit the data into significant subsets of features [1]. Text mining is now widely used for such implications as summarization, classification, and clustering processes, which is encouraged by semantic-based similar words [2]. Traditionally, text clustering is termed the automated procedure of categorizing text into different cluster groups. In every cluster group, the similarity among the text becomes high level, whereas the similarity between cluster groups is a bit low [3]. The generation of each cluster group contains cluster centroid data points that are present in all clusters. Consequently, the appropriate text is chosen to form the group, where the relevant information and content [4] can be obtained. The clustering process is done in two different ways hierarchical-based, partition-based, and grid-based clustering of unstructured text data. Moreover, the text data is represented in the form of numerical or categorical data. Contrary to Natural Language Processing (NLP), the clustering mechanism requires less manual entailment and more effective performance [5]. Thus, text clustering is a required process for document representation, information retrieval, organization, and language processing domains. Being unstructured data, it becomes cumbersome to process eventually in an automated way. Several corpus processing methods are deployed for extracting the required text data [6].
Additionally, text mining is a subclass process of data mining technique, yet it faces some notable challenges in enhancing the clustering process [7] of unstructured text data. Thus, text mining is deployed for retrieving concealed data in an unstructured manner of information. Some of the confined factors for improving clustering performance are “unspecified structural form of data, customizing analyses for spoken language, high probability of data scarcity” [8]. These drawbacks degrade the performance and reduce the reputation of the computer-aided approach, thus it needs much more knowledge regarding the clustering process [9]. Having inadequate data becomes another challenging issue in the group the structured data. Missing values and restricted raw text tend to deduce the clustering efficacy and fragility while analyzing the performance [10]. Another major downside is large dimensionality criteria, less transparency, and scarcity of data distribution. Some well-developed techniques like the “K-Means algorithm, ordered clustering, Self Organizing Maps (SOMs), Expectation Maximization (EM)” are become futile [11] to handle a large amount of datasets since these all come under the confined area of scalability. The former methods of the automated clustering model still open the door for further improvement since it fails with some noteworthy limitations [12].
The aforementioned challenges pertain to the robustness and completion of the documents with relevant text data even though the model has less number of words or lines [13]. Hence, another challenge becomes to fulfill the text information and assist with a short span of text features. Machine learning methods are commonly used to resolve unstructured data problems to enhance performance. Such machine learning techniques are “Artificial Neural Network (ANN), Random Forest (RF), and Support Vector Machine (SVM)” and so on, which still exist in future directions [14]. However, on the other hand, also recently, the deep learning model is the trendsetting technique for the text clustering process. The deep learning model can easily retrieve the hidden information of unstructured text and explored it with higher clustering efficiency [15]. Some commonly used deep structured architectures are “Long-Short Term Memory (LSTM), Convolutional Neural Network (CNN), Deep Neural Network (DNN)” and so on. In deep learning technology, feature extraction and interlinked with learning the features of text data. Some feature extraction method represents the semantic similar words in terms of statistical and linguistic manner [16]. The relevant features are mostly obtained by word embedding, character embedding models, etc. However, this process easily gets trapped into semantic restrictions language-dependency. Moreover, some optimization algorithms are also used for solving the clustering issue based on the natural behavior of intelligence. The recent innovation in the text clustering model is influencing the standard optimization algorithms with multi-objective functions. The reliability and scalability of the clustering process get improved while implementing deep learning-based techniques. Some of the hybrid infused model also helps to determine the optimal results when using high-dimensional data [17]. Hence, several researchers have invented novel methods for approaching unstructured data using the text clustering method. This paper focuses on the hierarchical clustering process with a deep learning model to offer the result of better-clustered text data.
The core intentions of the enhanced model are explored as follows.
To design and develop a novel hierarchical clustering model for unstructured text data using deep learning and a meta-heuristic approach. To extract the relevant text and deep features by GloVe embedding and Text CNN model, which is then fused. Further, the optimal features are acquired through the newly designed HSLnGO algorithm. To develop the hybrid algorithm for providing optimal solutions. The novel algorithm is developed by combining the SLnO and GOA algorithms. It is used to further increase the classification rate. To cluster the unstructured text features, Deep CNN is deployed, where different factors like learning rate, epochs, and several suitably hidden neurons are tuned optimally by the HSLnGO algorithm. To evaluate the performance of the clustering process and comparative analysis is made with the aid of different validating measures, thereby; it ensures the effective clustering model of the proposed method.
The residual part of the research work is given below. The review, research gaps, and challenges regarding the existing clustering method are described in Section 2. Section 3 elaborates on the novel deep learning-aided hierarchical clustering model for unstructured data. Section 4 presents the feature extraction and determines the optimal feature selection for the subsequent section. The deep learning-aided hierarchical clustering technique is shown in Section 5. The results and discussion of the novel method are analyzed in Section 6. Finally, the paperwork concludes with Section 7.
Related works
In 2022, Hosseini and Varzaneh [18] presented a hybrid-based deep text clustering model with the assistance of “stacked autoencoder and k-means clustering”. Initially, the text data was collected from Barez data for clustering performance. The three major steps were included as i) Pre-trained Bidirectional Encoder Representations from Transformers (BERT) model with text presentation, termed as ParsBERT, ii) Relevant features were obtained by autoencoder method to mitigate the dimension of the features, and iii) Clustering was performed by K-Means process. Finally, the performance was analyzed and validated the effectiveness with the measure of the Silhouette Score. Thus, the novel deep learning model has achieved the expected results for clustering performance rather than the existing clustering process.
In 2020, Saeed et al. [19] described the methodology of corpus creation that integrated the metadata and retrieved text data. It has composed of many unstructured documents, where the different weight nodes were acquired via multistage clustering. The proposed model has contained various phases. Here, the News dataset was applied for clustering, where the output was associated with sun corpora. It has to lead to enhancing text detection and reducing the cost function using clusters. The new model has extracted significant data related to expected content. Contrary to other techniques, thus, the suggested system has ensured to increase in the detection level.
In 2021, Kumar et al. [20] elaborated on the two-phase cluster validation (TPCV) for unstructured text data. It was used to validate the similarity measures belonging to different cluster groups. Initially, the TPCV model has identified the cluster centroid, thus determining the probability of the closeness value of each cluster group. Further, it was employed to validate the separation probability between the cluster groups in terms of distance. Hence, the experimental results have demonstrated that the implemented model has attained higher results in contrast with existing methods.
In 2016, Sundermann et al. [21] proposed an ensemble-based learning model for text mining of unstructured data. Once the text mining technique was implemented, every metadata set was fed into the suggested model. Then, the resultant data was fused to provide a final ranking value for the user. Conversely, two models as K-Nearest Neighbors and BPR-Mapping were combined into an ensemble model. Through the simulation results, the recommended system has illustrated that it became feasible when using multiple data related to various types, which has proved its efficacy via different measures.
In 2019, Lee et al. [22] illustrated the novel document representation model by using neural-embedding and probabilistic word clustering. Here, the clustering was done for determining the semantic-based similar text. Depending on the membership, the neural embedding method was utilized for identifying similar words. Further, the clustering was accomplished in the manner of a probabilistic approach via the "fuzzy C-means clustering method or Gaussian mixture model". Hence, the findings have been explored to attain higher accuracy results while implementing 12 different classes.
In 2021, Thirumoorthy and Muneeswaran [23] developed a novel system for Document Clustering (DC) using Hybrid Jaya Optimization algorithm (HJO), termed HJO-DC. The performance was assessed by the Silhouette index to determine the data quality. The proposed method has provided the expected outcome and contrasts with classical techniques such as K-Means and K-Medoids and heuristic approaches like GA, Cuckoo Search, PSO, Firefly, and Grey Wolf Optimizer.
In 2020, Fidan and Yuksel [24] aligned the unstructured text data into an appropriate format using the Grey system theory. It was mainly used to handle small datasets. The raw comments or reviews were collected from Amazon, where the review was declared as a positive and negative label. The Grey system method was conducted with the hierarchical and partitional-based clustering mechanism that was applied. Hence, the simulation analysis has outperformed to obtain efficient clustered data when compared to traditional methods.
In 2019, Janani and Vijayarani [25] investigated the new approach of the Spectral Clustering algorithm with PSO (SCPSO) to render effective text clustering. The suggested model was employed with the random solution with the aid of local and global optima results. The main intention of the model was to cluster the text data with PSO to resolve the issue of using a large number of datasets. Using standard text-related datasets; the proposed model has improved the clustering performance rather than “Spherical K-means, Expectation Maximization Method (EM) and existing PSO”. Thus, the proposed model has offered more clustering results.
Research gaps and challenges
The former clustering approaches like keyword-based algorithms become fragile to provide the appropriate clustered output for unstructured data. Also, it faces another problem of security overhead and data loss since it easily falls into a limitation of time and data overhead. Merits and demerits related to existing unstructured text clustering are given in Table 1. Auto Encoder and K-Means [18] attain the higher value of the Silhouette Score. But, it cannot handle a large number of datasets and has inadequate knowledge to train the data. Multi-Stage Clustering [19] aids to assess the quality and effectiveness of unstructured text query output, at a reduced cost. But, it needs to improve the text assessments via computer-aided text segments. TPCV [20] achieves higher clustering accuracy. On the other hand, it is only applicable for sequential series of text data. BPR-Mapping algorithm [21] improves the accuracy level while maintaining the extensibility and reliability of the model. Due to confined data, it mitigates the computational power of the system. Embedding-probabilistic model [22] obtains a higher accuracy value and reduces the computational complexity. However, it reduces the semantic similarity measures among the text. HJO [23] offers more quality clustering performance for documenting the text. It easily gets trapped in the local optimum issue. Grey system theory [24] increases the efficacy while using small datasets also. But, it limits by certain rough set theory and its rules. SCPSO [25] maintains an enormous amount of text documents. It is critical to process the multi-core CPU. Hence, a novel clustering method is instigated and executed for unstructured data using NLP techniques.
Strengths and weakness of existing unstructured text clustering methods
Strengths and weakness of existing unstructured text clustering methods
Data description of English and Telugu dataset
The proposed model requires English and Telugu language for processing the unstructured text clustering. The datasets are explained as follows.
English dataset: The input text is garnered from the below link as “
Telugu dataset: The raw unstructured text is obtained from the link as “
Here, the sample text of the total collected unstructured text data is denoted as
Pre-processing of raw text
The functionality of pre-processing is to transform the raw or input data into an understandable format, which is represented as cleaned data. The data pre-processing is mainly used to check the missing data, noisy or unwanted data, and inconsistency data. These irrelevant data get removed to further improve the performance results. The proposed methodology employs various pre-processing techniques that are described as follows.
Tokenization: In general, tokenization is employed for segregating text into meaningful pieces or parts, termed tokens. The resultant token is represented in the form of phrasal, symbol, words, and so on. For example, the text is given as “How are you”, while applying tokenization, the sentence is split into different tokens as
Stemming: The stemming process utilizes the input text
Stop words removal: The stemming text
Punctuation removal: Here, the input is declared as stop word removed text as
Developed text clustering model
Text clustering is a prominent role, defined as a process of grouping the data into different clusters, these cluster groups are formed by the data that depends on the similarity measures among the data elements. Also, depending on the cluster attributes, a clustering mechanism is performed. Up-to-date, the documents of unstructured text subsist in our daily needs. Being an unstructured text, it becomes cumbersome to process eventually in an automated way. Several corpus processing methods are deployed for extracting the required text data. Since the corpus is huge, sustaining efficiency during clustering is a challenging one. Another issue is appropriately dividing the documents into their respective and related text documents. Due to the lack of corpus grouping, the text retrieval and clustering performance gets degraded. On the other hand, inefficient tends to create cost-effectively and requires more work to involve in the retrieval approach. Over the past few years, it still exists with some challenges like poor reliability, more consumption of time, and computational burden. Some other confined strategies in unstructured text clustering are lack of storage independence, data mobility, and incapable of managing the data. To evade all the issues, it is developed a novel unstructured text clustering mechanism uses the hybrid heuristic algorithm. The architectural representation of enhanced text clustering is depicted in Fig. 1.
Architectural representation of proposed clustering model for unstructured text data.
The proposed method is involved in various stages; they are a) Pre-processing, b) Feature extraction, c) Optimal feature selection, and d) Deep CNN-aided hierarchical clustering. Initially, the unstructured text data of both English and Telugu languages were gathered from benchmark datasets. Conversely, the pre-processing is undergone by tokenization, stemming, stop words, and punctuation removal, where the pre-processed text is obtained to further increase the accuracy level. Subsequently, the resultant text is fed into the feature extraction phase. The significant text and deep feature set are obtained using GloVe embedding and the Text CNN model. Further, these features are combined and determine the optimal features, which are acquired through an enhanced HSLnGO algorithm. The hybrid algorithm is built by incorporating the GOA and SLnO algorithms. The major use of a hybrid algorithm is to yield an accurate solution to resolve the time and structural complexity. In the end, the Deep CNN model is adopted for processing the hierarchical clustering with the help of optimal features. On account of reducing the training time and burden, the parameters like learning rate, number of epochs, and hidden neurons are optimized by the HSLnGO algorithm. Finally, the expected clustered output is attained through the suggested framework.
Feature extraction: Phase 1
Once the text is pre-processed
GloVe embedding [26] refers to a log-bilinear model, termed a count-based approach. It designs the global matrix, where the matrix value provides the results of present or absent words throughout the document. The idea behind the count-based model of GloVe is to estimate how frequently the words appear in the document. Thus, it is required to determine the probability ratio, which tends to improve the performance of word analogy. It is utilized to increase the accuracy and also it helps to avoid the loss of data.
The GloVe is processed by dot product representation of text or words related to the logarithm value of the probability of word co-occurrence. It is formulated by using Eq. (1).
Here,
Here,
The term
In this phase, the pre-processed text
In the aforementioned equation, the concatenation operator is indicated as
In Eq. (5), the bias and non-linear function is represented by
Once the two significant features are attained through the above two models, it is undergone feature concatenation. Hence, the fused features are represented as
Depicts the feature concatenation and selection of optimal features through the HSLnGO algorithm.
The major intentions of selecting optimal features are to reduce the overfitting issue, achieve less redundant text data, and increase the performance level. In general, optimal feature selection renders the results in terms of lessening the false features to retain the same level of true positive values. The optimal feature obtains significant information regarding the clustering process for unstructured text data. Finally, the resultant optimal features are indicated
Hierarchical clustering and target generation
For unstructured text data, hierarchical clustering [28] is most commonly used to build a manner of hierarchy of clustered data groups. The clustering process is done through the structure of a binary tree with the aid of its corresponding text data. Every cluster group is generated with data items, in which the leaves in the tree represent a single data, whereas all the data is integrated by the root. With the assistance of leaves and roots in a tree, transitional clusters are formed. This clustering seems like a “cluster of clusters” that moves and places like a queue to build a tree. The clustering is split into two types that are “(i) Agglomerative clustering is processed in bottom-up nature and (ii) Divisive clustering is executed in the way of top-down approach”.
The agglomerative-based hierarchical clustering is implemented in many different ways. The dendrogram is a term, that defines a visual elucidation of a tree diagram that shows the linkage of similar sets of text data to construct the hierarchical clustering. In dendrogram graphs, the x-axis and y-axis are represented as cluster distance and data points. The word “legs” paves the way for generating cluster groups. Thus, the formed groups are presented in the form of a singleton, which is composed of text data with the axis.
Depending on the distance, the data is grouped to instigate the clusters. Similarly, the shortest path for all the data pairs, the “single-linkage” strategy is computed that integrates the group, which is derived using Eq. (6).
Here, the cut-off distance among the text data is assessed using
Despite having good and worst distances, the average linkage strategy is employed for calculating the average value for all data pairs among the groups that are expressed in Eq. (8).
Simultaneously, the centroid strategy is used for determining the difference between the cluster centroids of the text data, formulated in Eq. (9).
During the training phase, hierarchical clustering is performed to reach the target clustered output. The optimal features are given as input to hierarchical clustering, where the effective clustered data is obtained in terms of various strategies. Here, the clustering mechanism is accomplished with the help of Deep CNN, where the training and testing phase is executed to acquire the required clustered text data.
The proposed hybrid algorithm is mainly used to determine the optimal parameters that are utilized in the Deep CNN-assisted hierarchical clustering mechanism. The novel algorithm is built by superimposing both conventional SLnO and GOA algorithms. The significant benefits of using the GOA algorithm are capable of solving the issue of not having the search space knowledge, and the ability to enhance the initial random population and global optima results. In SLnO, the advantage is to increase the solution in terms of searching, solving the trap of local optima problems, and so on. But, both the algorithms subsist with noteworthy challenges; the population may get diverge due to not providing the exact fitness value, the linear convergence parameter tends to produce unbalanced results, and an inconsistent convergence rate. To recover all these issues, the proposed HSLnGO algorithm is introduced. The following steps are explored the novel algorithm as follows.
Step 1 – Parameter setting: Consider the total population as
Step 2 – Fitness calculation: Since the SLnO algorithm depends on whiskers to find the prey; it degrades the performance to render the optimal solution. Also, to encircle the prey, they need some more population to form the groups. It leads to time constraints during optimization and struggles to improve the convergence rate. To combat the issue of improving the results, the novel algorithm plans to design the objective function with the help of the worst Wfit and best fitness Bfit values. Based on the condition, the exploration and exploitation phase is done. The fitness function is derived using Eq. (10).
In the above equation, the variable
Step 3 – SLnO: The conventional SLnO [29] algorithm is performed by the natural behavior of sea lions. The sea lions employ whiskers to determine the position of prey. The whiskers can identify the size, shape, and location of prey. Once the prey is identified, the sea lion calls the remaining members to locate the prey. The sea lion is designated as a leader and the rest of the members are used to upgrade the position of prey. Thus, the initial position of the sea lion is expressed in Eq. (11).
At the current iteration, the location of sea lion and prey is denoted as
Here,
Velocity: Sea lions can survive on both land and water. While surviving in water, the sounds of a sound lion can reach four times greater than when it is present on land. Thus, the velocity of the sea lion is assessed for hunting the prey, which is given in Eq. (13).
Here, the leader sea lion is declared
New position: Using the best search sea lion, the hunting phase will be done. Here, the prey is termed a bait ball. Based on the value
Here, the absolute value of the distance between the optimal solution and sea lion is annotated
Finally,
Step 3 – GOA: Traditionally, grasshoppers are defined as insects, which are inspired by the natural behavior of targeting and hunting prey. In the exploration phase, the search agents of insects are moving toward the prey, as well as the agents update their position in the exploitation phase. Thus, the behavior of the GOA [30] algorithm is expressed in Eq. (17).
Here,
The difference
The above equation brings the random population for searching and after a certain iteration; all comes under the comfort zone place, where it does not move anywhere. To bring all the grasshoppers out of the zone to solve the optimization problem, another variant of Eq. (18) is expressed in Eq. (20).
Here, the upper and lower bound of
Here, the maximum and minimum value is denoted as
Thus, the flow chart representation of the enhanced hybrid algorithm is depicted in Fig. 3.
Flow chart of proposed HSLnGO algorithm.
The recommended clustering model employs Deep CNN [31] for providing the clustered output, where some of the hyperparameters are getting optimized by the developed HSLnGO algorithm. A few of the advantages of Deep CNN has depicted here. Deep CNN has the capability of predicting the failure of time and also it increases the accuracy rate. The essential aspect of this Deep CNN-based clustering is to create an accurate clustered outcome. It decreases the running time and it is suitable for real-time applications. It has the advantages of easy initialization, quick clustering within a minute, and efficient modification. Due to these advantages, it is chosen for this research work for clustering. The Deep CNN model is constructed with distinct layers as “convolutional layers, pooling layers, fully connected layers, and softmax classifier layers”. Conversely, the resultant optimal features are trained through every hidden layer that depends on the preceding output in the model. In addition to this, the activation function is considered from the convolutional operation for mapping the features.
Meanwhile, the convolution relies on the optimal features that are represented as
In the aforementioned equation, the sentence matrix along with column size
The above matrix is formed by bias
Here, weight and bias values are represented by
Diagrammatic illustration of Deep CNN model for hierarchical clustering using HSLnGO algorithm.
The major downside of Deep CNN is, that the performance may get slower when it performs with max pooling operation and requires a larger training time as it has multiple hidden neurons, Also, due to a large number of epochs, the computational burden occurs. Then, the learning rate also relies on the number of epochs. To solve this problem, the novel algorithm is used to optimize the parameters like hidden neurons, learning rate, and epochs. With the result of optimal results, the clustering mechanism ensures the efficacy of the model. The fitness function of Deep CNN – assisted clustering is determined using Eq. (25).
Here, the
Similarly, precn the term “precision”, defines the nearness value of each other data. It is formulated using Eq (27).
In Eqs (26) and (27),
Simulation setup
The offered unstructured text data clustering was developed using Python
Evaluation metrics
Various measures are used for validating the performance; those measures are described below.
Accuracy: It is shown in Eq. (26).
Precision: It is derived using Eq. (26).
FPR and FNR: “The false positive rate provides the error rate, in which the results are obtained incorrectly presence of text data. On the second hand, the false negative is used to determine the absence of unstructured text incorrectly actually when the data is present”.
Sensitivity and Specificity: Sensitivity is referred to as the “probability of an actual positive test” and specificity is computed by the “probability of a negative test”.
F1-Score: “It is defined as the harmonic mean value of precision and recall”.
FDR: “It is defined as the ratio between the false positive and a total number of both true and false positive”.
NPV: “The negative predictive rate is measured by the ratio of true negative to the total value of true and false negative”.
MCC: It is used to calculate the “difference between the predicted images and actual images”.
The k-fold analysis of the enhanced clustering model is shown in Fig. 5 concerning varying different k-fold numbers using dataset 1.
Evaluation of K-Fold for the proposed unstructured text data clustering model with hybrid algorithms for dataset 1 with respect to “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.
Figure 5h depicts the precision analysis of the novel data clustering model. When the K-fold is 3, the precision value is obtained as 3.9% by EFO-Deep CNN, 3.15% by FISA-EFO-Deep CNN, 1.4%, and 3.2% by GOA-Deep CNN and SLnO-Deep CNN, which is a lower value than the proposed HSLnGO-Deep CNN model. Thus, the higher precision results ensure to render a better clustering performance.
Figure 6 illustrates the k-fold analysis of the novel text data clustering model by diverse heterogeneous k-fold numbers and dataset 1.
Evaluation of K-Fold for the proposed unstructured text data clustering method with former classifiers for dataset 1 regarding “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.
The F1-Score analysis of the proposed method is demonstrated in Fig. 6b. When the K-fold is 2, the offered method attains a higher F1-score value when compared with the lower value of 17%, 10%, 8%, and 4% by DNN, LSTM, ELM, and CNN, Hence, the suggested method achieves more results to prove the effectiveness of the clustered outcome.
The performance evaluation of the unstructured text clustering method is represented in Fig. 7 in terms of various learning percentages.
Estimation on proposed unstructured text data clustering method with hybrid algorithms for dataset 1 concerning “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”.
Figure 7c elucidates the FNR analysis of the suggested method by dataset 1. The FNR value is acquired as 8.2% of EFO-Deep CNN, 5.2% of FISA-EFO-Deep CNN, 10.2% of GOA-Deep CNN, and 9.2% of SLnO-Deep CNN, which is superior to improved HSLnGO-Deep CNN model at 50
Figure 8 represents the performance analysis of the novel method over existing deep learning models through dataset 1.
The overall analysis of offered unstructured text data clustering model with existing classifiers for dataset 1 concerning “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”.
The accuracy analysis of the offered method is given in Fig. 8a. At the 75
The k-fold analysis for dataset 2 over classical heuristic algorithms is illustrated in Fig. 9.
K-fold analysis of the offered unstructured text data clustering model with hybrid algorithms for dataset 2 concerning “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.
The MCC analysis for the proposed method is given in Fig. 9f. When the k-fold number is 2, the enhanced clustering model achieves a higher MCC value, when compared with a lesser percentage of 7.3% of EFO-Deep CNN, 6% of FISA-EFO-Deep CNN, 2.3% of GOA-Deep CNN, and 6.6% of SLnO-Deep CNN. Hence, the novel text clustering method offers better performance for unstructured text data.
Figure 10 shows the k-fold analysis of the proposed system with traditional classifiers for dataset 2.
K-fold analysis of proposed unstructured text data clustering method with former classifiers for dataset 2 regarding “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.
The F1-score analysis of the offered method is represented in Fig. 10b. The F1-score value like 21% for DNN, 13% for LSTM, 11% for ELM, and 9% for CNN, which is less valuable than HSLnGO-Deep CNN. Due to the higher value, the suggested method yields a better-clustered output.
Evaluation of the proposed clustering method while implementing dataset 2 is elucidated in Fig. 11. The FNR analysis is given in Fig. 11c.
The overall analysis of the proposed unstructured text data clustering model with hybrid algorithms for dataset 2 in terms of “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”.
When the learning percentage is 50, the FNR value is obtained for existing algorithms like 8% of EFO-Deep CNN, 5.4% of FISA-EFO-Deep CNN, 9.2% of GOA-Deep CNN, and SLnO-Deep CNN, which is superior to the proposed HSLnGO-Deep CNN. Thus, the novel clustering method achieves less error rate to render the desired outcome.
Figure 12 demonstrates the performance analysis for dataset 2 compared with conventional classifiers. The precision analysis is represented in Fig. 12e.
The overall analysis of the proposed unstructured text data clustering method with existing classifiers for dataset 2 concerning “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”
At the 35
Table 2 evaluates the efficiency of the new method in contrast with various heuristic algorithms using dataset 1.
Comparative evaluation of proposed unstructured text data clustering model for dataset 1 using heuristic algorithms
Comparative evaluation of proposed unstructured text data clustering model for dataset 1 using heuristic algorithms
The sensitivity of the model achieves a higher value when compared to former algorithms like 5.51% advanced than EFO-Deep CNN, 3.4% more than FISA-EFO-Deep CNN, 6.1%, and 7% for GOA-Deep CNN and SLnO-Deep CNN. Due to impressive results, the proposed model improves the clustering performance.
Evaluation of the offered method, when compared with diverse classifiers, is given in Table 3 by implementing dataset 1.
Comparative analysis of proposed unstructured text data clustering model for dataset 1 using different classifiers
Comparative analysis of proposed unstructured text data clustering model for dataset 1 using different classifiers
The precision value of the proposed method is obtained as 22.6% superior to DNN, 13.6% superior to LSTM, 11.5% more than ELM, 8.89% higher than CNN, and 0.3% higher than FISA-EFO-DNN. Since it offers the desired outcome, the proposed model delivers better text data clustering.
Table 4 estimates the comparative analysis for the new model using different algorithms and dataset 2.
Comparative evaluation of proposed unstructured text data clustering model for dataset 2 using heuristic algorithms
Comparative evaluation of proposed unstructured text data clustering model for dataset 2 using heuristic algorithms
The NPV of the model obtains a higher value in contrast with former algorithms like 5.3% more than EFO-Deep CNN, 2.9% more than FISA-EFO-Deep CNN, 6.1%, and 7.3% for GOA-Deep CNN and SLnO-Deep CNN. Thus, the proposed text clustering model improves the clustering efficiency.
The comparative evaluation of the offered method, when compared with diverse classifiers, is given in Table 5 for dataset 1.
Comparative analysis of proposed unstructured text data clustering model for dataset 2 using different classifiers
Comparative analysis of proposed unstructured text data clustering model for dataset 2 using different classifiers
The sensitivity value of the novel method is achieved as 23.3% superior to DNN, 12.5% superior to LSTM, 11.5% more than ELM, and 8.38% higher than CNN. Since it acquires more results, the proposed model delivers better text data clustering.
In the experimental evaluation, the k-fold analysis shows the equivalent performance. Here, the GOA is the 2
Conclusion
This paper has elucidated the novel unstructured text data clustering model with the aid of a hybrid-based heuristic algorithm. Initially, the unstructured text dataset was collected from standard data sources. Once the data was garnered, it was undergone for the pre-processing phase, which was done via tokenization, stemming, stop words, and punctuation removal. The pre-processed text data was given as input to the feature extraction stage 1 and stage 2. In phase 1, the relevant text features were obtained by the GloVe embedding model, whereas the deep features of data were acquired through the Text-CNN method. With the resultant of these two features, feature concatenation was taken place, thereby; it could be used to select the optimal features by influencing the novel algorithm of HSLnGO. This algorithm was built by combining the classical SLnO and GOA algorithms. Then, the optimal feature selection was fed into the Deep CNN network-based hierarchical clustering, where the features were clustered into different groups depending on the text data. To further improve the clustering performance, the hyperparameters of Deep CNN were tuned optimally by the HSLnGO algorithm. Finally, several metrics were utilized to assess the improvement of the clustering model. Thus, the precision value of the proposed method is obtained as 22.6% superior to DNN, 13.6% superior to LSTM, 11.5% more than ELM, 8.89% higher than CNN, and 0.3% higher than FISA-EFO-DNN. Contrary to classical approaches, the recommended clustering mechanism has offered impressive outcomes to provide effective clustered data.
Funding
This research did not receive any specific funding.
Informed consent
Not applicable.
Ethical approval
Not applicable.
Author contributions
All authors have made substantial contributions to the conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Footnotes
Acknowledgments
I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.
Conflict of interest
The authors declare no conflict of interest.
