Hybrid unstructured text features for meta-heuristic assisted deep CNN-based hierarchical clustering

Abstract

The text clustering model becomes an essential process to sort the unstructured text data in an appropriate format. But, it does not give the pave for extracting the information to facilitate the document representation. In today’s date, it becomes crucial to retrieve the relevant text data. Mostly, the data comprises an unstructured text format that it is difficult to categorize the data. The major intention of this work is to implement a new text clustering model of unstructured data using classifier approaches. At first, the unstructured data is taken from standard benchmark datasets focusing on both English and Telugu languages. The collected text data is then given to the pre-processing stage. The pre-processed data is fed into the model of the feature extraction stage 1, in which the GloVe embedding technique is used for extracting text features. Similarly, in the feature extraction stage 2, the pre-processed data is used to extract the deep text features using Text Convolutional Neural Network (Text CNN). Then, the text features from Stage 1 and deep features from Stage 2 are all together and employed for optimal feature selection using the Hybrid Sea Lion Grasshopper Optimization (HSLnGO), where the traditional SLnO is superimposed with GOA. Finally, the text clustering is processed with the help of Deep CNN-assisted hierarchical clustering, where the parameter optimization is done to improve the clustering performance using HSLnGO. Thus, the simulation findings illustrate that the framework yields impressive performance of text classification in contrast with other techniques while implementing the unstructured text data using different quantitative measures.

Keywords

Unstructured data text clustering feature extraction optimal feature selection deep CNN-based hierarchical clustering hybrid sea lion grasshopper optimization

1. Introduction

The interpretation is made through the speech and text data between the humans, which entails unfortunate content of unstructured data, thus requiring some machine learning methods to handle to transmit the data into significant subsets of features [1]. Text mining is now widely used for such implications as summarization, classification, and clustering processes, which is encouraged by semantic-based similar words [2]. Traditionally, text clustering is termed the automated procedure of categorizing text into different cluster groups. In every cluster group, the similarity among the text becomes high level, whereas the similarity between cluster groups is a bit low [3]. The generation of each cluster group contains cluster centroid data points that are present in all clusters. Consequently, the appropriate text is chosen to form the group, where the relevant information and content [4] can be obtained. The clustering process is done in two different ways hierarchical-based, partition-based, and grid-based clustering of unstructured text data. Moreover, the text data is represented in the form of numerical or categorical data. Contrary to Natural Language Processing (NLP), the clustering mechanism requires less manual entailment and more effective performance [5]. Thus, text clustering is a required process for document representation, information retrieval, organization, and language processing domains. Being unstructured data, it becomes cumbersome to process eventually in an automated way. Several corpus processing methods are deployed for extracting the required text data [6].

Additionally, text mining is a subclass process of data mining technique, yet it faces some notable challenges in enhancing the clustering process [7] of unstructured text data. Thus, text mining is deployed for retrieving concealed data in an unstructured manner of information. Some of the confined factors for improving clustering performance are “unspecified structural form of data, customizing analyses for spoken language, high probability of data scarcity” [8]. These drawbacks degrade the performance and reduce the reputation of the computer-aided approach, thus it needs much more knowledge regarding the clustering process [9]. Having inadequate data becomes another challenging issue in the group the structured data. Missing values and restricted raw text tend to deduce the clustering efficacy and fragility while analyzing the performance [10]. Another major downside is large dimensionality criteria, less transparency, and scarcity of data distribution. Some well-developed techniques like the “K-Means algorithm, ordered clustering, Self Organizing Maps (SOMs), Expectation Maximization (EM)” are become futile [11] to handle a large amount of datasets since these all come under the confined area of scalability. The former methods of the automated clustering model still open the door for further improvement since it fails with some noteworthy limitations [12].

The aforementioned challenges pertain to the robustness and completion of the documents with relevant text data even though the model has less number of words or lines [13]. Hence, another challenge becomes to fulfill the text information and assist with a short span of text features. Machine learning methods are commonly used to resolve unstructured data problems to enhance performance. Such machine learning techniques are “Artificial Neural Network (ANN), Random Forest (RF), and Support Vector Machine (SVM)” and so on, which still exist in future directions [14]. However, on the other hand, also recently, the deep learning model is the trendsetting technique for the text clustering process. The deep learning model can easily retrieve the hidden information of unstructured text and explored it with higher clustering efficiency [15]. Some commonly used deep structured architectures are “Long-Short Term Memory (LSTM), Convolutional Neural Network (CNN), Deep Neural Network (DNN)” and so on. In deep learning technology, feature extraction and interlinked with learning the features of text data. Some feature extraction method represents the semantic similar words in terms of statistical and linguistic manner [16]. The relevant features are mostly obtained by word embedding, character embedding models, etc. However, this process easily gets trapped into semantic restrictions language-dependency. Moreover, some optimization algorithms are also used for solving the clustering issue based on the natural behavior of intelligence. The recent innovation in the text clustering model is influencing the standard optimization algorithms with multi-objective functions. The reliability and scalability of the clustering process get improved while implementing deep learning-based techniques. Some of the hybrid infused model also helps to determine the optimal results when using high-dimensional data [17]. Hence, several researchers have invented novel methods for approaching unstructured data using the text clustering method. This paper focuses on the hierarchical clustering process with a deep learning model to offer the result of better-clustered text data.

The core intentions of the enhanced model are explored as follows.

•
To design and develop a novel hierarchical clustering model for unstructured text data using deep learning and a meta-heuristic approach.
•
To extract the relevant text and deep features by GloVe embedding and Text CNN model, which is then fused. Further, the optimal features are acquired through the newly designed HSLnGO algorithm.
•
To develop the hybrid algorithm for providing optimal solutions. The novel algorithm is developed by combining the SLnO and GOA algorithms. It is used to further increase the classification rate.
•
To cluster the unstructured text features, Deep CNN is deployed, where different factors like learning rate, epochs, and several suitably hidden neurons are tuned optimally by the HSLnGO algorithm.
•
To evaluate the performance of the clustering process and comparative analysis is made with the aid of different validating measures, thereby; it ensures the effective clustering model of the proposed method.

The residual part of the research work is given below. The review, research gaps, and challenges regarding the existing clustering method are described in Section 2. Section 3 elaborates on the novel deep learning-aided hierarchical clustering model for unstructured data. Section 4 presents the feature extraction and determines the optimal feature selection for the subsequent section. The deep learning-aided hierarchical clustering technique is shown in Section 5. The results and discussion of the novel method are analyzed in Section 6. Finally, the paperwork concludes with Section 7.
2. Existing works

2.1 Related works

In 2022, Hosseini and Varzaneh [18] presented a hybrid-based deep text clustering model with the assistance of “stacked autoencoder and k-means clustering”. Initially, the text data was collected from Barez data for clustering performance. The three major steps were included as i) Pre-trained Bidirectional Encoder Representations from Transformers (BERT) model with text presentation, termed as ParsBERT, ii) Relevant features were obtained by autoencoder method to mitigate the dimension of the features, and iii) Clustering was performed by K-Means process. Finally, the performance was analyzed and validated the effectiveness with the measure of the Silhouette Score. Thus, the novel deep learning model has achieved the expected results for clustering performance rather than the existing clustering process.

In 2020, Saeed et al. [19] described the methodology of corpus creation that integrated the metadata and retrieved text data. It has composed of many unstructured documents, where the different weight nodes were acquired via multistage clustering. The proposed model has contained various phases. Here, the News dataset was applied for clustering, where the output was associated with sun corpora. It has to lead to enhancing text detection and reducing the cost function using clusters. The new model has extracted significant data related to expected content. Contrary to other techniques, thus, the suggested system has ensured to increase in the detection level.

In 2021, Kumar et al. [20] elaborated on the two-phase cluster validation (TPCV) for unstructured text data. It was used to validate the similarity measures belonging to different cluster groups. Initially, the TPCV model has identified the cluster centroid, thus determining the probability of the closeness value of each cluster group. Further, it was employed to validate the separation probability between the cluster groups in terms of distance. Hence, the experimental results have demonstrated that the implemented model has attained higher results in contrast with existing methods.

In 2016, Sundermann et al. [21] proposed an ensemble-based learning model for text mining of unstructured data. Once the text mining technique was implemented, every metadata set was fed into the suggested model. Then, the resultant data was fused to provide a final ranking value for the user. Conversely, two models as K-Nearest Neighbors and BPR-Mapping were combined into an ensemble model. Through the simulation results, the recommended system has illustrated that it became feasible when using multiple data related to various types, which has proved its efficacy via different measures.

In 2019, Lee et al. [22] illustrated the novel document representation model by using neural-embedding and probabilistic word clustering. Here, the clustering was done for determining the semantic-based similar text. Depending on the membership, the neural embedding method was utilized for identifying similar words. Further, the clustering was accomplished in the manner of a probabilistic approach via the "fuzzy C-means clustering method or Gaussian mixture model". Hence, the findings have been explored to attain higher accuracy results while implementing 12 different classes.

In 2021, Thirumoorthy and Muneeswaran [23] developed a novel system for Document Clustering (DC) using Hybrid Jaya Optimization algorithm (HJO), termed HJO-DC. The performance was assessed by the Silhouette index to determine the data quality. The proposed method has provided the expected outcome and contrasts with classical techniques such as K-Means and K-Medoids and heuristic approaches like GA, Cuckoo Search, PSO, Firefly, and Grey Wolf Optimizer.

In 2020, Fidan and Yuksel [24] aligned the unstructured text data into an appropriate format using the Grey system theory. It was mainly used to handle small datasets. The raw comments or reviews were collected from Amazon, where the review was declared as a positive and negative label. The Grey system method was conducted with the hierarchical and partitional-based clustering mechanism that was applied. Hence, the simulation analysis has outperformed to obtain efficient clustered data when compared to traditional methods.

In 2019, Janani and Vijayarani [25] investigated the new approach of the Spectral Clustering algorithm with PSO (SCPSO) to render effective text clustering. The suggested model was employed with the random solution with the aid of local and global optima results. The main intention of the model was to cluster the text data with PSO to resolve the issue of using a large number of datasets. Using standard text-related datasets; the proposed model has improved the clustering performance rather than “Spherical K-means, Expectation Maximization Method (EM) and existing PSO”. Thus, the proposed model has offered more clustering results.

2.2 Research gaps and challenges

The former clustering approaches like keyword-based algorithms become fragile to provide the appropriate clustered output for unstructured data. Also, it faces another problem of security overhead and data loss since it easily falls into a limitation of time and data overhead. Merits and demerits related to existing unstructured text clustering are given in Table 1. Auto Encoder and K-Means [18] attain the higher value of the Silhouette Score. But, it cannot handle a large number of datasets and has inadequate knowledge to train the data. Multi-Stage Clustering [19] aids to assess the quality and effectiveness of unstructured text query output, at a reduced cost. But, it needs to improve the text assessments via computer-aided text segments. TPCV [20] achieves higher clustering accuracy. On the other hand, it is only applicable for sequential series of text data. BPR-Mapping algorithm [21] improves the accuracy level while maintaining the extensibility and reliability of the model. Due to confined data, it mitigates the computational power of the system. Embedding-probabilistic model [22] obtains a higher accuracy value and reduces the computational complexity. However, it reduces the semantic similarity measures among the text. HJO [23] offers more quality clustering performance for documenting the text. It easily gets trapped in the local optimum issue. Grey system theory [24] increases the efficacy while using small datasets also. But, it limits by certain rough set theory and its rules. SCPSO [25] maintains an enormous amount of text documents. It is critical to process the multi-core CPU. Hence, a novel clustering method is instigated and executed for unstructured data using NLP techniques.

Table 1
Strengths and weakness of existing unstructured text clustering methods

Author [citation]	Methodology	Strengths	Weakness
Hosseini and Varzaneh [18]	Auto encoder and K-means	• Attains the higher value of Silhouette Score.	• It cannot handle a large number of datasets. • Inadequate knowledge to train the data.
Saeed et al. [19]	Multi-stage clustering	• It aids to assess the quantity of unstructured text query output, with a reduced cost.	• It needs to improve the text assessments via computer-aided text segments.
Kumar et al. [20]	TPCV	• Achieves higher clustering accuracy.	• It is only applicable for sequential series of text data.
Sundermann et al. [21]	BPR-mapping algorithm	• It improves the accuracy level while maintaining the extensibility and reliability of the model.	• Due to confined data, it mitigates the computational power of the system.
Lee et al. [22]	Embedding-probabilistic model	• It scores a higher accuracy rate. • It reduces computational complexity.	• It reduces the semantic similarity measures among the text.
Thirumoorthy and Muneeswaran [23]	HJO	• It offers a more quality clustering performance for documenting the text.	• It easily gets trapped in the local optimum issue.
Fidan and Yuksel [24]	Grey system theory	• It increases the efficacy while using small datasets.	• It limits by certain rough set theory and its rules.
Janani and Vijayarani [25]	SCPSO	• It maintains an enormous amount of text documents.	• It is critical to process the multi-core CPU.

3. A deep learning-assisted hierarchical clustering for unstructured data clustering

3.1 Data description of English and Telugu dataset

The proposed model requires English and Telugu language for processing the unstructured text clustering. The datasets are explained as follows.

English dataset: The input text is garnered from the below link as “https://www.kaggle.com/kritanjalijain/ amazon-reviews?select=train.csv: Access date: 2022-05-27”. The dataset is named Amazon Review Polarity Dataset which consists of 34,686,770 comments from 6,643,669 users following the 2,441,053 products, from the “Stanford Network Analysis Project” (SNAP). In each polarity result, this subset possesses 1,800,000 training and 200,000 testing samples. Comments include plain text, ratings, and user and product feedback. The review score of 1 and 2 is indicated as negative, whereas score 4 and 5 is declared as positive. But, score 3 is get ignored. Depending on the score value, the polarity dataset is constructed. While in the dataset, class 1 and class 2 represent the negative and positive. The body and heading of the user’s review body are termed as text and title correspondingly.

Telugu dataset: The raw unstructured text is obtained from the link as “https://github.com/subbareddy248/ Datasets/tree/master: Access date: 2022-05-27”. This dataset is composed of Telugu text data to perform the unstructured data clustering.

Here, the sample text of the total collected unstructured text data is denoted as $T_{d}$ , where $d=1,2,\ldots,D$ in which the term $D$ signifies the entire unstructured text utilized for the following sections.

3.2 Pre-processing of raw text

The functionality of pre-processing is to transform the raw or input data into an understandable format, which is represented as cleaned data. The data pre-processing is mainly used to check the missing data, noisy or unwanted data, and inconsistency data. These irrelevant data get removed to further improve the performance results. The proposed methodology employs various pre-processing techniques that are described as follows.

Tokenization: In general, tokenization is employed for segregating text into meaningful pieces or parts, termed tokens. The resultant token is represented in the form of phrasal, symbol, words, and so on. For example, the text is given as “How are you”, while applying tokenization, the sentence is split into different tokens as $\langle$ How $\rangle$ $\langle$ are $\rangle$ $\langle$ you $\rangle$ . Thus, the input data $T_{d}$ is given for tokenization, and then the output is declared as $T_{d}^{\textit{tk}}$ .

Stemming: The stemming process utilizes the input text $T_{d}^{\textit{tk}}$ for pre-processing the text. Stemming is commonly used to remove the suffix from the word. Further, it returns the word that has the same meaning as the root word. For example, “Crying” is a word that has a suffix as “ing”. By removing this, stemming offers the same root word as “Cry”. Thus, the resultant is represented as $T_{d}^{\textit{st}}$ .

Stop words removal: The stemming text $T_{d}^{\textit{st}}$ is fed as input for stop word removal. The main aspect of this technique is to ignore the unwanted or commonly used words that are present throughout the documents. Like, the stop words are declared as articles and pronouns, which are removed from the text data using this process. Finally, the outcome of this technique is denoted as $T_{d}^{\textit{sw}}$ .

Punctuation removal: Here, the input is declared as stop word removed text as $T_{d}^{\textit{sw}}$ . It is solely employed to delete the punctuation marks that are present in the text or sentences. Some of the punctuation marks are semi-colon (‘;’), exclamation (‘!’), comma (‘,’), and so on, which diminishes from this step and provides the obtained pre-processed text as $T_{d}^{\textit{pre}}$ .

3.3 Developed text clustering model

Text clustering is a prominent role, defined as a process of grouping the data into different clusters, these cluster groups are formed by the data that depends on the similarity measures among the data elements. Also, depending on the cluster attributes, a clustering mechanism is performed. Up-to-date, the documents of unstructured text subsist in our daily needs. Being an unstructured text, it becomes cumbersome to process eventually in an automated way. Several corpus processing methods are deployed for extracting the required text data. Since the corpus is huge, sustaining efficiency during clustering is a challenging one. Another issue is appropriately dividing the documents into their respective and related text documents. Due to the lack of corpus grouping, the text retrieval and clustering performance gets degraded. On the other hand, inefficient tends to create cost-effectively and requires more work to involve in the retrieval approach. Over the past few years, it still exists with some challenges like poor reliability, more consumption of time, and computational burden. Some other confined strategies in unstructured text clustering are lack of storage independence, data mobility, and incapable of managing the data. To evade all the issues, it is developed a novel unstructured text clustering mechanism uses the hybrid heuristic algorithm. The architectural representation of enhanced text clustering is depicted in Fig. 1.

Figure 1.

Architectural representation of proposed clustering model for unstructured text data.

The proposed method is involved in various stages; they are a) Pre-processing, b) Feature extraction, c) Optimal feature selection, and d) Deep CNN-aided hierarchical clustering. Initially, the unstructured text data of both English and Telugu languages were gathered from benchmark datasets. Conversely, the pre-processing is undergone by tokenization, stemming, stop words, and punctuation removal, where the pre-processed text is obtained to further increase the accuracy level. Subsequently, the resultant text is fed into the feature extraction phase. The significant text and deep feature set are obtained using GloVe embedding and the Text CNN model. Further, these features are combined and determine the optimal features, which are acquired through an enhanced HSLnGO algorithm. The hybrid algorithm is built by incorporating the GOA and SLnO algorithms. The major use of a hybrid algorithm is to yield an accurate solution to resolve the time and structural complexity. In the end, the Deep CNN model is adopted for processing the hierarchical clustering with the help of optimal features. On account of reducing the training time and burden, the parameters like learning rate, number of epochs, and hidden neurons are optimized by the HSLnGO algorithm. Finally, the expected clustered output is attained through the suggested framework.

4. Feature extraction and feature concatenation for unstructured data clustering

4.1 Feature extraction: Phase 1

Once the text is pre-processed $T_{d}^{\textit{pre}}$ , it is further used for getting the significant features. The first feature set is extracted with the aid of the GloVe embedding model. The gloVe is nothing but the Global Vector representation of the text. Using this embedding model, the relevant text features are obtained. It has the benefits of consuming two various methods such as count-based and prediction-based models. Unlike, word2vec pertains only to local information, the GloVe model utilizes the global text data regarding semantic connections among the words or sentences. The main key points of GloVe embedding are to concentrate on the word co-occurrences over the entire corpus, influencing global statistics to provide a significant word vector.

GloVe embedding [26] refers to a log-bilinear model, termed a count-based approach. It designs the global matrix, where the matrix value provides the results of present or absent words throughout the document. The idea behind the count-based model of GloVe is to estimate how frequently the words appear in the document. Thus, it is required to determine the probability ratio, which tends to improve the performance of word analogy. It is utilized to increase the accuracy and also it helps to avoid the loss of data.

The GloVe is processed by dot product representation of text or words related to the logarithm value of the probability of word co-occurrence. It is formulated by using Eq. (1).

$\displaystyle P_{x}^{E}+\vec{P}_{y}+Q_{x}+\vec{Q}_{y}=\log({S_{x_{y}}})$ (1)

Here, $P$ and $\vec{P}$ signifies the word vector and context word vector, respectively. The scalar value of $x^{\text{th}}$ and $y^{\text{th}}$ word is given by $Q$ and $\vec{Q}$ . The term $S_{x_{y}}$ annotates the co-occurrence matrix, which is defined by the number of $x^{\text{th}}$ words that is appeared in $y^{\text{th}}$ the word context. The major issue in this model is the weight-based co-occurrence of words. To address this problem, the weight function is evaluated using Eq. (2), which is expressed below.

$\displaystyle f(S_{x_{y}})=\begin{cases}\left(S_{x_{y}}\big{/}S_{\text{max}}% \right)^{\beta}&fS_{x_{y}}<s_{\text{max}}\\ 1&\textit{Otherwise}\end{cases}$ (2)

Here, $s_{\text{max}}=$ 100 and $\beta=$ 3/4, is fixed in this model. Finally, the cost function is assessed using Eq. (3).

$\displaystyle\textit{CF}=\sum_{x,y=1}^{V}f(S_{x_{y}})(P_{x}^{E}\vec{P}_{y}+Q_{% x}+\vec{Q}_{y}-\log(S_{x_{y}}))^{2}$ (3)

The term $V$ indicates the vocabulary dimension. Thus, the resultant text features are obtained and denoted as $\textit{fs}_{f}^{\textit{GV}}$ .

4.2 Feature extraction: Phase 2

In this phase, the pre-processed text $T_{d}^{\textit{pre}}$ is given as input to the Text-CNN [27] model, where the deep features of text data are extracted. It is another variant of the CNN model and achieves improves the results in terms of clustered outcomes. It deploys a one-dimensional operation of convolution to retrieve the deep features significantly. It is utilized to reduce the overfitting process. It can reduce the count of trainable constraints together with an increment in the network’s accuracy. Contrary to former CNN, filters in text-CNN contain constant width but heterogeneous heights. Assume $v_{j}\in U^{s}$ the $s$ -size of the word vector related to $j^{\text{th}}$ the word and $l$ the length of the sentence. It is annotated in Eq. (4).

$\displaystyle v_{1:l}=v_{1}\oplus v_{2}\oplus\ldots\oplus v_{j}$ (4)

In the aforementioned equation, the concatenation operator is indicated as $\oplus$ . While performing the convolution operation, the filter is adopted in every window size of $m$ several words, thereby; a new feature set is created. The new feature is determined using Eq. (5).

$\displaystyle n_{j}=F(W\cdot v_{j:j+m-1}+B)$ (5)

In Eq. (5), the bias and non-linear function is represented by $B$ and $F$ , the window of words is defined by $v_{j:j+m-1}$ . By applying a filter, a new feature map is acquired that is expressed as $n=[n_{1},n_{2},\ldots,n_{l-m-1}]$ . Furthermore, it employs max pooling operation to yield the maximum value as $n_{\text{max}}=\max(n)$ . Finally, the features are fused and bypassed to the last layer of the fully connected softmax function to obtain the deep features of text data, which is declared as $\textit{fs}_{f}^{\textit{TCNN}}$ .

4.3 Feature concatenation and optimal feature selection

Once the two significant features are attained through the above two models, it is undergone feature concatenation. Hence, the fused features are represented as $\textit{FE}_{f}=\{\textit{fs}_{f}^{GV},\textit{fs}_{f}^{\textit{TCNN}}\}$ . Since it adds up the two features, it must have a large set of extracted features. Due to lengthy features, some are related to providing less performance and unwanted features for further clustering process. To address this problem, the most important features are obtained optimally with the help of a novel HSLnGO algorithm. Figure 2 illustrates the feature fusion and optimal feature selection.

Figure 2.

Depicts the feature concatenation and selection of optimal features through the HSLnGO algorithm.

The major intentions of selecting optimal features are to reduce the overfitting issue, achieve less redundant text data, and increase the performance level. In general, optimal feature selection renders the results in terms of lessening the false features to retain the same level of true positive values. The optimal feature obtains significant information regarding the clustering process for unstructured text data. Finally, the resultant optimal features are indicated $\textit{FE}_{s}^{\textit{opt}}$ , which is used for the subsequent section of Deep CNN, where the clustered output is acquired.

5. Feature extraction and feature concatenation for unstructured data clustering

5.1 Hierarchical clustering and target generation

For unstructured text data, hierarchical clustering [28] is most commonly used to build a manner of hierarchy of clustered data groups. The clustering process is done through the structure of a binary tree with the aid of its corresponding text data. Every cluster group is generated with data items, in which the leaves in the tree represent a single data, whereas all the data is integrated by the root. With the assistance of leaves and roots in a tree, transitional clusters are formed. This clustering seems like a “cluster of clusters” that moves and places like a queue to build a tree. The clustering is split into two types that are “(i) Agglomerative clustering is processed in bottom-up nature and (ii) Divisive clustering is executed in the way of top-down approach”.

The agglomerative-based hierarchical clustering is implemented in many different ways. The dendrogram is a term, that defines a visual elucidation of a tree diagram that shows the linkage of similar sets of text data to construct the hierarchical clustering. In dendrogram graphs, the x-axis and y-axis are represented as cluster distance and data points. The word “legs” paves the way for generating cluster groups. Thus, the formed groups are presented in the form of a singleton, which is composed of text data with the axis.

Depending on the distance, the data is grouped to instigate the clusters. Similarly, the shortest path for all the data pairs, the “single-linkage” strategy is computed that integrates the group, which is derived using Eq. (6).

$\displaystyle\textit{SLge}^{\textit{dis}}\left(\{X_{a}\}_{a=1}^{A},\{Y_{b}\}_{% b=1}^{B}\right)=\min\limits_{a,b}\left\|X_{a}-Y_{b}\right\|$ (6)

Here, the cut-off distance among the text data is assessed using $\left\|{X_{a}-Y_{b}}\right\|$ . On the other hand, the “complete-linkage” strategy is utilized to cover the longest distance between all the data pairs. It is given in Eq. (7).

$\displaystyle\textit{CLge}^{\textit{dis}}\left(\{X_{a}\}_{a=1}^{A},\{Y_{b}\}_{% b=1}^{B}\right)=\max\limits_{a,b}\left\|X_{a}-Y_{b}\right\|$ (7)

Despite having good and worst distances, the average linkage strategy is employed for calculating the average value for all data pairs among the groups that are expressed in Eq. (8).

$\displaystyle\textit{ALge}^{\textit{dis}}\left(\{X_{a}\}_{a=1}^{A},\{Y_{b}\}_{% b=1}^{B}\right)=\frac{1}{AB}\sum_{a=1}^{A}\sum_{b=1}^{B}\left\|X_{a}-Y_{b}\right\|$ (8)

Simultaneously, the centroid strategy is used for determining the difference between the cluster centroids of the text data, formulated in Eq. (9).

$\displaystyle\textit{Ctrd}^{\textit{dis}}\left(\{X_{a}\}_{a=1}^{A},\{Y_{b}\}_{% b=1}^{B}\right)=\left\|\left(\frac{1}{A}\sum_{a=1}^{A}X_{a}\right)-\left(\frac% {1}{B}\sum_{b=1}^{B}Y_{b}\right)\right\|$ (9)

During the training phase, hierarchical clustering is performed to reach the target clustered output. The optimal features are given as input to hierarchical clustering, where the effective clustered data is obtained in terms of various strategies. Here, the clustering mechanism is accomplished with the help of Deep CNN, where the training and testing phase is executed to acquire the required clustered text data.

5.2 5.2 Proposed HSLnGO

The proposed hybrid algorithm is mainly used to determine the optimal parameters that are utilized in the Deep CNN-assisted hierarchical clustering mechanism. The novel algorithm is built by superimposing both conventional SLnO and GOA algorithms. The significant benefits of using the GOA algorithm are capable of solving the issue of not having the search space knowledge, and the ability to enhance the initial random population and global optima results. In SLnO, the advantage is to increase the solution in terms of searching, solving the trap of local optima problems, and so on. But, both the algorithms subsist with noteworthy challenges; the population may get diverge due to not providing the exact fitness value, the linear convergence parameter tends to produce unbalanced results, and an inconsistent convergence rate. To recover all these issues, the proposed HSLnGO algorithm is introduced. The following steps are explored the novel algorithm as follows.

Step 1 – Parameter setting: Consider the total population as $H$ and the maximum iteration is set as $N_{T}$ . These parameters are used for the subsequent steps of fitness calculation, exploration, and exploitation stage.

Step 2 – Fitness calculation: Since the SLnO algorithm depends on whiskers to find the prey; it degrades the performance to render the optimal solution. Also, to encircle the prey, they need some more population to form the groups. It leads to time constraints during optimization and struggles to improve the convergence rate. To combat the issue of improving the results, the novel algorithm plans to design the objective function with the help of the worst Wfit and best fitness Bfit values. Based on the condition, the exploration and exploitation phase is done. The fitness function is derived using Eq. (10).

$\displaystyle a=\textit{fit}$ (10) $\displaystyle b=\frac{\textit{Wfit}+\textit{Bfit}}{2}$

In the above equation, the variable $a$ and $b$ represents the fitness value. If the condition ( $a<b$ ) is satisfied, the position updating is done by the SLnO algorithm; else it is accomplished through the GOA algorithm.

Step 3 – SLnO: The conventional SLnO [29] algorithm is performed by the natural behavior of sea lions. The sea lions employ whiskers to determine the position of prey. The whiskers can identify the size, shape, and location of prey. Once the prey is identified, the sea lion calls the remaining members to locate the prey. The sea lion is designated as a leader and the rest of the members are used to upgrade the position of prey. Thus, the initial position of the sea lion is expressed in Eq. (11).

$\displaystyle\textit{dis}=\left|2\vec{R}\cdot L(n)-\textit{EA}(n)\right|$ (11)

At the current iteration, the location of sea lion and prey is denoted as $\textit{EA}(n)$ and $L(n)$ , correspondingly. The term dis is the displacement value of both sea lions and prey. Here, $\vec{R}$ refers to the random vector that has the range of 0 and 1. While during the next iteration, the new position is generated that is derived in Eq. (12).

$\displaystyle\textit{EA}(n+1)=L(n)-\textit{dis}\cdot\vec{B}$ (12)

Here, $\vec{B}$ is a vector to deduce the searching range from 2 to 0, which is used for rapidly locating the prey and encircling them.

Velocity: Sea lions can survive on both land and water. While surviving in water, the sounds of a sound lion can reach four times greater than when it is present on land. Thus, the velocity of the sea lion is assessed for hunting the prey, which is given in Eq. (13).

$\displaystyle\textit{EA}^{\textit{ldr}}=\left|\frac{\vec{S}_{1}(1+\vec{S}_{2})% }{\vec{S}_{2}}\right|$ $\displaystyle\vec{S}_{1}=\sin\theta$ (13) $\displaystyle\vec{S}_{2}=\sin\phi$

Here, the leader sea lion is declared $\textit{EA}^{\textit{ldr}}$ . The term $\vec{S}_{1}$ and $\vec{S}_{2}$ offers the velocity of sea lions in both water and land, respectively. Similarly, $\sin\phi$ makes the sound to call other members who are in land medium and vice versa condition is given by $\sin\theta$ .

New position: Using the best search sea lion, the hunting phase will be done. Here, the prey is termed a bait ball. Based on the value $\vec{B}$ in Eq. (12), the agent can able to reduce the distance regarding the location of prey; finally, it leads to surrounding the bait ball. Thus, the new solution is created and formulated in Eq. (14).

$\displaystyle\textit{EA}(n+1)=\left|L(n)-\textit{EA}(n)\right|\cdot\cos 2\pi t% +L(n)$ (14)

Here, the absolute value of the distance between the optimal solution and sea lion is annotated $\left|L(n)-\textit{EA}(n)\right|$ and the random number $t$ has the range between $-$ 1 and 1. In Eq. (14), the position is updated based on the best search sea lion. During exploration, the search agents are upgrading their position by choosing a sea lion randomly. It is given in Eqs (15) and (16).

$\displaystyle\textit{dis}=\left|2\vec{R}\cdot\textit{EA}^{\textit{rd}}(n)-% \textit{EA}(n)\right|$ (15) $\displaystyle\textit{EA}(n+1)=\textit{EA}^{\textit{rd}}(n)-\textit{dis}\cdot% \vec{B}$ (16)

Finally, $\textit{EA}^{\textit{rd}}(n)$ signifies the random sea lion. When $\vec{B}$ is higher than 1, the search agent is selected randomly. When it is less than 1, the location is to be updated.

Step 3 – GOA: Traditionally, grasshoppers are defined as insects, which are inspired by the natural behavior of targeting and hunting prey. In the exploration phase, the search agents of insects are moving toward the prey, as well as the agents update their position in the exploitation phase. Thus, the behavior of the GOA [30] algorithm is expressed in Eq. (17).

$\displaystyle G_{p}=U_{p}+V_{p}+W_{p}$ (17)

Here, $p^{\text{th}}$ grasshopper the grasshopper is indicated in $G_{j}$ , also, “social interaction, gravity force and wind advection” of $p^{\text{th}}$ grasshopper is denoted as $U_{j}$ , $V_{j}$ and $W_{j}$ , correspondingly. Equation (17) can be briefly explained in Eq. (18).

$\displaystyle G_{p}=\sum_{\begin{subarray}{c}q=1\\ q\neq p\end{subarray}}^{H}r\left(|g_{q}-g_{p}|\right)\frac{g_{q}-g_{p}}{D_{pq}% }-v\hat{e}_{v}+w\hat{e}_{w}$ (18)

The difference $|g_{q}-g_{p}|$ gives the distance between $p^{\text{th}}$ and $q^{\text{th}}$ grasshopper $r$ defines the social strength $\frac{g_{q}-g_{p}}{D_{pq}}$ and represents the unity vector belonging to grasshopper interaction. Similarly, the term $v$ and $w$ signifies the gravitational and wind advection constant. It maps the grasshopper distance in the range between 1 and 4. Also, $\hat{e}_{v}$ and $\hat{e}_{w}$ refers to the unity vector in the direction of the center of the earth and wind. The term $r$ is derived in Eq. (19).

$\displaystyle r(m)=fe^{\frac{-m}{n}}-e^{-m}$ (19)

The above equation brings the random population for searching and after a certain iteration; all comes under the comfort zone place, where it does not move anywhere. To bring all the grasshoppers out of the zone to solve the optimization problem, another variant of Eq. (18) is expressed in Eq. (20).

$\displaystyle G_{p}=d\left(\sum_{\begin{subarray}{c}q=1\\ q\neq p\end{subarray}}^{H}d\frac{\textit{UB}_{s}-\textit{LB}_{s}}{2}r\left(|g_% {q}^{s}-g_{p}^{s}|\right)\frac{g_{q}-g_{p}}{D_{pq}}\right)+\vec{Z}_{s}$ (20)

Here, the upper and lower bound of $s^{\text{th}}$ the dimension is given as $\textit{UB}_{s}$ and $\textit{LB}_{s}$ . The term $d$ is the deducing coefficient to reduce the space of “comfort, repulsion, and attraction zone”. The target prey of $s^{\text{th}}$ dimension is given in $\hat{Z}_{s}$ . Throughout the iterations, the coefficient $d$ is calculated using Eq. (21).

$\displaystyle d=d_{\text{max}}-n\frac{d_{\text{max}}-d_{\text{min}}}{N}$ (21)

Here, the maximum and minimum value is denoted as $d_{\text{max}}$ and $d_{\text{min}}$ , and the current and last iteration is indicated as $n$ and $N$ . Finally, the optimal solutions are generated through this algorithm, which is used for the subsequent section of Deep CNN. The pseudo-code of the proposed HSLnGO algorithm is described below.

Algorithm 1: Proposed HSLnGO algorithm

Consider the total populations

H

and total iteration as

N_{T}

Compute the fitness value of

a

and

b

for every agent using Eq. (10)

Assume the population as sea lions or grasshoppers

Set the best solution as

F

For (

n<N_{T})

(a<b)

Solution update by SLnO algorithm

Compute leader sea lion using Eq. (13)

(\textit{EA}^{\textit{ldr}}<0.25)

(\vec{B}<1)

Location is upgraded in Eq. (11)

Else

Using Eq. (16), the position is updated

End If

Else

Update the location of the search agent by Eq. (14)

Random position

End If

Else

Solution update by GOA algorithm

Compute

d

using Eq. (21)

Upgrade the position by Eq. (20)

End if

Upgrade

F

once the new solution is generated

n=n+1

;

End for

Return

F

Thus, the flow chart representation of the enhanced hybrid algorithm is depicted in Fig. 3.

Figure 3.

Flow chart of proposed HSLnGO algorithm.

5.3 Deep CNN-based hierarchical clustering

The recommended clustering model employs Deep CNN [31] for providing the clustered output, where some of the hyperparameters are getting optimized by the developed HSLnGO algorithm. A few of the advantages of Deep CNN has depicted here. Deep CNN has the capability of predicting the failure of time and also it increases the accuracy rate. The essential aspect of this Deep CNN-based clustering is to create an accurate clustered outcome. It decreases the running time and it is suitable for real-time applications. It has the advantages of easy initialization, quick clustering within a minute, and efficient modification. Due to these advantages, it is chosen for this research work for clustering. The Deep CNN model is constructed with distinct layers as “convolutional layers, pooling layers, fully connected layers, and softmax classifier layers”. Conversely, the resultant optimal features are trained through every hidden layer that depends on the preceding output in the model. In addition to this, the activation function is considered from the convolutional operation for mapping the features.

Meanwhile, the convolution relies on the optimal features that are represented as $\textit{FE}_{s}^{\textit{opt}}\in\{\textit{FE}_{1}^{\textit{opt}},\linebreak% \textit{FE}_{2}^{\textit{opt}},\ldots,\textit{FE}_{S}^{\textit{opt}}\}$ , which is mapped in the matrix dimension of $S\times\gamma$ , where the total text data is given by $\gamma$ with $S$ several optimal features. The input comprises a sequence of characters that are represented in a matrix format as $\textit{CH}\in D^{c\times|U|}$ a sentencing matrix is defined $\textit{SE}\in D^{c\times|V|}$ . It is then fed into the layer of convolution, where the filter is presented $\textit{FI}\in D^{c\times x}$ . Hence, the convolution operation is determined in Eq. (22).

$\displaystyle L_{q}=(\textit{SE}*\textit{FI})_{q}=\sum_{r,p}(\textit{SE}_{[:,q% -s+1:q]}\otimes\textit{FI})_{rp}$ (22)

In the aforementioned equation, the sentence matrix along with column size $s$ is given by $[:,q-s+1:q]$ and matrix multiplication of element is done by $\otimes$ . After this, the activation function is utilized to render the feature maps of the text or words. Here, the ReLU (Rectified Linear Unit) is considered for the activation function, in which the pooling process is performed in the way of weights and accumulating the values obtained from the convolution operation. Thus, the pooling process is derived using Eq (23).

$\displaystyle L_{\textit{pl}}=\begin{bmatrix}\textit{pl}(\textit{af}(L_{1}+b_{% 1}*\textit{uv}))\\ \vdots\\ \textit{pl}(\textit{af}(L_{z}+b_{z}*\textit{uv}))\\ \end{bmatrix}$ (23)

The above matrix is formed by bias $b_{z}$ , unit vector as uv bias, and $z^{\text{th}}$ convolution feature map as $L_{z}$ . Subsequently, the max pooling is processed by only taking the maximum value of the matrix. At the last stage, the resultant feature is fed into the fully-connected softmax layer, where the probability distribution functions are calculated using Eq. (24).

$\displaystyle\textit{PDF}=\textit{SM}_{q}(y^{T}w+i)=\frac{e^{y^{T}W_{Q}+i_{q}}% }{\sum\limits_{r=1}^{R}e^{Y^{T}w_{r}+i_{r}}}$ (24)

Here, weight and bias values are represented by $w_{r}$ and $i_{r}$ , respectively. Figure 4 illustrates the diagrammatic view of Deep CNN-based hierarchical clustering.

Figure 4.

Diagrammatic illustration of Deep CNN model for hierarchical clustering using HSLnGO algorithm.

The major downside of Deep CNN is, that the performance may get slower when it performs with max pooling operation and requires a larger training time as it has multiple hidden neurons, Also, due to a large number of epochs, the computational burden occurs. Then, the learning rate also relies on the number of epochs. To solve this problem, the novel algorithm is used to optimize the parameters like hidden neurons, learning rate, and epochs. With the result of optimal results, the clustering mechanism ensures the efficacy of the model. The fitness function of Deep CNN – assisted clustering is determined using Eq. (25).

$\displaystyle\textit{FF}=\mathop{\arg\min}\limits_{\{\textit{FE}_{s}^{\textit{% opt},\textit{Hn},\textit{LR},\textit{Ep}}\}}\left[\frac{1}{\textit{acry}+% \textit{precn}}\right]$ (25)

Here, the $\textit{FE}_{s}^{\textit{opt}}$ is the optimal feature selection, Hn refers to the number of suitably hidden neuron counts that has the range of [5, 255], the learning rate is denoted as LR, which lies between 0.01 and 0.99, and the count of epochs as Ep that has the range of 5 and 50. The variable acry termed “accuracy” is defined as the measure of the closeness measurements of the text data. The general expression of accuracy is given in Eq. (26).

$\displaystyle\textit{acry}=\frac{\textit{HP}^{\textit{TI}}+\textit{LN}^{% \textit{TI}}}{\textit{HP}^{\textit{TI}}+\textit{LN}^{\textit{IT}}+\textit{EP}^% {\textit{FK}}+\textit{FN}^{\textit{FK}}}$ (26)

Similarly, precn the term “precision”, defines the nearness value of each other data. It is formulated using Eq (27).

$\displaystyle\textit{precn}=\frac{\textit{HP}^{\textit{IT}}}{\textit{HP}^{% \textit{TI}}+\textit{EP}^{\textit{FK}}}$ (27)

In Eqs (26) and (27), $\textit{HP}^{\textit{TI}}$ and $\textit{LN}^{\textit{TI}}$ refers to the true positive and negative rates, as well as, $\textit{EP}^{\textit{FK}}$ and $\textit{FN}^{\textit{FK}}$ signifies the false positive and negative values.

6. Results

6.1 Simulation setup

The offered unstructured text data clustering was developed using Python $.$ To evaluate the effectiveness, the proposed algorithm has utilized 10 numbers population and total iterations of 25. The performance of the proposed algorithm was compared with the former heuristic algorithms like (Electric Fish Optimization) EFO-Deep CNN [32], (Fitness Improved Sensing Area) FISA-EFO-Deep CNN [33], GOA-Deep CNN [30], and SLnO-Deep CNN [29]. Similarly, the proposed hybrid deep learning model was compared with (Deep Neural Network) DNN [34], (Long-Short Term Memory) LSTM [35], (Extreme Learning Machine) ELM [36], CNN [37], and FISA-EFO-DNN [33].

6.2 Evaluation metrics

Various measures are used for validating the performance; those measures are described below.

Accuracy: It is shown in Eq. (26).

Precision: It is derived using Eq. (26).

FPR and FNR: “The false positive rate provides the error rate, in which the results are obtained incorrectly presence of text data. On the second hand, the false negative is used to determine the absence of unstructured text incorrectly actually when the data is present”.

$\displaystyle\textit{FPR}=\frac{\textit{EP}^{\textit{FK}}}{\textit{EP}^{% \textit{FK}}+\textit{FN}^{\textit{TI}}}$ (28) $\displaystyle\textit{FNR}=\frac{\textit{FN}^{\textit{FK}}}{\textit{FN}^{% \textit{FK}}+\textit{HP}^{\textit{TI}}}$ (29)

Sensitivity and Specificity: Sensitivity is referred to as the “probability of an actual positive test” and specificity is computed by the “probability of a negative test”.

$\displaystyle\textit{Sensitivity}=\frac{\textit{HP}^{\textit{TI}}}{\textit{HP}% ^{\textit{TI}}+\textit{FN}^{\textit{FK}}}$ (30) $\displaystyle\textit{Specificity}=\frac{\textit{LN}^{\textit{TI}}}{\textit{LN}% ^{\textit{TI}}+\textit{EP}^{\textit{FK}}}$ (31)

F1-Score: “It is defined as the harmonic mean value of precision and recall”.

$\displaystyle\textit{F1Score}=2\ast\frac{\textit{precn}\ast\text{Re}}{\textit{% precn}+\text{Re}}$ (32)

FDR: “It is defined as the ratio between the false positive and a total number of both true and false positive”.

$\displaystyle\textit{FDR}=\frac{\textit{fpos}}{\textit{tpos}+\textit{fpos}}$ (33)

NPV: “The negative predictive rate is measured by the ratio of true negative to the total value of true and false negative”.

$\displaystyle\textit{NPV}=\frac{\textit{LN}^{\textit{TI}}}{\textit{LN}^{% \textit{TI}}+\textit{FN}^{\textit{FK}}}$ (34)

MCC: It is used to calculate the “difference between the predicted images and actual images”.

$\displaystyle\textit{MCC}=\frac{\textit{HP}^{\textit{TI}}\times\textit{LN}^{% \textit{TI}}-\textit{EP}^{\textit{FK}}\times\textit{FN}^{\textit{FK}}}{\sqrt{(% \textit{HP}^{\textit{TI}}+\textit{EP}^{\textit{FK}})(\textit{HP}^{\textit{TI}}% +\textit{FN}^{\textit{FK}})(\textit{LN}^{\textit{TI}}+\textit{EP}^{\textit{FK}% })(\textit{LN}^{\textit{TI}}+\textit{FN}^{\textit{FK}})}}$ (35)

6.3 K-fold analysis using various existing algorithms with the proposed model for dataset 1

The k-fold analysis of the enhanced clustering model is shown in Fig. 5 concerning varying different k-fold numbers using dataset 1.

Figure 5.

Evaluation of K-Fold for the proposed unstructured text data clustering model with hybrid algorithms for dataset 1 with respect to “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.

Figure 5h depicts the precision analysis of the novel data clustering model. When the K-fold is 3, the precision value is obtained as 3.9% by EFO-Deep CNN, 3.15% by FISA-EFO-Deep CNN, 1.4%, and 3.2% by GOA-Deep CNN and SLnO-Deep CNN, which is a lower value than the proposed HSLnGO-Deep CNN model. Thus, the higher precision results ensure to render a better clustering performance.

6.4 K-fold analysis using various existing classifiers with the proposed model for dataset 1

Figure 6 illustrates the k-fold analysis of the novel text data clustering model by diverse heterogeneous k-fold numbers and dataset 1.

Figure 6.

Evaluation of K-Fold for the proposed unstructured text data clustering method with former classifiers for dataset 1 regarding “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.

The F1-Score analysis of the proposed method is demonstrated in Fig. 6b. When the K-fold is 2, the offered method attains a higher F1-score value when compared with the lower value of 17%, 10%, 8%, and 4% by DNN, LSTM, ELM, and CNN, Hence, the suggested method achieves more results to prove the effectiveness of the clustered outcome.

6.5 Evaluation using various existing algorithms with the proposed model for dataset 1

The performance evaluation of the unstructured text clustering method is represented in Fig. 7 in terms of various learning percentages.

Figure 7.

Estimation on proposed unstructured text data clustering method with hybrid algorithms for dataset 1 concerning “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”.

Figure 7c elucidates the FNR analysis of the suggested method by dataset 1. The FNR value is acquired as 8.2% of EFO-Deep CNN, 5.2% of FISA-EFO-Deep CNN, 10.2% of GOA-Deep CNN, and 9.2% of SLnO-Deep CNN, which is superior to improved HSLnGO-Deep CNN model at 50 ${}^{\text{th}}$ learning percentage. Therefore, the proposed system enhances the clustering performance concerning a low false error rate.

6.6 Performance analysis using various existing classifiers with the proposed model for dataset 1

Figure 8 represents the performance analysis of the novel method over existing deep learning models through dataset 1.

Figure 8.

The overall analysis of offered unstructured text data clustering model with existing classifiers for dataset 1 concerning “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”.

The accuracy analysis of the offered method is given in Fig. 8a. At the 75 ${}^{\text{th}}$ learning percentage, the offered method acquires more accuracy value, which is compared with the lesser value of existing classifiers as 24%, 14%, 12%, 10%, and 3% by DNN, LSTM, ELM, CNN, and FISA-EFO-DNN, Hence, the recommended module achieves expected accuracy value to improve the clustering performance.

6.7 K-fold analysis using various existing algorithms with the proposed model for dataset 2

The k-fold analysis for dataset 2 over classical heuristic algorithms is illustrated in Fig. 9.

Figure 9.

K-fold analysis of the offered unstructured text data clustering model with hybrid algorithms for dataset 2 concerning “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.

The MCC analysis for the proposed method is given in Fig. 9f. When the k-fold number is 2, the enhanced clustering model achieves a higher MCC value, when compared with a lesser percentage of 7.3% of EFO-Deep CNN, 6% of FISA-EFO-Deep CNN, 2.3% of GOA-Deep CNN, and 6.6% of SLnO-Deep CNN. Hence, the novel text clustering method offers better performance for unstructured text data.

6.8 K-fold analysis using various existing classifiers with the proposed model for dataset 2

Figure 10 shows the k-fold analysis of the proposed system with traditional classifiers for dataset 2.

Figure 10.

K-fold analysis of proposed unstructured text data clustering method with former classifiers for dataset 2 regarding “(a) Accuracy, (b) F1-score, (c) FDR, (d) FNR, (e) FPR, (f) MCC, (g) NPV, (h) Precision, (i) Sensitivity, (j) Specificity”.

The F1-score analysis of the offered method is represented in Fig. 10b. The F1-score value like 21% for DNN, 13% for LSTM, 11% for ELM, and 9% for CNN, which is less valuable than HSLnGO-Deep CNN. Due to the higher value, the suggested method yields a better-clustered output.

6.9 Performance analysis using various existing algorithms with the proposed method for dataset 2

Evaluation of the proposed clustering method while implementing dataset 2 is elucidated in Fig. 11. The FNR analysis is given in Fig. 11c.

Figure 11.

The overall analysis of the proposed unstructured text data clustering model with hybrid algorithms for dataset 2 in terms of “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”.

When the learning percentage is 50, the FNR value is obtained for existing algorithms like 8% of EFO-Deep CNN, 5.4% of FISA-EFO-Deep CNN, 9.2% of GOA-Deep CNN, and SLnO-Deep CNN, which is superior to the proposed HSLnGO-Deep CNN. Thus, the novel clustering method achieves less error rate to render the desired outcome.

6.10 Performance analysis using various existing classifiers with the proposed model for dataset 2

Figure 12 demonstrates the performance analysis for dataset 2 compared with conventional classifiers. The precision analysis is represented in Fig. 12e.

Figure 12.

The overall analysis of the proposed unstructured text data clustering method with existing classifiers for dataset 2 concerning “(a) Accuracy, (b) F1-score, (c) FNR, (d) FPR, (e) Precision”

At the 35 ${}^{\text{th}}$ learning percentage, the HSLnGO-Deep CNN attains more precision value in contrast with the existing classifier at 28%, 16%, 14%, and 12% for DNN, LSTM, ELM, and CNN. Hence, the higher precision leads to enhancing the clustered performance for unstructured text data.

6.11 Overall comparative algorithmic analysis of a proposed model for dataset 1

Table 2 evaluates the efficiency of the new method in contrast with various heuristic algorithms using dataset 1.

Table 2
Comparative evaluation of proposed unstructured text data clustering model for dataset 1 using heuristic algorithms

Metrics	EFO-Deep CNN [32]	FISA-EFO-Deep CNN [33]	GOA-Deep CNN [30]	SLnO-Deep CNN [29]	HSLnGO-Deep CNN
Accuracy	90.91	93.045	90.35	89.385	96.475
Sensitivity	90.95718	93.07316	90.28845	89.41012	96.47669
Specificity	90.86264	93.01673	90.41178	89.35978	96.4733
Precision	90.90274	93.0453	90.43287	89.4012	96.48632
FPR	9.137361	6.983268	9.588218	10.64022	3.526701
FNR	9.042819	6.926839	9.711548	10.58988	3.523306
NPV	90.86264	93.01673	90.41178	89.35978	96.4733
FDR	9.097257	6.9547	9.56713	10.5988	3.513675
F1-Score	90.92995	93.05923	90.3606	89.40566	96.48151
MCC	81.81993	86.08995	80.70009	78.76992	92.94998

The sensitivity of the model achieves a higher value when compared to former algorithms like 5.51% advanced than EFO-Deep CNN, 3.4% more than FISA-EFO-Deep CNN, 6.1%, and 7% for GOA-Deep CNN and SLnO-Deep CNN. Due to impressive results, the proposed model improves the clustering performance.

6.12 Evaluation of classifier analysis on offered method for dataset 1

Evaluation of the offered method, when compared with diverse classifiers, is given in Table 3 by implementing dataset 1.

Table 3
Comparative analysis of proposed unstructured text data clustering model for dataset 1 using different classifiers

Metrics	DNN [34]	LSTM [35]	ELM [36]	CNN [37]	FISA-EFO-Deep CNN [33]	HSLnGO-Deep CNN
Accuracy	74.095	83.09	85.18	87.835	93.8	96.475
Sensitivity	74.01936	83.01228	85.1582	87.76325	91.38577	96.47669
Specificity	74.17092	83.16802	85.20188	87.90702	96.56652	96.4733
Precision	74.20452	83.19496	85.24328	87.93	96.48632	96.8254
FPR	25.82908	16.83198	14.79812	12.09298	3.433476	3.526701
FNR	25.98064	16.98772	14.8418	12.23675	8.614232	3.523306
NPV	74.17092	83.16802	85.20188	87.90702	96.56652	96.4733
FDR	25.79548	16.80504	14.75672	12.07	3.174603	3.513675
F1-Score	74.11183	83.10352	85.20072	87.84655	94.02697	96.48151
MCC	48.19021	66.18018	70.35998	75.67014	87.75152	92.94998

The precision value of the proposed method is obtained as 22.6% superior to DNN, 13.6% superior to LSTM, 11.5% more than ELM, 8.89% higher than CNN, and 0.3% higher than FISA-EFO-DNN. Since it offers the desired outcome, the proposed model delivers better text data clustering.

6.13 Overall comparative algorithmic analysis of a proposed model for dataset 2

Table 4 estimates the comparative analysis for the new model using different algorithms and dataset 2.

Table 4
Comparative evaluation of proposed unstructured text data clustering model for dataset 2 using heuristic algorithms

Metrics	EFO-Deep CNN [32]	FISA-EFO-Deep CNN [33]	GOA-Deep CNN [30]	SLnO-Deep CNN [29]	HSLnGO-Deep CNN
Accuracy	90.81836	93.21357	90.21956	89.22156	96.40719
Sensitivity	90.61876	93.01397	90.21956	89.42116	96.40719
Specificity	91.01796	93.41317	90.21956	89.02196	96.40719
Precision	90.98196	93.38677	90.21956	89.06561	96.40719
FPR	8.982036	6.586826	9.780439	10.97804	3.592814
FNR	9.381238	6.986028	9.780439	10.57884	3.592814
NPV	91.01796	93.41317	90.21956	89.02196	96.40719
FDR	9.018036	6.613226	9.780439	10.93439	3.592814
F1-Score	90.8	93.2	90.21956	89.24303	96.40719
MCC	81.63738	86.42783	80.43912	78.44374	92.81437

The NPV of the model obtains a higher value in contrast with former algorithms like 5.3% more than EFO-Deep CNN, 2.9% more than FISA-EFO-Deep CNN, 6.1%, and 7.3% for GOA-Deep CNN and SLnO-Deep CNN. Thus, the proposed text clustering model improves the clustering efficiency.

6.14 Evaluation of classifier analysis on offered method for dataset 2

The comparative evaluation of the offered method, when compared with diverse classifiers, is given in Table 5 for dataset 1.

Table 5
Comparative analysis of proposed unstructured text data clustering model for dataset 2 using different classifiers

Metrics	DNN [34]	LSTM [35]	ELM [36]	CNN [37]	HSLnGO-Deep CNN
Accuracy	73.9521	83.03393	85.02994	88.62275	96.40719
Sensitivity	73.05389	83.83234	84.83034	88.02395	96.40719
Specificity	74.8503	82.23553	85.22954	89.22156	96.40719
Precision	74.39024	82.51473	85.17034	89.09091	96.40719
FPR	25.1497	17.76447	14.77046	10.77844	3.592814
FNR	26.94611	16.16766	15.16966	11.97605	3.592814
NPV	74.8503	82.23553	85.22954	89.22156	96.40719
FDR	25.60976	17.48527	14.82966	10.90909	3.592814
F1-Score	73.71601	83.16832	85	88.55422	96.40719
MCC	47.91192	66.07629	70.06044	77.25105	92.81437

The sensitivity value of the novel method is achieved as 23.3% superior to DNN, 12.5% superior to LSTM, 11.5% more than ELM, and 8.38% higher than CNN. Since it acquires more results, the proposed model delivers better text data clustering.

6.15 Discussion of results based on diverse existing approaches

In the experimental evaluation, the k-fold analysis shows the equivalent performance. Here, the GOA is the 2 ${}^{\text{nd}}$ greatest algorithm. At the same time, EFO achieves the least performance which holds a high error rate. Due to these challenges, EFO generates overfitting issues, and also it easily falls into the local optimum. Accordingly, CNN is the 2 ${}^{\text{nd}}$ greatest classifier. At the same time, DNN achieves the least performance which holds a high error rate. It does not apply to real-time datasets and has reduced the computation time. The overall performance of the designed method confirmed its effective performance. The performance of the classifiers and algorithms increased at the 80 ${}^{\text{th}}$ learning percentage and also the performance degradation did at the 80 ${}^{\text{th}}$ learning percentage.

7. Conclusion

This paper has elucidated the novel unstructured text data clustering model with the aid of a hybrid-based heuristic algorithm. Initially, the unstructured text dataset was collected from standard data sources. Once the data was garnered, it was undergone for the pre-processing phase, which was done via tokenization, stemming, stop words, and punctuation removal. The pre-processed text data was given as input to the feature extraction stage 1 and stage 2. In phase 1, the relevant text features were obtained by the GloVe embedding model, whereas the deep features of data were acquired through the Text-CNN method. With the resultant of these two features, feature concatenation was taken place, thereby; it could be used to select the optimal features by influencing the novel algorithm of HSLnGO. This algorithm was built by combining the classical SLnO and GOA algorithms. Then, the optimal feature selection was fed into the Deep CNN network-based hierarchical clustering, where the features were clustered into different groups depending on the text data. To further improve the clustering performance, the hyperparameters of Deep CNN were tuned optimally by the HSLnGO algorithm. Finally, several metrics were utilized to assess the improvement of the clustering model. Thus, the precision value of the proposed method is obtained as 22.6% superior to DNN, 13.6% superior to LSTM, 11.5% more than ELM, 8.89% higher than CNN, and 0.3% higher than FISA-EFO-DNN. Contrary to classical approaches, the recommended clustering mechanism has offered impressive outcomes to provide effective clustered data.

Funding

This research did not receive any specific funding.

Informed consent

Not applicable.

Ethical approval

Not applicable.

Author contributions

All authors have made substantial contributions to the conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Footnotes

Acknowledgments

I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.

Conflict of interest

The authors declare no conflict of interest.

References

Skabar

Abdalgader

. Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm. IEEE Transactions on Knowledge and Data Engineering. 2013 Jan; 25(1): 62-75.

Yang

Huang

Cai

. Discovering Topic Representative Terms for Short Text Clustering. IEEE Access. 2019; 7: 92037-92047.

Shehata

Karray

Kamel

. An Efficient Concept-Based Mining Model for Enhancing Text Clustering. IEEE Transactions on Knowledge and Data Engineering. 2010 Oct; 22(10): 1360-1371.

Yang

Liu

Meng

Sun

. Neural Feedback Text Clustering With BiLSTM-CNN-Kmeans. IEEE Access. 2018; 6: 57460-57469.

da Cruz Nassif

Hruschka

. Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection. IEEE Transactions on Information Forensics and Security. 2013 Jan; 8(1): 46-54.

Luo

Chung

. Text Clustering with Feature Selection by Using Statistical Data. IEEE Transactions on Knowledge and Data Engineering. 2008 May; 20(5): 641-652.

Fuentealba

López

Ponce

. Effects on Time and Quality of Short Text Clustering during Real-Time Presentations. IEEE Latin America Transactions. 2021 Aug; 19(8): 1391-1399.

Shao

Trovati

Shi

, et al. A hybrid spam detection method based on unstructured datasets. Soft Comput. 2017; 21: 233-243.

Lang

. A tetrahedral data model for unstructured data management. Sci China Inf Sci. 2010; 53: 1497-1510.

10.

. Semantic string operation for specializing AHC algorithm for text clustering. Ann Math Artif Intell. 2020; 88: 1083-1100.

11.

Karol

Mangat

. Evaluation of text document clustering approach based on particle swarm optimization. Centr Eur J Comp Sci. 2013; 3: 69-90.

12.

Cao

Guo

Wang

. Text clustering using VSM with feature clusters. Neural Comput & Applic. 2015; 26: 995-1003.

13.

. GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput & Applic. 2022; 34: 4321-4341.

14.

Ponnusamy

Bedi

Suresh

, et al. Design and analysis of text document clustering using salp swarm algorithm. J Supercomput. 2022.

15.

Lomakina

Rodionov

Surkova

. Hierarchical clustering of text documents. Autom Remote Control. 2014; 75: 1309-1315.

16.

Abualigah

Almotairi

, et al. Efficient text document clustering approach using multi-search Arithmetic Optimization Algorithm. Knowledge-Based Systems. 2022; 248.

17.

Purushothaman

Rajagopal

Dhandapani

. Hybridizing Gray Wolf Optimization (GWO) with Grasshopper Optimization Algorithm (GOA) for text feature selection and clustering. Applied Soft Computing. 2020; 96.

18.

Hosseini

Varzaneh

. Deep text clustering using stacked AutoEncoder. Multimed Tools Appl. 2022; 81: 10861-10881.

19.

Saeed

Awais

Talib

Younas

. Unstructured Text Documents Summarization with Multi-Stage Clustering. IEEE Access. 2020; 8: 212838-212854.

20.

Kumar

Ahmed

Vigneshwaran

, et al. Two phase cluster validation approach towards measuring cluster quality in unstructured and structured numerical datasets. J Ambient Intell Human Comput. 2021; 12: 7581-7594.

21.

Manzato

Domingues

Fortes

Sundermann

D’Addio

Conrado

Rezende

Pimentel

MGC

. Mining unstructured content for recommender systems: an ensemble approach. Information Retrieval Journal. 2016; 19: 378-415.

22.

Lee

Song

Cho

, et al. Document representation based on probabilistic word clustering in customer-voice classification. Pattern Anal Applic. 2019; 22: 221-232.

23.

Thirumoorthy

Muneeswaran

. A hybrid approach for text document clustering using Jaya optimization algorithm. Expert Systems with Applications. 2021 Sep 15; 178: 115040.

24.

Fidan

Yuksel

. A Novel Short Text Clustering Model Based on Grey System Theory. Arab J Sci Eng. 2020; 45: 2865-2882.

25.

Jananim

Vijayarani

. Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization. Expert Systems with Applications. 2019 Nov 15; 134; 192-200.

26.

Mohammad

Jacksi

Zeebaree

SRM

. Glove Word Embedding and DBSCAN algorithms for Semantic Document Clustering. In: 2020 International Conference on Advanced Science and Engineering (ICOASE). 2020. pp. 1-6.

27.

Huang

Qiao

Yao

. Text-Attentional Convolutional Neural Network for Scene Text Detection. IEEE Transactions on Image Processing. 2016 Jun; 25(6): 2529-2541.

28.

Zhao

Karypis

Fayyad

. Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery. 2005; 10: 141-168.

29.

Masadeh

RMT

Mahafzah

Sharieh

AAA

. Sea Lion Optimization Algorithm. International Journal of Advanced Computer Science and Applications. 2019 May; 10(5): 388-395.

30.

Saremi

Mirjalili

Lewis

. Grasshopper Optimization Algorithm: Theory and application. Advances in Engineering Software. 2017; 105: 30-47.

31.

Arora

Kansal

. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis. Soc Netw Anal Min. 2019; 9(12).

32.

Yilmaz

Sen

. Electric fish optimization: a new heuristic algorithm inspired by electrolocation. Neural Computing and Applications. 2020; 32: 11543-11578.

33.

Jyothi

Sumalatha

Eluri

. Intelligent Deep Learning-based Hierarchical Clustering for Unstructured Text Data. Communication with Concurrency and Computation: Practice and Experience. 2022.

34.

Apoorva

Sangeetha

. Deep neural network and model-based clustering technique for forensic electronic mail author attribution. SN Applied Sciences. 2021; 3(348).

35.

Santhanam

. Context based Text-generation using LSTM networks. Computer Science – Computation and Language. 2018.

36.

Roul

Gugnani

Kalpeshbhai

. Clustering based feature selection using Extreme Learning Machines for text classification. In: 2015 Annual IEEE India Conference (INDICON). 2015. pp. 1-6.

37.

Akhter

Zheng

Naqvi

Abdelmajeed

Mehmood

Sadiq

. Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access. 2020; 8: 42689-42707.

Hybrid unstructured text features for meta-heuristic assisted deep CNN-based hierarchical clustering

Abstract

Keywords

1. Introduction

2.1 Related works

2.2 Research gaps and challenges

Table 1 Strengths and weakness of existing unstructured text clustering methods

3.1 Data description of English and Telugu dataset

3.2 Pre-processing of raw text

3.3 Developed text clustering model

4.1 Feature extraction: Phase 1

5.1 Hierarchical clustering and target generation

6.1 Simulation setup

6.2 Evaluation metrics

Table 2 Comparative evaluation of proposed unstructured text data clustering model for dataset 1 using heuristic algorithms

Table 3 Comparative analysis of proposed unstructured text data clustering model for dataset 1 using different classifiers

Table 4 Comparative evaluation of proposed unstructured text data clustering model for dataset 2 using heuristic algorithms

Table 5 Comparative analysis of proposed unstructured text data clustering model for dataset 2 using different classifiers

7. Conclusion

Funding

Informed consent

Ethical approval

Author contributions

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Strengths and weakness of existing unstructured text clustering methods

Table 2
Comparative evaluation of proposed unstructured text data clustering model for dataset 1 using heuristic algorithms

Table 3
Comparative analysis of proposed unstructured text data clustering model for dataset 1 using different classifiers

Table 4
Comparative evaluation of proposed unstructured text data clustering model for dataset 2 using heuristic algorithms

Table 5
Comparative analysis of proposed unstructured text data clustering model for dataset 2 using different classifiers