Abstract
This article proposes the modified AHC (Agglomerative Hierarchical Clustering) algorithm which considers the feature similarity and is applied to the text clustering. The words which are given as features for encoding texts into numerical vectors are semantic related entities, rather than independent ones, and the synergy effect between the word clustering and the text clustering is expected by combining both of them with each other. In this research, we define the similarity metric between numerical vectors considering the feature similarity, and modify the AHC algorithm by adopting the proposed similarity metric as the approach to the text clustering. The proposed AHC algorithm is empirically validated as the better approach in clustering texts in news articles and opinions. The significance of this research is to improve the clustering performance by utilizing the feature similarities.
Introduction
Introduction Text clustering refers to the process of segmenting a group of various texts into subgroups of content based similar texts, as an instance of pattern clustering. In this research, we assume an unsupervised learning algorithm is used as approach to the task, and texts are clustered based on their content based similarities. Texts are mapped into their structured forms and clustered based on their similarities. The results from the task are unnamed clusters of texts, and it is another task to name clusters based on their contents. Note that clustering is very expensive computation with the quadratic complexity by itself.
We cannot avoid some problems in encoding texts into numerical vectors and computing their similarities based on only attribute values. In encoding so, many features are needed for maintaining the robustness, since each feature has very tiny coverage in the given corpus [9]. The dominance of zero feature values in each numerical vector cases the very poor environment for computing similarities among numerical vectors [9]. The assumption that the features are independent of each other violates against the reality, especially in text mining tasks; the words as the features of texts are related with each other, semantically [24]. Therefore, as the challenge of this research against the problems, we consider features as well as feature values in computing the similarity between texts.
Let us mention what we propose in this research as agenda. In this research, assuming that words are given as features of numerical vectors representing texts, we will consider the semantic relations among features. Based on the fact, we define the similarity measure which involves both the feature similarity and the feature value similarity between text representations. Using the defined similarity measure, the AHC algorithm is modified into the version which computes the similarity between items as the approach to the text clustering. Therefore, the discriminations among numerical vectors are improved even in the sparse distribution, and texts may be expected into numerical vectors with their smaller dimensions.
We mention some benefits which are expected from this research, by implementing the above ideas. We open potentially the possibility that texts are represented more compactly into numerical vectors, because the discriminations among them are improved. The information loss is reduced in computing the similarity between texts, by reflecting the semantic similarities among words which are given as features. The system become more tolerant to the sparse distribution of numerical vectors by considering the similarity between attributes as well as one between attribute values. Therefore, we expect both the better performance of the text clustering and more efficient representations of texts from this research.
This article is organized into the five sections. In Section 2, we survey the relevant previous works. In Section 3, we describe in detail what we propose in this research. In Section 4, we validate empirically the proposed approach by comparing it with the traditional one. In Section 5, we mention the general discussion on the empirical validations and remaining tasks for doing the further research.
Previous works
PreviousWorks This section is concerned with the previous works which are relevant to this research. In Section 2.1, we explore the previous cases of applying the AHC algorithm to text mining tasks. In Section 2.2, we survey the schemes of encoding texts or words into structured data. In Section 2.3, we describe the previous machine learning algorithms which receive alternative structured data such as tables and string vectors to numerical vectors. Therefore, in this section, we provide the history about this research, by surveying the relevant previous works.
Using AHC algorithm to text mining tasks
Section02.01 This section is concerned with the previous cases of applying the AHC algorithm to text mining tasks. Clustering texts or words belongs to the text mining task, we adopt the AHC algorithm as the approach, in this research. In the AHC algorithm, data items are clustered by starting singletons and building up clusters by merging them in the bottom up direction. The fact that the AHC algorithm is popularly used in clustering tasks in any domain is the reason of adopting and modifying the AHC algorithm in this research. In this section, we survey the cases of using the AHC algorithm for clustering words or texts and introduce the clustering index as the evaluation metric of clustering data items.
Let us mention the previous cases of applying the AHC algorithm to the word clustering. In 2001, Slonim and Tishby proposed the hierarchical bottle neck algorithm which is a AHC variant as the approach to the word clustering [27]. In 2002, Dhillon et al. clustered words by the AHC algorithm and used the results for the hierarchical text categorization [5]. In 2016, Zhou et al. applied the word clustering to the hot topic detection [31]. In the above works, words are clustered, depending on their frequencies in each category, and the research on the word clustering is progressed for supporting the text categorization tasks.
Let us mention the previous cases of applying the AHC algorithm to the text clustering. In 2010, Gao et al. adopted the AHC algorithm for implementing the text clustering system based on the Map Reduce frame work [7]. In 2015, Gamare et al. proposed the mixture model including the AHC algorithm as the approach to the web document clustering [6]. In 2016, Ah-Pine et al. used the cosine similarity instead of the Euclidean distance in applying the AHC algorithm for clustering texts [2]. We use the AHC algorithm for text categorization by modifying or combining it with other approaches in the recent literatures.
Let us mention the evaluation metric, which is called clustering index, of clustering tasks. In 2007, Brun et al. mentioned various metrics of evaluating clustering results and classified them into the three types [4]. In 2007, Jo and Lee proposed the metric based on the intra-cluster similarity and the inter-cluster similarity, called cluster index [20]. In 2008, Jo used the clustering index for evaluating clustering results [10]. Therefore, we adopt the clustering index for validating the proposed approach empirically, in this research.
In this research, we adopt the AHC algorithm as the approach to the word clustering based on the above cases. In this research, words are clustered based on their meaning and topics with regardless of their spellings. We modify the AHC algorithm into another version and apply it to the word clustering, itself. We use the clustering index which integrates the both metrics, intra-cluster similarity and inter-cluster similarity, as the evaluation metric for validating the proposed approach. Therefore, the above literatures provide the basis for doing the current research.
Encoding schemes
Section02.02 This is concerned with the previous works on encoding words or texts into structured data. It has been assumed that the input data are always given as numerical vectors in applying machine learning algorithms to applications in any domain. In almost cases of applying machine learning algorithms to text mining texts, texts or words are encoded into numerical vectors. In order to solve the problems in encoding so, there were previous trials of encoding them into alternative structured data to numerical vectors. In this section, we explore the previous schemes of encoding texts or words into structured data.
It is known that texts or words are usually encoded into numerical vectors, by surveying articles on machine learning approaches to text categorization. In 1995, Wiener encoded texts into approximately three hundreds dimensional numerical vectors, in applying the neural networks to the text categorization [28]. In 1999, Yang encoded texts absolutely into numerical vectors, in evaluating machine learning algorithms in apply them to the text categorization [29]. In 2002, Sebastiani surveyed the machine learning approaches to the text categorization, under the assumption of encoding texts into numerical vectors [26]. The fact that texts should be encoded into numerical vectors for doing the text mining tasks is confirmed through the above literatures.
There existed the previous trials of encoding text into tables, instead of tables for doing the text categorization and clustering. In 2008, Jo modified the single pass algorithm into its table based version for the text clustering [10]. In 2011, Jo invented the table marching algorithm as the method of categorizing texts in his patent [16]. In 2015, Jo improved the table matching algorithm into the stable and robust version [17]. The table becomes the alterative text representation to the numerical vector for doing the both tasks.
There also existed the previous trials of encoding texts into string vectors as one more alternative structured data for doing the both tasks. In 2009, Jo proposed the semantic similarity between string vectors for using the KNN and the SVM for ding the text categorization [12]. In 2010, Jo modified the k means algorithm into its string vector based version as the approach to the text clustering [13]. In 2015, Jo defined and characterized mathematically the numerical operations on string as the fundamental research for modifying other machine learning algorithms into their string vector based versions [18]. It requires to define and characterize more semantic operations for modifying more advanced machine learning algorithms.
In this research, we adopt the scheme where texts are encoded into numerical vectors. Words are defined as features, and TF-IDF (Term Frequency-Inverse Document Frequency) and frequency are given as feature values. In order to avoid the poor discriminations among sparse numerical vectors, we need consider the similarity between words which is the feature similarity. The feature similarity and the feature value similarity are combined with each other for computing the similarity between vectors. We modify the AHC algorithm into the version where both similarities are computed and combined with each other as the approach to the text clustering.
Specialized machine learning algorithms
Section02.03 This section is concerned with the previous works on approaches to the text mining tasks where texts are encoded into alternative representations to numerical vectors. In Section 2.3, we explored the previous schemes of encoding texts into structured data and the modified versions of existing machine learning algorithms using the scheme. In this section, we introduce the string kernel which is a kernel function on string in using the SVM to the text categorization, and mention the new neural networks, NTC (Neural Text Categorizer) and NTSO (Neural Text Self Organizer) for doing the text categorization and the text clustering. It takes very much time for applying the string kernel version to the text categorization and needs more operations for advancing the created neural networks. In this section, we explore cases of using the string kernel based SVM and the two creative neural networks to text mining tasks.
The string kernel was proposed as a kernel function in using the SVM for classification tasks, in order to avoid problems in encoding texts into numerical vectors. In 2002, Lodhi et al. initially proposed the string kernel, in order solve the problems in doing so [23]. In 2004, Leslie et al. applied the string kernel to protein classification where a protein is given as a string [22]. In 2006, Kate and Mooney used the string kernel for processing semantically sentences, instead of entire full texts [21]. The SVM with the string kernel works not successfully in the text categorization, but it works successfully in the protein classification.
The neural network, NTC (Neural Text Categorizer), was previously invented as a more suitable approach to the text categorization. In 2000, Jo initially proposed the NTC, but it used its weights which are fixed based on word frequencies [8]. In 2008 and 2010, Jo improved the NTC into the version where word weights are updated, and applied it to both the exclusive text categorizations and the soft ones [11, 14]. In 2012, Pawar and Gawande mentioned the NTC as innovative approach to the text categorization, and in 2015, Abainia et al. applied it for categorizing Arabic texts [1, 25]. In future, the NTC needs to be expanded from the swallow version into deep version.
The NTSO (Neural Text Self Organizer) was previously created as a more suitable approach to the text clustering. In 2005, Jo and Japkowicz invented initially as the approach to the text clustering [19]. In 2006, Zehng et al. mentioned the NTSO as one of the main text clustering tools [30]. In 2010, Jo validated empirically the NTSO performance by comparing it with the popular approaches such as the k means algorithm and the Kohonen Networks [15]. The NTSO will be modified into its supervised version and semi-supervised version in future, like the Kohonen Networks.
In this research, we set the text clustering as the task against which we challenge. In the string kernel based SVM, the lexical similarity between strings is computed, rather than semantic ones; it is more suitable for processing proteins than for doing texts. The two neural networks, NTC and NTSO, take actually much computation time and are very complicated with respect to implementations; they are usually used for implementing heavy versions of text categorization and clustering systems. It requires the mathematical definition and characterizations on more operations on strings for expanding them. Therefore, in this research, we propose the modified AHC version for implementing light versions.
Proposed approach
ProposedApproach This section is concerned with the AHC (Agglomerative Hierarchical Clustering) algorithm which considers both the feature similarity and feature value one, and it consists of the three sections. In Section 3.1, we describe the process of encoding texts into numerical vectors as the text preprocessing. In Section 3.2, we present the equation with which the similarity between two numerical vectors is computed, considering the feature similarity. In Section 3.3, we mention the proposed version where the similarity is computed by the proposed scheme with respect to its learning process. Therefore, this section is intended to describe the proposed version of AHC algorithm as the approach to the text clustering task.
Text encoding
TextEncoding We cover the process of encoding a text into a numerical vector, in this section. A list of words is extracted from a corpus as the feature candidates. Among them, we select some as the features which are the attributes of numerical vectors which represent texts. For each text, we assign values such as frequencies or TF-IDF (Term Frequency and Inverse Document Frequency) weights as elements of the numerical vector. Therefore, this section is intended to describe the process of encoding texts into numerical vectors.
By indexing texts in the corpus, a collection of words is generated and they become feature candidates. The system integrate texts in the corpus into a single text and it is indexed into a list of words. The indexing is performed by the three basic steps: tokenization, stemming, and stopword removal. For further more efficiency, we may attach additional steps such as word filtering and expanding. Words which remain after the three steps are usually verbs, nouns, and adjectives.
After selecting some among the feature candidates as features, it is considered how to decide values to them. The number of selected features becomes the numerical vector dimension. In heuristic schemes, the frequencies or presence bits of features in the given text are assigned as the feature values. In the state of the art scheme, we may calculate the TF-IDF weight of the word by Equation (1)
Let us consider some issues in encoding texts into numerical vectors. The number of feature candidates which are extracted from a corpus is usually ten thousands and the number of selected features is usually several hundreds. Even if only small number of features is selected among them, the dimension of numerical vector which represent a text is still big. The numerical vector tend to be sparse; zero values are dominant by more than 95%. In computing the similarities among numerical vectors by the traditional similarity metrics, zero similarities or zero distances are easily generated.
This research proposed the two kinds of similarities between numerical vectors: feature similarity and feature value similarity. The feature similarity means the similarities among features within a numerical vector, while the feature value similarity means the similarity between two numerical vectors based on their values. The cosine similarity or the inverse Euclidean distance belong to the feature value similarity. In the next section, we will describe how to compute both similarities between two numerical vectors.
FeatureSimilarity We mention the proposed strategy of computing the similarity between numerical vectors. The strategy is outlined in Fig. 1, where both feature similarities are considered. The cosine similarity and the Euclidean distance will be regarded as the traditional similarity metrics in this research. When two vectors have non-zero values in their different features, the cosine similarity between them is computed to zero, while the proposed one generates a non-zero value. Therefore, this section is intended to describe how to calculate the similarity between two numerical vectors.

The Combination of Feature and Feature Value Similarity.
Let us assume that the features of numerical vectors which represent texts are given as words. Actually, the features are semantically related, rather than independent ones which is assumed in using the Naive Bayes [24]. Previously, various similarity metrics between two numerical vectors have been proposed [3] Manning and Schutze 1999. When two vectors have non-zero values in different features, the cosine similarity generates a zero value as the similarity. By considering the feature similarity as well as the feature value similarity, it is expected to avoid the situation.
In order to get feature similarities, we need to build the similarity matrix from a corpus. Its columns and rows correspond to the words which are selected as features. The similarity between two words is computed by Equation (2),
Let us assume that the words, t1, t2,…, t
d
, are selected from the feature candidates, and the two texts, which are notated by d1 and d2, are encoded into the two vectors, as follows:
As the payment, it takes the quadratic complexity, O (d2), to the dimension in computing the similarity between two vectors by Equation (3). If we use the traditional similarity metrics, such as the cosine similarity and the Euclidean distance, it takes only linear complexity, O (d) to the dimension, d. In encoding texts into same dimensional numerical vectors, it takes much more time for computing the similarity. However, it is required to encode texts into less dimensional numerical vectors, so the time difference for computing similarity may be reduced.
ProposedVersion of AHC AlgorithmThis section is concerned with the modified version of AHC algorithm which considers both the feature similarity and the feature value one. The texts which are given as clustering targets are encoded into numerical vectors whose features are texts by the scheme which was described in Section 1. The numerical vectors which represent words or texts tend to be sparse, inherently; zero values in each vector tend to be dominant over 90%. In the proposed version, the similarities among the numerical vectors are computed by Equation (3). In order to provide the detail explanation, we describe the proposed AHC version, together with the traditional one.
The traditional version of AHC algorithm is illustrated in Fig. 2. Words are encoded into numerical vectors, and it begins with unit clusters each of which has only single item. The similarity of every pairs of clusters is computed using the Euclidean distance or the cosine similarity, and the pair with its maximum similarity is merged into a cluster. The clustering by the AHC algorithm proceeds by merging cluster pairs and decrementing number of clusters by one. If the similarities among the sparse numerical vectors are computed, the traditional version becomes very fragile from the poor discriminations among them.

The Traditional Version of AHC Algorithm.
The proposed AHC version is illustrated in Fig. 3. Texts are encoded into numerical vectors, and the clustering begins with individual items. The similarities among numerical vectors are computed by Equation (3) which was presented in Section 3.2. Clustering proceeds by merging the pair with its maximum similarity. By replacing the Euclidean distance or the cosine similarity by Equation (3), it is expected to improve the discriminations among even sparse numerical vectors.

The Proposed Version of AHC Algorithm.
We may consider several schemes of computing a similarity between clusters. We may compute similarities of all possible pairs of items between two clusters and average over them as the cluster similarity. The maximum or the minimum among similarities of all possible pairs is set as the cluster similarity. In another scheme, we may select representative members of two clusters and the similarity between the selected members is regarded as the cluster similarity. In this research, we adopt the first scheme for computing the similarity between two clusters in using the AHC algorithm; other schemes will be considered in next research.
Let us compare the both AHC versions with each other. Both versions begin the clustering process with individual numerical vectors. In computing the similarity between two numerical vectors, the traditional version uses the Euclidean distance or cosine similarity mainly, whereas the proposed one uses the Equation (3). Like the traditional version, the pair of clusters with its maximum similarity is merged into a single cluster in doing the clustering process. However, the proposed version is more tolerant to sparse numerical vectors in computing the similarities among them than the traditional version.
Experiments This section is concerned with the empirical experiments for validating the proposed version of AHC algorithm, and consists of the four sections. In Section 4.1, we present the results from applying the proposed version of AHC to the text clustering on the collection, NewsPage.com. In Section 1 and 1, we mention the results from comparing the two versions of AHC algorithm with each other in clustering texts from 20NewsGroups.
NewsPage.com
SectionFourOne This section is concerned with for validating empirically the better performance of the proposed version on the collection: NewsPage.com. We set the number of clusters as four, following the number of categories, for evaluating the clustering results, and gather texts from the collection, category by category as labeled ones. Each text is allowed to be arranged into only one of the four clusters, in proceeding the clustering task, in this set of experiments. We use the clustering index which was proposed in [9] for evaluating the clustering results.
In Table 1, we specify the text collection, NewsPage.com, which is used in this set of experiments. The text collection was used for evaluating approaches to text categorization in previous works [17]. In this collection, the four categories are predefined: Business, Health, Internet, and Sports, and we select 300 texts at random in each category. The entire group which consist of 1200 texts is segmented into four subgroups by clustering algorithm, in this set of experiments. This collection was built by copying and pasting news articles from the web site, newspage.com, in 2005, as plain text files.
The Number of Texts in NewsPage.com
The Number of Texts in NewsPage.com
Let us mention the experimental process for validating empirically the proposed approach to the task of text clustering. In each category, we select the 300 texts among totally the 500 texts, and encode them into numerical vectors. The group of 1200 texts is segmented into the four clusters by the two versions of AHC algorithm. We use the clustering index which combines the intra-cluster similarity and inter-cluster similarity, for evaluating the both versions of AHC algorithm. The detail description of the clustering index is provided in [20], and it was previously used for evaluating the clustering results.
In Fig. 4, we illustrate the experimental results from clustering texts, using the both versions of AHC algorithm. The y-axis indicate the clustering index as the measure for evaluating the clustering results. In the x-axis, each group indicates the input size which is the dimension of numerical vectors which represent texts. In each group, the gray bar and the black bar indicates the achievements of the traditional version and the proposed version of AHC algorithm, respectively. The two bars in the most right group indicates the averages over their results of the four left groups.

Results from Clustering Texts in Text Collection: NewsPage.com.
Let us make the discussions on the results from doing the text clustering, using the both versions of AHC algorithm, as shown in Fig. 4. The clustering index which is the performance measure of these clustering tasks is in the range between and 0.1 and 0.45. The proposed version of AHC algorithm works strongly better in all input sizes as shown in Fig. 4. The reason of the better results of the proposed version is the improve discriminations among representations by considering the feature similarities as well as feature value ones. From this set of experiments, we conclude that the proposed version works better than the traditional one, in averaging over the four cases.
SectionFourThree This section is concerned with one more set of experiments for validating the better performance of the proposed version on the text collection, 20NewsGroups I. In this set of experiments, we predefine the four general categories in this collection, and gather texts from it in each predefined one as the classified ones. The task of this set of experiments is to cluster texts into the four subgroups based on their semantic similarities, exclusively. We evaluate the both versions of AHC algorithm by clustering index. Therefore, in this section, we observe the performances of the both versions with the different input sizes.
In Table 2, we specify the general version of 20NewsGroups which is used for evaluating the two versions of AHC algorithm. In 20NewsGroup, the hierarchical classification system is defined with the two levels; in the first level, the six categories, alt, comp, rec, sci, talk, misc, and soc, are defined, and among them, the four categories are selected, as shown in Table 2. In each category, we select 300 texts from 4000 or 5000 texts at random. Following the external evaluation, we use the classified words for evaluating clustering results. We obtain the collection, 20NewsGroup, by downloading from the web site, https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html, as one of the standard text collection for evaluating approaches to text categorization.
The Number of Texts in 20NewsGroups I
The Number of Texts in 20NewsGroups I
The experimental process is identical is that in the previous sets of experiments. In each category, we extract the 300 texts at random and encode them into numerical vectors with the input sizes, 10, 50, 100, and 200. The totally 1200 texts are clustered into the four subgroups by the two versions of AHC algorithm, based on their similarities. We use the clustering index which combines the intra-cluster similarity and the inverse inter-cluster similarity with each other, for evaluating the both versions, identically to the previous sets of experiments. We use the labeled texts and their target labels are hidden during clustering process.
In Fig. 5, we illustrate the experimental results from clustering the texts into the four groups on the broad version of 20NewsGroups. Figure 5 has the identical frame of presenting the results to those of Fig. 4. In each group, the gray bar and the black bar indicates the achievements of the traditional version and the proposed version of AHC algorithm, respectively. Fig. 5 presents the results from clustering texts by changing their input sizes. In this set of experiments, we adopt the external evaluation as the paradigm of evaluating the clustering results.

Results from Clustering Texts in Text Collection: 20NewsGroups I
Let us discuss the results from doing the text clustering using the both versions of AHC algorithm, as shown in Figure 5. The clustering indices of the both versions range between 0.05 and 0.4. The proposed version shows its outstandingly better performance in three of the four input sizes. It keeps its competitive performance in the input size, 200. From this set of experiments, we conclude that the proposed version wins over the traditional version, outstandingly, in averaging the four achievements.
SectionFourFour This section is concerned with one more set of experiments where the better performance of the proposed version is validated on another version of 20NewsGroups. In this set of experiments, the four specific categories are predefined in this collection. Texts are exclusively clustered into the four subgroups like the previous sets of experiments. We use the clustering index as the metric for evaluating clustering results. Therefore, in this section, we observe the performances of the both versions of AHC algorithm with the different input sizes.
The Number of Texts in 20NewsGroups II
The Number of Texts in 20NewsGroups II
The process of doing this set of experiments is same to that in the previous sets of experiments. We select the balanced number of texts from the collection over categories, and encode them into the representations with the input sizes which are identical to those in the previous set of experiments. Using the two versions of AHC algorithm, we cluster the 300 examples into the four clusters, identically to the previous set of experiments. We use the clustering index whose bases are the intra-cluster similarity and the inverse inter-cluster similarity, for evaluating the both versions of AHC algorithm. We evaluate the results from clustering items, using the labeled examples, following the external validity.
We present the experimental results from clustering the texts using the both versions of KNN algorithm on the specific version of 20NewsGroups. The frame of illustrating the classification results is identical to the previous ones. In each group, the gray bar and the black bar stand for the achievements of the traditional version and the proposed version, respectively. The y-axis in Fig. 6, indicates the clustering index which is used as the performance metric. In clustering texts, each of them is allowed to belong to only one cluster like the cases in the previous sets of experiments.

Results from Clustering Texts in Text Collection: 20NewsGroups II
Let us discuss on the results from clustering the texts on the specific version of 20NewsGroups, as shown in Fig. 6. The accuracies of the both versions range between 0.05 and 0.62. The proposed version shows its outstandingly better performance in the two smaller input sizes. However, it is leaded in the input size, 100, but it matches in the input size, 200. From this set of experiments, it is concluded that the proposed version shows its better performance by averaging over the accuracies of the four cases.
Conclusion Let us discuss the entire results from performing text clustering using the two versions of KNN algorithm. The both versions is compared with each other in the task of text clustering, in these sets of experiments. The proposed version show its better results in the three collections. The clustering indices of the traditional version range between 0.09 and 0.3, while those of the proposed version range between 0.12 and 0.62. From the three sets of experiments, we conclude that the proposed version improves the text clustering performance, as the contribution of this research.
The proposed approach should be applied and validated in the specialized domains: engineering, medicine, science, and law, and it should be customized to the suitable version. We may consider similarities among only some essential features rather than among all features, to cut down the computation time. We develop and combine various schemes of computing the similarities among features. By adopting the proposed approach, we will develop the text clustering system as a real version.
Footnotes
Acknowledgment
This work was supported by 2018 Hongik University Research Fund.
