Abstract
This article proposes the modified KNN (K Nearest Neighbor) algorithm which receives a string vector as its input data and is applied to the text summarization. The results from applying the string vector based algorithms to the text categorizations were successful in previous works and the text summarization is able to be viewed into a binary classification where each paragraph is classified into summary or non-summary. In the proposed system, a text which is given as the input is partitioned into a list of paragraphs, each paragraph is classified by the proposed KNN version, and the paragraphs which are classified into summary are extracted ad the output. The proposed KNN version is empirically validated as the better approach in deciding whether each paragraph is essential or not in news articles and opinions. We need to define and characterize mathematically more operations on string vectors for modifying more advanced machine learning algorithms.
Introduction
Introduction Text summarization refers to the process of selecting essential parts from a text automatically as its brief version. Each text is partitioned into sentences or paragraphs by the punctuation mark or the carriage return, respectively, and the text summarization may be viewed as a binary classification where each sentence or paragraph is classified into the essence part or not. Each sentence or paragraph is encoded into its structured form and prepare sentences or paragraphs labeled with essence or remaining as the samples. The system constructs the classification capacity by learning the labeled samples, classify novice sentences or paragraphs into essence or non-essence, and present the one which is classified into essence. We need to distinguish the automatic summarization from the manual summarization in that the manual summarization refers to the process of rewriting the entire text into its brief form.
Let us consider the motivations for doing this research. Since the text summarization task may be interpreted into a binary classification, it is regarded as an instance of text categorization [3]. In the traditional text categorization systems, texts were encoded into numerical vectors, which were very inefficient representations [18]. The numerical vectors as surrogates of texts have very poor transparency; they are unable to present their contents, symbolically [18]. Therefore, in this research, we encode each text into a string vector, in order to solve the problems.
Let us mention the proposals of this research as the solutions to the above problems. The text partitions which are paragraphs or sentences are encoded into string vectors, and the semantic similarity measure between string vectors is defined in this research. Using the similarity measure, we modify the KNN (K Nearest Neighbor) where a string vector is given as the direct input data. The modified version is used as the approach to the classification task which is mapped from the text summarization. Hence, in this research, the task of text summarization is regarded as the binary classification of each sentence or paragraph into an essence or a remaining one.
This research is expected to provide some benefits. By encoding the texts into the alternative representations to the numerical vectors, it is possible to avoid the above problems completely. It is more efficient and compact to encode texts into string vectors; the dimension of string vectors is usually only tens, while that of numerical vectors is usually hundreds, ten times for maintaining enough system robustness. The string vectors are more transparent than numerical vectors since they are characterized symbolically; we are able to guess the contents by referring only representations without their full sentences. However, as remaining tasks, we need to define more advanced operations on string vectors, in order to modify more advanced machine learning algorithms.
This article is organized into the five sections. In Section 2, we survey the relevant previous works. In Section 3, we describe in detail what we propose in this research. In Section 4, we validate empirically the proposed approach by comparing it with the traditional one. In Section 5, we mention the general discussion on the empirical validations and remaining tasks for doing the further research.
Previous works
PreviousWorks This section is concerned with the previous works which are relevant to this research. In Section 2.1, we explore the previous cases of applying the KNN algorithm to text mining tasks. In Section 2.2, we survey the schemes of encoding texts or words into structured data. In Section 2.3, we describe the previous machine learning algorithms which receive alternative structured data such as tables and string vectors to numerical vectors. Therefore, in this section, we provide the history about this research, by surveying the relevant previous works.
Using KNN algorithm to text mining tasks
Section02.01 This section is concerned with the previous cases of applying the KNN algorithm to text mining tasks. Classifying texts or words belong to a text mining task, and the KNN algorithm is adopted as the approach to the task in this research. The KNN algorithm belongs to the lazy learning algorithm which does not learn training examples in advance. The fact that the KNN algorithm is popularly used in classification tasks in any domain, as well as text categorization is the reason of adopting and modifying the KNN algorithm. In this section, we survey cases of applying the KNN algorithm to the word categorization, text categorization, and spam mail filtering.
Let us mention the previous cases of applying the KNN algorithm to the word categorization and its similar tasks. In 2001, Kim et al. translated words using the KNN algorithm between English and Korean [22]. In 2003, Pekar and Staab classified words into their synonyms, using the KNN algorithm [29]. In 2016, Stauffer et al. represented optimal images of handwritten words into graphs and applied the KNN to the word recognition [31]. The word categorization in this research is the task of classifying words based on their topics or meaning; it should be distinguished from ones which are mentioned in the above literatures.
Let us mention the previous cases of applying the KNN algorithm to the text categorization. In 2001, Sam et al. proposed the modified version of KNN which considers the feature importance for computing the similarity between numerical vectors [5]. In 2010, Khan et al. reviewed machine learning approaches including the KNN algorithm to the text categorization [21]. In 2014, Vishwanath et al. proposed the KNN version which computes the similarity between a test document and a class prototype and the Naive Bayes, as the approaches to text categorization [32]. Even recently, texts are still encoded into numerical vectors in using the KNN for the text categorization, in above literatures.
The spam mail filtering refers to a particular text categorization where each email is classified into spam or ham. In 2003, James et al. proposed the neural networks as the approach to the spam mail filtering and compared it with the KNN algorithm as the base approach [6]. In 2004, Lai and Tsai proposed the four machine learning algorithms including the KNN as the email categorization tools [23]. In 2010, Frite et al. used the KNN algorithm for the spam mail filtering with resampling methods [4]. In the above literatures, emails are regarded as texts and encoded into numerical vectors.
We survey the cases of using the KNN for the text mining tasks: word categorization, text categorization, and spam mail filtering. Textual data are encoded into numerical vectors in all above literatures. The KNN algorithm has been used as very popular approach to text mining tasks as well as any other kinds of classification tasks. Even if another approach has been proposed to the word and text categorization, it has been compared with the KNN algorithm. In this research, we adopt the KNN based on the above literatures and modify it into more suitable version for text mining tasks.
Encoding schemes
Section02.02 This is concerned with the previous works on encoding words or texts into structured data. It has been assumed that the input data are always given as numerical vectors in applying machine learning algorithms to applications in any domain. In almost cases of applying machine learning algorithms to text mining texts, texts or words are encoded into numerical vectors. In order to solve the problems in encoding so, there were previous trials of encoding them into alternative structured data to numerical vectors. In this section, we explore the previous schemes of encoding texts or words into structured data.
It is known that texts or words are usually encoded into numerical vectors, by surveying articles on machine learning approaches to text categorization. In 1995, Wiener encoded texts into approximately three hundreds dimensional numerical vectors, in applying the neural networks to the text categorization [33]. In 1999, Yang encoded texts absolutely into numerical vectors, in evaluating machine learning algorithms in apply them to the text categorization [34]. In 2002, Sebastiani surveyed the machine learning approaches to the text categorization, under the assumption of encoding texts into numerical vectors [30]. The fact that texts should be encoded into numerical vectors for doing the text mining tasks is confirmed through the above literatures.
There existed the previous trials of encoding text into tables, instead of tables for doing the text categorization and clustering. In 2008, Jo modified the single pass algorithm into its table based version for the text clustering [9]. In 2011, Jo invented the table marching algorithm as the method of categorizing texts in his patent [15]. In 2015, Jo improved the table matching algorithm into the stable and robust version [16]. The table becomes the alterative text representation to the numerical vector for doing the both tasks.
There also existed the previous trials of encoding texts into string vectors as one more alternative structured data for doing the both tasks. In 2009, Jo proposed the semantic similarity between string vectors for using the KNN and the SVM for ding the text categorization [11]. In 2010, Jo modified the k means algorithm into its string vector based version as the approach to the text clustering [12]. In 2015, Jo defined and characterized mathematically the numerical operations on string as the fundamental research for modifying other machine learning algorithms into their string vector based versions [17]. It requires to define and characterize more semantic operations for modifying more advanced machine learning algorithms.
In this research, we adopt the scheme where paragraphs are encoded into string vectors. In each string vector, the features indicate the relations between a text and a word, and the feature values do the text identifiers which satisfying the relations. We define the semantic similarity between two string vectors which is always given as a normalized value between zero and one. The similarity between words, becomes the basis for computing the similarity matrix. The KNN algorithm is modified as the approach to the text summarization by replacing the cosine similarity.
Specialized machine learning algorithms
This section is concerned with the previous works on approaches to the text mining tasks where texts are encoded into alternative representations to numerical vectors. In Section 2.3, we explored the previous schemes of encoding texts into structured data and the modified versions of existing machine learning algorithms using the scheme. In this section, we introduce the string kernel which is a kernel function on string in using the SVM to the text categorization, and mention the new neural networks, NTC (Neural Text Categorizer) and NTSO (Neural Text Self Organizer) for doing the text categorization and the text clustering. It takes very much time for applying the string kernel version to the text categorization and needs more operations for advancing the created neural networks. In this section, we explore cases of using the string kernel based SVM and the two creative neural networks to text mining tasks.
The string kernel was proposed as a kernel function in using the SVM for classification tasks, in order to avoid problems in encoding texts into numerical vectors. In 2002, Lodhi et al. initially proposed the string kernel, in order solve the problems in doing so [25]. In 2004, Leslie et al. applied the string kernel to protein classification where a protein is given as a string [24]. In 2006, Kate and Mooney used the string kernel for processing semantically sentences, instead of entire full texts [20]. The SVM with the string kernel works not successfully in the text categorization, but it works successfully in the protein classification.
The neural network, NTC (Neural Text Categorizer), was previously invented as a more suitable approach to the text categorization. In 2000, Jo initially proposed the NTC, but it used its weights which are fixed based on word frequencies [7]. In 2008 and 2010, Jo improved the NTC into the version where word weights are updated, and applied it to both the exclusive text categorizations and the soft ones [10, 13]. In 2012, Pawar and Gawande mentioned the NTC as innovative approach to the text categorization, and in 2015, Abainia et al. applied it for categorizing Arabic texts [1, 28]. In future, the NTC needs to be expanded from the swallow version into deep version.
The NTSO (Neural Text Self Organizer) was previously created as a more suitable approach to the text clustering. In 2005, Jo and Japkowicz invented initially as the approach to the text clustering [19]. In 2006, Zehng et al. mentioned the NTSO as one of the main text clustering tools [35]. In 2010, Jo validated empirically the NTSO performance by comparing it with the popular approaches such as the k means algorithm and the Kohonen Networks [14]. The NTSO will be modified into its supervised version and semi-supervised version in future, like the Kohonen Networks.
In this research, we set the text summarization as the task against which we challenge. In the string kernel based SVM, the lexical similarity between strings is computed, rather than semantic ones; it is more suitable for processing proteins than for doing texts. The two neural networks, NTC and NTSO, take actually much computation time and are very complicated with respect to implementations; they are usually used for implementing heavy versions of text categorization and clustering systems. It requires the mathematical definition and characterizations on more operations on strings for expanding them. Therefore, in this research, we propose the modified KNN version for implementing light versions.
Proposed approach
ProposedApproach This section is concerned with encoding words into string vectors, modifying the KNN (K Nearest Neighbor) into the string vector based version and applying it to the text summarization, and consists of the three sections. In Section 3.1, we deal with the process of encoding texts into string vectors. In Section 3.2, we describe formally the similarity matrix and the semantic operation on string vectors. In Section 3.3, we focus on the process of applying the KNN to the given task with viewing it into a classification task.
Text encoding
We deal with the process of mapping texts into string vectors, in this section. Figure 1 shows the basic three steps which are involved in encoding texts. A text is given as the input and a string vector which is composed of words is generated as the output. The features of string vectors are defined as posting, static, and grammatical properties. Therefore, we describe in detail how to encode a text into a string vector.

The Process of Text Encoding.
As the first step of encoding a text into a string vector, the text is indexed into a list of words. The text is segmented into tokens by white spaces, punctuation marks, and other special characters. Each token is changed into its root form through the stemming process. Among stemmed tokens, grammatical words which are called stop words are removed. After doing so, posting, statistical, and grammatical information are assigned to each remaining word.
In order to map a text into a string vector, we need to define features of string vector. In previous works on the string vector based machine learning algorithms, word frequencies were used for defining the features [7]. In this research, we proposes the three types of string vector features: posting features, statistical features, and grammatical features. The posting features are ones which indicate where words are positioned in the text, the statistical features are ones which indicate how many words occurs in the text, and the grammatical features are ones which indicate what words play role of. Because in the proposed system, the POS tagging module is not available, we use the posting and the statistical information about words as string vector features. We define the string vector features as follows:
Highest Frequent Word in the given Text Second Highest Frequent Word in the given Text Third Highest Frequent Word in the given Text Highest TF-IDF Weighted Word Second Highest TF-IDF Weighted Word Third Highest TF-IDF Weighted Word The Last Word in the Text The First Word in the Last Paragraph The Last Word in the First Paragraph
Once the above features are defined, we explain the process of assigning words to features. In the above process, a text is indexed into a list of words, together with their information. We defined the nine items as string vector features manually and, from the text the system takes words which correspond to the nine features. The nine words become the elements of the string vector. The dimension of string vectors which represent texts becomes nine, following the number of above features.
In the current research, the nine features are defined manually, but it is feasible, because the number of features is small. However, in encoding texts into numerical vectors, the features are defined, automatically; it is the trade off between the two encoding schemes. Even if the dimension of string vector is very smaller, the classification performance is better than the traditional classification systems where the raw data is encoded into numerical vectors. The reason is that numerical vectors representing texts tend to be sparse, but the string vector is very compact. In next research, we will consider schemes of defining the string vector features, automatically.
StringVectors This section is concerned with the operation on string vectors and the basis for carrying out it. It consists of two subsections and assumes that a corpus is required for performing the operation. In Section 1, we describe the process of constructing the similarity matrix from a corpus. In Section 1, we define the string vector formally and characterize the operation mathematically. Therefore, this section is intended to describe the similarity matrix and the operation on string vectors.
Similarity matrix
SimilarityMatrix Before explaining the operation on string vectors, we focus on the similarity matrix which is basis for executing it. The similarity matrix is constructed from a corpus as the square matrix. Its rows and columns correspond to words which are available in the corpus. The entries of the similarity matrix becomes the semantic similarity between two corresponding words. Therefore, in this section, we describe formally the similarity matrix.
In the similarity matrix, each entry indicate the semantic similarity between two words. The two words are notated by t
i
and t
j
, and they are viewed as the two sets of texts, T
i
and T
j
. The similarity between the two words is computed by Equation (1),
Assuming that there are N words in the corpus, we build the N × N square matrix, as follows:
We try to characterize the similarity matrix, mathematically. In the diagonal positions of the similarity matrix, the entry is always given as one, since the corresponding words are exactly same to each other. In the off-diagonal position, the entry value is a continuous real number between zero and one. The semantic similarity between two words which is computed by Equation (1) is always given as a normalized value. The similarity matrix is symmetry and it is proved as follows:
It takes much time to build the similarity matrix from a corpus. Once the matrix is built, it may be used continually, if the corpus is fixed. When the corpus is changed continually, the system becomes infeasible, if the similarity matrix is rebuilt, accordingly. When some texts are added or deleted, we need to consider the scheme of updating the similarity matrix, in the next research. Since the similarity matrix is usually big, we may consider using small distributed similarity matrix as basis for carrying out the operation.
StringVector and Semantic Similarity We mention the operation on string vectors which is called semantic similarity. In the traditional classification system, we used the cosine similarity or the Euclidean distance as the operation on numerical vectors. Using the string vector as the input vector was proposed by Jo in 2000 [7], so very few operations are defined. This research defines the semantic similarity between two string vectors and uses it for modifying the KNN algorithm. In this section, we will describe the operation formally.
A string vector refers to a finite ordered set of strings as follows:
Let us mention the semantic similarity between two string vectors which corresponds to the cosine similarity between two numerical vectors. The two string vectors are denoted as follows:
where each element, d1i and d21i indicates a text identifier. The operation is defined as Equation (2) as follows:
Let us characterize mathematically the operation on the string vectors. The commutative law applies as follows:
However, note that the transitive rule does not apply as follows:
Many operations were defined and characterized mathematically to numerical vectors, while very few operations were done to string vectors. In this research, only semantic similarity is defined and used for modifying the KNN algorithm. We need more advanced operations on string vectors for modifying other machine learning algorithms. In modifying the k means algorithm or the k medoid algorithm, we need the operation which correspond to averaging over numerical vectors. After this research, we try to modify the k means algorithm into its string vector based version.
Applicationto Text Summarization We mention the strategy of applying the proposed KNN version to the text summarization, in this section. The task should be mapped into a classification task where each item is classified into summary or non-summary, before applying it. We gather sample paragraphs which were labeled with summary or non-summary, and encode them into structured forms. The paragraphs which are classification targets are also texts; the classification which is mapped from the text summarization should be distinguished from the text categorization. Therefore, the text summarization is interpreted as a binary classification and the application scheme will be mentioned.
Figure 2 shows the proposed version of KNN where raw data is encoded into string vectors. The labeled sample texts are gathered and encoded into string vectors. A novice text is given as classification target, and encoded into a string vector. The KNN computes its similarities with the training string vectors by the process which was described in Section 1. It selects most similar training ones as its nearest neighbor, and decides by voting the labels of nearest neighbors. Since string vectors are compact representations, the proposed KNN version is expected be more robust than the traditional version.

The Proposed Version of KNN.
The text summarization is mapped into a binary classification, as shown in Fig. 3. A text is given as the input, and it is partitioned into paragraphs by carriage return. Each paragraph is classified into either of the two categories: ‘essence’ and ‘not’. The paragraphs which are classified into ‘essence’ are selected as the output of the text summarization system. For doing so, we need to collect paragraphs which are labeled with one of the two labels as sample examples, in advance.

View of Text Summarization into Binary Classification.
Let us consider the process of collecting sample paragraphs for doing the binary classification. The text summarization is mapped into the binary classification, as mentioned above. As shown in Fig. 4, a particular paragraph may be regarded differently, depending on the given domain; a paragraph should be summary in a particular domain, but it should be non-summary in another. The text collection from which we collect paragraphs should be partitioned into sub-collections by the domain. Here, the text which is given as the input is assumed to be tagged with its own domain.

Text Summarization: Domain Dependent Classification.
Let us consider the process of applying the KNN to the text summarization which is mapped into a classification. A text is given as the input, and the classifier which corresponds to the subgroup which is most similar to the given text with respect to its content is selected. The text is partitioned into paragraphs, and each paragraph is classified into ‘essence’ or ‘not’ by the classifier. The paragraphs which are classified into ‘essence’ are extracted as results from summarizing the text. Note that the text is rejected, if all paragraphs are classified into ‘not’.
In the text categorization, a text is classified into one or some among the predefined categories, independently of domains. However, the text summarization, a paragraph should be regarded as a summary or not, depending on the given domain. This is the reason of partitioning the text collection into sub-collections domain by domain and assuming that each text is tagged with its own domain. The domain granularity which is the degree of how much specific each domain is decides the text summarization performance. In too specific domains, we expect good performance, but it is difficult to obtain sample paragraphs.
Experiments This section is concerned with the empirical experiments for validating the proposed version of KNN, and consists of the five sections. In Section 1, we present the results from applying the proposed version of KNN to the text summarization on the collection, NewsPage.com. In Section 1, we show the results from applying it for classifying paragraphs into summary or not, from the collection, Opinosis. In Section 1 and 1, we mention the results from comparing the two versions of KNN with each other in the task of text summarization from 20NewsGroups.
NewsPage.com
SectionFourOne This section is concerned with the experiments for validating the better performance of the proposed version on the collection: NewsPage.com. We interpret the text summarization into the binary classification where each paragraph is classified into summary or non-summary, and gather the paragraphs which are labeled with one of the two categories, from the collection, topic by topic. Each paragraph is classified exclusively into one of the two labels. We fix the input size as 50 dimensions of numerical vectors, and use the accuracy as the evaluation measure. Therefore, this section is intended to observe the performance of the both versions of KNN in the four different domains.
In Table 1, we specify the text collection, NewsPage.com, which is used in this set of experiments. The collection was used for evaluating approaches to text categorization tasks in previous works [16]. In each category, we extract 250 paragraphs and label them with summary or non-summary, keeping the complete balance over the two labels. In each category, the set of 250 paragraphs is partitioned into the training set of 200 paragraphs and the test set of 50 ones. Each text is segmented into paragraphs by a carriage return, and they are corrected manually, in the process of extracting paragraphs.
The Number of Texts and Paragraphs in NewsPage.comTableThreeOne
The Number of Texts and Paragraphs in NewsPage.comTableThreeOne
Let us mention the experimental process for validating empirically the proposed approach to the task of text summarization. We collect the sample paragraphs which are labeled with summary or non-summary in each of the four topics: Business, Sports, Internet, and Health, and encode them into numerical and string vectors. For each of 50 examples, the KNN computes its similarities with the 200 training examples, and selects the three similarity training examples as its nearest neighbors. This set of experiments consists of the four independent binary classifications each of in which each paragraph is classified into one of the two labels by the two versions of KNN algorithm. We compute the classification accuracy by dividing the number of correctly classified test examples by the number of test examples, for evaluating the both versions.
In Fig. 5, we illustrate the experimental results from deciding whether each paragraph is a summary, or not, using the both versions of KNN algorithm. The y-axis indicates the accuracy which is the rate of the correctly classified examples in the test set. Each group in the x-axis means the domain within which the text summarization which is viewed as a binary classification is performed, independently. In each group, the gray bar and the black bar indicate the accuracies of the traditional version and the proposed version of the KNN algorithm. The most right group in Fig. 1 consists of the averages over the accuracies of the left four groups, and the input size which is the dimension of numerical vectors is set to 50.

Results from Summarizing Texts in Text Collection: NewsPage.com.
Let us make the discussions on the results from doing the text summarization, using the both versions of KNN algorithm, as shown in Fig. 5. The accuracy which is the performance measure of this classification task is in the range between 0.46 and 0.52. The proposed version of KNN algorithm works strongly better in the two domains: Health and Sports. It matches with the traditional version in the domain, Internet, but it loses in the domain, Business. In spite of that, from this set of experiments, we conclude the proposed version works much better than traditional one, in averaging over the four cases.
SectionFourTwo This section is concerned with the set of experiments for validating the better performance of the proposed version on the collection, Opinosis. We view the text summarization into a binary classification where each paragraph is classified into summary or non-summary, and collect the paragraphs, labeling manually with one of summary and non-summary from the collection. Each paragraph is exclusively classified into one of the two labels. We fix the input size to 50 and use the accuracy as the evaluation measure. In this section, we observe the performance of the both versions of KNN algorithm, in the three experiments as many as topics.
In Table 2, we specify the text collection, Opinosis, which is used in this set of experiments. The collection was used in previous works for evaluating the approaches to text categorization. We extracted the 50 paragraphs in each topic, and they are labeled with ‘summary’ or ‘non-summary’, keeping the complete balance over the labels. The 50 paragraphs is partitioned into the 40 as the training set and the 10 as the test set, in each topic. Each text is segmented into paragraphs by the carriage return, and some of them are corrected, in the processing of extracting paragraphs from texts.
The Number of Texts and Paragraphs in OpiniopsisTableThreeTwo
The Number of Texts and Paragraphs in OpiniopsisTableThreeTwo
We perform this set of experiments by the process which is described in section 1. We collect sample paragraphs which are labeled with ‘summary’ and ‘non-summary’ in each of the three domains: ‘Car’, ‘Electronics’, and ‘Hotel’, and we encode them into 50 sized numerical vectors and string vectors. For each test example, the both versions of KNN computes its similarities with the 40 training examples and select the three most similar training examples as its nearest neighbors. Each test example is classified into ‘keyword’ or ‘non-keyword’ by the two versions of KNN algorithm; we performed the three independent experiments as many as the domains. The classification accuracy is computed by the number of correctly classified test examples by the number of the test examples for evaluating the both versions of KNN algorithm.
In Fig. 6, we illustrate the experimental results from the text summarization which is mapped into a classification task, using the both versions of KNN algorithm. Like Fig. 5, the y-axis indicates the value of accuracy, and the x-axis indicates the group of two versions by a domain of Opniopsis. In each group, the gray bar and the black bar indicate the results of the traditional version and the proposed version of KNN algorithm. In Fig. 6, the most right group indicates the averages of the both version over their results of the left three groups. Therefore, Fig. 6 shows the results from classifying paragraphs into one of ‘summary’, and ‘non-summary’, by the both versions.

Results from Summarizing Texts in Text Collection: Opiniopsis.
We discuss the results from doing the text summarization which is mapped into a binary classification, using the both versions of KNN algorithm on Opinosis, shown in Fig. 6. While the accuracy values of the traditional version stay at 0.5, those of the proposed version range between 0.5 and 0.6. The proposed version works better than the traditional one, in the two domains: Car and Electronics. It is leaded by the traditional one in the domain, Hotel; its accuracy is 0.4. From this set of experiments, we conclude that the proposed one works slightly better in averaging the three cases.
SectionFourThree This section is concerned with one more set of experiments for validating the better performance of the proposed version on text collection, 20NewsGroup I. We gather paragraphs which are labeled with ‘summary’ or ‘non-summary’, from each broad category of 20NewsGroups I, by viewing the text summarization into a binary classification. The task of this set of experiments is to classify each paragraph exclusively into one of the two labels in each topic which is called domain. We fix the input size to 50 in encoding the paragraphs and use the accuracy as the evaluation measure. Therefore, in this section, we observe the performances of the both versions in the four different domains.
In Table 3, we specify the general version of 20NewsGroups which is used for evaluating the two versions of KNN algorithm. In 20NewsGroup, the hierarchical classification system is defined with the two levels; in the first level, the six categories, alt, comp, rec, sci, talk, misc, and soc, are defined, and among them, the four categories are selected, as shown in Table 3. In each category, we extract 250 paragraphs from 4000 or 5000 texts; the first half is labeled with ‘summary’, and the other half is labeled with ‘non-summary’. The 250 paragraphs is partitioned into the 200 ones in the training set and the 50 ones in the test sets, as shown in Table 3. In the process of gathering the classified paragraphs, each of them is labeled manually into one of the two categories by scanning individual texts.
The Number of Texts and Paragraphs in 20NewsGroups ITableThreeThree
The Number of Texts and Paragraphs in 20NewsGroups ITableThreeThree
The experimental process is identical is that in the previous sets of experiments. We collect the paragraphs by labeling manually them with ‘summary’ or ‘non-summary’ by scanning individual texts in each of the four domains, comp, rec, sci, and talk, and encode them into numerical vectors with the input size fixed to 50. For each test example, we compute its similarities with the 200 training examples, and select the three similar ones as its nearest neighbors. The versions of KNN algorithm classify each of the 50 test examples into one of the two categories by voting the labels of its nearest neighbors. Therefore, we perform the four independent set of experiments as many as domains, in each of which the two versions are compared with each other in the binary classification task.
In Fig. 7, we illustrate the experimental results from deciding whether each paragraph is a summary, or not, on the broad version of 20NewsGroups. Fig. 7 has the identical frame of presenting the results to those of Fig. 5 and 6. In each group, the gray bar and the black bar indicates the achievements of the traditional version and the proposed version of KNN algorithm, respectively. In the x-axis, each group indicates the domain within which each paragraph is classified into ‘summary’, or ‘non-summary’. This set of experiments consists of the four binary classifications in each of which it is done so.

Results from Summarizing Texts in Text Collection: 20NewsGroup I.
Let us discuss the results from doing the text summarization using the both versions of KNN algorithm as shown in Fig. 7. The accuracies of both versions range between 0.48 and 0.85. The proposed version shows its better performances in all of the four domains. It keeps its competitive performance in the other. From this set of experiments, the proposed version wins over the traditional one, certainly, in averaging its achievements of the four domains.
SectionFourFour This section is concerned with one more set of experiments where the better performance of the proposed version is validated on another version of 20NewsGroups. From each specific topic, separately, we gather the paragraphs which are labeled with ‘summary’ or ‘non-summary’. In this set of experiments, we view the text summarization into a binary classification, and carry out the four binary classifications, independently of each other. We fix the input size of representing paragraphs to 50 and use the accuracy as the evaluation metric. Therefore, in this section, we observe the performances of the both versions of KNN algorithm in the four different domains.
In Table 4, we specify the specific version of 20NewsGroups which is used as the test collection, in this set of experiments. Within the general category, sci, we predefine the four categories: ‘electro’, ‘medicine’, ‘script’, and ‘space’. In each topic, we extract 250 paragraphs from approximately 1000 texts and label each of them with ‘summary’ or ‘non-summary’, maintaining the complete balance. The set of 250 paragraphs is partitioned into the training set of 200 ones and the test set of 50 ones, as shown in Table 4. We use the accuracy as the metric for evaluating the results from classifying paragraphs.
The Number of Texts and Paragraphs in 20NewsGroups II
The Number of Texts and Paragraphs in 20NewsGroups II
The process of doing this set of experiments is same to that in the previous sets of experiments. We gather sample paragraphs which are labeled with ‘summary’ or ‘non-summary’, in each of the four domains: ‘electro’, ‘medicine’, ‘script’, and ‘space’, and encode them with the fixed input size: 50. We use the two versions of KNN algorithm for their comparisons. Each test paragraph is classified into one of the labels in each domain. We use the accuracy as the evaluation metric.
We present the experimental results from classifying the paragraphs using the both versions of KNN algorithm on the specific version of 20NewsGroups. The frame of illustrating the classification results is identical to the previous ones. In each group, the gray bar and the black bar stand for the achievements of the traditional version and the proposed version, respectively. The y-axis in Fig. 8, indicates the classification accuracy which is used as the performance metric. In this set of experiments, we execute the four independent classification tasks which correspond to their own domains, where each paragraph is classified into ‘summary’ or ‘non-summary’.

Results from Summarizing Texts in Text Collection: 20NewsGroup II.
Let us discuss the results from classifying the paragraphs using the both versions of KNN algorithm on the specific version of 20NewsGroups, as shown in Fig. 8. The accuracies as the performance metrics of this classification task which is mapped from the text summarization range between 0.46 and 0.75. The proposed version shows its better results in three of the four domains. It maintain its matching results in the domain, ‘medicine’. From this set of experiments, it is concluded that the proposed version have its better performance by averaging over the accuracies of the four domains.
Conclusion Let us discuss the results from summarizing texts using the two versions of KNN algorithm. In these sets of experiments, we compare the two versions with each other in the classification tasks which is mapped from the text summarizations. The proposed version shows its better results in all of the four collections. The classification accuracies of the traditional version range between 0.46 and 0.55, while those of the proposed version range between 0.52 and 0.70. From the four sets of experiments, we conclude that the proposed version improves the text summarization performance, as the contribution of this research.
We need to consider the remaining tasks for doing the further research. We will apply and validate the proposed approach in summarizing a text in the specific domains such as medicine, engineering, and law, rather than the general domains. In order to improve the performance, we may consider various types of features of string vectors. As another scheme of improving the performance, we define and combine multiple similarity measures between two string vectors with each other. By adopting the proposed approach, we may implement the text summarization system as a module or an independent system.
Footnotes
Acknowledgment
This work was supported by 2018 Hongik University Research Fund.
