Abstract
A vector space model (VSM) composed of selected important features is a common way to represent documents, including patent documents. Patent documents have some special characteristics that make it difficult to apply traditional feature selection methods directly: (a) it is difficult to find common terms for patent documents in different categories; and (b) the class label of a patent document is hierarchical rather than flat. Hence, in this article we propose a new approach that includes a hierarchical feature selection (HFS) algorithm which can be used to select more representative features with greater discriminative ability to present a set of patent documents with hierarchical class labels. The performance of the proposed method is evaluated through application to two documents sets with 2400 and 9600 patent documents, where we extract candidate terms from their titles and abstracts. The experimental results reveal that a VSM whose features are selected by a proportional selection process gives better coverage, while a VSM whose features are selected with a weighted-summed selection process gives higher accuracy.
1. Introduction
Vector space models (VSMs) commonly represent documents as vectors with multiple features (i.e. terms). As there are thousands upon thousands of unique terms in a document set, it is almost impossible to construct a VSM using all of these terms. It is instead more efficient to use fewer terms to build a VSM to represent a document [1, 2]. A method for reducing the terms, a so-called dimensionality reduction method, is needed. The methods for the dimensionality reduction process can be further divided into two categories: feature selection and feature extraction [3]. This article designs a feature-selection method for reducing the dimensions in the term space to build a more efficient VSM.
Generally, there are two types of methods for feature selection: (a) simple feature selection and (b) feature selection with class labels. The main difference between them is whether they have class labels or not. In the scenario of simple feature selection methods, no document has a class label. The objective of a simple feature selection method is to select those features (i.e. terms) with more discriminative or representative abilities. On the other hand, feature selection methods with class labels can be applied to a set of documents with class labels. An example of a class label contained in a document set would be ‘food’ or ‘3C products’. In other words, some documents in the document set will belong to the ‘food’ class, while others will belong to the ‘3C product’ class. Terms like chocolate cake or noodles might be appropriately classified as representing documents in the food class since these terms are mostly used to describe food. On the other hand, terms such as mobile phones or laptops might be most appropriately classified as representing documents in the 3C product category, since these terms are mostly used to describe 3C products. The objective of this type of method is not only to select features with more discriminative or representative abilities, but also to increase categorization accuracy.
The target data used in this research is a set of patent documents. A patent document is quite similar to a general document except that the latter contains only unstructured data (i.e. text), while the former contains both unstructured and structured data. There are thus two main differences between a general document and a patent document: (a) it is difficult to find common terms for patent documents in different categories; and (b) the class labels of patent documents are hierarchical rather than flat. These differences are part of the special characteristics of patent documents.
Since patent documents are meant to describe the full scope of an industrial product innovation, the terms used in patent documents are not limited to a small scope, but encompass a full scale. Since different terms are used in different domains, it is not possible to use the terminology used in one domain to describe a patent document from another. For example, the terms used in patents in the ‘Textiles and Paper Section’ are naturally very different from those used in the ‘Electricity Section’. This lack of intersection makes it difficult to build a base of common terms (i.e. VSM) for all types of patent documents. Without a common term base, the vector representing a patent will degenerate into a zero-vector.
The International Patent Classification (IPC) hierarchy is the natural class labelling system for organizing patent documents. An IPC code can be viewed as a class label in the hierarchy. Class labels for patent documents are different from those used for tagging general documents. The latter is usually a flat label, while the former is part of the hierarchical structure. This hierarchical structure is divided into five levels: section, class, subclass, group and subgroup [4, 5]. Thus, IPC codes are hierarchical class labels. Therefore, when selecting the most representative terms for patent documents one should consider the discrimination ability both on the lower and upper levels. The advantage of ensuring that the discrimination abilities of a term cover all five levels is that, when these terms are used as VSMs for patent document classification, the predicted class label can hit a class label on an upper level, even though it might not hit a class label on the bottom level. It is better to hit a class label on a higher level than not at all.
Traditionally, selection of discriminative terms can be performed using the TFIDF method [6], the entropy method [7, 8] or the chi-square (χ2) method [9, 10]. It is usually assumed in these methods that the document set consists of general documents with flat labels. Even though these methods have also been used to select features for patent documents with IPC labels [11, 12], they still utilize general class labels with a flat structure. For example, Tseng et al. [11] used categories such as ‘FED’ and ‘Device’ to calculate the value of the correlation coefficient. Xue et al. [12] classified patents into four categories: ‘controlling and regulating device’, ‘cylinder and motor device’, ‘muffle and shock absorption device’ and ‘safety device’ in their experiment. The unique hierarchical feature of the system has never previously been considered as a way to improve classification accuracy for the upper or all levels. In other words, previously, the same feature selection methods as used for general documents have been applied to patent documents, and the hierarchical feature of patent documents has been neglected. To overcome this problem, a new feature selection method is needed that is capable of selecting the most discriminative terms for presenting the patent documents with the hierarchical class structure. Thus, this current work aims to modify the traditional feature selection method to do just that.
The above motivates us to ask the following research question: how do we select the most discriminative terms with the consideration of the hierarchical class structure for patent documents? Accordingly, a new method is designed to select terms for representing patent documents based on the discrimination abilities at all levels.
The advantages of the proposed method are: (a) the term’s discriminative abilities are considered at all levels while selecting features; and (b) the terms selected may be more general than those selected when only considering the abilities at the bottom level. This improved generality means that the more common terms can be used for patent classification in different categories. In turn, more patent documents can be successfully represented as non-zero vectors. For example, if term ‘A’ and term ‘B’ are the most representative words on level five (the bottom level) and level four, then term ‘B’ is very likely to appear in patents more often than ‘A’. So, if we include ‘B’ in the vector base, we will have more patents whose vectors have non-zero weights for element ‘B’. This can alleviate the problem of zero-vectors in representing patents as vectors.
The primary contribution of this article is discussed below. A new approach to selecting representative features is proposed based on the special characteristics of patent documents. The proposed approach takes the generality of a term and its discrimination on every level into consideration. Moreover, we also design an evaluation procedure based on the special characteristics of patent documents to evaluate the performance of the proposed approach. In the experiment, the proposed approach, the hierarchical feature selection (HFS) algorithm, is used to select the terms. Using these terms to build a VSM we can then improve the correctness of the document matching process for documents containing hierarchical class labels (i.e. patent documents). In addition, it can also improve the coverage of patents – specifically, the percentage of patents which can be represented as non-zero vectors.
2. Literature review
The following sub-sections review the relevant literature. We first introduce some background knowledge of the VSM. After a brief introduction of the VSM, the reasoning and the methodology for feature selection are demonstrated. After this, general information related to patent documents and the classification hierarchy is provided. Other relevant studies and information regarding patent classification are introduced in the final sub-section of the literature review.
2.1. VSM and feature selection
The VSM was originally introduced by Salton et al. in 1975 [13] for indexing and retrieving information, and is now widely used in text mining practices and document retrieval systems. All the documents and queries in VSM can be represented by vectors. VSM is also a popular method in information retrieval (IR) and is adapted to present patent documents in this research. In addition, documents that are presented by vectors can be compared through the measurement of similarity, distance or correlation between the vectors [1, 8, 13, 14]. In addition to VSM, there are other methodologies to present documents. One of them, proposed by Jiang et al. [15], is a graph-based method. Another one represents patent documents as vectors of IPC codes to avoid the problem of term selection [16].
The accuracy and performance of VSM are influenced by the selected indexing vocabulary. In order to build a more accurate and efficient indexing vocabulary, it is important to conduct a dimensionality reduction process. According to Li et al. [9], a major challenge in the VSM methods is the high dimensionality of the feature space. Feature selection and feature extraction are two ways to simplify, refine, and obtain good attributes to build an indexing vocabulary. Trappey and Trappey [17] indicated numerous methods for selecting index terms and reducing the dimensions of the created indexing vocabulary: term frequency (TF) [1, 17], term frequency with inverse document frequency (TFIDF) [18, 19], variants of TFIDF [6], entropy [8, 7], chi-square (χ2) [9, 10], information gain and text clustering with feature selection (TCFS) [9].
However, these previous methods have been designed for documents without class labels or documents with flat class labels. This explains why it is necessary to design a new approach for selecting terms for documents with hierarchical labels. Thus we design a method of feature selection for patent documents which preserves their heterogeneous and informative attributes, by considering the hierarchical class labels within patent documents.
2.2. Patent documents and their classification hierarchy
A general document is simply a textual article containing many paragraphs, sentences and words. A patent document is similar to a general document, but includes rich and varied technical information as well as important research results [11, 20]. The difference between a patent document and a general document is that the former contains both unstructured data (i.e. textual articles) and semi-structured data. Patent documents can be retrieved from a patent database, such as the United States Patent and Trademark Office (USPTO). The USPTO issues over 150,000 patents each year to companies and individuals worldwide, and, as of February 2008, has granted over 7,950,000 patents [21].
The International Patent Classification (IPC) is an attribute within the patent document. It was first established by the Strasbourg Agreement in 1971. Information about the IPC is currently published by the World Intellectual Property Organization (WIPO). This officially published IPC structure is well constructed and has strong public confidence. The classification has a five-level hierarchical structure comprised of section, class, subclass, group and subgroup [4]. ‘H01L027/18’ is an example of an IPC code, where ‘H’ represents a section, ‘01’ means a class, ‘L’ means a subclass, ‘027’ represents a main group and ‘18’ represents a subgroup. When we predict the label of a patent, hitting a class label on a higher level is better than not at all. For example, consider a patent document with an IPC code of ‘H01L027/01’. Suppose we have two incorrect classification results: H01L021/06 and H01L027/04. Although both predictions are wrong, the latter is better than the former, because the latter hits the label from the root to main group H01L027 but the former only to subclass H01L. Kang et al. [22] notes that the IPC is a standard taxonomy for sorting, organizing, classifying, determining and searching patent documents. Thus, the IPC hierarchy can be viewed as a topic label or a class label regarding the contents.
2.3. Patent classification
Patent mining consists of patent retrieval, patent classification and patent clustering. Previous studies have shown that classifying patent documents can be done automatically through various methods. Some of those methods, for example, are the k-nearest-neighbour classifiers and Bayesian classifiers [23, 24], machine learning algorithms [25], the k-nearest-neighbour with patent’s semantic structure [26], back-propagation network [27], or the artificial neural network with pre-constructed ontology schemas [28]. Generally, traditional text mining approaches have been employed to build vectors for patent document representation, after which patent classification can be conducted through vector similarity calculation or other supporting techniques.
The goal of this current work is not to solve the patent document classification problem but rather to find a better feature selection method for producing vectors. This method should demonstrate better classification results than using a standard method to predict the IPC label. We use patent classification as a means to prove the superiority of our proposed method.
3. Problem definition
The goal of the design is to select the most representative terms for the patent document set and to construct a VSM for these patent documents. The problem is defined in the following paragraphs.
According to the hierarchical structure of the IPC codes, there are five levels in the hierarchy (see Figure 1). Each level can be denoted as la (a = 1–5). Since there are five levels, the set of IPC codes in each level is denoted as Ca, and the kth class label in set Ca is denoted as cak. For example, C1 is the set of IPC codes in the section level (i.e. l1) within the hierarchy, and c11 is the first IPC code in set C1. Similarly, c5 k is the kth IPC code in the subgroup level (i.e. l5) and c51 can be ‘A01B001/02’. The IPC codes are viewed as hierarchical class labels in this article.

Hierarchical structure of the IPC codes.
A set of patent documents is denoted as D = {d1, d2,…, dn}, where n is the total number of patent documents in set D. Every patent document in set D is denoted as dj (j = 1 to n). Naturally, there are plenty of patent documents belonging to every single IPC code. The relationships among patent documents and class labels can now be expressed as:
which means all patent documents dj with the class label cak. For example, document d1 belongs to the classes labelled c51 and c52, and documents d2 and d3 belong to the class labelled c52. The relations among these three patent documents and the two class labels are illustrated in Figure 2. In other words, they can be denoted as D(c51) = {d1} and D(c52) = {d1, d2, d3}, respectively.

Relationships among patent documents and class labels.
The terms contained in patent documents are the features to be selected in this research. The set of terms is denoted as T = {t1, t2,…, tq}, where q is the total number of terms in set T. The variable ti indicates the ith term in set T. Generally, a patent document consists of multiple terms, and a single term may appear in several different patent documents. Since there are numerous terms in a patent document and numerous patent documents in the document set, there will be thousands of terms ti in set T, representing all the patent documents in set D.
Let RT be the set of representative terms in the set of patent documents. Given a set of patent documents D, the proposed approach is to select the r most representative terms to store in RT. Thus, only r representative terms remain after selection.
4. Proposed HFS algorithm
The aim of the HFS algorithm is to provide a procedure for selecting terms with higher representative abilities and greater discriminative abilities. The fact that the class labels (i.e. IPC codes) representing patent documents are not arranged in the format of a flat structure but a hierarchical structure is considered in the design of the HFS algorithm.
In the procedure, the so-called ‘hp-entropy’ formula is designed based on the traditional entropy formula as follows:
where pi is the probability of class Ci in D1, determined by dividing the number of tuples of class Ci in D1 by |D1|, the total number of tuples in D1 [29].
In this article, the proposed hp-entropy method is used to calculate the hierarchical entropy for a term ti on level la as follows:
where niak means the total number of documents containing term ti with class label cak; and nak means the total number of documents with class label cak. The variable m is the total number of class labels in level la. Moreover, both the denominator (i.e. nak) and the numerator (i.e. niak) are designed to be plus 1 to prevent a problem with the calculation of log20. After hp-entropy(ti, la) has been computed for all terms in level la, they are stored in the TermLista.
Karanikolas and Skourlas [30] proposed that ‘the appropriate key phrases for text classification are those that are frequent enough within the documents of only one or a few classes in the training set’. This statement is conceptually similar to the idea proposed in this research. However, the Authority List Creation Algorithm (ALCA) [30] used the method of finding frequent item sets in association rules in data mining to form the feature list (a list of word-phrases). In addition, the class labels used in that work had a flat structure. The methodologies used in the ALCA and in this current article are quite different.
A simple example illustrating the concept of hp-entropy is given in Table 1. The value in the cell of ti across cak in the table is the value of niak (which means the total number of documents containing term ti with class label cak). For example, there are in total four documents containing term t8 with class label c51, and no documents containing term t8 with class labels c52, c53 and c54. The value of hp-entropy(t8, l5) is equivalent to 1.304201. The hp-entropy of t8 on level five is the lowest among all the hp-entropy scores for the other terms. This means that t8 is the most discriminative term in this example and should be selected.
Example using hp-entropy to select terms
After the above computation, each term will have five hp-entropy(ti, la) scores. Since the class labels of patent documents are hierarchical in structure, with five levels, every candidate term will have five hp-entropy scores corresponding to the five different levels. With this design there are two possible ways to find the hp-entropy score for a candidate term.
4.1. Proportional strategy
In the proportional strategy, representative terms are selected level by level according to the hp-entropy scores (the smaller the better). The terms with lower hp-entropy scores from the fifth level are picked out first, followed by those from the fourth level, third level, second level and first level, successively. In addition, the proportion of terms selected on each level can be different.
For example, if the total number of selected representative terms is set to be 100, the proportion of terms for each level can be 20, 20, 20, 20 and 20 terms, respectively. The proportion for each level can also be set differently, such as 40 terms from level five, 30 terms from level four, 20 terms from level three, 10 terms from level two, and zero from level one. Since the representative terms are selected level by level, terms may be duplicated from previous selections. In order to prevent duplication, a term which has already been picked from a lower level (e.g. level five) will not be selected for a higher level (e.g. level four).
4.2. Weighted-summed strategy
In the weighted-summed strategy, representative terms are selected based on the aggregate scores of hp-entropy. Let wa be the weight at level la. An aggregate score for a term is accumulated by manually assigning weights with the following formula:
For example, the assigned weights for each level are 1.0, 0.5, 0.25, 0.125 and 0.0625 for the fifth, fourth, third, second and first levels, respectively. After the computation, every term is assigned an aggregate score of its own. The top k terms with the lowest aggregate hp-entropy scores will be picked out. If the total number of selected representative terms is set to be 100, the top 100 terms with the lowest aggregate hp-entropy scores will be chosen.
In addition to the two proposed strategies for selecting representative terms with which to build a VSM, there is a traditional technique for selecting representative terms. This traditional strategy is utilized to provide a baseline for comparison and evaluation. The traditional technique selects representative terms using the hp-entropy scores on level five only. In other words, the top k terms ti with the lowest hp-entropy(ti, l5) values will be picked as representative terms.
5. Experiments and evaluation
The set of patent documents used in the experiments were collected from the USPTO patent database. A total of 2400 and 9600 patent documents were collected from the USPTO patent database to form the patent document sets DExp1 and DExp2 for testing the performance of the proposed method. Patent documents for collection DExp1 and DExp2 were randomly and evenly selected from ‘Section A: Human necessities’ and ‘Section H: Electricity’. In other words, there were 1200 patent documents in section A and 1200 patent documents in section H for DExp1, and 4800 patent documents in section A and 4800 patent documents in section H for DExp2. These patent documents included two section-level IPC codes, four class-level IPC codes (two class-level IPC codes for each section-level IPC code), eight subclass-level IPC codes (two subclass-level IPC codes for each class-level IPC code), 16 group-level IPC codes (two group-level IPC codes for each subclass-level IPC code) and 80 subgroup-level IPC codes (five subgroup-level IPC codes for each group-level IPC code). In each branch of the subgroup-level IPC code, there were 30 patent documents collected for set DExp1, but for DExp2 the number of patents collected in each branch ranged from 30 to 290 (preferably 120) because some branches did not have enough patent documents and we had to compensate the insufficient number by drawing more patents from other branches. The experiment where document set DExp1 operates is named Exp-1 and that for document set DExp2 is named Exp-2. Collecting patent documents from different branches of IPC codes can help preserve the diversity of the patent content.
Tseng et al. [11] indicated that fields in patent documents can be divided into structured and unstructured types. Since the main focus in this article is on designing a method for text mining, only unstructured fields are considered. Additionally, Xue et al. [12] stated that the title field contains the subject of the invention (i.e. describes what the invention is), while the abstract field contains a brief statement of the main working principles or working structure of the invention (i.e. describes the core of the patent). To simplify the procedure, only the title and abstract of the patent documents were used for extracting candidate terms. From the 2400 patent documents collected, 16,345 unique terms were extracted as candidate terms, while 33,854 unique terms were extracted as candidates from the 9600 patent document set. The pre-processing operation to extract terms from the text of every patent document was then begun. The pre-processing procedure included the tasks of part-of-speech (POS) tagging, the elimination of stop words or insignificant terms, and morphological operations.
The process of POS tagging involves the labelling of the corresponding syntactic features for every term in a sentence and every sentence in a document. Terms with the following POS tags were retained to be candidates of the representative and discriminative terms: singular or mass noun (NN), plural noun (NNS), singular proper noun (NNP), plural proper noun (NNPS), adjective (JJ), comparative adjective (JJR), superlative adjective (JJS), base form verb (VB), past tense verb (VBD), gerund or present participle (VBG), past participle (VBN), non-third-person singular present verb (VBP), and third-person singular present verb (VBZ). In the process of the elimination of stop words or insignificant terms, terms identified as useless were eliminated. The morphological operation dealt with the morphological problems associated with terms. For example, a plural noun ‘accessories’ would be transformed into a singular noun ‘accessory’. After the pre-processing procedure, the total number of terms that remained and were stored in the term set T was 9977 for Exp-1 and 16,206 for Exp-2.
Afterwards the HFS algorithm was utilized to calculate the score of hp-entropy for each candidate term in set T. Since the hierarchy has five levels, every candidate term would have five hp-entropy scores for the five different levels. This article utilized two strategies for accumulating hp-entropy scores for every term: the proportional strategy and weighted-summed strategy.
In Exp-1, a total of 1000 terms were selected as the most representative terms to present a patent document. There were a total of 11 sets of representative terms selected in the experiment. Using the traditional selection strategy, the representative terms were selected based on the score of hp-entropy for level five only. Hence, the 1000 terms selected using the traditional selection strategy in the experiment were the top 1000 terms with the lowest hp-entropy scores for level five. This served as a baseline for comparing and evaluating the performance of the other two proposed techniques. Similarly, in Exp-2 the 2000 most discriminative terms were selected. Since the size of the patent document set in Exp-2 is larger than that in Exp-1, the number of selected terms should be larger.
For the proportional strategy and weighted-summed strategy, five different sets of parameters were used to sift out a total of 10 sets of 1000 and 2000 representative terms (for Exp-1 and Exp-2, respectively) from candidate terms as described in Table 2. Take ‘P-1’ in the proportional selection strategy in Exp-1 as an example. The 333 terms, 267 terms, 200 terms, 133 terms and 67 terms with the lowest hp-entropy scores were selected for the fifth, fourth, third, second and first levels, respectively. Another example is: ‘W-1’ in the weighted-summed selection strategy in Exp-1. The accumulated hp-entropy score for a term is summed from the hp-entropy in level five times one, in level four times one, in level three times one, in level two times one and in level one times one, respectively. Finally, the 1000 terms with the lowest accumulated hp-entropy scores were picked out for ‘W-1’ selection strategy in Exp-1.
Parameters used in each selection strategy
After the term selection process, each set of 1000 or 2000 representative terms was utilized to build a VSM for Exp-1 or Exp-2, respectively. The 11 VSMs in either Exp-1 or Exp-2 were named: Traditional, VSM-P-1, VSM-P-2, VSM-P-3, VSM-P-4, VSM-P-5, VSM-W-1, VSM-W-2, VSM-W-3, VSM-W-4 and VSM-W-5, respectively. All of the patent documents (a total of 2400 in Exp-1 and a total of 9600 in Exp-2) were transformed to their corresponding document vectors based on these 11 VSMs. In addition, the value of a term in the VSM for a document vector was computed utilizing the TFICF (term frequency, inverse category frequency) formula – in other words, it was used to compute the weighting scores of each representative term in a patent document to form the document vector. The TFICF considers the weight of a term across document categories, while traditional TFIDF (term frequency, inverse document frequency) considers the weight of a term across documents [6]. Since the classifiers introduced later are meant to distinguish between the categories of documents, it is more appropriate to use TFICF than TFIDF. The TFICF formula is
where TFICF(ti) stands for the weight of term ti in document dj; tfi,j is the occurrence of term ti in document dj; |C5| is the total number of class labels in level five; and |C5(ti∈D(c5k))| is the number of class labels in level five whose documents contain term ti. The TFICF formula is used for assigning different weights for each selected term. The first constituent (tfi,j) increases the weight of term ti for a given document dj proportional to the frequency of the term in the document. The second constituent (
VSMs, which consist of varied features (i.e. terms), will have different abilities to represent a patent document. In order to realize the representative abilities of every VSM, an evaluation process was designed. In the evaluation process, a set of document vectors transformed by one of the 11 VSMs in either Exp-1 or Exp-2 were first treated as training documents to learn the classification model using the support vector machine (SVM). Each classification model is named after its name in VSM. For example, a classification model for the vector space model VSM-P-1 is called CMd-P-1. The classification accuracy of each model (listed as Acc. in Tables 3 and 4) was estimated by the percentage of patent documents that were correctly classified by the model.
Results of Exp-1 for each selection strategy with different parameter settings
Results of Exp-2 for each selection strategy with different parameter settings
Since it is better to classify a document into a partially correct class than not at all, it is worth demonstrating the hierarchical accuracy of the classification model. The hierarchical accuracy (listed as Hie-Acc. in Tables 3 and 4) is proposed to measure the correctness of document classification on every level of the hierarchy and is computed by taking the average correctness score SCj for every document dj classified by the model. In the evaluation process, if the classification model classifies a document with a completely accurate class label, the correctness score for this document will be one. If a document is classified with an accurate class label on the group level, subclass level, class level and section level, the correctness score for this document will be 0.8, 0.6, 0.4 and 0.2, respectively.
When a patent is transformed to a vector according to a VSM, if all of the TFICF weights are zero, then the vector will degenerate into a zero vector and this VSM model fails to represent this patent. Thus, we define the value of coverage as being indicative of the representative abilities of the selected terms in each VSM. The higher the value of coverage, the better the selected terms represent the patent documents in the document set. Thus, coverage is a good indicator for identifying the quality of the selected terms to represent a document set:
Besides the measurement of coverage, another estimation method known as recall is used to measure the percentage of the document vectors that are correctly recognized. The score of recall is obtained via the multiplication of the accuracy (i.e. Acc.) and the coverage
The results obtained from Exp-1 using each selection strategy with different parameter settings are shown in Table 3. Additionally, Table 4 shows the outcome of Exp-2 using each of the two selection strategies with different parameter settings. In the table, the classification models are named after the term selection strategies as well as the parameters. For example, the one using the proportional selection strategy with the parameter setting of P-1 is named ‘CMd-P-1’.
The experimental results show that the VSMs constructed using the terms picked out by the proportional selection and weighted-summed selection strategies were better (for accuracy, coverage and recall) than the VSM formed by the traditional method. Furthermore, the proportional classification models (CMd-P-1 to CMd-P-5) gave the highest coverage, while the weighted-summed classification models (CMd-W-1 to CMd-W-5) gave the highest accuracy and the highest hierarchical accuracy.
6. Discussion
The accuracy rate means the ratio of correctly predicted vectors (i.e. patent documents) to non-zero vectors. The P-value is used to demonstrate the significance and extent of the difference in accuracy between a model constructed using the proposed method and one built with the traditional method. If the P-value is less than 0.05* appears after the value and it denotes a significant difference; ** after the value denotes a P-value of less than 0.01, meaning a highly significant difference; *** indicates a P-value of less than 0.001, which represents an extremely significant difference. The coverage means the ability of a classification model to represent patent documents. When the coverage rate is relatively low, it means that the number of non-zero vectors, i.e. the number of patent documents for prediction, is less. The accuracy is independent of the coverage. High coverage would not directly lead to a high or low accuracy. The recall rate can be used to measure the overall performance of the classification model.
In Exp-1, the accuracy rates of classification models CMd-P-2, CMd-P-4, CMd-W-1, CMd-W-2, CMd-W-3, CMd-W-4 and CMd-W-5 were higher than those of the traditional model. Furthermore, the classification models CMd-W-1, CMd-W-2, CMd-W-3, CMd-W-4 and CMd-W-5 had P-values of less than 0.001, indicating an extremely significant difference from the traditional model. The hierarchical accuracy rates of all the classification models were higher than those of the traditional model. The improved performance means that these classification models could predict the class label more correctly. Among them, the classification models CMd-W-1 and CMd-W-3 had the best performance in terms of hierarchical accuracy in Exp-1. The classification models CMd-P-1, CMd-P-3, CMd-P-4 and CMd-P-5 had better coverage scores than the traditional model. The best coverage was obtained with CMd-P-5. Thus, the classification model CMd-P-5 in Exp-1 could represent the most patent documents (which are non-zero vectors). All the recall rates of the classification models, with the exception of CMd-P-2 in Exp-1, were higher than those obtained with the traditional model, and the recall rate for CMd-P-5, 7.76%, is the highest in Exp-1. This means that the classification model CMd-P-5 had the best overall performance during experiment Exp-1. The correctness of the prediction obtained using CMd-P-5 for both non-zero vectors and zero vectors was the highest.
In Exp-2, the accuracy rates of CMd-W-1, CMd-W-2, CMd-W-3, CMd-W-4 and CMd-W-5 were higher than those of the traditional model. Their P-value of less than 0.001 indicated an extremely significant difference between these models and the traditional model. Furthermore, the hierarchical accuracy rates obtained with CMd-P-4, CMd-W-1, CMd-W-2, CMd-W-3, CMd-W-4 and CMd-W-5 were higher than those of the traditional model. The five classification models that used weighted-summed selection strategies had P-values of less than 0.001 and were extremely significantly different from the traditional model. Among them, CMd-W-2 had the best performance in terms of hierarchical accuracy in Exp-2. The coverage scores obtained with CMd-P-1, CMd-P-2, CMd-P-3, CMd-P-4 and CMd-P-5 were higher than those of the traditional model. The best coverage was obtained with CMd-P-2. This means that the classification model CMd-P-2 in Exp-2 could represent the largest amount of patent documents. Since the size of the document set was much larger than that in the experiment Exp-1, the number of selected terms (i.e. 2000) was higher than that used in the experiment Exp-1 (i.e. 1000). The reason for increasing the amount was to retain the coverage rate. According to Tables 3 and 4, the coverage rates of the classification models in Exp-1 ranged from 6% to 9% and the coverage rates of the classification models in Exp-2 ranged from 7% to 10%. The coverage rates in the two experiments were approximately the same. The recall rates of CMd-P-1, CMd-P-2, CMd-P-3, CMd-P-4, CMd-P-5 and CMd-W-5 in Exp-2 were higher than those of the traditional model. Among them, CMd-P-4 had the highest recall rate (6.31%) in Exp-2, which means that CMd-P-4 had the best overall performance in Exp-2.
The reason for the high coverage of the proportional classification model could be that it collected the most representative terms from each level. Since the most representative terms could cover more documents, considering the representative abilities level by level could result in maximum coverage. In addition, the proportional classification models could expand the representative abilities and retain the commonly used terms on the higher level (e.g. section level). On the other hand, the reason for the high accuracy of the weighted-summed classification model could be that it considered the representative ability of a term on all levels simultaneously. The classification models CMd-W-1, CMD-W-2 and CMd-W-3 took the discrimination abilities of a term on all five of the levels into consideration, while CMd-W-4 and CMd-W-5 only took into consideration the discrimination abilities of a term on two or three levels. The higher accuracy of the CMd-W-1, CMD-W-2, CMd-W-3 results than the CMd-W-4 and CMd-W-5 results show that weighting more levels can produce higher accuracy than weighting fewer levels. This proves the validity of the proposed approach. In other words, a classification model will perform better if it is based on a VSM that considers terms’ discrimination abilities for all levels.
According to the results in the experiment and evaluation, this research proved that the accuracy and coverage can be enhanced by considering the relationship between the hierarchical levels and the term discrimination abilities. This is a revolutionary finding in feature selection regarding hierarchical class labels because no previous research has ever included this relationship to help select discriminative features. Although it is now in the preliminary stage, this finding might revolutionize the way we select most discriminative features for documents with hierarchical labels. In the future, it is expected that new feature selection methods will be developed based on this finding to further improve the performance of feature selection. In addition, we can attempt to apply these feature selection methods to different document domains with hierarchical labels such as news classification structure, disease classification code and book classification code.
In summary, the results of the two experiments indicate a trend that was not be affected by the size of the document set: the proportional selection strategies could lead a higher coverage and a higher recall rate, and the weighted-summed selection strategies could lead a higher accuracy rate.
7. Conclusion and future study
This article proposes an approach for constructing VSMs to represent patent documents. The results of the experiments and the process of evaluation reveal that a VSM built on terms selected by a proportional selection strategy can give higher coverage, while a VSM built on terms selected via a weighted-summed strategy can have higher accuracy. Furthermore, the results also indicate that both proposed methods can significantly improve the hierarchical accuracy of patent classification.
The limitations of this research are explained below. The proposed method provides a new approach to represent patent documents with hierarchical class labels via VSM. In order to test the performance of the proposed method, we built a classification model via SVM (which is the most popularly used method for classification) from the vectors generated from patent documents. Although the experimental results indicate that our method can generate vectors with better performance, the validity of our results is limited to classification tasks with the SVM method. It is still unknown if similar results can be obtained by applying the generated vectors to other tasks or other classification models rather than SVM.
This is the first article that utilizes the hierarchical label structure to design a better feature selection scheme. This effort points out an interesting direction for future research, not only for patent documents, but also for documents with hierarchical labels. That is, what other hierarchical label structures can be used to design improved methods for feature selection, feature extraction, classification and other tasks? For example, we may consider another kind of hierarchical label structure such as is used in medicine, i.e. the International Statistical Classification of Diseases and Related Health Problems or commonly known as International Classification of Disease (ICD).
Future studies could be carried out: to test more possible parameters and settings in each selection strategy to further demonstrate the validity of the proposed method; utilizing normalization on the basis of the most frequent term of the jth document while calculating the value of TFICF; trying other ways of setting correctness scores for testing the validity of the measurement of hierarchical accuracy; trying other sizes of patent document set to further prove the validity of the proposed method; to test more possibilities of setting different amounts of selected terms to see whether there is any difference among the experimental results; or to use more textual fields such as the claim and description in patent documents to evaluate the validity of the proposed method.
