Abstract
Text classification (a.k.a text categorisation) is an effective and efficient technology for information organisation and management. With the explosion of information resources on the Web and corporate intranets continues to increase, it has being become more and more important and has attracted wide attention from many different research fields. In the literature, many feature selection methods and classification algorithms have been proposed. It also has important applications in the real world. However, the dramatic increase in the availability of massive text data from various sources is creating a number of issues and challenges for text classification such as scalability issues. The purpose of this report is to give an overview of existing text classification technologies for building more reliable text classification applications, to propose a research direction for addressing the challenging problems in text mining.
Introduction
With the explosion of information resources on the Web and corporate intranets continues to increase, there is an imminent need for more effective and efficient technologies to help people search and manage these resources. Machine-learning for text classification or categorisation is an important technique for information organisation and management. The capacious researches have been invested in text classification and there are different kinds of classification techniques, including neural networks, decision trees, k-nearest neighbors, support vector machines, Naïve Bayes, linear least squares, Rocchio’s method, rules-based method etc., have been developed [7,9,28,29,43,57,85]. There are also many efficient and practical text classification applications in the areas such as information retrieval, text filtering, classification of news stories, categorisation of incoming e-mail message and memos, the classification of Web pages, classification of academic papers using technical domains and sub-domains, spam and porn filtering, bioinformatics, customer service automation, topic categorize and sentiment analysis etc. [12,20,48,70,75,86,89,99,100]. A number of researches have focused on the processing of textual information available in healthcare datasets to improve care while lowering costs [6] or using text mining technology to develop a computer-based diagnostic decision-support system for helping the physicians to make better medical decision [2] or applying medical data mining technology to detect adverse drug events [27].
However, the dramatic increase in the availability of massive text data from various sources is creating a number of issues and challenges for text analysis and classification. The main purpose of this document is to provide a survey of the development on these research issues and challenges.
The paper is organized as follows. The Section 2 begins with a brief introduction to text classification to provide the basic concept and background knowledge. In Section 3, we review some important feature selection method. Section 4 discusses classification algorithms. Section 5 presents some important applications of the real world. Section 6 concludes this study.
Text classification
Text classification is the process of classifying an incoming stream of documents into categories by using the classifiers learned from the training samples [49]. The Machine Learning (ML) approach to text classification has gained popularity and has eventually become the dominant one [70]. Using machine learning, the objective is to learn the classifiers from the examples that automatically perform the documents category. The input to a classifier is a training set of records, each of which is tagged with a class label. A set of attribute values defines each record. The goal is to induce a model or description for each class in terms of the attributes. The model is then used to classify future records whose classes are unknown. More formally, TC assigns a Boolean value to each pair
In generally, text classification problem can be a “binary” classification problem If there are exactly two classes or a “multi-class” problem if there are more than two classes and each document falls into exactly one class or a “multi-label categorization” problem if a document may have more than one associated category in a classification scheme [86]. The basic form of text classification is binary classification in which a text document was given one of two labels, usually referred to as positive and negative. Multi-label and multi-class tasks are often handled by reducing them to k binary classification tasks, one for each category [70,86]. For example, in [50] a multi-label classification problem was converted to be a set of multiple binary classification problems and then a simple convolutional neural networks (CNN) model applied to the text classification.
Current text classifiers cannot unambiguously describe the decision boundary between positive and negative objects owing to uncertainties caused by text feature selection and the knowledge learning process. To overcome this problem, a new three-way decision model was proposed recently. The objectives of the new model is to deal with the uncertain boundary to improve the binary text classification performance based on the rough set techniques and centroid solution [47].
Several algorithms have been proposed for binary classification including neural networks, decision trees, k-Nearest Neighbor, Naive Bayes classifier, rough set based classifier, and Support Vector Machines [90,91,97]. These algorithms can be naturally extended to the multi-class classification. Another way to solve mult-iclass problem is to convert the multi-class classification problem into a set of binary classification problems [1].
For multi-label, it must be transformed to single label first before processed in binary classification. At least four transformation approaches from a multi-label dataset into a single-label dataset were presented [8]. They are All Label Assignment (ALA), No Label Assignment (NLA), Largest Label Assignment (LLA), and Smallest Label Assignment (SLA). Among these approaches, ALA is usually the best; however, it suffers from the fact that duplicate documents with different label introduce noise and decrease categorization effectiveness. ALA is Problem Transformation 5 (PT5) in [76,77]. A new transformation Entropy-based Label Assignment (ELA) that modified ALA was proposed in [8]. In [92], more details of multi-label classificatoin are provided.

Machine learning for text classification.
Machine learning for text classification tasks can be categorised into supervised, semi-supervised, and unsupervised learning task. In supervised learning, the machine is presented with training examples consisting of input/output pair patterns where it is required to predict the output values of new examples based on their input values. Supervised learning require a set of training examples. However, sometimes, samples may be inadequate or insufficient, though available. Such a problem is referred to as semi-supervised text classification. A semi-supervised approach was proposed in [58] to learn classifiers from only partial label samples (the training documents are pre-classified into a set of possible classes with only one correct class). An approach was introduced in [18] to learn classifiers from only positive and unlabelled samples, without negative ones. The approach first extracts negative samples from the unlabelled set and then builds classifiers as usual. Supervised and semi-supervised text classification techniques more or less rely on pre-classified samples to learn classifiers. Unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. The authors of [83] proposed to build the classification model for a target class without associated training samples, by analysing the correlating auxiliary classes.
As shown in Fig. 1, the process of text categorization generally consists of the feature selection, classification model development and classification evaluation tasks. In this section, we will discuss feature selection, which is a very important task for text classification.
In text classification, a text document is typically represented as a feature vector. The feature vectors are typically high-dimensional but sparse. Feature selection is a technique that select a subset of the features available for describing the data before applying a learning algorithm. Feature selection techniques intend to remove non-informative features according to corpus statistics and to reduce dimensionality of data. Feature selection can increase the classification accuracy and decrease computational complexity by eliminating noise features. Therefore, effective feature selection is essential for improving the scalability, efficiency and accuracy of a text classifier [7]. For instance, in [87], Yang and Pedersen have shown that depending on the different classifier, the effectiveness can be moderately increased about (⩽5%) by using the term space reduction technique. In [79], rough set based feature selection method was proposed.
In the feature selection phase, the text document pre-processing task is performed first. At least three tasks in the pre-processing phase, which are the selection of initial/complete feature set, identification of related information which are not directly used as features, and dimensionality reduction.
Types of feature
The features can be simple structures (words), complex linguistic structures (e.g. phrases, lexical dependencies, part of speech (POS)), statistical structures (e.g. n-gram, patterns (a.k.a. termset)), supported information (e.g. word’s compactness, word’s first position), Named Entities (e.g., people or organization names), etc. in the document. The set of features consists of one or more types. Most systems use only a kind of feature (e.g. term). However, it was found in many works that using more than one types of features can increase classification performance [55,59].
Term-based feature
Terms (aka normalized words) are the most common type of feature in document representation. “The bag of words” model is a typical model used for representing and computing the feature values for a document. In that model, the features in the feature vector are a given word or phrase. The advantage of term-based feature is that a complex natural language document is transformed into a set of simple independent terms and using simple term feature make the classification efficient. However, relation information among terms is lost [72]. Another problem using terms as features is the semantic ambiguity: polysemy and synonymy problems. The polysemy problem means that terms can be used to express different things in different contexts (e.g. driving a car and driving results). This will affect precision. The Synonymy problem means that terms can be used to express the same thing (e.g. espionage and spy). This will affect recall.
Word sense is used to overcome synonym, polysemy and homonym which related to word meanings (a.k.a. senses). For example, on polysemy, word “phone” can be a noun as a device and a verb as to communicate. In English, a popular lexical database which provides senses of English words is WordNet [53]. In [30], the authors used a WordNet-based annotated (by linguist) corpus to compare word-based and sense-based features for categorization. Using a small training set (182 documents), they found that sense-based feature did not improve effectiveness significantly. The works in [54] also concluded that word-sense feature was not sufficient to improve text categorization effectiveness. Another investigation, however, indicated that WordNet’s Synsets relationship hierarchy usage help categorization performance [62].
Phrase-based feature
Phrase is defined as “a group of words which is part rather than the whole of a sentence” [80] in a dictionary. For example, “take away” and “pull out” are phrases. Phrases has been used as features intensively in information retrieval; however, at least in its early works, it was not effective [40]. The phrase is not a good feature because it does not fit four good feature criteria; (i) small number of indexing terms, (ii) flat distribution of values for an indexing terms, (iii) lack of redundancy among terms, and (iv) low noise in indexing terms values [41]. However, there is no detail quantitative analysis why they are not successfully improve categorization effectiveness compared to single word features. Other works [55] gave explanation about the failure of phrase for document representation in text classification; however their analyses were short and descriptive. Regarding criteria (i), the automatic syntactical phrase identification method was used by author [40] and he found 32,521 phrases, while words just only 22,791. In [19], it was reported that syntactical phrase improved precision
Pattern-based feature
A new approach for document representation is using termset (aka pattern). Pattern Taxonomy Model (PTM) [46,81,95] used intra-document based frequent closed sequential pattern with paragraph as the transactional unit. A pattern is closed if none of its immediate supersets have exactly the same support count. PTM defines closed patterns as meaningful patterns as most of the sub-sequence patterns of closed patterns have the same frequency, which means they always occur together in a document. Smaller patterns in the taxonomy, are usually more general because they have a high occurrence frequency in both positive and negative documents; but larger patterns are usually more specific since they have a small chance of being found in both positive and negative documents [46]. PTM pruned non-closed patterns from document representation with an attempt to reduce the size of the feature set by removing noisy patterns.
PTM is applied as effective information filtering system [46,96]. Different to most other document representation, instead of as input for machine learning classifier, PTM representation is then used to produce a set of weighted term. The set of weighted term is used as class representation, it makes scoring process (for new documents) more efficient.
Feature weighting methods
A text document is represented by a feature vector which includes selected feature and it’s associated weight. A feature weight indicates the degree of information represented by the feature occurrences in a document and reflects the relative importance of the feature.
A number of term weighting methods derived from information retrieval including term frequency (
Terms that repeat multiple times in a document are considered important. Terms that appear in many documents are considered common and are not indicative of document content. Based on this idea, the number of documents in a document collection is defined as N. The frequency of a term
Information gain (IG) [70,87] measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. The information gain of term t and category
Latent Semantic Indexing (LSI) method was developed by [11] based on Latent semantic Analysis (LSA) [14]. LSI uses a linear algebra’s matrix factorization, singular value decomposition (SVD), to transform an origin high dimensional data to a new lower, orthogonal dimension approximation, by applying truncated SVD to word-document matrix. This new space is a more compact document representation. Words and documents that are closely associated will be placed near one another in new “semantic space”. LSI compresses the original document vectors to a smaller semantic space by taking advantage of the patterns of co-occurrence as an implicit higher-order structure. This “semantic structure” or “theme” (associations of terms with documents) can be viewed as conceptual user representations on document [69]. These variety representations can be higher for document collection produced by several different people. The new set of vectors can be viewed as pseudo document vectors. However the created features are not intuitively interpretable. In recent paper [22], the authors proposed a new method for sensor storage that combined semantic web concepts and data aggregation method along with aligning sensors in hierarchical form to respond to semantic web-based queries more effectively.
There are many current work on weighting methods. Two of them are relevance frequency (RF) [35] (a supervised inter-document method which exploits the distribution of relevant documents in the collection) and distributional feature [82] (an intra document method).
Classification algorithms
Types of classifiers
There are different types of text classifier. In [18] text classifiers are categorized into two types: kernel-based classifiers and instance-based classifiers. Typical kernel-based classifier learning approaches include the Support Vector Machines (SVMs) [28] and regression models [70]. Typical instance-based classification approaches include the K-Nearest Neighbour (K-NN) [10] and its variants, which do not rely upon the statistical distribution of training samples. Other research works, such as [86,94], have a different way of categorizing the classifier learning techniques: Linear classifiers, including linear SVM, regularized logistic regression, ridge regression (i.e., regularized least squares fit), Naïve Bayes (NB) methods and Boosted linear classifiers; and Non-linear classifiers, including KNN, SVM with non-linear kernels, Boosting, decision trees, and Neural Networks (NN) with hidden layers. Among those classifiers, the most notable ones are SVM, KNN, Logistic regression, boosting.
Conventional classifiers
Text has the properties of high dimensional (more than 1000), few irrelevant features (dense concept vector), sparse document vectors (most feature values in document vector are zero), and most text categorization problems are linearly separable [28]. SVM is an outstanding text categorization method because of its capability to overcome text properties [28]. Even if all available features were used (no dimensional reduction), SVM had good effectiveness, and difficult to be beaten [63]. The investigation in [35] showed that SVM’s effectiveness rapidly increased to level 0.8–0.9 in F measure then steady relatively. It was different with kNN which had a peak at level 0.7–0.8 in F measure and went down as the number of feature increase.
SVM is based on the structure risk minimization principle from the statistical learning theory [78]. This method has been introduced in Text Classification by Joachims [28]. It is an inductive learning technique. SVM can be used to find a hyperplane h that has the maximum margin as decision boundary in the linear space. A classification task usually involves with training and testing data that consist of some data instances. Each instance in the training set contains one class labels and several features (attributes).
For the linear classification, SVMs learn linear decision rules

A binary support vector classifier in two dimensions.
The SVM require the solution of the following optimization problem [78]:
However, SVM is not suitable for classification of large datasets or text corpora since the training complexity of SVMs is highly dependent on the input size. A new multikernel SVM was developed in [65] to deal with highly dimensional data. Their experimental results demonstated that the new multikernel SVM classifier achived better accuracy compared with the classic SVM, while the training is significantly faster than several other SVM classifiers.
Logistic regression is another commonly-used statistical approach for obtaining a linear classifier. One regression model is called the Linear Least Squares Fit, a mapping approach developed in [85]. A multivariate regression model is automatically learned from a training set of documents and their categories. The training data are represented in the form of input/output vector pairs where the input vector is a document in the conventional vector space model (consisting of words with weights), and the output vector consists of categories (with binary weights) of the corresponding document. By solving the linear least-squares fit on the training pairs of vectors, one can obtain a matrix of word-category regression coefficients. The matrix defines a mapping from an arbitrary document to a vector of weighted categories. By sorting these category weights, a ranked list of categories is obtained for the input document.
KNN (K-Nearest Neighbour) classification is a well-known statistical approach that has been intensively studied in pattern reconciliation for over four decades [10]. KNN has been applied to text categorization since the early stages of the research [85]. KNN approach is most common due to its simplicity and a prediction accuracy. Given an arbitrary input document, the system ranks its nearest neighbours among the training documents and uses the categories of the k top-ranking neighbours to predict the categories of the input document. The similarity score of each neighbour document to the new document being classified is used as the weight of each of its categories; the sum of the category weights over the k nearest neighbours are used for category ranking. The model complexity is controlled by the choice of k. Formally, the degree of freedom of a kNN classification model is defined as
The Rocchio is a classic vector-space-model method for document routing or filtering in information retrieval. Applying it to text categorization, the basic idea is to construct a prototype vector for each category using a training set of documents. Given a category, the vectors of the documents belonging to this category are given a positive weight, and the vectors of the remaining documents are given a negative weight. By summing up these positively and negatively weighted vectors, the prototype vector of this category is obtained. This method is easy to implement, efficient in computation, and has been used as a baseline in several evaluations [9,44]. A potential weakness of this method is the assumption of one centroid per category, and consequently, the Rocchio does not perform well when the documents belonging to a category naturally form separate clusters [84].
Other classifiers such as Bayesian probabilistic classifiers can be found in [42], decision trees in [16,42], inductive rule learning algorithms in [9], neural networks in [57] and on-line learning approaches in [9,44].
In general, SVM has very good performance, however in many situations other methods have comparable performance or even better. SVM and KNN significantly outperform NN and NB for low frequency category (less than 10 documents), but all the methods perform comparably when the categories are sufficiently common (over 300 documents) [84]. In high skew class distribution (imbalance data), SVM performance can drop sharply [15]. For distributional features (i.e. features that represent word compactness in a document), KNN is more suitable than SVM [82]. For spam email filtering that need high efficiency, Naïve Bayes is a popular method [68].
Hierarchical text classification is an important text mining and natural language processing task in many real-world applications and it aims at classifying text documents into classes that are organized into a hierarchy [52]. The hierarchical information can be added into feature representation for hierarchical text classification [64,66]. The basic idea is that the notion of class attributes will allow generalization to take place across (similar) categories and not just across training examples belonging to the same category [64].
In recent years, the authors of [25,26] have developed a simpler recursive regularization of weight vectors of linear classifiers in the hierarchical model for the large-scale hierarchical text classification. A new weakly-supervised neural method for hierarchical text classification was proposed in [52]. This new method is different from a deep neural model that requires a large amount of training data. The weakly-supervised model requires only easy-to-provide weak supervision signals such as a few class-related documents or keywords.
Deep learning for classificatoin
Deep learning [37] is a machine learning technique based on representation learning where the system automatically learns and discovers the features needed for classification from the processing of multiple layers of input data. Deep learning has become a mainstream machine learning technique with capacity in various nonlinear modelling tasks such as the classification and feature extraction process from complex datasets. In recent years, many deep learning techniques including the convolutional neural networks (CNN) [38,39], and recurrent neural networks (RNN) [4], have been explored for text classification.
Deep learning has been proven to be effective to perform end-to-end learning of hierarchical feature representations and deep neural networks have demonstrated superior performance for flat text classification [52]. RNN has been proposed for document classification in [34,73] and a hierarchical attention network model that emphasizes important sentences and words was presented in [88]. Deep convolutional neural networks has been applied for short texts classificatoin and sentence level classification in [13,31]. The character-level convolutional networks (ConvNets) model was designed for text classification in [93]. A graph-CNN based deep learning model is proposed in [61] to convert text to graph-of-words, on which the graph convolution operations are applied for feature extraction.
Applications in the real world
The most popular text classification algorithms, such as Support Vector Machine (SVM), K Nearest Neighbours (K-NN), logistic regression, Rocchio method, have been used for information retrieval and filtering tasks [17,70,87]. The combination of the lexico-syntactic and statistical learning approaches was proposed in [36] to extract and build more reliable domain ontology for an intelligent information retrieval system. In [98], to represent user profile by relevant topic ontology, this paper presents a new method capable of measuring the user profile more objectively and hence has great potential to enhance the information retrieval and filtering processes. The authors of [74] used ontologies techniques to build user profile for knowledge-based and personalized Web information gathering systems.
Text classification techniques are widely used in concept-based Web information gathering systems. The paper [21] described how text classification techniques are used for concept-based Web information gathering. Web users submit a topic associated with some specified concepts. The gathering agents then search for the Web documents that are referred to by the concepts. A list of tasks in Web information gathering to which text classification techniques may contribute is outlined in [70], including automatic indexing for Boolean information retrieval systems, document organization (particularly in personal organization or structuring of a corporate document base), text filtering, word sense disambiguation, and hierarchical categorization of web pages. Text classification techniques have been utilized by [51] to classify Web documents into the best matching interest categories, based on their referring semantic concepts.
Most research has focused on classification into topic categories, other types of classification such as sentiment classification (positive, neutral or negative), have received a lot of attentions in recent years. Sentiment classification classifies the expressed opinion in a document, or a sentence into positive, negative or neutral [60]. Many supervised and unsupervised text classification methods have been used in sentiment classification. In [100], the authors used unsupervised method to analyse tweets data for social event prediction. The authors of [75] analyse the tweets data for depression detection.
Recurrent neural network also benefits sentiment classification because it is capable of capturing the sequential information. In [45], the authors investigated tree-structured long-short term memory (LSTM) networks on text or sentiment classification and discussed when tree structures are necessary. There are also some hierarchical models proposed to deal with document-level sentiment classification [73], which generate different levels (e.g., phrase, sentence or document) of semantic representations within a document.
There are many practical text classification applications in diverse domains. For bioinformatics [32], text mining applications include subcellular localization prediction [5,71] and protein clustering [23]. For the applications in finance field, the various applications of text mining to finance are presented in a survey paper [33]. In [33], text mining applications are categorized broadly into FOREX rate prediction, stock market prediction, customer relationship management (CRM) and cyber security in finance domain. In survey [24], text mining techniques were applied for decision support systems in dental clinics.
Conclusions
The purpose of this survey has been to describe and analyse the state-of-the-art of text classification, and to convey to the reader a sense of our excitement about the intellectual richness and breadth of the area. In the recent years, many research groups have invested much effort on automated text analysis and classification, and have made many great achievements. The rich literature growing around these topics. However, challenging problems still exist in these areas. In particular, the research issues on how to make breakthrough on the current text classification for very large-scale text categorization problems, how to build more effective and efficient feature selection model have attracted lots of the research attentions.
We very much hope we have provided some helpful information to the readers who are encouraged to take up the many challenges that remain in the area.
