De-noising documents with a novelty detection method utilizing class vectors

Abstract

The classification of customer-voice data is an important matter in real business since it is necessary for customer-voice data to be delivered to relevant departments and responsible individuals. Additionally, customer-voice data typically includes several novel words, such as typo’s, informal terms, or exceedingly general words to discriminate between categories of customer-voice data. Furthermore, noisy data often has a negative effect on the classification task. In this study, advanced novelty detection method is proposed to utilize class vector that possessed high cosine similarity with words to effectively discriminate between classes. The class vector is considered as the centroid or the mean of each word vector distribution as derived from the neural embedding model, and the novelty score of each word is calculated and novel words are effectively detected. Each novelty score is calculated by improvements of GMM and KMC methods utilizing a class vector. The experiments verify the propriety of the proposed method with qualitative observations, and the application of the proposed method with quantitative experiments verifies the representational effectiveness and classification performance of customer-voice data. The experiment results indicate that the performance of a classification of customer-voice data improved with the application of the newly proposed novelty detection method in this study.

Keywords

De-noising documents novelty detection class vector customer-voice

1. Introduction

Customer-voice (Voice of the Customer, VOC) is a term that indicates a customer’s feelings regarding their experience with a business, product, and/or service. A customer-voice analysis provides important outputs and benefits for product developers. It provides a detailed understanding of a customer’s requirements and a common language for a team involved in the product development process. Additionally, it could be a key input to set appropriate design specifications for a new product or service and a highly useful springboard for product innovation [1, 2, 3].

As mentioned above, the customer-voice plays an important role in various fields and is used in various departments. Thus, it is important to categorize customer-voice data and deliver the same to relevant departments and responsible individuals. For instance, categorizing the customer-voice data of mobile device into system, user interface, design and appearance categories enables its delivery to proper departments as well as aids in the overall information of customer-voice data distribution based on function. Hence, it is necessary to classify customer-voice data based on functional categories prior to analyzing the same [4].

Then, the customer-voice data is gleaned across a variety of channels including phone, e-mail, and the web, and it is stored in a text document. The customer-voice data consists of extremely unstructured text since e-mail contents or phone call recordings are stored without any proofreading. Thus, it typically includes mistakes, such as typo’s and other informal terms including interjections and slang as shown in Table 1. With respect to the aspects related to the representation and classification of customer-voices, these words are considered as noisy data since they do not provide significant information on the meaning of a customer-voice. Furthermore, noisy data typically exerts a negative effect on the classification task, and even small amounts of noisy data can severely decrease overall performance [5].

Table 1
Example of problematic customer-voice data

Example of customer-voice	Problem
… I hate these tpye of device. You are not concernded about fixing the rebotting problem. Only interested in making high-end device. You’d better get rid of that attittude and get your head right…	Including much of typo’s
… I dunno what ur thinking. I am waiting your fucking response. I just wanna stop this and sue for damages. You piss me off. Don’t make me mad. OK? …	Including lots of slang or informal terms
… I am upset that I bought this stupid thing. for three days.. money.. angry about this. you said it was my responsibility. I didn’t do anything. but you said it’s my fault. You said I should pay for the repairs. I’m angry about this ….	Unnecessary repetitions without meaningful word

Hence, the novelty detection method is considered as an effective solution to remove various noisy words. It is expected that the removal will further improve the representation and classification performance following the detection of words by novelty detection [5]. Furthermore, if it is possible to use novelty detection to detect less important words with extremely low discriminative power for a classification task as well as to detect noisy data, then it is also expected that the removal of these words improves the classification performance. For instance, words that are too general to represent specific classes, such as ‘this’ ‘have’ or ‘and’ could possess extremely low discriminative power with respect to the classification task. Table 2 shows examples of novel words included in real customer-voice data of mobile devices collected from LG Electronics. The examples list novelty words such as typo’s or words with low discriminative power. In this study, improvements of the previously discussed novelty detection method is proposed to effectively remove the fore-mentioned words.

Table 2

Example of novelty words in customer-voice data

Type	Example of words
Typo’s	de, tecnic, EPRD, sus, aguardo, poseedor, nsp, KAKTFTN, QUNC, VER, Adreno, uu
Words have low discriminative power	phone, device, smartphone, both, if, blah, is, was, and, or, my, your, it, that, who, which, not, no, last, first, now

Novelty detection in a textual domain aims to detect novel documents, sentences, words or interesting topics. Several studies and applications apply the novelty detection method in the case of text data as described in Section 2. However, these studies mainly focus on the document or sentence level in which various features could be easily extracted and represented as a vector dimension. Additionally, novelty detection studies of word levels are mostly based on a dictionary or a corpus due to a lack of suitable methods to extract word features.

There are many other methods to detect novelty. However, distance/density based method and probabilistic method are typically used frequently. Distance/density based methods are based on the assumption that normal data are tightly clustered and that novel data occur far away from the other data points. Probabilistic approaches are based on estimating the generative probability density function of the data with the assumption that low density areas in the training set indicate areas with a low probability of containing normal objects [6].

However, the direct application of previous novelty detection methods in the vector space of words causes a misclassification problem. That is, with respect to the previous methods, the words that are unique and less frequent but meaningful for a specific class are classified as novelties since they could be situated at a distance from other words.

The use of the class vector could then decrease the risk that a word that is unique and less frequent but meaningful for a specific class is classified as a novelty. A class vector is trained from a similar neural network of simple neural embedding model. Words as well as the class label of each document are incorporated into the network [7]. Class vectors have high cosine similarity with words to discriminate between classes. Conversely, words with extremely low cosine similarity with each class vector would possess low or no discriminative power for a classification task. Therefore, novel words are effectively detected by assuming each class vector as a centroid or a means of a distribution of words generated from each class. Therefore, the proposed novelty detection method in the present study utilizes a class vector and the performance of representation, and classification is improved after removing words by using the proposed novelty detection method.

Therefore, in this study, two previously discussed novelty detection methods are modified, namely the ad Gaussian mixture model (GMM) that is a probabilistic novelty detection method and the K-means clustering (KMC) that is a distance/density based novelty detection method. Each class vector is assumed as a means of each latent distribution in the Gaussian mixture model and the centroids of each latent cluster in K-means Clustering. The proposed method is used to remove noisy words and this is followed by qualitatively comparing the results of the proposed novelty detection method and the previous method to verify the representational effectiveness and classification performance of customer-voice data with the application of both methods.

2. Related work

Novelty detection can be defined as the task of recognizing that data differ in some respects from the data that are considered as normal. Novelty detection methods are commonly classified into five categories, namely probabilistic approach, distance/density based approach, reconstruction based approach, domain based approach, and information theoretic techniques. The probabilistic approach and distance/density based approach are commonly used among the fore-mentioned approaches [6, 8]. Probabilistic approach uses probabilistic density estimation and assumes that low-density areas correspond to low probabilities of including normal data. The distance/density based approach assumes that normal data is tightly clustered and located close to each other in contrast to novel data. This study includes improvements of these novelty methods that combines a Gaussian mixture model that is a probabilistic approach with the k-means clustering based that is a distance/density based approach.

Novelty detection in the textual domain aims to detect novel documents, sentences, words, or interesting topics. There are many examples of novelty detection methods in the textual domain and these studies apply various methods including the statistical approach, mixture of models approach, neural networks based approach, support vector machine based approach, and clustering based approach in novelty detection [9, 10, 11, 12, 13, 14]. However, these studies focused on novelty detection of a document or sentence level. That is mainly because various features could be easily extracted from a document or sentence such as word frequency, frequent POS list, and average length [15, 16, 17]. Meanwhile, novelty detection studies of word levels are mostly based on a dictionary or a corpus only due to the lack of suitable methods to represent words in a vector space [18, 19].

Recently, the neural embedding model based on neural networks going beyond simple co-occurrence statistics was developed, and it could represent a word in a vector space. The neural embedding approach is based on the assumption that words occurring in a similar context tend to have similar meanings [20, 21]. Based on this assumption, the neural embedding model is trained with an optimization function $\frac{1}{T}\sum_{t=k}^{T-k}\log(p(\omega_{t}|\omega_{t-k},...,\omega_{t+k}))$ in CBOW or $\frac{1}{T}\sum_{t=k}^{T-k}\log(p(\omega_{t-k},...,\omega_{t+k}|\omega_{t}))$ in Skip-gram to predict neighboring words in which $T$ denotes the number of words, and $k$ denotes the window size of neighboring words. Hidden nodes could be then used as representation of words $w_{t}$ .

The main advantage of the neural embedding model is that words with similar meaning are located close to each other and preserve the semantic distance between words by considering a word’s semantic context. The words are represented in a continuous vector space, and thus various machine learning techniques including novelty detection could be applied with respect to the word level. For instance, Camach-Collados applied the novelty detection method in a vector space of words to present a framework for the intrinsic evaluation of a word vector representation [22]. However, the aim of the present study involves the evaluation of representational performance and not the detection of novelty words in a whole word distribution. Thus, to the best of the author’s knowledge, the present study is the first to propose a novelty detection method in a word vector space by using a class vector.

A class vector is trained from a neural network similar to simple neural embedding model. Sachan and Kumar suggested architecture to embed word vectors in conjunction with a class vector by incorporating both into a neural network [7]. In a manner similar to simple neural embedding model, the neural network model is trained with an optimization function $\sum_{i=1}^{V}\log p(w_{i}|w_{\textit{context}})+\sum_{j=1}^{k}\sum_{i=1}^{V}% \log(w_{i}|c_{j})$ when $V$ denotes the number of words, and $k$ denotes the number of classes. The calculation of a class vector $c_{j}$ as well as word vectors $w_{i}$ lead to class vectors with high cosine similarity with words that discriminate between classes. For instance, with respect to the IMDB dataset, there are two classes of words, namely positive words and negative words. Negative words, such as ‘awful’, are located close to the negative class vector, while positive words, such as ‘wonderful’ or ‘lovely’ are located close to the positive class vector [23]. Table 3 shows the examples of words with high cosine similarity with each class vector of the customer-voice data collected from LG Electronics.

Table 3
Example of words with high cosine similarity with each class vector

Class	Word	Cosine similarity	Word	Cosine similarity
Network connection	Wi-Fi	0.5579	Connect	0.3439
	Connection	0.4068	Access	0.3427
	Contact	0.4000	Network	0.3331
	Bluetooth	0.3594	HBS	0.3324
Multimedia	Play	0.5942	Watch	0.4485
	Music	0.5223	VR	0.4320
	TV	0.4866	Streaming	0.4290
	Video	0.4791	Codec	0.4014
Security & backup	Recovery	0.6564	Backup	0.4555
	Data	0.5591	Delete	0.4492
	Important	0.4856	Memory	0.4393
	Move	0.4634	Loss	0.4316

In this study, an alternative is proposed to previous novelty detection methods, such as Gaussian mixture model [24, 25, 26] and K-means clustering approach [27, 28, 29], to utilize a class vector as described in Section 3. The Gaussian mixture model is a parametric probability density function (PDF) represented as a weighted sum of Gaussian component densities. In a multivariate distribution, $p(x|\theta)$ is defined as a finite mixture model with $J$ components, and each component is a multivariate Gaussian density defined with parameter ${\theta}_{j}=\{{\mu}_{j},{\Sigma}_{j}\}$ as follows:

$\displaystyle p(x|\theta)=\sum_{j=1}^{J}\alpha_{j}p_{j}(x|z_{j},\theta_{j}),$ $\displaystyle p_{j}(x|{\theta}_{j})=\frac{1}{{(2\pi)}^{d/2}{|{\Sigma}_{j}|}^{1% /2}}e^{-\frac{1}{2}{(x-\mu_{j})}^{t}\Sigma_{j}^{-1}(x-\mu_{j})}$

and $\alpha_{j}=p(z_{j})$ denote the mixture weigh, representing the probability that a randomly selected $x$ was generated by components $J$ in which $\Sigma_{j=1}^{J}\alpha_{j}=1$ . Following the calculation of each parameter using the Expectation-Maximization (EM) algorithm, the data with a low PDF value is considered as novel data.

The aim of K-means clustering includes partitioning the $n$ observations into $k(\leqslant n)$ sets $S=\{S_{1},S_{2},...,S_{k}\}$ to minimize the within-cluster sum of squares (sum of distance functions of each point in the cluster to the K center). Thus, its objective involves finding the following:

$\displaystyle\mathop{\text{argmin}}\limits_{S}\sum_{i=1}^{k}\sum_{X\in S}% \textit{dist}^{2}(X,\mu_{i})$

where $\mu_{i}$ denotes the mean of points in $S_{i}$ . Following the partitioning of $n$ observations into $k$ sets and calculating the location of each centroid, each data that is far from the centroid is considered as novel data.

3. Proposed method

As described above, customer-voice data involves extremely unstructured data containing mistakes such as typo’s or other informal terms. It also contains less important words to effectively represent each class. The examples shown in the Table 2 clearly specify the noisy words. The removal of these noisy words by novelty detection improves the representation and classification performance.

First, it is necessary to consider the application of the previously described novelty detection method in a vector space of words calculated by neural embedding model to detect the noisy words. A data set as shown in Fig. 1 is assumed to exist. Each circle refers to word vectors calculated by neural embedding model. Ideally, it is expected that purple circles and green circles are clustered into two main clusters, and a yellow circle is classified as a novelty. However, the application of the GMM novelty detection method on these data leads to the detection of both green and yellow circles as novelties since these words are located at a distance from other words as shown in Fig. 1. Additionally, red ‘+’ and blue ‘+’ indicate the means of each Gaussian distribution. This implies that words that are distant from other words due to their uniqueness and low frequency are classified as novel words based on the previously described novelty detection method although these words constitute meaningful words that explain specific classes or important words with respect to the classification task. Thus, the application of the previously stated novelty detection method without modification is not sufficient for the effective detection of novel words.

Figure 1.

Results of previous novelty detection method.

The utilization of the class vector addresses this limitation. As described in Section 2, class vectors have high cosine similarity with words that discriminate between classes. Hence, each class vector is assumed as a mean or centroid of each words distribution to consider words that are close to each class vector or have high PDF value as meaningful words to effectively explain each class. Meanwhile, words that are far from the class vector or possess a low PDF value are considered as noisy words, such as typo’s, or less important words to discriminate between classes. Figure 2 shows the advantage of the proposed novelty detection method that utilizes a class vector. In the proposed method, the class vector is located near the centroid of each word distribution that is composed of words that represent each class. Although a word distribution composed of a small number of green words exists, a class vector that is indicated by a green ‘+’ is located near the centroid of the word distribution. Therefore, the proposed method effectively classifies the meaningful words and novel words by utilizing a class vector. Thus, in this study, an alternative is proposed to previous novelty detection methods, such as Gaussian mixture model and K-means clustering approach which are most frequently used in novelty detection task, to utilize a class vector.

Figure 2.

Advantage of proposed novelty detection method.

The details of the proposed novelty detection are presented below. Formally, let set of documents $D=\{d_{1},...,d_{N}\}$ where $N$ denotes the number of documents. Additionally, the set of words $W=\{w_{1},...,w_{V}\}$ where $V$ denotes the total number of word in $D$ , and $C=\{c_{1},...,c_{k}\}$ where $k$ denotes the total number of class in $D$ .

Word vector $w_{i}$ and class vector $c_{j}$ is h-dimensional vector that represents each word and each class, and $h$ denotes the number of hidden nodes as defined by a user in the neural embedding model. The number of class vectors is equal to the data classes.

I. 1.

Calculate vector dimension of each words $w_{i}$ and each class $c_{j}$ . Specifically, $w_{i}$ and $c_{j}$ are calculated by optimizing function $\sum_{i=1}^{V}\log p(w_{i}|w_{\textit{context}})+\sum_{j=1}^{k}\sum_{i=1}^{V}% \log(w_{i}|c_{j})$ .

Calculate the novelty score with improvements of the Gaussian mixture model and the K-means clustering method utilizing a class vector.

(i) (a)

Improvements of Gaussian mixture model:

Apply improvements of GMM method considering each class vector as the means of each distribution. Each distribution is assumed as the distribution of words of each class. The improvements of the GMM method is also represented as a weighted sum of $k$ component Gaussian densities as given by the following equation:

$\displaystyle p(W|\mu,\Sigma)=\sum_{j=1}^{k}m_{j}g(W|\mu_{j},\Sigma_{j})$

where $m_{j}$ , $j=1,...,k$ , denotes the mixture weight and $g(W|\mu_{j},\Sigma_{j})$ , $j=1,...,k$ , denote the component Gaussian densities. Each component density belongs to a Gaussian function of the following form:

$\displaystyle g(W|\mu_{j},\Sigma_{j})=\frac{1}{{(2\pi)}^{h/2}{|{\Sigma}_{j}|}^% {1/2}}e^{-\frac{1}{2}{(x-\mu_{j})}^{t}\Sigma_{j}^{-1}(x-\mu_{j})}$

Then, mean vector $\mu_{j}$ is fixed with each class vector $c_{j}$ , and only $m_{j}$ and $\Sigma_{j}$ is calculated and updated by the Expectation-Maximization (EM) algorithm as follows.

$\displaystyle\bar{m_{j}}={\displaystyle\frac{1}{V}}\sum_{i}^{V}\frac{m_{j}p(w_% {i}|\mu_{j},\Sigma_{j})}{p(w_{i}|\mu,\Sigma)}$ $\displaystyle\Sigma_{j}=\frac{\sum_{i=1}^{V}(w_{i}-\mu_{j})(w_{i}-\mu_{j})^{T}% {\displaystyle\frac{m_{j}p(w_{i}|\mu_{j},\Sigma_{j})}{p(w_{i}|\mu,\Sigma)}}}{% \sum_{i=1}^{V}{\displaystyle\frac{m_{j}p(w_{i}|\mu_{j},\Sigma_{j})}{p(w_{i}|% \mu,\Sigma)}}}$

(b)

Improvements of K-means clustering:

The improvements of the KMC method considers each class vector as the centroids of each cluster. Each cluster is assumed as the cluster of words of each class. The improvements of the KMC method aims to minimize an objective function $J$ known as a squared error function given by the following expression:

$\displaystyle J=\sum_{j=1}^{k}\sum_{W\in S_{k}}\textit{dist}(W,\mu_{j})^{2}$

where $S=\{S_{1},...,S_{k}\}$ denotes sets of clusters. The centroid vector $\mu_{j}$ is then fixed with each class vector and assigns the data point to the cluster center whose distance from the cluster center corresponds to the minimum of all the cluster centers. It does not require an additional step to recalculate and obtain a new centroid. The distance between word $w_{i}$ and centroid of cluster containing $w_{i}$ is utilized as a novelty score.

Finally, PDF value, weighted sum of $k$ component Gaussian densities are utilized as a novelty score in the variation of GMM approach, and the distance between a word $w_{i}$ and the centroid of the cluster containing $w_{i}$ is utilized as a novelty score in a variation of the KMC method to detect novel words. This implies that words with a PDF value lower than specific probability, user define, are consider as novel words in the variation of the GMM approach. Words with a distance from the centroid that exceeds the specific distance, user set, are considered as novel words in the variation of the KMC approach.

In step I, each word vector $w_{i}$ and class vector $c_{j}$ is calculated by neural embedding model. The number of dimensions of $w_{i}$ and $c_{j}$ denotes the number of hidden nodes of the neural embedding as defined by a user. A variation of the Gaussian mixture model and K-means clustering method that utilizes a class vector to calculate the novelty score in step II is used. The novelty score is calculated by the PDF value or the distance from the centroid in each method. In step III, novel words are detected by the threshold of novelty score as defined by the user in each method.

4. Experiment

This section describes the performance of the proposed novelty detection method with observations and experiments. Figure 3 shows the summary of the experiment flow. First, the results of the novelty detection of both methods are observed. That is, novel words detected by the proposed novelty detection method and the previous method are observed, and those results are compared with respect to the novelty of words. Second, each customer-voice data representation is constructed only with words that are not determined as a novelty by the proposed novelty detection method as well as the previous method. For example, if ‘tpye’ and ‘of’ are determined as novelties by the proposed method, then ‘I hate these tpye of device’ is represented as a document vector such as bag-of-words only with ‘I’, ‘hate’, ‘these’, ‘device’. This is followed by comparing the representational effectiveness and classification performance of customer-voice data representation constructed without novel words as detected by the proposed novelty detection method as well as the previous method.

4.1 Data description

In this study, in order to verify the representational effectiveness and classification performance of our proposed method and its applicability on customer-voice data, customer-voice data of mobile devices collected from Mobile Communication (MC) department in LG Electronics is used. The data was collected between April 23, 2014 and March 23, 2017. The customer-voice data were manually labeled by domain experts in LG Electronics into 12 classes. In order to avoid a class imbalance problem, a similar number of customer-voice were collected from each class as shown in the Table 4 below.

Table 4
Customer-voice data set collected from LG Electronics

Class	Number of customer-voice data	Class	Number of customer-voice data
OS upgrade	900	Network connection	900
Multimedia	900	Call & Message	900
Hard key & input error	900	Heating & Processing	900
Water-proof & Dust-proof	900	Battery & Power	900
Accessory	793	Appearance & Display	900
Security & Backup	900	User Interface	900
		Total 12 classes	10,693

Figure 3.

Summary of the experiment flow

4.2 Experiment setup

First, in order to calculate the word vector with neural embedding model, neural embedding model is designed with a window size of 8, and the number of hidden layers in training corresponds to 300. These parameter settings are also applied in the construction of various customer-voice data representations, such as neural embedding based word clustering [30] and probabilistic word clustering based approach [31], as described in following paragraph to minimize the impact of hyper-parameters in the overall experiments.

Second, de-nosing customer-voice data representation is constructed and is composed of only words that are not determined as a novelty to compare the representational effectiveness and the classification performance of the customer-voice data representations by applying the proposed method and the previous method. Each 1%, 2%, 3%, 4%, 5%, 6%, 8%, 10%, 12%, 15% or 20% of novel words detected by the proposed method and the previous method is removed preliminarily as a means of de-noising prior to constructing the customer-voice data representation. Then, the result of representational effectiveness and classification performance of each customer-voice data representation is compared by applying de-noising.

The customer-voice data representation methods include the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), topic vector, neural embedding based word clustering approach, and probabilistic word clustering based approach. The TF-IDF is most common document representation method in which a document is fundamentally represented by the counts of word occurrences within the document [32, 33]. LSA is the technique applying singular value decomposition (SVD) in term-frequency matrix to reduce the number of rows while preserving the similarity structure among columns [34]. The topic vector is an inferred topic proportion that is typically used as a topic feature to represent the document [35]. Additionally, in the neural embedding based word clustering approach [30, 36] and probabilistic word clustering based approach [31], semantically similar terms are clustered into a common cluster by clustering word generated from neural embedding. Document vectors are subsequently represented by the frequencies of these clusters. The only difference between these methods is that the probabilistic word clustering based approach additionally considers the membership strength of words by utilizing a soft clustering method. In this experiment, the number of clusters is fixed at 150 for the neural embedding based word clustering approach and probabilistic word clustering based approach to minimize the impact of the number of clusters in the experiments.

Finally, the second experiment of representational effectiveness is similar to that performed by Dai et al. [37]. In these studies, triplets of documents were constructed in which two documents were selected from the same class, while the other document was selected from a different class. If the document calculated as most distant is distinctively different from a different class, then the classification result is considered as correct. The dataset of the present study contains 12 different classes, and thus 132 unique combinations of the triplets are constructed. Additionally, 1000 triplets are created for each combination, and thus, the experiment is performed on 132000 triplets. In the third experiment of document classification, the classification result is considered as correct if the document is predicted as its actual class by prediction model. A major voting ensemble model was used in several studies [38, 39] for classification task. The K-Nearest Neighbor classifier [40, 41, 42], Support Vector Machine [43, 44, 45], Logistic Regression [46], Gaussian Naive Bayes classifier [47, 48, 49] and Neural Network [50, 51] models are combined for the ensemble model.

4.3 Experimental results

4.3.1 Observation of novelty detection results

Table 5
Words with lowest novelty score

Novelty detection method	Examples of words (Novelty score)
GMM with class vector (Proposed)	LCD ( $-$ 143.32), breakage ( $-$ 142.87), Marshmellow ( $-$ 141.85), break ( $-$ 141.69), health ( $-$ 137.66), voice ( $-$ 133.93), GPS ( $-$ 133.23), battery ( $-$ 132.46), volume ( $-$ 130.04), QWERTY ( $-$ 127.24)
KMC with class vector (Proposed)	touch (0.1728), restore (0.1947), security (0.2676), Lollipop (0.3927), WiFi (0.3022), ringtone (0.3169), LCD (0.3173), memo (0.3414), message (0.3503), backup (0.3850)
Previous GMM	is ( $-$ 176.90), do ( $-$ 175.71), again ( $-$ 175.75), various ( $-$ 174.43), season ( $-$ 162.19), after ( $-$ 160.06), opposite ( $-$ 154.20), Samsung ( $-$ 152.29), phone ( $-$ 149.64), important ( $-$ 147.67)
Previous KMC	and (0.1584), my (0.1606), of (0.1697), it (0.1754), your (0.1808), was (0.1811), have (0.1989), this (0.1990), is (0.2000), no (0.0.2006)

Table 5 shows words with the lowest novelty score as determined by the proposed method and the previous method. Novelty score of GMM method is calculated by minus of logarithm of the PDF value and that of KMC is calculated by distance from the closest centroid. In the proposed method, words with lowest novelty score constitute considerably discriminative words to represent each class such as ‘LCD’, ‘voice’, ‘security’ and ‘WiFi’. In the previous method, words with the lowest novelty score are general words to discriminate between classes such as ‘phone’, ‘again’ and ‘after’. Especially, in the previous KMC method, extremely general words, such as ‘my’, ‘of’ and ‘it’ are extracted. This implies that the novelty score of the proposed method is a proper measure when compared to the previous method to determine whether each word effectively represents each class.

Table 6

Words with highest novelty score

Novelty detection method	Examples of words (Novelty score)
GMM with class vector (Proposed)	aguardo (146.61), de (146.61), suddenly (146.59), regards (146.59), may (146.57), holiday (146.45), sus (146.38), poseedor (146.32), why (146.30), method (146.28)
KMC with class vector (Proposed)	both (0.7652), volkswagen (0.7652), normal (0.7652), blah (0.7651), if (0.7651), time (0.7651), age (0.7651), last (0.7651), uu (0.7647), SIRS (0.7647)
Previous GMM	electronic (29.39), statement (29.33), eBay (29.16), YouTube (29.07), native (28.95), connection (28.86), showing (28.65), progress (28.52), VOLTE (28.33), photography (28.33)
Previous KMC	premium (0.4109), repair (0.4101), than (0.4101), Media (0.4092), provide (0.4090), open (0.4074), Windows (0.4071), read (0.4070), music (0.4064), GUI (0.4043)

Table 6 shows words with the highest novelty score determined by each method. Typo’s including ‘de’, ‘sus’ and ‘uu’ and meaningless words including ‘blah’, ‘last’ and ‘may’ are effectively detected in the proposed method and not detected in the previous method. From a qualitative viewpoint, these results indicated that the proposed method performed better in the detection of novel words. Additionally, it is intuitively expected that the representational effectiveness and classification performance will improve when those words are detected and removed by the proposed method.

4.3.2 Representational effectiveness

Table 7
Accuracy of representational effectiveness

	Novelty detection method	No de-noising	5% de-noising	10% de-noising	20% de-noising
TF-IDF	GMM with class vector ${}^{*}$	0.6542	0.6663	0.6696	0.6714
	KMC with class vector ${}^{*}$	0.6542	0.6674	0.6699	0.6754
	Previous GMM	0.6542	0.6439	0.6469	0.6446
	Previous KMC	0.6542	0.6490	0.6482	0.6489
Neural embedding based	GMM with class vector ${}^{*}$	0.7909	0.8113	0.8135	0.8161
clustering [30, 36]	KMC with class vector ${}^{*}$	0.7909	0.8162	0.8217	0.8229
	Previous GMM	0.7909	0.7829	0.7781	0.7837
	Previous KMC	0.7909	0.7914	0.7959	0.7830
Probabilistic clustering	GMM with class vector ${}^{*}$	0.8801	0.9058	0.9222	0.9264
based [31]	KMC with class vector ${}^{*}$	0.8801	0.9051	0.9241	0.9251
	Previous GMM	0.8801	0.8910	0.8871	0.8810
	Previous KMC	0.8801	0.8845	0.8899	0.8831
Topic vector	GMM with class vector ${}^{*}$	0.6786	0.6941	0.6986	0.7025
	KMC with class vector ${}^{*}$	0.6786	0.6955	0.7050	0.7033
	Previous GMM	0.6786	0.6742	0.6764	0.6692
	Previous KMC	0.6786	0.6733	0.6713	0.6763
LSA	GMM with class vector ${}^{*}$	0.6761	0.6851	0.6924	0.6989
	KMC with class vector ${}^{*}$	0.6761	0.6855	0.6932	0.7010
	Previous GMM	0.6761	0.6742	0.6804	0.6802
	Previous KMC	0.6761	0.6685	0.6789	0.6774

${}^{*}$ : Proposed method.

Figure 4.

Accuracy of representational effectiveness.

Figure 5.

Accuracy of classification performance.

Table 7 and Fig. 4 show the results of the representational effectiveness of the customer-voice data by applying the proposed method and the previous method. As shown in Fig. 4, the results of representational effectiveness of the proposed method increase steadily when the removal ratio of novel words increases from zero to twenty irrespective of the document representation method. Furthermore, the results of representational effectiveness of the previous method increase and decrease irregularly with changes in the removal ratio and even decreases in specific document representation methods such as the neural embedding based word clustering approach. The results of the representational effectiveness of the proposed method outperform that of the previous method with respect to most of the removal ratios.

4.3.3 Classification performance

Table 8
Accuracy of classification performance

	Novelty detection method	No de-noising	5% de-noising	10% de-noising	20% de-noising
TF-IDF	GMM with class vector ${}^{*}$	0.6403	0.6471	0.6510	0.6523
	KMC with class vector ${}^{*}$	0.6403	0.6506	0.6522	0.6545
	Previous GMM	0.6403	0.6311	0.6358	0.6364
	Previous KMC	0.6403	0.6329	0.6391	0.6346
Neural embedding based	GMM with class vector ${}^{*}$	0.6723	0.6902	0.6982	0.7027
clustering [30, 36]	KMC with class vector ${}^{*}$	0.6723	0.6874	0.6918	0.7053
	Previous GMM	0.6723	0.6668	0.6498	0.6555
	Previous KMC	0.6723	0.6739	0.6700	0.6690
Probabilistic clustering	GMM with class vector ${}^{*}$	0.8638	0.8808	0.8876	0.8907
based [31]	KMC with class vector ${}^{*}$	0.8638	0.8867	0.8856	0.8994
	Previous GMM	0.8638	0.8642	0.8661	0.8657
	Previous KMC	0.8638	0.8695	0.8605	0.8738
Topic vector	GMM with class vector ${}^{*}$	0.6401	0.6626	0.6651	0.6698
	KMC with class vector ${}^{*}$	0.6401	0.6616	0.6719	0.6758
	Previous GMM	0.6401	0.3890	0.6270	0.6419
	Previous KMC	0.6401	0.3802	0.6487	0.6497
LSA	GMM with class vector ${}^{*}$	0.6443	0.6532	0.6572	0.6627
	KMC with class vector ${}^{*}$	0.6443	0.6552	0.6631	0.6728
	Previous GMM	0.6443	0.6469	0.6514	0.6507
	Previous KMC	0.6443	0.6502	0.6467	0.6439

${}^{*}$ : Proposed method.

Table 8 and Fig. 5 show the results of the classification performance of customer-voice data by applying the proposed method and the previous method. In a manner similar to the results of representational effectiveness, the results of the classification performance of the proposed method improve steadily with increases in the removal ratio of novel words and outperform that of the previous method with respect to all representation methods. The reason for the better performance of the proposed method is attributed to the fact that it can detect novel words more effectively than previous method by utilizing a class vector.

5. Conclusion

In this study, advanced novelty detection method utilizing class vector is proposed to effectively detect and remove novel words such as typo’s or meaningless words. Noisy data has a negative effect on the classification task and even small numbers of noisy data can severely decrease the overall performance. Previously, a pre-processing method that removes parts of speech, such as prepositions or article, was applied for this purpose. However, this method is not sufficient for the removal of all the different types of noisy words such as typo’s and other informal words.

Recently, the development of the neural embedding model to represent words in vector space has opened up the possibility to apply various novelty detection methods in a word vector space. Therefore, previous novelty detection methods, such as a density based method or a probabilistic method, could constitute an another option to deal with these problems. Nevertheless, with respect to the previous method, the words that are unique and less frequent albeit meaningful for a specific class are classified as a novelty since it could be situated at a distance from other words.

Thus, an advanced method that utilizes a class vector is proposed in the present study. The class vector is trained from similar neural network of simple neural embedding model by embedding word vectors in conjunction with a class vector by incorporating both into the neural network. According to the previous study, class vectors possess high cosine similarity with words that discriminate between classes. Thus, the utilization of the class vector decreases the risk that unique and less frequent but meaningful words for a specific class are classified as a novelty.

The class vector is utilized in the proposed method that modifies the previous novelty detection methods, such as GMM and KMC based methods, to observe that the proposed method detects novel words more effectively than the previous method. In the actual experiments, representation effectiveness and classification performance of customer-voice representation by applying the proposed method outperformed those of the previous method. Additionally, results of the proposed method improved steadily with increases in the removal ratio of novel words irrespective of the document representation method. Therefore, it is concluded that the novelty score of the proposed method is a more proper measurement when compared with the previous method to determine whether a word effectively represents a class. Furthermore, it is concluded that the reason for the better performance of the proposed method is attributed to the fact that it can detect novel words more effectively when compared to the previous method by utilizing a class vector.

In this study, novel words are detected and preliminarily removed prior to document representation. Thus, the proposed method can be effectively applied in unordered document representation methods such as TF-IDF, neural embedding based word clustering based approach, LSA and probabilistic word clustering based approaches. However, it is difficult to apply the proposed method in ordered document representation methods such as convolutional neural networks based model [52, 53] or Recurrent neural networks based model [54] since intermittently missed words in sentences impairs the sequentiality of a sentence. Thus, future studies will involve the development of more comprehensive methods that can be effectively applied in ordered document representation methods. It is expected that future studies will aid in the wide application of the proposed novelty detection method utilizing a class vector in various text mining tasks arising in the context of a real business environment.

Footnotes

Acknowledgments

I would like to express my appreciation to LG Electronics who provided me the dataset of customer-voice used in experiments section in our study.

References

Katz

G.M.

, The ‘One Right Way’ to gather the voice of the customer, PDMA Visions Magazine 25(4) (2001), 1–6.

Gaskin

S.P.

Griffin

Hauser

J.R.

Katz

G.M.

and Klein

R.L.

, Voice of the Customer, Wiley International Encyclopedia of Marketing (2010).

Griffin

and Hauser

J.R.

, The voice of the customer, Marketing science 12(1) (1993), 1–27.

Temkin

B.D.

Chatham

and Amato

, The Customer Experience Value Chain: An Enterprisewide Approach For Meeting Customer Needs, Forrester Research. March 15 (2005).

Mahapatra

Srivastava

and Srivastava

, Contextual anomaly detection in text data, Algorithms 5(4) (2012), 469–489.

Pimentel

M.A.

Clifton

D.A.

Clifton

and Tarassenko

, A review of novelty detection, Signal Processing 99 (2014), 215–249.

Sachan

D.S.

and Kumar

, Class Vectors: Embedding representation of Document Classes, arXiv preprint arXiv:1508.00189 (2015).

Singh

and Upadhyaya

, Outlier detection: applications and techniques, International Journal of Computer Science Issues 9(1) (2012), 307–323.

Ando

, Clustering needles in a haystack: An information theoretic analysis of minority and outlier detection, in: Seventh IEEE International Conference on Data Mining (ICDM 2007), IEEE, 2007, pp. 13–22.

10.

Spinosa

E.J.

de Leon

Ponce

and Gama

, Novelty detection with application to data streams, Intelligent Data Analysis 13(3) (2009), 405–422.

11.

Zhang

Ghahramani

and Yang

, A probabilistic model for online document clustering with application to novelty detection, in: Advances in Neural Information Processing Systems, 2004, pp. 1617–1624.

12.

Baker

L.D.

Hofmann

McCallum

and Yang

, A hierarchical probabilistic model for novelty detection in text, in: Proceedings of International Conference on Machine Learning, Citeseer, 1999.

13.

Manevitz

and Yousef

, Learning from positive data for document classification using neural networks, in: Proceedings of the 2nd Bar-Ilan Workshop on Knowledge Discovery and Learning, 2000.

14.

Manevitz

L.M.

and Yousef

, One-class SVMs for document classification, Journal of Machine Learning Research 2(Dec) (2001), 139–154.

15.

Guthrie

Allison

and Wilks

, Unsupervised Anomaly Detection., in: IJCAI, 2007, pp. 1624–1628.

16.

Guthrie

, Unsupervised Detection of Anomalous Text, PhD thesis, University of Sheffield, 2008.

17.

Guthrie

and Wilks

, An Unsupervised Approach for the Detection of Outliers in Corpora, LREC, 2008.

18.

Heymann

Walter

Haeb-Umbach

and Raj

, Unsupervised word segmentation from noisy input, in: Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, IEEE, 2013, pp. 458–463.

19.

Chatterji

Chatterjee

and Sarkar

, An Efficient Technique for De-Noising Sentences using Monolingual Corpus and Synonym Dictionary., in: COLING (Demos), Citeseer, 2012, pp. 59–66.

20.

Harris

Z.S.

, Distributional structure, Word 10(2–3) (1954), 146–162.

21.

Turney

P.D.

Pantel

et al., From frequency to meaning: Vector space models of semantics, Journal of Artificial Intelligence Research 37(1) (2010), 141–188.

22.

Camacho-Collados

and Navigli

, Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations, in: ACL Workshop on Evaluating Vector Space Representations for NLP, 2016, pp. 43–50.

23.

Park

, Supervised feature representations for document classification, PhD thesis, Seoul National University, 2016.

24.

Chandola

Banerjee

and Kumar

, Anomaly detection: A survey, ACM Computing Surveys (CSUR) 41(3) (2009), 15.

25.

Markou

and Singh

, Novelty detection: a review-part 1: statistical approaches, Signal Processing 83(12) (2003), 2481–2497.

26.

Eskin

, Anomaly detection over noisy data using learned probability distributions, in: In Proceedings of the International Conference on Machine Learning, Citeseer, 2000.

27.

Srivastava

and Zane-Ulman

, Discovering recurring anomalies in text reports regarding complex space systems, in: IEEE Aerospace Conference, 2005, p. 37.

28.

Srivastava

, Enabling the discovery of recurring anomalies in aerospace problem reports using high-dimensional clustering techniques, in: 2006 IEEE Aerospace Conference, IEEE, 2006, p. 17.

29.

Kim

Kang

Cho

Lee

H.-J.

and Doh

, Machine learning-based novelty detection for faulty wafer detection in semiconductor manufacturing, Expert Systems with Applications 39(4) (2012), 4075–4083.

30.

Kim

H.K.

Kim

and Cho

, Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation (2015).

31.

Lee

Song

and Cho

, Document representation based on probabilistic word clustering in customer-voice classification (2016).

32.

Baeza-Yates

Ribeiro-Neto

et al., Modern information retrieval, Vol. 463, ACM press New York, 1999.

33.

Manning

C.D.

and Schütze

, Foundations of statistical natural language processing, Vol. 999, MIT Press, 1999.

34.

Landauer

T.K.

Foltz

P.W.

and Laham

, An introduction to latent semantic analysis, Discourse Processes 25(2–3) (1998), 259–284.

35.

Cai

and Graesser

, Can Word Probabilities from LDA be Simply Added up to Represent Documents?

36.

Suárez-Paniagua

Segura-Bedmar

and Martínez

, Word embedding clustering for disease named entity recognition, in: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, 2015, pp. 299–304.

37.

Dai

A.M.

Olah

and Le

Q.V.

, Document embedding with paragraph vectors, arXiv preprint arXiv:1507.07998 (2015).

38.

Orrite

Rodríguez

Martínez

and Fairhurst

, Classifier ensemble generation for the majority vote rule, in: Iberoamerican Congress on Pattern Recognition, Springer, 2008, pp. 340–347.

39.

Bouziane

Messabih

and Chouarfia

, Profiles and majority voting-based ensemble method for protein secondary structure prediction, Evolutionary Bioinformatics Online 7 (2011), 171.

40.

Mucherino

Papajorgji

P.J.

and Pardalos

P.M.

, k-Nearest Neighbor Classification, in: Data Mining in Agriculture, Springer, 2009, pp. 83–106.

41.

Cost

and Salzberg

, A weighted nearest neighbor algorithm for learning with symbolic features, Machine Learning 10(1) (1993), 57–78.

42.

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1) (1967), 21–27.

43.

Steinwart

and Christmann

, Support vector machines, Springer Science & Business Media, 2008.

44.

Vladimir

V.N.

and Vapnik

, The nature of statistical learning theory, Springer Heidelberg, 1995.

45.

Vapnik

V.N.

and Vapnik

, Statistical learning theory, Vol. 1, Wiley New York, 1998.

46.

Walker

S.H.

and Duncan

D.B.

, Estimation of the probability of an event as a function of several independent variables, Biometrika 54(1–2) (1967), 167–179.

47.

Langley

Iba

and Thompson

, An analysis of Bayesian classifiers, in: Aaai, Vol. 90, 1992, pp. 223–228.

48.

Domingos

and Pazzani

, On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning 29(2–3) (1997), 103–130.

49.

Lewis

D.D.

, Naive (Bayes) at forty: The independence assumption in information retrieval, in: European Conference on Machine Learning, Springer, 1998, pp. 4–15.

50.

McCulloch

W.S.

and Pitts

, A logical calculus of the ideas immanent in nervous activity, Bulletin of Mathematical Biology 52(1–2) (1990), 99–115.

51.

Gallant

S.I.

, Neural network learning and expert systems, MIT press, 1993.

52.

dos Santos

C.N.

and Gatti

, Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts., in: COLING, 2014, pp. 69–78.

53.

Kim

, Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882 (2014).

54.

Lai

Liu

and Zhao

, Recurrent Convolutional Neural Networks for Text Classification., in: AAAI, 2015, pp. 2267–2273.

De-noising documents with a novelty detection method utilizing class vectors

Abstract

Keywords

1. Introduction

Table 1 Example of problematic customer-voice data

Table 3 Example of words with high cosine similarity with each class vector

4.1 Data description

Table 4 Customer-voice data set collected from LG Electronics

4.3 Experimental results

4.3.1 Observation of novelty detection results

Table 5 Words with lowest novelty score

Table 7 Accuracy of representational effectiveness

Table 8 Accuracy of classification performance

Footnotes

Acknowledgments

References

Table 1
Example of problematic customer-voice data

Table 3
Example of words with high cosine similarity with each class vector

Table 4
Customer-voice data set collected from LG Electronics

Table 5
Words with lowest novelty score

Table 7
Accuracy of representational effectiveness

Table 8
Accuracy of classification performance