NgramSPD: Exploring optimal n -gram model for sentiment polarity detection in different languages

Abstract

Due to the rapid growth of web platforms such as blogs, discussion forums, peer-to-peer networks, and various other types of social media, Sentiment Polarity Detection (SPD) (classifying texts by “positive” or “negative” orientation) has become more important and challenging task in recent years. There is a growing need for management and study of SPD not only in English, but also in other languages. The key reason for using Machine Learning (ML) for SPD lies in engineering a representative set of features. This paper explores different (byte, character and word) $n$ -gram based text representation models in order to determine the most valuable model for the representation of text documents in various languages, which can be used successfully by ML classification techniques for solving SPD task. Proposed $n$ -gram models were used in conjunction with k Nearest Neighbourhood (kNN), Support Vector Machine (SVM) and Maximum Entropy (MaxEnt) algorithms to determine opinion polarity of the proposed movie reviews. The effectiveness and language independence of the proposed $n$ -gram models were demonstrated in experiments performed on seven publicly available movie review benchmarks in Arabic, Czech, English, French, Spanish,Turkish, and Serbian being the authors’ mother tongue. Formal evaluation has confirmed that the proposed byte and character $n$ -gram models outperform word $n$ -gram model, and in conjunction with the presented MaxEnt algorithm outperform other ML supervised techniques used with more complex document representation approaches. In some cases (Arabic, Czech, French, Serbian and Turkish), signficant improvements over the baselines have been achieved. Despite their simplicity and broad applicability, byte and character $n$ -grams have been shown to be able to capture information on different levels – lexical and syntactic.

Keywords

Sentiment polarity detection movie reviews n-grams SVM MaxEnt kNN

1. Introduction

Sentiment Analysis (SA) is a challenging task that combines natural language processing (NLP) and text mining techniques in order to automatically identify and analyze emotions and opinions in text. In the process of analyzing emotions we generally speak of three text classification (TC) types: (1) identification of subjectivity (opinion classification or subjectivity identification) used to divide texts into those that carry emotional content and those that only have factual content; (2) sentiment classification (sentiment polarity detection – SPD) of texts that carry emotional content into those with positive and those with negative emotional content; (3) determination of the strength or intensity of emotional polarity (strength of orientation). In this paper, we focus on SPD problem that has attracted a great deal of attention, partly due to its potential applications.

Different approaches have been used for SPD, but the mainstream approach basically consists of two major methodologies: a supervised statistical methodology based on the application of Machine Learning (ML) algorithms (when a collection of labeled training data is available), and an unsupervised semantic methodology usually applied when linguistic resources are available, but training data have not been provided. In order to take advantage of both methodologies, some studies apply a hybrid statistical-semantic approach. Regardless of the chosen approach, there are a number of challenges to be dealt with. Pang and his colleagues [36] concluded that the sentiment classification problem is more challenging than traditional topic-based classification problem. Moreover, Turney [49] found movie reviews to be the most difficult of several domains for sentiment classification. Also, one of the difficulties with handling texts written by web users is the presence of different kinds of textual errors, such as typing, spelling and grammatical errors.

Although most of the research activity on sentiment analysis has been concentrated on texts in English [36, 32, 44, 39], people increasingly express their points of views, experiences and opinions in many other languages. Therefore, there is a growing need for management and study of SPD in languages other than English. The aim of this paper is to determine if there is a unique type of features representing text documents, and if it can be valuable regardless of the language, so as to be successfully used by statistical ML classification techniques for solving SPD efficiently and avoiding previously mentioned challenges. We explore byte, character and word n-gram based document representation models in conjunction with k Nearest Neighborhood (kNN), Support Vector Machine (SVM) and Maximum Entropy (MaxEnt) ML algorithms solving SPD task in different natural languages. The effectiveness of techniques and language independence have been demonstrated in experiments performed on publicly available benchmark datasets of movie reviews in seven languages: English, Spanish, Arabic, French, Czech, Turkish and Serbian. Note that English, Spanish, Arabic and French are among top ten languages most used on the Internet according to the Internet World State rank.1

¹
http://www.internetworldstats.com/stats7.htm.

The rest of the paper is organized as follows: the next section presents previous work related to using n-gram based models in order to solve SPD task. Section 3 presents the proposed n-gram based document representation models and ML classification techniques used in this work, and Section 4 describes the experimental framework. The results obtained by experiments are expounded in Section 5, while Section 6 presents the results of comparison with the previously published SPD techniques. Section 7 concludes the paper.

2. Related work

SA in multiple natural languages is explored in [25] as SPD classification task, where the authors applied SVM classifier on character n-grams document representation model. The training data sources were Amazon local websites in English, German and French. The experiments showed that classification model based on multilingual labelled data outperforms the models that utilize the training set only from a single language. The experiments on different n-grams features were not performed. In [12] the authors compared character with word n-grams in solving SPD on a standard corpus composed of 1600 hotel reviews. They concluded that character $n$ -grams improve F1 measure in detection of positive and negative opinions (for 2.3% and 2.1% respectively) compared with word n-grams features. Furthermore, the authors concluded that character n-grams provide a good performance with a very small training corpus. In [42] the author compared character n-grams against word n-grams in the classification of positive, negative, and neutral sentences using three datasets in German. The first dataset covered articles taken from a net-clipping database, while the second and third ones were based on Amazon product reviews considering two domains – mobile phones and notebooks. In multiple experiments they discovered that character n-grams provided better performance than word n-grams (by 4% on F1 score). Also, the author concluded that pre-processing techniques as stemming did not make any positive influence on the classification performance. The same n-grams types – word and character – were used in [4]. The aim of this study was to explore whether the character n-gram model offered improved accuracy on three corpora: movie reviews, tweets, and a corpus of Facebook photo comments. Three standard classifiers (Naive Bayes (NB), SVM and MaxEnt) were used in the experiments with character n-grams features where $1\leqslant n\leqslant 8$ , and word n-grams where $1\leqslant n\leqslant 2$ . In total of 36 experiments, 35 times character n-grams achieved better accuracies. Also, in [20] authors compared character n-grams ( $1\leqslant n\leqslant 10$ ) with word n-grams ( $1\leqslant n\leqslant 3$ ). They concluded that character n-grams gave better accuracy than word n-grams in conjunction with NB ML algorithm on two datasets – IMDb and IMDb-NOT. Only character n-grams where used in [1] to detect sentiment polarity in English, Spanish, and French. There are researches that combined word and char $n$ -grams [54, 55, 50]. In [55] word and char $n$ -grams were combined for sentiment analysis of Chinese online reviews. In [5] a method2

²
https://research.fb.com/fasttext/.

for representing a word as a bag of character

n

-grams was proposed. This technique, known as “word hashing” or “sub-word embeddings”, is used for breaking down words into character

n

-grams and representing them as vectors of char

n

-gram. For movie rating prediction, word and sub-word embeddings were used in [50].

To our knowledge, although byte n-grams have been used in many other text classification tasks, there has been only one example of using byte n-grams in solving SPD task. In [17] the authors used byte n-gram frequency statistics method for document representation, and variant of kNN (for $k=$ 1) algorithm for solving SPD task in English and Spanish. This paper represents a significant improvement of this work.

3. N-gram-based SPD techniques

3.1 Document representation models

The role of the document representation component is to represent a text document so as to facilitate machine manipulation but also to retain as much information as needed. Text documents should be transformed into a compact and applicable representation which will be used uniformly in training, validation and classification. A text document $d_{j}$ is usually represented as a vector of terms weights $\overrightarrow{d_{j}}=(w_{1j},w_{2j},\ldots,w_{|T|j})$ where $T$ is the set of terms that occur at least once in at least one document from the training set, and $0\leqslant w_{kj}\leqslant 1$ represents, loosely speaking, how much term $t_{k}$ contributes to the semantics of document $d_{j}$ [43].

A common and often overwhelming characteristic of text data is its extremely high dimensionality. Feature selection techniques are widely employed to reduce the dimensionality of data and enhance the discriminatory information. The word “feature” usually has two different but closely related meanings in the context of text classification. One meaning refers to the unit (corresponding to a term) used to represent or to index a document, while the other focuses on how to assign an appropriate weight to a given term.

A typical choice of “feature” in its first meaning is to identify terms with words. This is often called either set-of-words or bag-of-words (BoW) approach to document representation, depending on whether weights are binary or not. This approach treats text as a set of words ignoring the fact that text is a sequence of data [36].

Moreover, in morphologically rich languages, [47] a word can get a large number of derived word forms (for example, in Serbian, lemma ljubav ‘love’ has 15 inflected words [33]). Since word forms of one word have the same or similar meaning, sometimes it is useful to replace them by the single one, which is usually stem or lemma. The process of replacement is used as a part of a space dimensionality reduction in many NLP tasks. Stemming (a process of reducing inflected or derived word forms to their stem, base or root form) and lemmatization (a process of reducing inflected word forms to the basic word form – lemma) are language-dependent dimensionality reduction techniques, and a raw text has to be previously tokenized (divided into single words). But, many Asian languages (Chinese and Japanese, for example) actually do not have explicit word boundaries in text. Beside BoW model, there are many studies that use n-gram models.

Definition 1 Given a sequence of tokens $S=(s_{1},s_{2},\ldots,s_{N+(n-1)})$ over the token alphabet $\Sigma$ , where $N$ and $n$ are positive integers, n-gram of the sequence S is any n-long subsequence of consecutive tokens. The $i^{th}$ n-gram of $S$ is the sequence $s_{1},s_{2},\ldots,s_{i+(n-1)}$ [46].

Term $n$ -gram could be defined on a word, character or byte level. Extracting, for example, character n-grams from a document is like moving an n-character wide “window” across the document internal representation, character by character. Each window position covers n character, defining a single n-gram.

Byte vs character n-grams: In the case of languages over the Latin alphabet, byte and character n-grams are quite similar considering the fact that one character is mostly represented by one byte. The difference is usually in the set of characters that is being considered (n-grams on the character level usually do not take into account the distinction between big and small letters, punctuation symbols and digits). It is especially significant when alphabets like Arabic, Chinese or Serbian Cyrillic Alphabet are used. In the case of Asian languages, for example, one character is usually represented by two bytes, depending on the coding scheme used, so 75% byte-level n-grams include half-characters (all odd-length n-grams, and half of the even-length n-grams) because the text is simply treated as a sequence of bytes instead of characters. Although byte n-grams sometimes do not have specific meaning, especially for humans (for example, when they contain only one of two bytes that represent a character), their extraction from a text does not require information about the used code scheme, which is why they are simpler representation for computer processing.

Byte/character vs word n-grams: Although word level analysis seems to be intuitive, it ignores the fact that many languages, such as Chinese for example, do not have word boundaries explicitly identified in text, thus making word segmentation a difficult problem.

When used in processing natural language documents, character and byte n-grams exhibit many good features:

•
Language and topic independence: There is no need for any text preprocessing or higher level processing, such as tagging, parsing, or other language dependent and nontrivial NLP tasks;
•
Robustness: There is relative insensitivity to spelling variations and errors. Since each string is decomposed into small parts, any errors that are present tend to affect only a limited number of these parts, leaving the rest intact;
•
Word stemming is got essentially for free: The n-grams for related forms of a word (e.g., “advance”, “advanced”, “advancing”, “advancement”, etc.) intrinsically have a lot in common when viewed as sets of n-grams;
•
No linguistic knowledge is required: It is not necessary to have any linguistic information, even about space character used for word separation, the new line character, uppercase and lowercase letters, and the like;
•
Completeness: Token alphabet is known in advance;
•
Efficiency: Only one pass processing is required.

Using of char and byte $n$ -grams features can help especially when dealing with noise and missing values. For example, in [54] a method that substitutes (hashes) each word by letter-tri-grams (char $n$ -grams, $n=$ 3) is proposed for the purpose of dimension reduction. The authors illustrated the idea on the English text stream “2014 Sci-Fi Movies” which is firstly converted to word $n$ -grams #2014# #sci# #fi# #movies#, and then broken into corresponding letter-tri-grams #20 201 014 14# #sc sci ci# #fi fi# #mo mov ovi vie ies es#, which are the final input for the training data structure. The authors pointed out that substituting a word by letter-tri-grams has effects in treating noisy and missing data: (1) it reduces vocabulary size (the vocabulary size is reduced from 500 K to 30 K by replacing each term with its letter-tri-gram); (2) the morphological variants of the same text can be mapped to close letter-tri-grams, so spelling errors can be overcome; (3) trigram hashing also helps to cover unseen data in the training dataset.

The main disadvantage of using n-gram techniques is that they yield a large number of n-grams. But in the case of relatively small-scale text collections, possibility to easily render several thousands of distinctive features can actually be an advantage.

N-gram techniques have been successfully used for a long time in a wide variety of problems and domains. In NLP they turn out to be effective in many applications, for example, text compression [51], information retrieval [8], authorship attribution [24], flat and hierarchical topic text classification [14, 15, 16] etc.

In our experiments we used byte, character and word n-grams with their normalized frequencies as document representation models. For producing n-grams and their normalized frequencies, the variant of publicly available software package Text:Ngrams3
³
http://web.cs.dal.ca/

written by Keselj [24] is used.
3.2 kNN

The first technique that we present in this paper for SPD is a variant of the kNN (for $k=$ 1) n-gram technique, introduced and successfully used by Keselj and his colleagues [24] for authorship attribution. An author profile is defined as an ordered set of pairs $(x_{1},f_{1}),(x_{2},f_{2}),\ldots,(x_{L},f_{L})$ of the $L$ most frequent character n-grams $x_{i}$ and their normalized frequencies $f_{i}$ . The authorship is determined on the basis of dissimilarities between two profiles, comparing the most frequent n-grams. Identical texts will obviously have an identical set of the $L$ most frequent n-grams, and thus have zero dissimilarity. Different text documents will be more or less similar to each other, based on the amount of the most-frequent n-grams which they share. Algorithm 1 gives the detailed steps for this kNN n-gram based text classification technique (kNNnGT).

Algorithm 1 Train&Validation_kNNnGT(C,D(Train),D(Validation),n_min, n_max, L_min, L_max, step_L)
Input: Set of category labels $C$ , training and validation sets of documents D(Train),D(Validation),
initial and final values (with a step) for training classifier parameters: n_min, n_max; L_min, L_max, step_L.
Output: Classification accuracy
1: for each $n$ from n_min to n_maxdo
2: //Produce the set of ”category documents”
3: for each $c\in C$ do
4: $\textit{doc(c)}\leftarrow\textit{ConcatenateTextsOfAllTrainingDocsInCategory(D% (Train),c)}$
5: $D(C)\leftarrow\bigcup_{c\in C}\textit{doc(c)}$
6: //For each test and category document, construct its profile
7: for each $\textit{doc}\in\ \textit{D(Test)}\cup D(C)$ do
8: $\textit{Ngrams(doc)}\leftarrow\textit{ExtractAllNgrams(doc,n)}$
9: for each $x\in\ \textit{Ngrams(doc)}$ do
10: $\textit{frequencies[x]}\ \leftarrow\ \textit{CalculateTheNormalizedFrequency(x% ,doc)}$
11: $\textit{Profile(doc)}\ \leftarrow\ \textit{ListNgramsByDescFreq}(\bigcup_{x\in% \textit{Ngrams(doc)}}\textit{(x,frequencies[x])})$
12: for each $L$ from L_min to L_max with step step_Ldo
13: //Cut of the test and category profiles at the length $L$
14: for each $\textit{doc}_{t}\in\textit{D(Test)}$ do $\textit{Profile}_{L}(\textit{doc}_{t})\leftarrow\textit{Profile}(\textit{doc}_% {t})\|L$
15: for each $c\in C$ do $\textit{Profile}_{L}(\textit{doc(c)})\leftarrow\textit{Profile(doc(c))\|L}$
16: //Calculate dissimilarity measure between test and category profiles
17: for each $\textit{doc}_{v}\in\textit{D(Test)}$ do
18: for each $\textit{doc(c)}\in D(C)$ do
19: $\textit{diss}_{vc}\leftarrow\textit{DissimilarityMeasure}(\textit{Profile}_{L}% (\textit{doc}_{v}),\textit{Profile}_{L}(\textit{doc(c))})$
20: //Select the most similar category (or categories)
21: $c(\textit{doc}_{v})\leftarrow\textit{argmin}_{c\in C}\textit{diss}_{vc}$
22: Compute the accuracy of the produced categorization
23: select $n^{\prime}$ and $L^{\prime}$ which provide the highest accuracy
24: return accuracy for $n^{\prime}$ and $L^{\prime}$

Note that dissimilarity measure plays an important role. We used dissimilarity measure in a form of relative distance presented in [24]:

$\displaystyle d(\mathcal{P}_{1},\mathcal{P}_{2})=\sum_{x\in\textit{Profile}}% \Big{(}\frac{2\cdot(f_{1}(x)-f_{2}(x))}{f_{1}(x)+f_{2}(x)}\Big{)}^{2}$ (1)

where $f_{1}(x)$ and $f_{2}(x)$ are frequencies of an n-gram $x$ in the category profile $\mathcal{P}_{1}$ and the test document profile $\mathcal{P}_{2}$ , respectively.

3.3 SVM

The SVM classifiers have been shown to be efficient and effective for TC. It is a supervised ML method that generates input-output mapping functions from a set of labeled training data. Although the original model of SVMs was designed to do binary classification, for the purpose of this research we used $\textit{SVM}^{\textit{Multiclass}}$ proposed by Joachim4

⁴
Available at http://www.cs.cornell.edu/people/tj/svm light/svm multiclass.html.

[21]. It was built by direct method proposed by Crammer and Singer [7], not by decomposing a multiclass problem into a number of binary classification problems. During the training phase,

\textit{SVM}^{\textit{Multiclass}}

finds the solution of the following optimization problem:

$\displaystyle\min_{w,\xi}\frac{\sum_{i=1}^{k}w_{i}^{2}}{2}+C\frac{\sum_{i=1}^{% n}\xi_{i}}{n}$

$\displaystyle s.t.\forall i\in[1..n]\forall y\in[1..k]:[x_{i}\cdot w_{yi}]% \geqslant[x_{i}\cdot w_{y}]+100\Delta(y_{i},y)-\xi_{i}$ (2)

where ${(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{n},y_{n})}$ is a training set with labels $y_{i}$ in $[1:k]$ , $C$ is the usual regularization parameter that trades off margin size and training error, $\Delta(y_{i},y)$ is the loss function that returns $0$ if $y_{i}$ equals $y$ , and $1$ otherwise. To solve this optimization problem, $\textit{SVM}^{\textit{multiclass}}$ uses an algorithm based on Structural SVMs [48].

3.4 MaxEnt

Maximum entropy modeling is a supervised ML method for prediction probability distribution of data labels y by maximizing following entropy function Eq. (3) that fits training data x, i.e. satisfies given constraints.

$\displaystyle H(p)=-\sum_{x,y}{p(x)}p(y|x)\ \textit{log}p(y|x)$ (3)

It can be shown that there is always a unique model $p^{\ast}$ with maximum entropy in given constraints, and then we have:

$\displaystyle p^{\ast}={\textit{argmax}}_{p}\ H(p)$ (4)

In this paper we used SharpEntropy library – a part of SharpNLP,5

⁵

https://sharpnlp.codeplex.com.

a C# port of the Apache OpenNLP, used in [33].

4. Experimental framework

In this section, we describe the framework used to evaluate the experiments carried out in this work.

4.1 Performance evaluation

The performance evaluation of classification methods can be measured in different ways. Frequently used measures for classification are: ROC curve (for imbalanced datasets), Gmean (the geometric mean of accuracies measured on each class separately), Dominance Index (difference between true positive and true negative rates which can be interpreted as a measure of balancing these rates), etc. In this research, we used the typical evaluation metrics that come from information retrieval – Precision (P), Recall (R), F1 measure and Accuracy [2]:

$\displaystyle P=\frac{\textit{TP}}{\textit{TP}+\textit{FP}},R=\frac{\textit{TP% }}{\textit{TP}+\textit{FN}},F1=\frac{\textit{2PR}}{P+R},\textit{Acc}=\frac{% \textit{TP}+\textit{TN}}{\textit{TP}+\textit{TN}+\textit{FP}+\textit{FN}}$ (5)

where TP (True Positives) is defined as the number of documents that were correctly assigned to the considered category, TN (True Negatives) is the number of the assessments where the system and a human expert agree on a negative label, FP (False Positives) is the number of documents that were incorrectly assigned to the considered category, and FN (False Negatives) is the number of negative labels that the system assigned to documents otherwise assessed as positive by the human expert [22]. All presented measures can be aggregated over all categories in two ways: micro-averaging – the global calculation of measure considering all the documents as a single dataset regardless of categories, and macro-averaging – the average on measure scores of all the categories. In this research we used macro-averaged P, R, F1 and Acc.

Table 1

Benchmarks of movie reviews used for SPD

Data collections	Language	Number of positive reviews	Number of negative reviews	Size in MB (uncompressed)
CornellPD	English	1000	1000	7.90
MuchoCine	Spanish	1351	1274	7.60
OCA	Arabic	250	250	2.04
FMR	French	1000	1000	1.36
TMR	Turkish	5331	5331	1.47
CSFD	Czech	30897	29716	19.6
SerbMR-2C	Serbian	841	841	2.21

4.2 Benchmarks

For empirical evaluation of the SPD techniques presented, eight balanced and publicly available benchmark movie reviews datasets were used (see Table 1):

Cornell Polarity Dataset (CornellPD) – in English.6

⁶
CornellPD: http://www.cs.cornell.edu/people/pabo/movie-review-data/.

This corpus was first introduced by Pang and Lee [35]. It contains 1000 positive and 1000 negative reviews (we used here version 2.0) and it was compiled before 2002, with 20 reviews per author (312 authors total) per category.

MuchoCine – in Spanish.7

⁷

MuchoCine is available at: http://www.lsi.us.es/.

The corpus consists of 3878 movie reviews collected from the MuchoCine website. For this study, neutral opinions (movies with a score of 3 out of 5) have not been used, so the total number of documents on which the experiments have been performed is 2625, with 1274 negative reviews, and 1351 positive reviews.

OCA – in Arabic.8

⁸

OCA is available at: http://timm.ujaen.es/recursos/oca-corpus/.

OCA corpus is collected by Rushdi-Saleh et al. [41] from Arabic movie reviews. It contains 500 reviews collected from 15 different web pages, consisting of 250 positive and 250 negative reviews.

French movie reviews (FMR) – in French. FMR is collected by Ghorbel and Jacot [13] from the web.9

⁹

http://www.allocine.com.

Corpus contains 2000 French movie reviews, 1000 positive and 1000 negative, from 10 movies. All reviews have a size between 500 and 1000 characters.

Turkish movie reviews (TMR) – in Turkish.10

¹⁰

TMR is available at: http://www.win.tue.nl/

TMR dataset is collected by Demirtas and Pechenizkiy [10] from Beyazperde web pages. They restricted this dataset to 5331 positive and 5331 negative sentences. On this website, reviews are marked in scale from 0 to 5 by the same users who made the reviews. They consider a review positive if its rating is equal to or above 4 and negative if it is below or equal to 2.

CSFD – in Czech.11

¹¹

CSFD is available at: http://liks.fav.zcu.cz/sentiment/.

This corpus is collected by Habernal and Brychcn [18]. They downloaded 91,381 movie reviews from the Czech Movie Database12

¹²

http://www.csfd.cz/.

and split them into three categories according to their star rating (0–2 stars as negative, 3–4 stars as neutral, 5–6 stars as positive). The dataset contains 30897 positive, 30768 neutral, and 29716 negative reviews, respectively. In our experiments we used only positive and negative reviews.

Table 2

Results for movie reviews in English, Spanish and Arabic

		CornellPD				MuchoCine				OCA
	n-gram	P	R	F1	Acc	P	R	F1	Acc	P	R	F1	Acc
kNN	Byte	83.68	83.68	83.68	83.68	80.66	80.66	80.66	80.66	95.78	95.78	95.78	95.78
	Char	84.43	84.43	84.43	84.43	75.94	75.94	75.94	75.94	92.17	92.17	92.17	92.17
	Word	82.48	82.48	82.48	82.48	73.47	73.47	73.47	73.47	90.96	90.96	90.96	90.96
MaxEnt	Byte	86.82	88.37	87.58	87.48	81.31	91.29	86.00	84.69	91.16	99.58	95.16	94.89
	Char	89.42	88.97	89.19	89.22	85.31	89.59	87.39	86.67	92.44	98.00	95.12	94.95
	Word	82.9	84.42	83.65	83.5	81.71	86.14	83.86	82.92	85.38	98.09	91.26	90.59
SVM	Byte	88.89	85.72	87.23	86.99	88.31	84.74	86.43	85.72	94.82	87.55	90.96	90.55
	Char	88.99	85.74	87.29	87.04	86.32	80.89	83.39	82.72	96.44	86.64	91.14	90.58
	Word	78.79	78.81	78.54	78.62	81.36	77.44	79.09	78.41	97.64	85.55	90.65	89.37

Notes: Bold numbers denote the best results for each SPD technique and each dataset among byte, character and word $n$ -grams, while underlined numbers denote the best results among different SPD techniques for each dataset.

SerbMR-2C – in Serbian. SerbMR-2C is collected by Batanović et al. [3] and it contains 841 positive and 841 negative reviews from eight websites. Some of the reviews are written in Cyrillic, but most of them are in the Latin script. A great majority of the reviews use ekavian pronunciation.

We have made byte, character and word n-gram representation of all mentioned benchmarks used in our work publicly available13

¹³

www.matf.bg.ac.rs/∼jgraovac/ngramSPD/benchmarks.zip.

to the scientific community for research purposes.

Figure 1.

Comparison of different $n$ -gram models.

5. Empirical results

In order to compare different $n$ -gram based document representation models, we conducted a set of experiments for kNN, SVM, MaxEnt ML techniques for all benchmarks. We wanted to explore how these methods behave on the same domain (movie reviews) under the same conditions. For that reason (1) the same balanced datasets are used in all three methods (2) training and testing datasets were created in the same way – by using 10-fold cross-validation (CV); and (3) for each experiment, the main parameter (the length of a $n$ -gram feature) had the same value for all features. Under these conditions, we explored (1) how these three methods behave when the type of features changes (word, char, byte); (2) how the performance evaluation metrics change depending on the main parameter of a feature – the length of the $n$ -gram, in the case when both the method and type of features are fixed; (3) if there is a method, a type of features and length of a $n$ -gram feature that can be the most optimal method for all datasets, i.e. if there is optimal, language independent method and if so, what type of features is then used. The results are presented in the following tables and figures.

Feature selection performed automatically and no additional knowledge about the domain was used. In the case of byte and character n-grams we performed experiments for n-gram lengths between 2 and 9, while for word n-grams we used n-grams with lengths between 1 and 3, for all methods. We used available optional parameters related to noise and missing values – cutoff frequency and smoothing. In the case of kNN, classification parameter L (number of the most frequent $n$ -grams) took values from 1000 to 60000 with step 1000. In MaxEnt method, SharpNLP GisModel was used with 100 iterations and cutoff 5 (each $n$ -gram which occurs at least 5 times). Obtained results are presented in Table 2 (English, Spanish and Arabic), Table 3 (French and Turkish) and Table 4 (Czech and Serbian).

We can conclude that the best results have been obtained for MaxEnt ML technique, with the exception of OCA in Arabic (where the best results have been obtained for kNN). Comparison of different n-gram models for each benchmark and each ML technique is graphically represented in Fig. 1. From this figure we conclude that accuracy of SPD technique strongly depends on the dataset used, so for a fair benchmarking process the same datasets need to be used for SPD techniques comparison. Also, from the obtained results, it is obvious that byte and character n-gram models outperform word n-gram model for all benchmarks and all ML techniques, while, as we expected, the results for byte and character n-grams are quite similar. In the case of word n-grams, the best results have been obtained for $n=$ 1 (which comes down to BoW model). In the case of byte n-grams, the best results have been obtained for $n$ between 3 and 9, while in the case of character n-grams, best results have been obtained for $n$ between 5 and 7. Figure 2 represents comparison of different byte n-gram lengths for MaxEnt ML technique. We have made detailed results, obtained for MaxEnt ML technique applied to all benchmarks for different n-gram lengths, publicly available.14

¹⁴
www.matf.bg.ac.rs/∼jgraovac/ngramSPD/results.zip.

Table 3

Results for movie reviews in French and Turkish

		FMR				TMR
	$n$ -gram	P	R	F1	Acc	P	R	F1	Acc
kNN	Byte	92.07	92.07	92.07	92.07	86.24	86.38	86.31	86.30
	Char	92.51	92.51	92.51	92.51	85.94	86.69	86.31	86.25
	Word	91.47	91.47	91.47	91.47	79.77	83.98	81.82	81.34
MaxEnt	Byte	94.34	96.73	95.52	95.46	90.76	90.32	90.54	90.56
	Char	94.82	96.21	95.51	95.48	88.99	88.81	88.90	88.91
	Word	92.67	94.24	93.45	93.4	88.51	88.2	88.35	88.37
SVM	Byte	93.19	92.88	93.00	93.00	89.78	90.01	89.88	89.89
	Char	93.90	93.65	93.76	93.75	88.58	89.04	88.75	88.79
	Word	86.55	87.95	87.17	87.30	86.10	85.45	85.74	85.68

Table 4

Results for movie reviews in Czech and Serbian

		CSFD				SerbMR-2C
	$n$ -gram	P	R	F1	Acc	P	R	F1	Acc
kNN	Byte	89.63	89.67	89.65	89.65	81.14	81.14	81.14	81.14
	Char	89.81	89.92	89.86	89.86	80.60	80.60	80.60	80.60
	Word	82.80	84.84	83.81	83.61	68.86	68.86	68.86	68.86
MaxEnt	Byte	92.94	94.00	93.47	93.30	87.83	82.64	85.12	85.54
	Char	94.15	93.58	93.86	93.76	87.83	82.64	85.12	85.54
	Word	90.66	94.44	92.51	92.21	83.06	75.56	79.09	79.99
SVM	Byte	92.89	91.66	92.27	92.07	84.06	83.16	83.54	83.47
	Char	93.72	90.89	92.29	92.01	82.40	84.00	83.08	83.29
	Word	90.55	86.89	88.68	88.21	78.47	78.94	78.66	78.78

Notes: Bold numbers denote the best results for each SPD technique and each dataset among byte, character and word $n$ -grams, while underlined numbers denote the best results among different SPD techniques for each dataset.

6. Comparison with other SOA methods

In order to evaluate performance of the $n$ -gram based techniques presented in this paper, we compare the results with the results of SOA techniques over all benchmarks. Tables 5–7 represent the results of comparison with other supervised (statistical), unsupervised (semantic) and hybrid (statistical-semantic) techniques, respectively. Only the best reported results are presented for each technique and each benchmark. Because of the diversity of evaluation methods and different methodologies on which the techniques are based, we need to be cautious when interpreting the obtained results of comparison. For example, in the case of CSFD benchmark, we performed two-class (positive and negative) classification, while other authors performed three-class (positive, neutral, negative) classification, so better results were expected from the approach we used (15.36% higher F1). In all other cases, for the purpose of making the comparison more convincing and fair, the same datasets were used in our experiments as in those performed by other authors. The results confirm that presented simple byte and character level n-gram based document representation models in conjunction with the proposed ML supervised techniques, outperform all other ML supervised statistical techniques (with exception of MuchoCine benchmarks where we have obtained almost the same results compared to SOA), although they used more complex document representation models. Many of them used BoW model with different preprocessing steps such as stop and rare words removing, stemming, lemmatization, spelling mistake corrections or special character removing. Note that in the case of character-level n-grams we only removed punctua-

Table 5
Comparison results with other supervised statistical SOA results

Dataset	Authors	ML technique	Document representation model with preprocessing	P	R	F1	Acc
CornellPD in English	Matsumoto et al. [31]	SVM	Removing rare words word $n$ -grams	N/A	N/A	N/A	88.10
	Wang and Domeniconi [52]	SVM	“stop words” filtering removing rare words stemming bag-of-words	81.24	N/A	N/A	N/A
	Martineau and Finin [29]	SVM	Removing rare words bag-of-words	N/A	N/A	N/A	88.1
	Our approach	MaxEnt	char n-grams	89.42(+8.18)	88.97	89.19	89.22(+1.12)
MuchoCine in Spanish	del-Hoyo et al. [9]	NN, SVM	“stop words” filtering part-of-speech tagging bag-of-words	N/A	N/A	N/A	77.13
	Martínez-Cámara et al. [30]	SVM, NB, BBR, kNN	“stop words” filtering stemming bag-of- words	87.21	87.01	87.10	87.08
	Our approach	MaxEnt	char n-grams	85.31(-1.9)	89.59(+2.58)	87.39(+0.29)	86.67(-0.41)
OCA in Arabic	Rushdi-Saleh et al. [41]	NB, SVM	Correct spelling mistakesre remove special characters “stop words” filtering stemming word n-grams	87.38	95.20	90.00	90.60
	Perea-Ortega et al. [38]	NB, SVM	Online translation “stop words” filtering bag-of-words	86.99	94.80	90.73	N/A
	Perea-Ortega et al. [37]	NB, SVM	Online translation “stop words” filtering word $n$ -grams	86.55	96.40	91.22	N/A
	Our approach	kNN	byte n-grams	95.78	95.78	95.78 (+4.56)	95.78 (+5.18)
TMR in Turkish	Demirtas and Pechenizkiy [10]	NB, SVM, MaxEnt	Online translation word n-grams	N/A	N/A	N/A	80.10
	Our approach	MaxEnt	byte n-grams	90.76	90.32	90.54	90.56(+10.46)
CSFD in Czech	Habernal et al. [19]	SVM, MaxEnt	“stop words” filtering stemming lemmatization phonetic transcription removing diacritics word, char n-grams	N/A	N/A	78.50	N/A
	Ourappraoch*	MaxEnt	char n-grams	94.15	93.58	93.86 (+15.36)	93.76
SerbMR-2C in Serbian	Batanović et al. [3]	MNB, LR, SVM	Stemming word n-grams	87.50	88.50	88.00	79.10
	Our appraoch	MaxEnt	byte n-grams	87.83(+0.33)	82.64	85.12	85.54(+6.44)

Notes: Bold numbers denote the best results for each dataset. In brackets we indicated how much our results differ from the best published results. In our approach we performed two-class (positive and negative) classification, while other authors performed three-class (positive, neutral, negative) classification.

Table 6

Comparison with unsupervised semantic SOA results

Dataset	Authors	Linguistic	P	R	F1	Acc
		resources
MuchoCine in	del-Hoyo et al. [9]	DAL	N/A	N/A	N/A	67.64
Spanish	Molina-González et al. [34]	SOL	63.93	62.74	63.33	63.16
	Our approach	–	85.31(+21.38)	89.59(+26.85)	87.39(+24.06)	86.67(+23.51)
TMR in Turkish	Vural et al. [51]	SSL	N/A	N/A	N/A	79.20
	Our approach	–	90.76	90.32	90.54	90.56(+11.36)

Notes: Bold numbers denote the best results for each dataset. In brackets we indicated how much our results differ from the best published results. DAL $=$ Dictionary of Affect in Language; SOL $=$ Spanish Opinion Lexicon; SSL $=$ SentiStrength Lexicon.

Figure 2.

Comparison of different $n$ -gram lengths using byte- $n$ -gram model and MaxEnt ML technique.

tion marks and digits and did uppercase conversion, while in the case of byte-level n-grams model there was no need for any preprocessing steps. Except for the supervised statistical ML technique, our technique outperforms other unsupervised semantic techniques (see Table 6) and in some cases (Arabic, French and Czech) other hybrid statistic-semantic techniques that have been at least partially knowledge-based (see Table 7). Some of them used valence shifters (negations, intensifiers, diminishes) [23], some

Table 7

Comparison with hybrid statistical-semantic SOA results

Dataset	Authors	ML techniques	Knowledge-based techniques	P	R	F1	Acc
CornellPD in English	Pang and Lee [35]	NB, SVM	Subjectivity sentence extraction	N/A	N/A	N/A	87.20
	Matsumoto et al. [31]	SVM	Word sub-sequences dependency subtrees	N/A	N/A	N/A	92.20
	Whitelaw et al. [53]	SVM	Lexicon of appraising adjectives	N/A	N/A	N/A	90.20
	Kennedy and Inkpen [23]	SVM	Valence shifters	86.10	86.15	86.15	86.20
	König and Brill [27]	SVM	Human reasoning over text patterns	N/A	N/A	N/A	91.00
	Wang and Domenicon [52]	SVM	Wikipedia	86.37	N/A	N/A	N/A
	Prabowo and Thelwall [40]	RBC, SVM	GIL	N/A	N/A	N/A	87.29
	Our approach	MaxEnt	–	89.42	88.97	89.19	89.22( $-$ 2.98)
MuchoCine in Spanish	del-Hoyo et al. [9]	NN, SVM	DAL	N/A	N/A	N/A	80.86
	Martín et al. [28]	SVM, NB, C4.5, BBR	SWN	88.58	88.57	88.56	88.57
	Our approach	MaxEnt	–	85.31( $-$ 3.27)	89.59( $+$ 1.02)	87.39( $-$ 1.17)	86.67( $-$ 1.9)
OCA in Arabic	Perea-Ortega et al. [38, 37]	NB, SVM	SWN	85.66	98.00	91.42	N/A
	Our approach	kNN	–	95.78( $+$ 10.12)	95.78( $-$ 2.22)	95.78( $+$ 4.36)	95.78
FMR in French	Ghorbel and Jackot [13]	SVM	SWN	N/A	N/A	N/A	93.25
	Our approach	MaxEnt	–	94.82	96.21	95.51	95.48( $+$ 2.23)
CSFD in Czech	Habenal and Brychcín [18]	MaxEnt	Word clusters (from semantic spaces)	N/A	N/A	80.00	N/A
	Brychcín and Habernal [6]	MaxEnt	Word clusters (from semantic spaces)	N/A	N/A	81.53	N/A
	Our approach*		–	94.15	93.58	93.86( $+$ 12.33)	93.76

Notes: Bold numbers denote the best results for each dataset. In brackets we indicated how much our results differ from the best published results. GIL $=$ General Inquirer Lexicon; DAL $=$ Dictionary of Affect in Language; SWN $=$ SentiWordNet. (*) In our approach we performed two-class (positive and negative) classification, while other authors performed three-class (positive, neutral, negative) classification.

used text mining techniques to extract frequent word sub-sequences and dependency sub-trees from sentences in a document dataset [31], some of them embedded background knowledge derived from Wikipedia into a semantic kernel used to enrich the document representation model [52], while some of them used Dictionary of affect in language (DAL) [9] to determine an “evaluation” value to each word depending on its affective contents. Comparing results in Table 7 with those in Table 5 we conclude that hybrid SOA approaches outperform statistical supervised ML techniques. This further leads to the conclusion that our supervised ML techniques have a great potential to be combined with some knowledge-based techniques using some additional linguistic resources, such as SentiWordNet [11], WordNet-Affect or General Inquirer [45] in order to improve accuracy. Regardless, our supervised ML (byte and character) n-gram based techniques in many cases (Arabic, Czech, French, Serbian and Turkish) have achieved SOA results.

7. Conclusion and future work

In this paper we explored byte, character and word n-gram document representation models in order to see if there is a unique valuable type of features that can be used for representing text documents in different languages, so as to be successfully used by machine learning (ML) classification techniques to efficiently solve SPD. In our experiments we used kNN, SVM and MaxEnt ML supervised techniques. We demonstrated presented n-gram models on seven different publicly available benchmark corpora of movie reviews in paradigmatically quite different languages (Cornell Polarity Dataset in English, MuchoCine in Spanish, OCA in Arabic, FMR in French, TMR in Turkish, CSFD in Czech and SerbPD and SerbMR-2C in Serbian). We have found that byte and character n-gram models outperform word n-gram model. There is no need for any text preprocessing or higher level processing, so the necessity for the usage of taggers, parsers or other language-dependent and non-trivial natural language processing tools is avoided. They are fully language and topic independent and they do not require any prior information about document content or language. Despite their simplicity and broad applicability, we have obtained the best results among all other ML supervised statistical techniques, and in some cases (Serbian, Arabic, French, Turkish, Czech) we have obtained state-of-the art results.

The overall success of the presented techniques testifies that character and especially byte n-gram based document representation models are sound and promising in solving SPD task. They provide an inexpensive and effective way for sentiment analysis of large collections of documents written in any language, so we are encouraged to continue this line of research. Since semantic content in texts and knowledge of the domain are very important, and $n$ -grams technique is applicable to different domains and problems, our further work will address the usage of external linguistic resources like SentiWordNet, WordNet-Affect or General Inquirer in order to improve accuracy in a particular language. On the other hand, classification models can be improved by chosing Self Organized Maps [26], neural networks and deep learning, so we will experiment in building and testing different classifiers.

Footnotes

Acknowledgments

The work presented has been financially supported by the Ministry of Science and Technological Development, Republic of Serbia, through Projects No. 174021 and No. III47003.

References

Argueta

and Chen

Y.-S.

, Multi-lingual sentiment analysis of social data based on emotion-bearing patterns, In Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP), Dublin, Ireland, Association for Computational Linguistics and Dublin City University, 2014, pp. 38–43.

Baeza-Yates

R.A.

and Ribeiro-Neto

, Modern information retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

Batanović

Nikolić

and Milosavljević

, Reliable baselines for sentiment analysis in resource-limited languages: The serbian movie review dataset, In LREC, 2016, pp. 2688–2696.

Blamey

Crick

and Oatley

, Ru:-) or:-(character-vs. word-gram feature selection for sentiment classification of osn corpora, In SGAI Conf., Springer, 2012, pp. 207–212.

Bojanowski

Grave

Joulin

and Mikolov

, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

Brychcín

and Habernal

, Unsupervised improving of sentiment analysis using global target context. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, Hissar, Bulgaria, September 2013. INCOMA Ltd. Shoumen, BULGARIA, pp. 122–128,

Crammer

and Singer

, On the algorithmic implementation of multiclass kernel-based vector machines, Journal of Machine Learning Research 2 (2002), 265–292.

De Heer

, Experiments with syntactic traces in information retrieval, Information Storage and Retrieval 10(3-4) (1974), 133–144.

del Hoyo

Hupont

Lacueva

F.J.

and Abadía

, Hybrid text affect sensing system for emotional language analysis, In Proceedings of the international workshop on affective-aware virtual agents and social robots, ACM, 2009, p. 3.

10.

Demirtas

and Pechenizkiy

, Cross-lingual polarity detection with machine translation, In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, ACM, 2013, p. 9.

11.

Esuli

and Sebastiani

, Sentiwordnet: a high-coverage lexical resource for opinion mining, Evaluation, 2007, pp. 1–26.

12.

Fusilier

D.H.

Montes-y Gómez

Rosso

and Cabrera

R.G.

, Detection of opinion spam with character n-grams, In International Conference on Intelligent Text Processing and Computational Linguistics, Springer International Publishing, 2015, pp. 285–294.

13.

Ghorbel

and Jacot

, Sentiment analysis of french movie reviews, In Advances in Distributed Agent-Based Retrieval Tools, Springer, 2011, pp. 97–108.

14.

Graovac

, A variant of n-gram based language-independent text categorization, Intelligent Data Analysis 18(4) (2014), 677–695.

15.

Graovac

Kovačević

and Pavlović-Lažetić

, Language independent n-gram-based text categorization with weighting factors: A case study, Journal of Information and Data Management 6(1) (2015). 4.

16.

Graovac

Kovačević

and Pavlović-Lažetić

, Hierarchical vs. flat n-gram-based text categorization: can we do better? Computer Science and Information Systems 14(1) (2017), 103–121.

17.

Graovac

and Pavlović-Lažetić

, Language-independent sentiment polarity detection in movie reviews: A case study of english and spanish, In 6th International Conference ICT Innovations, 2014, pp. 13–22.

18.

Habernal

and Brychcín

, Semantic spaces for sentiment analysis, In International Conference on Text, Speech and Dialogue, Springer, 2013, pp. 484–491.

19.

Habernal

Ptáček

and Steinberger

, Supervised sentiment analysis in czech social media, Information Processing & Management 50(5) (2014), 693–707.

20.

Hartmann

Klenk

Burkovski

and Heidemann

, Sentiment detection with character n-grams, In Proceedings of the Seventh International Conference on Data Mining (DMIN’1), 2011, pp. 364–368.

21.

Joachims

, Making large-scale svm learning practical, Technical report, Technical Report, SFB 475: Komplexitätsreduktion in Multivariaten Datenstrukturen, Universität Dortmund, 1998.

22.

Joachims

, Learning to classify text using support vector machines: Methods, theory and algorithms, Kluwer Academic Publishers, 2002.

23.

Kennedy

and Inkpen

, Sentiment classification of movie reviews using contextual valence shifters, Computational Intelligence 22(2) (2006), 110–125.

24.

Kešelj

Peng

Cercone

and Thomas

, N-gram-based author profiles for authorship attribution, In Proceedings of the conference pacific association for computational linguistics, PACLING, volume 3, 2003, pp. 255–264.

25.

Kincl

Novák

and Přibil

, Sentiment classification in multiple languages: Fifty shades of customer opinions, In Business Challenges in the Changing Economic Landscape-Vol. 2, Springer, 2016, pp. 267–275.

26.

Kohonen

Schroeder

M.R.

and Huang

T.S.

, editors, Self-Organizing Maps, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 3rd edition, 2001.

27.

König

A.C.

and Brill

, Reducing the human overhead in text categorization, In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2006, pp, 598–603.

28.

Martín-Valdivia

M.-T.

Martínez-Cámara

Perea-Ortega

J.-M.

and Alfonso Ureña López

, Sentiment polarity detection in spanish reviews combining supervised and unsupervised approaches, Expert Systems with Applications 40(10) (2013), 3934–3942.

29.

Martineau

and Finin

, Delta tfidf: An improved feature space for sentiment analysis, Icwsm 9 (2009), 106.

30.

Martínez-Cámara

Martín-Valdivia

M.-T.

and Ure na López

L.A.

, Opinion classification techniques applied to a spanish corpus, In Mu noz

Montoyo

and Métais

, editors, Natural Language Processing and Information Systems, Springer Berlin Heidelberg, 2011, pp. 169–176.

31.

Matsumoto

Takamura

and Okumura

, Sentiment classification using word sub-sequences and dependency sub-trees, In PAKDD, volume 5, Springer, 2005, pp. 301–311.

32.

Mesnil

Mikolov

Ranzato

and Bengio

, Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews, CoRR, abs/1412.5335, 2015.

33.

Mladenović

Mitrović

Krstev

and Vitas

, Hybrid sentiment analysis framework for a morphologically rich language, Journal of Intelligent Information Systems 46(3) (2016), 599–620.

34.

Molina-González

M.D.

Martínez-Cámara

Martín-Valdivia

M.-T.

and Perea-Ortega

J.M.

, Semantic orientation for polarity classification in spanish reviews, Expert Systems with Applications 40(18) (2013), 7250–7257.

35.

Pang

and Lee

, A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2004, p. 271.

36.

Pang

Lee

and Vaithyanathan

, Thumbs up? Sentiment classification using machine learning techniques, In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, Association for Computational Linguistics, 2002, pp. 79–86.

37.

Perea-Ortega

J.M.

Martín-Valdivia

M.T.

Urena-López

L.A.

and Martínez-Cámara

, Improving polarity classification of bilingual parallel corpora combining machine learning and semantic orientation approaches, Journal of the Association for Information Science and Technology 64(9) (2013), 1864–1877.

38.

Perea-Ortega

J.M.

Martínez-Cámara

Martín-Valdivia

M.-T.

and Urena-López

L.A.

, Combining supervised and unsupervised polarity classification for non-english reviews, In International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2013, pp. 63–74.

39.

Poria

Cambria

Gelbukh

A.F.

Bisio

and Hussain

, Sentiment data flow analysis by means of dynamic linguistic patterns, IEEE Comp. Int. Mag. 10(4) (2015), 26–36.

40.

Prabowo

and Thelwall

, Sentiment analysis: A combined approach, Journal of Informetrics 3(2) (2009), 143–157.

41.

Rushdi-Saleh

Martín-Valdivia

M.T.

Urena-López

L.A.

and Perea-Ortega

J.M.

, Oca: Opinion corpus for arabic, Journal of the Association for Information Science and Technology 62(10) (2011), 2045–2054.

42.

Rybina

, Sentiment analysis of contexts around query terms in documents, PhD thesis, Masterâ€™s thesis, 2012.

43.

Sebastiani

, Machine learning in automated text categorization, ACM computing surveys (CSUR) 34(1) (2002), 1–47.

44.

Socher

Perelygin

Chuang

Manning

C.D.

and Potts

, Recursive deep models for semantic compositionality over a sentiment treebank, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, October 2013. Association for Computational Linguistics, pp. 1631–1642,

45.

Stone

P.J.

Dunphy

D.C.

and Smith

M.S.

, The general inquirer: A computer approach to content analysis, 1966.

46.

Tomović

Janičić

and Kešelj

, n-gram-based classification and unsupervised hierarchical clustering of genome sequences, Computer Methods and Programs in Biomedicine 81(2) (2006), 137–153.

47.

Tsarfaty

Seddah

Goldberg

Kübler

Candito

Foster

Versley

Rehbein

and Tounsi

, Statistical parsing of morphologically rich languages (spmrl): what, how and whither, In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, Association for Computational Linguistics, 2010, pp. 1–12.

48.

Tsochantaridis

, Support vector learning for interdependent and structured output spaces, In Proc. International Conference on Machine Learning (ICML), 2004, 2004.

49.

Turney

P.D.

, Thumbs up or thumbs down: semantic orientation applied to unsupervised classification of reviews, In Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, 2002, pp. 417–424.

50.

Varshit

Batchu

V.V.

Dakannagari

M.M.K.R.

and Mamidi

, Sentiment as a prior for movie rating prediction, In 2nd International Conference on Innovation in Artificial Intelligence, ICIAI-2018, Shanghai, China, 2018.

51.

Vural

A.G.

Cambazoglu

B.B.

Senkul

and Tokgoz

Z.O.

, A framework for sentiment analysis in turkish: Application to polarity detection of movie reviews in turkish, In Computer and Information Sciences III, Springer, 2013, pp. 437–445.

52.

Wang

and Domeniconi

, Building semantic kernels for text classification using wikipedia, In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2008, pp. 713–721.

53.

Whitelaw

Garg

and Argamon

, Using appraisal groups for sentiment analysis, In Proceedings of the 14th ACM international conference on Information and knowledge management, ACM, 2005, pp. 625–631.

54.

Song

and Massey

, Generalized learning of neural network based semantic similarity models and its application in movie search, In IEEE International Conference on Data Mining Workshop, ICDMW 2015, Atlantic City, NJ, USA, November 14–17 2015, 2015, pp. 86–93.

55.

Zheng

Wang

and Gao

, Sentimental feature selection for sentiment analysis of chinese online reviews, Int. J. Machine Learning & Cybernetics 9(1) (2018), 75–84.

NgramSPD: Exploring optimal n -gram model for sentiment polarity detection in different languages

Abstract

Keywords

1. Introduction

1 http://www.internetworldstats.com/stats7.htm.

2 https://research.fb.com/fasttext/.

3.1 Document representation models

4 Available at http://www.cs.cornell.edu/people/tj/svm light/svm multiclass.html.

4.1 Performance evaluation

6 CornellPD: http://www.cs.cornell.edu/people/pabo/movie-review-data/.

14 www.matf.bg.ac.rs/∼jgraovac/ngramSPD/results.zip.

Table 5 Comparison results with other supervised statistical SOA results

Footnotes

Acknowledgments

References

¹
http://www.internetworldstats.com/stats7.htm.

²
https://research.fb.com/fasttext/.

⁴
Available at http://www.cs.cornell.edu/people/tj/svm light/svm multiclass.html.

⁶
CornellPD: http://www.cs.cornell.edu/people/pabo/movie-review-data/.

¹⁴
www.matf.bg.ac.rs/∼jgraovac/ngramSPD/results.zip.

Table 5
Comparison results with other supervised statistical SOA results