Stock price prediction through sentiment analysis of corporate disclosures using distributed representation

Abstract

Many researches have exploited textual data, such as news, online blogs, and financial reports, in order to predict stock price movements effectively. Previous studies formed the task as a classification problem predicting upward or downward movement of stock prices from text documents. Such an approach, however, may be deemed inappropriate when combined with sentiment analysis. In financial documents, same words may convey different sentiments across different sectors; if documents from multiple sectors are learned simultaneously, performance can deteriorate. Therefore, we conducted sentiment analysis of 8-K financial reports of firms sector by sector. In particular, we also employed distributed representation for predicting stock price movements. Experiment results show that our approach improves prediction performance by 25.4% over the baseline model, and that the direction of post-announcement stock price movements shifts accordingly with the polarity of the sentiment of reports. Not only does our model improve predictability, but also provides visualizations, which may assist agents actively trading in the field with understanding the drivers for the observed stock movements. The two main aspects of our model, predictability and interpretability, will provide meaningful insights to help decision-makers in the industry with time-split trading decisions or data-driven detection of promising companies.

Keywords

Stock price change prediction distributed representation sentiment analysis visualization 8-K financial reports

1. Introduction

The study of the prediction of stock prices has been one of the major branches in the field of financial research. Previous studies relied on mathematical models to predict stock prices using quantitative variables such as historical time-series of stock prices[23, 2], as well as microeconomic and macroeconomics indicators [18]. Despite continuous efforts, econometric and artificial intelligence models in the past have not witnessed significant improvement in performance. This is partly due to the difficulty of discovering novel, meaningful factors that help explain stock price movements aside from the traditional ones. On the other hand, models that utilize a limited set of input variables are of less value especially to decision-making agents and investors actively operating in financial industry, since such information is fair-to-all and publicly available, hence reducing their chance to “beat the market.” In order to tackle such shortcomings, more recent studies began to incorporate textual data in the analysis of market behaviors, ranging from the comprehensive reports on the company’s performance submitted to the Securities and Exchange Commission (SEC) such as 10-Ks and 10-Qs [15], opinion columns in Wall Street Journal’s[12] and the finance section of the New York Times [17] to earning press releases [21]. Specific words, phrases, paragraphs, and documents appearing in text data convey certain sentiments, and the analysis of these sentiments is gaining growing attention as a new method for stock price prediction in the financial field. Recent work has suggested that deep learning models may improve performance in predicting stock prices [41]. However, same words may convey different sentiments in different sectors, and performance can potentially degrade if the prediction model learns documents from multiple sectors simultaneously. For example, words such as war, attack, and terrorism are generally associated with negative sentiments, while they may positive implications for firms in military supply or military intelligence industry. Therefore, it is necessary to design a model that can accurately account for different industry effects when considering textual information in solving prediction tasks.

Figure 1.

(a) 8-K announcement of Citigroup Inc. and subsequent stock price movement (b) 8-K announcement of JP Morgan Chase & Co and subsequent stock price movement.

The objective of this paper is to predict stock price movements in both quantitative and qualitative ways by analyzing the sentiments of 8-K financial reports based on the means of distributed representation. For example, consider the excerpts from 8-K reports by Citigroup Inc. and JP Morgan Chase & Co and their respective stock price movements presented in panels (a) and (b) of Fig. 1. Citigroup’s 8-K report, presented in panel (a), was released on July 17, 2002; JP Morgan’s, presented in panel (b), was released on Jan 15, 2010. Words highlighted in blue are associated with negative sentiments. The values in parentheses show the percentage decline in stock prices following the 8-K announcement date. Figure 1 shows that the stock price of Citigroup has fallen by 26.9% in four business days after its announcement, while that of JP Morgan dropped by 7.2% within three business days after the announcement for JP Morgan. Observations from Fig. 1 indicates that the sentiment of financial report announcements may be a great tool to explain the subsequent stock price movements. Our goal is to automate such a process and produce meaningful results by employing distributed representation method.

Figure 2.

Data flow diagram in terms of prediction and interpretation.

Distributed representation expresses documents as vectors, which are used to embed documents along with class information on the same space. This enables calculation of distances among documents and sentiment classes, hence allowing identification of the sentiment class of given documents. Moreover, since document representations are now in the form of vectors, one can easily visualize the sentiment class of the document after some dimension reduction process. A data flow diagram of our methodology is shown in Fig. 2. Distributed representation method incorporated in this study does not only improve predictability of the model as compared to the conventional one-hot encoding approach but also provide visualization of the prediction results, hence adding interpretability. Improvement in predictability of the model, as well as the visualizations produced as the result of the analysis, can be used to help active traders in the field when making time-split trading decisions or data-driven detection of promising companies. On the other hand, the visualization produced by the model may directly benefit the decision-making agents of the industry by providing intuitive illustration of the relationship between the sentiment and the stock price movement. Because our visualization results place stock price movements side-by-side to the sentiment trend of documents reporting about the respective company, which help enhance the understanding of the readers and assist with their decision-making by providing intuitive insights.

The remainder of this paper is structured as follows: Section 2 introduces past literature that has incorporated text data in solving stock price prediction tasks. We introduce model framework of this study in Section 3. Section 4 describes experiment settings and reports results. Finally, we conclude this paper in Section 5.

2. Related work

Various hypotheses and models have been proposed and applied to explain financial markets. According to the efficient market hypothesis of Fama [14], it is impossible to predict the market perfectly since publicly available information is fully reflected in stock price and the market will respond only to the new information. However, Lo and MacKinlay [26] insisted that the market could be predicted to some extent. Apart from the traditional autoregressive integrated moving average (ARIMA) models [3], more recent work began to look to artificial intelligence algorithms to solve stock price prediction tasks. For instance, R. Sitte and J. Sitte[39] used a neural network (NN) to detect weak signals in the S&P 500 time series, while Ballings et al. [4] employed random forest (RF) model to predict the direction of stock price movements. Support vector machine (SVM) was another technique utilized in various studies to predict stock price indices [43]. In addition, in the financial market and industry, researchers have actively employed both textual and numerical data in order to predict stock price changes. Chen and Liu[7] and Schumaker and Chen[38] make use of both news articles and text from social network services as well as online blogs, where the former represents market sentiment, while the latter, individual investor’s sentiments.

Druz et al.[13] collected earning transcripts from 2004 to 2012 for all stocks belonging to the S&P 500 index and studied the relationship between the managerial tone and the investor’s variables of interest, including future stock returns. A “negativity” score, defined as the number of negative words minus the number of positive words divided by the sum of negative and positive words plus 1, was used to standardize the “one surprise”, the excessive components of managerial tone. Druz et al. [13] reported that tone surprise was a significant factor in predicting earnings per share adjustments to the sell-side trader. Furthermore, from the adjusted long-short strategy from the regression framework taking long stocks with positive tone surprise and short stocks with negative tone surprise they obtained 1% return within 60 days after the earning call. Jegadeesh and Wu [22] found a meaningful relationship between market reaction and the tone quantified from the Form 10-K documents by counting negative and positive words. The results from multivariate regressions showed that both negative and positive words were significant features in explaining market reaction, and the effect of the tone of documents were observed quickly in the market, mostly within two weeks. The same methodology was applied to the IPO prospectuses to examine the relationship with IPO underpricing and found that tone of IPO prospectuses adversely affected IPO underpricing. Heston and Sinha [19] predicted stock prices exploiting almost 1 million news stories. The sentiment of news stories was measured by using the Thomson Reuters sentiment engine, and the sentiment of companies was calculated by subtracting negative sentiment score from the positive. Results showed that the duration of stock return predictability depended on the temporal aggregation of news. Sentiment of news over a day has predictability only a few days, while the sentiment of news over a week lasts the predictability for up to a quarter. In addition, positive news causes stock returns to rise quickly, while negative news affects stock returns with a long-delayed reaction. Bollen et al. [5] studied the relationship between Dow Jones Industrial Average (DJIA) and public mood measured by quantifying twitter data using OpinionFinder. Google-Profile of Mood States (GPOMS) measured public mood in terms of six dimensions: calm, alert, sure, vital, kind, and happy. The correlation between public mood and DJIA was analyzed by Granger causality analysis, and it was observed that “Calm” mood was the most significant feature. The self-organizing fuzzy NN method predicted a daily up and down change of DJIA closing-values and obtained 86.7% accuracy and a 6% decrease in mean average percentage error (MAPE). Lee et al. [25] reported that the performance of stock price prediction improved when using linguistic features of financial reports rather than the existing analysis using quantitative indicators such as earning surprise, recent movement, volatility, and event category for S&P 1500 companies. The unigram features, extracted from documents, were expressed using non-negative matrix factorization (NMF) and used as input variables. For three classification problem through RF, the model with quantitative indicators was used as a baseline with 50.1% accuracy, on the other hand, when linguistic factors were added, the prediction accuracy was 55.5% and improved by more than 10% over baseline performance. Sun et al. [42] explored the effect of text information from user-generated microblogs on the market. On a financial communications platform called StockTwits ${}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}$ ,1

¹
http://www.StockTwits.com.

textual data were collected for five years. The author created term-document matrix and dense input variables through sparse matrix factorization model. Using the latent space model by Ming et al.[31] improved performance and predicted accuracy of 51.37% compared to the baseline regression model. It was found that StockTwits

{}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}

contain information useful to asset managers and investors and contributed to the use of high-volume social media data without using news and text sentiment.

As many studies have confirmed that the performance of stock price prediction was improved by using text data rather than numerical data only. Therefore, various algorithms have been developed and applied to better represent text data to improve the prediction performance.

Maas et al. [27] proposed a method to capture sentiment and semantic term-document information by combining unsupervised and supervised techniques to overcome the problem of not capturing sentiment information when representing documents. In order to capture semantic similarities, they constructed a probability model of a document that uses a continuous mixed distribution of words, $p(d)=\int p(d,\theta)d\theta=\int p(\theta)\Pi_{i=1}^{N}p(w_{i}|\theta)d\theta$ , where $N$ is the number of words in $d$ and $w_{i}$ is the $i^{th}$ word in $d$ . For capturing the word sentiment, used was the logistic function, $p(s=1|w,R,\psi)=\sigma(\psi^{T}\phi_{w}+b_{c})$ , where $R$ is a word representation matrix, $\psi$ is regression weights, $\phi_{w}$ is $w$ ’s vector representation, and $b_{c}$ is a scalar bias. The objective function was set up by combining the two probabilities in order to search for optimal parameters. It was shown that the proposed method extracted semantically and sentimentally similar words using cosine similarity better than the model that use only the semantic objective function, or latent semantic analysis (LSA)[44]. For example, when five words similar to the word “romantic” were extracted, the proposed method extracted “romance”, “love”, “sweet”, “beautiful”, and “relationship”, whereas LSA extracted “romance”, “screwball”, “grant”, “comedies”, and “comedy”. The performance of the proposed algorithm was tested against linear SVM on bag of words features, latent dirichlet allocation (LDA), and LSA, using the widely used IMDB data of movie reviews. Document-level sentiment polarity classification was conducted, and the proposed method outperformed other vector space models (VSMs). It is meaningful that the performance is improved only by simple sentiment information, so the scope of application is wide. Naïve bayes (NB) and SVM, commonly used as the baseline for classifying text, were in some cases found to perform better than state-of-the-art models, which showed that complex models do not always perform well [46]. Wang also found that the bigram features have more consistent gains than the unigram features. The study used various dataset widely used for data analysis such as RT-s[34], CR[20], MPQA[47], Subj[33], RT-2k [33], and IMDB[27]. The performance of the algorithms proposed in the latest researches were tested using the snippet dataset. Algorithms considered included: Recursive Autoencoder (RAE)[40], namely, RAE-pretrain [8] which trains on Wikipedia, and Voting and Rules[32], which use sentiment lexicon and hard-coded reversal rules. Experiment results showed that Multinomial naïve bayes (MNB) with bigram feature and SVM with naïve bayes feature (NBSVM) tend to outperform other models. In addition, the performance of full-length reviews was compared with the result of BoW and LDA from [27], the result of tf. $\Delta$ idf[29], and word presentation restricted boltzmann machine[11]. The result showed that NBSVM achieved 91.22% performance in the IMDB classification problem and showed the best performance in other full-length datasets. Furthermore, it provides robust results for snippets and full-length text.

Previous work, however, never attempted to combine distributed representation and visualization for the sake of assisting agents the industry-side. The model of this study is designed to do exactly that, by exploiting distributed representation of documents to improve predictability, while providing easy-to-understand summary of the result via visualization, which, altogether, may help guide the decision making process of financial agents in the field.

3. Methodology

This study predicts stock price movements in two ways: model-based and visualization-based. The framework of our methodology is shown in Fig. 3.

Figure 3.

Diagram of stock price change prediction.

Model-based analysis calculates the sentiment of 8-K financial reports via distributed representation method [35] to predict stock price changes, while the visualization-based analysis provide qualitative assessment of the prediction results.

3.1 Distributed representations

In order to form the stock price prediction problem into a classification model with text as input, it is important to appropriately and clearly represent documents for the task in hand. The most basic way to represent documents is through a bag-of-words (BoW) approach. BoW model considers a document as a bag of its words, disregarding grammar and word order. The biggest shortcoming of BoW model is that as the number of words increases, the dimension of the one-hot representations explodes astronomically. Distributed representation addresses such a limitation by projecting input data onto a continuous space of a smaller dimension [37]. The similarity between words and documents can easily be calculated, because each document is projected on to the same, continuous embedding space. The resulting representation then can be used in various ways, from extracting words that are “closer” to a given word to clustering similar documents. Moreover, it has been reported that performance, to some extent, is guaranteed even when working with a small dataset. In this study, we employ word2vec [30] to obtain distributed representations of documents. Word2vec is a simple NN model that embeds words onto a continuous embedding space, and it has become the most widely used word embedding model by reducing the computational time and enabling learning several times faster than, for instance, calculating a sparse matrix as required by the conventional BoW method. One interesting feature of word2vec is that linguistic regularities can be applied to the representation vectors. For example, one may express Rome as a combination of Paris, France, and Italy in the following way: v(“Rome”) $=$ v(“Paris”) $-$ v(“France”) $+$ v(“Italy”). There are two ways through which word2vec learns distributed representations of words: Skip-gram and Continuous Bag-of-Words (CBOW). Skip-gram predicts the context words given a target word; CBOW predicts the probability of observing the target word given its context words.

Skip-gram model is reportedly suitable for learning distances in discrete data that is difficult to define similarity. It represents textual data as a continuous vector, with which words within similar contexts are placed close to each other on the same embedding space. Figure 4 shows the structure of the skip-gram model.

Figure 4.

The structure of the Skip-gram model[30].

Skip-gram utilizes a NN, which consists of an encoder and a predictor. The encoder converts the input words, coded discretely, into a continuous vector; the predictor, then, uses the resulting vector from the encoder to predicts context words. Mathematically, given training words $w_{1},w_{2},\ldots,w_{T}$ , the objective function of skip-gram is defined as:

$\frac{1}{T}\sum_{t=1}^{T}{\sum_{-r\leqslant j\leqslant r,j\neq 0}\log p(w_{t+j% }|w_{t})}$ (1)

where $r$ is the size of the training context. The conditional probability $p(w_{t+j}|w_{t})$ is calculated by the softmax function:

$p(w_{t+j}|w_{t})=\frac{\text{exp}(v^{\prime}_{w_{t+j}}{}^{T}v_{w_{t}})}{\sum_{% w=1}^{W}\exp(v^{\prime}_{w_{t+j}}{}^{T}{v_{w_{t}}})}$ (2)

where $v^{\prime}_{w_{t+j}}$ and $v_{w_{t}}$ are the input and output vector representations of $w$ , and $W$ is the number of words in the vocabulary. Skip-gram treats each context-target pair as a new observation, and this tends to work better with large datasets. One of the extensions from word2vec is the paragraph vector(PV) [24]. The PV model adds a paragraph ID when sliding through each word of the corresponding paragraph, which allows learning word and document vectors simultaneously in the same embedding space. Graphical illustration of PV model is presented in Fig. 5.

Figure 5.

Distributed model using paragraph vectors [24].

In the Fig. 5, $D$ is a PV matrix, which is expressed as a set of unique vector for each paragraph, and $W$ is a unique vector mapping of words. PV is inserted as an input vector just as a word vector is. PV provides information about current context of words belonging to the same paragraph; therefore, it may be thought of as a memory model in some sense. PV allows consideration of word orders as the bag-of-n-gram models do but with a much smaller dimension.

On top of the PV vector, we enrich the model by adding the class vectors when learning the document. This is because PV model, without the class vector, may produce results that may not be suitable for sentiment analysis of financial documents. Since word2vec slides through the input text word-by-word, when the set of context words are similar, the target words will end up being placed very closely to each other on the embedding space even if they may convey very different sentiments. For example, suppose there are two documents evaluating the operational performance of two different firms, Firm A and Firm B, namely. Document 1 evaluates Firm A well by stating: “ $\ldots$ showed good performance.” On the other hand, Firm B receives a harsher remark, and Document 2 reports: “ $\ldots$ showed bad performance.” In both cases, the context words surrounding the target words are exactly the same, “showed” and “performance”; hence, word2vec will place “good” and “bad” are at the same position on the embedding space when the learning is complete. This is not appropriate, especially for sentiment analysis, since, “good” and “bad” are clearly the exact opposites of each other in terms of sentiments. In this study, we address this issue by setting up a supervised learning framework for PV model by enforcing the model to learn class information simultaneously with PV as input variables.

Class information is the sentiment labeling of documents using the direction of the closing price movement of the next business day. Each document is assigned one of the two labels (“UP” and “DOWN”) as its class depending on the stock price shift the next business day. “UP” means that the price went up more than criterion, while “DOWN” means that prices went down more than criterion. Because stock price movements are already known at the time of learning, we design our model as supervised learning. The main difference of Supervised PV (SPV) models and PV models is that the class label information of documents is added on to the input vector at the time of training. The objective function of SPV models is defined as:

$\frac{1}{T}\sum_{t=1}^{T}\{{\sum_{-r\leqslant j\leqslant r,j\neq 0}{\log p(w_{% t}|w_{t+j})+\log p(w_{t}|d)+\log p(w_{t}|c)}}\}$ (3)

where $c$ indicates the class of the document, $d$ is the document and all other notations are equal to those of the skip-gram model. Above framework can be considered as a form of SPV model introduced in [35] applied in the financial setting.

3.2 Visualization

In this study, we visualize the result of sentiment analysis via distributed representation. By expressing a document with a distributed representation, words, sentences, documents and class information in the text document can be represented by the vectors of same dimension. It is, however, very difficult to visualize vectors with large dimensions. There exists a number of dimensionality reduction techniques for visualization purposes, such as LDA, t-Distributed Stochastic Neighbor Embedding (t-SNE), and Principal Component Analysis (PCA). LDA calculates a linear combination of variables to categorize them into two or more groups, and its performance is reported to be fairly good. However, because it is a supervised method, and its applications are limited [16]. t-SNE is one of the newer methods, which reduces dimension by maintaining the relative distance between observations based on the non-linear relationship in between. Nonetheless, it is a non-linear method, hence it takes a long time to compute [28]. PCA, which is similar to LDA but unsupervised, is a multivariate technique that analyzes data in which observations are described by several inter-correlated dependent variables. PCA extracts important information from the data in the form of orthogonal variables called principal components [1]. PCA has a mathematical relationship to other popular machine learning methods such as $k$ -means clustering and factor analysis, while being simple. For such reasons, this study chooses PCA to reduce dimension and visualize vectors resulting from the supervised learning PV stage. We linearly map data to a lower-dimensional space using PCA and find two eigenvectors with the greatest variances. These eigenvectors are then used to visualize class information, words, and documents into two-dimensional space. Figure 6 presents an example of our approach for visualization.

Figure 6.

Example of visualization[35].

In the above figure, we represent class information, namely ‘c1.0’, ‘c2.0’, $\ldots$ ,‘c5.0’, in a red box, words in green, and documents in blue. ‘c1.0’ is a negative class and ‘c5.0’ is a positive class. ‘trn5-73001’ document is located close to class 5 and appears as a positive document, and words such as ‘love’, ‘wonderful’, ‘great’, and ‘awesome’ are classified as positive words. In this way, we will visualize the sentiment of documents by embedding the word, document and class into the same space.

3.3 Model based prediction

The framework of model-based prediction is outlined in Fig. 7.

Figure 7.

The framework of model-based prediction.

An input document is parsed to extract unigram and bigram features, and the document term matrix is created using the term frequency (TF) and term frequency-inverse document frequency (TFIDF). We use six prediction models: logistic regression (LR), random forest (RF), multinomial Naïve Bayes (MNB), support vector machine (SVM), Naïve Bayes SVM (NBSVM), and supervised PV (SPV) as introduced in Section 3.1. LR and RF serve as baseline models, and they use unigram features only as input variables. A detailed description of each algorithm is given in Section 3.3.1 through 3.3.5.

3.3.1 Random forest

RF, an ensemble learning method for classification and prediction, is a combination of a number of decision trees or regression trees. Given a training set $X=x_{1},x_{2},\ldots,x_{n}$ with responses $Y=y_{1},y_{2},\ldots,y_{n}$ , a number of trees B are built by repeatedly selecting a random sample with replacement of the training set. After training, the response of unseen samples $x^{\textit{new}}$ are predicted by taking a majority vote among the trees for classification, or by averaging the predictions from all trees on $x^{\textit{new}}$ in the case of the regression tree as follows:

$\hat{T}=\frac{1}{B}\sum_{b=1}^{B}{\hat{T_{b}}(x^{\textit{test}})}$ (4)

where $\hat{T_{b}}$ is the prediction from b tree [6]. RF is one of the most accurate and widely known supervised learning algorithms. For many data sets, it makes a highly accurate classifier and provides estimates of which variables are influential in the classification. However, RF has been observed to overfit for noisy datasets.

3.3.2 Logistic regression

LR is widely used when the output variable is a categorical variable. The probability of response can be estimated through logistic function [10, 45]. The logarithm of the odds ratio, $y_{i}$ , which is the ratio the probability of $Y=0$ and $Y=1$ at $X=x$ , is predicted by the linear regression:

$y_{i}=\log\left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1}x_{1}+...+% \beta_{n}x_{n}+\epsilon_{i},\epsilon_{i}\iid N(0,\sigma^{2})$ (5)

where $p_{i}$ is the probability that the response variable equals a case $i$ , $\beta_{0}$ is the intercept from the linear regression equation, and $\beta_{1},\ldots,\beta_{n}$ are the regression coefficients. The Eq. (5) can be expressed as an equation for $p_{i}$ , defined as a logistic function:

$\hat{p}=\hat{T}(Y=1|x)=\frac{\textit{exp}(\hat{\beta_{0}}+\hat{\beta_{1}}x_{1}% +\hat{\beta_{2}}x_{2}+\ldots+\hat{\beta_{n}}x_{n})}{1+\textit{exp}(\hat{\beta_% {0}}+\hat{\beta_{1}}x_{1}+\hat{\beta_{2}}x_{2}+\ldots+\hat{\beta_{n}}x_{n})}$ (6)

The resulting probability $\hat{p}$ ranges between 0 and 1. If $\hat{p}$ is greater than the predetermined criterion value, $x$ is classified as 1; otherwise, it is classified as class 0. LR is easy to implement and provides intuitive interpretation of the relationship between the response variable and predictor variables. LR, however, is often less accurate due to overfitting.

3.3.3 Multinomial naïve bayes

MNB is a classification model that uses conditional probability of belonging to a class, assuming that each variable is independent [36]. Given data $x_{i}$ , $i=1,\.{,}n$ with $K$ classes, the conditional probability of belonging to class $k$ is as follows:

$p(C_{k}|x_{1},x_{2},\.{,}x_{n})=p(C_{k}|\bm{x})=\frac{p(C_{k})p(\bm{X}|C_{k})}% {p(\bm{x})}$ (7)

where, the denominator is a constant value, regardless of the class, which is calculated only from the observed data. Since each variable is independent, a probability model can be expressed using only the numerator:

$p(C_{k},x_{1},x_{2},\.{,}x_{n})=p(C_{k})p(x_{1}|C_{k})\.{p}(x_{n}|C_{k})=p(C_{% k})\prod_{i=1}^{n}p(x_{i}|C_{k})$ (8)

We can now define the class of each variable that maximizes the Eq. (8) as follow:

$\hat{y}=\arg\max_{k\in\{1,\.{,}K\}}p(C_{k})\prod_{i=1}^{n}p(x_{i}|C_{k})$ (9)

Because MNB assumes that all variables are independent of each other, it can lead to inaccurate results depending on the data set. However, it is possible to estimate the distribution of a class as a one-dimensional distribution, so the model has good performance and exhibits fast speed even with large datasets.

3.3.4 Support vector machine

SVM is a supervised learning model, which constructs a hyperplane in a high dimensional space for classification, regression, or other tasks. The hyperplane, $w^{T}x+b=0$ where $x$ is dataset, $w$ is the normal vector to the hyperplane, and $b$ is the bias, separates the vector space [9]. A graphic illustration of SVM model is shown below in Fig. 8.

Figure 8.

Support vector machine.

We search for a decision boundary (shown in the solid line in Fig. 8) and the ones closest to the boundary are called the support vectors. The distance between these support vectors and the decision boundary is $1/w$ , and the range $2/w$ is called the margin. Finally, learning the SVM can be formulated as an optimization:

$\displaystyle\max_{w}\frac{2}{||w||}$ (10) $\displaystyle\text{s.t.}y_{i}(w^{T}x_{i}+b)\geqslant 1$

We classify the data using a discriminant function, $f(x)=w^{T}x+b$ , that has optimized parameters $w$ , $b$ . If a discriminant function $f(x)$ is greater than 1, the observation is classified as class 1; if a discriminant function $f(x)$ is less than $-$ 1, as class 0. SVM is the model based on structural risk minimization, as it has better prediction and a wide range of applications, while optimization requires a long time for learning with large datasets.

3.3.5 Naïve bayes SVM

NBSVM is very similar to SVM, except that we use transformed input variables $x_{i}$ , where $x_{i}=r\circ f$ is the elementwise product. $r$ , the log count ratio, is defined as follows:

$r=\log({{p/||p||_{1}}\over{q/||q||_{1}}})$ (11)

where $p=\alpha+\sum_{i:y_{i}=1}f_{i}$ , $q=\alpha+\sum_{i,y_{i}=-1}f_{i}$ , and $\alpha$ is smoothing parameter. Interpolation $w^{\prime}$ is calculated using an interpolation parameter( $\beta$ ) as follows:

$w^{\prime}=(1-\beta)\bar{w}+\beta w$ (12)

where $\bar{w}=||w||_{1}/|V|$ is the average magnitude of $w$ . Finally, a discriminant function for NBSVM is defined as: $f(x)=\bar{w}^{T}x+b$ . If a discriminant function $f(x)$ is greater than 1, then $x$ is assigned to class 1; if a discriminant function $f(x)$ is less than $-1$ , to class 0. By applying these transformed parameters to SVM, we can improve the performance.

4. Experiments and results

4.1 Data

We use the 8-K financial reports2

²
https://www.sec.gov/edgar.shtml.

as the primary data source. 8-K financial reports are reports of unscheduled material events or corporate changes at a company that could be of importance to the shareholders or the SEC [25]. Data contains company ID, date of report, and relevant business events such as bankruptcies, layoffs, the election of a director, a change in credit, etc. and main contents. For the sentiment analysis, we collected 8-Ks from 2002 to 2012 for the four companies in financial sector, as listed in Table 1.

Table 1

Company lists

Ticker symbol	Company name	Number of doc
C	Citigroup Inc.	513
WFC	Wells fargo and Co	427
GS	Goldman sachs group Inc.	257
JPM	JP Morgan chase and Co	835
Number of total document		2,032

Preprocessing included removing stopwords and changing all numbers in various meaning to the word ‘num’. We gather company’s daily stock prices from Yahoo! Finance3

http://finance.yahoo.com/.

and use them as the target variable. “UP” and “DOWN” classes are used as the output variable, calculated by taking the difference in the company’s stock price before and after the report is released. It is assumed that the news announced in the middle of the day affects the stock price of the next day. We used the closing price of the date of the news announcement as the stock price before the report is released and the open price of the next day as the stock price after the report is released [25]. We normalize this difference by subtracting the difference of S&P 500 index for the same period to remove the effect of market conditions (bull or bear) from the influence of news on stock movements. The equation is:

$\Delta=\frac{\textit{SP}_{T+1}-\textit{SP}_{T}}{\textit{SP}_{T}}-\frac{\textit% {S\&P500}_{T+1}-\textit{S\&P500}_{T}}{\textit{S\&P500}_{T}}$ (13)

where $T$ is the announcement date of the financial report and SP indicates the individual stock price. We used the closing price at time $T$ and the opening price at time $T+1$ . For instance, if company’s stock price rises 2% and S&P 500 index goes up 1% after event, the normalized difference equals 1%. We set the criterion value at 1 and assigned “UP” class if the difference is greater than the criterion and otherwise assigned “DOWN” class.

4.2 Experiment settings

We set the parameters the same as used by [46] for direct comparison of results to his work. The full list of parameters for each algorithm are reported in Table 2.

Table 2
Parameters

Algorithm	Parameters	Values
SVM	Tradeoff	0.1
	Tradeoff	1
NBSVM	$\alpha$	1
	$\beta$	0.25

Tradeoff indicates the tradeoff between the training errors and the model complexity; $\alpha$ is the smoothing parameter, and $\beta$ is the interpolation parameter. We use ten-fold cross-validation for performance evaluation.

4.3 Prediction of price in percentage change after the 8-K report announcements

Based on the assumption that the day after the announcement of a financial report will have the greatest impact on the company’s stock price, we applied all algorithms mentioned in Section 3.3. The results are shown in Table 3.

Table 3
Prediction accuracy

Method	Accuracy	Method	Accuracy
LR-Uni(Baseline)	54.65	SVM-Uni	63.41
RF-Uni	59.21	SVM-Bi	63.80
MNB-Uni	62.33	NBSVM-Uni	62.87
MNB-Bi	64.91	NBSVM-Bi	65.67
SPV model	68.54

We used two types of input variables extracted from unigram and bigram models and set the output variable as the stock price change after 1 day of the announcement. ‘Uni’ and ‘Bi’ are abbreviations of the input variable used in the prediction models with the unigram or bigram features, respectively. Using unigram features as input variables, results show that RF performs better than LR. SVM outperforms MNB, and NBSVM has been found to improve SVM results. Models with bigram features tend to perform better in general when compared to those using unigram features. In case of the SPV-model, the improvement over the LR baseline amounts to 25.4%. In order to compare the performance of prediction models, we conducted an independent 2-sample t-test assuming unknown standard deviation. The test was run 28 times for 28 pairs of combination among 8 different models. A null hypothesis, $H_{0}$ : $\mu_{A}=\mu_{B}$ , was rejected if the difference in means of the pair are statistically significant. We calculated the probability of rejecting the null hypothesis, and the results showed 96.4% probability of rejecting the null hypothesis on significance level 0.05. That is, except for one, all combinations rejected the null hypothesis.

4.4 Sentiment of the 8-K report announcements and visualization of stock prices

Unlike other areas, documents in financial markets are not independent of each other and affect stock prices for a certain period of time. Based on this property, we visualize the sentiment of 8-K report over time and confirmed that the relationship between the sentiments and the stock price movements makes sense. The sentiment of the published documents and the stock prices of Wells Fargo Company from 2008 to 2009 is shown in Fig. 9. During this time period, stock prices of Wells Fargo were volatile due to financial crisis. Documents from the same time period were split every two months due to the limit number of documents.

Figure 9.

Sentiment of 8-K reports and stock price for Wells Fargo Company.

Panel (a) of Fig. 9 exhibits a graph representing the sentiment of documents every two months, whereas (b) plots the 10-day moving average of the stock prices. As the sentiment of documents changes, stock price moves accordingly to the sentiment trend. For example, in panel (a), the sentiment of the document in March-April 2008 grew relatively more negative, which is reflected in the stock price is affected from the end of April, 2008 where it fell moderately steeply, as plotted in panel (b). In addition, the sentiment of the documents from October 2008 to February 2009 is negative, and it can be seen that the stock price has been steadily decreasing since December 2008. Above illustration shows that the sentiment of the documents is closely related to the stock price and that, since the direction of the stock price appears after the document announcement, the sentiment of documents can be used as the leading indicator. Furthermore, plots like Fig. 9 can serve as qualitative evaluation of the model, providing the practitioners with more interpretable insights.

4.5 Firm-by-Firm Visualization of the sentiment of the 8-K report announcements

We plot the sentiment trend of documents over time for the four selected companies as listed in Table 1. Furthermore, we compare the sentiment trend of entire documents with stock price movements side-by-side for each selected company from 2002 to 2012. The resulting visualization is shown in Fig. 10.

Figure 10.

The sentiment of entire documents (left) and stock price trend (right) for each company.

Panel (a) of Fig. 10 exhibits a graph representing the sentiment of documents belonging to each company as considered in analysis. The more to the right the documents are distributed, the more positive the document is. The center of the documents in space is represented by a red dot. Plots in the panel (b) represent the volatility in stock prices when compared to the first business day in 2002. For example, if the y-value is 30%, it means that the stock price rose 30% compared to the first business day in 2002. As shown in the lots on panel (a) of the figure, the sentiment trend is different for each company. There is a handful of documents with neutral or positive sentiments for Goldman Sachs, Wells Fargo and Company, and JP Morgan, while many documents of Citigroup Inc. exhibit relatively negative sentiment. In the meantime, Goldman Sachs shows the steepest positive volatility of stock price as presented in panel (b), while and Citigroup displays negative volatility. Observations from Fig. 10 indicate that Citigroup had a number of issues that had greatly reduced stock prices since 2002, and these issues appeared to have been associated with negative sentiment conveyed by the financial reports. We represent the negative index of sentiment and volatility of stock price as shown in Fig. 11.

Figure 11.

Correlation between negative sentiment index and negative stock price index.

In the above graph, x-axis is the negative sentiment index and y-axis is the negative stock price index. Both indices are calculated as follows:

$\textit{Negative Sentiment Index }=\frac{\sum_{i\in\mathbf{N}}{|d_{i}|}}{\sum_% {j\in\mathbf{T}}{|d_{j}|}}$ (14)

where $\mathbf{N}$ is the set of negative sentiment documents, $\mathbf{T}$ is the set of total documents, and $d$ is the distance from 0.

$\textit{Negative Stock Price Index}=\sum_{i\in\mathbf{V}_{N}}{|v_{i}|}$ (15)

where $\mathbf{V}_{N}$ is the set of negative volatility of stock price and $v$ is the volatility in stock prices as compared to the first business day in 2002. The correlation between the two indices is 0.9918, which means that the negative sentiment index and the negative stock price index are positively correlated. Therefore, the polarity of the report sentiment is consistent with the movements of the stock price trend.

5. Conclusion

This study predicts the direction of stock price changes using 8-K financial reports for the four selected companies in the financial sector. We propose two methods to solve the prediction task: the model-based method and the visualization-based method. For the model-based stock price prediction, unigram features were extracted from financial reports and applied to the LR and RF models, and the results were used as baselines. The same unigram and bigram features were applied to MNB, SVM, and NBSVM models for comparison. Finally, we predicted stock price change using distributed representations. Experiment results show that distributed representation produced the most accurate predictions and the improvement in prediction accuracy over the baseline model amounts to an impressive 24.5%. For the visualized-based method, we visualize the sentiment of the document by projected the class information and the document on to the space of the same dimension. The benefits of this visualization helps one easily understand the sentiment changes in financial documents of a select company, while providing rich illustration of the relationship between sentiment trends and stock price trend movements. Visualization results confirmed that when the sentiment of documents was positive, the stock price movement showed an upward trend. On the other hand, when the sentiment of the document was negative, the stock price fell. The mean values of sentiment were calculated to create a sentiment index for the entire document. Visualization results showed that different sentiment trends of documents for selected companies were well reflected in the stock price movements of the corresponding companies. For GS firms, the sentiment of the entire documents was positive, of which the trend was consistent with the stock price moving mostly in the positive direction. On the other hand, in the case of company C, there were many documents with negative sentiment, and the stock price had continuously decreased since 2002.

Our proposed model does not only improve accuracy, but it also provides interpretability by producing visualizations that can show the sentiment trends of associated documents. It allows a trader to visualize the sentiment of documents of the company of interest, to view the trend of the sentiment at a glance, which help the active traders and decision-making agents in the field to make more data-informed decisions. In the financial market, traders are often required to make split-second decisions, and our proposed model can potentially provide them with opportunities to monitor new investment companies objectively and detect companies at crisis.

In this study, we analyze only a few companies in the finance sector, but in the future, we expect to gain more insights by analyzing a wider range of companies. In addition, by analyzing public documents and private document such as SNS and online blogs through distributed representation, it may be possible to extract sentiment words that are specific to the financial sector using our method to build the sentiment dictionary specifically designed for the financial domain. Finally, since visualizing the sentiment of documents is a powerful tool, research is needed on more effective visualization tools.

Footnotes

Acknowledgments

This work was supported by the BK21 Plus Program (Center for Sustainable and Innovative Industrial Systems, Department of Industrial Engineering, Seoul National University) funded by the Ministry of Education, Korea (No. 21A20130012638), the National Research Foundation (NRF) grant funded by the Korea government (MSIP) (No. 2011-0030814), and the Institute for Industrial Systems Innovation of SNU.

References

Abdi

and Williams

L.J.

, Principal component analysis, Wiley Interdisciplinary Reviews: Computational Statistics 2(4) (2010), 433–459.

Ahn

J.J.

Lee

S.J.

K.J.

and Kim

T.Y.

, Intelligent forecasting for financial time series subject to structural changes, Intelligent Data Analysis 13(1) (2009), 151–163.

Ariyo

A.A.

Adewumi

A.O.

and Ayo

C.K.

, Stock price prediction using the arima model, In Computer Modelling and Simulation (UKSim), 2014 UKSim-AMSS 16th International Conference on, 2014, pages 106–112.

Ballings

Van den Poel

Hespeels

and Gryp

, Evaluating multiple classifiers for stock price direction prediction, Expert Systems with Applications 42(20) (2015), 7046–7056.

Bollen

Mao

and Zeng

, Twitter mood predicts the stock market, Journal of Computational Science 2(1) (2011), 1–8.

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32.

Chen

C.-M.

and Liu

C.-Y.

, Personalized e-news monitoring agent system for tracking user-interested chinese news events, Applied Intelligence 30(2) (2009), 121–141.

Collobert

and Weston

, A unified architecture for natural language processing: Deep neural networks with multitask learning, In Proceedings of the 25th international conference on Machine Learning, pages 160–167. ACM, 2008.

Cortes

and Vapnik

, Support-vector networks, Machine Learning 20(3) (1995), 273–297.

10.

Cox

D.R.

, The regression analysis of binary sequences, Journal of the Royal Statistical Society. Series B (Methodological), 1958, pp. 215–242.

11.

Dahl

G.E.

Adams

R.P.

and Larochelle

, Training restricted boltzmann machines on word observations, arXiv preprint arXiv:1202.5695, 2012.

12.

Dougal

Engelberg

Garcia

and Parsons

C.A.

, Journalists and the stock market, The Review of Financial Studies 25(3) (2012), 639–679.

13.

Druz

Wagner

A.F.

and Zeckhauser

R.J.

, Tips and tells from managers: How analysts and the market read between the lines of conference calls, Technical report, National Bureau of Economic Research, 2015.

14.

Fama

E.F.

, Multiperiod consumption-investment decisions, The American Economic Review, 1970, pages 163–174.

15.

Feldman

Govindaraj

Livnat

and Segal

, Management’s tone change, post earnings announcement drift and accruals, Review of Accounting Studies 15(4) (2010), 915–953.

16.

Fisher

R.A.

, The use of multiple measurements in taxonomic problems, Annals of Human Genetics 7(2) (1936), 179–188.

17.

Garcia

, Sentiment during recessions, The Journal of Finance 68(3) (2013), 1267–1300.

18.

Ghosn

and Bengio

, Multi-task learning for stock selection, In Advances in Neural Information Processing Systems, 1997, pp. 946–952.

19.

Heston

S.L.

and Sinha

N.R.

, News vs. sentiment: Predicting stock returns from news stories, Financial Analysts Journal 73(3) (2017), 1–17.

20.

and Liu

, Mining and summarizing customer reviews, In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM, 2004.

21.

Huang

A.H.

Zang

A.Y.

and Zheng

, Evidence on the information content of text in analyst reports, The Accounting Review 89(6) (2014), 2151–2180.

22.

Jegadeesh

and Wu

, Word power: A new approach for content analysis, Journal of Financial Economics 110(3) (2013), 712–729.

23.

Kumar

Agrawal

and Joshi

S.D.

, Multiscale rough set data analysis with application to stock performance modeling, Intelligent Data Analysis 8(2) (2004), 197–209.

24.

and Mikolov

, Distributed representations of sentences and documents, In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1188–1196.

25.

Lee

Surdeanu

MacCartney

and Jurafsky

, On the importance of text analysis for stock price prediction, In LREC, 2014, pp. 1170–1175.

26.

A.W.

and MacKinlay

A.C.

, Stock market prices do not follow random walks: Evidence from a simple specification test, The Review of Financial Studies 1(1) (1988), 41–66.

27.

Maas

A.L.

Daly

R.E.

Pham

P.T.

Huang

A.Y.

and Potts

, Learning word vectors for sentiment analysis, In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, 2011, pp. 142–150. Association for Computational Linguistics.

28.

Maaten

L.V.D.

and Hinton

, Visualizing data using t-sne, Journal of Machine Learning Research 9 (2008), 2579–2605.

29.

Martineau

and Finin

, Delta TFIDF: An improved feature space for sentiment analysis, In Proceedings of ICWSM 9 (2009), 106.

30.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, In Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.

31.

Ming

Wong

Liu

and Chiang

, Stock market prediction from wsj: text mining via sparse matrix factorization, In Data Mining (ICDM), 2014 IEEE International Conference on, pages 430–439. IEEE, 2014.

32.

Nakagawa

Inui

and Kurohashi

, Dependency tree-based sentiment classification using crfs with hidden variables, In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 786–794. Association for Computational Linguistics.

33.

Pang

and Lee

, A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, 2004, p. 271. Association for Computational Linguistics.

34.

Pang

and Lee

, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, In Proceedings of the 43rd annual meeting on association for computational linguistics, 2005, pp. 115–124. Association for Computational Linguistics.

35.

Park

E.L.

, Ph.D dissertation: Supervised feature representation for document classification, Seoul National University, 2016, pp. 1–160.

36.

Rennie

J.D.

Shih

Teevan

and Karger

D.R.

, Tackling the poor assumptions of naive bayes text classifiers, In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 616–623.

37.

Rumelhart

D.E.

Hinton

G.E.

and Williams

R.J.

, Learning internal representations by error propagation, Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.

38.

Schumaker

R.P.

and Chen

, A discrete stock price prediction engine based on financial news, Computer 43(1) (2010).

39.

Sitte

and Sitte

, Neural networks approach to the random walk dilemma of financial time series, Applied Intelligence 16(3) (2002), 163–171.

40.

Socher

Pennington

Huang

E.H.

A.Y.

and Manning

C.D.

, Semi-supervised recursive autoencoders for predicting sentiment distributions, In Proceedings of the conference on empirical methods in natural language processing, 2011, pp. 151–161. Association for Computational Linguistics.

41.

Socher

Perelygin

J.Y.

Chuang

Manning

C.D.

A.Y.

Potts

et al., Recursive deep models for semantic compositionality over a sentiment treebank, In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, 2013, p. 1642.

42.

Sun

Lachanski

and Fabozzi

F.J.

, Trade the tweet: Social media text mining and sparse matrix factorization for stock market prediction, International Review of Financial Analysis 48 (2016), 272–281.

43.

Tay

F.E.H.

and Cao

L.J.

, Improved financial time series forecasting by combining support vector machines with self-organizing feature map, Intelligent Data Analysis 5(4) (2001), 339–354.

44.

Turney

P.D.

and Pantel

, From frequency to meaning: Vector space models of semantics, Journal of Artificial Intelligence Research 37 (2010), 141–188.

45.

Walker

S.H.

and Duncan

D.B.

, Estimation of the probability of an event as a function of several independent variables, Biometrika 54(1-2) (1967), 167–179.

46.

Wang

and Manning

C.D.

, Baselines and bigrams: Simple, good sentiment and topic classification, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, volume 2, 2012, pp. 90–94.

47.

Wiebe

Wilson

and Cardie

, Annotating expressions of opinions and emotions in language, Language Resources and Evaluation 39(2) (2005), 165–210.

Stock price prediction through sentiment analysis of corporate disclosures using distributed representation

Abstract

Keywords

1. Introduction

1 http://www.StockTwits.com.

4.1 Data

2 https://www.sec.gov/edgar.shtml.

Table 2 Parameters

Table 3 Prediction accuracy

Footnotes

Acknowledgments

References

¹
http://www.StockTwits.com.

²
https://www.sec.gov/edgar.shtml.

Table 2
Parameters

Table 3
Prediction accuracy