Abstract
Stock price prediction has been an attractive research domain for both investors and computer scientists for more than a decade. Reaction prediction to the stock market, especially based on released financial news articles and published stock prices, still poses a great challenge to researchers because the prediction accuracy is relatively low. For prediction purposes, linear regression is a popular method. Statistical metrics, such as the Document Frequency (DF), term frequency-invert document frequency (TF-IDF) and information gain (IG), are used for feature selection to extract the most expressive features to reduce the high dimensionality of the data. However, the effectivenesses of the available metrics have not been explored in identifying important financial feature representations that have dependable and strong relations with the stock price. The objective of this study are (i) to investigate the performance of five statistical metrics, namely, DF, TF-IDF, IG, Chi-square Statistics (Chi-Sqr) and occurrence in identifying important features that can represent the news and have a strong relationship with the stock price; (ii) to introduce feedback variables, namely, the prediction accuracy (PA), directional accuracy (DA) and closeness accuracy (CA), to capture the interaction between the released news and the published stock prices; and (iii) to introduce a prediction model that integrates features from financial news and a stock price value series based on a 20-minute time lag using linear regression. The experiment used the ELR-BoW method to build a number of 330 datasets with five statistical metrics to select different feature sizes of 50, 100, 150, 200, 250, 300, 400, 500, 600, 700 and 800. The performance of ELR-BoW is observed based on three parameters, namely, PA, DA and CA, and is compared against Naïve Bayes (NB) as the benchmark approach and the Support Vector Machine (SVM). The proposed ELR-BoW-SVM obtained a higher accuracy compared to ELR-BoW-NB, where the best feedback measure is PA, which has an F-measure value of 0.842. In addition, the best number of features is 300 features and using document frequency DF statistical metric. The identification of the top feature representations for financial news is highly promising for automatic news processing for stock prediction. This study demonstrates that the identification of the top feature representations for financial news is highly promising for news article processing in stock prediction.
Keywords
Introduction
Stock market prediction continually draws the attention of researchers and financial investors because mastering the nuances of the market promise the ability to gain surplus profits. The rapid growth of online textual data such as financial news poses a challenge in extracting valuable information and determining its relationship to the stock market. In this respect, the limitation of the stock prediction models is mainly in transferring unstructured data to a structured format to model the stock market dynamicity accurately [16].
The investors are interested in getting the highest profits from the market, therefore, identifying the future trend of the stock is important, and this is termed as forecasting the stock prices. Predictions of stock prices can be performed using structured data (i.e., stock price records) and unstructured data (i.e., financial news with regard to the stocks). The structured data are categorized into two types, namely, fundamental analysis and technical analysis. The fundamental analysis evaluates the stock security by examining the related economic, financial and other qualitative and quantitative factors, whereas technical analysis utilizes statistics on the stock market, such as the past prices and volumes, which are modeled using mathematical tools to predict trends in the future values [10, 27].
On the other hand, the success of the analysis methods that use unstructured data has gained more attention in stock price prediction. Among the popular methods is the text mining approach, which aims to explore and exploit the relationship between the news articles and the time-stamped stock prices [28]. Several studies have demonstrated the influences of the news articles on the stock market price where there is a strong relationship between the time of the stock price fluctuation and the time of the released news articles. The provided information in the news articles includes a number of terms that have a direct effect on the stock price [12]. Most of the previous studies extract a set of features such as the top financial terms published in the news and the used machine learning techniques in the prediction model [1, 34, 42]. These studies assign weights to these features to predict the stock market movements. However, these methods have obtained very weak stock price prediction performance mainly because of the relationships between the structured and unstructured data, which indicate the stock fluctuation behaviors. However, stock market prediction based on time series data might be not sufficient, due to the existing of a huge number of factors that affect the stock market movements that could be political, economic and psychological, which are inherently noisy, non-stationary and non-deterministically [8].
According to Nassirtoussi et al. [28], there is a strong correlation between the news articles and stock price. Several studies have demonstrated the influences of the news articles on the stock market price where there is a strong relationship between the time of the stock price fluctuation and the time of the released news articles.
The previous studies have confirmed that the news article has a positive and negative impact on the stock price movement, these news articles effect the measurement of return volatility and return volatility [9, 17]. The strong efficient market hypothesis (EMH) states that the stock market is influenced by all kind of information. This hypothesis has motivated us to investigate all the possible of information that has an impact on the stock price movements [4]. Therefore, It is important to process all the available information that are related to the stock market to extract the most useful time series patterns and increase the performance of stock price prediction [21].
Recently, the combination of structured and unstructured data is assumed to provide better stock price prediction by combining the features that are extracted from both data modalities. Several techniques have been investigated to build more representative features for the stock market fluctuations. The bag-of-words technique is implemented [9, 30, 38] to denote the binary representation of terms, but the frequencies of these features are ignored [9]. Additionally, different techniques have been investigated, for example, noun phrases and named entities [34] are implemented to extract the occurrences of the named entities.
Other studies have explored the impact of statistical metrics on the prediction accuracy, such as the TF-IDF method, which captures the distribution of features inside the documents [13, 16, 30]. An attempt to select the features using the mutual information (MI), balanced mutual information (BMI) and chi-sqr to predict the directions of the stock prices has been made [14]. However, the existing statistical-based approaches still have a weak ability to capture the relationship between the news articles and the stock prices, to model all of the relative movement and fluctuations of the stocks accurately [42]. Moreover, there is no available research that has investigated the best statistical metrics to decide on the most representative features for the prediction modeling of a fluctuating stock price. A short-timeline-based prediction has an added value compared with the existing methods, which have commonly depended on the intra-day rate [30].
A few studies capture the impact of correlation features to explore more relationship between the unstructured data and stock price [11, 13, 34]. However, these methods have obtained very weak performance to capture correlation features, mainly due to two reasons (i) they ignore the temporal effect of the stock price for the short timeline, and (ii) the limitation of existing techniques to represent expressive features that affect the stock price movements.
Due to the limitations of the existing techniques to extract a correlation features that affect the stock price from a staggering amount of textual data. In this study, we intend to develop an algorithm for feature representation using time series data for short timeline prediction that implements a technique to discover series correlation features based on temporal events to predict the stock market movements. Therefore, this study addresses the investigation of the performance of statistical metrics and introduces feedback variables to build an Enhanced Linear Regression Based Bag-of-Word Model for Feature Representation (ELR-BoW) algorithm. The ELR-BoW utilizes the relationship between the news articles and stock prices based on bag-of-words for a short-timeline stock prediction. The ELR-Bow algorithm is based on heuristic using statistical measures to speed up the search process to find the best solution for the search space [20, 22]. The heuristic search aims to discover series of correlation between the features for short timeline prediction [44]. The contributions of the study are three-fold; (i) identifying the best feature extraction model using five statistical measures, namely, DF, TF-IDF, IG, Chi-square Statistics (Chi-Sqr) and occurrences, (ii) introducing feedback variables, namely, closeness, directional movement and prediction, as indicative measures for the interaction between the financial news and stock prices, and (iii) proposing stock price predictions based on linear regression. The S&P500 index close prices dataset is used.
Our study shows that feature representation using the ELR-BoW algorithm has the ability to discover the relationship and represent the direct effect of news articles on the stock price. The implementation of the proposed feedback measure (PA) pushed the F-measure value up to 0.842 when the features are incorporated with SVM. The analysis of different feature sizes has different feature selection methods demonstrate that the best feature size is 300 when using the DF selection method.
This paper is organized as follows. Section 2 introduces the background of the study. Section 3 presents the an enhanced-linear regression based bag-of-word model for feature representation (ELR-BOW) for stock price prediction, which is based on short timeline stock information for stock price prediction. Section 4 presents the effectiveness evaluation. Section 5 presents the experimental results and the findings of the paper. Section 6 provides the conclusions of the paper.
Related studies
The financial time series facilitate the effective extraction of positive patterns of the stock market and predict its movements. The topic of stock market prediction continually draws the attention of researchers and financial investors, that whosoever capable of mastering the nuances of the market can beat the market and able to gain surplus profits. Generally, the investors are unaware of their stocks behavior; hence, they face difficulty in trading stocks. The investors mostly fail to gain more profit in trading stocks, as they are uncertain about the nature of the stock market and unsure of which stock to buy or sell. Nevertheless, it is crucial for them to be able to predict the future behaviour of the stock prices in order to gain more insights for trading.
This has further encouraged academic researchers and business practitioners to develop more time series prediction models by implementing artificial intelligence (AI) techniques, such as an artificial neural network (ANN), that are extensively used to accurately forecast the stock index and direction of its change [19, 29]. Meanwhile, excellent performance from the Support Vector Machine applications has been obtained in investigating the issue of forecasting the stock index futures market [7, 8]. However, the main challenge in stock price prediction is the price fluctuations [6, 16].
According to the strong efficient market hypothesis (EMH), the stock market price data fluctuation reflects the all the information available about the stock market [24]. Furthermore, the efficient-market hypothesis (EMH) elucidate a link between the published information and the market price movements. The investors cannot guarantee that they will always achieve consistent returns even if they have a prior knowledge of the stock information before the investment [5]. The existence of an enormous amount of financial news generated from different sources has a direct effect on the market movement [39]. Therefore, understanding the news content and combining it with the stock price data can contribute to increasing the accuracy of the stock price prediction model.
One of the main issues of handling the textual data remains a sophisticated due to a large amount of information and the availability of different sources. In order to analysis this information and figure out the relationship Natural Language Processing (NLP) techniques need to be used to identify the most significant terms that might causes changes on the security prices [35]. So that, the analysis of the textual information are a great chance to know if the news article consists of good or bad news and attempt to predict the direction of the stock price in the future.
The idea of trading (buy/sell) the stock when there is a good or bad information. The unexpected good and bad news in the stock markets always occur, these cases make the stock price unpredictable due to the high volatility [37]. The news articles contain trustworthy information that leads to moving the stock prices. According to Zhang and Skiena [43], the news articles considered a reliable source that can be important as much as the commodity. Therefore, text mining pre-processing an important to analysis the text information and extract the most significant feature that has an impact on the stock price movements [32]. Although, there are several studies address the stock market movements. However, the investors are still interested to know more about stock market movements.
One of the most challenging aspects is to predict stock market movements from textual data due to the difficulty to capture correlation features between the stock price and news articles [6, 16]. A few studies attempt to addresses the problem by proposed systems to capture the impact of correlation features based on bag-of-words (BoW) to explore more relationship between the unstructured data and stock price to predict the stock price in specific periods [11, 13, 28, 34]. However, such a prediction models suffer from providing an accurate performance due to sudden changes in the stock market and a huge price fluctuation per minute in the stock market [34]. The investigated approaches are still in early stages and there is a need to dive more deep to examine inclusion the extracted features with the stock price that demonstrate the impact of price fluctuation for short timeline prediction [28].
Studies that model the relationship between the released news and the market movement have grown over the years. These include the investigation of representative features from the news and from the stock data as well as machine learning algorithms such as Support Vector Machines (SVM) [15, 28, 34], Naïve Bayes [14, 41] and decision tree [30]. To represent the relationship between the news articles and the stock prices, there are several studies that map the news with the stock price time stamp to predict the stock price for specific periods. Table 1 shows a comparison of the pre-processing steps and machine learning methods used in various studies on modeling stock prices, which span from 1998 to 2015.
Pre-processing steps and machine learning methods for stock price modeling
Pre-processing steps and machine learning methods for stock price modeling
The pre-processing steps are divided into feature selection, dimensionality reduction, data representation technique and timeline used. The bag-of-words technique is the most commonly used technique for feature selection, which is mainly due to its advantage of retaining the occurrence multiplicity [13]. For dimensionality reduction purposes, several methods have been used, such as filtering according to certain occurrence thresholds, expert-based keyword determinations, and scoring-based methods. At the same time, the binary method and the term frequency-inverse document frequency (TFIDF) method are the most used representation techniques because they indicate the weight of the selected terms in representing the documents. For the purpose of correlating the financial documents with the released stock price, several time granularities have been used. The time-line for the 20-minute stock close value has achieved a remarkable explanation with regard to the news impact on the stock market [31].
The developed machine learning-based prediction models can be explained according to the forecasting type and the classifiers. Various forecasting types have been applied, such as binary class, multiple class and discrete. However, only the discrete type forecasting through a regression-based technique can allow numerically based estimation of a stock price [23]. Several classifier types have also been explored, and the SVM is the most popular [3].
However, none of the existing approaches covered in the literature have provided a method for feedback measurement to capture the interaction between the fluctuating stock price and the released news. This technique has a low prediction accuracy because by depending on the latest stock price only, the stock fluctuation is ignored. The relationship between the stock price and the related messages in the released financial news is also vague. Linear regression is a machine learning-based approach that has the capability of capturing the relations from the financial news. The linear regression approach requires identification of strong features that can represent the direction of the stock price [26].
Although the TFIDF is the widely used representation approach, the performance of other statistically based feature representation methods on improving stock predictions is unknown. Therefore, this research fills this gap and addresses the investigation of effective feature representations through statistical metrics-based evaluation and through introducing feedback variables into the linear regression models, toward achieving high stock prediction accuracy.
The primary goal of this paper is to introduce an enhancement to the conventional bag-of-words representation that will be able to capture the temporal events that effect the stock price for time-series data. The proposed model is based on the integration of statistical measurements with linear regression for short timeline prediction (within a 20-minute context) published financial news and the stock price. The proposed model map and represent the most relevant features that will increase the classification accuracy. Figure 1 presents general architecture of ELR-BoW implementation to discover the temporal effect from each feature vector.
General architecture of ELR-BoW.
The general architecture of the model building is composed of three phases. The first phase is called bag-of-words representation that is used 5000 news articles to build a lexicon and apply pre-processing steps, the second phase is stock price pre-processing and the third phase is feature selection technique. The next subsections discuss in details the description of each phase. The ELR-BoW algorithm is designed to tackle the limitations of feature representation using time series data for short timeline prediction. Which aims to discover series correlationfeatures based on temporal events to predict the stock market movements. The ELR-BoW implements different statistical metrics and introduces feedback variables to build an effective linear regression model, which utilizes the relationship between for a short timeline stock prediction.
The dataset consists of a total of 46674 news articles (saved into a Table called News) and stock price information (saved into Tables named Quote and Ticker) on the S&P500 gathered from an online financial news corpus such as from noodle and Reuters. The dataset consists of three Tables, namely, the ticker, quote and news Tables. Only the Quote and News Tables are used in this experiment. The quote Table records the stock information, and the attributes involved are the quote symbol for the stock names, quote time, quote close, quote high, quote low, quote open and quote volume. Only data on the quote close are used in this experiment.
We implemented the pre-processing steps of text mining such as tokenization, stemming and stop removal to extract a pattern from the structured or unstructured data. the main aim of these steps is to clean the text data by eliminating all the irrelevant characters (such as stop words, conjunctions prepositions, etc.) to reduce the dimensionality of term space [36]. The importance of the text processing is to remove all the characters that do not carry any significant meaning to the text, these characters are noisy and irrelevant data, those words are not measured as features in text mining application [2].
This research utilizes both the structured data (released stock price) and the unstructured data (financial corpus) for building the prediction model. The financial news corpus consists of 8500 articles, which are collected between 6th November 2013 and 25th March 2014. A total of 5000 news articles are used for training and building a lexicon, while 3500 news articles are used for evaluation purposes
The main process in the first phase is to represent the news articles (unstructured data) using bag-of-words technique
Enhanced BoW (eBoW) representation algorithm.
For the evaluation document, features set are formed in binary format (0, 1) to represents the absence or presence of terms for each document [18]. The binary representation of the words can be expressed as
Statistical measures for BoW representation.
The first phase utilizes the bag-of-words technique to list all of the words in the financial news corpus. Then, for each of the words, their scores according to the five statistical metrics, namely, TFIDF, Occurrence, Chi-Square, IG, and DF, are calculated. Next, the words stored in each vector are determined based on the top score. Figure 3 shows the steps in the first phase.
The formula for the calculation of the statistical metrics is as follows:
Term frequency invert document frequency (TFIDF): TFIDF evaluates the significance of a single word inside a document
where
Occurrence: Occurrence measures the number of words that occur in all of the documents, to indicate how relevant the word is to the domain.
Chi-square: The Chi-square value is a statistical metric that is used to compare the independence between two random variables using the following equation:
where
Information gain (IG) This metric is used to measure the expected reduction in entropy by assuming the presence and absence of a term in the document. The expected reduction in entropy is caused by partitioning the examples according to a given attribute.
where
Document frequency (DF): The document frequency depends on a very simple idea, to calculate the number of appearances of a single term in the existing documents, which is aimed at measuring how often the term is used.
where
In the second phase, a time series for stock price pre-processing (structured data) is conducted and incorporated into the feature vectors that were built in the first phase. The preparation of stock price information and the feedback parameters aims to compose the prediction model’s features. The details of the process in each phase are described. Figure 4 shows the algorithm for the stock price pre-processing in phase 2.
Stock price representation algorithm.
The algorithm mechanism selects the document symbol
news date: specific intervals from Nov 6, 2013, to Mar 25 (35 days) to build data that is comparable to data in previous studies [12, 25, 34]
news time: from 9:00 am to 4:00 pm. These intervals are extremely important to restrict the news articles that highly affect the stock price to market hours, to reduce the impact of overnight news and allow for market prediction.
20-minute lag-time: to remove redundant news and ensure that only news that appears within 20 minutes is retained [28, 34].
Stock price process for mapping between news articles and stock price.
Close values for a stock price in time series.
At the end of the filtering process, the remaining 1887 financial news is left. Figure 6 represents the Close values for the stock price based on 20 minutes. For each document, three stock price value are obtained, these values are the stock close time
where
For the purpose of feedback measures that enable the relation between the predicted and actual close values to be captured, three parameters are used. These parameters evaluate the strength of the linear regression prediction technique, namely, the Prediction Accuracy (PA), Directional Accuracy (DA), and Closeness Accuracy (CA).
Prediction Accuracy, PA: measures how close the
Directional Accuracy, DA: measures how close the
Closeness Accuracy, CA: measures how close the
The value of the feedback parameters is incorporated into the earlier prepared features and is used as prepared data for building the NB and SVM classifiers. The next section presents the heuristic Feature selection technique to discover temporal information.
In order to select the most useful feature set
Heuristic feature selection technique.
Figure 7 shows the pseudo code for the feature selection technique based on heuristics, which describes the procedures of the selecting the best set of features for temporal data. The process of feature selection technique begins with input the top feature set for unique words, sort the features number according to the score and insert a set of best features number.
The process starts with identifying a set of features in the search space. At the first step, a feature number is selected according to feature vectors
The classification effectiveness can be evaluated based on three evaluation measures which accuracy, F-measure and weighted accuracy. These evaluation measures are used to evaluate the effectiveness of the binary classification of document categorization. The classification process labels the binary data into two different categories either positive or negative, the classification is represented in confusion matrix according to the confusion the two class problem.
The confusion matrix consists of four categories: false positive (FP) indicates the negative instances and incorrectly labelled instances as positive, true positive (TP) the instances that correctly labelled as positive, true negative (TN) refer to the instances that are correctly labelled as negative and false negative (FN) indicates the negative instances that incorrectly labelled as negative. These are the content of the confusion matrix, these four categories are used to calculate the precision, recall, and F-measure.
Average of precision for the d class label:
Average of recall for d class label:
Average of F-measure for d class variables:
Weighted Accuracy for F-measure value for x and y calsses
Accuracy for positive predictive value:
The evaluation is performed by measured the weighted average F-measure values for the classified classes. The macro F-measure score is computed by calculating the total performance for all categories. Then, the total score is used to calculate the performance for each category in the table.
As described before, the main aim of this study is to enhance bag-of-words representation mechanism to capture the effect of news article features on the stock price. The representations of bag-of-words are
Details of the PA datasets space numbers with different feature sizes
Details of the PA datasets space numbers with different feature sizes
composed in time-series forms, this kind of representation allows predicting the temporal effect within 20 minutes time-line. In the past studies, it is indicated that incorporated bag-of-words with the temporal effect of stock price lead to discovering more pattern for the stock price [40]. This makes a logical sense, the proper representation of the document in a time-series format with the stock price allows any model to provide more accurate prediction accuracy.
We performed experiments on our dataset with a Bag-of-Words representation, which contained 1887 news articles. To compare the performances of the different feature selection methods (Chi-Square, DF, TF-IDF, IG and occurrence), we allowed each feature selection method to select the most relevant 50, 100, 150, 200, 250, 300, 500, 600, 700 and 800 features from the 1887 articles and to represent each news article in the feature vector with respect to the number of selected features. For each feature vector, a binary representation is used (0, 1); these values indicate the absence or presence of the features inside each news article. We extracted 165 datasets (Table 2) to cover all of the features sizes, and then, we labeled the data in two directions, namely, up and down, using three different class labels, namely, PA, DA and CA, as explained in the previous section.
In order to distinguish the variety of the datasets, a unique number has been added to each data space (DS). Table 2 shows the Name of Data Spaces (DS) numbers used for each PA, DA, CA feedback measures. The numbers of datasets (DS) are used to indicate the feature size and the statistical measure respectively.
An experiment has been conducted to observe the effectiveness of the feedback measures and statistical metrics as the representative features for the stock price modeling. This evaluation testifies to the ability of the features (which consist of the news ID, news publication time, the top selected expressive words determined by each of the statistical metrics,
From investigating the methods in the literature review and building the same techniques on our dataset, we can easily justify and benchmark our approach. The classification accuracy is used to predict the performance of the stock price feedback using Naïve Bayes [14, 41] and SVM [15, 28, 34]. Therefore, we can testify that our results improvements are feasible based on the stock market feedback.
In our experiment, we measured the performance of the stock price using two classification methods, the NB and SVM, using three class labels PA, DA and CA, and we compared the performance of the proposed class label prediction accuracy (PA) against the closeness accuracy (CA) [34, 38] and the direction accuracy (DA) [12, 34]. We used the number of correctly classified instances and the accuracy for the whole test set. In addition, to evaluate the best classification method, we used the F-measure value for each direction (up, down) and the weighted accuracy for PA-SVM against PA-NB. Finally, we conducted the experiment using different feature sizes. We calculated the average and standard deviation values for PA-SVM to identify the best feature size and the best statistical metrics.
Classification accuracy for NB using chi-sqr
Classification accuracy for NB using DF
Classification accuracy for NB using TF-IDF
Classification accuracy for NB using IG
Classification accuracy for NB using occurrences
Classification accuracy for SVM using chi-sqr
Classification accuracy for SVM using DF
Classification accuracy for SVM using TF-IDF
Classification accuracy for SVM using IG
Classification accuracy for SVM using OCC
In this experiment, Naïve Bayes (NB) and support vector machines (SVM) are used with different sizes of features sets, namely, 50, 100, 150, 200, 250, 300, 400, 500, 600, 700 and 800. Five feature selection metrics, namely, chi-sqr, df, tf-idf, ig and occ, were used over those different sizes of feature sets. Tables 3–7 show the results for the NB classifier, while Tables 8–12 show the results for the SVM classifier. To measure the performance of the two classification methods, we focused on the percentage of accuracy for the test set and correctly classified the instances for each news article. The performance measurement assesses the ability of the ELR-BoW algorithm using the feedback measurements, which are PA, DA, and CA, to evaluate the best accuracy between two classifiers and the best number of correctly classified instances using different sizes of feature sets.
Finding 1: Investigate the performance of ELR-BoW using NB against three class labels
Evaluation of the effectiveness of the ELR-BoW using naïve baye NB based on the weighted accuracy. The aim of this test is to compare the performance of proposed feedback measure PA against the state-of-the-art feedback measure (DA, CA), and the impact of different feature representations on the prediction accuracy using naïve Bayes classifier. Tables 3–7 tabulate the results for the NB classifier that used different statistical metrics using the PA, DA and CA class labels to measure the price fluctuations for the stock price. The implementation of PA achieves the highest accuracy in (DS11, DS32), with an accuracy of 73.09%, the number of correctly classified instances was 1314 and 1300 for chi-sqr and df respectively, while for DA, the best result that was reported for (DS61) and achieved an accuracy of 58.20% for chi-sqr. The CA scored the lowest accuracy compared to the other class labels, and the best accuracy is in (DS111), with 53.87% for chi-sqr as well.
The obtained results show that the PA achieved a higher accuracy than the DA and CA using different statistical measurement in all of the test datasets. The performance of chi-sqr achieves the best results using the three feedback measures. We also notice that the best accuracy are recorded when the number of features size is between (150–400). It is indicated that using these features number have a remarkable influence on the feedback measurements on stock price movements. From this point, we can conclude that the previous studies were focusing on introducing a stock price models, rather than investigating the performance of the feedback measurements. The strong determination of the extracted features and the stock price for short timeline based on 20 minutes using ELR-BoW yielded to significant enhancements in PA the performance.
Finding 2: Investigate the performance of ELR-BoW using SVM against three class labels
In order to determine the performance of ELR-BoW using SVM the prediction accuracy and the correctly classified instances. Also, in this test we compare the performance of proposed feedback measure PA against the state-of-the-art feedback measure DA and CA. In Tables 8–12, the SVM classifier was implemented similarly to the same datasets. The obtained results demonstrated that the PA in (DS7 and DS9) for DF and IG respectively. The number of correctly classified instances was 1350 and scored an accuracy of 73.09% using 100 features for both. We also note that the accuracy decreased using the DA and CA in (DS83 and DS130), which had best accuracies of 59.17 and 55.22, respectively.
Based on the obtained results, we observed that the PA also achieved better results compared to DA and CA, which were due to the effectiveness of the ELR-BoW algorithm to measure the feedback and represent the features for the stock price modeling. To be more exact, when the linear regression was used, the accuracy was significantly increased for PA, as can be seen in Tables 8–12. Comparing the other statistical metrics, we found that PA performs betters than the other class labels as well. The drop in the accuracy for DA and CA was caused by the inaccurate evaluation for the class label.
The achieved results indicate that PA obtained a significant performance to understand the impact of news articles on the stock price. Additionally, the implementation of ELR-BoW proved to be a successful improvement in discovering the relationships and representing the direct effect of news articles on the stock prices. In addition, ELR-BoW for short time-line intraday stock prediction had a strong impact when exploring different feature representations for the stock prices. The results might be useful for market traders, whereas the results were that it was easier to predict the stock prices efficiently. In addition, the results make logical sense for clarifying realistic stock price fluctuation behavior, to show that the prediction is close to the eventual outcomes. From this point onward, we want to shed light on the impact of the ELR-BoW implementation for feature representation. It is evident that the proposed class label PA has significant enhancements for all of the statistical metrics.
Evaluate the performance of PA using the NB and SVM classification methods.
Based on findings 1 and 2, we found that PA scored the best results in both classification methods, NB and SVM. In this section, we evaluate the classification methods using the PA feedback measurements for NB and SVM. We calculate the F-measure value for each direction (up, down) and the weighted accuracy for PA-SVM against PA-NB. Tables 12–16 show the results for the classification methods using different feature sizes and feature selection metrics, chi-sqr, df, tf-idf, ig and occ. In this evaluation, we will focus on the F-measure value for the up direction and the weighted accuracy.
The classification results for the chi-sqr
The classification results for the chi-sqr
The classification results for the DF
The classification results for the TF-IDF
The classification results for the IG
The classification results for the OCC
In Tables 13–17, we show the results that were obtained using the PA class label for both classifiers, NB and SVM. The classification results achieved a weighted accuracy for chi-sqr that reached 0.666% for SVM and 0.654 for NB. In addition, the F-measure for the news articles in the up direction achieved a score of 0.84 and 0.842, respectively. The results show that the implementation of SVM in different features sizes is better than NB.
Figures 8–12 represent the weighted classification accuracy for five statistical measures CHI-SQR, DF, TF-IDF, IG and OCC respectively. The weighted accuracy measured using the naïve Bayes and SVM algorithms, for different feature sizes. According to [14, 16], the NB achieved promising results, and therefore, we used NB to compare the results against SVM. The five figures shows that the SVM trend line model is better than NB across all the comparisons. In Fig. 8, the highest score recoded is 0.665 and 0.654 for SVM and NB respectively, the results also shows that the SVM performance was slightly change using all the features. While the NB performance dramatically decrease when using a large number of features. This indicates that the NB performance have weak performance to classify the large number of features.
The weighted accuracy for the CHI-SQR using SVM and NB-Based PA.
The weighted accuracy for the DF using SVM and NB-Based PA.
The weighted accuracy for the TF-IDF using SVM and NB-Based PA.
The weighted accuracy for the IG using SVM and NB-Based PA.
The weighted accuracy for the OCC using SVM and NB-Based PA.
F-measure value for chi-sqr.
F-measure value for DF.
F-measure value for TF-IDF.
F-measure value for IG.
F-measure value for occurrence.
In Fig. 9, the DF performance of the both classifiers have recorded a drop in accuracy when using [150–300] features, then, the performance slightly increase from features 400 to 800. The highest accuracy reported for SVM is 0.666 when using 100 features and the high accuracy for NB is 0.655 when using 150 features. The results for DF provide a strong evidence to the impact of the features representation on the features performance. Similarly, the Figs 10–12 prove that the SVM is better than the NB in classify the stock market data. The reported results show that there is negative relationship between the number of features and performance, the trend line decrease based increasing the number of features. The plotted figures show that using small number of features is better than high numbers.
The implementation of statistical measures assists in exploring a wide range of features that lead to discovering more relationship of the market movement, and the results indicated that the selected features utilize the characteristic of the statistical measures for feature representation. The ELR-BoW was able successfully to identify a strong features that represents the condition of stock market. We can conclude an important remark that is related to feature selection, and it is obvious with regard to the classifier performance that with feature selection, the accuracy increases due to reducing the number of irrelevant features on the training test set.
To answer evaluate the impact of different feature sizes on the classification accuracy, the proposed algorithm ELR-BoW for the feature was tested on two classifiers SVM and NB on using 11 different feature sizes. The main purpose of using different feature sizes is to identify the best number of feature size to discover a strong correlation between the extracted features. In addition, there are five statistical metrics have been used to select features in a different representation. Based on findings 1 and 2, the weighted accuracy for PA is significantly better when compared with CA and DA. Therefore, the results for CA and DA are discarded from the analysis.
To evaluate the impact of different feature sizes, we used F-measure for PA-SVM and PA-NB to compare the performance of different feature sizes and statistical metric. Tables 13–17 summarize the results for the SVM and NB classifier using the PA class label in a different feature. From the above tables, plotted a five Figs 13–17 for feature selection (CHI-SQR, DF, TF-IDF, IG and OCC) statistical measures.
In this section, the obtained results were further analyzed by implementing statistical analysis using paired sample t-test to evaluate the performance of the proposed method PA-SVM compared against PA-NB. The results are presented in Tables 18–20. The mean of the best features number (M) and their standard deviation (SD) are calculated in terms of F-measure values for each classification methods are presented in Table 18. In addition, for each feature size, Tables 19 and 20 present the correlation between the features, significant value, and the P-value.
In Table 18, we reported the (M and SD) for SVM and NB in each feature size. The standard deviation value to evaluate the distribution of the data and to know whether a specific data point is standard and expected or unusual and unexpected. A low standard deviation tells us that the data are closely clustered around the average, while a high standard deviation indicates that the data are dispersed over a wider range of values.
Standard deviation and mean values for PA using SVM and NB
Standard deviation and mean values for PA using SVM and NB
According to the results in Table 18, we summarize that the results indicate that the ELR-BoW assist the SVM to produce better results than NB classifier. The mean value M is larger when the number of features is small, and then, the results start to decrease while the number of features increases. In contrast, the standard deviation value SD achieved 0.00055 and 0.00841 when the number of features was 50 and increase dramatically to 0.04075 and 0.05614 at 800 features for SVM and NB respectively. The best mean value (M) is 0.8396 for SVM while the best M for NB is 0.8274. it can be clearly seen in the table that the SD is in SVM is lower than SD in NB which indicated the SVM is better using all different features sizes.
The results demonstrated the effect of having a large number features on the classifier results. Using a lower number of features minimizes the number of irrelevant features in the training set and results in an increase in the performance. On the other hand, with an increased number of features, the number of irrelevant data increases, and the accuracy decreases due to the curse of dimensionality reduction.
In Table 19, the correlation R between the features for SVM and NB are calculated. The correlation is used to measure the relationship between two variables. The correlation is detonated by R, which is commonly used to represent a linear regression line between two values. The R value can be range from
Shows the correlation and the significant results between the PA-SVM and PA-NB
The best correlation value R
Presents the t-test p-value for two variables SVM and NB
Moreover, a paired sample t-test conducted to evaluate whether statically significant differences existed between the PA-SVM and PA-NB in different feature sizes in Table 20. The significant level below (
Furthermore, the Wilcoxon test is used to compare between the support vector machines SVM and the naïve Bayes NB statistical measures. The analyses of Wilcoxon statistical test is based on average value of F-measure value for each feature number. The results in Table 21 below, show that SVM is significantly better when compared with the NB, whereas the significant level below (
The comparisons between SVM and NB using Wilcoxon test
a. Wilcoxon Signed Ranks test; b. Based on positive ranks.
This finding indicates that there is a significant difference between the SVM and NB. From this point, we should shed a light to strength of SVM to predict stock market movements. In order to determine differences between SVM and naïve Bayes feature selection measures, it’s highly suggested to rank the statistical measures using the Friedman’s test based on the obtained F-measure value.
The obtained results are further analyzed using the Friedman’s test for PA-SVM and PA-SVM. The test is used to rank the statistical measures, the results are tabulated in Table 22. It can be clearly seen in Table 22, for PA-SVM the best performing feature selection measurement was DF, with rank 2.5, whereas the worst one was IG, with rank 5.318. Moreover, the results for PA-NB, shows also the best statistical measurement was DF, with rank 5.3182 and the worst was OCC, with score 9.681. In addition, the
Average ranking of PA-SVM and PA-NB for different feature sizes
Based on statistical test analysis, the best number of features for the stock price at 300 features, the experiment results recorded the best performance for the DF as the best statistical measures, the best F-measure value obtained was 0.842. Moreover, the representations of the bag-of-word features using different statistical metrics have increased the flexibility to express the extracted features based on the characteristic of each statistical metric to capture the most discriminating features in spite of the results being slightly similar
We believe that the results met our expectations, we can conclude that we have a successful implementation of the proposed method ELR-BoW and feedback measure PA-SVM is robust for building correlation features between the news articles and stock prices. The results testify that there is an improvement in predicting the stock market
In summary, our research introduced ELR-BoW algorithm for feature representation for stock market prediction. The performance of the proposed method and measured the effect of financial news articles on the S&P500 stock market. The news articles were represented as features, and the feature vectors were constructed using five statistical metrics to select the best features. Then, the class label examined the close prices using linear regression to calculate three different representations, namely, PA, DA and CA. The naïve Bayes (NB) support vector machines SVM classifier was trained to evaluate the performance in terms of correctly classified instances and the accuracy of the whole test set. Additionally, the F-measure and weighted accuracy are used to indicate the changes that occur in the stock price with the two categories, up/down.
In general, the results were satisfactory because they answered the research objectives, which were to identify the best feature extraction model using five statistical metrics, chi-sqr, DF, TF-IDF, IG, and occurrence. It was found that ELR-BoW using SVM obtained better performance than ELR-BoW using NB using the three feedback measures PA, DA, CA. The DF obtained the best performance compared to other statistical metrics, and the implication of different feature representations using the ELR-BoW algorithm helped to capture the stock market’s sudden movements for short-timeline prediction. In addition, the results demonstrated the remarkable improvements in the performance using the proposed PA class label to measure the feedback between the stock price and the published news articles and introduced an accurate prediction model for the S&P500 stock market using linear regression tackled the issue of stock market prediction using short timeline movement based on a 20-min timeline prediction.
Additionally, the experimental results obtained a remarkable significance while capturing the relationship between the news and the stock prices. The ELR-BoW for SVM successfully achieved high significant correlation between the features, the
This work is considered to be different from previous studies by the nature of building the dataset. The superiority lies in using a large number of statistical measures to select the features and to delve into feature representation enhancements using the linear regression method. Additionally, this work shows an emphasis on investigating the relationship of the stock price using a feature selection method that incorporates five statistical metrics for stock market prediction. That approach captures relationships that demonstrate the interactions between the news articles and the stock prices to predict the movement into two directions up or down.
Despite the significant outcomes from this study, there are still some weak points that are open for debate. From this perspective, we propose a possible direction for future research that requires further vigorous investigation. In our work, we do not include any semantic method to select the features to reflect the condition of the market and understand the vagueness. Thus, we foresee focusing on integrating some distinct features that might be considered. To focus on text mining for market prediction techniques, we have not found any method that is dedicated to context capturing or abstraction methods that entail the required information for the stock market. Because this domain is an emerging field, the necessity for such methods is strongly required. The utilization of computational processing must be investigated rigorously. Last, given the availability of a staggering amount of online data, the implication of dimensionality reduction methods is highly recommended for further enhancements in the field of market prediction.
Footnotes
Acknowledgments
This work is supported by Malaysia Ministry of Education Exploratory Research Grant Scheme (ERGS/1/2013/ICT07/UKM/02/4).
