Learning the sentiment of soccer fans from data on bets and social nets

Abstract

In this paper we propose a Hidden Markov Model in order to predict the sentiment of soccer fans based on information regarding the result of matches. The model was constructed by data collected from a social network where fans of a soccer team periodically expressed feelings towards their team. We show that the choice of a HMM is justified due to the fact that the change in a fan’s sentiment is analogous to a Markovian process of change of state through time. A comparative evaluation will be performed between variations of the proposed models and also between the most accurate of them and classification algorithms. Second order HMM, considering the match results and fan’s gambling information, is the most accurate model even though the models are constructed from results from different kind of championships.

Keywords

Hidden Markov models sentiment analysis social networks

1. Introduction

The interactions performed in digital media, which characterize the era of Web 2.0, allow for studies previously unimaginable of human social behavior. The extraction of patterns from social networks, for instance, provides affordable and reliable means to analyze crowd sentiment. Not surprisingly, researchers explore this subject from multiple fields such as Information Retrieval [21], Natural Language Processing [6,17,23] and Machine Learning [19], among others.

Our work falls into this general context of learning crowd sentiment through data provided by users of a determined social network. Three differences, however, must be adverted. The first is related to our research goal itself, which is to construct a predictive model rather than extract sentiments from interactions of a group of people. It is common to find research that study the impact of the crowd’s mood at an event in society. A recent example can be viewed in [4] who studied the influence of the sentiment expressed by Twitter users in protests that took place in 2011 in Egypt. The correlation between the strength of the riots and a number of negative tweets on a particular day was demonstrated. On another context, it has been shown that the change of the sentiment from a crowd also can impact acceptance of a product on the market. In fact It is possible to say that the financial success of a product is also related to the sentiment expressed by people about a particular product. For example, [16] shows that positive or negative review in blogs about movies immediately before the release of the movies affects their financial successes. Despite the benefit of these approaches, they are limited because people’s posts on social networks are not always done in real time or immediately after the event affecting their mood. Soccer fans, for instance, participate in social networks with varied intensity according to their team’s victories and defeats. To infer their sentiments right after a match, even when knowing the final result, might not be a viable option. Consequently, this could reduce inference power. It’s difficult for Law Enforcement authorities, for instance, to use information about the mood of fans to avoid violence from the crowd after a sport event.

The second difference relates to our basic research hypothesis. We have considered that people’s mood modify, continually over time, vis-à-vis to events that affect them. This may be modeled as an evolutionary process of change of state over time. Differently from sentiment analysis works, which generally use classification models, we defend the importance of considering that sentiments felt at a determined time are influenced by sentiments previously held.

Lastly, the third difference, from a practical standpoint, originates from the fact that, in order to build our model, we will use data from a social network, but will not apply Natural language techniques to treat semi-structured data. This will be unnecessary since we will use the database originated from a social network called FootyCrowd1

¹
http://www.footycrowd.com

(FC). In this social network, formed by soccer fans, two pieces of information are essential for our studies. The first is the sentiment of soccer team’s fans. On FC, fans are periodically invited to express how they are feeling about their teams in a specific moment. The second information is a sport book of official games from Brazilian championships. The information gathered from bets placed on teams will serve as indicators for favorite teams (or not). Such information becomes an equally important feature for the model, because the result of a match and the competence of the opponent team are essential for the change of fans’ sentiment.

The data collected from FootyCrowd and our basic research hypothesis have lead us to propose a model to represent fan group’s sentiment which evolves over time and is influenced by the results of matches in official championships. This data structure has naturally led us to choose Hidden Markov Models (HMM) as a modeling instrument. In the Markov model, the fan’s sentiment is represented as latent variables while the results of matches and the characteristics of the teams participating are considered observations, or visible variables. The results obtained with HMM in several variations, both in quantity of states and in the model’s order, have shown that second order HMM, considering the match results and fan’s gambling information, is a more accurate model than other variants. Comparisons performed with classification algorithms have also demonstrated the advantages of using HMM.

The rest of this article will be structured as following. First, we will present a brief revision on Hidden Markov Models and a few of the algorithms we will use to explore the sentiment model created for the fans. Afterward we will describe how HMM was used to model the sentiment of soccer fans. We will describe the different modeling options with a varied degree of complexity depending on the amount of information collected on the matches.

A comparative evaluation will be performed between variations of the proposed models and also between the most accurate of them and classification algorithms.

2. Background knowledge

2.1. Hidden Markov model

“A HMM is a doubly stochastic process with an underlying stochastic process that is not observable (it is hidden), but can only be observed through another set of stochastic processes that produce the sequence of observed symbols.” [22]

Formally, we have a set O of observations $o_{t}$ for a time t and a set of states $S = {s_{1}, s_{2}, s_{3}, \dots, s_{n}}$ . The states are what the model attempts to probabilistically infer from the observations.

Figure 1 presents a first order Markov chain. In order to estimate a following state, $s_{t + 1}$ it is necessary to know only about the current state and the observation $o_{t + 1}$ . In greater orders, however, the future state will depend on more than the current state. For instance, in a second order chain, to define $s_{t + 1}$ it will be necessary to know of the state $s_{t - 1}$ and $s_{t}$ .

Fig. 1.

Markov chain.

From both variables, state and observation, the Markov model is formed by a transition model, $P (s_{t + 1} ∣ s_{t})$ , which informs the probability of transition between the states, and an observation model, $P (o_{t} ∣ s_{t})$ , which offers the probability of an observation given the state, in other words, the probability of observing $o_{t}$ while being the state s in time t.

In short, a Markov model is composed of the following:

A set of states $S = {s_{1}, s_{2}, s_{3}, \dots, s_{4}}$

A set of observations $O = {o_{1}, o_{2}, o_{3}, \dots, o_{n}}$

A vector π with initial probabilities for each state

A transition matrix A, with a transition model

An observation matrix B, representing the observation model

Such model allows for the finding of probabilistic inferences through algorithms such as Baum–Welch [22], which adjusts the model’s matrix according to the observation sequence, and the Viterbi algorithm [24], which infers a more probable sequence of states.

2.2. Related work

Sentiment extraction from social network logs is an important research area and an important tool for social scientists. Different practical applications are emerging as has been shown by [18]. The authors connect measures of public opinion, which are collected from polls, with sentiment estimated from tweets, and highlight the potential of tweets as a complement for traditional polling. Based on sentiment extraction from twitter, [2] indicate that the accuracy of stock market predictions can be improved by the inclusion of specific public mood dimensions. These works basically identify the polarization of tweets using a dataset of terms with their corresponding polarity. Several methods in Natural Language Processing use a similar strategy [13,20].

Sentiment classification has been used to study the readers’ emotions [12]. Emoticon (emotion labels) has also been used for sentiment classification: [25] built a system called MoodLens, in which 95 emoticons are mapped into four categories of sentiments from Chinese tweets in Weibo. [10] designed an approach for classifying headline emotion based on the information collected from the World Wide Web. Also associated with topic models, there are two related works [8,15] that share some similarities with ours. The first studies the change of sentiment over time in written documents using a Topic Sentiment Change Analysis. The second work proposes a probabilistic mixture model, Topic-Sentiment Mixture, to extracts the topics and sentiment from weblogs. It uses Hidden Markov Model to extract topic life cycles and sentiment dynamics from the document.

More related with our approach since it tries to model emotions with respect to time spans, [26] inserted an emotional layer in Latent Dirichlet Allocation. It uses topic models for analyses of the correlation between the sentiment of the crowd and topics in News comments.

Some works while not exactly performing sentiment analysis and opinion mining are related to ours. An example is [9] that seeks to understand what makes baseball games viewers in Korea interact with other fans in an online chat during a match. The factors analyzed are pre-game factors, the team’s statistics in previous matches and during the game as well as factors that indicate how enjoyable the game is for a fan. In addition to studying data about fans and games, another factor that is similar to our work is that, to define the factors during the games, the authors used Markov Hidden Model to predict the progress of the “half-innings.” With this information they try to estimate the number of messages in chats using regression models. Another related work is [1] which proposes an assistant, using a neural network fuzzy, for retrieving on the Internet relevant information about NBA players in order to help the NBA scouting agents.

None of these works studies the relation between people’s mood, which continually modifies over time, and events that affect them as we intended to with this research.

2.3. FootyCrowd

The social network FootyCrowd (FC) is a digital space for interaction of soccer fans. It includes an area for interaction of fans from the same team, which is called a team page, and an area for interaction of fans from different teams. In the team page, the fan is weekly consulted about the sentiment he/her is experiencing towards the team at that specific moment. This sentiment can be expressed in six distinct forms: great, good, worry, sad, bad and terrible. Each user has a voting period, which is not necessarily the same as the other users’. FootyCrowd defines the sentiment of fans for a team at a set time by computing the most voted sentiment registered by fans from the seven previous days.

On FC fans participate in competitions and are encouraged to win virtual prizes through gamification strategies. The main competition is a race in which one can earn points and badges through bets placed on the results of matches from Brazilian soccer championships. The bets are made with virtual money and consist of two kinds: bets on the result (victory, tie, defeat) or in the match’s exact final score. The social network is composed of more than 200,000 subscribed users with 10% of them considered active.

Figure 2 shows the distribution of the 34,764 votes on sentiment by fans from 2012 to 2014. It is possible to observe that most votes are toward positive sentiments (great and good).

Fig. 2.

Distribution of votes by sentiment per year.

We use results of matches from several championships that take place in Brazil during the season of a year. In all, 896 matches. For such matches occurred 113,820 bets, an average of 127 bets per match. We have merely used types of bets that one only chooses win, lose or draw.

3. Applying HMM to learn about soccer fan’s sentiment

3.1. First order basic models

In order to model the sentiment of fans we initially developed three variations of a first order Markov model, each with more added information. Our hypothesis is the larger the amount of information in the model, the best it will represent reality. The models are the following:

M1 – It was set up considering the 6 phases that describe fans’ sentiment, s, toward a team in FootyCrowd. Each phase was modeled as a state in the Markov model. Thus $s \in {great, good, worry, sad, bad and terrible}$ . The observations (or visible states) represent the result of a match for a determined team. Formally, $o \in {victory, defeat, tie}$ .

M2 – In this model we added the concept of rout. The expectation is that a defeat by several goals of difference would have a greater effect on fan’s mood. A rout is represented by the difference between the goals of both teams during a match, in which that result is equal or greater than three. The model in this case will have five visible states: winning by rout, winning, tying, losing and losing by rout.

M3 – It was determined by the increase of an extra function, which consists of the bets placed on matches and available in the social network. Due to the number of bets placed it is possible to know which are the favorite and most non-favorite teams for each game. The belief is that if a team, which is considered a favorite to win, ends up losing the game, its fans will be more dissatisfied. In case it does win, the victory does not have a great impact in increasing fan’s happiness. Therefore, a model was defined with 15 observations (Table 1) and the same states from M2.

Table 1
Possible values of the observations from the model that considers bets placed matches

Victory as very favorite Victory as favorite Victory as very dark horse (non-favorite) Victory as dark horse (non-favorite)

Victory Defeat Tie

Defeat as very favorite Defeat as favorite Defeat as very dark horse Defeat as dark horse

Tie as very favorite Tie as favorite Tie as very dark horse Tie as dark horse

Victory as very favorite	Victory as favorite	Victory as very dark horse (non-favorite)	Victory as dark horse (non-favorite)
Victory	Defeat	Tie
Defeat as very favorite	Defeat as favorite	Defeat as very dark horse	Defeat as dark horse
Tie as very favorite	Tie as favorite	Tie as very dark horse	Tie as dark horse

3.2. How to define a team’s favoritism

The definition of a level of favoritism of a team during a match required a more elaborate strategy to adapt data coming from FootyCrowd. We began by computing the difference, Δ, between the number of bets favorable to the analyzed team ( $v_{f}$ ) and the non-favorable bets ( $v_{n f}$ ). $\begin{matrix} (1) & Δ = | v_{f} - v_{n f} | \end{matrix}$ It was noticed that the distribution of the differences follows a Power Law. Therefore we have used Maximum Likelihood Estimator to estimate the slope (α) of that distribution. The result is $α = 1.24$ . From this parameter value a pure Power Law curve was plotted and compared to the curve of data. It is noticeable, in Fig. 3, that both curves are fairly similar.

Fig. 3.

Curve illustrating the number of matches for each difference in comparison to a Power Law.

We decided to divide the data into four equal parts using the quartile concept, from observing that the data from score differences per number of matches forms a Power Law distribution. Each quartile divides a probability distribution into four equal parts. Thus when calculating the first, second and third quartile the values obtained were 10, 40 and 90. By assuming the existence of more matches with more balanced votes, we defined that the two first quartiles would represent matches set as non-favorite and not-unpopular, hence, any match with a difference between 0 and 40 would have observations with values winning, losing or tying. 54% of the matches do not have favorites.

Differences over 40 were analyzed separately and divided again into four equal groups, which gave us the new values for the three quartiles of 60, 90 and 120. Subsequently we set the observations favorite and non-favorite as all matches in the first quartile, between 40 and 60, while leaving out observations most non-favorite and most favorite as well as all matches with difference greater than 60. 43% of the matches in which the difference is greater than 40 are favorites or dark horse and 57% are very favorite or very dark horse.

3.3. Estimating parameters

The Markov model includes as its parameters a vector of initial probabilities, a transition matrix between states and an observation matrix.

The initial probabilities are values, which represent the probabilities of each model state in the beginning of the process. Thus, in our initial model the vector of initial probabilities will have a value for each state great, good, worry, bad, sad and terrible.

The transition matrix represents the probability of transition between a specific state in a time t into any other state in a subsequent time. To illustrate the transition matrix we initially defined $γ (s_{i}, s_{j})$ ,which represents the frequency of transitions between the state $s_{i}$ into the state $s_{j}$ . Therefore, the probability of a transition from $s_{i}$ to $s_{j}$ is calculated through the frequency of the transitions between both states divided by the frequency of the transitions of $S_{i}$ into all other states from the model. $\begin{matrix} (2) & P (s_{j} ∣ s_{i}) = \frac{γ (S_{i}, S_{j})}{\sum_{w = 1}^{6} γ (S_{i}, S_{w})} \end{matrix}$

The observation matrix stores the probabilities of occurrence of the observations. For instance, in our basic model (M1) they represent the possibility of winning, losing or tying given some state of sentiment from the fans, i.e. the probability of an observation given a state $P (o_{i} ∣ s_{j})$ .

In our model the probability of an observation $O_{i}$ being found when a fan is in a state $S_{j}$ , $P (o_{i} ∣ s_{j})$ , is calculated by the number of sentiment records in the state $s_{j}$ and observation $O_{i}$ in time t, $γ (o_{i} ∣ s_{j})$ , divided by the total number of sentiment records in the state $s_{j}$ , $γ (s_{j})$ . $\begin{matrix} (3) & P (o_{i} ∣ s_{j}) = \frac{γ (O_{i} ∣ S_{j})}{γ (S_{1})} \end{matrix}$

3.4. Markov models of higher order

Our hypothesis is that a fan sentiment depends on more than just its state in a short previous moment, but also on what the sentiment was a longer time ago. We decided to represent this trend through a second order Markov model.

We use the same strategy of [14] specifying the second-order Markov chain by a 3-D matrix ${a_{i j k}}$ . $\begin{matrix} (4) & a_{i j k} = P (q_{t} = s_{t} ∣ q_{t - 1} = s_{j}, q_{t - 2} = s_{i,} q_{t - 3} \dots) \end{matrix}$

The probability of the state sequence $Q = (q_{1}, q_{2}, q_{3}, \dots, q_{T})$ is defined as: $\begin{matrix} (5) & P (Q) = Π_{q 1} a_{q 1, q 2} \prod_{t = 3}^{T} a_{q t - 2 q t - 1 q t} \end{matrix}$ where $Π_{i}$ is the probability of state $s_{i}$ at time t and $a_{i j}$ is the probability of the transition $s_{i} \to s_{j}$ at time $t = 2$ .

Each second-order Markov model has an equivalent first-order model on the twofold product space. For instance, Fig. 4 shows two equivalent HMM1 (bottom) and HMM2 (upper).

Fig. 4.

Example of a HMM2 represented as a HMM1.

The extension of the Viterbi algorithm to HMM2 is straightforward. Instead of referring to a state in the state space S, one must refer to an element of the twofold product space $S \times S$ . To adjust the second-order HMM parameters, we have used the extended Baum–Welch algorithm proposed by [14] which is based on modified forward and backward functions. The forward function $α_{t} (j, k)$ defines the probability of the partial observation $O_{1}, \dots, O_{t}$ sequence $o_{1}, \dots, o_{t}$ and the transition ( $s_{j}, s_{k}$ ) between time $t - 1$ and t. $α_{t} (j, k)$ can be computed from $α_{t - 1} (i, j)$ in which ( $s_{i}, s_{j}$ ) and ( $s_{j}, s_{k}$ ) are two transitions between states $s_{i}$ and $s_{k}$ . The backward function $β_{t} (i, j)$ is computed in a similar way.

Considering that $γ (s_{k} s_{j}, s_{i})$ represents the frequency of transitions between the states $s_{k}$ , $s_{j}$ and $s_{i}$ , in which $s_{i}$ is the state in time $t - 2$ ; $s_{j}$ is the state in time $t - 1$ and $s_{k}$ is the state in time t, we represented the calculation o the probability for each second order transition by the following equation: $\begin{matrix} (6) & \begin{matrix} P (q_{t} = s_{k} ∣ q_{t - 1} = s_{j,} q_{t - 2} = s_{i}) \\ = \frac{γ (s_{k} s_{i}, s_{j})}{\sum_{w = 1}^{N} γ (s_{k} s_{i}, s_{w})} \end{matrix} \end{matrix}$ where N represents the number of states.

The second order models were generated only for the model with the greatest amount of information (M3).

3.5. Laplacian smoothing

In order to prevent any probability is zero and improve the performance of HMM we have used Laplacian Smoothing (LS). Smoothing is a mathematical method that removes the excess of data variability, while keeping the same expressiveness [3].

Equation (7) shows how to computer a probability p using LS. $\begin{matrix} (7) & p = \frac{N (y, x) + 1}{N (x) + w} \end{matrix}$ $N (y, x)$ is the number of times the events x and y occur together, $N (x)$ is the number of times that the event x occurs and w is the number of possible values that x can assume. For instance, if x refers to the sentiment of a person and x can assume the values happy, neutral and sad, then $w = 3$ .

There is a change in the definition of the model probabilities. The probability of sentiment changing, $P (s_{j} ∣ s_{i})$ , is estimate like equation (8). $\begin{matrix} (8) & P (s_{j} ∣ s_{i}) = \frac{γ (S_{i}, S_{j}) + 1}{(\sum_{w = 1}^{6} γ (S_{i}, S_{w})) + w} \end{matrix}$ w represents the number of states. The same change occurs for the second-order transitions. $\begin{matrix} (9) & \begin{matrix} P (q_{t} = s_{k} ∣ q_{t - 1} = s_{j,} q_{t - 2} = s_{i}) \\ = \frac{γ (s_{k} s_{i}, s_{j}) + 1}{\sum_{w = 1}^{N} γ (s_{k} s_{i}, s_{w}) + w} \end{matrix} \end{matrix}$ Finally the probability $P (o_{i} ∣ s_{j})$ is represented by equation (10). $\begin{matrix} (10) & P (o_{i} ∣ s_{j}) = \frac{γ (O_{i} | S_{j}) + 1}{γ (S_{1}) + k} \end{matrix}$ where k represents the number of observations.

4. Empirical evaluation

The trials were made by using the Baum–Welch algorithm to train the generated Markov models and the Viterbi algorithm to discover which sequence of states was more plausible for an observation sequence.

Our first evaluation compared models with a varied amount of information represented on visible states. We compared these models with a baseline algorithm from an intuition that, whenever a team wins a match, fans’ sentiment elevate and, whenever it loses, the sentiment degrades. For instance, if a team is in a good phase and wins, it gets upgraded to a great phase. If, in the next game, the team wins again it remains in that great phase since it is the maximum limit. In case the team loses, fans remove it from a good phase to worry and if the team ties fans let the team continue in its current phase.

Fig. 5.

Impact of observations in transitions between states in the baseline system.

Figure 5 presents a diagram of states explaining how, in this baseline algorithm, the observations (arcs) impact on states (circles).

To assemble the initial model, which we will refer to as the training model, we estimated the probabilities of the transition and observation matrices from FootyCrowd data, which captured fan sentiment. This data consists of voting which took place during the seasons of 2012 and 2013 of 8 of the largest Brazilian teams (in terms of number of fans) and also represent the first teams from FootyCrowd‘s fan ranking. The teams consist of Corinthians, Palmeiras, Santos, São Paulo, Botafogo, Vasco, Flamengo and Fluminense. In addition to the sentiment database, we used data from matches from 2012 and 2013. Altogether, 533 match results were utilized. Assuming that the model will attempt to estimate states for soccer matches of teams that do not have sentiments registered, the initial probabilities in our model were the same for all sentiments, $P_{initial} (S_{i}) = 0.166$ .

The results from those teams’ matches on season 2014 were used to refine the model, which, after applying Baum–Welch, presented new probabilities. Through this refined model, it is possible to apply Viterbi to infer fan sentiment for each week after the match and compare with the value from fan sentiment captured by FC. Table 2 presents the amount of matches for each team on season 2014.

Table 2

Amount of matches for each team from the test database

Team	Number of matches
Corinthians	44
Palmeiras	43
São Paulo	42
Santos	49
Flamengo	50
Vasco	46
Fluminense	43
Botafogo	46
Total	363

The inference for fan sentiment was made individually for each team. For example, to test the accuracy of the algorithm in estimating the sentiment of the Flamengo’s supporters, for each one of the 50 matches composing the test database we compare the sentiment inferred by the Viterbi’s algorithm with the sentiment of the crowd represented in FootyCrowd just after each particular match. Using M1, the success rate was 29.16% meaning that for 14 matches the inference made via Viterbi was correct. Table 3 presents the results per team with the success percentage between inferences performed by Viterbi algorithm and fan sentiment on FootyCrowd at that moment. From the table it is possible to view results for different model types (M1, M2, M3, all from first order) as well as results obtained with the baseline system previously described.

Table 3

Results referred to the 3 first order models and the baseline system

	M1	M2	M3	Baseline
Corinthians	10.25%	29.54%	22.72%	38.63%
Palmeiras	50.0%	48.83%	60.46%	46.51%
São Paulo	40.47%	40.47%	50.0%	21.42%
Santos	25.0%	29.78%	44.68%	30.61%
Flamengo	29.16%	34.0%	50%	46.0%
Vasco	41.3%	21.73%	64.44%	15.21%
Fluminense	20.0%	37.2%	32.55%	46.51%
Botafogo	38.63%	69.56%	60.89%	23.91%

It is possible to notice an improvement in results of almost all models, as more information is included in the observations from the Markov model. For instance, using M1, the Viterbi’s algorithm was accurate in 25% of inferences. This percentage of accuracy increased to 29.78% while the M2 was used and 44.68% to M3. However they don’t have clear advantage compared with the baseline system.

4.1. Defining rout

In the previous Section, in the M2 model, we considered the concept of rout. This concept allows to measure how impactful is, for the fans, the information that her team lost or won by a large goal difference. Rout was defined as a difference of 3 or more goals (M2-Dif.3) at the end of the match. We have also executed the model varying the boundary as a difference of 4 or more goals (M2-Dif.4). Table 4 shows these values.

Table 4
Comparison with the results of the M2 model for different definitions of rout

M2-Dif.3 M2-Dif.4

Corinthians 29.54% 29.54%

Palmeiras 48.83% 74.41%

São Paulo 40.47% 40.47%

Santos 29.78% 36.17%

Flamengo 34.0% 28.0%

Vasco 21.73% 45.65%

Fluminense 37.2% 25.58%

Botafogo 69.56% 67.39%

	M2-Dif.3	M2-Dif.4
Corinthians	29.54%	29.54%
Palmeiras	48.83%	74.41%
São Paulo	40.47%	40.47%
Santos	29.78%	36.17%
Flamengo	34.0%	28.0%
Vasco	21.73%	45.65%
Fluminense	37.2%	25.58%
Botafogo	69.56%	67.39%

In general the results are similar and without large variations. However, the significant differences identified in the results of Palmeiras and Fluminense indicate that the perception of what a rout is, and the impact of it in the sentiment of the crowd varies from team to team. Further studies need to be conducted to validate this.

4.2. Second order models

We compared the results from the best model (M3) with its extended version for orders 2. Table 5 presents the tests results conducted for second HMM (HMM2) and compares to first order HMM (HMM1).

Table 5
Results from sentiment inference by First (HMM1) and second order HMM (HMM2)

Team HMM1 HMM2

Corinthians 22.72% 77.72%

Palmeiras 60.46% 23.25%

São Paulo 50.0% 47.61%

Santos 44.68% 53.19%

Flamengo 50% 52.0%

Vasco 64.44% 77.77%

Fluminense 32.55% 53.48%

Botafogo 60.89% 32.60%

Team	HMM1	HMM2
Corinthians	22.72%	77.72%
Palmeiras	60.46%	23.25%
São Paulo	50.0%	47.61%
Santos	44.68%	53.19%
Flamengo	50%	52.0%
Vasco	64.44%	77.77%
Fluminense	32.55%	53.48%
Botafogo	60.89%	32.60%

There is a significant improvement on inferences made by HMM2 compared to HMM1. Solely the inferences made for the sentiment Palmeira’s and Botafogo’s fans did not present improvement (we will elaborate more on this matter further). We realized that, by increasing the Markov model order, the amount of examples (with respect to transitions between states and observations) to build the model decrease. In the second order model, some transitions from the transition matrix and some probabilities from the observation matrix remain null. Further tests with more data are necessary to reach definitive conclusions on whether increasing the order will always increase the accuracy of the model.

4.3. Comparison with classification algorithms

The accuracy of HMM was compared with two classification algorithms: Support Vector Machine [5] and Naïve Bayes [11]. The classification algorithms were executed from the Weka framework [7]. The training and testing base was set up in the following manner: the classes consisted of the six sentiments, which represent the vote of team supporters on their feelings toward their team. The set of attributes that compose an example is:

Result (or results) of the match (or matches) which assumes values lost, tied or won;

If the team is favorite or non-favorite in the match;

The level of favoritism or of non-favoritism of a team, which may consist of favorite, non-favorite, most favorite, most non-favorite or neither.

Different tests were conducted with SVM and Naives Bayes, in which we varied the number of matches of the training example between one and two. Our idea was to allow for a fair comparison between first and second order HMM.

Table 6
Results for sentiment inference using SVM with one and two matches (SVM1, SVM2), first and second order HMM (HMM1, HMM2), Naïve Bayes with one and two matches (NB1, NB2) and the baseline (BL) algorithm previously described

Team SVM1 HMM1 NB1 SVM2 NB2 HMM2 BL

Corinthians 19.04% 22.72% 30.95% 30% 10% 77.72% 38.63%

Palmeiras 57.14% 60.46% 59.52% 52.38% 61.9% 23.25% 46.51%

São Paulo 27.5% 50.0% 37.5% 47.36% 21.05% 47.61% 21.42%

Santos 33.3% 44.68% 45.83% 58.33% 58.33% 53.19% 30.61%

Flamengo 22.44% 50% 36.73% 33.33% 37.5% 52.0% 46.0%

Vasco 13.63% 64.44% 25% 45.45% 27.27% 77.77% 15.21%

Fluminense 35.71% 32.55% 40.47% 47.61% 28.57% 53.48% 46.51%

Botafogo 55.55% 60.89% 53.33% 63.63% 50% 32.60% 23.91%

Average 33.03% 48.21% 41.15% 47.24% 36.8% 51.20% 33.58%

Team	SVM1	HMM1	NB1	SVM2	NB2	HMM2	BL
Corinthians	19.04%	22.72%	30.95%	30%	10%	77.72%	38.63%
Palmeiras	57.14%	60.46%	59.52%	52.38%	61.9%	23.25%	46.51%
São Paulo	27.5%	50.0%	37.5%	47.36%	21.05%	47.61%	21.42%
Santos	33.3%	44.68%	45.83%	58.33%	58.33%	53.19%	30.61%
Flamengo	22.44%	50%	36.73%	33.33%	37.5%	52.0%	46.0%
Vasco	13.63%	64.44%	25%	45.45%	27.27%	77.77%	15.21%
Fluminense	35.71%	32.55%	40.47%	47.61%	28.57%	53.48%	46.51%
Botafogo	55.55%	60.89%	53.33%	63.63%	50%	32.60%	23.91%
Average	33.03%	48.21%	41.15%	47.24%	36.8%	51.20%	33.58%

Table 6 shows the results of all models. Variation of SVM and Naïve Bayes did not present significant differences.

The more accurate results were obtained in the HMM2 model from tests performed with HMM, SVM, Naive Bayes and the baseline algorithm. This indicates that the historical information of results and the temporal evolution of sentiment are important to obtain better inferences in this scenario. However, the inferences made for the sentiment of Palmeira’s and Botafogo’s fans did improve when compared to HMM1, which has required a more thorough analysis of the FC data. The reason of this is going to be discussed in the next section.

5. Discussion

5.1. Bias towards good moments

From the data we observed that FC fans use the functionality “Fan Sentiment” more frequently when their team has a positive result. This means that, when a team loses, the fan does not express frustration with the same frequency as he expresses happiness. This is probably due to the fact that FC is a social network, and interacting with friends and acquaintances from the community after a bad result is not particularly pleasant.

Table 7
Percentage of votes of fans in FootyCrowd after a match

Teams Votes on Victory Votes on tie Votes on Defeat

Corinthians 51% 34% 13%

Palmeiras 41% 34% 24%

São Paulo 42% 23% 34%

Santos 40% 33% 26%

Flamengo 38% 28% 32%

Vasco 37% 38% 25%

Fluminense 38% 28% 32%

Botafogo 55% 24% 21%

Teams	Votes on Victory	Votes on tie	Votes on Defeat
Corinthians	51%	34%	13%
Palmeiras	41%	34%	24%
São Paulo	42%	23%	34%
Santos	40%	33%	26%
Flamengo	38%	28%	32%
Vasco	37%	38%	25%
Fluminense	38%	28%	32%
Botafogo	55%	24%	21%

Table 7 shows that for all 8 considered teams from the database, except for Vasco, fans voted the most when their team attained better results in the matches. Even so, the values for Vasco for victory and defeat were very close. By looking at Corinthians, for example, 41% of votes occurred after a victory, 35% after a tie and 23% after a defeat. In such cases, the difference between the amount of votes in victory and defeat is fairly large (e.g. Corinthians, Palmeiras and São Paulo).

These observations show that the models learnt tend to value transitions between positive sentiments, with larger probabilities to positive phases. We have numbered the amount of positive phases (phases great and good) and negative (worry, sad, bad and terrible) and found that more than double the votes are in positive sentiments (27.579) than negative (11.787).

As previously explained, by increasing the order of the Markov model the amount of examples to develop the model decreases and, since the states with more transitions are the ones which involve positive sentiments, it is presumed that the second order model could be even more biased towards positive sentiment. Therefore, predictions over sentiment of fans whose team’s data from the test contain more negative sentiments will have lower accuracy.

This assumption made us investigate more closely the relation between positive/negative sentiment and the inference results achieved with HMM2. Palmeiras and Botafogo, the teams with the lowest accuracy in inferences made with second order model, are the only teams, which presented more negative than positive sentiment for each observation in the trial dataset. Table 8 shows those numbers for all teams.

Table 8

Amount of positive and negative sentiments for each team

For the 42 matches that were tested on Viterbi with HMM2s models, from FC’s database for Palmeiras, 20 of them are marked as positive sentiment and 22 as negative sentiment. Botafogo has 14 positive votes and 32 negative ones. Opposed to Corinthians, for which the HMM2 model had the largest accuracy rate with 43 matches tested as Viterbi observations, 38 of them marked as positive sentiment and only 5 as negative sentiment.

5.2. Tests performed removing data from the analyzed team

We have also performed some tests in which we removed the sentiment and results of the matches from the process of constructing the initial model from teams that would have its sentiment inferred from season 2014. This means that, in order to infer Corinthian’s fan’s sentiment from a determined period, the initial model with data from 2012 and 2013 was created without considering a single sentiment vote made by Corinthians fans. The goal was to verify how many inferences obtained from the model were dependent of sentiment data collected by fans themselves.

Table 9
Results form sentiment inference for models with (W) and without (WO) influence of team with inferred sentiment

Teams HMM2(W) HMM2(WO)

Corinthians 77.72% 77.72%

Palmeiras 23.25% 34.88%

São Paulo 47.61% 47.61%

Santos 53.19% 51.06%

Flamengo 52.0% 50%

Vasco 77.77% 77%

Fluminense 53.48% 53.48%

Botafogo 32.60% 39.13%

Teams	HMM2(W)	HMM2(WO)
Corinthians	77.72%	77.72%
Palmeiras	23.25%	34.88%
São Paulo	47.61%	47.61%
Santos	53.19%	51.06%
Flamengo	52.0%	50%
Vasco	77.77%	77%
Fluminense	53.48%	53.48%
Botafogo	32.60%	39.13%

It is clear, on Table 9, that there was no decline in results. This indicated that the Markov model might be applied in teams not included in the training model in order to infer crowd sentiment from soccer team fans.

6. Taking into consideration championship differences

So far the sentiment of fans were analyzed based on the entire dataset of matches available on FootyCrowd. We have considered the results from the National Championship (Serie A), the Brazilian Cup and two regional championships (one that refers to the region of Rio de Janeiro and another referring to the region of São Paulo). In these tournaments, teams qualify of different manners. The Brazilian Cup competition is a single elimination knockout tournament featuring two-legged ties played by 86 teams, representing all 26 Brazilian states plus the Federal District. In the first two rounds, if the away team wins the first match by 2 or more goals, it progresses straight to the next round avoiding the second leg. The Brazilian Cup uses the away goals rule that states that the team that has scored more goals ”away from home” will win if scores are otherwise equal.

It is worth to note that in this kind of tournament a passing of phase is more important than a simple victory. Conversely, fans may become happy even though a defeat has happened.

The National Championship of Serie A or Brasileirão has 20 clubs. During the course of a season (from May to December) each club plays the others twice (a double round-robin system), once at their home stadium and once at that of their opponents, for a total of 38 games. Teams receive three points for a win and one point for a draw. No points are awarded for a loss. Teams are ranked by total of points, victories, goal difference and goals scored. At the end of each season, the club with the most points is crowned champion. A system of promotion and relegation exists between the Serie A and the Série B. The four lowest placed teams in the Serie A are relegated to Série B, and the top four teams from the Serie B promoted to the Serie A. Also, the top four of Serie A are allowed to play the Libertadores Cup (continental tournament). Football fans created their own way to watch the league’s scoring. They consider that there are three regions: the G4 involving those disputing the top four places, the Z4 region involving those who want to leave the four last positions and the intermediary region involving teams that are in between a zone and another.

In the State level, several independent championships exist. The most important ones are those from São Paulo (typically with 20 clubs) and Rio (only with 16 clubs). They may include obscure formats or experiment with proposed innovations in rules. This can influence in the perception of fans. Typically, the Brazilian Cup, the regional tournaments and the Libertadores Cup are played simultaneously. To track the sentiment of the fans in this context is challenging because the performance in different championships considering different rules of classification depends on a lot of hidden variables that are not present in data. We hypothesize that this might be one of the reasons for the high variability of the results of our models. We decided then to investigate the fans sentiment considering only the results from matches of Serie A, the longest championship in the country (8 months’ length). We decide also to represent, in the model, the information about the position of the club in the general classification.

A second-order Markov Model (named here HMM2Z) was created with observations that consider the three previously mentioned zones. We have created three categories. The first one groups the clubs that are at least three points away to G4. The second group represents those that are three points away from Z4 and, the last group involves the rest of the teams (intermediary zone). Table 10 presents the observations of HMM2Z model.

Table 10
Observations from the combination of favoritism and the zones of Serie A

Win Very Favorite G4 Win Very Favorite intermediary Win Very Favorite Z4

Win Favorite G4 Win Favorite intermediary Win Favorite Z4

Win very dark horse G4 Win very dark horse intermediary Win very dark horse Z4

Win dark horse G4 Win dark horse intermediary Win dark horse Z4

Win G4 Win intermediary Win Z4

Defeat very favorite G4 Defeat very favorite intermediary Defeat very favorite Z4

Defeat favorite G4 Defeat favorite intermediary Defeat favorite Z4

Defeat very dark horse G4 Defeat very dark horse intermediary Defeat very dark horse Z4

Defeat dark horse G4 Defeat dark horse intermediary Defeat dark horse Z4

Defeat G4 Defeat intermediary Defeat Z4

Tie very favorite G4 Tie very favorite intermediary Tie very favorite Z4

Tie favorite G4 Tie favorite intermediary Tie favorite Z4

Tie very dark horse G4 Tie very dark horse intermediary Tie very dark horse Z4

Tie dark horse G4 Tie dark horse intermediary Tie dark horse Z4

Tie G4 Tie intermediary Tie Z4

The dataset used to create the models contains matches from June to December of 2012, 2013 and 2014. The test set refers to the same period of 2015. Table 11 shows the comparison of the results of HMM2Z and SVM using the information about the three classification zones. HMM2 without this information is also displayed

Table 11

Results comparing HMM2 to HMM2Z as well as comparing to SVM after the addition of the feature representing the team classification in the championship

	HMM2Z	HMM2	SVM
Corinthians	73.52%	85.29%	25.0%
Palmeiras	58.82%	20.58%	70.0%
São Paulo	20.58%	50.0%	18.75%
Santos	47.05%	55.88%	68.75%
Flamengo	58.82%	58.82%	62.5%
Vasco	58.82%	55.88%	46.87%
Fluminense	38.23%	52.94%	56.66%
Botafogo	82.35%	17.64%	68.75%
Average	54.64%	49.63%	52.0%
Standard Deviation	0.1812	0.2035	0.1896

In terms of accuracy, the results of HMM2Z model did not improve those produced by HMM2 that did not take into account the classification of the teams in the league. The most striking was the fact that SVM started to have similar levels of accuracy to HMM. Our interpretation is that information on the classification of the team in the championship incorporates a semantic that correlates with the sentiment of the fans. That is, if a team is in the zone Z4 is because it came several defeats and the crowd was already with a negative sentiment. So does the G4 zone for wins and positive feelings. This is exactly what is inherent in the HMM and that was not available to classification algorithms. In other words, it seems that classification is competitive in terms of accuracy if some historical information is represented in the examples.

To confirm our intuition we decided to do another test, this time considering only matches that took place in the first half of the year when many championships take place concurrently and no historical information was available. The tests took into account the information about favoritism and the match results, featuring HMM2 model described above. However we use a dataset storing matches played between January, 1st and May 31st of 2012, 2013 and 2014. The test dataset refers to the same period of 2015.

The results (see Table 12) confirm our intuition. Comparatively, the accuracy of SVM has reduced. On the other hand the variability of the results increased. HMM2 seems to be more robust because it is less imune to situations when historic information is unavailable.

Table 12

HMM2 with SMV using data with first semester results

Teams	HMM2	SVM
Corinthians	100%	12.5%
Palmeiras	94.73%	55.56%
São Paulo	52.94%	12.5%
Santos	52.63%	22.22%
Flamengo	82.35%	75%
Vasco	42.10%	33.33%
Fluminense	17.64%	12.5%
Botafogo	63.15%	77.77%
Average	63%	37.67%
Standard Deviation	0.2610	0.2617

7. Conclusion

This research has investigated a database of sentiments expressed by soccer fans from teams in Brazil. We proposed to create a predictive model based on Hidden Markov Models due to its stochastic characteristics, which describe a procedure that operates over a long period of time. Therefore, the hypothesis that a sentiment is formed over time and not just by a single observation can be demonstrated by comparing results from HMM with classifiers such as Support Vector Machine and Naive Bayes. A significant improvement on results was also observed when a second order model was used. This reinforced the hypothesis that current sentiments are dependent on previous states.

Variations of Markov models show that the accuracy rate improves, since observations are capable of expressing winnings or losses as well as the favoritism of the opponents. Another important finding is the fact that sentiment evolution over time also needs to be represented.

We had indications that the feeling of the fans behaves differently during the year. In the first half, when the championships take place in parallel and are based on different rules inferences with greater variability happen. During the second half, when a long-term championship occurs, the variability is lower. On both occasions HMM seems well represent the feeling of the fans, although in the second half, classification using the team ranking information in the championship also presents itself as an appropriate model.

The limitations of our approach guide us towards future investigations. In all our tests, the model validation was based on comparing the feeling most votes in FootyCrowd with the feeling that has the highest probability of being inferred by the Viterbi algorithm. Investigate new forms of assessment is part of our intentions for future work. In particular, we believe that the comparison process can be done for all the probability distribution. That is why we are investigating how to represent this problem as a HMM with continuous states. Instead of inferring the sentiment represented by the majority of the votes, we want to infer a histogram, which represents the distribution of fans’ sentiment.

The bias towards good moments verified on the dataset is also an important direction to future work. We think that this problem is similar to the class imbalance in supervised learning methods. An alternative way is to apply procedures of under/over sample on the data in order to have more accuracy in the estimates of the model.

References

Atlas and

Y.-Q.

Zhang, Fuzzy neural web agents for efficient NBA scouting, Web Intelligence and Agent Systems: An International Journal 6 (2008), 83–91.

Bollen,

Mao and

Zeng, Twitter mood predicts the stock market, Journal of Computational Science 2 (2011), 1–8. doi:10.1016/j.jocs.2010.12.007.

Boodidhi, Using smoothing techniques to improve the performance of Hidden Markov’s Model, Master Dissertation, University of Nevada, 2011.

Borge-Holthoefer,

Magdy,

Darwish and

Weber, Content and network dynamics behind Egyptian political polarization on Twitter, in: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work and Social Computing, 2014, pp. 1–30.

Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (1998), 121–167. doi:10.1023/A:1009715923555.

Gao,

Wei,

Li,

Liu and

Zhou, Co-training based bilingual sentiment lexicon learning, in: Workshops at the Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013, pp. 26–28.

Hall,

Frank,

Holmes,

Pfahringer,

Reutemann and

Witten, The WEKA data mining software: An update, SIGKDD Explor 11 (2008), 10–18. doi:10.1145/1656274.1656278.

Jiang,

Meng and

Yu, Topic sentiment change analysis, in: Proceedings of International Conference on Machine Learning and Data Mining, 2011, pp. 443–457.

Ko,

Yeo,

Lee,

Lee and

Y.J.

Jang, What makes sports fans interactive? Identifying factors affecting chat interactions in online sports viewing, PLOS ONE 11 (2016).

10.

Kozareva,

Navarro,

Vazquez and

Montoyo, Ua-zbsa: A headline emotion classification through web information, in: Proceedings of the Fourth International Workshop on Semantic Evaluations, 2007, pp. 335–337.

11.

D.D.

Lewis, Naive (Bayes) at forty: The independence assumption in information retrieval, in: ECML’98: Tenth European Conference on Machine Learning, 1998.

12.

K.H.-Y.

Lin and

H.-H.

Chen, Ranking reader emotions using pairwise loss minimization and emotional distribution regression, in: Proceedings Conference on Empirical Methods in Natural Language Processing, 2008, pp. 136–144.

13.

Liu,

Li and

Guo, Emoticon smoothed language models for Twitter sentiment analysis, in: Proceedings of AAAI Conference on Artificial Intelligence, 2012, pp. 1678–1684.

14.

J.-F.

Mari,

J.-P.

Haton and

Kriouile, Automatic word recognition based on second-order hidden Markov models, in: Proceedings of IEEE Transactions on Speech and Audio Processing, Vol. 5, 1997, pp. 22–25.

15.

Mei,

Ling,

Wondra,

Su and

Zhai, Topic sentiment mixture: Modeling facets and opinions in weblogs, in: Proceedings of the International Conference on World Wide Web, 2007, pp. 171–180.

16.

Mishne and

Glance, Predicting movie sales from blogger sentiment, in: AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, 2006, pp. 155–158.

17.

S.M.

Mohammad,

Kiritchenko and

Zhu, NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets, in: Proceedings of the Seventh International Workshop on Semantic Evaluation Exercises, Vol. 2, 2013, pp. 321–327.

18.

O’Connor,

Blasasubramanyan,

B.R.

Routledge and

N.A.

Smith, From tweets to polls: Linking text sentiment to public opinion time series, in: Proceedings of the International AAAI Conference on Weblogs and Social Media, 2010.

19.

Ortigosa,

J.M.

Martín and

R.M.

Carro, Sentiment analysis in Facebook and its application to e-learning, Computers in Human Behavior 31 (2014), 527–541. doi:10.1016/j.chb.2013.05.024.

20.

Palanisamy,

Yadav and

Elchuri, Serendio: Simple and practical lexicon based approach to sentiment analysis, in: Proceedings of International Workshop on Semantic Evaluation, Atlanta, 2013.

21.

Paltoglou and

Thelwall, A study of information retrieval weighting schemes for sentiment analysis, in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 1386–1395.

22.

L.R.

Rabinerand and

B.H.

Juang, An introduction to hidden Markov models, IEEE ASSP Magazine 3 (1986), 4–16. doi:10.1109/MASSP.1986.1165342.

23.

Vanzo,

Croce and

Basili, A context-based model for sentiment analysis in Twitter, in: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, 2014, pp. 2345–2354.

24.

Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory 13 (1967), 260–269. doi:10.1109/TIT.1967.1054010.

25.

Zhao,

Dong,

Wu and

Xu, Moodlens: An emoticon-based sentiment analysis system for Chinese tweets, in: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 1528–1531.

26.

Zhu,

Ge,

Chen and

Liu, Tracking the evolution of social emotions: A time-aware topic modeling perspective, in: Proceedings of IEEE International Conference on Data Mining, 2014, pp. 697–706.

Win Very Favorite G4	Win Very Favorite intermediary	Win Very Favorite Z4
Win Favorite G4	Win Favorite intermediary	Win Favorite Z4
Win very dark horse G4	Win very dark horse intermediary	Win very dark horse Z4
Win dark horse G4	Win dark horse intermediary	Win dark horse Z4
Win G4	Win intermediary	Win Z4
Defeat very favorite G4	Defeat very favorite intermediary	Defeat very favorite Z4
Defeat favorite G4	Defeat favorite intermediary	Defeat favorite Z4
Defeat very dark horse G4	Defeat very dark horse intermediary	Defeat very dark horse Z4
Defeat dark horse G4	Defeat dark horse intermediary	Defeat dark horse Z4
Defeat G4	Defeat intermediary	Defeat Z4
Tie very favorite G4	Tie very favorite intermediary	Tie very favorite Z4
Tie favorite G4	Tie favorite intermediary	Tie favorite Z4
Tie very dark horse G4	Tie very dark horse intermediary	Tie very dark horse Z4
Tie dark horse G4	Tie dark horse intermediary	Tie dark horse Z4
Tie G4	Tie intermediary	Tie Z4