Life aspect inference of tweets based on probability distribution

Abstract

Many people share their daily events and opinions on Twitter. Some tweets are beneficial and others are related to such aspects of a user’s real-life as eating, traffic conditions, and weather. In this paper, we propose an inference method of the real-life aspect distribution of tweets using labeled tweets. Our method infers the aspect probability distributions by a hierarchical estimation framework (HEF), which is hierarchically composed of both unsupervised and supervised machine learning methods. In the first phase, it extracts topics from a sea of tweets using Latent Dirichlet Allocation (LDA). In the second phase, it builds associations between topics and real-life aspects using a small set of labeled tweets. The probability distribution of aspects is inferred using the associations based on the bag of terms extracted from unknown tweets. Our sophisticated experimental evaluations with a large amount of actual tweets demonstrate the high efficiency and robustness of our inference method. Especially in the case of single-label training, HEF showed significantly lower JSD values than other baseline methods, such as Naive Bayes, SVM, and L-LDA.

Keywords

Twitter real life hierarchical estimation framework probability distribution inference t-test

1. Introduction

Twitter,1

¹
http://twitter.com.

which is one of the most popular social media services, had 200 million active users per month at the end of September 2013 [22]. Since Twitter only permits users to post short sentences up to 140 characters, users can easily post their experiences and opinions about daily events.

Thus, Twitter posts are often both useful and timely because they typically discuss current events. For example, tweets about traffic jams or traffic accidents are quite valuable for users who will pass those places. Supermarket sales and bargain information are also helpful for neighborhood consumers. Such tweets, which are highly regional, up-to-date, and beneficial to others, are called real-life tweets. We classified such tweets into 14 aspects to present them based on user contexts. The 14 aspects (Table 1), which are assumed to be the life aspects of users, refer to the Yahoo directory2

http://business.yahoo.com.

and some other web portals. A tweet mentioning a traffic accident is labeled as a Traffic aspect, and one about supermarket sales and bargain information is labeled as an Expense aspect.

Table 1

Aspects of real life

Aspect		Typical terms
Appearance	(App.)	clothes, dressing, wearing, fashion, uniforms, kimono, decoration, makeup, haircuts …
Contact	(Con.)	appointments, meetings, invitations, family, friends, parties, drinking parties, get-togethers …
Disasters	(Dis.)	flood, tornados, earthquakes, seismic ocean waves, power loss, hazards, secondary disasters …
Eating	(Eat.)	cooking, dining out, eating, restaurants, recipes, ingredients …
Events	(Eve.)	festivals, ceremonies, projects, schedules of events, conferences, special days, art shows …
Expense	(Exp.)	shopping, orders, advertisements, discounts, bargains, markets, sales, purchases …
Health	(Hea.)	colds, physical condition, aches and pains, hospital, health management method, medicine …
Hobbies	(Hob.)	leisure-time, pastime, entertainment, hobbies, interest, games, music, television, movies …
Living	(Liv.)	home, lodgings, furniture, cleaning, doing laundry, living, apartment, accommodation …
Locality	(Loc.)	sightseeing, regionally specific, local information ...
School	(Sch.)	study, class, examinations, education, research, homework, coursework, cancellation of lectures …
Traffic	(Tra.)	trains, buses, airplanes, timetables, traffic information, clogs, roads, traffic jams, accidents …
Weather	(Wea.)	weather forecasts, temperature, humidity, hail, rain, thunder, sky, air, wind, pollen …
Work	(Wor.)	job hunting, part-time jobs, coursework, opening a store, closing a business, job, employment …

The Great East Japan Earthquake Disaster, which occurred in March, 2011,3

https://en.wikipedia.org/wiki/2011_Tohoku_earthquake_and_tsunami.

provides an example of the benefits of real-life tweets. There was a great amount of confusion in the stricken area immediately following the earthquake. There was a lack of food, suspension of water supply, and train service cancellations. At that time, useful tweets reported the location of water supplies and food distribution sites, as well as the service status of trains, demonstrating that such real-life tweets helped the users in the devastated region [28].

Depending on the tweet, we might have to designate several aspects per tweet. For example, a tweet such as “A heavy snowstorm caused a traffic accident near the JFK airport” mentions heavy snowfall and a traffic accident. Its main topic is the traffic accident, but it also provides weather information. Therefore, we label it as both Traffic and Weather. In our previous research, we proposed a hierarchical estimation framework (HEF) to estimate multiple aspects of unknown tweets [29] and clarified that two to five aspects are generally estimated for actual tweets in experimental evaluations. Because real-life aspects are related, many aspects appear together in tweets. Actually, in the above tweet, the traffic accident was caused by abysmal weather. Thus, real-life tweets might include various aspects with high probability.

An approach that estimates several aspects of a tweet can clearly provide real-life information for specific users. On the other hand, exhaustive-oriented users might expect broad information that includes the specific aspects. In other words, accuracy-oriented users might desire strictly selected real-life information on specific aspects. When we visit sightseeing locations, for example, we want information about them. Multi-label classification methods [29] failed to achieve such tightly associated aspects with the same weight.

In this paper, we propose an inference method of the real-life aspect distribution of tweets. The aspect distribution is represented by the probability distribution in each tweet. Accurately inferring the probability distribution of the aspects means supporting either the strict or broad associations between tweets and aspects. As an inference method of probability distribution, we extend HEF, which is composed of both unsupervised and supervised machine learning techniques. In the first phase, it extracts topics from a sea of tweets using Latent Dirichlet Allocation (LDA). In the second phase, it calculates the relevance scores between topics and aspects using a small set of labeled tweets to build associations among them. Aspect scores for unknown tweets are calculated using the associations between topics and aspects based on the terms extracted from them. As the most important feature in this paper, we propose an optimal association building method based on t-test, which is an efficient strategy to manage the relationships between topics and aspects. In this paper, we assume that the training data are not given as the probability distributions of the aspects based on a training model of a typical classification method. Our challenge is to train from labeled tweets and infer the probability distribution of the aspects of unknown tweets based on a natural extension of HEF.

The reminder of our paper is organized as follows. In Section 2, related works are discussed. In Section 3, we explain the details of the extended HEF mechanism that infers the probability distribution of aspects. In Section 4, the experimental evaluations for inferring the probability distributions are described, including JS divergence and Euclidean distance. In Section 5, we discuss the effectiveness of our inference method. We conclude the paper and briefly describe future works in Section 6.

2. Related works

2.1. Information extraction from Twitter

The study of information extraction from Twitter is flourishing. Sakaki et al. [19] assumed that Twitter users act as sensors that discover an event occurring in real time in the real world. Mathioudakis et al. [13] extracted burst keywords in automatically collected tweets and found trends that fluctuated in real time by creating groups using the co-occurrence of keywords. Zhao et al. [32] extracted tweets about information needs using a Support Vector Machine (SVM) to discover real-world trends and events. Wang et al. [24] estimated user interests using posted tweets to discover effective users for tweet diffusion. Rajadesingan et al. [16] detected the sarcasm on Twitter to help improve the company’s customer services. They introduced psychological studies and sentiment score of term into the modeling framework to discover the sarcasm. Bollen et al. [12] analyzed sentiment on Twitter based on a six-dimensional mood (tension, depression, anger, vigor, fatigue, and confusion) representation, and determined that on Twitter, it correlates with such real-world values as stock prices and coincides with cultural events. In this paper, we infer the real-life aspect distribution on unknown tweets.

2.2. Extracting information related to user’s life

Many studies have extracted beneficial information for the lives of users. Extracting traffic information from social media has been particularly widely studied. Sakaki et al. [20] extracted real-time driving information from social media to provide current traffic situations to users. Their system incorporated geographically related terms into geographical coordinates. When the number of tweets satisfying their three defined rules exceeded their threshold, they judged that a target railway was running late. Aramaki et al. [1] predicted influenza epidemics using Twitter. They extracted tweets related to influenza based on an SVM modeled by tweets that literally mention influenza patients. Ishino et al. [9] extracted both information on souvenir and tourist spots as travel information from travel blog entries. Moreover, they built a collection of travel information links by extracting hyperlinks from travel blog entries.

Although these studies extract beneficial information in particular life aspects, our research concurrently estimates several aspects of unknown tweets based on probability distribution.

2.3. Topic model

Topic model studies widely use LDA [2], which is a latent topic extracting method devised for a probability topic model. LDA supposes that a document has a mixture distribution of plural topics, each of which is expressed by the probability distribution of the terms. Zhao et al. [31] proposed a model called Twitter-LDA, based on the hypothesis that one tweet expresses one slice of a topic’s content. They classified tweets by topics and extracted keywords to express their contents. Diao et al. [6] detected bursty topics using Time-User-LDA, which is an extension of LDA. They evaluated the accuracy of topic detection among three LDA models and clarified that Time-User-LDA detects with the highest accuracy.

Topic models are being applied to many studies. Zhang et al. [30] recommended bands and musicians to music lovers using LDA by calculating the degree of artist similarity based on generated topics. Riedl et al. [18] found the change-points of topics using LDA by calculating the similarity between sentences that express topic frequency vectors. Ma et al. [11] automatically annotated hashtags to tweets. Their PLSA-style models include user, time, and tweet content factors and achieved higher precision than other methods. Based on these previous studies, we applied LDA for building associations between aspects and topics as a part of HEF.

2.4. Multi-label classification

Multi-label classification studies are widely known methods based on SVM, Naive Bayes classifiers, and LDA. SVM, an identification method that performs supervised learning, has high generalizing capability and classification performance [5]. Chang et al. [3] developed an SVM library called LIBSVM, which achieves multi-label classification by building models by combining several labels. A Naive Bayes classifier assumes that the term occurrence in a document is independent, and label probabilities are calculated from these terms using Bayes rules. It estimates labels with the highest probability for a document [7]. Wei et al. [25] proposed multi-label classification based on Naive Bayes classifiers and estimated several labels with a probability that exceeds the average scores calculated by all the label probabilities. Ramage et al. [17] proposed a model called Labeled LDA (L-LDA) that expanded LDA to supervised learning. To extract latent topics, it assumes that the labels are the contents of documents. L-LDA can extract a one-to-one correspondence between LDA’s latent topics and document labels. Kase and Miura [10] estimated the new labels for an existing news-corpus. They calculated the occurrence probability of each feature in each class from a multi-label dataset and estimated an additional label with high probability using an EM algorithm based on the multinomial mixture model.

These methods show the high estimation performance of such long documents as blogs and newspapers using sufficient training data. However, tweets consist of fewer terms because their length averages 45 characters [14]. Moreover, as training data, fresh tweets are preferred because they are easily influenced by the real world. In these conditions, conventional multi-label classification methods fail to produce adequate performance to estimate several aspects of unknown tweets.

2.5. Our approaches

In this paper, we propose an inference method of the probability distribution of real-life aspects for unknown tweets by extending HEF, which was proposed in our previous work [29], to estimate several aspects of unknown tweets. HEF is composed of two-phase training: First, many topics are extracted using unsupervised topic models. Second, associations between topics and aspects are built based on supervised learning using labeled data. In our previous work, the association building had to explore an optimal parameter in each aspect using tuning data to enhance estimation performance of aspects. The main extension in this paper is to automatically build the optimal associations between many topics and aspects.

3. Probability distribution inference

3.1. Overview of HEF

The hierarchical estimation framework (HEF), which is a multi-label classification model that we previously proposed [29] (Fig. 1) is composed of two phases in a hierarchical manner. In the first phase, many topics are extracted from a sea of tweets using LDA. In its second phase, associations between topics and aspects are constructed using a small set of labeled tweets. We calculated the aspect scores for unknown tweets using the associations based on the terms extracted from them. Appropriate aspects are used to label unknown tweets by particular thresholds.

Fig. 1.

Hierarchical estimation framework.

Typical supervised machine learning methods directly calculate the term likelihood from labeled training data. The terms in unknown tweets, which do not appear in the training data, can’t play an effective role in the estimation of conventional methods. In contrast, HEF is composed of a triple hierarchy: Tweet-Topic-Aspect. The terms in a tweet are expanded using co-occurrence terms in appropriate topics. As a result, we clarified that HEF can estimate several appropriate aspects from a small set and the short sentences of labeled data: i.e. tweets.

In this paper, we extend the second phase of HEF to infer the probability distribution of the aspects of unknown tweets and propose an optimal association building method to manage topics extracted by LDA.

3.2. Relevance calculation

To build associations between many topics and fewer aspects in the second phase of HEF, we calculated the relevance as joint probability $p (a, t)$ between topics t and aspects a and prepared a small set of labeled tweets. A set of extracted terms from the tweets is W. Relevance $p (a, t)$ is calculated as follows: $\begin{matrix} (1) & p (a, t) = \sum_{w \in W} p (w | a) p (w | t), \end{matrix}$ where $p (w | t)$ denotes the occurrence probability of term w in topic t preliminarily calculated by LDA. $p (w | a)$ denotes the occurrence probability of term w in aspect a and is calculated as follows: $\begin{matrix} (2) & p (w | a) = \frac{n_{w, a}}{\sum_{w^{'} \in W} n_{w^{'}, a}}, \end{matrix}$ where $n_{w, a}$ denotes the occurrence number of term w in tweets where aspect a is labeled. Note this equation only calculates the relevance between topics and aspects using the occurrence probability.

3.3. Association building

We make associations between topics and aspects based on relevance $p (a, t)$ . Our approach assumes that each aspect consists of many topics. Here, since we consider that important topics for each aspect have high relevance, an effective strategy of association building connects topics to aspects based on the strength of the relevance. We arranged the topics in descending order of the relevance strength in each aspect and divided the topics into two sets. Our purpose is to discover a significantly high dividing point between a set of topics with high and low relevance. A set of topics with high relevance is our candidate of associations. To achieve this, we adopt a t value in Welch’s t-test [26], which is a certification test between two independent groups. When the Welch’s t-test value exceeds a threshold, two independent groups are significantly different.

Topic set $T_{a}$ in aspect a is given as follows: $\begin{matrix} (3) & T_{a} = \underset{T_{x} \subset T}{argmax} t - test (T_{x}, T_{y} | a), T_{y} = T ∖ T_{x}, \end{matrix}$ where $T_{x}$ denotes the set of topics with high relevance. $T_{y}$ denotes the complement set of $T_{x}$ in all topics T extracted by LDA. $T_{a}$ is given as $T_{x}$ when the t-test value between $T_{x}$ and $T_{y}$ is the highest in all of the dividing points. Welch’s t-test is defined as follows: $\begin{array}{l} (4) & t - test (T_{x}, T_{y} | a) = \frac{μ_{x} - μ_{y}}{\sqrt{\frac{σ_{x}}{| T_{x} |} + \frac{σ_{y}}{| T_{y} |}}}, \\ (5) & μ_{i} = \frac{1}{| T_{i} |} \sum_{t \in T_{i}} p (a, t), \\ (6) & σ_{i} = \sqrt{\frac{1}{| T_{i} |} \sum_{t \in T_{i}} {p (a, t) - μ_{i}}^{2}}, \end{array}$ where $| T_{x} |$ and $| T_{y} |$ denote the number of topics.

Here, we focus on the relationship between topics and aspects. As mentioned above, each aspect consists of many topics. The importance of topic t in each aspect a is calculated as conditional probability $p (a | t)$ . On the other hand, topics exist that are connected to many aspects with high relevance. For example, topics that are aggregated by location names extracted by LDA with high occurrence probability are connected with high relevance to all aspects because real-life tweets often contain location names. Similarly, topics including stop-words [21] will be associated with strong relevance to many aspects. This problem can be solved by calculating conditional probability $p (t | a)$ as the importance of aspects a in each topic t. By normalizing all aspects, topics with high relevance for all aspects have low conditional probability $p (t | a)$ .

The conditional probabilities are calculated as follows: $\begin{matrix} (7) & \begin{matrix} p (a | t) = \frac{p (a, t)}{\sum_{t^{'} \in T_{a}} p (a, t^{'})}, \\ p (t | a) = \frac{p (a, t)}{\sum_{a^{'} \in A} p (a^{'}, t)}, \end{matrix} \end{matrix}$ where $T_{a}$ denotes the topics associated with aspect a. A denotes all the aspects. These two probabilities use probability distribution inference in Section 3.4.

Fig. 2.

Probability distribution inference method.

3.4. Inference

To infer the probability distribution of real-life aspects for unknown tweets, we use the associations between topics and aspects. The inference flow using the associations is shown in Fig. 2. First, terms are extracted from tweets. Second, the occurrence probabilities of all the terms are calculated for each topic. After that, the aspect score is calculated based on the term’s probabilities and associations. Aspect scores $p (a | W_{t w})$ between tweets $t w$ and aspects a are calculated as follows: $\begin{array}{l} p (a | W_{t w}) \\ (8) & = \frac{1}{Z} \sum_{t \in T_{a}} \sum_{w \in W} p (w | t) p (a | t) p (t | a), \end{array}$ where $W_{t w}$ denotes a set of terms extracted from an unknown tweet $t w$ and $p (w | t)$ denotes the occurrence probability of terms w in topic t. Z denotes the summation value of $p (a | W_{t w})$ in all aspects.

$p (t | a)$ gives high relevance to important topics t for aspects a. However, several aspects might strongly associate with the same topics. For example, topics in which verbs have a high rank of occurrence probability are given high relevance from many aspects because verbs often appear in many aspects. $p (t | a)$ also gives high relevance to the characteristic topics of aspects, and low relevance to topics that share several aspects. Here, we must consider the properties of real-life aspects with examples. For example, flood and heavy rain often appear in the same sentence because floods are generally caused by heavy rain; they are aggregated in the same topic by LDA. From Table 1, because flood and heavy rain are respectively included in Disaster and Weather aspects, both should share flood and heavy rain topics. However, $p (a | t)$ gives low relevance to Disaster and Weather aspects. To consider the relevance of both $p (t | a)$ and $p (a | t)$ , we multiply both relevances of the score calculation with Eq. (8).

4. Experimental evaluations

To clarify the effectiveness of our proposed method that infers the probability distribution, we evaluated the JS divergence (JSD) and the Euclidean distance (ED) between each method’s inferred and correct probability distributions. As baseline methods, we adopted Naive Bayes, SVM, and L-LDA.

4.1. Dataset and parameter settings

4.1.1. Collecting many regional tweets

Our method requires many tweet datasets for generating topics using LDA. We used the Twitter Search API and collected 2,390,553 tweets4

⁴
https://dev.twitter.com/docs/api/1/get/search.

that were posted from April 15, 2012 to August 4, 2012, each of which has “Kyoto” as its Japanese regional information.

4.1.2. Real-life tweets

To construct associations and evaluate our method, we prepared a small set of 1,500 labeled tweets, each of which has “Kyoto” as the Japanese location information. We used three examinees: examinee E1 is the first author, and E2 and E3 are university students living in the city of Tsukuba. During the labeling process, the examinees freely consulted Table 1 and viewed the example tweets in each aspect and why they were classified as such. They selected the most suitable aspect for each tweet as the first aspect and the next two most suitable aspects as the second and third aspects. If no suitable aspect remained, they selected “other” to identify it as a non-real-life tweet.

We evaluated the κ coefficients among the top level candidates, i.e. the top ones of the examinees [4]. When the κ coefficient is high, the classification agreement rate among the examinees is high. The κ coefficient between examinees E1 and E2 was 0.687, 0.595 between E1 and E3 and 0.576 between E2 and E3. The average was 0.619, which shows a substantial match rate.

4.1.3. Single-label dataset for training

To identify the most appropriate aspect for each tweet, we extracted the aspect selected as the top candidate assigned by two or three examinees. The number of labels of each aspect is shown in the “Single-label” column in Table 2. The Eating aspect received the most labels: 136 out of 1,500 tweets. Eight aspects were labeled by 100 tweets. The total number of aspects labeled by tweets was 1,345. The tweets that didn’t completely match by the three examinees numbered 155 ( $= 1, 500 - 1, 345$ ).

Table 2
# and probability of labels by aspect

Aspects Single-label Multi-label

# $P (a)$ $| T_{a} |$ # $P (a)$ $| T_{a} |$

Appearance 104 0.0773 341 151 0.0636 329

Contact 100 0.0743 343 208 0.0877 341

Disaster 39 0.0290 379 52 0.0219 363

Eating 136 0.1011 247 219 0.0923 296

Event 85 0.0632 241 219 0.0923 288

Expense 76 0.0565 334 211 0.0889 387

Health 92 0.0684 348 121 0.0510 322

Hobby 108 0.0803 339 200 0.0843 312

Living 97 0.0721 332 141 0.0594 328

Locality 68 0.0506 320 147 0.0619 348

School 110 0.0818 321 153 0.0645 323

Traffic 107 0.0796 346 136 0.0573 306

Weather 111 0.0825 291 157 0.0662 291

Working 105 0.0781 299 176 0.0742 303

Other 7 0.0052 248 82 0.0346 375

Total 1,345 1.0000 2,373 1.0000

Aspects	Single-label	Multi-label
Appearance	104	0.0773	341	151	0.0636	329
Contact	100	0.0743	343	208	0.0877	341
Disaster	39	0.0290	379	52	0.0219	363
Eating	136	0.1011	247	219	0.0923	296
Event	85	0.0632	241	219	0.0923	288
Expense	76	0.0565	334	211	0.0889	387
Health	92	0.0684	348	121	0.0510	322
Hobby	108	0.0803	339	200	0.0843	312
Living	97	0.0721	332	141	0.0594	328
Locality	68	0.0506	320	147	0.0619	348
School	110	0.0818	321	153	0.0645	323
Traffic	107	0.0796	346	136	0.0573	306
Weather	111	0.0825	291	157	0.0662	291
Working	105	0.0781	299	176	0.0742	303
Other	7	0.0052	248	82	0.0346	375
Total	1,345	1.0000		2,373	1.0000

4.1.4. Multi-label dataset for training

The appropriate several aspects for each tweet are given at least once as the first candidate aspects selected by one of the three examinees. Therefore, the multi-label dataset is a superset of the single-label dataset. The number of labels of each aspect is shown in the “Multi-label” column in Table 2. The aspects of Eating and Event are the most labeled ones. The aspects of Contact, Event, Expense, and Locality increased to over twice the number of labels compared with the single-label dataset.

4.1.5. Probability distribution dataset for evaluation

To give the probability distribution of the aspects for each tweet, we used all of the candidate aspects assigned by the three examinees. Based on the reciprocal rank (RR) [23], which is one evaluation metric for search engine effectiveness, we assumed that the aspects selected with a higher rank have greater weight for the tweets. Correct probability distribution $P (a | t w)$ of each aspect a in tweet $t w$ is shown as follows: $\begin{array}{l} (9) & RR (a | t w) = \frac{1}{| A |} + \sum_{e \in E} \frac{1}{rank (a | t w, e)}, \\ (10) & P (a | t w) = \frac{RR (a | t w)}{\sum_{a^{'} \in A} RR (a^{'} | t w)}, \end{array}$ where E denotes all the examinees. $rank (a | t w, e)$ is a candidate number: 1st, 2nd, and 3rd rankings of aspects a labeled by examinee e for tweet $t w$ . The $\frac{1}{| A |}$ is a constant value for probability distribution smoothing. In this paper, the reciprocal value is given by the aspect number. Probability $P (a | t w)$ of aspect a is given as the value divided by the summation of RR.

4.1.6. Parameter settings

LDA requires hyperparameters. Based on related works [8], we set α to $\frac{50}{| T |}$ and β to 0.1. $| T |$ denotes the number of topics chosen from among 50, 100, 200, 500, and 1,000 topics in Section 4.4.1. The iterative calculation count in LDA is 100 times in every case.

4.2. Evaluation metrics

To correctly evaluate our method’s performance, we used 10-fold cross validation. We evaluated the JSD and ED between the inference and correct probability distributions. JSD is a metric that measures the similarity among probability distributions [15]. When both metrics are low, our method accurately infers the probability distribution of tweets. JSD and ED between the probability distributions of x and y are calculated as follows: $\begin{array}{l} JSD (x, y) = \frac{1}{2} (\sum_{a \in A} x (a) log \frac{x (a)}{z (a)} \\ (11) & + \sum_{a \in A} y (a) log \frac{y (a)}{z (a)}), \\ (12) & ED (x, y) = \sqrt{\sum_{a \in A} {x (a) - y (a)}^{2}}, \end{array}$ where $z (a)$ denotes the average of $x (a)$ and $y (a)$ .

4.3. Baseline methods

We extracted nouns, verbs, and adjectives using a Japanese morphological analyzer called MeCab5

⁵
http://mecab.sourceforge.net/.

and entered the sets of words and label(s) to every method in common.

4.3.1. Uniform distribution (UD)

As the simplest comparison method, we prepared the uniform distribution of aspects, each of which has $\frac{1}{| A |}$ probability. $| A |$ is the number of aspects.

4.3.2. Prior distribution (PD)

Prior distribution is calculated from the ratio of the number of aspects in the training dataset. UD and PD do not depend on the set of words appearing in the tweets.

4.3.3. Naive Bayes (NB)

A Naive Bayes classifier [5], which is one of the most typical and effective classification methods, classifies the labels with the highest posterior probability for a document. In our experimental evaluations, we used the normalized posterior probability of each document.

4.3.4. Support vector machine (SVM)

We used LIBSVM [3] as a support vector machine library. LIBSVM provides a probability estimation tool [27] for each class in addition to document classification. As SVM parameters, we chose a linear kernel and set parameter C to 1.0, indicated by a grid search in the LIBSVM tools.6

⁶
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

4.3.5. Labeled LDA (L-LDA)

Labeled LDA is an LDA extended model proposed by Ramage et al. [17]. L-LDA sets the hyperparameters of both α and β, as in LDA. We experimentally set α to 0.1 and β to 0.1, and the iterative calculation count in L-LDA was 100.

4.4. Experimental results

4.4.1. Comparison of number of topics

We evaluated the micro-average value of JSD between the inference and correct probability distributions in both the single and multi-label cases (Table 3). In both cases, according to an increasing number of topics, JSD decreased. Its decrease stabilized from 500 topics because the JSD difference at 500 and 1,000 topics is slight. A minimum JSD was achieved at 1,000 topics in both the single and multi-label cases. Therefore, we used 1,000 as the optimal number of topics for HEF.

Table 3
JSD scores in each # of topics in HEF

# of topics Single-label Multi-label

50 0.2408 0.2170

100 0.2324 0.2127

200 0.2159 0.1977

500 0.1987 0.1852

1,000 0.1926 0.1820

# of topics	Single-label	Multi-label
50	0.2408	0.2170
100	0.2324	0.2127
200	0.2159	0.1977
500	0.1987	0.1852
1,000	0.1926	0.1820

4.4.2. Number of topics connected to each aspect

We show the number of topics associated to each aspect in the $| T_{a} |$ column of Table 2. These numbers are optimized by Welch’s t-test. The maximum topic numbers of single and multi-label cases are the aspects of Disaster at 379 and Expense at 387 respectively. The minimum topic number is the Event aspect in both the single and multi-label cases.

We show the relevance and the t-test distributions of the Disaster and Event aspects in Figs 3 and 4. The horizontal axes of both figures are the topics rankings that are arranged in descending order of the relevance strength. The left and right vertical axes of both figures are the relevance and t-test values. The Disaster aspect achieved the maximum t-test value from 300 to 400 topics. On the other hand, the Event aspect achieved the maximum from 200 to 300 topics.

Fig. 3.

Relevance and t-test value distribution of Disaster.

Fig. 4.

Relevance and t-test value distribution of Event.

4.4.3. Comparison of association building methods

To evaluate the effectiveness of our association building method, we implemented three simple methods to build associations. First, we associated a topic with the highest relevance to each aspect; second, we associated ten topics with higher relevance to each aspect; finally, we associated all topics to each aspect.

Table 4
JSD by each association building method

Method Single-label Multi-label

Highest topic 0.3376 0.2921

Highest 10 topics 0.2391 0.2281

All topics 0.2068 0.1935

t-test topics 0.1926 0.1820

Method	Single-label	Multi-label
Highest topic	0.3376	0.2921
Highest 10 topics	0.2391	0.2281
All topics	0.2068	0.1935
t-test topics	0.1926	0.1820

The JSD value by each method is shown in Table 4. The minimum JSD value was achieved by t-test topics. The first and second methods showed higher JSD values than the third method. Based on these results, the aspect architecture is insufficient for just a few topics. However, to build refined associations, the architecture needs to delete extra topics from the third method’s result.

4.4.4. Inference performance of each method

We show the micro-average value and the standard deviation of JSD and ED by each method in Figs 5 and 6. The vertical axis shows the JSD and ED values. We took a one-sided t-test of HEF’s JSD and ED values against the baseline method’s values. That result was drawn on the top of each baseline method as “*” symbols in the figures; “***” represents a significantly-high value at 1%, “**” at 5%, and “*” at 10%.

From the t-test results, our method efficiently estimated the probability distributions against all the baseline methods in the single-label case. In the multi-label case, HEF performed significantly better than every baseline method except SVM. The JSD value of HEF that was trained by multi-label datasets is significantly better at 10% against single-label training.

Fig. 5.

JS divergence.

Fig. 6.

Euclidean distance.

5. Discussion

From Figs 5 and 6, the multi-label dataset achieved lower JSD and ED values than the single-label dataset, except for the uniform distribution. The reason is clear because the multi-label dataset has more detailed training information than the single-label one to infer the probability distributions. SVM especially decreased the JSD values of 0.04 ( $= 0.22 - 0.18$ ) in the multi-label case compared with the single-label one.

Our method showed the lowest JSD and ED values in both the single and multi-label cases. In the single-label case, HEF showed significantly higher performance than the other methods. We can see an optimal example tweet that explains this reason in Table 5 and Fig. 7. Table 5 shows the example tweet sentence and its labels. The main topic of this tweet is open campus, and two examinees selected the School aspect as its top candidate. Therefore, this tweet received the School aspect label. On the other hand, examinee E3 selected the Event aspect as its top candidate because he defined open campus as an event. In fact, examinee E1 selected the Event aspect as his second highest candidate. In multi-label cases, this tweet was labeled by School and Event aspects.

Table 5
Effectively inferred probability distributions of aspects by HEF

Examinees 1st 2nd 3rd

E1 School Event Hobby

E2 School Other

E3 Event School Other

Tweet We’ll hold an open campus for Kyoto Seika University on June 9, and some professors will provide special lectures!

Single-label School

Multi-label School, Event

Examinees	1st	2nd	3rd
E1	School	Event	Hobby
E2	School	Other
E3	Event	School	Other

Fig. 7.

Probability distributions of aspects estimated by each method for Table 5’s tweet.

Figure 7 shows the correct probability distributions of Table 5’s tweet as a solid black line. The School and Event aspects have higher probability than the other examinee labeling results. In addition, we show the probability distributions estimated by each method that was trained by a single-label dataset. We focus on the probabilities of the Event and School aspects. The inferred probability of the School aspect by each method is higher than the other aspects. NB inferred a higher probability than 0.50. SVM and HEF showed a lower probability than the correct one. Next, in the inferred probabilities of Event by each method, HEF successfully estimated the most approximate probability.

Table 6

High-occurrence-probability terms in highest relevance topic associated with Event

Topics	Characteristic words
Topic #387	participation, Kyoto, lecture, held, hall, culture, campus, conference, university

Here, we show the high-occurrence-probability words in the topic associated with the highest relevance to the Event aspect by HEF in Table 6. This topic includes terms related to the Event aspect: “participation”, “held”, and “conference”. On the other hand, such terms as “lecture”, “campus”, and “university” are also included in the topic, suggesting that they often appear together in many tweets. Therefore, such terms as “campus” and “university” are frequently mentioned in connection with Event aspect terms, including “held”. Our method can build associations between this topic and the Event and School aspects because it can use such terms as “held” to assign the Event aspect and “lecture” to assign the School aspect. Although Table 5’s tweet includes many terms that suggest the School aspect such as “university”, “professor”, and “lecture”, the only term for estimating the Event aspect is the verb “held”. For these reasons, the estimations of NB, SVM, and L-LDA, all of which directly calculate the likelihood of terms, were not appropriate. On the other hand, HEF inferred the Event aspect with high probability for Table 5’s tweet because it associated the Event aspect with high relevance to topic #387, which includes “held”, “lecture”, and “university” with high occurrence probability by LDA.

Finally, we show the number of average labels (Mean), its standard deviation (SD), and the assigned labels for a tweet by each examinee in Table 7. The mean and standard deviation of examinees E1 and E3 are approximate values. The number of assigned labels for them is also shown as similar distributions. However, the number of average assigned labels by examinee E2 shows greater values than E1 and E3. E2 tended to assign many labels for a tweet from the values in the “Three labels column”. These results suggest that the criteria of the users for assigning aspects are different. For example, E1 and E2 are more accuracy-oriented users and E3 is an exhaustive-oriented user. A multi-label classification approach has difficulty accommodating an individual user’s requirements or various situations. However, the representation of probability distribution on tweets can be applied to these users.

Table 7

Average number of labels for a tweet by each examinee

	Mean	SD	One label	Two labels	Three labels
E1	1.519	0.633	834 (55.6%)	553 (36.9%)	113 (7.5%)
E2	2.498	0.700	180 (12.0%)	393 (26.2%)	927 (61.8%)
E3	1.497	0.625	859 (57.3%)	536 (35.7%)	105 (7.0%)

6. Conclusion

In this paper, we proposed an inference method of real-life aspect distribution of tweets by a hierarchical estimation framework (HEF) using a small set of labeled tweets. To evaluate our method’s effectiveness, we prepared a small set of labeled tweets based on the classifications of three examinees. From our experimental evaluation results, our prototype system demonstrated that HEF can appropriately infer the probability distribution of the aspects of all unknown tweets.

The main contributions of this paper are as follows: First, although the aspect architecture is insufficient for just a few topics, the architecture needs to delete extra topics with low relevance by t-test to build refined associations. Second, in the case of single-label training, HEF showed significantly lower JS Divergence and Euclidean Distance values than every baseline method based on sharing topics by several aspects.

These results show that our scheme is an effective inference method of the probability distribution using a small labeled dataset for such short sentences as tweets. In the future, we will confirm the effectiveness of our method using other datasets, such as newspapers and blogs.

Footnotes

Acknowledgements

This work was supported by Grants-in-Aid for Scientific Research No. 25280110 and No. 15J05599 and by NII’s strategic open-type collaborative research.

References

Aramaki,

Maskawa and

Morita, Twitter catches the flu: Detecting influenza epidemics using Twitter, in: Proceedings of the EMNLP 2011, AAAI, 2011, pp. 1568–1576.

D.M.

Blei,

A.Y.

Ng and

M.I.

Jordan, Latent Dirichlet allocation, JMLR 3 (2003), 993–1022.

Chang and

Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2(3) (2011), 1–27. doi:10.1145/1961189.1961198.

Cohen, A coefficient of agreement for nominal scales, EPM 20(1) (1960), 37–46.

Cortes and

Vapnik, Support-vector networks, JMLR 20(3) (1995), 273–297.

Diao,

Jiang,

Zhu and

E.-P.

Lim, Finding bursty topics from microblogs, in: Proceedings of the ACL 2012, ACM, 2012, pp. 536–544.

Domingos and

Pazzani, On the optimality of the simple Bayesian classifier under zero–one loss, JMLR 29(2–3) (1997), 103–130.

T.L.

Griffiths and

Steyvers, Finding scientific topics, NAS 101 (2004), 5228–5235. doi:10.1073/pnas.0307752101.

Ishino,

Nanba and

Takezawa, Automatic compilation of an online travel portal from automatically extracted travel blog entries, in: Information and Communication Technologies in Tourism 2011, Springer, 2011, pp. 113–124. doi:10.1007/978-3-7091-0503-0_10.

10.

Kase and

Miura, Mining classes by multi-label classification, in: Proceedings of the EGC 2015, RNTI, 2015, pp. 77–82.

11.

Ma,

Sun,

Yuan and

Cong, Tagging your tweets: A probabilistic modeling of hashtag annotation in Twitter, in: Proceedings of the CIKM 2014, ACM, 2014, pp. 999–1008.

12.

J.B.H.

Mao and

Pepe, Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena, in: Proceedings of the ICWSM 2011, AAAI, 2011, pp. 450–453.

13.

Mathioudakis and

Koudas, Twittermonitor: Trend detection over the Twitter stream, in: Proceedings of the SIGMOD 2010, ACM, 2010, pp. 1155–1158.

14.

Mizunuma,

Yamamoto,

Yamaguchi,

Ikeuchi,

Satoh and

Shimada, Twitter bursts: Analysis of their occurrences and classifications, in: Proceedings of the ICDS 2014, IARIA XPS, 2014, pp. 182–187.

15.

K.P.

Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012, p. 58.

16.

Rajadesingan,

Zafarani and

Liu, Sarcasm detection on Twitter: A behavioral modeling approach, in: Proceedings of the WSDM 2015, ACM, 2015, pp. 97–106.

17.

Ramage,

Hall,

Nallapati and

C.D.

Manning, Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora, in: Proceedings of the EMNLP 2009, ACM, 2009, pp. 248–256.

18.

Riedl and

Biemann, TopicTiling: A text segmentation algorithm based on LDA, in: Proceedings of the ACL 2012, ACM, 2012, pp. 37–42.

19.

Sakaki,

Okazaki and

Matsuo, Earthquake shakes Twitter users: Real-time event detection by social sensors, in: Proceedings of the WWW 2010, ACM, 2010, pp. 851–860.

20.

Sakaki,

Okazaki and

Matsuo, Tweet analysis for real-time event detection and earthquake reporting system development, IEEE Transactions on Knowledge and Data Engineering 25(4) (2013), 913–931. doi:10.1109/TKDE.2012.29.

21.

Suresh,

Krishnamurthy,

Badrinath and

Veni Madhavan, Advances in Intelligent Data Analysis X, LNCS, Vol. 7014, Springer, 2011, pp. 364–375. doi:10.1007/978-3-642-24800-9_34.

22.

Twitter , Twitter reports fourth quarter and fiscal year 2013 results, 2014, available at https://investor.twitterinc.com/releasedetail.cfm?ReleaseID=823321.

23.

Voorhees and

Tice, The TREC-8 question answering track evaluation, in: Proceedings of the TREC-8, ACM, 1999, pp. 77–82.

24.

Wang,

Bu,

Chen,

W.V.

Zhang,

Cai and

He, Whom to mention: Expand the diffusion of tweets by @ recommendation on micro-blogging systems, in: Proceedings of the WWW 2013, ACM, 2013, pp. 1331–1340.

25.

Wei,

Zhang,

Li and

Miao, A naive Bayesian multi-label classification algorithm with application to visualize text search results, International Journal of Advanced Intelligence 3(2) (2011), 173–188.

26.

B.L.

Welch, The generalization of ‘student’s’ problem when several different population variances are involved, Biometrika 34(1/2) (1947), 28–35.

27.

T.-F.

Wu,

C.-J.

Lin and

R.C.

Weng, Probability estimates for multi-class classification by pairwise coupling, JMLR 5 (2004), 975–1005.

28.

Yamamoto,

Ogasawara,

Suzuki and

Furukawa, Tourism informatics: 9. Information propagation network for 2012 Tohoku earthquake and tsunami on Twitter, IPSJ Magazine 53(11) (2012), 1184–1191 (in Japanese).

29.

Yamamoto and

Satoh, Hierarchical estimation framework of multi-label classifying: A case of tweets classifying into real life aspects, in: Proceedings of the ICWSM 2015, ACM, 2015, pp. 523–532.

30.

Y.C.

Zhang,

D.O.

Séaghdha,

Quercia and

Jambor, Auralist: Introducing serendipity into music recommendation, in: Proceedings of the WSDM 2012, ACM, 2012, pp. 13–22.

31.

W.X.

Zhao,

Jiang,

He,

Song,

Achananuparp,

E.-P.

Lim and

Li, Topical keyphrase extraction from Twitter, in: Proceedings of the HLT 2011, ACM, 2011, pp. 379–388.

32.

Zhao and

Mei, Questions about questions: An empirical analysis of information needs on Twitter, in: Proceedings of the WWW 2013, ACM, 2013, pp. 1545–1556.

Life aspect inference of tweets based on probability distribution

Abstract

Keywords

1. Introduction

1 http://twitter.com.

2.1. Information extraction from Twitter

2.2. Extracting information related to user’s life

2.3. Topic model

2.4. Multi-label classification

2.5. Our approaches

3. Probability distribution inference

3.1. Overview of HEF

3.3. Association building

4. Experimental evaluations

4.1. Dataset and parameter settings

4.1.1. Collecting many regional tweets

4 https://dev.twitter.com/docs/api/1/get/search.

4.1.3. Single-label dataset for training

4.1.5. Probability distribution dataset for evaluation

4.1.6. Parameter settings

4.2. Evaluation metrics

4.3. Baseline methods

5 http://mecab.sourceforge.net/.

4.3.2. Prior distribution (PD)

4.3.3. Naive Bayes (NB)

4.3.4. Support vector machine (SVM)

6 http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

4.4. Experimental results

4.4.1. Comparison of number of topics

Table 3 JSD scores in each # of topics in HEF # of topics Single-label Multi-label 50 0.2408 0.2170 100 0.2324 0.2127 200 0.2159 0.1977 500 0.1987 0.1852 1,000 0.1926 0.1820

Table 4 JSD by each association building method Method Single-label Multi-label Highest topic 0.3376 0.2921 Highest 10 topics 0.2391 0.2281 All topics 0.2068 0.1935 t-test topics 0.1926 0.1820

Footnotes

Acknowledgements

References

¹
http://twitter.com.

⁴
https://dev.twitter.com/docs/api/1/get/search.

⁵
http://mecab.sourceforge.net/.

⁶
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

Table 3
JSD scores in each # of topics in HEF

# of topics Single-label Multi-label

50 0.2408 0.2170

100 0.2324 0.2127

200 0.2159 0.1977

500 0.1987 0.1852

1,000 0.1926 0.1820

Table 4
JSD by each association building method

Method Single-label Multi-label

Highest topic 0.3376 0.2921

Highest 10 topics 0.2391 0.2281

All topics 0.2068 0.1935

t-test topics 0.1926 0.1820