Abstract
Given the immediacy of social networks, commonly called sociodigital networks, it is necessary to develop methods to retrieve and interpret visually and in an organized way large amounts of information. Although there are tools that classify the information by using a visualization, generally in form of graphs, the identification of the topics around an event remains complicated. This article describes the use of dendrograms, as a different visual representation, by analyzing the frequency of the terms used in the tweets as well as the relationship between them. Thus, the use of semantic dendrograms facilitates the immediate identification of themes and subtopics of a given event by showing a clustering of these in the form of a tree.
Introduction
The current world generates more than 2.5 quintillions of bytes of data daily, and 80% of them are unstructured [15]. That is, that they are expressed in a natural language (spoken, written or visual) so a human can easily understand, however traditional software does not. The algorithms that are currently used should be able to find common patterns in the data in order to obtain the information that is wanted and, if it is possible, to be able to process it quickly in real time. Nevertheless, this does not occur all the time. Consequently, it is fundamental to have algorithms that analyze the relationship between data [15].
Online social media have created new ways to communicate, interact and share information to a wide audience. People have found on Internet an attractive place to show support for a cause. Opposed to offline channels, it offers enhanced ease, reduced risk and immediate gratification [14]. As a result, for example, Twitter has revolutionized the communication originating new terms to describe online behaviors. The term slacktivism is often used interchangeably with the word clicktivism [8]; signifying the facility that individuals have when they click on an online petition or social-media activist page and feel like they are actually helping [6].
The great amount of participation obligates the development of new tools for the study of large volumes of textual data. From the field of computer science and linguistics, new techniques are being developed to facilitate access to much of the information generated daily by using text mining. Among the existing computational models for the representation of texts, it is possible to find the Boolean Model, the Latent Semantic Analysis (LSA), the Latent Dirichlet Allocation (LDA) and the Vector Space Model (VSM). In this paper, an approach based on the LSA is presented in order to find relations between terms contained in tweets. This work explores an alternative of visualization of the results produced by analyzing the frequent terms and grouping them by similar theme. The utility of visualizing themes from a Twitter event by using dendrograms was demonstrated.
The rest of the paper is organized as follows. In Section 2 an state of the art is presented by describing platforms that use different kinds of visualization to show tweets. As well, research work in Social Network Analysis is explored. Section 3 describes the process followed to analyze tweets and visualize them. In Section 4 a case of study is presented. Finally, in Section 5, the main conclusions and ideas of future work are mentioned.
Related work
Online social media is generally presented as posts that are small informal texts, often with colloquial syntax, non-standard orthography or non-standard spelling, and it frequently lacks any punctuation [10]. Infographics can help to place sets of data in a useful context; to transform data into information, generally to share knowledge. In this way, visual communication is more effective than textual because of human cognitive abilities to understand visual information [17]. Also, with the increasing amount of data that are published online, it becomes very important to analyze and build narratives from the data, as well as producing graphics that support the narratives. So, some tools have been developed in order to visualize posts in a more attractive way. For example, it is possible to find some tools and research that present information that is organized through the use of automatic infographics or graphs. One of them is Spot1 which is an interactive real-time Twitter visualization that displays the posts in form of particles representing actors, communities and a wordcloud. However, sometimes the visualization is not pertinent because stop words remain or terms are not well selected.
Another tool is Twitonomy2 which analyzes a user account to present mentions, retweets, followers, top hashtags, more engaging users and other information in a page that often is very saturated. The most interesting contribution is the possibility to download the created charts. Mentionmapp3 is a very interesting tool that allows to explore, discover and visualize who’s talking with who, this means the connections between users. The visualization of the flow of information allows another interpretation of it against only having the posts. Nevertheless, none of the previous tools organizes the data into subcategories for the representation of the main themes.
Recent research in Social Network Analysis (SNA) has produced a great number of studies that contribute to the visualization of information based on graphs. Two kinds of fields are covered in the visualization of data on Twitter: 1) topic detection and 2) knowledge visualization. The first one is addressed to the detection of topics and their evolution over time [4, 26]. The second one, knowledge visualization, tries to create a more representative and useful result including with it a semantic dimension [5, 35]. However, it is difficult to find examples that cover the two fields. In general, the vast majority of research work, using tweets, it’s applied to the study of the influence of actors (e.g. [11, 22]), or sentiment analysis (e.g. [19, 23]) or opinion mining (e.g. [2, 24]).
A prototype to visualize Twitter information on qualitative, temporal, geographic, hierarchical or contextual infographics is presented in [13]. This is a good example of an interest to provide other resources to the user in order to be able to understand, interpret and analyze the results by his own.
In general, the tools and methods mentioned above propose different ways for the analysis and visualization of the information produced in social networks. But, they are mainly focused in the identification of actors and topics and their evolution over time. There is little interest in analyzing the posts, around a selected hashtag, in order to facilitate the understanding of a topic and its evolution.
The present work, seeks to provide a tool to facilitate the identification of relevant subthemes of a main theme, by the use of dendrograms. In cluster analysis, a dendrogram is a tree graph that can be used to examine and show how groups or clusters are formed. Each leaf represents an individual observation spaced along the horizontal axis while the vertical axis indicates a distance or dissimilarity measure. The height of a node represents the distance of the two clusters that the node joins [27].
Clustering of tweets is a recent research field mainly coming from biological studies. In computer science, the study of clustering of tweets has been used in [28] in order to represent temporal tweet content from cultural events. Dendrograms, are also been used to represent user communities like in [29, 30]. In [30] is proposed a clustering technique by using the Wards method in order to get accurate dendrograms for sentimental analysis of tweets. However, dendrograms are not shown in the article and their purpose and utility can’t be validated. An interesting work combining different visual approaches, as well as dendrograms is shown in [31] after extracting tweets produced as a consequence of two airlines crashes. In this case, a stratified analysis of the messages structure has been achieved by performing hierarchical clustering in the terms used in the messages. By performing clustering analysis in the tweets data, it was possible to reconstruct in a reasonable and statistically sound manner the meanings contained in the posts [31]. In [32], the authors explore whether the linguistic content analysis of Twitter data can be insightful for understanding the various latent issues during an epidemic. They found that citizens were more concerned with the long-term issues than the short-term issues such as fever and rash. Using hierarchical clustering and word co-occurrence analysis, they found underlying themes related to immediate effects such as the spread of Zika.
Another use of dendrograms is presented in [33] where a sample of 4800 tweets was examined through hierarchical cluster analysis and textual analysis. The main purpose was to cluster extremist terms coming from hate groups in order to explore the convergence of white extremist political ideology with mainstream political ideology on Twitter. The same was done in [34] where the authors manually identified a group of tweets that were tagged either with the hashtag #JeSuisCharlie or #JeNeSuisPasCharlie resulted from the digital movement emerged after the shooting attack by two self-proclaimed Islamist gunmen at the offices of French satirical weekly Charlie Hebdo on 7th January 2015. Some of the results are presented by using dendrograms to identify relevant information during crisis events and explore the structure of the discussion over the time.
At the same time, dendrograms are used by applying text mining to show the relation that the subtopics can have with other secondary aspects that occur around the event. All the examples cited above use dendrograms for the analysis of tweets after an event has occurred but there are not designed to study events in real time. The idea behind this article is the possibility to use dendrograms in order to extract subtopics, while hundreds or thousands of tweets are generated, by allowing the user to select the number of clusters to be visualized.
In the next sections two modules of the process to analyze tweets are presented.
Text data mining, Latent Semantic Analysis and visualization
The proposed system is composed of two modules: 1) text data mining and 2) Latent Semantic Analysis (LSA).
Text data mining
Text data mining is a process of exploratory data analysis [9] that leads to the discovery of heretofore unknown information, or to respond for questions for which the answer is not currently known. The main steps of text data mining are: Data import. Corpus handling. Preprocessing. Meta data management. Creation of term-document matrices.
The data import concerns the recovery of the tweets by specifying a hashtag. R4, and its library twitteR, has been used in order to connect to the Twitter platform. By specifying keys provided by the Twitter developer page it is possible to access the Application Program Interface (API).
The main structure for managing documents is by creating a corpus which is a set of text documents that contains similar characteristics. In this case, every document represents a tweet that is tagged with the same hashtag.
One of the previous steps before storing the tweets is the preprocessing which allows the elimination of data that is not of interest. For example, punctuation, numbers, links, tabs, and blank spaces are removed. Also, stopwords are eliminated before corpus is transformed into a matrix. It is of special interest to filter out the corpus satisfying given properties.
Metadata can be used to annotate the whole corpus with additional information. At last, term-document matrices from the corpus are created. In this case, terms are represented in rows and documents in columns. With the creation of the term-document matrix a huge amount of R functions (like clustering, classifications, etc.) can be applied.
Latent Semantic Analysis
Existing approaches to analyze posts mainly rely on parts of text in which opinions are explicitly expressed such as polarity terms, affect words, and their co-occurrence frequencies. However, opinions and sentiments are often conveyed implicitly through latent semantics, which make purely syntactic approaches ineffective [34].
One of the vector space models is the Latent Semantics Analysis (LSA) which assumes that some seemingly independent words are related by unobserved underlying themes. The LSA it is a statistical model that allows to determine the distances and semantic relationships between pieces of textual information, whether between words, phrases or paragraphs. It indexes phrases or documents to reflect topic similarity based on word co-occurrence, a valid indicator of topic relatedness [7].
The main idea of LSA is that the totality of information about all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and set of words to each other [36].
The LSA begins by processing a large corpus that contains thousands of words, paragraphs and phrases. In addition, it is represented as a matrix of frequencies whose rows are the different words of the corpus and whose columns are the different paragraphs or phrases. In this way, the matrix contains the number of times that each word appears in the text. Subsequently, a weighting is made in order to downplay to the most frequent terms since they do not provide relevant information. And on the contrary, increase it to the less frequent ones as well to those that appear in a moderate way. This weight expresses both the word’s importance in the particular document and the degree to which the word type carries information in the domain of discourse in general [36].
The next step is to apply an algorithm called Singular Value Decomposition (SVD) in order to reduce the size of the matrix to a more accessible number without losing important information of the original one. This allows to reveal similarities that are latent in the document collection. SVD of the co-occurrence matrix identifies a basis whose vectors correspond to specific topics, or concepts that are relevant to the text [1]. By taking into account that words with similar meaning tend to occur together, the initial term-space is transformed into a semantic space. The degree of semantic similarity between any pair of texts is measured by the cosine of the corresponding two vectors, +1 (identical) and -1 (opposite). Near-zero values represent unrelated texts [36].
In the next section, we present a case of use in order to explain the whole process to visualize themes of a Twitter event.
Case of use: hurricane #Irma
The implemented visualization was made in R by using a representative hashtag. In recent days, the case of use was changed in order to be more actual to the events that are occurring. In this way, the arrive of hurricanes to several countries and cities has generated a big participation of people in social media. So, to show the process of analysis and the final visualization, a search for tweets was made on the event #Irma during the first days of September 2017.
To be able to analyze the tweets it was important to follow all the steps explained in Section 3.1. In the data import we have retrieved 10,000 tweets, number that was specified in the query. The corpus handling included the replacement of badly downloaded characters so the scripts can read all the tweets without problems.
The preprocessing step concerned the filtering, to eliminate actors, in order to clean the corpus before starting to work with it. After that, a term-document matrix, representing the relationship between terms and tweets, was generated. Each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document. However, term-document matrices tend to get very big already for normal sized data sets. Therefore, we have used a method to remove sparse terms, i.e., terms occurring only in very few documents. Normally, this reduces the matrix dramatically without losing significant relations inherent to the matrix. The elimination concerns those terms which have at least a 100 percentage of sparse (i.e., terms occurring 0 times in a tweet) elements.
Term document matrix: composed of 270 terms and 10000 tweets Non-/sparse entries: 11474/2688526 Sparsity: 100% Maximal term length: 14 Weighting: term frequency (tf)
In the above result, the term-tweet matrix is composed of 270 terms and 10000 tweets. It is very sparse, with 100% of the entries being zero.
By using a term document matrix is possible to apply several functions in order to inspect the corpus, for example frequency of terms or associations between terms (see Figs. 1 and 2).

Terms that occur at least 50 times.

Terms that occur at least 100 times.
Another information that is of interest, to the users, when analyzing the tweets is to see the relationship that certain keywords have with others. For example, in the next figure it is shown next to which words appear the words hurricane, Miami, Irma and storm, and their frequency of appearance (see Fig. 3).

Relationship between the words hurricane, Miami, Irma and storm.

Matrix showing terms and tweets for the hashtag Irma.
From the created corpus, without sparse terms, we have the matrix of tweet lengths m×n (10×10 in the example), named A, where each column corresponds to a tweet. If the term i appears sometimes in the tweet j, then A [i, j] = a.
Through A, a new matrix terms-terms can be obtained:
If the terms i and j appear together in the tweet b, then B[i,j] = b. On the other hand, if the tweet i and j have c terms in common then C[i,j] = c.
Next, a cluster method is used, which is essentially a set of rules for dividing up a proximity matrix to form groups of similar objects [3]. Sparse terms are removed, so that the plot of clustering will not be crowded with words. So, taking into account A, a matrix of Euclidean distances between objects is obtained. Euclidean distances are calculated by defining the matrices:
S: matrix of eigenvalues of B U: matrix of eigenvalues of C P: diagonal matrix whose elements are the square roots of the eigenvalues of matrix B
By using SVD the dimensions are reduced by eliminating the smaller values of the matrix P. The biggest k values are kept creating the matrix P k .
As a consequence, the S and U T matrices are also reduced. The matrix A is approximated as:
The dimension of will be m×k, for it will be k×k and for it will be k×n. The words will be represented by the rows of the matrix m×k
Any query given will be represented by the centroid of the vectors of their terms. The calculation of distances is shown in Fig. 5.

Example of distances between terms.
The decision on the optimal number of clusters was left to the user which is consider the expert. However, this decision is subjective especially when increasing the number of objects because if they are selected too few, clusters resulting are heterogeneous and artificial, while if too many are selected, interpretation of them is often complicated.
Dendrograms are used to show clusters, there are tree diagrams to represent the hierarchical structure of the data. Cluster analysis, through a dendrogram, of the conversations that occur in a Twitter search allows us to see the semantic relationships between words, and how they relate to each other. That is, it allows to see “what conversations” are being produced and not just what words are most used (see Fig. 6).

Dendrogram for the tweets obtained for the hashtag Irma.
The example given in Fig. 6 is simple because the matrix was reduced in order to present a little dendrogram that can be clear and easy to understand to the user. However, dendrograms can show a lot of information according to user needs.
In the dendrogram shown in the Fig. 6 it is possible to see the main topics of the tweets. The terms “miami”, “latest”, “hurricaneirma”, “cuba” and “hurricane” are clustered into one group. In the next group the terms “florida” and “irma” are together. And finally “track”, “now” and “storm” are in the last cluster.
Comparing the query #Irma between days, one day after the other, there are substantial differences (see Fig. 7). In this case, the term document matrix is composed of 263 terms and 10000 tweets. The dendrogram generated in Fig. 8 contains one cluster corresponds to the term “irma”. The next one is composed by the term “water” and the last one is considered as what is happening at the news and the relevant terms are “news”, “miami”, “key”, “hurricane”, “florida”,, “downtown”, “breaking” and “completely”.
Term document matrix: composed of 263 terms and 10000 tweets Non-/sparse entries: 19075/2610925 Sparsity: 99% Maximal term length: 20 Weighting: term frequency (tf)

Matrix with terms and tweets for the hashtag Irma with one day of difference from Fig. 4.

Dendrogram for the tweets obtained for the hashtag Irma with one day of difference between the Fig. 5.
As a consequence of the great development that society has experienced in generating information and the existing capacity to store it, the text mining is becoming more and more frequent, since much of the information gathered today it is in the form of text. Nevertheless, there is a need to develop new methods and techniques to interpret information in a quickly way.
Automatic processing of unstructured data is a difficult task for several reasons. Among them is the great dimensionality; for this reason, many of the statistical techniques have, among their tasks, to reduce the size of documents, in our case tweets, without changing the semantic content of the documents.
This article describes the work done for the automatic identification of topics and sub-themes that happened around an event on Twitter. One of the advantages offered by the use of LSA is the possibility to analyze and obtain information not only from structured data, but also from texts. Also, the visualization of subthemes in form of dendrograms allows the user to interpret information and to select the number of clusters according to their needs.
Further work should consider the automatic comparison of generated dendrograms. This could be interesting for decision making. As well, user will be able to select the size of the dendrogram according to the number of terms. Also, the generation of dendrograms by using LSA has to be evaluated against applying current word embedding models.
