GOW-Stream: A novel approach of graph-of-words based mixture model for semantic-enhanced text stream clustering

Abstract

Recently, rapid growth of social networks and online news resources from Internet have made text stream clustering become an insufficient application in multiple domains (e.g.: text retrieval diversification, social event detection, text summarization, etc.) Different from traditional static text clustering approach, text stream clustering task has specific key challenges related to the rapid change of topics/clusters and high-velocity of coming streaming document batches. Recent well-known model-based text stream clustering models, such as: DTM, DCT, MStream, etc. are considered as word-independent evaluation approach which means largely ignoring the relations between words while sampling clusters/topics. It definitely leads to the decrease of overall model accuracy performance, especially for short-length text documents such as comments, microblogs, etc. in social networks. To tackle these existing problems, in this paper we propose a novel approach of graph-of-words (GOWs) based text stream clustering, called GOW-Stream. The application of common GOWs which are generated from each document batch while sampling clusters/topics can support to overcome the word-independent evaluation challenge. Our proposed GOW-Stream is promising to significantly achieve better text stream clustering performance than recent state-of-the-art baselines. Extensive experiments on multiple benchmark real-world datasets demonstrate the effectiveness of our proposed model in both accuracy and time-consuming performances.

Keywords

Text stream clustering topic model graph-of-words

ï»¿

1. Introduction

Clustering is one of common primitive tasks in text mining area [1, 2, 3]. Text clustering have been found to be useful for numerous real-world applications such as improving the information retrieval [2], text summarization [4], search result diversification, sampling text documents in latent topic space, etc. In short, text clustering is a process of partitioning/grouping a set of unlabeled documents into specific $k$ categories/clusters/topics for multiple purposes. For example, in social event detection application, text clustering can help to identify hot trends or frequent discussing topics in common social networks recently (such as: COVID-19, China-United States trade war, etc.). In text retrieval, text clustering can help to group relevant search results (as text documents) in order to make it much easier for users to get their needed information. Recently, we are entering the BigData era, with the tremendous raises of online social networks which have billion users interact with the other every day. These online social medias have facilitated the development and rapid spreading of online news and digital resources to anyone who can connect to the Internet. In fact, these massive amounts of digital content which are generated by users are often in the form of short-length text documents such as: user’s comments, tweets, posts, etc. on Twitter or Facebook, etc. These short-length text documents carry valuable information [1, 3] which can help to represent for hot real-life events, such as social discussions about the spreading of COVID-19 virus, 2020 China-India skirmishes, China-United States trade war, etc.

From the past, most of studies in text clustering are majorly concentrated on the long-length static text corpora. In fact, designed models for this traditional clustering approach is unable to be applied for rapid changeable text corpora with short-length text documents such as comments/posts/microblogs in social networks, such as: Twitters, Facebook, etc. Clustering rapid and short-length text documents is considered as more difficult than traditional static text clustering approach due to three main properties, including diverse document’s length (very short with only few words or very long), sparsity of text data representation and fast change/evaluation of existing clusters/topics in different text document batches (sequentially coming from the text streams). Moreover, in the case of a rapidly coming textual data of streams from social media resources traditional text clustering techniques are considered as not either not applicable to handle such high-velocity temporal as well as natural sparse of short-length text collections.

Recently, many researchers have paid a lot of attention on studies related text stream clustering in order to achieve better performance in both terms of clustering output accuracy and model’s time-effectiveness. Topic modeling is one of the most common approach for handling text stream clustering task. Topic modelling based models are designed depending on an assumption that text documents are generated by a mixture model. Then, by estimating model’s parameters via multiple techniques, such as Gibbs Sampling (GS), Sequential Monte Carlo (SMC), etc. the distributions of topics/clusters of give text corpus can be achieved. Inspired from the original Latent Dirichlet Allocation (LDA) [5] model, several extensions have been proposed to tackle text stream topic modelling tasks, such as well-known models: DTM [6], TM-LDA [7], ST-LDA [8], DCT [9], MStream/MStreamF [10], etc. These mixture model-based techniques try to infer the topic distributions over documents in given text steam to fulfill the clustering task. However, LDA-based models like as DTM, TM-LDA, ST-LDA, etc. are considered as not applicable for handling short-length text documents. Due to the original drawback of LDA-based technique, the topic-document mixture model must have a reasonable number of common words from each document to infer high-quality topics. Therefore, these LDA-based models can only achieve good performance with rich enough contextual long-length text stream documents. Recently, the DCT and MStream/MStreamF models are proposed to overcome the challenge of short-length text stream clustering, however, these models still encounter the limitation of largely ignoring word’s relations while inferring the topics/clusters distributions of given documents.

1.1 Problem definitions

There are two key problems of text stream clustering which has been investigated by researchers in the past. The first key challenge is related to short-length document in given text streams, especially occurring within topic modelling based approach. The second key challenge in text stream clustering is the lack of word’s dependency evaluation while inferring the topic/cluster distributions over text documents.

1.1.1 Shortage in short-text stream clustering

A major mechanism of inferring cluster/topic by using topic modelling/mixture model approach in text stream clustering is mostly relied on the content (distributed words) of documents. The content of documents in streams must be rich enough (have a reasonable number of occurring words) in order to properly infer topic’s multinomial distributions on each document. Therefore, with the low quantity of occurring words in documents, the overall model accuracy will be significantly decreased. Recent researches demonstrates that most of topic modelling/mixture model based text stream clustering techniques cannot achieve good performance on short-length text documents, with only few words like as comments or microblogs on social networks. In fact, one of the major difficulty in clustering streaming data is the rapid changes of topics/clusters over the time, such as hot trends or frequent discussing topics on common social networks. Therefore, topic/cluster distributions of the text streams are considered as dynamic overtime, or also known as “concept drift”. Multiple short-length text documents within each streaming batch might cover different topics and sparse nature in their raw structures. Within topic modeling based approach, properly choosing the number of clusters for each document batch from a given stream with the diversity of textual structure and covered topics is not an easy task. Moreover, applying fixed number of clusters/topics like as previous topic modelling approach for all document batches in a given text stream is considered as inflexible and unable to deal with concept drift problem. Hence, identifying the changes on the topic distribution of over very short-length text documents like as comments (Facebook), tweets (Twitter), etc. are extremely challenging task which has attracted a lot of attentions from many researchers in recent years.

1.1.2 Lack of word dependency evaluation

Beside challenges related to concept drift problem in short-length text stream clustering, word dependency is also considered as a major drawback of recent text stream clustering approaches. In most of model-based text stream clustering techniques, document’s words are separately evaluated with considering their occurring orders and relationships within specific textual contexts. It is needless to say that a text document is a complex natural human-based structure. Depending on language usage, words in each document are organized strictly following a specific systematic structure. Therefore, different words’ orders or combinations (relationships between words) might carry out different semantic meanings which definitely influence to the covered topics of documents which they are occurred. A common assumption of model-based text stream clustering techniques is that group of documents which share the same sets of common words will tend to choose the same topics/clusters, it is also known as traditional bag-of-words (BOW) representation. The major drawback of BOW representation is the largely ignoring of word’s orders (e.g.: “the lazy fox jumps over the brown dog” is totally different “the brown dog jumps over the lazy fox”, etc.) and relationships (like as combined words: “United States”, “corona virus”, etc.). Therefore, sampling common words distribution over documents during the topic/cluster inferring process without considering the occurrence orders of words might lead to the downgrade of clustering accuracy output. It should be better to extend the word dependency evaluation within different document’s contexts during the process of inferring clusters in order to improve the quality of text stream clustering output.

1.2 Our contributions

To meet existing challenges, in this paper we propose a novel approach of mixture model-based text stream clustering which utilize the occurring graph-of-words (GOW) evaluation in given text corpus, which is inspired from our previous works [11], called GOW-Stream. The GOW-Stream is designed to leverage both accuracy and time-consuming performances for text stream clustering tasks by thorough evaluation on words’ relationships while inferring clusters. The overall contributions of our works in this paper can be summarized as three main points, which are:

•
We proposed an approach of applying n-gram text-to-graph (text2graph) transformation with frequent sub-graph mining (FSM) technique for extracting common GOWs from the given text corpus. Then, the occurrences of common GOWs in each text document are used to support the process of estimating the distributions of topics/clusters over documents.
•
Next, we formally define the mechanism of GOW-Stream which is a mixture model-based for effectively handling short-text stream clustering task by combining both word independent (separated words in each document) and dependent (co-occurred words in common GOWs) evaluations. GOW-Stream is not only applicable for tackle challenges related to natural concept drift of text stream but also better accuracy and time-consuming performances in comparing with previous word-independent evaluation based models.
•
Extensive experiments on real-world standard datasets demonstrates the efficiency and effectiveness of our proposed GOW-Stream model for short-text steam clustering task in comparing with recent state-of-the-art algorithms, such as: DTM [6], Sumblr [4] and MStream [10].

In general, the rest of the paper is structured in 5 main parts. In the second part, we generally present literature reviews on recent studies of common trends of text stream clustering. In this second part, we also discuss about pros/cons of recent proposed text streaming models which are played as main motivations for our contributions in this paper. Next, in the third part, we formally describe the methodology of our proposed GOW-Stream model with details of concepts and implementations. In the fourth part, we present empirical studies on the performance of the proposed GOW-Stream model in comparing with recent well-known text stream clustering baselines with two benchmark labelled datasets. Furthermore, in this part, we also demonstrate extensive experimental studies on model’s hyper-parameter sensitivity and time-consuming performance of our proposed model. Finally, in the last part, we give the conclusion for our works as well as highlight some possible directions for future improvement.
2. Related works and motivations

Recent studies of textual data stream clustering can be categorized as three main categories, which are topic modelling, dynamic mixture model-based and similarity based approaches.

2.1 Traditonal topic modelling based approach

Considering as the earliest approach for textual data stream clustering, topic modelling is a family of algorithms which support to discover latent topics/thematic structure from given text documents. Latent Dirichlet Allocation (LDA) [5] is one of the most well-known topic modelling algorithm which support to infer latent topics from a set of text documents which are biased probability distributions over words. LDA represents latent discovered topics as subsets of distributed words and documents as subsets of distributed latent topics. Researches have demonstrated that topic modelling can be applied to effectively model temporal nature of topics/clusters in textual data streams as well as dealing with the sparsity of documents. Many LDA-based extensions have been introduced to cope with the dynamic nature of topics/clusters in different batches of text streams, such as topic over time (TOT) [12], dynamic topic model (DTM) [6], topic tracking model (TTM) [13], temporal LDA (TM-LDA) [7], streaming LDA (ST-LDA) [8], etc. These proposed models can support to effectively infer dynamic topics from given long-length document in streams. However, these LDA-based models need to initially identify a fixed number of topics/clusters for all different document batches in a stream which is considered as unable to deal with the changeability of topics/clusters over time.

2.2 Dynamic mixture model-based approach

Since the number of topics/clusters are varied with time and different document batches in streams, fixed number of topics is considered as major limitation for applying LDA-based techniques in dealing with natural topic evolution problem of text streams. To overcome this drawback, continuous improvements related to dynamic topic modelling approach have been proposed. This approach is also called as Dirichlet Process (DP) [14] method which is widely used for handling topic evolution problem in text stream clustering. Mostly inspired from LDA-based models, mixture model-based text stream clustering algorithms are designed to infer distributions of topics/clusters over documents which are considered as a generated mixture model. Then, multiple sampling techniques such as Gibbs Sampling, Sequential Monte Carlo, etc. are applied estimate model’s parameters, so as to achieve the distributions of topics/clusters over a given text stream. In other words, dynamic mixture model-based text clustering techniques mostly rely on Bayesian non-parametric theorem for dynamic topic modeling. Dynamic mixture model-based approach has demonstrated its effectiveness in automatic topics/clusters discovering from sparse text streams. Recent well-known models in this approach such as: Dirichlet-Hawkes Topic Model (DHTM) [15], Dynamic Clustering Topic Model (DCT) [9] and Temporal Dirichlet Process Mixture Model (TDPM) [16]. These Dirichlet process based baselines have shown potential solutions for tackling concept/topic drift problem of text streams. However, these proposed models still have existing drawbacks. The DHTM is considered as incapable to work well on short-length text documents. In contrast, DCT is designed to work with short-length text streams, however, DCT cannot investigate the evolution of topics/clusters in different document batches of text streams where the number of topics/cluster might be changed overtime. Hence, it fails to deal with the concept drift challenge. For the TDPM, it is considered as an offline text clustering framework which requires the whole set of text documents from a given stream. Therefore, TDPM is unsuitable to be applied in the context of high-velocity in-coming text clustering task. Recently, there is a novel upgrade of short-length text stream clustering depending on Dirichlet Process Multinomial Mixture Model (DPMM) [17], called MStream/MStreamF which enable to effectively predicting latent topics/clusters from given short-length text streams. However, MStreamF also still encountered a common shortage of exploiting independent word representation while inferring topics/clusters. The sparsity and word dependency ignore might lead to the cause ambiguity of discovered topics/clusters from given text streams.

2.3 Vector space representation based approach

Similar to the classical text clustering approach for static text corpora, give text documents in streams are transformed and represented as feature vectors, then out-of-the-shelf distance-based metrics such as cosine similarity, Euclidean distance, etc. are applied to measure the similarity between text documents and given topics/clusters. From the past, vector space representation (VSR) based approach has been widely studied for handling high-velocity text stream clustering task with well-known similarity-based text stream clustering models, such as: CluStream [18], DenStream [19], Sumblr [4], etc. However, these VSR-based text clustering techniques have two major drawbacks. The first drawback is related to concept/topic drift challenge which number of topics/clusters should be specified first. The second drawback of VSR-based models is related to initial threshold of document-topic/cluster similarity which means we must manually select a proper similarity threshold in order to identify a new text document from a given stream should belong to a specific topic/cluster or not. Moreover, the quality of document represented vectors also be influenced by the document’s length. Due to the existence of these severe challenges, the VSR-based approach is less attractive than mixture model-based approach in handling text stream clustering task.

3. Methodology

In this part, we formally present mixture model based approach which leverage the word’s dependency evaluation by applying common graph-of-words (GOW) distributions over documents in a given text stream, called GOW-Stream. Our proposed GOW-Stream is an Dirichlet Process Multinomial Mixture (DPMM) based text stream clustering method which can significantly improve the quality of clustering short-length sparse text streams. At first, we briefly introduce an approach of extracting common graph-of-words (GOW) from text documents by applying text2graph transformation and frequent sub-graph mining (FSM) technique. Then, we present a novel topics/clusters inference technique mostly inspired from previous MStream/MStreamF model, as the multinomial distributions of documents which are represented as distribution of occurred words and common GOWs. Thereby, both independent word and common GOWs in each document of a given text stream are well considered in the process of topic/cluster generation.

3.1 Preliminaries and background concepts

In this section, we briefly introduce preliminary concepts which our proposed GOW-Stream model in this paper are inherited from.

3.1.1 Graph-of-words (GOW) representation

Text2graph transformation. GOW-based text document representation is a well-known NLP approach which aims to transform a text document ( ${d}$ ) into graph-based structure, denoted as: ${{G}}_{{d}}=({{V}}_{{d}},{{E}}_{{d}})$ with the set of nodes ( ${{V}}_{{d}}$ ) and edges ( ${{E}}_{{d}}$ ) represent for set of unique words, as $W=\{{{w}}_{{1}},{{w}}_{{2}}\dots{{w}}_{{|W|}}\}$ ), which are occurred in document ( ${d}$ ) and co-occurrence relations between these words, respectively. The co-occurrence relations between words might be flexibly extracted within a predefined sliding window. This is also known as a text2graph transformation technique which is considered as statistical approach for representing the co-occurring relationships between words in the text without deliberation on semantic meanings between connected words. The transformed textual graphs can be directed or undirected. The simplest implementation of GOW representation for textual document is using undirected graph to present co-occurring relations between words (illustrated in Fig. 1A). In case that we need to consider the occurring orders of words within documents, the constructed textual graphs should be directed graphs. For advanced implementation of text2graph approach, constructed textual graphs can be weighted to take into account the co-occurring frequency of two words and labelled with part-of-speech annotation of words (illustrated in Fig. 1B).

Frequent common subgraphs (FCS) as the unsupervised document’s features. Then, with a set of constructed textual graphs ( $\{{G}_{{1}},{{G}}_{{2}},\dots{{G}}_{{|D|}}\}$ ) from a given text corpus ( ${D}$ ), with ${V}$ and ${E}$ are sets of distinctive occurring words ${W}$ as graph’s nodes and their co-occurring relations, respectively. Then, we applied frequent subgraph mining techniques, such as: gSpan, FFSM, etc. to extract a set of common subgraphs, denoted as: $F=\{G^{\prime}_{{1}},G^{\prime}_{{2}}\dots G^{\prime}_{{|F|}}\}$ , where each common subgraph: $G^{\prime}_{{f}}=(V^{\prime}_{{f}},E^{\prime}_{{f}})$ , with $V^{\prime}_{{f}}{\in}{V}$ and $E^{\prime}_{{f}}{\in}{E}$ , is considered as the distinctive feature for given documents which subgraph $G^{\prime}_{{f}}$ is included. Different from using common words as distinctive features for text representation, aka bag-of-words (BOW) representation, the use of common textual subgraphs is considered as more semantic due to the capability of capturing word’s dependent and order relationships. Therefore, combining with the classical BOW based representation, a document ( ${d}$ ) is now decomposed as the following tuple (as shown in Eq. (1)):

$\displaystyle\langle{{W}}_{{d}}{}:{{N}}_{{d}}|{{F}}_{{d}}\rangle$ (1)

Where,

•

${{W}}_{{d}}$ and ${{N}}_{{d}}$ , present for the set of unique words which are occurred in given document ( ${d}$ ) with respect to their frequencies which are represented as a vector ${{N}}_{{d}}$ , where ${{N}}^{{w}}_{{d}}$ is the occurrence frequency of specific ( ${w}$ ) in given document ( ${d}$ ) or ${{N}}_{{d}}=\sum_{{w}{\in}{d}}{{{N}}^{{w}}_{{d}}}$ , respectively.

•

${{F}}_{{d}}$ , presents for set of contained common GOW in given document ( ${d}$ ).

Figure 1.

Illustrations of text document to graph-based structure transformation (text2graph).

Algorithm 1. Extracting common GOWs from the given document set (D)
Input: • Document set ${D}$ • Sliding window ( $s$ ) • Minimum support value ( $\sigma$ ). Output: The sets of common GOWs for document set ( $D$ ), denoted as: ${{F}}_{{D}}$
1: Function ExtractGOWs ( $D,\sigma$ )
2: Initialize: ${{G}}_{{D}}=[]$
3: For document ( ${d}$ ) in ${{D}}_{{t}}$ :
4: Initialize: ${{G}}_{{d}}=$ Text2Graph ( $d$ )
5: Update: ${{G}}_{{D}}$ .append ( ${{G}}_{{d}}$ )
6: End for
7: Initialize: ${{F}}_{{D}}=$ gSpanAlgorithm ( ${{G}}_{{D}},{\sigma}$ )
8: Return ${{F}}_{{D}}$
9: End function
10: Function Text2Graph ( $d$ ):
11: Initialize: ${{G}}_{d}$ #graph-based structure
12: Initialize: ${{W}}_{{d}}=[]$ , WSeq ${}_{{d}}{=[]}$
13: For word ( ${w}$ ) in tokenize ( ${d}$ ):
14: If ${w}$ not in ${{W}}_{{d}}$ : ${{W}}_{{d}}$ .append ( ${w}$ )
15: Update: Seq ${}_{{d}}$ .append ( ${w}$ )
16: End for
17: Update: G.nodes.create ( ${{W}}_{{d}}$ ) #creating set of nodes from the given set of unique words
18: For word ( ${w}$ ) in WSeq ${}_{{d}}$ :
19: For ${i}$ in range (0, s):
20: Update: G.edges.create ( $[{w}],[{\text{Seq}}_{{d}}{[w-i}]$ )
21: Update: G.edges.create ( $[{w}],[{\text{Seq}}_{{d}}{[w+i}]$ )
22: End for
23: End for
24: Return ${{G}}_{d}$
25: End function

For each common GOWs ${{F}}_{{d}}$ which are included in documents, it is contained only one for each document ( ${d}$ ), so there is no need to calculate the occurring frequency of common GOWs in each document. Algorithm 1 illustrates steps for extracting common GOWs from a given raw text corpus with the gSpan algorithm for frequent common textual subgraph mining.

3.1.2 Text stream clustering and dynamic mixture model based approach

Mixture model-based text stream clustering. In general, text stream clustering is totally different from classical static text clustering approach. In context of continuous coming of different-length text documents over the time. For common text stream data, such as comments, microblogs, etc. from social networks, the number of documents, document’s length and covered clusters/topics might be diverse and rapidly changeable at different time ( ${t}$ ). Formally, for each arrival document set, also known as streaming batch, comes to the system at a specific time ( ${t}$ ) – denoted as: ${{D}}_{{t}}=\{{{d}}_{{1}},{{d}}_{{2}}\dots{{d}}_{|{{D}}_{{t}}|}\}$ , where each document ${{d}}_{{t}}$ contains a set of unique words: ${{N}}_{{{d}}_{{t}}}=\{{{w}}_{{1}},{{w}}_{{2}}\dots{{w}}_{|{{N}}_{{{d}}_{{t}}}|}\}$ . Within all sets of documents which are sequentially come from the text stream, $D=\{{{D}}_{{1}},{{D}}_{{2}}\dots{{D}}_{{T}}\}$ with ${T}{\to}{\infty}$ , The ultimate objective of mixture model-based text stream clustering approach is to group relevant documents in ${{D}}_{{t}}$ into topics/clusters, denoted as: ${{Z}}_{{t}}=\{{{z}}_{{1}},{{z}}_{{2}}\dots{{z}}_{{K}}\}$ with ${K}{\to}{\infty}$ . Following the concept/topic drift assumption the number of topics/clusters ${K}$ of given text stream is changeable over time ( ${t}$ ). Following traditional topic modelling approach, each discovered topic/cluster is represented as the multinomial distributions of unique words which appear in the given ${t}$ -th time document batch, denoted as: ${{z}}_{{t}}=\{\textit{prob}({{w}}_{{1}}),\textit{prob}(w_{{2}})\dots\textit{% prob}({{w}}_{|{{N}}_{{{d}}_{{t}}}|})\}$ , where $\textit{prob}(.)$ stands for the probability distribution. Then, each document ${{d}}_{{t}}$ is represented as: ${{d}}_{{t}}=\{\textit{prob}({{z}}_{{1}}),\textit{prob}(z_{{2}})\dots\textit{% prob}({{z}}_{|{{Z}}_{{t}}|})\}$ , set of distributed probabilities of extracted clusters/topics. However, in order to cope with the changes of clusters/topics in each arrival document batch, each document ${{d}}_{{t}}$ is only chosen with a highest likelihood topic/cluster ${{z}}_{{t}}$ , hence for two different topics/clusters, a and b with ${a}{\neq}{b}$ and ${{z}}^{{a}}_{{t}}=\{{{d}}^{{a}}_{{1}},{{d}}^{{a}}_{{2}}\dots{{d}}^{{a}}_{{n}}\}$ and ${{z}}^{{b}}_{{t}}=\{{{d}}^{{b}}_{{1}},{{d}}^{{b}}_{{2}}\dots{{d}}^{{b}}_{{n}}\}$ (each document ${{d}}^{{a}}_{{t}}$ , ${{d}}^{{b}}_{{t}}{\in}{{D}}_{{t}}$ ) are their corresponding assigned documents, we have: ${{z}}^{{a}}_{{t}}\cap{{z}}^{{b}}_{{t}}=\emptyset$ .

Dirichlet Process [14] & Poly-Urn schema. Commonly applied in multiple mixture model-based text stream approach, the Dirichlet Process (DP) is considered as non-parametric processes for modelling data. It supports to draw a sample $\mathcal{N}$ from distribution ${G}$ with the given base distributions, denoted as: ${{G}}_{0}$ , denoted as: $G\sim DP({\alpha},{{G}}_{0})$ with ${\alpha}$ is a concentration hyper-parameter for controlling the distribution of drawing each sample $\mathcal{N}$ . To draw a sequential samples: $\{{\mathcal{N}}_{0},{\mathcal{N}}_{1}\dots{\mathcal{N}}_{n}\}$ from a distribution ${G}$ , the Poly-Urn schema theorem is applied as following (as shown in Eq. (2)):

$\displaystyle{\mathcal{N}}_{{n}}{|}{\mathcal{N}}_{{1:n-1}}{\sim}\frac{{\alpha}% }{{\alpha}{+n-1}}+\frac{\sum^{{n-1}}_{{k=1}}{{\delta}({\mathcal{N}}_{{n}}-{% \mathcal{N}}_{{k}})}}{{\alpha}{+n-1}}$ (2)

Where,

•

${n}$ , is number of draws from distribution ${G}$ .

•

${\delta}(x)$ is the indicator function, where ${\delta}(x)=1$ when $x=0$ , otherwise ${\delta}(x)=0$ .

Repeating the ${n}$ draws from distribution ${G}$ , we can take a set of $K$ distinctive values, where $K<n$ so as partitioning the ${n}$ draws into $K$ topics/clusters. The distribution over defined $K$ partitions is developed by applying a well-known process, called Chinese Restaurant Process (CRP). The draws of distribution ${G}$ is represented by applying CRP and the stick-breaking process demonstrate the property of distribution ${G}$ as: $G(\mathcal{N})=\sum^{{\infty}}_{{k=1}}{{\theta}}_{k}{\delta}({\mathcal{N}}_{n}% -{\mathcal{N}}_{k})$ with $\mathcal{N}_{k}\sim{\mathcal{N}}_{0}$ . The mixture weights ${\theta}=\{{\theta}_{k}\}^{{\infty}}_{{k=1}}$ are constructed by formulating the GEM (Griffiths, Engen, and McCloskey) distribution of DP as: ${\theta}\sim\textit{GEM}({\gamma})$ . Then, the stick-breaking construction is applied for the generative process of DPMM model as the following (as shown in Eq. (3.1.2)):

$\displaystyle{\theta}{|}{\gamma}\sim\textit{GEM}({\gamma})$ $\displaystyle{\mathcal{N}}_{{k}}{|}{\beta}\sim\textit{Dirichlet}({\beta}),k{% \to}{\infty}$ $\displaystyle{{z}}_{{d}}{|}{\theta}\sim\textit{Mult}({\theta}),{k}{\to}{\infty}$ $\displaystyle{d|}{{z}}_{{d}},{\{{\mathcal{N}}_{{k}}\}}^{{\infty}}_{{k=1}}\sim% \textit{prob}(d|{\mathcal{N}}_{{{z}}_{{d}}})$ (3)

Where,

•

${z}$ presents for cluster which generates document ( ${d}$ ).

•

$\textit{prob}(d|{\mathcal{N}}_{{{z}}_{{d}}})$ is probability of distribution that a given document ( ${d}$ ) is generated by a cluster ${z}$ , is defined as: $\textit{prob}({d}\mathrel{|\vphantom{{d}{\mathcal{N}}_{{{z}}_{{d}}}}\kern-1.2% pt}{\mathcal{N}}_{{{z}}_{{d}}})=\prod_{{w}{\in}{d}}{{\textit{Mult}(w|}}{% \mathcal{N}}_{{{z}}_{{d}}})$ .

From the given Eq. (3.1.2), the Bayesian assumption is that, the generation of words ( ${{W}}_{{d}}$ ) in each document ( $d$ ) is independent with a known $z$ topic/cluster which ( $d$ ) is assigned to. Then, the sequential draws of samples can be achieved by CRP. This method is assumed that the distribution probability of words in each document is independently evaluated without considering their positions as well as co-occurring relationships. Figure 2 illustrates the graphical generative processes of our proposed GOW-Stream model.

Figure 2.

Generative process of our proposed GOW-Stream model.

3.2 Proposed GOW-Stream model

In this section, we formally present our main contributions in this paper by an effective text stream clustering techniques which leverage the quality of identified topics/clusters in given text stream by utilizing extracted common GOWs distributions over documents. Our proposed GOW-Stream is an DPMM based model which inspired from previous works of MStream/MStreams model.

3.2.1 GOW-based cluster/topic representation

Different from traditional approach of static text corpus clustering, the clusters are represented as means of document sets in the given vector spaces. In recent approach of mixture model, the clusters are represented as cluster’s feature vectors, or simply cluster vectors. A cluster vector is formed as a tuple, denoted as: $\langle{{n}}_{{z}}{}:\overrightarrow{{{n}}_{{z}}},{{m}}_{{z}}\rangle$ , where ${{n}}_{{z}}{}:\overrightarrow{{{n}}_{{z}}}$ presents for number of words in given topic/cluster ${z}$ and their corresponding occurrence frequencies in given cluster ${z}$ , respectively and ${{m}}_{{z}}$ is number of documents which are assigned to topic/cluster ${z}$ . To combine with the distribution of common extract GOWs from documents, we restructure the cluster feature as following (as shown Eq. (4)):

$\displaystyle\langle{{f}}_{{z}}{}:\overrightarrow{{{f}}_{{z}}},{{n}}_{{z}}{}:% \overrightarrow{{{n}}_{{z}}},{{m}}_{{z}}\rangle$ (4)

Where,

•

${{f}}_{{z}}{}:\overrightarrow{{{f}}_{{z}}}$ , present for number of common extracted GOWs which are assigned to topic/cluster ${z}$ and their corresponding occurring frequencies, respectively.

•

${{n}}_{{z}}{}:\overrightarrow{{{n}}_{{z}}}$ , present for number of words which are assigned to topic/cluster ${z}$ and their corresponding occurring frequencies, respectively.

•

${{m}}_{{z}}$ , is number of documents which assigned to topic/cluster ${z}$ .

Similar to previous approach of MStream/MStreams model, this cluster vector representation also has important properties, including: addible and removable. The addible and removable properties of our proposed GOW-Stream model are described as the following (as shown in Eq. (5a) and (5b)):

$\displaystyle{{f}}^{{g}}_{{z}}={{f}}^{{g}}_{{z}}{+}{{|F}}^{{g}}_{{d}}|,\text{% with }{\forall}{g}{\in}{d}$ $\displaystyle{{f}}_{{z}}={{f}}_{{z}}{+}{{|F}}_{{d}}|$ $\displaystyle{{n}}^{{w}}_{{z}}={{n}}^{{w}}_{{z}}{+}{{N}}^{{w}}_{{d}},\text{% with }{\forall}{w}{\in}{d}$ (5a) $\displaystyle{{n}}_{{z}}={{n}}_{{z}}{+}{{N}}_{{d}}$ $\displaystyle{{m}}_{{z}}={{m}}_{{z}}{+1}$ $\displaystyle{{f}}^{{g}}_{{z}}={{f}}^{{g}}_{{z}}{-}{|}{{F}}^{{g}}_{{d}}|,\text% {with }{\forall}{g}{\in}{d}$ $\displaystyle{{f}}_{{z}}={{f}}_{{z}}{-}{|}{{F}}_{{d}}|$ $\displaystyle{{n}}^{{w}}_{{z}}={{n}}^{{w}}_{{z}}-{{N}}^{{w}}_{{d}},\text{with % }{\forall}{w}{\in}{d}$ (5b) $\displaystyle{{n}}_{{z}}={{n}}_{{z}}{-}{{N}}_{{d}}$ $\displaystyle{{m}}_{{z}}={{m}}_{{z}}{-}{1}$

Where,

•

${{f}}^{{g}}_{{z}}$ and ${{n}}^{{w}}_{{z}}$ , are number of frequency of contained common GOW ( ${g}$ ) and number of frequency of occurring word ( ${w}$ ) in a document ( ${d}$ ), respectively.

•

${{F}}^{{g}}_{{d}}$ and ${{N}}^{{w}}_{{d}}$ , present for sets of occurrences of contained common GOW ( ${g}$ ) and word ( ${w}$ ) in a given document ( ${d}$ ), respectively. For each common GOW ( ${g}$ ) in each document, it is considered as occurring only once, therefore the value of ${{F}}^{{g}}_{{d}}$ is always 1.

•

${{F}}_{{d}}$ and ${{N}}_{{d}}$ , are sets of common GOWs and number of words in the given document ( ${d}$ ).

3.2.2 Topic/cluster inference of GOW-Stream model

The most important part of mixture model-based text stream clustering algorithm is the definition of relationships between documents and the decomposed topics/clusters. It means the probability of a new document might be assigned to a specific topic/cluster. The traditional approach of similarity-based text clustering mainly utilizes the similarity threshold to control the process of topic/cluster assignment for each text document in a given stream. However, selecting a proper similarity threshold is considered as a challenging task due to the concept/topic drift and variety in document’s length of real-world text streams. Inspired from previous works, we apply a dynamic cluster inferring technique which is majorly based on DPMM to achieve the probability of a document ( ${d}$ ) choosing an existing cluster ( ${z}$ ), denoted as: $\textit{prob}({{z}}_{{d}}=z|{\overrightarrow{{z}}}_{{\neg}{d}},\overrightarrow% {{d}},{\alpha},{\beta})$ . With $\overrightarrow{{d}}$ is the given collected documents from the stream as a vector and $\overrightarrow{{z}}$ is a set of recorded clusters, the probability is proportioned as following (as shown in the Eq. (6)):

$\displaystyle\textit{prob}({{z}}_{{d}}=z\mathrel{|\vphantom{{{z}}_{{d}}=z{% \overrightarrow{{z}}}_{{\neg}{d}},\overrightarrow{{d}},{\alpha},{\beta}}\kern-% 1.2pt}{\overrightarrow{{z}}}_{{\neg}{d}},\overrightarrow{{d}},{\alpha},{\beta}% ){\propto}\textit{prob}({{z}}_{{d}}=z\mathrel{|\vphantom{{{z}}_{{d}}=z{% \overrightarrow{{z}}}_{{\neg}{d}},{\alpha}}\kern-1.2pt}{\overrightarrow{{z}}}_% {{\neg}{d}},{\alpha}).\textit{prob}(d|{{z}}_{{d}}=z,{\overrightarrow{{d}}}_{{z% ,}{\neg}{d}},{\beta})$ (6)

Where,

•

${\overrightarrow{{z}}}_{{\neg}{d}}$ , is set of recorded clusters which don’t contain the given document ( $d$ ).

•

${\overrightarrow{{d}}}_{{z,}{\neg}{d}}$ , is set of collected documents which are assigned to the cluster ( $z$ ) and definitely does not contain the given document ( $d$ ).

Adding a document to an existing topic/cluster. As given in the Eq. (6), the first part of this equation: $\textit{prob}({{z}}_{{d}}=z\mathrel{|\vphantom{{{z}}_{{d}}=z{\overrightarrow{{% z}}}_{{\neg}{d}},{\alpha}}\kern-1.2pt}{\overrightarrow{{z}}}_{{\neg}{d}},{% \alpha})$ indicates the probability of document ( ${d}$ ) choose a cluster ( ${z}$ ) after being given the cluster/topic assignments for other documents except the current document ( ${d}$ ). To achieve the probability distribution of the first part, we apply the inference techniques of classical DP-based topic models. The first part of equation 6 is calculated as the following (see Eq. (7)):

$\displaystyle\textit{prob}({{z}}_{{d}}=z\mathrel{|\vphantom{{{z}}_{{d}}=z{% \overrightarrow{{z}}}_{{\neg}{d}},{\alpha}}\kern-1.2pt}{\overrightarrow{{z}}}_% {{\neg}{d}},{\alpha}){\propto}\frac{{{m}}_{{z,}{\neg}{d}}}{{D-1+}{\alpha}{D}}$ (7)

Where,

•

${D}$ , is number of overall documents in current streaming batch.

•

${{m}}_{{z,}{\neg}{d}}$ , is number of documents in the current cluster ( ${z}$ ), except document ( ${d}$ ).

For the second part of Eq. (6): $\textit{prob}(d|{{z}}_{{d}}=z,{\overrightarrow{{d}}}_{{z,}{\neg}{d}},{\beta})$ , this part considers the relevance between contained common GOW ( ${g}$ ) and words ( ${w}$ ) in a given cluster ( ${z}$ ) and document ( ${d}$ ) which can be further derived as following (see Eq. (8)):

$\displaystyle\textit{prob}({d}\mathrel{|\vphantom{{d}{{z}}_{{d}}=z,{% \overrightarrow{{d}}}_{{z,}{\neg}{d}},{\beta}}\kern-1.2pt}{{z}}_{{d}}{=z,}{% \overrightarrow{{d}}}_{{z,}{\neg}{d}},{\beta})=\frac{\prod_{{w}{\in}{d}}{\prod% ^{{{N}}^{{w}}_{{d}}}_{{j=1}}{({{n}}^{{w}}_{{z,}{\neg}{d}}{+}{\beta}{+j-1})}}}{% \prod^{{{N}}_{{d}}}_{{i=1}}{({{n}}_{{z,}{\neg}{d}}{+W}{\beta}{+i-1})}}{+}\frac% {\prod_{{g}{\in}{d}}{\prod^{{{F}}^{{g}}_{{d}}}_{{j=1}}{({{f}}^{{g}}_{{z,}{\neg% }{d}}{+}{\beta}{+j-1})}}}{\prod^{{{F}}_{{d}}}_{{i=1}}{({{n}}_{{z,}{\neg}{d}}{+% F}{\beta}{+i-1})}}$ (8)

Where,

•

${W}$ and ${F}$ , present for sets of occurring words and common GOWs in the current document collections ( ${D}$ ).

•

${{n}}^{{w}}_{{z,}{\neg}{d}}$ and ${{f}}^{{g}}_{{z,}{\neg}{d}}$ , are number of occurring words and common GOWs in given topic/cluster ( ${z}$ ), that does not contain the given document ( ${d}$ ).

Adding a document to a new topic/cluster. In previous part, we have defined a probability case that a new document chose an existing cluster. Then, in case that a new document isn’t matched to any existing topic/cluster, we will need to create a new cluster for it by defining the probability for a document to create a new topic/cluster. For DPMM based dynamic topic/cluster inference approach in case that the quantity of clusters is infinite, the transformation of ${\theta}\sim\textit{GEM}({\gamma})$ to ${\theta}\sim\textit{GEM}({\alpha}{D})$ should be applied. Therefore, the probability of a new cluster ( ${K+1}$ ), with ${K}$ is number of current topics/clusters, is created for a given document ( ${d}$ ) is changed with the modifications of the first part (Eq. (9a)) and second part (Eq. (9b)) of Eq. (6) as following:

$\displaystyle\textit{prob}({{z}}_{{d}}=K+1\mathrel{|\vphantom{{{z}}_{{d}}{=z}{% \overrightarrow{{z}}}_{{\neg}{d}},{\alpha}}\kern-1.2pt}{\overrightarrow{{z}}}_% {{\neg}{d}},{\alpha}){\propto}\frac{\alpha D}{{D-1+}{\alpha}{D}}$ (9a) $\displaystyle\textit{prob}({d}\mathrel{|\vphantom{{d}{{z}}_{{d}}{=K+1,}{% \overrightarrow{{d}}}_{{z,}{\neg}{d}},{\beta}}\kern-1.2pt}{{z}}_{{d}}{=K+1,}{% \overrightarrow{{d}}}_{{z,}{\neg}{d}},{\beta})=\frac{\prod_{{w}{\in}{d}}{\prod% ^{{{N}}^{{w}}_{{d}}}_{{j=1}}{({\beta}{+j-1})}}}{\prod^{{{N}}_{{d}}}_{{i=1}}{({% W}{\beta}{+i-1})}}{+}\frac{\prod_{{g}{\in}{d}}{\prod^{{{F}}^{{g}}_{{d}}}_{{j=1% }}{({\beta}{+j-1})}}}{\prod^{{{F}}_{{d}}}_{{i=1}}{({F}{\beta}{+i-1})}}$ (9b)

Where,

•

${K}$ is number of current topics/clusters which has been discovered from the given text stream.

•

${\alpha}{D}$ and ${\beta}$ are the pseudo numbers of documents and occurrences of each word and common GOWs in the new created ( ${K+1}$ )-th cluster.

In fact, our proposed GOW-Stream model is a combination of word and common GOWs distributions in topic/cluster inference for text streams via DPMM based approach. By integrating with the distributions of common GOWs within documents, the GOW-Stream model is aimed to capture richer semantic meanings of discovered clusters by utilizing the word co-occurring relationships in text documents. It not only helps to leverage the clustering output accuracy but also eliminate the ambiguity in discovered clusters. The overall procedure of our proposed GOW-Stream is described in Algorithm 2.

Algorithm 2. GOW-Stream algorithm
Input: • Document set ${{D}}_{{t}}$ at specific time ( ${t}$ ) from a given text stream. • Sliding window ( ${s}$ ) and minimum support value ( $\sigma$ ). • Model’s hyper-parameter: ${\alpha}$ , ${\beta}$ . Output: cluster assignments ${{z}}_{{d}}$ for the current batch
1: Initialize: ${K=[]}$ #storing feature vectors of discovered clusters of the input document batch $D_{t}$
2: Initialize: ${Z=[\|}{{D}}_{{t}}{\|]}$ #dictionary structure for storing document’s id – cluster’s id, e.g.: $<$ doc_1: cluster_1 $>$ , etc.
3: Initialize: $F=$ ExtractGOWs ( $D_{t},\sigma$ ) #extracting common GOWs from the given document set ( $D_{t}$ ) – see Algorithm 1
4: For document ( ${d}$ ) in ${{D}}_{{t}}$ :
5: Set: ${{P}}_{{Z,d}}=[]$
6: Set: ${{F}}_{{d}}{=F[d]}$ , #Getting common GOWs which are contained in current document ( $d$ ).
7: For ${{z}}_{{i}}$ in ${K}$ :
8: Calculate: ${{P}}_{{{z}}_{{i}},d}=\textit{prob}({{z}}_{{i}}{,d)}$ #calculating the probability of a given document ( $d$ ) choosing an existing cluster
( $z_{i}$ ) – following Eqs (6)–(8).
9: Update: ${{P}}_{{Z,d}}$ .append ( ${{P}}_{{{z}}_{{i}}{,d}}$ )
10: End for
11: Set: ${{P}}^{\textit{existing}}_{{{z}}_{{i}}{,d}}={\mathop{\text{argmax}}_{{i}}{(}{{% P}}_{{Z,d}}{)}}$
12: Calculate: ${{P}}^{\textit{new}}_{z_{i},d}=\textit{prob}({{z}}_{{i}},d)$ #calculating the probability of a given document ( $d$ ) creating a new cluster
( $z=z_{\|K\|+1}$ ) – following Eqs (6), (9) and (10).
13: If ${{P}}^{\textit{existing}}_{{{z}}_{{i}}{,d}}<{{P}}^{\textit{new}}_{{{z}}_{{i}}{% ,d}}$ then: #creating new feature vector for new cluster $z_{K+1}$
14: Set: ${{m}}_{{{z}}_{{\|K\|+1}}}{=1}$
15: For common GOWs ( ${g}$ ) in ${{F}}_{{d}}$ : $\to$ Set: ${{f}}^{{g}}_{{z}}={{\|F}}^{{g}}_{{d}}\|$
16: Set: ${{f}}_{{{z}}_{{\|K\|+1}}}{=\|}{{F}}_{{d}}\|$
17: For word ( ${w}$ ) in ${{W}}_{{d}}$ : $\to$ Set: ${{n}}^{{w}}_{{{z}}_{{\|K\|+1}}}={{N}}^{{w}}_{{d}}$
18: Set: ${{n}}_{{{z}}_{{\|K\|+1}}}={{N}}_{{d}}$
19: Update: ${\text{K.append}(}{{z}}_{\|{K}\|{+1}})$
20: Update: ${Z}[{d}]={{z}}_{{\|K\|+1}}$
21: Else: #Updating feature vector for cluster $z_{i}$ with a new given document (as described in Eq. (5a))
22: Update: ${{m}}_{z_{i}}{+=1}$
23: For common GOWs ( ${g}$ ) in ${{F}}_{{d}}$ : $\to$ Update: ${{f}}^{{g}}_{{{z}}_{i}}{+=}{{\|F}}^{{g}}_{{d}}\|$
24: Update: ${{f}}_{{{z}}_{{i}}}{+=\|}{{F}}_{{d}}{\|}$
25: For word ( ${w}$ ) in ${{W}}_{{d}}$ : ${\to}$ Update: ${{n}}^{{w}}_{{{z}}_{{\|K\|+1}}}{+=}{{N}}^{{w}}_{{d}}$
26: Update: ${{n}}_{{{z}}_{{i}}}{+=}{{N}}_{{d}}$
27: Update: ${K[}{{z}}_{{i}}{]}$ #updating feature vector of $z_{i}$ with a new document $(d)$ is added.
28: Update: ${Z}[{d}]={{z}}_{{i}}$
29: End if
30: End for
31: Return ${K}$ and ${Z}$

At the first stage, the input document sets will be evaluated to extract common GOWs, denoted as: ${{F}}_{{d}}$ with the initial sliding window and minimum support value ( ${\sigma}$ ) (line 3). The set of extracted common GOWs: ${{F}}_{{d}}$ is then used for next topic/cluster inferring processes. Initially, for the first text document ( ${d}$ ) in a given batch ( ${{D}}_{{t}}$ ) the model will create a new topic/cluster for it. Then, each next arrival document in the stream will be evaluated in order to decide choosing an existing cluster or creating other new cluster for it by calculating the corresponding probabilities. For each document, the probability of choosing an existing recorded topic/cluster in ( ${K}$ ) will be calculated. Then, a cluster ( ${{z}}_{{i}}$ ), with with the highest probability (line 12) for document ( $d$ ) is chosen, denoted as: ${{P}}^{\textit{existing}}_{{{z}}_{{i}}{,d}}$ . Next, the probability of creating new cluster for document ( ${d}$ ) is also calculated, denoted as: ${{P}}^{\textit{new}}_{{{z}}_{{i}}{,d}}$ . If the probability of choosing an existing cluster is larger than creating a new cluster ( ${{P}}^{\textit{existing}}_{{{z}}_{{i}}{,d}}{>}{{P}}^{\textit{new}}_{{{z}}_{{i}% }{,d}}$ ) (line 13–29), the cluster feature vector of ${{z}}_{{i}}$ will be updated with parameters ( ${f,n,m}$ ) of new added document ( ${d}$ ) (line 13–20), following the Eq. (5a). Otherwise, a new cluster ${{z}}_{{|K|+1}}$ will be created with initial parameters of document ( ${d}$ ) (line 21–29).

4. Experiments and discussions

In this section, we demonstrate extensive experiments on real-world datasets to evaluate the performance of our proposed GOW-Stream model in comparing with recent state-of-the-art mixture model-based text stream clustering baselines, including: DTM, Sumblr and MStream.

4.1 Dataset and evaluation metric usage

4.1.1 Dataset descriptions

In order to fairly evaluating the performance of different text stream clustering models including our proposed GOW-stream, we use two main real-world labelled datasets which are commonly used for most of empirical studies in previous works. These two datasets are:

•
Google-News (GN):1
¹
Google News: https://news.google.com/.

this dataset is firstly introduced by [Yin and Wang, 2014] which contains 11,109 labelled documents. The documents in this dataset are assigned to 152 different topics/clusters.
•
Tweets (Tw):2
²
Tweets (TREC) dataset: http://trec.nist.gov/data/microblog.html.

this dataset is constructed by collecting tweets from the Twitter social network. These collected tweets are labeled in the 2011–2015 microblog tracks at Text Retrieval Conference (TREC), NIST. This dataset contains 269 topics/clusters which cover 30,322 different-length text documents in terms of tweets.
•
Synthetic datasets (GN-T, Tw-T): These two datasets are the modified versions of above Google-News (GN) and Tweets (Tw) in order to stimulate the circumstance of topic/concept drift in real-world text streams where clusters/topics only occur at specific time (e.g.: COVID-19, China-India skirmish, etc.) and then disappear. Following the procedure of previous works, firstly, we sorted tweets (Tw) and news (GN) by their labelled clusters/topics. Then, we equally divided these tweets/news into 16 parts before shuffling them.

Similar to previous works, we apply a simple text preprocessing process including: transforming all text to lower case, removing all stop-words and word’s stemming. These two datasets are mainly used in our experiments are also considered as short-length and sparse due to the low average number of words in each document as well as the large amount of covered topics (as shown from the statistics in Table 1).

Table 1
Details of the experimental datasets

No of documents No. topics No. unique words Average length

GN and GN-T (synthetic) 11,109 152 8,110 6.23

Tw and Tw-T (synthetic) 30,322 269 12,301 7.97

In overall, the Tw/Tw-T dataset is considered as more challenging than GN/GN-T due to large number of covered labelled topics for text documents in this dataset.
4.1.2 Evaluation metric usage

	No of documents	No. topics	No. unique words	Average length
GN and GN-T (synthetic)	11,109	152	8,110	6.23
Tw and Tw-T (synthetic)	30,322	269	12,301	7.97

To evaluate the accuracy performance of text clustering task with different text stream clustering algorithms, we use three main evaluation metrics which are NMI and F1 measure. These evaluation metrics are used in our experiment as following:

Normalized Mutual Information (NMI). This is the most common evaluation metric which is widely used to evaluate the quality of clustering output with the given ground truth. The NMI is considered as the strictest metric for evaluating the performance of clustering task within the range [0, 1]. In case that the clustering outputs are totally matched with the given ground truth/labelled classes, the NMI value will be [1], whereas its value will be close to [0] when the clustering outputs are randomly generated. The NMI metric is formally defined as the following (see Eq. (10a)):

$\displaystyle\textit{NMI}=\frac{\sum_{{c,k}}{{{n}}_{{c,k}}{\log}\left(\frac{{N% .}{{n}}_{{c,k}}}{{{n}}_{{c}}.{{n}}_{{k}}}\right)}}{\sqrt{\left(\sum_{{c}}{{{n}% }_{{c}}{\log}\frac{{{n}}_{{c}}}{{N}}}\right)}.\left(\sum_{{k}}{{{n}}_{{k}}{% \log}\frac{{{n}}_{{k}}}{{N}}}\right)}$ (10a)

Where,

•

${{n}}_{{c}}$ , ${{n}}_{{k}}$ are number of documents in a class ( ${c}$ ) and number of documents in a cluster ( ${k}$ ).

•

${{n}}_{{c,k}}$ is number of documents in both class ( ${c}$ ) and cluster ( ${k}$ ).

•

${N}$ is total number of all documents in the given dataset.

F1 measure. This is a well-known metric for both clustering and classification tasks. The F1 metric considers both precision ( ${P}$ ) and recall ( ${R}$ ) values of clustering outputs to compute the F1 value. The F1 metric is formally defined as the following (see Eq. (10b)):

$\displaystyle P=\frac{TP}{TP+FP},R=\frac{TP}{TP+FN},F1=2\cdot\frac{P\cdot R}{P% +R}$ (10b)

Where,

•

${TP}$ , is the number of text documents which are assigned to the correct clusters (relying on their corresponding labelled classes).

•

${FP}$ and ${FN}$ , are number of expected documents which are assigned to specific clusters but not correct and not assigned by actually belong to that clusters, respectively.

4.2 Experimental setups

To compare the performance of our proposed GOW-Stream model, three main state-of-the-art text stream clustering baselines are implemented in our experiments, which are: DTM [6], Sumblr [4] and MStream [10]. The configurations for these text stream clustering models briefly described as the following:

•
DTM (Dynamic Topic Modelling) [6]: is considered as an earliest model for dynamic topic modelling approach which enables to discover latent topics/clusters from sequential text documents such as text streams. However, the DTM is considered as a “fixed number of topics” approach which means that number of clusters/topics must be specified first. Therefore, it fails to deal with natural topic/concept drift challenge of the text stream clustering task.
•
Sumblr [4]: is the most well-known model in similarity-based approach for text stream clustering task. The Sumblr is an online version for clustering tweets in Twitter social network. Sumblr is considered as effective for handling short-length text clustering task which needs only one-pass batch iteration to both cluster assigning for new documents as well as maintaining compact cluster statistics. However, similar to previous DTM model, in Sumblr model, the number clusters/topics must be initially configured which make it fails to cope with the topic/concept drift challenge.
•
MStream [10]: is the most recent mixture model-based for handling short-length text stream clustering problem as well as topic/concept drift challenge. MStream model applies the DPMM-based inference technique to decompose covering clusters/topics from documents in a given text stream. Extensive experiments on standard datasets demonstrated the effectiveness of MStream model in both of one-pass text document clustering as well as clusters update/maintaining processes. However, MStream is considered as the word-independent evaluation approach which largely ignoring the dependency relationships between words in a text document. Therefore, it might be unable to deal with the ambiguity challenge of extracted clusters/topics from text streams.

For the initial configurations of each model, we apply different settings which are corresponding with the default setups of each model to achieve highest accuracy performance from the original works. The details of configurations for each text stream clustering model with different used datasets are described in the Table 2. For the DTM and Sumblr models, the initial number of topics/clusters must be specified first, with different datasets, we applied different numbers of clusters/topics, respectively. In order to come with these model’s hyper-parameter configurations (in Table 2), we have conducted extensive studies related to the fluctuations of model’s hyper-parameters w.r.t overall model’s accuracy performance in Section 4.3.3.

Table 2
Details of configurations for text stream clustering models

Dataset Model Hyper-parameters Number of initial topics ( ${K}$ )

$\alpha$ $\beta$

GN and GN-T (synthetic) DTM 0.01 N/A 170

Sumblr N/A 0.02 170

MStream 0.03 0.03 N/A ( ${K=0}$ )

GOW-Stream 0.03 0.03 N/A ( ${K=0}$ )

Tw and Tw-T (synthetic) DTM 0.01 N/A 300

Sumblr N/A 0.02 300

MStream 0.03 0.03 N/A ( ${K=0}$ )

GOW-Stream 0.03 0.03 N/A ( ${K=0}$ )

For each text streaming model, the number of iterations for each arrival document batch is all configured as 10. In overall, the given datasets are divided into different 16 document batches, then each batch clustering output is evaluated by using above listed metrics (in Section 4.1.2). For experiments in each document batch, we run 10 independent trials for each model and reported the average results.
4.3 Experimental results and discussions

Dataset	Model	Hyper-parameters	Number of initial topics ( ${K}$ )
		$\alpha$	$\beta$
GN and GN-T (synthetic)	DTM	0.01	N/A	170
	Sumblr	N/A	0.02	170
	MStream	0.03	0.03	N/A ( ${K=0}$ )
	GOW-Stream	0.03	0.03	N/A ( ${K=0}$ )
Tw and Tw-T (synthetic)	DTM	0.01	N/A	300
	Sumblr	N/A	0.02	300
	MStream	0.03	0.03	N/A ( ${K=0}$ )
	GOW-Stream	0.03	0.03	N/A ( ${K=0}$ )

4.3.1 Text stream clustering task

Of-the-art text stream clustering baselines, including: DTM, Sumblr and MStream in two standard Google News and Twitter datasets. For each model, we conducted experiments of text clustering task with both two datasets 10 times and reported the average results with standard deviations in terms of NMI and F1 metrics. Tables 3 and 4 shows experimental outputs for text stream clustering task with different models in terms of NMI and F1 metrics, respectively.

Table 3
Average outputs of text clustering task with different models in terms of NMI metric

Model	Dataset
	GN	GN-T	Tw	Tw-T
DTM	0.723003 $\pm$ 0.03	0.685827 $\pm$ 0.05	0.676872 $\pm$ 0.03	0.681405 $\pm$ 0.02
Sumblr	0.580666 $\pm$ 0.05	0.548127 $\pm$ 0.08	0.542124 $\pm$ 0.06	0.580197 $\pm$ 0.05
MStream	0.895725 $\pm$ 0.01	0.872712 $\pm$ 0.03	0.852766 $\pm$ 0.02	0.889187 $\pm$ 0.02
GOW-Stream	0.928976 $\pm$ 0.01	0.896155 $\pm$ 0.03	0.873884 $\pm$ 0.02	0.905492 $\pm$ 0.01

Table 4

Experimental outputs of text clustering task with different models in terms of F1 metric

Model	Dataset
	GN	GN-T	Tw	Tw-T
DTM	0.876349 $\pm$ 0.02	0.832146 $\pm$ 0.01	0.839111 $\pm$ 0.01	0.885007 $\pm$ 0.02
Sumblr	0.891121 $\pm$ 0.01	0.848805 $\pm$ 0.02	0.855527 $\pm$ 0.03	0.907313 $\pm$ 0.01
MStream	0.976687 $\pm$ 0.01	0.930837 $\pm$ 0.01	0.940604 $\pm$ 0.01	0.978933 $\pm$ 0.01
GOW-Stream	0.985201 $\pm$ 0.01	0.935536 $\pm$ 0.01	0.943755 $\pm$ 0.02	0.979442 $\pm$ 0.01

Figure 3.

Experimental results for different number of document batches in terms of NMI metric.

Figure 4.

Experimental results for different number of document batches in terms of F1 metric.

In this experiment, we compare the performance of our proposed GOW-Stream model with different state-

In general, through experimental outputs which are demonstrated in Tables 3 and 4, our proposed GOW-Stream always achieves better accuracy performance than recent text stream clustering models with all given datasets. The GOW-Stream model gains the highest performance in Google News (GN and synthetic GN) dataset with averagely 91.25% and 96.03% in term of NMI and F1 metrics, respectively. For the Tweets dataset which is considered as more challenging than GN dataset, the GOW-Stream model also stably achieves a reasonable accuracy performance with 88.96% and 96.15% in terms of NMI and F1 metrics. In comparing with recent text stream clustering task, the GOW-Stream significantly outperforms Sumblr and DTM models about 60.12% and 30.26% in terms of NMI metric, respectively. Experimental results also show that GOW-Stream slightly gets better performance than the recent well-known mixture model-based MStream model about 2.68% in terms of NMI metric.

Further evaluations (as shown in Figs 3 and 4) on the accuracy performance with separated document batch of each text stream clustering model demonstrates that both MStream and GOW-Stream produce better and more stable text stream clustering outputs than previous DTM and Sumblr models. Moreover, the evaluations with F1 metric on text stream task with different models in Table 4 also indicate that the mixture model-based approach of MStream and GOW-Stream is considered as more flexible and stable with different length of documents in compare with classical approach of topic modelling and similarity-based. To sum up, through experiments, our proposed GOW-Stream demonstrates the effectiveness and outperformance in text stream clustering task in compare with recent baselines which prove that the use of GOWs distribution in text documents can help to leverage the accuracy performance of short-length text stream clustering task.

4.3.2 Scalability performance

Model’s speedup performance. In this part, we try to evaluate the scalability of our proposed GOW-Stream model with other text stream clustering models. We implemented and run GOW-Stream, MStream, Sumblr and DTM in a same CentOS 6.5 computer with Intel Xeon CPU E5-2620 v4 2.10 GHz (8 cores – 16 threads) CPU and 64 Gb memory. All models are configured with 10 iterations for each document batch, with 16 batches for two GN and Tw datasets. Each model has been run 5 times and reported the average execution times (in seconds) as the final result. Figure 5 shows the speed of different text stream clustering model within Tw (Fig. 5A) and GN (Fig. 5B) datasets. As shown from experimental outputs, both MStream and GOW-Stream significantly faster than traditional approaches of Sumblr and DTM. Specifically, GOW-Stream faster than DTM approximately 20.08 and 7.29 times in compare with DTM and Sumblr for both GN and Tw datasets, respectively. With MStream model, GOW-Stream also slightly improves about 2.8 times of model speedup. Experiments on model’s scalability demonstrates that the combination between independent words and common GOWs evaluation while inferring topics/clusters from given text streams can help to fasten the model’s coverage process.

Figure 5.

Scalability performance of different text stream clustering techniques.

Figure 6.

Evaluation on the influence of number of iterations on the accuracy performance of GOW-Stream model.

Figure 7.

Experimental results for influence of the $\alpha$ hyper-parameter.

Figure 8.

Experimental results for influence of the $\beta$ hyper-parameter.

Influence of number of iterations. Most of topic modelling and mixture model-based approaches need a reasonable number of iterations for each document batch to reach the acceptable accuracy performance. In this part, we try to investigate the influence of number of iterations for each document batch on the overall our GOW-Stream model accuracy performance in terms of NMI metric. Similar to previous ones, we conducted the experiments on two dataset Tw and GN with different number of iterations for each document batch. Each experiment is run repeatedly 10 times and reported the average results. Figure 6 shows the changes on the accuracy performance of text stream clustering task with different number of iterations for each document batch in both Tw and GN datasets. Experimental results demonstrate that our proposed model achieve the balance in accuracy performance within range 7–10 iterations per each document batch. This shows that our proposed GOW-Stream model are quite fast to coverage.

4.3.3 Model’s hyper-parameters sensitivity

To evaluate the influence of model’s parameters on the accuracy performance, we conducted extensive experiments for investigating the changes of $\alpha$ and $\beta$ hyper-parameters on the text clustering quality. For both topic modelling and dynamic mixture model based text stream clustering approaches, the initial setup hyper-parameters might have thorough impacts on the outputs as well as model’s coverage process. Therefore, selecting an appropriate default hyper-parameters is considered as important in order to make the given model can reach highest performance in both accuracy and time-consuming aspects. To test evaluate the influence of model’s parameters, we implemented the GOW-Stream model with the default configurations (as shown in Table 2) and changed the values of initial $\alpha$ and $\beta$ hyper-parameters within a specific range in order to observe the fluctuations of model’s accuracy outputs which are evaluated by NMI metric. Experiments of parameter’s sensitivity in this section also used the two standard Tw and GN datasets with different variants of $\alpha$ and $\beta$ hyper-parameters within the range of [0.01, 0.05]. For each experiment, the value of corresponding hyper-parameter is changed while the others are fixed. Figures 7 and 8 show the experimental results for the influences of $\alpha$ and $\beta$ hyper-parameters, respectively on the accuracy performance in terms of NMI metric of our proposed GOW-Stream model.

As shown from the experimental results, we can see that the proposed GOW-Stream model can achieve a stable accuracy performance with different values of both $\alpha$ and $\beta$ hyper-parameters. The model’s accuracy performance is balanced within [0.84 $\pm$ 0.02, 0.94 $\pm$ 0.01] range and with the value of 0.03 for both $\alpha$ and $\beta$ hyper-parameters, the GOW-Stream model reached the highest accuracy performance. In overall, extensive experiments on parameter’s sensitivity of the proposed GOW-Stream demonstrates the effectiveness and as well as the stability of our proposal for the application of common GOWs evaluation in short-length sparse text stream document clustering task.

5. Conclusions and future works

In this paper, we formally propose a novel semantic-enhanced approach for text stream clustering by applying common graph-of-words (GOWs) distributions over short-length text documents. The application of GOWs evaluation in text mining task has demonstrated several effectiveness related to the capability of naturally capturing words’ dependent relationships such as co-occurring and order relationships. GOWs is considered as unsupervised text restructuring technique which has been widely applied in multiple sematic-enhanced approaches due to its simple and efficiency in implementation without using any advanced supervised NLP technique. By combining with frequent subgraph mining (FSM), we can extract common GOWs from the given text corpora, these common GOWs play as distinctive features for text documents. To overcome drawbacks relation to word’s dependency evaluation of previous text stream clustering models, we combine the word-independent and common GOWs based evaluation in the topic/cluster inference process of Dirichlet Process Multinomial Model (DPMM) to enhance the text clustering outputs from the given streams. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed model on handling short-length sparse text stream clustering task in compare with recent state-of-the-art baselines, including: DTM, Sumblr and MStream. In future improvements, we tend to extend the implementation of our proposed GOW-Stream model on the distributed processing environment which is mainly designed for handling large-scale and high-velocity textual data stream, such as Apache Spark Streaming.

Footnotes

Acknowledgments

This research is funded by Thu Dau Mot University under grant number DT.20-031 and Vietnam National University Ho Chi Minh City (VNU-HCMC) under the grant number DS2020-26-01.

References

Zhou

Chen

Zhang

and He

, Unsupervised event exploration from social text streams, Intelligent Data Analysis 21(4) (2017), 849–866.

Aggarwal

C.C.

, A Survey of Stream Clustering Algorithms, Data Clustering, 2013, 231–258.

Zhu

Xiao

Deng

Sun

and Bai

, A joint model of extended LDA and IBTM over streaming Chinese short texts, Intelligent Data Analysis 23(3) (2019), 681–699.

Shou

Wang

Chen

and Chen

, Sumblr: continuous summarization of evolving tweet streams, in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2013.

Blei

D.M.

Y.A.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3(Jan) (2003), 993–1022.

Blei

D.M.

and Lafferty

J.D.

, Dynamic topic models, in: Proceedings of the 23rd International Conference on Machine Learning, 2006.

Wang

Agichtein

and Benzi

, TM-LDA: efficient online modeling of latent topic transitions in social media, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012.

Amoualian

Clausel

Gaussier

and Amini

M.R.

, Streaming-LDA: A copula-based approach to modeling topic dependencies in document streams, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

Liang

Yilmaz

and Kanoulas

, Dynamic clustering of streaming short documents, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

10.

Yin

Chao

Liu

Zhang

and Wang

, Model-based clustering of short text streams, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.

11.

Pham

and Ta

C.D.C.

, GOW-LDA: Applying Term Co-occurrence Graph Representation in LDA Topic Models Improvement, in: International Conference on Computational Science and Technology, 2017.

12.

Wang

and McCallum

, Topics over time: a non-Markov continuous-time model of topical trends, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.

13.

Iwata

Watanabe

Yamada

and Ueda

, Topic tracking model for analyzing consumer purchase behavior, in: Twenty-First International Joint Conference on Artificial Intelligence, 2009.

14.

Teh

Y.W.

, Dirichlet Process, 2010, 280–287.

15.

Farajtabar

Ahmed

Smola

A.J.

and Song

, Dirichlet-hawkes processes with applications to clustering continuous-time document streams, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.

16.

Ahmed

and Xing

, Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering, in: Proceedings of the 2008 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2008.

17.

Yin

and Wang

, A model-based approach for text clustering with outlier detection, in: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), 2016, pp. 625–636.

18.

Aggarwal

C.C.

Philip

S.Y.

Han

and Wang

, A framework for clustering evolving data streams, in: Proceedings 2003 VLDB Conference, Morgan Kaufmann, 2003.

19.

Cao

Estert

Qian

and Zhou

, Density-based clustering over an evolving data stream with noise, in: Proceedings of the 2006 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2006.

GOW-Stream: A novel approach of graph-of-words based mixture model for semantic-enhanced text stream clustering

Abstract

Keywords

1. Introduction

1.1 Problem definitions

1.1.1 Shortage in short-text stream clustering

1.1.2 Lack of word dependency evaluation

1.2 Our contributions

2.1 Traditonal topic modelling based approach

2.2 Dynamic mixture model-based approach

2.3 Vector space representation based approach

3. Methodology

3.1 Preliminaries and background concepts

3.1.1 Graph-of-words (GOW) representation

3.2.1 GOW-based cluster/topic representation

4.1 Dataset and evaluation metric usage

4.1.1 Dataset descriptions

4.3.1 Text stream clustering task

Table 3 Average outputs of text clustering task with different models in terms of NMI metric

5. Conclusions and future works

Footnotes

Acknowledgments

References

Table 3
Average outputs of text clustering task with different models in terms of NMI metric