Unsupervised latent event representation learning and storyline extraction from news articles based on neural networks

Abstract

Storyline extraction aims to generate concise summaries of related events unfolding over time from a collection of temporally-ordered news articles. Some existing approaches to storyline extraction are typically built on probabilistic graphical models that jointly model the extraction of events and the storylines from news published in different periods. However, their parameter inference procedures are often complex and require a long time to converge, which hinders their use in practical applications. More recently, a neural network-based approach has been proposed to tackle such limitations. However, event representations of documents, which are important for the quality of the generated storylines, are not learned. In this paper, we propose a novel unsupervised neural network-based approach to extract latent events and link patterns of storylines jointly from documents over time. Specifically, event representations are learned by a stacked autoencoder and clustered for event extraction, then a fusion component is incorporated to link the related events across consecutive periods for storyline extraction. The proposed model has been evaluated on three news corpora and the experimental results show that it outperforms state-of-the-art approaches with significant improvements.

Keywords

Storyline extraction event representation neural network

1. Introduction

With the rapid development of social media, massive data is generated on online news media each day. Facing tremendous news articles, how to digest such large volumes of data effectively is crucial for the public. Storyline extraction aims to extract a series of concise summaries of related events unfolding over time from a collection of news articles. It can help readers to have a quick glimpse of the development of current events without the need of reading through a large number of articles specifically. Therefore, automatically extracting storylines from text has been intensively studied in recent years [33, 13, 17, 7, 38, 37].

Generally, a storyline can be considered as a cluster of documents where documents describing the same event are clustered based on their content and temporal dependency across neighboring periods. Therefore, storyline extraction can be formalized as an evolutionary clustering problem. Most existing approaches to storyline extraction can be categorized into two classes, pipeline methods and end-to-end methods. The pipeline methods first extract events in each time epoch separately, and subsequently link those related events across epochs to form storylines by using a post-processing procedure. For instance, Huang et al. [14] first extracted topics of short text-based on word co-occurrence patterns, and then proposed an event evolutionary algorithm to extract storylines. And Yu et al. [34] developed a context-dependent storyline detection model by using subgraph learning. Different ways of measuring the event similarity between documents are employed [17, 21]. However, pipeline methods suffer the error propagation problem. Compared with the pipeline methods, the traditional end-to-end methods typically use probabilistic graphical models to describe the extraction process of storylines [13, 7, 38]. For example, Diao et al. [7] developed a probabilistic model to extract events and topics simultaneously by using the Recurrent Chinese Restaurant Process (RCRP) to capture the essence of events from tweets. And Du et al. [8] proposed to combine the Dirichlet process and Hawkes process to capture the characteristics of asynchronous streaming data. However, their parameter inference procedures are often too complex which leads to a long convergence time. It makes them impractical to be deployed in real-world applications. Instead of employing probabilistic graphical models, Zhou et al. [37] developed an unsupervised neural network-based approach for storyline extraction using a pairwise ranking loss. However, it is unable to learn latent event representation which is crucial for storyline extraction.

To overcome the limitations mentioned above, in this paper, we propose a novel unsupervised deep neural network-based approach for jointly learning latent event representations and extracting storylines without any annotated data. To extract the inherent structure of storylines and capture the contextual information of the same storyline automatically from news stream data, a non-linear autoencoder is employed to learn a global mapping from raw documents to underlying semantic features. Then storyline distribution is learned by calculating the distance between the mapped documents and their corresponding meta-events. The model parameters are refined iteratively to minimize the KL divergence between the storyline distribution and the normalized distribution. Moreover, a fusion component is incorporated into the deep clustering framework as a constraint in latent event space to link related events in neighboring time epochs for constructing storylines.

The main contributions of this paper are summarized below:

•
A novel unsupervised neural network-based model for jointly learning latent events and extracting storylines from temporally ordered news articles is proposed. A fusion component is employed to connect the related events in neighboring epochs. To the best of our knowledge, this is the first attempt to perform storyline extraction by incorporating the deep embedded clustering framework.
•
We evaluate the proposed approach on three corpora and observe significant improvements on three metrics when compared with the state-of-the-art approaches to storyline extraction.

The remainder of the paper is organized as follows: Section 2 discusses related literature on storyline extraction and describes the deep embedded clustering framework briefly; In Section 3, we provide the details of the proposed model; Section 4 will discuss our experiments and we conclude this paper in Section 5 with suggestions for future work.
2. Related work

Our work is the first work to perform storyline extraction by incorporating deep embedded clustering into the storyline extraction framework. To facilitate the description of our model, we will review existing storyline extraction algorithms and introduce deep embedded clustering briefly.

2.1 Storyline extraction

Storyline extraction from text has been extensively studied in recent years. Yan et al. [33] calculated the similarity of summaries on each date with a summarization method to detect the events. Kawamae et al. [15] proposed a trend analysis method to detect the topic evolution over time. Lin et al. [20] proposed to generate storylines via graph optimization with relevant tweets. Radinsky et al. [26] extracted storylines by text clustering and entity entropy. Huang et al. [13] developed a mixture event model to capture the local and global aspects of events and then generate storylines by utilizing an optimized method. Storyline extraction can be cast into the classical topic detection and tracking (TDT) problem if the storyline is considered as a hidden topic. Naturally, topic model-based methods are adopted intuitively to deal with the TDT problem because of their interpretability, however classical topic models such as Latent Dirichlet Allocation (LDA) [5] can not capture the dependency between topics across consecutive periods. Consequently, Blei et al. [4] developed a dynamic topic model that captures the evolution of topics in a sequentially organized corpus of documents. Zhang et al. [36] proposed an evolutionary hierarchical Dirichlet process by incorporating time-dependency to discover the topic evolution pattern from documents. Zhou et al. [39] proposed an unsupervised Bayesian model by modeling one storyline as a joint distribution over entities and topics. Zhou et al. [38] proposed a generative model to extract the event representations and generate storylines simultaneously, which uses per-token Metropolis-Hastings sampler to reduce sampling complexity. Li et al. [19] proposed to generate storylines with a time-dependent hierarchical Dirichlet process, which can detect different levels of topic information from the corpus. Li et al. [18] proposed to capture the topic evolution pattern in the storyline via an evolutionary hierarchical Dirichlet process. Instead of the method based on the Dirichlet process, Ahmed et al. [2] presented a time-dependent topic-cluster model, a hierarchical approach for combining LDA and clustering via RCRP [7]. Huang et al. [14] proposed a dynamic Chinese Restaurant Process (CRP) model, which considers the birth, survival, and death of a storyline which is more in line with the real situation of how the storyline develops over time. Tang et al. [28] developed a hybrid distance-dependent CRP based hierarchical topic model that is used for the news articles clustering. Nevertheless, probabilistic graphical models usually have complex structures and require a long time to converge, which hinders their application to the real world.

In recent years, deep learning has demonstrated its power in many research areas including natural language processing with high performance. There have been increasing interests in exploring neural network-based approaches for topic detection from text. For example, Cao et al. [6] explained topic models from the perspective of neural networks and proposed a neural topic model where the representation of words and documents are combined into a unified framework. Tian et al. [29] proposed a sentence-level recurrent topic model which is assuming the generation of each word within a sentence is dependent on both the topic of the sentence and the historical context of its preceding words in the sentence. Xie et al. [32] proposed a Deep Embedded Clustering (DEC) model that simultaneously learns feature representations and cluster assignments using deep neural networks by optimizing a clustering objective. Wang et al. [31] proposed an adversarial topic model by using Dirichlet’s prior and generative adversarial network to capture the semantic patterns among latent topics. Miao et al. [23] developed a neural variant document model for topic modeling by using the multivariate Gaussian as the prior distribution of the latent space based on variational autoencoder. Srivastava et al. [27] replaced the mixture assumption at word-level and proposed the ProdLDA to improve the performance of the topic extraction. However, the neural network-based model above can only extract events from documents independently and can not capture the event’s evolution pattern over time. Zhou et al. [37] firstly proposed a neural network-based approach for storyline extraction without any annotated data, however, the model can only perform rough event clustering across epochs and is not able to learn latent event representations which are crucial for storyline extraction.

2.2 Deep embedded clustering

Deep Embedded Clustering model is a neural network-based clustering approach [32], which has received widespread attention since been proposed. For example, Dizaji et al. [9] and Guo et al. [10] both borrowed the clustering objective of DEC to improve clustering performance by replacing the structure of autoencoder. Guo et al. [11] proposed a two-stage deep clustering algorithm by incorporating data augmentation and self-paced learning into the DEC framework. Li et al. [16] proposed deep boosted clustering by utilizing a convolutional autoencoder to improve the clustering of image datasets. Dizaji et al. [9] developed DEPICT by adding a balanced assignment loss to the DEC framework to alleviate trivial solutions. Moreover, Asadi et al. [3] proposed to deal with the spatio-temporal clustering problem by using DEC. Hadifar et al. [12] proposed to learn discriminative features by combining the SIF embedding and DEC framework. While DEC has been studied extensively in different areas, relatively few works have focused on various tasks of NLP.

DEC considers the problem of clustering a set of $n$ points $\{x_{i}\in X\}_{i=1}^{n}$ into $k$ clusters, each represented by a centroid $\mu_{j},j=1,\cdots,k$ . It consists of two steps: (1) parameter initialization with deep autoencoder, (2) finetuning with clustering loss. In the parameter initialization step, a deep autoencoder is employed to initialize the parameters. First, an encoder is used to transform samples $x_{i}$ from data space $X$ to latent space $Z$ with a non-linear mapping $z_{i}=f(x_{i})$ . Then a decoder $\tilde{x_{i}}=g(z_{i})$ is used to restore $z_{i}$ . The deep autoencoder for initialization can capture salient features in samples, which helps to optimize the clustering. The centroids $\mu_{j}$ can be initialized by some traditional clustering methods, such as $k$ -means.

In the clustering step, it clusters the samples by iteratively minimizing the distance between mapped point $z_{i}$ and centroid $\mu_{j}$ until convergence. First, it use the Student’s $t$ -distribution [22] to measure the similarity between $z_{i}$ and $\mu_{j}$ as follows:

$\displaystyle q_{i,j}=\frac{(1+\lVert{z}_{i}-{\mu}_{j}\rVert^{2}/\alpha)^{-% \frac{\alpha+1}{2}}}{\sum_{k}(1+\lVert{z}_{i}-{\mu}_{k}\rVert^{2}/\alpha)^{-% \frac{\alpha+1}{2}}}$ (1)

where $\alpha$ is the degrees of freedom of the Student’s $t$ -distribution. And $q_{i,j}$ can be interpreted as the probability of assigning sample $i$ to cluster $j$ . Then an auxiliary distribution $p_{i,j}$ defined by $q_{i,j}$ is proposed to help to learn from the high confidence assignments. In DEC, it compute $p_{i}$ by raising $q_{i}$ to the second power and normalizing by frequency per cluster:

$\displaystyle p_{i,j}=\frac{q^{2}_{i,j}/f_{j}}{\sum_{j^{\prime}}q^{2}_{i,{j}^{% \prime}}/f_{j^{\prime}}}$ (2)

where $f_{j}=\sum_{i}q_{ij}$ are soft cluster frequencies. With $p_{ij}$ , it can improve cluster purity and put more emphasis on data points assigned with high confidence.

The clustering objective is defined as a KL divergence loss between $q_{i}$ and $p_{i}$ . In each iteration, $q_{i}$ is expected to close to $p_{i}$ , then $\mu_{j}$ , $z_{i}$ are updated according to the KL loss.

Considering that documents describing the same event should share a similar storyline distribution and the clustering advantage of DEC that mapping the input data to latent distribution space, we incorporate DEC into our framework to extract storylines. Thus our model can perform event representation learning and storyline extracting jointly.

3. Methodology

To model the generative process of a storyline in consecutive periods from a stream of documents, we propose Deep Embedded Storyline Extraction Model (DESEM) for storyline extraction. In our model, storyline extraction is still considered as an evolutionary clustering problem. And deep embedded clustering is incorporated to cluster events in each time epoch. Moreover, we use a fusion component to link related events across epochs to cluster events and construct a storyline jointly in an end-to-end manner. For understanding easily, we explain several basic definitions in our model:

Event: an abstract concept describing things that happened in some places at some time. We model each event $e$ as a quadruple $<\bm{l},\bm{p},\bm{o},\bm{w}>$ , which contains four elements including location $\bm{l}$ , organization $\bm{o}$ , person $\bm{p}$ and keywords $\bm{w}$ . For example, for the event “Apple vs Samsung”, we can model the event as $<\bm{l}:\textit{U.S.},\textit{Americas};\bm{p}:\textit{Jury};\bm{o}:\textit{% Apple},\textit{Samsung};\bm{w}:\textit{patent},\textit{pay}>$ .

Meta-event: the specific embodiment of the event $e$ which can be considered as the document cluster centroid $\mu$ in our model.

Epoch: the raw corpus is divided into several slices from the corpus $\mathcal{D}=[D_{1},D_{2},\cdots,D_{T}]$ relying on their publishing date and each slice can be considered as one epoch, where $T$ is the number of epochs.

Storyline: an event sequence $s=[e_{1},e_{2},\cdots,e_{E}]$ , which describes a developing process of the highly related events, where $E$ is the number of highly related events across different epochs in storyline $s$ .

Storyline Extraction: a process which aims to extract the storyline set $\mathcal{S}=\{s_{1},s_{2},\cdots,s_{S}\}$ from the corpus $\mathcal{D}$ , where $S$ is the number of storylines.

Figure 1.

Overall architecture of the DESEM. Top: the stacked autoencoder maps the quadruple of each document $d$ to their latent event representation $z$ . The dotted lines indicate that the parameters of stacked autoencoder are different for different epochs. Bottom: the structure of different epochs sorted by the publishing date, the solid red line shows the update process of meta-event $\mu$ . Note that there is no fusion component at epoch 1.

The DESEM structure is shown in Fig. 1. The proposed model contains three main components: (1) the latent event mapping component which is shown in the top part of Fig. 1, defines the representation mapping function which maps the original document features $d$ to a latent event space $\mathcal{Z}$ . (2) the storyline distribution computes the storyline distribution $q_{t}$ with meta-event $\mu_{t}$ and mapped latent event $z$ , and then optimizes the model parameter iteratively by minimizing the distance between $q_{t}$ and its normalized distribution $p_{t}$ . (3) the fusion component which takes the meta-event at the last epoch as input is incorporated into the storyline construction component to capture the dynamic evolution of different storylines, whose output is subsequently used as meta-event at the current epoch. In the following, we will explain each component of our model in more detail.

3.1 Latent event mapping

Clustering original document features into events in raw space usually faces the problem of the “curse of dimensionality” because of the high dimension of input documents. Furthermore, documents that describe the same event may not share many words in common and could use different expressions. Thus, it is more sensible to learn low-dimensional representations to capture the most salient event features in documents. We assume that each news document is an instance of the corresponding event (i.e., meta-event) and the documents describing the same event would scatter around the same meta-event in a latent space.

In the Latent Event Mapping step, we use a deep non-linear Stacked Autoencoder (SAE) to learn the mapping from the original feature space of documents to the latent event space as is shown in the top of the Fig. 1. The stacked autoencoder consists of two non-linear mapping units, an encoder function $F(\mathcal{D}|{\Theta})\to\mathcal{Z}$ and a decoder function $G(\mathcal{Z}|{\Omega})\to\tilde{\mathcal{D}}$ , where encoder and decoder have symmetrical structure and the reconstructed $\tilde{\mathcal{D}}$ is supposed to restore $\mathcal{D}$ . $\mathcal{D}\in\mathbb{R}^{n}$ and $\mathcal{Z}\in\mathbb{R}^{m}$ are the original feature space and the latent event space, respectively. Internally, the hidden layer $z$ from the latent event space describes the code used to represent the input. ${\Theta}$ and $\Omega$ are the parameters of the encoder and the decoder respectively. There is two types of autoencoder widely used in the deep model, the under-complete autoencoder which controls the dimension of the latent space lower than that of the input space and the denoising autoencoder (DAE) which has to recover the input data from its corruption rather than simply the original input [10]. In our model, the denoising autoencoder is employed in our pretraining phrase and the under-complete autoencoder is used to force the dimension of latent event code $m$ lower than the dimension of input data $n$ . And a lower dimension of $m$ can force the autoencoder to capture the most salient event features in one document.

In the stacked autoencoder, each pair of the layers (the forward $i$ -th layer of the encoder and the backward $i$ -th layer of the decoder) is initialized by the denoising autoencoder via reconstructing the previous layer’s output after random corruption to avoid perturbation [30, 32]. Given an input $x$ , the denoising autoencoder is trained to minimize the mean-square reconstruct loss below:

$\displaystyle\mathcal{L}_{\textit{MSE}}=||x-g_{\textit{dae}}(f_{\textit{dae}}(% \tilde{x}))||_{2}^{2}$ (3)

where $\tilde{x}\sim\textit{Dropout}(x)$ , $f_{\textit{dae}}$ and $g_{\textit{dae}}$ are the encoder and the decoder, respectively. Thus, the denoising autoencoder has to recover the corrupted $x$ which implicitly can help to capture the underlying structure of the data generating distribution. We train the denoising autoencoder by minimizing the least-square reconstruct loss of the previous layer’s output. Moreover, the rectified linear unit (ReLU) is used in the denoising autoencoder for non-linear approximation.

3.2 Storyline distribution learning

After latent event mapping, we keep the encoder $F(\mathcal{D}|{\Theta})\to\mathcal{Z}$ and transform $i$ -th document $\bm{d}_{i}$ to the corresponding latent event representation $\bm{z}_{i}$ , then we use the Eq. (1) with $\alpha=1$ to measure the similarity between the embedded event instance $\bm{z}_{i}$ and the meta-event $\bm{\mu}_{j}$ . Finally, we can get the storyline distribution $\bm{q}_{i}=\{q_{i,1},q_{i,2},\cdots,q_{i,S}\}$ of document $\bm{d_{i}}$ , and $S$ is the hyper-parameter, which represents the total number of storylines.

Generally, a news article usually contains only one major event, thus we assumed that there is at most one event in each news article. And the document $\bm{d}_{i}$ will be assigned to the event index which has the maximum probability.

3.3 Storyline construction

With the assumption that meta-event extracted in previous time epoch $t-1$ can affect the extraction process in epoch $t$ , we establish the connection between the current event in epoch $t$ and the previous one in epoch $t-1$ by fusion component and capture the connection in a unified framework.

Supposing that $\bm{\mu}_{j}$ in epoch $t-1$ has been learned, denoted as $\bm{\mu}_{t-1,j}$ , and we need to learn the meta-event $\bm{\mu}_{t,j}$ in the current epoch $t$ . More concretely, a fusion linear component is used to fuse them as follows:

$\displaystyle\bm{\mu}^{\prime}_{t,j}=\bm{W}\times\bm{\mu}_{t-1,j}+\bm{U}\times% \bm{\mu}_{t,j}+\bm{b}$ (4)

where $\bm{W}\in\mathbb{R}^{m\times m}$ , $\bm{U}\in\mathbb{R}^{m\times m}$ are the parameter matrices, $\bm{b}\in\mathbb{R}^{m\times 1}$ is the bias, and $\bm{\mu}^{\prime}_{t,j}$ is the fused meta-event, which can be considered as a recurrent unit that gradually folds in the related meta-events over time.

For the initial epoch, we use the meta-event $\bm{\mu}_{t,j}$ directly, which is initialized using the standard K-means approach for the whole corpus. If two documents from different epochs have the same event assignment, then the two documents are considered to be in the same storyline. The fused meta-event for storyline $j$ at epoch $t$ , $\bm{\mu}^{\prime}_{t,j}$ , is updated accordingly.

With the fusion component, we can construct storylines by our proposed end-to-end DESEM approach directly. Furthermore, our approach can deal with flexible types of storylines, such as intermittent storylines (storyline ends at a certain epoch and then resumes some epochs later), without any post-processing. This is because that the meta-event $\bm{\mu}_{t,j}$ stores the information of the $j$ -th storyline. If the storyline disappears at epoch $t+1$ , but resumes at epoch $t+3$ , then $\bm{\mu}^{\prime}_{t+3,j}$ will still be able to recover the previous storyline information stored at $\bm{\mu}_{t,j}$ , which naturally deals with intermittent storylines.

3.4 Training

In the training step, we first pretrain the stacked autoencoder and initialize each layer of it with a denoising autoencoder. After initialization, we finetune the stacked autoencoder using the training set. Then, we iteratively refine the meta-event $\bm{\mu}_{t,j}$ by learning from their high confidence event assignments with the help of an auxiliary reference distribution using the clustering loss [32]. Specifically, the storyline distribution $\bm{q}_{i}$ is learned to be close to a auxiliary reference distribution $\bm{p}_{i}$ . We use the Kullback-Leibler (KL) divergence to measure the similarity between $\bm{q}_{i}$ and $\bm{p}_{i}$ as follows:

$\displaystyle\mathcal{L}_{KL}=KL(\bm{p_{i}}|\bm{q_{i}})=\sum_{i}\sum_{j}p_{i,j% }\log\frac{p_{i,j}}{q_{i,j}}$ (5)

Then, we compute the reference distribution $p_{i,j}$ by normalizing $q_{i,j}$ as Eq. (2).

From the Eq. (2), it can been seen that the auxiliary reference distribution $\bm{p}_{i}$ is defined by $\bm{q}_{i}$ , thus the training process can be seen as a form of self-training [24].

3.5 Optimization

We optimize $\mathcal{L}$ using Stochastic Gradient Descent (SGD) with momentum. The gradients of $\mathcal{L}$ with respect to event instance representation $\bm{z}_{i}$ and meta-event representation $\bm{\mu}_{i}$ are computed as:

$\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\bm{z}_{i}}}=\frac{\alpha+1% }{\alpha}\sum_{j}\left(1+\frac{\lVert\bm{z}_{i}-\bm{\mu}_{j}\rVert^{2}}{\alpha% }\right)^{-1}\times(p_{i,j}-q_{i,j})(\bm{z}_{i}-\bm{\mu}_{j})$ (6) $\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\bm{\mu}_{j}}}=-\frac{% \partial{\mathcal{L}}}{\partial{\bm{z}_{i}}}\times\bm{U}^{T}$ (7) $\displaystyle\frac{\partial{\mathcal{L}}}{\bm{W}}=-\frac{\partial{\mathcal{L}}% }{\partial{\bm{z}_{i}}}\times\bm{\mu}_{t-1,j}^{T}$ (8) $\displaystyle\frac{\partial{\mathcal{L}}}{\bm{U}}=-\frac{\partial{\mathcal{L}}% }{\partial{\bm{z}_{i}}}\times\bm{\mu}_{t,j}^{T}$ (9) $\displaystyle\frac{\partial{\mathcal{L}}}{\bm{b}}=-\frac{\partial{\mathcal{L}}% }{\partial{\bm{z}_{i}}}$ (10)

We use standard backpropagation to update the model parameters. Based on the model structure and the loss function described above, the training procedure for DESEM is given in Algorithm 3.5.

[H] Training procedure for the DESEM[1] Max training iterations MaxIter, Total epochs $T$ , news corpus $\{\mathcal{D}_{t}\}_{1}^{T}$ grouped by the publishing date Initialize the model parameters $\{\Phi_{t}\}_{1}^{T}$ each pair of the SAE layers create a denoising autoencoder using the layers each batch $\mathcal{M}$ in $\{\mathcal{D}_{t}\}_{1}^{T}$ Calculate the batch loss $\mathcal{L}_{\textit{MSE}}^{(\mathcal{M})}$ and update the gradients of denoising autoencoder Copy the model parameter of the denoising autoencoder to the corresponding layers of SAE each batch $\mathcal{M}$ in $\{\mathcal{D}_{t}\}_{1}^{T}$ Calculate the batch loss $\mathcal{L}_{\textit{MSE}}^{(\mathcal{M})}$ and update the gradients of the SAE Initialize $\bm{\mu}_{1}$ with standard K-means Epoch $t$ from 1 to $T$ $s$ from 1 to MaxIterCompute storyline distribution $\bm{q}$ and target distribution $\bm{p}$ over corpus $\mathcal{D}_{t}$ each batch $\mathcal{M}$ in $\mathcal{D}_{t}$ $t=1$ Compute storyline distribution $\bm{q}^{(\mathcal{M})}$ of batch $\mathcal{M}$ with $\bm{\mu}_{t,j}$ directly Compute storyline distribution $\bm{q}^{(\mathcal{M})}$ of batch $\mathcal{M}$ by Eq. (4) with $\bm{\mu}^{\prime}_{t,j}$ Obtained the target distribution $\bm{p}^{(\mathcal{M})}$ corresponding to batch $\mathcal{M}$ from $\bm{p}$ Calculate minibatch loss $\mathcal{L}^{(\mathcal{M})}_{KL}$ and update the gradients $\nabla_{\Phi_{t}}\mathcal{L}_{\mathcal{M}}$ of model parameter $\Phi_{t}$

4. Experiments

In this section, we first describe the details of datasets, baselines approach and the hyper-parameters setting in our experiments, and then we present the evaluation results of our model compared with baseline approaches. Moreover, we implement ablation experiments to evaluate the effectiveness of various components in our model. In addition, we visualize the latent meta-event and structured representations of storylines to validate the effectiveness of storyline extraction.

4.1 Datasets

To evaluate the proposed DESEM, we use the three datasets from [38] which were crawled from the GDELT Event Database.1

¹
http://data.gdeltproject.org/events/index.html.

In these datasets, documents are grouped by their publishing date and each day is considered as an epoch:

•

Dataset I contains more than 500K news articles published in the month of May in 2014.

•

Dataset II, a subset of Dataset I, contains news articles published in the first week of May 2014. It consists of more than 100K documents, and altogether 77 storylines were manually annotated for evaluation.

•

Dataset III includes 30 different types of storylines manually constructed from Dataset I, and four types of storyline are contained in this dataset: (1) long-term storylines which last for more than two weeks; (2) short-term storylines which last for less than one week; (3) intermittent storylines which last for more than two weeks in total, but stop for a period of time and then appear again; (4) new storylines that emerge in the middle of the whole period, not starting at the beginning.

We extract quadruple $\langle\bm{l},\bm{p},\bm{o},\bm{w}\rangle$ from a document and concatenate their word embeddings as the input vector $\bm{d}=[\bm{l},\bm{p},\bm{o},\bm{w}]$ of SAE. For a document containing more than one entity for the same event element type, for example, a document might contain mentions of different locations, we calculate the weighted sum of all location embeddings according to their occurrence frequency. If a certain event element is missing from the document, we set it to “null” which means a zero vector. We use pre-trained Glove [25] to initialize each word with a 100-dimensional embedding vector, and $\bm{d}$ would be a 400-dimensional vector.

4.2 Experimental setup

In our experiments, we follow the pre-processing step as describe in [38]. We used the Stanford Named Entity Recognizer2

²
https://nlp.stanford.edu/software/CRF-NER.html.

for identifying the named entities. The keywords are extracted from documents in the following way: firstly removing common stop-words and only keeping tokens which are verbs, nouns, or objectives, then filtering unimportant words based on TF-IDF, the remaining words are considered as keywords. We choose the following methods as the baseline approaches.

RCRP [1] is a non-parametric model for evolutionary clustering, which assumes that storyline distribution in the previous epoch is the prior distribution for the current epoch.

SDM [39] assumes that the number of storylines is fixed and the storyline is modeled as a joint distribution over entities and keywords. The dependency of different documents of the same storyline at different periods is captured by modifying the Dirichlet prior.

DSEM [38] is integrated with CRP so that the number of storylines can be determined automatically without human intervention. Moreover, per-token Metropolis-Hastings sampler based on light LDA [35] is used to reduce sampling complexity.

NSEM [37] is an unsupervised storyline extraction model based on a neural network. In this model, the title and the main body of a news article are assumed to share a similar storyline distribution. Moreover, similar documents described in neighboring periods are assumed to share similar storyline distributions. A pairwise ranking algorithm is used to optimize the model parameters.

We set the dimension of the encoder network which is a fully connected multilayer perception (MLP) to $n$ -200-200-1000-50 for all datasets, where $n$ is the dimension of the original input document features, in our experiments $n$ is set to 400. The decoder network is symmetric with the encoder network, thus its dimension is 50-1000-200-200- $n$ . Except for the dimension, other settings of the autoencoder are the same as [32].

For the DESEM, NSEM, and SDM, we run the official codes with the default parameters setting except the parameters we mentioned below with the same dataset slices, the number of storylines is set to 200 on Dataset I and 100 on Dataset II and Dataset III.3

Although the number of storylines is fixed as prior knowledge for all the models, some of the storyline indicators are not assigned with any documents (i.e., empty clusters). As such, the number of extracted storylines actually varies by different models.

And the RCRP and DSEM can automatically determine the number of clusters due to the Chinese restaurant process. For RCRP, the hyper-parameter

\alpha

is set to 1. For SDM and NSEM, the number of historical epochs

M

is set to 7, which is the same as the setting in [38]. We train the model in each epoch

t

with 400 iterations and the batch size is set to 256. The learning rate of the SGD optimizer is 0.1 with momentum 0.9, and it decays every 20000 iterations with a rate of 0.1. The degree of freedom

\alpha

of the Student

t

-distribution is set to 1 in the experiments.

4.3 Evaluation

We use evaluation metrics in [38] to evaluate our model. We evaluate the performance of storyline extraction in precision, recall, and F-measure which are commonly used in evaluating information extraction models. The precision is calculated based on the following criteria: 1) The entities and the keywords extracted refer to the same storyline; 2) The duration of the storyline is correct.

For Dataset I, it is difficult to obtain the gold standard, thus we check the extraction results manually by searching for the relevant news articles (e.g. entities and keywords) in the same period, Moreover, three annotators carry out the evaluation task and manually compare the retrieved results with the extracted results relying on the criteria listed above and then determine the correctness of the extraction results. After reaching consensus, the result is considered as the ground truth. Meanwhile, We ignore the duplicate storylines extracted in the experimental phase.

4.4 Experimental results

The experimental results of the proposed approach in comparison with the baselines on Dataset I, II and III are presented in Table 1. For Dataset I which contains more than 500K news articles, since no ground-truth is available, we only report the precision values of our model and baselines approaches by manually examining the extracted storylines. Meanwhile, it is difficult for a topic model-based method such as RCRP to extract storylines efficiently with such massive documents. Thus, the RCRP is not implemented on Dataset I, but we compare the performance with our model on Dataset II and Dataset III.

Table 1
Performance comparison of the storyline extraction results on Dataset I, II and III

Dataset II
Dataset I
Method	Precision (%)	# of extracted storylines
SDM	70.20	104
DSEM	75.43	114
NSEM	76.58	121
DESEM	82.08	142
Method	Precision (%)	Recall (%)	F-measure (%)
RCRP	67.11	66.23	66.67
SDM	70.67	68.80	69.27
DSEM	73.17	77.92	75.47
NSEM	75.31	79.22	77.22
DESEM	79.66	80.51	80.08
Dataset III
Method	Precision (%)	Recall (%)	F-measure (%)
RCRP	61.54	53.33	57.14
SDM	54.17	43.33	48.15
DSEM	75.00	70.00	72.41
NSEM	77.78	70.00	73.69
DESEM	80.00	76.67	78.29

It can be observed from Table 1 that DESEM evidently achieves the best performance in all three datasets. Specifically, for Dataset I, DESEM extracts 173 storylines among which 142 are correct and gives high precision values with a large margin compared to the other three baselines. For Dataset II which contains 77 storylines, our DESEM gives similar recall value compared to NSEM, but with a higher precision value, which leads to an improved F-measure overall, outperforming the best baseline NSEM by 2.86%. For Dataset III, DESEM outperforms all the baselines by a large margin, achieving superior performance in F-measure, with the improvements ranging between 4.6% and 30%. The remarkable improvement on Dataset III which has 30 complex storylines empirically verifies that the DESEM is more suitable for dealing with a flexible storyline than those baselines approaches.

We can also observe that RCRP gets better results than SDM on Dataset III while gets worse on Dataset II with all metrics, a reasonable interpretation is that Dataset III contains more flexible storylines such as storylines happened in the middle of the period or storylines started at the beginning but with no relevant articles in the middle of the time period and appeared again later. SDM aims to capture the long-distance dependency by taking statistics in the past $M$ epochs into account, however, SDM cannot detect new storylines that happened at a later time. As a result, SDM fails to identify the newly emerged storylines on Dataset III. While RCRP can generate new storylines automatically, which get higher performance on Dataset III than SDM. Overall, our DESEM outperforms all the baselines across three datasets. It may be attributed to the effective representation extracted by the autoencoder. In addition, iteratively refining the meta-event can also improve the accuracy of clustering which enhances the performance of storyline extraction indirectly.

4.5 Ablation study

To evaluate the effectiveness of various components in our proposed DESEM framework, we conduct experiments with some of the key components of DESEM removed or replaced. We try two variants, DESEM-AE and AE+ $k$ -means. DESEM-AE is our DESEM without the latent event mapping step. The original document feature vector $\bm{d}=[\bm{l},\bm{p},\bm{o},\bm{w}]$ is fed into DESEM-AE directly. For the AE+ $k$ -means, we keep an autoencoder for the extracting the latent event features $z$ , but replace the storyline distribution learning and storyline extraction steps by using $k$ -means with Euclidean distance for event clustering. To construct storylines, we initialize the centers of $k$ -means in $t$ th epoch with the clustering centers of $k$ -means results in the $t-1$ th epoch. The experimental results on Dataset II are shown in Table 2.

Table 2
Comparison of storyline extraction performance with variants of DESEM

Dataset II
Method	Precision (%)	Recall (%)	F-measure (%)
DESEM-AE	54.48	48.05	51.06
AE+ $k$ -means	70.50	58.44	63.90
DESEM	79.66	80.51	80.08

It can be observed from Table 2 that DESEM-AE gives the worst result. DESEM with the autoencoder leads to remarkable improvements in precision and recall, indicating the importance of using the autoencoder for extracting latent event features from the simple weighted aggregation of word embeddings of key event elements in the original news articles. This is shown that although word embeddings capture the syntactic and semantic regularities to some extent, more high-level feature extraction is required in order to achieve better storyline extraction results. Compared with AE+ $k$ -means, we see significantly improved recall value from DESEM. The performance shows that the neural network-based approach for storyline extraction is more effective in recovering actual storylines in comparison with the traditional $k$ -means algorithm.

4.6 Visualization of event representations

We present the structured representations of storylines “Saudi MERS” and “American MERS” as is shown in Fig. 2 where the structured event (entity, person, organization, keywords) about storylines and the duration time of the storylines can be observed easily. From Fig. 2, it can be seen how the two storylines develop according to the keywords. At the initial epoch, the Saudi and the American found the MERS virus simultaneously, and then the MERS virus starts to spread. In the end, the American patient with MERS is improving. However, there is panic in Saudi Arabia at the end. Thus, two storylines have a similar trend at an early stage and different endings which is accord with Fig. 3.

Figure 2.

The structured representations of storylines “Saudi MERS” and “America MERS”.

Figure 3.

The representations of storylines evolved over time.

As the whole corpus has been mapped into the same latent space, we can observe how those storylines evolved over time. We randomly select some extracted storylines in Dataset II and visualize the meta-event $\bm{\mu}$ which represents the corresponding cluster or storyline by t-SNE. From Fig. 3, we can observe how one storyline develops in the latent event space. For example, the storyline “Saudi MERS” and “American MERS” are all about the spread of the MERS virus except that the events happened in different countries. We can see that both storylines are close to each other in the latent space. It demonstrates that similar storylines have similar semantic information and developing trend.

5. Conclusion and future work

In this paper, we have proposed a novel unsupervised neural network-based storyline extraction model, termed as DESEM, to extract structured storylines from news articles over time. To jointly learning event representations and extracting storylines, a stacked autoencoder is employed to learn latent event representations from documents and a recurrent fusion component is used as a constraint to construct storylines by connecting related events across neighboring epochs. Experimental results show that our approach outperforms state-of-the-art approaches by a large margin. In future work, we will explore the extension of our proposed model to mine deep semantic relationships among events and establish the hierarchical relations of events in a more fine-grained scale.

Footnotes

Acknowledgments

We would like to thank the reviewers for their valuable comments and helpful suggestions. This work was funded by the National Key Research and Development Program of China (2016YFC1306704), the National Natural Science Foundation of China (61772132) and the Natural Science Foundation of Jiangsu Province of China (BK20161430).

References

Ahmed

Eisenstein

Xing

Smola

A.J.

and Teo

C.H.

, Unified analysis of streaming news, in: Proceedings of the 20th International Conference on World Wide Web, ACM, 2011, pp. 267–276.

Ahmed

Teo

C.H.

Eisenstein

Smola

and Xing

, Online inference for the infinite topic-cluster model: Storylines from streaming text, in: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 101–109.

Asadi

and Regan

, Spatio-temporal clustering of traffic data with deep embedded clustering, in: Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Prediction of Human Mobility, ACM, 2019, pp. 45–52.

Blei

D.M.

and Lafferty

J.D.

, Dynamic topic models, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 113–120.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

Cao

Liu

and Ji

, A novel neural topic model and its supervised extension, in: Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2210–2216.

Diao

and Jiang

, Recurrent chinese restaurant process with a duration-based discount for event identification from twitter, in: Proceedings of the 2014 SIAM International Conference on Data Mining, SIAM, 2014, pp. 388–397.

Farajtabar

Ahmed

Smola

A.J.

and Song

, Dirichlet-hawkes processes with applications to clustering continuous-time document streams, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 219–228.

Ghasedi Dizaji

Herandi

Deng

Cai

and Huang

, Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5736–5745.

10.

Guo

Gao

Liu

and Yin

, Improved deep embedded clustering with local structure preservation, in: Proceedings of Conference on International Joint Conference on Artificial Intelligence, 2017, pp. 1753–1759.

11.

Guo

Liu

Zhu

and Yin

, Adaptive self-paced deep clustering with data augmentation, in: IEEE Transactions on Knowledge and Data Engineering, 2019.

12.

Hadifar

Sterckx

Demeester

and Develder

, A self-training approach for short text clustering, in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), 2019, pp. 194–199.

13.

Huang

and Huang

, Optimized event storyline generation based on mixture-event-aspect model, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013, pp. 726–735.

14.

Huang

Zhu

and Heng

P.-A.

, The dynamic chinese restaurant process via birth and death processes, in AAAI, 2015, 2687–2693.

15.

Kawamae

, Trend analysis model: trend consists of temporal words, topics, and timestamps, in: Proceedings of the fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 317–326.

16.

Qiao

and Zhang

, Discriminatively boosted image clustering with fully convolutional auto-encoders, Pattern Recognition 83 (2018), 161–173.

17.

and Cardie

, Timeline generation: Tracking individuals on twitter, in: Proceedings of the 23rd International Conference on World Wide Web, ACM, 2014, pp. 643–652.

18.

and Li

, Evolutionary hierarchical dirichlet process for timeline summarization, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2013, pp. 556–560.

19.

Wang

and Wang

, Time-dependent hierarchical dirichlet model for timeline generation, arXiv preprint arXiv:1312.2244, 2013.

20.

Lin

Wang

Chen

and Li

, Generating event storylines from microblogs, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, 2012, pp. 175–184.

21.

Liu

Niu

Lai

Kong

and Xu

, Growing story forest online from massive breaking news, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, 2017, pp. 777–785.

22.

Maaten

L.v.d.

and Hinton

, Visualizing data using t-sne, Journal of Machine Learning Research 9 (2008), 2579–2605.

23.

Miao

and Blunsom

, Neural variational inference for text processing, in: International Conference on Machine Learning, 2016, pp. 1727–1736.

24.

Nigam

and Ghani

, Analyzing the effectiveness and applicability of co-training, in: International Conference on Information &. Knowledge Management, 2000.

25.

Pennington

Socher

and Manning

, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

26.

Radinsky

and Horvitz

, Mining the web to predict future events, in: Proceedings of the sixth ACM International Conference on Web Search and Data Mining, ACM, 2013, pp. 255–264.

27.

Srivastava

and Sutton

, Autoencoding variational inference for topic models, arXiv preprint arXiv:1703.01488, 2017.

28.

Tang

Zhang

and Zhuang

, Sketch the storyline with charcoal: a non-parametric approach, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

29.

Tian

Gao

and Liu

T.-Y.

, Sentence level recurrent topic model: Letting topics speak for themselves, arXiv preprint arXiv:1604.02038, 2016.

30.

Vincent

Larochelle

Lajoie

Bengio

and Manzagol

P.-A.

, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion, Journal of Machine Learning Research 11 (2010), 3371–3408.

31.

Wang

Zhou

and He

, Atm: adversarial-neural topic model, Information Processing & Management 56 (2019), 102098.

32.

Xie

Girshick

and Farhadi

, Unsupervised deep embedding for clustering analysis, in: International Conference on Machine Learning, 2016, pp. 478–487.

33.

Yan

Kong

Huang

Wan

and Zhang

, Timeline generation through evolutionary trans-temporal summarization, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 433–443.

34.

Zhao

Zhang

and Wu

, Tracking news article evolution by dense subgraph learning, Neurocomputing 168 (2015), 1076–1084.

35.

Yuan

Gao

Dai

Wei

Zheng

Xing

E.P.

Liu

T.-Y.

and Ma

W.-Y.

, Lightlda: Big topic models on modest computer clusters, in: Proceedings of the 24th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2015, pp. 1351–1361.

36.

Zhang

Song

Zhang

and Liu

, Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2010, pp. 1079–1088.

37.

Zhou

Guo

and He

, Neural storyline extraction model for storyline generation from news articles, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1, 2018, pp. 1727–1736.

38.

Zhou

Dai

X.-Y.

and He

, Unsupervised storyline extraction from news articles, in: Proceedings of Conference on International Joint Conference on Artificial Intelligence, 2016, pp. 3014–3021.

39.

Zhou

and He

, An unsupervised bayesian modelling approach for storyline detection on news articles, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1943–1948.

Unsupervised latent event representation learning and storyline extraction from news articles based on neural networks

Abstract

Keywords

1. Introduction

2.1 Storyline extraction

2.2 Deep embedded clustering

3.3 Storyline construction

4.1 Datasets

1 http://data.gdeltproject.org/events/index.html.

2 https://nlp.stanford.edu/software/CRF-NER.html.

4.4 Experimental results

Table 1 Performance comparison of the storyline extraction results on Dataset I, II and III

Table 2 Comparison of storyline extraction performance with variants of DESEM

Footnotes

Acknowledgments

References

¹
http://data.gdeltproject.org/events/index.html.

²
https://nlp.stanford.edu/software/CRF-NER.html.

Table 1
Performance comparison of the storyline extraction results on Dataset I, II and III

Table 2
Comparison of storyline extraction performance with variants of DESEM