Storyline extraction from news articles with dynamic dependency

Abstract

Storyline generation aims to produce a concise summary of related events unfolding over time from a collection of news articles. It can be cast into an evolutionary clustering problem by separating news articles into different epochs. Existing unsupervised approaches to storyline generation are typically based on probabilistic graphical models. They assume that the storyline distribution at the current epoch depends on the weighted combination of storyline distributions in the latest previous $M$ epochs. The evolutionary parameters of such long-term dependency are typically set by a fixed exponential decay function to capture the intuition that events in more recent epochs have stronger influence to the storyline generation in the current epoch. However, we argue that the amount of relevant historical contextual information should vary for different storylines. Therefore, in this paper, we propose a new Dynamic Dependency Storyline Extraction Model (D ${}^{2}$ SEM) in which the dependencies among events in different epochs but belonging to the same storyline are dynamically updated to track the time-varying distributions of storylines over time. The proposed model has been evaluated on three news corpora and the experimental results show that it outperforms the state-of-the-art approaches and is able to capture the dependency on historical contextual information dynamically.

Keywords

Storyline extraction dynamic dependency topic model event extraction

1. Introduction

Storyline generation aims to produce a concise summary of related events unfolding over time from a collection of news articles. It enables people to gain a quick glimpse of the development of current events without the need of reading through a large number of news articles. It has been intensively studied in recent years [13, 8, 24, 23, 22].

Storyline generation can be cast into an evolutionary clustering problem by separating news articles into different epochs [7]. Most existing approaches to storyline generation can be categorized into two types, pipeline methods and joint methods. The pipeline methods usually first cluster the events in every epoch, and then link those related events across epochs to form storylines. For example, in [13, 8], events were extracted separately from different time periods and were subsequently combined as storylines in a post-processing step. An obvious drawback of the pipeline methods is error propagation. The joint models perform event extraction and storyline generation simultaneously, and are typically based on probabilistic graphical models. For instance, Zhou et al. [23] proposed a Dynamic Storyline Extraction Model (DSEM) which considers the dependencies among events across different epochs but belonging to the same storyline by using a pre-defined exponential decay function to quantify the degree of dependency. However, the actual degrees of dependencies over the past epochs are an unknown priori. Liang et al. [14] proposed a Dynamic Clustering Topic Model (DCT) which is able to track the time-varying distributions of topics over documents and words over topics. Nevertheless, the DCT model is suitable for short documents and it also can not extract structured representations of storylines.

To extract structured storylines and capture the dependencies between events in same storyline, in this paper, we propose a nonparametric generative model for the generation of storylines which can capture the amount of historical contextual information automatically. To make it easier to understand, we clarify several basic definitions in our approach:

Event: an event $e$ is something that happens in some place, and it may involve some persons or organizations. In general, each event is represented by a quadruple, $<\bm{l},\bm{p},\bm{o},\bm{w}>$ , consisting of four elements, locations $l$ , persons $p$ , organizations $o$ and keywords $w$ . Storyline: a storyline $s$ is the developing process of some events over time, and can be seen as a sequence of highly related events $s=[e_{1},e_{2},\cdots,e_{E}]$ happening in subsequent time, where $E$ is the number of events in the storyline $s$ . Storyline extraction: the storyline extraction aims to extract the storyline set $\mathcal{S}=\{s_{1},s_{2},\cdots,s_{S}\}$ from the corpus $\mathcal{D}=[D_{1},D_{2},\cdots,D_{T}]$ , where $S$ is the number of storylines, $T$ is the number of epochs. The whole corpus $\mathcal{D}$ is split into epochs, and the $\{D_{i}\}$ is ordered according to the publishing date.

In our approach, each news article is assumed to describe one storyline $s$ and each storyline is modeled as a joint distribution over locations $l$ , organizations $o$ , persons $p$ and keywords $w$ . To represent the dependencies of events across different epochs but belonging to the same storyline, we assume that the priors of the distributions in current epoch are the weighted combination of those distributions in previous epochs, and use evolutionary model parameters over time to capture the variation of the distributions across epochs which can be learned directly from the data using a fixed point iterative estimation method.

The main contributions of this paper are summarized below:

•
A non-parametric generative model is proposed to extract storylines, which can capture the dynamic temporal dependencies of related events in different epochs over news stream. Collapsed Gibbs sampling [16] and a fixed point iterative estimation method are used for parameter inference.
•
The proposed model has been evaluated on three news corpora and shown a significant improvement for storyline extraction compared to a number of state-of-the-art approaches.

The remainder of the paper is organized as follows: Section 2 discusses related work; Section 3 describes the proposed model; Section 4 discusses our experiments and we conclude this paper in Section 5.
2. Related work

There are two lines of research related to our work, storyline extraction and topic tracking over time. In this section, we will give a brief review about the two lines and discuss the most related models about them.

2.1 Storyline extraction

There have been many studies on storyline detection and extraction from text. For example, to generate storylines from massive texts on the Internet, Yan et al. [19] calculated correlations of individual summaries on each date generated using a text summarization algorithm. Lin et al. [15] extracted storylines by first obtaining relevant tweets and then generating storylines via graph optimization. Binh et al. [3] chose relevant time points and contents to be included in a storyline summary using a linear regression model. Radinsky and Horvitz [18] constructed storylines based on text clustering and entity entropy. Specifically, articles after clustering are either generated within a threshold time horizon or the date of an article is mentioned in a text of a more recent article in the chain. The performance of these pipeline approaches rely on the accuracy of calculating context similarity and time closeness, which suffer from the problem of error propagation. For joint model, Huang et al. [10] developed a mixture event aspect model to distinguish local and global aspects of events described in sentences and utilized an optimization method to generate storylines. Huang et al. [11] first extracted topics from a short text corpus based on word co-occurrence patterns, and then developed an event evolution mining algorithm to discover hot events and their evolutions. To make use of the complex relationship between articles, Yu et al. [20] proposed a context-dependent news article storyline detection model based on dense subgraph learning, which adaptively learns a set of cross-article link patterns. However, an event is represented by a set of raw articles, lacking of effective representation. Zhou et al. [23] proposed a non-parametric generative model called Dynamic Storyline Extraction Model (DSEM) to extract structured storylines in which the Chinese Restaurant Process (CRP) [4] was used to determine the number of storylines automatically. However, it models the storyline dependencies with a fixed decay function, which is a strong assumption and might not be suitable for the real world data.

2.2 Topic tracking over time

There have been many studies on tracking the dynamic variation of topics over time. Blei et al. [5] proposed a Dynamic Topic Model (DTM) which captures the evolution of topics in a temporally organized corpus of documents. It models the topic dependencies by chaining the natural parameters of topic distribution in a state space model that evolves with Gaussian noise. To model higher order dependencies, Ahmed and Xing [2] proposed a temporal Dirichlet process mixture model (TDPM). In every epoch, it calculates the contributions from previous epochs, which decay exponentially over time. To adaptively track topic, Iwata et al. [12] proposed a Topic Tracking Model (TTM) which can adaptively track the variations in interests and trends based on current purchase logs and previously estimated interests and trends of users. In specific, the current Dirichlet prior is obtained from the mean of posterior Dirichlet parameters in previous epochs and a precision variable is used to represent the topic persistency over time. In a similar vein, Liang et al. [14] proposed a dynamic clustering topic model (DCT) which is able to track the time-varying distributions of topics over documents and words over topics on short documents. DCT models temporal dynamics by factorizing the Dirichlet parameter into the mean of the distribution at the previous timestep and a set of precision values like that in TTM. However, such approaches are not suitable for storyline tracking, a more complex task comparing to topic tracking.

Our proposed approach is partly inspired by DSEM. However, DSEM only models the storyline dependencies of temporally-ordered documents using a fixed exponential decay function without considering the variation of dependencies in data. Our proposed Dynamic Dependency Storyline Extraction Model (D ${}^{2}$ SEM) instead extracts storylines from news streaming data by taking into account of dynamic dependencies of storyline distributions across various epochs.

3. Methodology

In this section, we propose the Dynamic Dependency Storyline Extraction Model (D ${}^{2}$ SEM) to learn the dynamic dependency of related events in real world data stream. The notations used in our model are summarized in Table 1.

Table 1
Notations used in the article

Symbol	Description
$\bm{\alpha}$	The Dirichlet prior for the storyline distribution
$\bm{\pi}$	The Multinomial distribution of the storyline
$\bm{\varepsilon}$	The Dirichlet prior for the storyline-location distribution
$\bm{\phi}$	The Multinomial distribution of the storyline-location
$\bm{\eta}$	The Dirichlet prior for the storyline-organization distribution
$\bm{\psi}$	The Multinomial distribution of the storyline-organization
$\bm{\epsilon}$	The Dirichlet prior for the storyline-person distribution
$\bm{\omega}$	The Multinomial distribution of the storyline-person
$\bm{\gamma}_{s}$	The Dirichlet prior for the storyline-word distribution
$\bm{\gamma}_{bg}$	The Dirichlet prior for the background word distribution
$\bm{\varphi}$	The Multinomial distribution of the storyline-word
$\bm{\zeta}$	The Dirichlet prior for the word switch variable distribution
$\bm{\chi}$	The Multinomial distribution of the word switch variable
$x$	Word switch indicator
$\bm{\mu}$	Weights of the storyline distribution in the past epoch for the current one

To model the generation of storylines and characterize the variation of distributions across epochs, we assume that the storyline $s$ is modeled as a joint distribution over storyline-locations $l$ , storyline-organizations $o$ , storyline-persons $p$ and storyline-keywords $w_{s}$ . A document is drawn from either an existing storyline or a new storyline in epoch $t$ by the Chinese Restaurant Process (CRP). The distributions of the events in epoch $t$ depend on the corresponding distributions in last $M$ epochs. The graphical model of D ${}^{2}$ SEM is shown in Fig. 1.

Figure 1.

The graphical representation of D ${}^{2}$ SEM. In the left half of the figure, the horizontal lines (in red and black colors) show the dependencies across epochs. The right half of the figure is the plate notation of the model in the epoch $t$ . Note that there might be temporal dependencies longer than two time steps shown in the figure.

In static topic models such as Latent Dirichlet Allocation (LDA) [6], there is an underlying assumption that the topic distribution and word distribution are independent of the past distributions, and they have Dirichlet priors with a static set of parameters. However, the independence assumption is not realistic when it comes to dealing with data streams. For storyline extraction, the distributions at epoch $t$ are usually dependent on past distributions. Taking the storyline distribution as an example, without considering the dependency, the storyline distribution at epoch $t$ can be written as following:

$\displaystyle P(\bm{\pi}_{t}|\alpha)\propto\prod_{s=1}^{S_{t}}\pi_{t,s}^{% \alpha_{s}-1}$ (1)

where $\alpha=\{\alpha_{s}\}_{s=1}^{S}$ , $\alpha_{s}>0$ is the Dirichlet prior for the storyline distribution, $S$ is the number of storyline, $\bm{\pi}_{t}$ is the storyline distribution at epoch $t$ .

In the dynamic setting, the dependencies of storyline distributions in the past latest $M$ epochs need to be taken into account. Also, the storyline distribution at epoch $t$ needs to be updated when a new set of documents is observed at $t$ . In the simplest case, $M=1$ , that is, the storyline distribution at $t$ , $\bm{\pi}_{t}$ , is only dependent on the storyline distribution at $t-1$ , $\bm{\pi}_{t-1}$ . In real life, events may have long-term effects, that is, the storyline distribution may depend on historical contextual information spanning over multiple past epochs. To capture such a long-term dependency, we assume the Dirichlet prior of the storyline distribution at epoch $t$ is a weighted combination of the historical storyline distributions in the past $M$ epochs. More concretely, the Dirichlet prior for the distribution of storylines at epoch $t$ is:

$\displaystyle\bm{\alpha}_{t}=\sum_{m=1}^{M}\bm{\mu}_{t,m}\bm{\pi}_{t-m}$ (2)

where $\bm{\mu}_{t,m}$ denotes the weight (or impact) of the storyline distribution in the past epoch $t-m$ for the derivation of the storyline distribution in the current epoch $t$ . We call $\bm{\mu}_{t,m}$ evolutionary parameters and will show in Section 3.1.3 how to estimate them automatically from data. $\bm{\pi}_{t-m}$ is the posterior storyline distribution at $t-m$ epoch.

Accordingly, the storyline distribution $p(\bm{\pi}_{t})$ at epoch $t$ can be written as:

$\displaystyle p(\bm{\pi}_{t}|\bm{\alpha}_{t})\propto\prod_{s=1}^{S_{t}}\pi_{t,% s}^{\bm{\alpha}_{t,s}-1}$ (3) $\displaystyle=\prod_{s=1}^{S_{t}}\pi_{t,s}^{(\sum_{m=1}^{M}\bm{\mu}_{t,m,s}\bm% {\pi}_{t-m,s})-1}$ (4)

where $M$ is the dependency length.

In a similar way, the dynamic variation of the multinomial distribution $p(\bm{\phi}_{t}|\{\bm{\mu}_{t,m}^{(\phi)},\bm{\phi}_{t-m}\}_{m=1}^{M})$ of location, $p(\bm{\psi}_{t}|\{\bm{\mu}_{t,m}^{(\psi)},\bm{\psi}_{t-m}\}_{m=1}^{M})$ of organization, $p(\bm{\omega}_{t}|\{\bm{\mu}_{t,m}^{(\omega)},\bm{\omega}_{t-m}\}_{m=1}^{M})$ of person and $p(\bm{\varphi}_{t}|\{\bm{\mu}_{t,m}^{(\varphi)},\linebreak\bm{\varphi}_{t-m}\}% _{m=1}^{M})$ of keyword can be derived accordingly, where $\bm{\mu}_{t,m}^{(\phi)}$ , $\bm{\mu}_{t,m}^{(\psi)}$ , $\bm{\mu}_{t,m}^{(\omega)}$ , $\bm{\mu}_{t,m}^{(\varphi)}$ are the weights of the storyline-location, storyline-organization, storyline-person distribution in epoch $t-m$ for the current one respectively.

Given the historical storyline distributions $\{\bm{\pi}_{t-m}\}_{m=1}^{M}$ , storyline-location distribution $\{\bm{\phi}_{t-m}\}_{m=1}^{M}$ , storyline-organization distribution $\{\bm{\psi}_{t-m}\}_{m=1}^{M}$ , storyline-person distribution $\{\bm{\omega}_{t-m}\}_{m=1}^{M}$ and storyline-keyword distribution $\{\bm{\varphi}_{t-m}\}_{m=1}^{M}$ in previous epochs, the generative process of our proposed model is shown below:

For each time epoch $t$ from $1$ to $T$ :

•

Draw a distribution $\bm{\pi}_{t}$ over prior storylines $\bm{\pi}_{t}\sim$ Dirichlet $(\sum_{m=1}^{M}\bm{\mu}_{t,m}^{(\pi)}\bm{\pi}_{t-m})$

•

Draw a distribution over word switch variable $\chi\sim$ Beta ( $\zeta$ )

•

For background words, draw a background word distribution $\varphi_{t,bg}\sim$ Dirichlet $(\gamma_{t,bg})$

•

For each existing storyline $s\in\{1,\cdots,S_{t}\}$ :

–

Draw a distribution over location $\bm{\phi}_{t,s}\sim$ Dirichlet $(\sum_{m=1}^{M}\bm{\mu}_{t,m}^{(\phi)}\bm{\phi}_{t-m,s})$

–

Draw a distribution over organization $\bm{\psi}_{t,s}\sim$ Dirichlet $(\sum_{m=1}^{M}\bm{\mu}_{t,m}^{(\psi)}\bm{\psi}_{t-m,s})$

–

Draw a distribution over person $\bm{\omega}_{t,s}\sim$ Dirichlet $(\sum_{m=1}^{M}\bm{\mu}_{t,m}^{(\omega)}\bm{\omega}_{t-m,s})$

–

Draw a distribution over keyword $\bm{\varphi}_{t,s}\sim$ Dirichlet $(\sum_{m=1}^{M}\bm{\mu}_{t,m}^{(\varphi)}\bm{\varphi}_{t-m,s})$

•

For each document $d\in\{1,\cdots,D_{t}\}$

–

Draw a storyline indicator $s_{d}$ from CRP.

†

If $s_{d}$ is an existing one, use existing distributions in the following steps

†

If $s_{d}$ is a new storyline, draw distributions $\varphi_{t,s_{\textit{new}}}\sim$ Dirichlet $(\gamma_{t,s_{\textit{new}}})$ , $\omega_{t,s_{\textit{new}}}\sim$ Dirichlet $(\epsilon_{t,s_{\textit{new}}})$ , $\psi_{t,s_{\textit{new}}}\sim$ Dirichlet $(\eta_{t,s_{\textit{new}}})$ and $\phi_{t,s_{\textit{new}}}\sim$ Dirichlet $(\varepsilon_{t,s_{\textit{new}}})$

–

For each location $l\in\{1,\cdots,L_{d}\}$ , choose a location $l\sim$ Multinomial $(\bm{\phi}_{t,s_{d}})$

–

For each organization $o\in\{1,\cdots,O_{d}\}$ , choose an organization $o\sim$ Multinomial $(\bm{\psi}_{t,s_{d}})$

–

For each person $p\in\{1,\cdots,P_{d}\}$ , choose a person $p\sim$ Multinomial $(\bm{\omega}_{t,s_{d}})$

–

For each word position $n\in\{1,\cdots,N_{d}\}$ :

†

Draw a switch variable $x_{d,n}\sim$ Multinomial $(\chi)$

†

If $x_{d,n}=0$ , choose a background word $w_{n,bg}\sim$ Multinomial $(\varphi_{t,bg})$

†

If $x_{d,n}=1$ , choose a keyword $w_{n,s_{d}}\sim$ Multinomial $(\bm{\varphi}_{t,s_{d}})$

where $D_{t}$ is the number of documents in epoch $t$ . $L_{d},O_{d},P_{d},N_{d}$ are the numbers of locations, organizations, persons and words in document $d$ respectively. $\varepsilon_{t,s_{\textit{new}}}$ , $\eta_{t,s_{\textit{new}}}$ , $\epsilon_{t,s_{\textit{new}}}$ and $\gamma_{t,s_{\textit{new}}}$ are the Dirichlet priors of distributions of locations, organizations, persons and keywords for the new storyline sampled at epoch $t$ .

3.1 Inference and parameter estimation

We use Collapsed Gibbs sampling to infer the parameters of our model given observed news stream documents. Gibbs sampling is a Markov chain Monte Carlo method which allows us to repeatedly sample from a Markov chain whose stationary distribution is the posterior of interest, $s_{d}^{t}$ and $x_{d,n}^{t}$ here, from the distribution over that variable given the current values of all other variables and the data [9]. Such samples can be used to empirically estimate the target distributions.

3.1.1 Storyline sampling

Letting the subscript $-d$ denote the quantity that excludes counts in document $d$ , the conditional posterior for $s_{t,d}$ is:

$\displaystyle p(s_{t,d}=j|{\bm{s}_{t,-d}},\bm{l},\bm{o},\bm{p},\bm{w},\Lambda)% \propto\frac{\prod_{l}^{L}\prod_{b=1}^{n_{j,l}^{(d)}}(n_{j,l}+\sum_{m=1}^{M}% \mu_{t,j,l,m}^{(\phi)}\phi_{t-m,j,l}-b)}{\prod_{b=1}^{n_{j}^{L_{(d)}}}(n_{j}^{% L}+\sum_{m=1}^{M}\sum_{l}^{L}\mu_{t,j,l,m}^{(\phi)}\phi_{t-m,j,l}-b)}$ $\displaystyle{}\times\frac{\prod_{o}^{O}\prod_{b=1}^{n_{j,o}^{(d)}}(n_{j,o}+% \sum_{m=1}^{M}\mu_{t,j,o,m}^{(\psi)}\psi_{t-m,j,o}-b)}{\prod_{b=1}^{n_{j}^{O_{% (d)}}}(n_{j}^{O}+\sum_{m=1}^{M}\sum_{o}^{O}\mu_{t,j,o,m}^{(\psi)}\psi_{t-m,j,o% }-b)}$ $\displaystyle{}\times\frac{\prod_{p}^{P}\prod_{b=1}^{n_{j,p}^{(d)}}(n_{j,p}+% \sum_{m=1}^{M}\mu_{t,j,p,m}^{(\omega)}\omega_{t-m,j,p}-b)}{\prod_{b=1}^{n_{j}^% {P_{(d)}}}(n_{j}^{P}+\sum_{m=1}^{M}\sum_{p}^{P}\mu_{t,j,p,m}^{(\omega)}\omega_% {t-m,j,p}-b)}$ $\displaystyle{}\times\frac{\prod_{w}^{W}\prod_{b=1}^{n_{j,w}^{(d)}}(n_{j,w}+% \sum_{m=1}^{M}\mu_{t,j,w,m}^{(\varphi)}\varphi_{t-m,j,w}-b)}{\prod_{b=1}^{n_{j% }^{W_{(d)}}}(n_{j}^{W}+\sum_{m=1}^{M}\sum_{w}^{W}\mu_{t,j,w,m}^{(\varphi)}% \varphi_{t-m,j,w}-b)}$ $\displaystyle{}\times\left\{\begin{array}[]{l}\frac{\beta}{n_{\textit{new},-d}% +\beta},\text{ new }s\text{ at }t\\ \\ \frac{n_{\textit{new},j,-d}}{n_{\textit{new},-d}+\beta},\text{ existing }s% \text{ at }t\\ \\ \frac{n_{\textit{pri},j,-d}+\sum_{m=1}^{M}\mu_{t,j,m}^{(\pi)}\pi_{t-m,j}}{\sum% _{s=1}^{S_{t}}(n_{s}+\sum_{m=1}^{M}\mu_{t,j,m}^{(\pi)}\pi_{t-m,j}-1)},\text{ % others}\\ \end{array}\right.$ (5)

where $L$ , $O$ , $P$ , $W$ are the numbers of locations, organizations, persons and keywords in the corpus $D$ respectively, $n_{\textit{new},-d}$ denotes the number of documents assigned to the new storyline generated in the current epoch $t$ , $n_{\textit{new},j,-d}$ denotes the number of documents assigned to the new storyline indicator $j$ in the current epoch $t$ , $n_{\textit{pri},j,-d}$ denotes the number of documents assigned to an existing storyline in the past $M$ epochs, $n_{j,l}$ is the number of times location $l$ assigned with storyline $j$ , $n_{j}^{L}$ denotes the total number of locations with storyline $j$ in the corpus, $n_{j,o}$ is the number of times organization $o$ assigned with storyline $j$ , $n_{j}^{O}$ denotes the total number of organizations with storyline $j$ in the corpus, $n_{j,p}$ is the number of times persons $o$ assigned with storyline $j$ , $n_{j}^{P}$ denotes the total number of persons with storyline $j$ in the corpus, $n_{j,w}$ is the number of times words $w$ assigned with storyline $j$ , $n_{j}^{W}$ denotes the total number of words with storyline $j$ in the corpus. And variables with $(d)$ notation denote the counts relating to the document $d$ only. The terms in the big curly bracket denote the probability of assigning document $d$ to storyline $j$ by incorporating CRP. And $\beta$ is the initial value in CRP when new storyline $s$ is generated.

3.1.2 Word category sampling

Letting the subscript $n$ denotes the word position, for each word token $w_{d,n}$ in document $d$ , the posterior probability of being a background word is:

$\displaystyle p(x_{d,n}=0|{\bm{x}_{d,-n}},{\bm{w}_{bg}},\Lambda)\propto\frac{n% _{d,bg}+\zeta_{bg}-1}{\sum_{y=1}^{2}(n_{d,y}+\zeta_{y})-1}\cdot\frac{n_{bg,w_{% d,n}}+\gamma_{bg,w_{d,n}}-1}{\sum_{w=1}^{W}(n_{bg,w}+\gamma_{bg,w})-1}$ (6)

where $n_{d,bg}$ denotes the number of background words in document $d$ and $n_{bg,w_{d,n}}$ denotes the number of times word $w_{d,n}$ is assigned to the background word category.

The posterior probability of the word token $w_{d,n}$ being a storyline word is:

$\displaystyle p(x_{d,n}=1|{\bm{x}_{d,-n}},{\bm{w}_{j}},\Lambda)\propto\frac{n_% {d,j}+\zeta_{j}-1}{\sum_{y=1}^{2}(n_{d,y}+\zeta_{y})-1}\cdot\frac{n_{j,w_{d,n}% }+\gamma_{j,w_{d,n}}-1}{\sum_{w=1}^{W}(n_{j,w}+\gamma_{j,w})-1}$ (7)

where $n_{d,j}$ denotes the number of words in document $d$ assigned to the storyline $j$ and $n_{j,w_{d,n}}$ denotes the number of times word token $w_{d,n}$ is assigned to the storyline $j$ .

3.1.3 Evolutionary parameter estimation

During sampling, at each iteration, the weight parameters $\bm{\mu}_{t}^{(\pi)}$ , $\bm{\mu}_{t}^{(\phi)}$ , $\bm{\mu}_{t}^{(\psi)}$ , $\bm{\mu}_{t}^{(\omega)}$ , $\bm{\mu}_{t}^{(\varphi)}$ can be estimated by maximizing the joint distribution $p(\mathbf{d}_{t},s_{t,d}|\{\bm{\pi},\bm{\phi},\bm{\psi},\bm{\omega},\bm{% \varphi}\}_{t-m},\{\bm{\mu}^{(\pi,\phi,\psi,\omega,\varphi)}\}_{t,m},\Lambda)$ . We apply fixed point iteration method [17] to get the optimal weights at $t$ . The update formula of $\bm{\mu}_{t}^{(\pi)}$ is as follows:

$\displaystyle(\mu_{t,j,m})^{\text{new}}\leftarrow\mu_{t,j,m}\times\frac{A_{t,j% }-B_{t,j}}{C_{t,j}-D_{t,j}},$ (8)

where $A_{t,j}=\Psi(n_{t,j}+\sum_{m=1}^{M}\mu_{t,j,m}^{(\pi)}\pi_{t-m,j})$ , $B_{t,j}=\Psi(\sum_{m=1}^{M}\mu_{t,j,m}^{(\pi)}\pi_{t-m,j})$ , $C_{t,j}=\Psi(\sum_{j}(n_{t,j}+\sum_{m=1}^{M}\mu_{t,j,m}^{(\pi)}\pi_{t-m,j}))$ and $D_{t,j}=\Psi(\sum_{j}\sum_{m=1}^{M}\mu_{t,j,m}^{(\pi)}\pi_{t-m,j})$ . $\Psi(\cdot)$ defined by $\Psi(x)=\frac{\partial\log\Gamma(x)}{\partial x}$ is the digamma function. The update rules of $\bm{\mu}_{t}^{(\phi)}$ , $\bm{\mu}_{t}^{(\psi)}$ , $\bm{\mu}_{t}^{(\omega)}$ and $\bm{\mu}_{t}^{(\varphi)}$ are in similar forms.

Based on the recurrent structure of the D ${}^{2}$ SEM, we list the inference procedure in an epoch as is shown in Algorithm 3.1.3.

[!h] Inference for the D ${}^{2}$ SEM model at epoch $t$ .[1] Previous storyline distributions $\{\bm{\pi}_{t-m}\}_{m=1}^{M}$ , storyline-location distributions $\{\bm{\phi}_{t-m}\}_{m=1}^{M}$ , storyline-organization distributions $\{\bm{\psi}_{t-m}\}_{m=1}^{M}$ , storyline-person distributions $\{\bm{\omega}_{t-m}\}_{m=1}^{M}$ , storyline-keyword distributions $\{\bm{\varphi}_{t-m}\}_{m=1}^{M}$ , news documents $D_{t}$ at time $t$ , extracted storylines $\mathcal{S}_{t-1}$ from $\{D_{i}\}_{1}^{t-1}$ in epoch $t-1$ Initialize storyline assignments randomly for all documents in $D_{t}$ $\textit{iter}=1\ to\ N_{\textit{iter}}$ $d\ in\ D_{t}$ Draw $s_{t,d}$ from $p(s_{t,d}|{\bm{s}_{t,-d}},\bm{l},\bm{o},\bm{p},\bm{w},\Lambda)$ $s_{t,d}=j$ is a new storyline Add $j$ to the existing storylines set Update $n_{j,l}$ , $n_{j,o}$ , $n_{j,p}$ , $n_{j,w}$ , $n_{\textit{new},j}$ and $n_{\textit{pri},j}$ $w_{d,n}\ in\ d$ Draw $x_{d,n}$ from $p(x_{d,n}|{\bm{x}_{d,-n}},{\bm{w}},\Lambda)$ Update $n_{d,bg}$ , $n_{d,j}$ , $n_{bg,w_{d,n}}$ and $n_{j,w_{d,n}}$ Update evolutionary parameters $\bm{\mu}_{t}^{(\pi)}$ , $\bm{\mu}_{t}^{(\phi)}$ , $\bm{\mu}_{t}^{(\psi)}$ , $\bm{\mu}_{t}^{(\omega)}$ and $\bm{\mu}_{t}^{(\varphi)}$ Update $\mathcal{S}_{t-1}$ to $\mathcal{S}_{t}$ according to the distributions learned in current epoch $t$

4. Experiments

In this section, we first describe the datasets, baselines and parameters setting used in our experiments. Then we evaluate the performance of our D ${}^{2}$ SEM compared with the baselines. In addition, we visualize the learned dependencies of related events in same storyline to verify our assumption.

4.1 Setup

To evaluate the proposed approach, we use the three datasets as in [23]. The statistics of the three datasets are presented in Table 2.

Table 2
Statistics of the three datasets

Datasets	Documents	Storylines	Dates
I	526,587	N/A	1–30 May 2014
II	101,654	77	1–7 May 2014
III	23,376	30	1–30 May 2014

From the Table 2, it can be seen that the Dataset I is a large dataset and without any manually anotated storyline. The Dataset II is a one-week data extracted from Dataset I with 77 storylines are identified. In [23], storyline is categorized into four types: (1) long-term storylines which last more than 2 weeks; (2) short-term storylines which last less than 1 week; (3) intermittent storylines which last more than 2 weeks and interrupt in the middle period; (4) new storylines which start in the middle of the period, not the beginning. Since not all these types of storylines exist in Dataset II, the Dataset III is manually constructed containing four types of storylines, altogether 30 storylines are identified.

In our experiments, we used the Stanford Named Entity recognizer1

https://nlp.stanford.edu/software/CRF-NER.html.

for identifying the named entities. In addition, we removed common stopwords and only kept tokens which are verbs, nouns, or adjectives.

We chose the following four models as the baselines.

•

DTM [5], the dynamic topic model based on the Markovian assumption that the topic-word distribution at the current time period is only influenced by the topic-word distribution in the previous time period.

•

RCRP [1], a non-parametric model for evolutionary clustering based on RCRP, which assumes that the popularity of the past story is a good prior for the popularity of the current story.

•

SDM [24] assumes that the number of storylines is fixed and the storyline is modeled as a joint distribution over entities and keywords. The dependency of events at different time periods but belonging to the same storyline is captured by modifying Dirichlet priors.

•

DSEM [23] is integrated with CRPs so that the number of storylines can be determined automatically without human intervention. Moreover, per-token Metropolis-Hastings sampler based on light LDA [21] is used to reduce sampling complexity.

•

NSEM [22] is an unsupervised storyline extraction model based on neural network. In this model, title and main body of a news article are assumed to share the similar storyline distribution. Moreover, similar documents described in neighboring time periods are assumed to share similar storyline distributions. A pairwise ranking algorithm is used to optimize the network.

For DTM, SDM and NSEM, the number of storylines is set to 100 on both Dataset II and III.2

Although the number of storylines is set to 100 for all the models, some of the storyline indices are not assigned with any documents (i.e. empty clusters). As such, the number of actually extracted storylines varies by different models.

We initialize the storyline-location distribution, storyline-organization distribution, storyline-person distribution and storyline-keyword distribution to

\phi_{0}=1/L

\psi_{0}=1/O

\omega_{0}=1/P

and

\varphi_{0}=1/W

, similar to the parameter setting in [14]. And the hyperparameter

\beta

is set to 1. To take into account of the impact of historical contextual information, the number of past epochs

M

is set to 7, which is same as SDM and DSEM.

4.2 Evaluation

As there is no direct evaluation metric for evaluating storyline extraction, we evaluate the performance of storyline extraction in precision, recall and F-measure which are commonly used in evaluating information extraction systems. The precision is calculated based on the following criteria: 1) The entities and keywords extracted refer to the same storyline; 2) The duration of the storyline is correct. We assume that the start date (or end date) of a storyline is the publication date of the first (or last) related news article.

For Dataset I, there is no gold standard available, thus we check the extraction results manually by searching for the relevant news articles in the same period and compare the retrieved results with the model extraction results based on the criteria listed above. Especially, we modify the start and end time of search engine to make it consistent with the dataset. Then we retrieve the extracted event elements, i.e. entities and keywords. We keep the top 20 results and discard those which are not related to events. Based on the retrieved results, three annotators manually check the results and choose the most related event individually. After reaching consensus, the result is considered as the ground truth. As there might be duplicate storylines extracted, thus in the examing process, we ignore the the duplicate one.

4.3 Experimental results

Table 3
Performance comparison of the storyline extraction results on Datasets I, II and III

Dataset I
Method	Precision (%)	# of extracted storylines
SDM	70.20	104
DSEM	75.43	114
NSEM	76.58	121
D ${}^{2}$ SEM	79.14	139
Dataset II
Method	Precision (%)	Recall (%)	F-measure (%)
DTM	62.67	61.03	61.84
RCRP	67.11	66.23	66.67
SDM	70.67	68.80	69.27
DSEM	73.17	77.92	75.47
NSEM	75.31	79.22	77.22
D ${}^{2}$ SEM	79.13	79.22	79.17
Dataset III
Method	Precision (%)	Recall (%)	F-measure (%)
DTM	46.16	43.33	42.86
RCRP	61.54	53.33	57.14
SDM	54.17	43.33	48.15
DSEM	75.00	70.00	72.41
NSEM	77.78	70.00	73.69
D ${}^{2}$ SEM	78.57	76.67	77.61

The experimental results of the proposed approach in comparison to the baselines on Datasets I, II and III are presented in Table 3. For Dataset I, since no ground-truth is available, we only report the precision values by manually examining the extracted storylines. For the Dataset I containing more than 500 thousand documents, we only report the precision of SDM, DSEM, NSEM and our D ${}^{2}$ SEM, as it is difficult for the topic model based approaches such as DLDA and RCRP to accomplish storyline extraction in a few days. However, we present the performances of DTM and RCRP on Dataset II, a subset of Dataset I, which are worse compared to other approaches. Thus the DTM and RCRP are not compared with our model on Dataset I.

It can be observed from Table 3 that D ${}^{2}$ SEM extracts more storylines and gives high precision values compared to the other three baselines on Dataset I. For Dataset II, D ${}^{2}$ SEM has same recall with NSEM, while has a higher precision than NSEM, which illustrates that more storylines can be extracted by D ${}^{2}$ SEM, and it may be attributed to the dynamic dependency. For Datasets II and III, D ${}^{2}$ SEM outperforms the second best model, NSEM, by 1.95% and 3.92% in F-measure respectively.

Compared with DSEM which models the dependency with fixed decay function, our D ${}^{2}$ SEM outperform it with a remarkable margin. It further shows that the adaptive dynamic dependencies are more suitable for real-world application.

Overall, D ${}^{2}$ SEM achieves superior performance than all the other baselines in all the three metrics.

4.4 Performance with different

M

Figure 2.

The performance of D ${}^{2}$ SEM with different dependency lengths.

Intuitively, the performance of D ${}^{2}$ SEM with longer dependency lengths should be better than shorter ones. Thus, to explore the impact on the evaluation results with different values of $M$ , we conduct experiments on Dataset III by varying $M$ from 1 to 7.

Figure 2 shows the precision, recall and F-measure results with different dependency length $M$ . We observe a generally upward trend that the performance of the model improves with longer dependency lengths. But the performance saturates when the dependency length is greater than 6. This indicates that impact of historical contextual information beyond 6 days essentially diminishes.

4.5 Time complexity analysis

Figure 3.

The time complexity of D ${}^{2}$ SEM with different dependency lengths.

Figure 4.

The time complexity of D ${}^{2}$ SEM in different sizes of documents compared with NSEM and DSEM.

Since longer dependency lengths could result in higher computational complexity, we conduct an experiment on Dataset III by varying the length of dependency from 1 to 7 in epoch 7 in which has 4193 documents.

We train our model on an PC equipped with an Intel Core i7 3770 CPU which is a 3.40 Ghz processor and 16 GB DDR3 RAM. Figure 3 illustrates the time consumed in each training iteration with the ‘inference’ curve showing the time spent in the update procedure of precision coefficients, the ‘sample’ one representing the time consumed in Gibbs sampling and the third curve showing the total time consumed for the above two procedures. It can be observed that the run time of our model increases with the increasing dependency lengths and run time of ‘update’ grows faster than that of ‘sample’.

Since further increasing the dependency length incurs more higher computational cost but with little impact on the storyline extraction results, it is reasonable to limit $M$ to be less than 7 days.

To explore the efficiency of the proposed D ${}^{2}$ SEM, we also conducted an experiment by comparing the proposed approach with DSEM and NSEM. We train each models on training data varying from 1,000 to 10,000 documents. And we set the dependency length to 7 for both DSEM and D ${}^{2}$ SEM. Figure 4 illustrates the logarithm of time consumed of each iteration on different size of training data. It can be observed that NSEM still the fastest approach, and the time complexity of D ${}^{2}$ SEM is lower than DSEM.

4.6 Dependency visualization

To visualize the impact of dependency of historical context on the storyline extraction results, we conduct experiments on Dataset III with $M=7$ and randomly choose three storylines which lasts throughout the 7-day period. We also retrieve the values of $\{\mu^{(\pi)}_{t,j,m}\}_{m=1}^{M}$ of these storylines in the last epoch.

Figure 5 shows the results with the subfigure (a) showing the storyline about ‘The MERS in Saudi Arabia’, the subfigure (b) showing the storyline of ‘Arrest of Gerry Adams over Jean McConville murde’ and the subfigure (c) showing the storyline of ‘The Kentucky Derby in Louisville’. In each subfigure, we also show three structured representations extracted by our model to illustrate what happened in the past.

Figure 5.

Dependencies visualization of three storylines.

For comparison, we also conduct experiments using the exponential decay function $f(x)=\exp(-\lambda\cdot m)$ with $m=0.5$ , used in DSEM, to set the weights of the historical contextual information in the past epochs. To investigate the proportions that previous epochs contribute to the current epoch, we normalize the sum of the values of $\{\mu^{(\pi)}_{t,j,m}\}_{m=1}^{M}$ for D ${}^{2}$ SEM to 1. It can be observed that since the exponential decay function assigned fixed decaying weights to the past epochs with the most recent epoch having the largest weight, the weight patterns are the same for both example storylines. However, with the automatically updated weights as used in D ${}^{2}$ SEM, we see the impact of the historical context varies for different storylines. For the first storyline, the most recent two epochs play approximate importance for events happened in the current epoch. But for the second storyline, only the most recent epoch significantly influences the generation of the events in the current epoch. As for the third storyline, in spite of that the most recent epoch has the greatest influence, we can see that the historical epochs from 2 to 4 have the approximate effect. The three different storylines show that the D ${}^{2}$ SEM can learn the influences from historical epochs adaptively.

5. Conclusions and future work

In this paper, we have proposed an unsupervised Bayesian model, called Dynamic Dependency Storyline Extraction Model (D ${}^{2}$ SEM), to extract the storylines from temporally-ordered news articles. To model the dynamic dependency of historical contextual information for storyline generation, we use the storyline distributions inferred in previous epochs as priors for the inference of the storyline distribution in the current epoch. Moreover, the dependency weights are automatically learned from data. This enables the modelling of dynamic dependencies for different storylines. Experimental results show that our proposed model outperforms the state-of-the-art approaches which illustrates the advantage of using adaptive dynamic dependencies on the real data. In future work, we will explore how to adjust dependency length dynamically according to the data in order to optimize the computational complexity.

Footnotes

Acknowledgments

We would like to thank the reviewers for their valuable comments and helpful suggestions. This work was funded by the National Key Research and Development Program of China (2016YFC1306704), the National Natural Science Foundation of China (61772132) and the Natural Science Foundation of Jiangsu Province of China (BK20161430).

References

Ahmed

Eisenstein

Xing

Smola

A.J.

and Teo

C.H.

, Unified analysis of streaming news, in: Proceedings of the 20th International Conference on World Wide Web, ACM, 2011, pp. 267–276.

Ahmed

and Xing

, Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering, in: Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, 2008, pp. 219–230.

Binh Tran

Alrifai

and Quoc Nguyen

, Predicting relevant news events for timeline summaries, in: Proceedings of the 22nd International Conference on World Wide Web, ACM, 2013, pp. 91–92.

Blei

D.M.

Griffiths

T.L.

and Jordan

M.I.

, The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies, Advances in Neural Information Processing Systems 16 (2004), 17–24.

Blei

D.M.

and Lafferty

J.D.

, Dynamic topic models, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 113–120.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

Chakrabarti

Kumar

and Tomkins

, Evolutionary clustering, in: Proceedings of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, ACM, 2006, pp. 554–560.

Diao

and Jiang

, Recurrent chinese restaurant process with a duration-based discount for event identification from twitter, in: Proceedings of the 2014 SIAM International Conference on Data Mining, SIAM, 2014, pp. 388–397.

Griffiths

T.L.

and Steyvers

, Finding scientific topics, Proceedings of the National academy of Sciences 101 (2004), 5228–5235.

10.

Huang

and Huang

, Optimized event storyline generation based on mixture-event-aspect model, in: EMNLP, 2013, pp. 726–735.

11.

Huang

Zhu

and Heng

P.-A.

, The dynamic chinese restaurant process via birth and death processes, in: AAAI, 2015, pp. 2687–2693.

12.

Iwata

Watanabe

Yamada

and Ueda

, Topic tracking model for analyzing consumer purchase behavior, in: IJCAI, Vol. 9, 2009, pp. 1427–1432.

13.

and Cardie

, Timeline generation: Tracking individuals on twitter, in: Proceedings of the 23rd International Conference on World Wide Web, ACM, 2014, pp. 643–652.

14.

Liang

Yilmaz

and Kanoulas

, Dynamic clustering of streaming short documents, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 995–1004.

15.

Lin

Wang

Chen

and Li

, Generating event storylines from microblogs, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, 2012, pp. 175–184.

16.

Liu

J.S.

, The collapsed gibbs sampler in bayesian computations with applications to a gene regulation problem, Publications of the American Statistical Association 89 (1994), 958–966.

17.

Minka

, Estimating a dirichlet distribution, 2000.

18.

Radinsky

and Horvitz

, Mining the web to predict future events, in: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, ACM, 2013, pp. 255–264.

19.

Yan

Kong

Huang

Wan

and Zhang

, Timeline generation through evolutionary trans-temporal summarization, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 433–443.

20.

Zhao

Zhang

and Wu

, Tracking news article evolution by dense subgraph learning, Neurocomputing 168 (2015), 1076–1084.

21.

Yuan

Gao

Dai

Wei

Zheng

Xing

E.P.

Liu

T.-Y.

and Ma

W.-Y.

, Lightlda: Big topic models on modest computer clusters, in: Proceedings of the 24th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2015, pp. 1351–1361.

22.

Zhou

Guo

and He

, Neural storyline extraction model for storyline generation from news articles, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long Papers), Vol. 1, 2018, pp. 1727–1736.

23.

Zhou

Dai

X.-Y.

and He

, Unsupervised storyline extraction from news articles, in: IJCAI, 2016, pp. 3014–3021.

24.

Zhou

and He

, An unsupervised bayesian modelling approach for storyline detection on news articles, in: EMNLP, 2015, pp. 1943–1948.

Storyline extraction from news articles with dynamic dependency

Abstract

Keywords

1. Introduction

2.1 Storyline extraction

2.2 Topic tracking over time

3. Methodology

Table 1 Notations used in the article

3.1.1 Storyline sampling

4.1 Setup

Table 2 Statistics of the three datasets

4.3 Experimental results

Table 3 Performance comparison of the storyline extraction results on Datasets I, II and III

Footnotes

Acknowledgments

References

Table 1
Notations used in the article

Table 2
Statistics of the three datasets

Table 3
Performance comparison of the storyline extraction results on Datasets I, II and III