Personalized news recommendation using graph-based approach

Abstract

Over the past decade, there has been a proliferation of online news articles. News articles can contain rich content and contextual information pertaining to groups in societies, such as senior citizens, child rights groups, religious minorities, or environmentalist groups. In addition, news articles contain different object types such as people, organizations, statistical (numerical) information, countries, authors, or events. Thus, it is possible to create a complex heterogeneous graph containing multi-type objects (vertices) and multi-type linkages (edges) among the objects, such as common keywords found between two news articles. We call such a graph a Heterogeneous News Graph (HNG). Currently, it is possible to extract rich information and knowledge from an HNG. It is our belief that one could use an HNG to resolve the bias and visibility issues found in many news sources, as well as capture important news articles. First, due to the amount of news feeds currently available in this digital age, readers want a filtered view of relevant news articles, allowing them to focus on important (breaking) news that contain rich contextual information for their particular societal group. For example, senior citizen groups might want to know new safety measures taken by police for elderly people. Second, visibility is another problem in the world of journalism, where there are multiple objects in the news articles, such as authors and organizations. In this example, readers might need to know who are the relevant authors, or experts, for particular topics, such as Libya, Afghanistan, or climate change. To address the issues of determining importance and visibility of objects, we propose novel graph-based approaches using HNGs that will (1) rank the expertness of an article’s author on a specific topic, and (2) identify articles of particular interest and value. In summary, we propose a novel graph-based approach for determining context and content in news articles so that more personalized recommendations can be realized.

Keywords

News mining graph mining recommendation system

1. Introduction

The ubiquitous nature of the World Wide Web has resulted in an increase in the volume and variety of available unstructured, heterogeneous data. In particular, online news sources provide a wide range of information on societal issues, events related to crimes, terrorist attacks, and security issues both with respect of the general population and cultures. To date, research has used news articles to solve problems such as predicting future crimes, analyzing patterns, and predicting the locations of offenders. In addition, some sources and systems are connected to each another, such as social networks, GPS maps, and computer networks. Because of the ubiquitous nature of online news networks, they can potentially form a rich and complex, heterogeneous network. A network is potentially heterogeneous based on two criteria. First, it is possible that the information is derived from multiple news sources. For example, if the research or system involves two news sources, such as The New York Times and the Washington Post, then it is called a heterogeneous network. Second, heterogeneous networks can also be viewed as logical networks, consisting of multiple object types and multiple linkages connecting those objects. In this way, each article can be seen in a more abstract form as a collection of different types of object. In this case, each news articles mentions different types of objects, such as organization names, person names, location, dates, numbers, images, keywords (i.e., tags), and image captions. For example, an object type person names mentioned in articles could be the names of statesmen, names of experts who provide opinions on certain societal issues, offenders’ names, or defenders’ names. Similarly, an object type organization names mentioned in articles could be the White House, World Health Organization (WHO), or United Nations (UN).

To ground this in a real-world scenario, let’s take the example of two news articles. The words mentioned in the object type caption of an image found in the first news article can interact with the object type keywords in the second news article. Let’s assume that the first news article has an image with the caption “Nintendo and Universal Studios Japan released concept art for Super Nintendo World. Nintendo is also working with Universal Orlando for a Nintendo component”.1

¹
http://www.ibtimes.co.in/super-nintendo-world-open-universal-studios-tokyo-ahead-olympics-tribute-nintendo-games-707998.

In addition, let’s assume that the object type keywords contains the word Japan in the second news article. Thus, there is a link between the location object type “Japan” that appears in the caption of an image in the first news article with a keyword from the second news article. Simultaneously, the second news article might contain an object type organization named “Japan International School” in the content of the news. Between the location object type named “Japan” from the first article, and the organization object type named “Japan International School”, there is one word or token “Japan” that is common, and hence another link can be formed. In short, there are two types of linkages from the first news article to the second news article. This shows that both linkages and objects can be of multiple types. Similarly, other object types, such as person names, dates, and keywords, from one news article can interact with other news articles forming a complex heterogeneous network. In other words, links in heterogeneous networks indicate interactions among various object types in the network.

In this work, we consider news articles as heterogeneous networks possessing multiple object types and linkages, where the graph input files are created chronologically by the published by date in the news article. We propose to mine these types of heterogeneous networks for context in the scope of personalization for different communities, where each community has a particular set of interests.

1.1 What is personalized context mining of news streams?

Personalization in this thesis is based on two notions: (1) physical outlook, and (2) interests. Physical outlook includes but is not limited to gender, race, age, and location. Interests involve a range of topics that a typical citizen and/or interest group would be interested to know, such as climate change, crime, corruption, or human trafficking.

In this work, context refers to important information that could potentially catch the attention of policy makers and/or citizens. There could be specific contextual information related to topics such as corruption, human trafficking, or road accidents. Citizens are looking for specific context related to the issues, such as if any new measures taken by municipal corporations will reduce road accidents, if any new law or regulation passed by Federal government or State government involves human trafficking, or how such laws are going to be implemented.

1.2 Why is personalized context mining important?

The users can be of various level of interests based on their job or personal interests. The following are motivational examples from two individual’s perspectives:

Jacob Lawrence works as a policy advisor for a think-tank which serves the United States government. Mr. Lawrence works as a prominent advisor on Middle Eastern policy. In his previous experience, he worked for seven years as a journalist in Iraq. Mr. Lawrence is interested in specific contextual information and knowledge that can be found in the news, such as:

•
Who are the experts for research and collaboration on a specific Middle Eastern topics such as Syria, Iran, Afghanistan, or terrorism?
•
Articles mentioning security situations of peer journalists in Middle Eastern countries such as Iran, Iraq, or Syria.
•
Articles that mention attacks on Western culture and its citizens (e.g., teachers from Western countries attacked in the Arabian peninsula countries of Saudi Arabia or Qatar).

Fatima Bhutto is a citizen of Bangladesh who works as an activist on the issue of human trafficking. Some children from Bangladesh are trafficked to South Asian countries, such as India and Nepal, in order to be eventually sold for bonded labor or prostitution in rich Middle Eastern countries, such as Dubai or the United Arab Emirates. Hence, Ms. Fatima extensively travels to South Asian countries and Middle Eastern countries to bring trafficked children back to their home country. Ms. Fatima is interested in only those news articles that mention specific context and knowledge on human trafficking, such as the following:

•
Articles having discussions about security issues for foreign travelers and precautions to be taken for one or more countries in either South Asia or the Middle East.
•
Articles containing contextual information on specific social issues such as child labor and human trafficking. Some example are measures/programs introduced by governments, recent statistics, prominent entities involved such as NGOs, and politicians, or interesting patterns and trends pertaining to the trafficking in one or more countries in South Asia or the Middle East.
•
Articles from those who are experts in human trafficking. Such articles might compare different measures taken among different nations in South Asia or Middle East to counter the human trafficking situation.

In summary, exploring interesting contextual information from news streams for different societal issues could better serve different communities such as policy-makers, travelers, or citizens. In other words, we believe that context mining should be based upon the perspectives of relevant persons and societal communities.

In this research, we will convert textual content from news articles into heterogeneous networks containing various object types using Natural Language Processing (NLP) techniques, and explore graph-based mining classification and ranking approaches. This work involves addressing two research problems:

•
Which are the important news articles to follow on a specific societal issue? Some of the challenges involve the structure of important facts and information, and capturing different entities within the context.
•
Who are the experts in news articles? Currently, there is no visibility on who are the expert journalists on a specific topic. Thus, what needs to be studied is how to rank journalists in terms of their relevance. Some of the challenges involve incomplete information and overlapping topics within news articles.

In addressing these two problem, our hope is that it will provide the right contextual information and knowledge for citizens and policy analysts. First, users will be able to filter the appropriate articles that convey important contextual information. Second, users will know who are the experts on a specific topic.
2. Related work

The change detection approach presented in this work is most closely related to the research that is being done on novelty detection, particularly in a temporal setting. Gaughan and Smeaton [5] study novelty detection using the TREC data set. The NIST TREC2

²
www.trec.nist.gov.

data is from 2002–2005, and includes tracks that are divided into event and opinion topics. For example, they use the 2004 data, which uses 3 news feeds from Xinhua, the New York Times and the Associated Press. In their study of novelty detection, they employ a Term Frequency-Inverse Document Frequency (TF-IDF) variant. The authors use the F1-score for evaluation, and achieve an F-score of 0.622 and 0.807 on the 2004 and 2003 data respectively. Li and Croft [13, 14] also study the novelty detection problem using TREC. First, their algorithm converts the query into a query and its expected related answer type. The basic idea is that if there is a combination of query words, named entities, etc., available in a sentence, this increases the possibility of an answer. The approach uses a concept called “answer patterns”. Answer patterns are a list of answer candidates, each with a specific pattern prepared for each question using a belief or heuristic.

Schiffman and McKeown [22] leverage contextual information for the novelty detection problem. The authors use the context of the sentence along with novel words and named entities. The algorithm tries to find the optimal value for 11 parameters, weights, and thresholds. The algorithm uses a random hill-climbing algorithm with backtracking for learning weights. The algorithm achieves a recall of 0.86 on the average of all runs, in comparison to cosine similarity with 0.81. Karkali et al. [11] study the problem of online novelty detection on news streams. Their work uses two data sets, one from the Google News RSS feed and another from Twitter. Novelty is defined in terms of a predefined window on the past. The proposed algorithm is based on the TF-IDF, and is evaluated using a linearly combined single detection cost [16].

The work presented in this paper differs from previous efforts, as our proposed method is a graph-based approach. Our graph-based approach has the advantage of leveraging (1) structure, (2) content information, (3) context Information (using an aspect-level and fine-grained approach of individually extracting and using dates, numbers, and organizations), and (4) sentiments in graphs. Also, we currently focus on discovering news articles relevant to policy makers – a problem for those who are tasked with generating relevant societal policies.

In addition to the task of recognizing change in news articles, there is also the issue of ranking the importance of authors. There are two primary general techniques used for ranking authors: topic modeling or PageRank. Tang et al. [27] study the problem of finding a consensus topic in multiple contexts. Using Twitter data, the authors create a multi-view based on the hashtags and dates and times of tweets. Each view creates a set of documents called pseudo documents, for which the authors run a topic model for each individual view finding a consensus topic by majority voting. Balog et al. [2] implement a generative probabilistic language model with two search strategies, leveraging two different types of evidence. Additionally, they extend the model to include co-occurrence information, using the Text REtrieval Conference (TREC) 2005 expert finding data sets.

PageRank [19] performs two basic functions on a network. First, there is a random walk by the surfer. Second, there is a propagation of influence in a network. In terms of using PageRank, little research has been done for the problem of co-ranking. Zhou et al. [31] propose a random walk based algorithm for ranking authors and documents together. The authors propose a novel coupling random walk, which separately ranks authors and documents. The coupling random walk has two networks for random walks: one for a social network of authors, and the other is a citation network of documents. The coupling random walk helps to mutually reinforce the influences between authors and documents, thus leveraging additional information from the network. A similar algorithm is also implemented by Wu et al. [28] for image and tag co-ranking.

CoRanking [30] implements a random walk with restart to simultaneously rank authors, venues, and documents. The authors use the concept of a population propagation factor (PPF) that is added to each link pointing to an object, where different types of objects such as authors, papers, and venues have different propagation factors. In other words, each object type will have a probability parameter for the random surfer. Evaluation is done by using a Discounted Cumulative Gain (DCG) score, where their CoRanking approach is better than PageRank for ranking authors on the topic of Opinion Mining. Additionally, the authors study using user generated content to improve ranking results.

In summary, previous research has focused on ranking authors in networks from the same domain with the same underlying authors. For example, different bibliographic databases are used for ranking the same authors. The primary difference in our co-ranking approach (Section 4) is that we are able to rank two different sets of authors in heterogeneous networks.

3. Detecting change in news feeds using a context based graph

3.1 Overview

Proliferation of news channels on the Web has introduced a wide range of diverse data. These news articles actively report on stories involving crimes, terrorist attacks, and security issues relevant to the general population. Communities at risk deal with issues, such as children forced into labor, senior citizens traveling in high-risk urban crime zones, and senior citizens not having access to public restrooms. Dark heterogeneous data sources, such as news feeds, html files, pdf files, and tables, provide various statistical information and expert opinion analysis on these issues. For example, an article on child workers reported that “Half of the 5.5 million working children in India are concentrated in five states: Bihar, Uttar Pradesh, Rajasthan, Madhya Pradesh and Maharashtra” (Source: timesofindia.com, 13 Jun 2015). These types of web data provide a rich and complex set of information and knowledge on societal issues, such as policies proposed by the government or the implementation of a new law, that need to be extracted in a meaningful way for knowledge representation.

News mining has been studied in a variety of contexts. Earlier work involved grouping related news items, the fusion of news articles, and the summarization of information from disparate sources. However, news mining within a specific context could be more useful for a targeted group of users based upon their interests – what is commonly referred to as personalized context mining. Specific targeted groups could be based on age, location, gender, physical attributes, or a variety of aspects based upon one’s own interests. For instance, one could be interested in community challenges, such as corruption, or perhaps safety in public places, such as on a beach or road, or at a railway station. In this work, we particularly examine news articles that catch the attention of policy makers or special interest groups on societal issues such as child labor and human trafficking.

First, we will define the various types of changes that our approach attempts to discover. Section 3.3 presents our definition of what constitutes a news article, or generally, a document. Section 3.4 describes our process of data collection and preparation. Section 3.5 presents our proposed graph topology, and the tool used to extract the information, followed by Section 3.6 that discusses the ground truth and evaluation methods for our experiments. Section 3.7 presents our proposed method. Section 3.8 presents the experimental setup for our proposed method and baseline comparisons, followed by experimental results in Section 3.9. We then conclude with conclusions, and future work.

3.2 Change detected

First, we need to define what we mean by a document where there is “change detected”.

Definition of change detected. A document (article) might contain one or more sentences in it that catch the attention of policy makers and/or groups interested in a particular social issue. In other words, the document includes one or more sets of information particularly useful to policy-makers and interest groups. In order for a document to be useful, the information should have mention of a solution or intervention, an account of large resource damages, and/or contextual information explaining the issue. We will mark articles containing such information as change detected articles. In other words, the article has “changed” from being just a typical news story.

Three types of change detected. There are three types of change detected articles we are interested in from the perspective of interest groups:

•
Solution-based Change Detected (SBCD): If an article contains a sentence that mentions an intervention or solution, then the article is marked as change detected. Here is an example of precautions against landslips:

–
“Come rain, Yercaud Ghat Road will be witnessing recurrence of minor landslips. To avoid this and to ensure smooth movement of traffic on Salem-Yercaud Ghat Road, the district administration through the state highways has started carrying out constructing retention walls to prevent mud and rocks slip in rain. Big stone blocks are being used to erect walls.” (Source: thehindu.com)

•
Context-based Change Detected (CBCD): If an article is rich in context, such as getting attention from public and policy makers, then the article is marked as changed. For example, only a few ebola outbreak warning articles mention expert opinions for possible reasons behind the event, and when they do, information provided is very detailed. In other words, rich context articles provide asymmetric information content on the corresponding topic. For example:

–
“The death rate in the Ebola outbreak has risen to 70 per cent and there could be up to 10,000 new cases a week in two months, the World Health Organization warned Tuesday.” (Source: ndtv.com)

•
Resource-damage-based Change Detected (RBCD): If an article contains a number of resource losses, such as through injuries, deaths, or revenue, that are higher than expected, then the article is marked as changed. For example, buildings that collapse due to a conflict or an earthquake, usually result in a significant amount of property and lives lost:

–
More than 20 persons were killed in road accidents in the district including the two major ones that took place on Dindigul-Palani Road in November. (Source: thehindu.com)

3.3 Document definition

A news article can be represented as a document. Each document (article) is denoted by $d_{i}\in D$ , where $D$ represents a collection of documents. Each document contains three features: url, date and body. Specifically, $d_{i}=\{\textit{url},\textit{body},\textit{date}\}$ , where url contains a web link to an article, body refers to the textual content of the article, and date contains the date and time of the article. Table 1 shows an example of features in a sample document from our RSS feed.

Table 1
Features from a sample document (article)

Name	Value
Url	http://www.thehindu.com/todays-paper/tp-national/tp-tamilnadu/Road-safety-likely-in-school-curriculum/article15274337.ece
Body (partial)	CHENNAI: The State government is considering inclusion of road safety as a chapter in the school curriculum, according to Joint Transport Commissioner D. Narayanamoorthy.Road safety rules such as compulsory helmet use for two-wheeler riders need to be taught in schools and the government may soon include it in school books, he told a function here on Wednesday.When contacted, a senior official of the School Education Department said that some classes already had a portion on road safety and safe transportation. However, a decision regarding integrating a module for all the children would have to be taken by the syllabus committee during the next revision of syllabus, he added. Educationists welcome this proposal and feel it will help increase awareness levels.President of Principals’ Association of Matriculation Schools in Tamil Nadu N. Vijayan says: “It is certainly a very good move. When children are made to understand the importance of road safety, they will ensure that their parents follow safety measures.” (continues.)
Date	07-Aug-2008 16:29

Table 2

Data statistics from our data set

Newspaper	Archive years	Total news articles	Articles mentioning 12 societal issues
The Hindu	2009–2015 & 2000–2005	36715	358
Times of India	2009–2015	1726674	7527
NDTV	2009–2015	189976	2118

The goal is that for each document $d_{i}\in D$ , our algorithm will mark the document as either changed detected or not.

3.4 Data collection and preparation

The following is how we collected and prepared the data.

3.4.1 Data collection

First, we crawled the index page of yearly archive pages from three Indian news papers – The Hindu,3

³
www.thehindu.com.

Times of India,4

⁴

www.timesofindia.com.

and NDTV5

⁵

www.ndtv.com.

– extracting the list of URLs from the archives. The urls also contain the title of the article embedded in it, as shown in Table 1. We then used grep on each URL to filter news articles based on 12 different societal issues using the following keywords: fire, traffic, kidnap, senior citizen, juvenile, mining, ebola, swine, migrant, slavery, collapse, and road accident. It should be noted that while this particular list of keywords was chosen somewhat arbitrarily, it was based upon feedback from experts in India who deemed these particular issues of the most importance. We did not use any lexical expansion on our keyword search filtering, but we did filter out irrelevant articles (e.g., mentions of employee fired instead of fire accidents), which reduced our experimental data set down to 8,433 news articles. Data statistics is shown in Table 2.

3.4.2 Data preparation

Second, we analyzed the news articles related to the 12 societal issues for possible changes. Articles that report uncommon incidents within the context of social issues and policy, are marked for change detection. Each article was read by human annotators and searched for one or more of the change detected types defined previously: solution-based, context-based, and/or resource-based. If the article provides a solution or intervention to a social problem, such as the discovery of the ebola virus, it is marked as change detected because of the solution-based impact. If an article mentions experts’ opinions, such as an expert opinion on the ebola outbreak, that is considered a context-based change. If the article mentions huge resource losses, such as mass human casualties due to ebola, it would be considered a resource-based change, and would be marked as change detected.

We also discovered a few out of context (noise) articles. For example, an article mentioning a “cease fire” is inappropriate for a fire accident, and thus is removed from consideration. In addition, a few articles might contain information appropriate for one or more of the different change detected types. For example, the article at [1] is marked for change detection, based on both resource-based and solution-based impact because 30 people died in the stampede, making it eligible for being considered as a resource-based change, and a process was initiated to correct the root cause of the stampede, thereby also providing a solution.

3.5 Graph topology

The input for our approach is a graph. An example of our proposed graph topology of news articles is shown in Fig. 1. In order to create this graph, we used openNLP6

⁶
www.opennlp.apache.org.

to extract the information with part-of-speech tagging. For example, if a given sentence is “Cops arrested the murderer in the park”, openNLP tags it as “Cops/NNS arrested/VBN the/DT murderer/NN in/IN the/DT park/NN”. From this tagged sentence, we extract verbs, numbers, etc., as shown in Fig. 1. We also used the Stanford NLP tool7

⁷

www.nlp.stanford.edu/.

to extract organizations as nodes, as shown in Fig. 1, where dashed and dotted lines represent types, such as “organization” and “verb”, and nodes with solid lines represent actual values from the articles. For each article, we created an “article” node under a “mainNode” node (used solely for ease of retrieval). Under an “article” node, we created hard-coded number, verb, and organization nodes. For “verb”, we further created hard-coded (dotted) nodes named

S1,S2,\ldots,Sn

where

S1

represents the first sentence,

S2

represents the second sentence, and so on. In other words, we create a tree-graph for each article. The extracted verb, numbers, and organization names, are attached as child nodes under their corresponding articles. The child nodes from one article can then be linked with other child nodes of other articles. In addition, common action verbs are used in our proposed Graph-Cut algorithm which makes the tree shown in Fig. 1 as a graph (see Section 3.7 for more details).

Figure 1.

Example graph topology of news articles.

Graph weight. We use the Stanford Sentiment Analysis library8

⁸

www.nlp.stanford.edu/sentiment/.

to mark the sentiment of each sentence in a news article. For each sentence, Stanford sentiment analysis predicts the following sentiment values: 1-very negative, 2-negative, 3-neutral, 4-positive, and 5-very positive. These values then represent edge weights in our graph. For all other edge relationships, we set them to zero. In Fig. 1, w15, w16, w17, w18, w20, and w21 contain non-zero sentiment values. For all others that are zero, we just represent as edge labels.

The result is a weighted graph with vertices consisting of verbs, organization names, dates, and numbers. The resulting graph consists of 117,343 vertices and 231,142 edges.

3.6 Evaluation

For the ground truth needed for evaluating our approach, each article is examined in detail by two policy making experts working for think tanks. Annotator 1 is employed by SunWorks Consultant Private Limited, leading a team examining news article publications related to Chinese social policies, government activities, foreign policies, china neighborhood relations, and the economy. Annotator 2 also works in the area of Chinese affairs studies, employed at the Institute of Chinese Studies (ICS). ICS is funded by the Ministry of External Affairs, Government of India. ICS promotes interdisciplinary studies and research on China and the rest of East Asia, with a focus on domestic politics, international relations, economy, history, health, education, border studies, language, and culture. They crawled documents related to the three different change detected documents described previously, using the approaches described in the data preparation section, marking each article as change detected or not. It is important to note that only articles where both annotators would agree were marked as change detected. If there was a disagreement in terms of the context of a specific article, the article was removed from the data set. This resulted in 74 articles (out of the original 8,433) being marked as irrelevant.

In order to evaluate our approach, we use recall, F1-score, and accuracy, compared against existing standard approaches.

3.7 Our proposed method

In this section, we propose an algorithm that uses an objective function based upon a graph-cut approach. First, we present the important aspects relevant to change detected documents. Articles that are considered as appropriate for detecting this type of change involve one or more of the following aspects:

User-based impact.A user is a named entity that is an organization, as marked by part-of-speech tagging (as discussed earlier). For example, articles of interest may mention an organization, such as the World Health Organization (WHO) and Public Health Agency of Canada. We can then capture attribute information about the mentioned organization names using a tool like the Stanford NLP library. For example, the following sentence can be parsed to recognize the entity: “The Canadian government will ship 800 vials of its experimental Ebola vaccine to the World Health Organization in Geneva beginning on Monday, the Public Health Agency of Canada said on Saturday.” (Source: ndtv.com, 18 Oct 2014). In this work, an expert is defined as the organization that is extracted using the Stanford NLP tool.

Action-based impact. We are interested in capturing verbs that imply changes, such as a new law or a new proposal being proposed or implemented. For example: “The standing committee of the Nashik Municipal corporation on Wednesday approved a proposal of the civic administration to purchase 125 fire proximity suits for its fire brigade personnel at a cost of Rs 1.19 crore as part of safety measures.” (Source:timesofindia.com, 21 Aug 2014). The basic idea is to capture interesting verbs, such as “approved”, “clears” or “passes”, in the context of change. In this work, we try to leverage such verbs using graph structure.

Resource-based impact. In the case of resource damages, lives lost (or hurt), or revenue lost, we need to capture attribute information about the resource affected. For example: “Colombian authorities said on Monday night that aggravated manslaughter charges would be filed against a bus driver over the deaths of 32 children from a fire in the overcrowded vehicle bringing them home from Sunday school.” (Source: timesofindia.com, 20 May 2014). So, in this example, we want to capture the value 32 in the context of death.

In short, in order to classify an article as change detected or not, our method must effectively capture and leverage contextual information that mentions user-based information (e.g., organization), action-based verbs (e.g., protest, strike), and resource-based information (e.g., “32”).

Graph. In this work, a news article is represented as a Graph $G$ . Each verb, organization, and number, are represented as nodes. Edges connect nodes of the same value (verb, name, number) between news articles. $D$ represents the set of articles (documents) in our data set. $M$ represents the total number of documents in our data set. The definition of a news article is provided below.

Definition 1. An article. An article is represented as $d_{i}\in D$ . Document $d_{i}$ has $N$ preceding documents by date. Each of the neighbors are represented by $d_{j}\in D$ . $M$ represents the total number of articles. A document $d_{i}$ is defined as $d_{i}=\{d_{i}^{vb},d_{i}^{\textit{org}},w_{i}^{\textit{neg}},N_{\textit{num}},% N_{yr},d_{i}^{\textit{neig}}\}$ . $d_{i}^{vb}$ represents the number of common (action) verbs the article shares with all other articles in the graph $G$ . $d_{i}^{\textit{org}}$ represents the number of organization names (tokens) mentioned. $w_{i}^{\textit{neg}}$ represents the number of negative sentiment sentences, $N_{\textit{num}}$ represents the number of times a number is mentioned (excluding numbers that represent a year), $N_{yr}$ represents the number of times a year is mentioned. $d_{i}^{\textit{neig}}=\{d_{1},..,d_{j}\}$ represents the set of neighboring (preceding N) articles. $N$ represents the number of preceding articles (neighbors) we wish to compare.

As will be demonstrated shortly, we discover that an N value of 5 gives an increased F1 score as shown in Table 7, and is subsequently used as the minimum neighbor in our experiments.

Definition 2. Smoothing Function. In order to overcome noise while fitting the data for our model, we implement a smoothing function. We examined several smoothing functions, but most were sensitive to outliers such as zero. Thus, we ended up implementing Eq. (3.7). In particular, we require smoothing for features such as $d_{i}^{vb}$ and $d_{i}^{\textit{org}}$ . We use $fn_{i}^{vb}$ and $fn_{i}^{\textit{org}}$ , as mentioned in Eq. (3.7), for $d_{i}^{vb}$ and $d_{i}^{\textit{org}}$ , respectively. First, this smoothing function provides smoothed values by converting outliers such as $d_{i}^{vb}=$ 0, $d_{i}^{\textit{org}}=$ 0 to 1. Second, values are smoothed to be in the range [0, 1].

Again, we discover that the precision and recall of the graph-cut with $N=$ 5 neighbor are 0.4249 and 0.1675, respectively, if we use $d_{i}^{vb}$ and $d_{i}^{\textit{org}}$ instead of $fn_{i}^{vb}$ and $fn_{i}^{\textit{org}}$ in our objective function. There is a decline in recall of 0.1675 from 0.4019 due to smoothing effect on new values that include outliers.

Definition 3. New words for a document ( $d_{i}^{\textit{nword}}$ ). New words for a document calculates the average of news words appearing in a document $d_{i}$ compared to its N neighbors.

$\displaystyle fn_{i}^{vb}=fn(d_{i}^{vb})=1/(1+(d_{i}^{vb})^{2})$ $\displaystyle fn_{i}^{\textit{org}}=fn(d_{i}^{\textit{org}})=1/(1+(d_{i}^{% \textit{org}})^{2})$ (1) $\displaystyle{\@setsize{\footnotesize}{10pt}{\ixpt}{\@ixpt}\begin{aligned} % \displaystyle CC=\Bigg{\{}\overbrace{\sqrt{fn_{i}^{vb}.fn_{i}^{\textit{org}}}*% w_{i}^{\textit{neg}}}^{\text{A}}-\beta\overbrace{\frac{\sum_{j=1}^{N}\sqrt{fn_% {j}^{vb}.fn_{j}^{\textit{org}}}*w_{j}^{\textit{neg}}}{N}\Bigg{\}}}^{\text{B}}*% \\ \displaystyle\big{\{}\underbrace{d_{i}^{\textit{nword}}-\frac{\sum_{j=1}^{N}d_% {j}^{\textit{nword}}}{D_{N}}}_{\text{C}}\big{\}}\end{aligned}}$ (2) $\displaystyle\begin{aligned} \displaystyle\beta&\displaystyle=\alpha*\big{\{}% \sqrt{N_{i}^{\textit{num}}+N_{i}^{yr}}-\underbrace{\frac{\sum_{j=1}^{N}\sqrt{N% _{j}^{\textit{num}}+N_{j}^{yr}}}{N}}_{\text{E}}\big{\}}\end{aligned}$ (3)

Objective function. The following discusses the context (graph structure) information, content information, and cut cost (CC) used in our proposed approach. For content and context information, we remove the stop words and perform word-stemming. Our approach calculates the minimum graph-cut in Eq. (2) in comparison with its neighbors. We leverage two types of information from each article: Context and Content. Context leverages information, such as verbs, numbers, and sentiments, using graph structure. Content leverages information, such as new words and similarity metrics. Our algorithm is based on the strength of context and content of an article with its neighbor.

Context. Contextual information involves the mention of common action verbs (social context) and popular social organizations across multiple articles, thus forming edges, or links, across documents. For example, two or more articles might be connected via the node of a common verb such as “act”, “notify”, or “announce”. These common (action) verbs might occur in phrases such as “Law passed” or “New traffic rules announced”, indicating a change (per our definition). Similarly, organization names and numbers, such as the year in one document, can form linkages to other documents. For instance, the more numbers are mentioned in an article, the more likely it has a statistical significance. In addition, more numbers in an article likely indicates that resource-based changes are contained in the article. For example: “2100 died because of Ebola in year 2015 alone”. In short, we use organization names, numbers, and common action verb linkages, tagged using the openNLP library and the Stanford NLP tool, for the contextual (structural) information in the graph.

Content. Content information that carries new words is considered to be carrying new information. For example: “New mobile application has been introduced for senior citizens.” Equation (3) is based upon new words. For each document $d_{i}$ , we calculate the count of new words that have not occurred in $N$ preceding documents, which we call neighbors. Thus, in terms of context and content, our first intuition is that an article with considerable changes will also mention the organization name, common (action) verbs, numbers, etc. Our second intuition is that this article will have a fewer number of negative sentiments than its neighbors. This is due to fact that these change detected articles might have information such as solutions proposed, statistical information, and guidelines mentioned by experts – all (presumably) positive information to the reader. In other words, interesting change detected articles might have fewer sentences with negative sentiments (i.e., lesser value for $w_{i}$ ), and are more than likely mentioning numbers, organization names, and action verbs, as compared to their neighbors.

Cut cost. For each article, we calculate Cut Cost (CC), as defined in Eq. (2). We first calculate the geometric mean of the common verb function $fn_{i}^{vb}$ and the organization function $fn_{i}^{\textit{org}}$ for a given document $d_{i}$ , which is marked as Part A in Eq. (2). Then, the geometric mean is compared to the average of its neighbors, which is marked as $B$ in Eq. (2). Additionally, the count of the number of negative sentences in the article $w_{i}^{\textit{neg}}$ is multiplied by the geometric mean. Similarly, for a given document $d_{i}$ , we calculate a $d_{i}^{\textit{nword}}$ count of new words that have not occurred in the past $N$ documents. Part C of Eq. (2) shows where we capture the difference of the calculated new words of $d_{i}$ with the average number of new words of all documents in the data set $d_{\textit{avg}}^{\textit{nword}}$ . If the cut cost is positive, we mark the article as change detected. Again, the basic intuition is that a change detected article will have mix of statistics, less negative sentiments, and organization names that act as change agents in comparison to their neighbors. More explanation of how cut cost works is explained later in this section.

Penalty. For articles that contain no relevant information on context, our objective function needs to over-penalize them on the cut cost. To do this, we introduce a penalty $\beta$ , as shown in Eq. (3). $\beta$ captures the difference between the geometric mean of the count of numbers and the count of years, against the average of its neighbors. Part E of Eq. (3) represents the average of its neighbors’ geometric mean, count of numbers, and count of years.

Figure 2.

Our proposed probabilistic graphical model CBLDA.

Figure 3.

Effect of penalty parameter alpha.

Parameter $\alpha$ controls the degree of penalization. In our experiments, we evaluated alpha from 0.1 to 40 and found the best $F1$ to be when $\alpha=$ 0.95. Figure 3 shows the effect of $\alpha$ on precision, recall, and $F1$ . After $\alpha=$ 2, precision, recall and $F1$ do not improve, and actually maintain the same percentage.

Graph-Cut Algorithm[1] Input: $D$ documents of our data set. alpha $=$ 0.9 Graph–Cut–Algorithm each article $d_{i}$ ? $D$ get “article” node Get all child nodes under “article” node of $d_{i}$ Calculate cut cost CC for $d_{i}$ as per Eq. (2) IF calculated CC $\textgreater$ 0 classify $d_{i}$ as “change detected” ELSE classify $d_{i}$ as “NOT change detected”

Cut cost case studies. We discuss three different case studies for the polarity of different components in terms of cut cost, as shown in Table 3. Each case study in the table has three news articles, where one article is a change detected article, and other two are neighbors of the change detected article. For example, article id 3 is a change detected article, whereas news article ids 1 and 2 are neighbors of article id 3. Table 4 shows different features $d_{i}^{\textit{org}}$ , $d_{i}^{vb}$ , $N_{i}^{\textit{num}}$ , $w_{i}^{\textit{neg}}$ , $N_{i}^{yr}$ , and $d_{i}^{\textit{nword}}$ for each article in Table 3. Using Eq. (3.7), $fn_{i}^{vb}$ is calculated from $d_{i}^{vb}$ , and $d_{i}^{\textit{org}}$ is calculated from $fn_{i}^{\textit{org}}$ . Using these features, we calculate cut cost with Eq. (2). As shown in Table 6, the equation has three components with individual polarity: the value of $\beta$ , the value of Part (A-B), and the value of Part C.

Table 3

Table showing set of articles for three different case studies for polarity of cut-cost in Eq. (2). Each case study has three news articles, where one article is a change detected article, and other two are neighbors of the change detected article. For example, article id 3 is change detected article, whereas news article ids 1 and 2 are neighbors of article id 3

Article	URL	Is this change	Description
ID		detected article?
Articles for Case 1
1	http://timesofindia.indiatimes.com/world/us/kidnapping-talibans-new-income-source/articleshow/3457736.cms	No	Neighbor article for article id 3.
2	http://timesofindia.indiatimes.com/city/patna/new-crop-of-educated-kidnappers-irks-police/articleshow/1437777.cms	No	Neighbor article for article id 3.
3	http://www.ndtv.com/world-news/militants-kidnap-kill-20-iraqi-soldiers-561301	Yes	–
Articles for Case 2
4	http://www.thehindu.com/todays-paper/tp-national/tp-tamilnadu/unused-pipes-cause-hindrance-to-road-users/article3379112.ece	No	Neighbor article for article id 6.
5	http://www.thehindu.com/todays-paper/tp-national/tp-tamilnadu/rs23-crore-for-linking-bypass-road-with-perambalur-town/article3715263.ece	No	Neighbor article for article id 6.
6	http://timesofindia.indiatimes.com/city/allahabad/robots-designed-by-students-to-fill-potholes-on-allahabad-roads/articleshow/19164244.cms	Yes	–
Articles for Case 3
7	http://timesofindia.indiatimes.com/india/kidnapping-its-now-a-rich-industry-in-india/articleshow/1001952.cms	No	Neighbor article for article id 9.
8	http://timesofindia.indiatimes.com/india/kidnapped-indian-diplomat-traced/articleshow/7371396.cms	No	Neighbor article for article id 9.
9	http://timesofindia.indiatimes.com/world/rest-of-world/maoists-kidnap-100-kids-in-nepal/articleshow/1226860.cms	Yes	–

Table 4

Table showing feature values of each article in Table 3 as represented in Eqs (2) and (3)

Article ID	Count for	Count for	Count for	Count number of	Count for	Count for
	tokens of	tokens of	tokens of	sentences with negative	tokens of	number of
	organization	verb	numbers	sentiments	years	new words
	$d_{i}^{\textit{org}}$	$d_{i}^{vb}$	$N_{i}^{\textit{num}}$	$w_{i}^{\textit{neg}}$	$N_{i}^{yr}$	$d_{i}^{\textit{nword}}$
Article ID $=$ 1 (Case 1)	13	42	6	20	1	10.5
Article ID $=$ 2 (Case 1)	2	43	6	24	0	9.5
Article ID $=$ 3 (Case 1)	3	30	10	14	0	15.0
Article ID $=$ 4 (Case 2)	8	25	2	19	0	5.5
Article ID $=$ 5 (Case 2)	0	25	5	22	0	4.5
Article ID $=$ 6 (Case 2)	25	33	6	14	1	2.0
Article ID $=$ 7 (Case 3)	18	25	7	16	2	9.0
Article ID $=$ 8 (Case 3)	0	24	3	16	0	5.0
Article ID $=$ 9 (Case 3)	20	9	1	8	0	9.5

Table 5

Table showing values of $\beta$ , Part (A-B), and Part C in cut-cost as represented in Eq. (2). Value of $\alpha=$ 0.95

Article ID	$\beta$	Part (A-B)	Part C	Cut-cost CC
Article ID $=$ 3 (Case 1)	0.5839	0.0639	5	0.3198
Article ID $=$ 6 (Case 2)	0.7795	$-$ 0.3625	$-$ 3	1.0875
Article ID $=$ 9 (Case 3)	$-$ 1.297	0.49933	2.5	1.24833

Table 6

Table showing polarity of $\beta$ , Part (A-B), and Part C in cut-cost as represented in Eq. (2). The polarity in this table is taken from Table 5. Polarity of the three parts affects polarity of cut-cut and thereby classification of an article for change detection. Value of $\alpha=$ 0.95

Article ID	Is Value of $\beta$ positive?	Is value of Part (A-B)	Is value of Part C	Is value of cut-cost CC
		is positive?	is positive?	greater than zero?
Article ID $=$ 3 (Case 1)	Yes	Yes	Yes	Yes
Article ID $=$ 6 (Case 2)	Yes	No	No	Yes
Article ID $=$ 9 (Case 3)	No	Yes	Yes	Yes

First, it is worth noting that smoothened verb $fn_{i}^{vb}$ is inversely proportional to the value of $d_{i}^{vb}$ , and smoothened organization $fn_{i}^{\textit{org}}$ is inversely proportional to the value of $d_{i}^{\textit{org}}$ . Part A has interplay between $d_{i}^{vb}$ and $d_{i}^{\textit{org}}$ . Second, how much penalty $\beta$ is applied on Part B can be controlled by $\alpha$ . Third, in general, change detected news articles, particularly those that are solution-based and context-based, are expected to have moderate negative sentiments $w_{i}^{\textit{neg}}$ . News articles that are not change detected might have slightly higher negative sentences compared to change detected news articles. Since change detected news articles do not occur often, neighbors are more likely not to be changed news articles. Hence, the only way a news article is marked as change detected is based on $fn_{i}^{vb}$ and $fn_{i}^{\textit{org}}$ of a news article having a higher value than the average of its neighbors as part of Part (A-B) and Part C. Hence, for the following three case studies, we discuss the interplay between $fn_{i}^{vb}$ , $fn_{i}^{\textit{org}}$ , $N_{i}^{\textit{num}}$ , $N_{i}^{yr}$ , and $d_{i}^{\textit{nword}}$ .

Case 1: Resource-based change detected article. All three components in Eq. (2) have positive polarity. First, in the case of resource-based change detected news articles, we expect there to be more numbers mentioned. Such articles will have relatively more numbers in comparison to its neighbors. This will make Part A positive. Second, in the case of resource-based articles, they might possibly mention accidents or earthquakes where lives and resources are damaged. Some important entities such as the United Nations will be mentioned with many common action verbs in comparison to their neighbors. In other words, the values for $d_{i}^{vb}$ and $d_{i}^{\textit{org}}$ are moderate. In contrast, a neighbor article which is not change detected will have higher values for verbs $d_{i}^{vb}$ , but will likely have a smaller value (or zero) for organization names $d_{i}^{\textit{org}}$ , which will make Part B a lower value. This dynamic will make Part (A-B) more likely to be positive, even after penalizing Part B with a positive $\beta$ value. Third, in general, change detected news articles are more likely to have more new words compared to their neighbors, and thus making the polarity of Part C positive. Since all components are positive, the cut-cost will be positive and the documents will be classified as change detected.

Case 2: Solution-based change detected article. In this case, the polarity of Part (A-B) is negative, the polarity of Part C is negative, and the polarity of $\beta$ is positive. First, sometimes people gather for a yearly event, such as a competition at a university or conference, for proposing solutions to societal problems. Such events have numbers such as the number of people attending the event, the number of companies/universities attending, or prize money (e.g., article id 6). Thus, a solution-based change detected article will have more numbers and year combinations as compared to its neighbors. This will make $\beta$ positive and larger. Second, when groups of people come together to solve a societal problem, they discuss the existing problematic situation (i.e., lesser new words), and implementing solutions (i.e., higher action verbs) with many organizations participating (i.e., higher organization names mentions). Higher action verbs and organization names will make the Part A value smaller than that of its neighbors, and also penalizing Part B with a larger $\beta$ value will more likely make the polarity of Part (A-B) negative.

Case 3: Context-based change detected article. In this case, the polarity of Part (A-B) is positive, the polarity of Part C is positive, and the polarity of $\beta$ is negative. Case 3 is similar to Case 1, but occurs mostly in context-based change detected news articles that have experts mentions and two characteristics:

•

There are two types of news articles that mention years and numbers: (1) not change detected news articles that have some statistical information mentioned (not from experts) in the news articles. We call these “recent history type” news articles that give detailed explanations of a topic such as kidnapping. For example, article id 7 discusses the rise in kidnapping events with the years and numbers mentioned.; and (2) change detected news articles, particularly context-based/resource-based news articles, have the opinions of experts with numbers and years mentioned. These articles are more likely to have lesser numbers and years compared to its neighbors, which might include recent history. In this case, the polarity of $\beta$ will be more likely negative, which makes the polarity of Part (A-B) positive.

•

Relatively higher new words compared to the average of their neighbors, which is generally expected for change detected news articles. This will make Part C more likely positive.

3.8 Experiments

The four comparison methods we use are cosine similarity, new word count, new word count with threshold, and Jaccard coefficient. In our comparison methods, as mentioned earlier, for all $N$ preceding document comparisons we use a value of 5 for $N$ . All experiments are run on a Mac with 2.8 GHz Intel Core i7, with 16 GB of memory, and a 1600 MHz DDR3.

3.8.1 Our proposed method

We implement our proposed method as an iterative algorithm, as shown in Algorithm 3. The algorithm iterates over one document $d_{i}$ at a time. It gets all the child nodes such as date, number, verb, and organization under the “article” node of $d_{i}$ , then calculates the cut cost for $d_{i}$ with Eq. (2). If the cut cost is positive, the algorithm classifies the article as “change detected”.

3.8.2 Comparison methods

Cosine similarity [3] and the Jaccard coefficient [7] are the two most popularly used methods. These methods have also been applied to problems such as novelty detection [11] and the discovery of similar documents [24]. We use them to calculate the TF-IDF similarity between 2 articles. Another approach involves calculating the new words count when comparing a document to past documents, which can be used to discover an uncommon document. We also chose this technique as one of our baselines as it has been used repeatedly in other related research [13, 14, 15].

For our experiments, the articles from all three newspapers listed in Table 2 are merged and sorted chronologically. Since several articles from thehindu newspaper have missing dates, we only use the month and year for the chronological ordering of news articles. Our basic intuition is that rich (uncommon) contextual documents will have less similarity with their recent past and future documents. For each of the baseline methods, $D^{q}$ represents a set of documents chronologically ordered containing a topic query $q$ . We prepared a TF-IDF for each document $d_{i}^{q}\in D_{q}$ . TF-IDF methods are often used as a fast and effective means of comparison with similarity based methods such as cosine similarity. We also removed stop words and performed word-stemming (i.e., Porter stemmer).

Cosine similarity. For a query q, we iterate through each individual document $d_{i}^{q}$ and calculate the cosine similarity for each of the N preceding documents. Then, we average the pair of documents’ calculated cosine similarity. The Top n% of documents with the least cosine similarity are marked as change detected. As discussed under Definition 1, for $N$ preceding documents in the range [1–4]. Thus, we experimented with values of 5, 10, and 15, and found the best precision, recall, and F1-score using $N=$ 5. The cosine similarity for two set $A$ and $B$ is shown in Eq. (4).

$\displaystyle\textit{Cosine}(A,B)=\frac{A.B}{||A||||B||}$ (4)

New word count. For a query q, we iterate through each individual document $d_{i}^{q}$ and count the new words that have not occurred in each of the N preceding documents. Then, we average the new word count for all pairs of documents. The Top n% of documents with the highest new word count are marked as change detected.

New word count with threshold. This method is similar to the new word count approach. In this case, we provide a threshold of percentage difference between new words found in an individual document $d_{i}^{q}$ with the average new word count of $N$ preceding documents. The Top n% of documents with the highest percentage difference are marked as change detected. We experimented with threshold values between 80 and 90, and discovered the best precision using a $\textit{threshold}=$ 90.

Jaccard coefficient. For a query $q$ , we iterate through each individual document $d_{i}^{q}$ and calculate the Jaccard coefficient for each of the $N$ preceding documents. Then, we average the pair of documents’ calculated Jaccard coefficients. The Top n% of documents with the lowest Jaccard coefficients are marked as change detected. The equation of Jaccard Coefficient is given in Eq. (5) for two sets $A$ and $B$ .

$\displaystyle J(A,B)=\frac{|A\cap B|}{|A\cup B|}$ (5)

In all the above methods, we iterate through each document from a chronologically sorted set. We usually take $N$ preceding documents. However, for the first few documents, there might not be enough preceding documents. Hence, we take either the $N$ succeeding documents (if available), or a combination of preceding and succeeding documents. For example, suppose we chose a value of 15 for $N$ . When iterating on the $6^{th}$ document, we would need a total of $N=$ 15 preceding documents, however, there are only 5 preceding ones. In this case, we would then include the succeeding 10 documents.

3.9 Results and discussion

Table 7 shows our experimental results, comparing baseline methods to our proposed Graph-Cut approach. We first compare and discuss results when dealing with the Top 20% and $N=$ 5. The accuracy of the baseline approaches, except cosine similarity, are noticeably better than our Graph-Cut approach. However, Graph-Cut gives the overall better precision (new word count comes close), recall, and F1-score. We also varied the number $N$ of preceding (neighbor) documents, with results for $N$ values of 10 and 15 shown in Table 7. The results are similar to that of $N$ with value 5, albeit the new word count approach again is close or even slightly better in terms of precision, but not better when it comes to recall or F1-score.

Table 7
Results showing Precision (Prec), Recall, F1-Score, and Accuracy (Acc) for baseline methods and our Graph-Cut approach. For baseline methods, the Top 20% of articles are marked as change detected. Different values for $N$ representing number of preceding document (neighbors) experiments and results are shown

Top n%	N	Method	Prec	Recall	F1-score	Acc
20%	5	Cosine	0.3462	0.1869	0.2427	0.5489
20%	5	New word count	0.4207	0.1799	0.2520	0.5800
20%	5	New word count-Threshold	0.2363	0.2935	0.2618	0.6414
20%	5	Jaccard	0.3550	0.1768	0.2360	0.6648
–	5	Graph-Cut	0.4283	0.4019	0.4146	0.5589
20%	10	Cosine	0.3363	0.1831	0.2371	0.5451
20%	10	New word count	0.4290	0.1857	0.2592	0.5832
20%	10	New word count-Threshold	0.2363	0.2935	0.2618	0.6414
20%	10	Jaccard	0.3673	0.1824	0.2437	0.6672
–	10	Graph-Cut	0.4266	0.4000	0.4128	0.5576
20%	15	Cosine	0.3231	0.1767	0.2284	0.5399
20%	15	New word count	0.4256	0.1842	0.2571	0.5818
20%	15	New word count-Threshold	0.2363	0.2935	0.2618	0.6414
20%	15	Jaccard	0.3802	0.1880	0.2515	0.6698
–	15	Graph-Cut	0.4289	0.4055	0.4168	0.5611

In addition, we evaluated the baseline methods with different values for the Top n% other than the Top 20% marked for change detection. It is worth noting that the percentage of change detected articles in our data set is approximately 14%, so we experimented with values of the Top 10% and Top 15% being marked as change detected. We include Table 8 to further show that results are only impacted by different values of $n$ used for the Top n% marked as change detected documents. For the Top 15%, performance reduces in precision, recall, and F1, and for the Top 10% performances is even lower. In Fig. 3, graph-cut achieves the best $F1$ when $\alpha=$ 0.95. After $\alpha=$ 2, all of our measures flatten out. For Top 15% and $N=$ 5, student’s t-test of our 4 evaluation metrics of New word count with Threshold and Graph-Cut shows significance only at 14%. However, Graph-Cut has better F1 and accuracy.

Table 8

Results showing effect of different Top n % values marked as change detected on performance of baselines

Top n%	N	Method	Prec	Recall	F1-score	Acc
10%	5	Cosine	0.3070	0.0842	0.1321	0.5727
10%	5	New word count	0.4692	0.0923	0.1542	0.6040
10%	5	New word count-Threshold	0.2021	0.1640	0.1810	0.6230
10%	5	Jaccard	0.4042	0.0933	0.1516	0.6420
10%	10	Cosine	0.3070	0.0842	0.1321	0.5727
10%	10	New word count	0.4692	0.0923	0.1542	0.6040
10%	10	New word count-Threshold	0.2021	0.1640	0.1810	0.6230
10%	10	Jaccard	0.4042	0.0933	0.1516	0.6420
10%	15	Cosine	0.3158	0.0922	0.1427	0.5744
10%	15	New word count	0.4789	0.0964	0.1604	0.6059
10%	15	New word count-Threshold	0.2021	0.1640	0.1810	0.6230
10%	15	Jaccard	0.4158	0.0974	0.1578	0.6431
15%	5	Cosine	0.3445	0.1322	0.1910	0.5605
15%	5	New word count	0.4729	0.1371	0.2125	0.5971
15%	5	New word count-Threshold	0.1839	0.2240	0.2019	0.6262
15%	5	Jaccard	0.4103	0.1357	0.2039	0.6540
15%	10	Cosine	0.3445	0.1322	0.1910	0.5605
15%	10	New word count	0.4729	0.1371	0.2125	0.5971
15%	10	New word count-Threshold	0.1839	0.2240	0.2019	0.6262
15%	10	Jaccard	0.4103	0.1357	0.2039	0.6540
15%	15	Cosine	0.3102	0.1248	0.1779	0.5508
15%	15	New word count	0.4757	0.1377	0.2135	0.5979
15%	15	New word count-Threshold	0.1839	0.2240	0.2019	0.6262
15%	15	Jaccard	0.3907	0.1337	0.1992	0.6512

It should also be noted that the running time of the baseline algorithms ranges anywhere from 5–10 seconds. In comparison, our Graph-Cut method takes approximately 1 second to complete. Our Graph-Cut method has a linear running time of $O(M)$ , where $M$ is the number of articles in our dataset.

4. Co-ranking authors in heterogeneous news networks

4.1 Overview

The rapid growth of news channel entities over the past decade has revolutionized the way people consume news. Digitization and Internet penetration have encouraged people to read news articles online. Until recently, research has primarily studied areas like news recommendation, summarizations, and actors’ interplay in news articles. However, the Web is highly complex with a rich set of diverse news channels or networks, and the heterogeneous nature of these networks results in an even more complex interaction of people, places, and things.

In order to address the issue of personalized news recommendation for different target groups, we have hypothesized a dual-layered approach. So far, we have presented our novel graph-cut approach that addresses the issue of filtering out unimportant news articles. However, in order to address the issue of finding news articles from authoritative news sources, we need to be able to rank the authors on interested topics such as Syria and human trafficking. In order to do that, we will introduce our novel context-based graphical model, which deals with the issue of handling two different (heterogeneous) news sources. One is a traditional news network, or TNS (e.g., Los Angeles Times, New York Times, etc.), which presents various local and international topics, such as corruption, new law proposals, crimes, and elections. The other is an institutional news network from policymaker institutes, or PNS (e.g., World Health Organization (WHO), Brookings Institute, etc.), that also publish on similar events and topics. There are two main points about the nature of TNS and PNS. First, TNS focuses on influencing citizens, and increasing more users globally. PNS focuses on influencing policymakers, academics, and politicians. Second, TNS journalists are more likely to cover topics in breadth rather than depth. In contrast, PNS journalists are more likely to cover topics in depth rather than breadth. Users are often overwhelmed by too many news articles, as well as the questionable reputation of what they are reading. Users want to know who are the experts to follow on specific topics? The problem of identifying experts in a bibliographic network has been previously studied [2, 26, 29]. However, such a ranking problem study has not been done for authors of news articles. We will study the problem of finding experts on specific topics from our proposed heterogeneous network of news sources. Our goal is to discover who are the expert journalists and policy analysts on specific topics. In order to achieve our goal, we will address two challenges. First, the two networks are heterogeneous in their object types, such as the mentions of people, locations, etc. Second, many news articles have incomplete information.

The following section addresses the issue of determining which news authors are relevant by: (1) combining two networks where the news stories are written by different authors with different objectives – one is playing the role of a journalist serving the public, and the other is playing the role of policy adviser, like for a thinktank. In other words, TNS and PNS are heterogeneous networks; and (2) showing that our method improves the performance for co-ranking authors compared to other methods. In this section, we first discuss the problem in detail. Section 4.3 presents the data that will be used in our evaluation, the data source and how the data is prepared. In Section 4.4, we present our proposed approach and Section 4.5 discusses various comparison methods. Section 4.6 explains our experimental setup, followed by how our approach is evaluated in Section 4.7. We then conclude with results in Section 4.8, followed by concluding remarks in Section 5.

4.2 Problem statement

In this section, we present some definitions and define the problem statement. Table 9 presents the important symbols, and their description, that are discussed in this section.

Table 9
Notation

Symbol	Description
$d_{i}$	An individual document from our collection $D$ .
$a_{i}$	Individual author $a_{i}\in A$ .
$T_{g}$	Represent the global tags having dictionary of applied tags extracted from the collection $D$
$Z$	Number of (latent) topics.
$V$	Unique number of words in our data set i.e., Dictionary.
$N_{d_{i}}$	Unique number of words from a document $d_{i}$ set $D$ .
$z_{i}$	$i^{th}$ topic assigned to a document.
$w_{i}$	Represents a word (token) from a document.
$C_{d_{i}}$	Represents list of 6 different context types for a document.
$C^{t,d_{i}}$	Represents any one of the 6 context type from $C_{d_{i}}$ for a document $d_{i}$ .
$W_{C^{t}}$	Represent the dictionary of words for a specific context type $C^{t,d_{i}}$ .
$w_{c,t}$	Represents a word (token) of index $c$ from $W_{C^{t}}$ from a document $d_{i}$ .
$\alpha$	Dirichlet prior for $\theta$ .
$\theta$	Multi nominal distribution of topic $z_{m}$ over documents $D_{i}$ .
$\beta$	Dirichlet prior for $\phi$ .
$\phi$	Distribution of word $w_{i}$ for topics $Z$ .
$\gamma$	Dirichlet prior for $\mu$ .
$\mu$	Multi nominal distribution of word $w_{c,t}$ in a context type $C_{t,d_{i}}$ .

4.2.1 Definitions

For the purpose of presenting and evaluating our proposed approach, we will define our news feed collection as one that takes a set of documents (articles) $D$ from both TNS and PNS networks. Let $G_{\textit{pns}}$ represent the graph of a PNS network, and $G_{\textit{tns}}$ represent the graph of a TNS network. Each document, $d_{i}$ , consists of the following features: title, date time, body, author names, and keywords (applied tags). While title and body are two separate features, we refer to the use of both of them as textual content. Let $Z$ be the number of latent topics. Let $Q=$ {Syria, Climate Change, Election, India, Crime, Boko Haram} represent the set of queries, and let $q\in Q$ represent a single query. Let $A=\{a_{1},a_{2},...\;a_{n}\}$ be the set of unique authors in our collection, where $a_{i}$ represents an individual author. Let $V$ indicate the vocabulary, i.e., the unique words in $D$ after removing stop words.

Definition 1. Applied tags and missing related tags. In the data sets that we captured for this work, only one-third of the documents have tags, or what we are calling Applied Tags, or $T^{\textit{applied},d_{i}}$ . Of that one-third, about 50% of the documents have, on average, 5 applied tags. In order to overcome this sparsity in applied tags, what we will term as missing related tags., we will assign appropriate tags for document $d_{i}$ as Missing Related Tags, or $T^{\textit{miss},d_{i}}$ . This process will be discussed in detail in Section 4.3.2. In total, we have the set $T_{d_{i}}=T^{\textit{applied},d_{i}}\cup T^{\textit{miss},d_{i}}$ , where $T^{\textit{applied},d_{i}}$ represents the set of applied tags, and $T^{\textit{miss},d_{i}}$ represents the set of missed related tags.

Definition 2. Context. A context represents a specific view from a document $d_{i}$ . We are interested in 6 different context types: person, location, applied tags, missing related tags, title, and the first line of the article. We chose to use these specific contexts because they provide different aspects of a news story. $C_{d_{i}}=\{C^{\textit{org},d_{i}},C^{\textit{per},d_{i}},C^{\textit{loc},d_{i% }},C^{\textit{tags},d_{i}},C^{\textit{title},d_{i}},C^{\textit{fline},d_{i}}\}$ , where

1.
$C^{\textit{org},d_{i}}$ represents the organization context,
2.
$C_{d_{i}}^{\textit{per},d_{i}}$ represents the person context,
3.
$C^{\textit{loc},d_{i}}$ represents the location context,
4.
$C^{\textit{tags},d_{i}}=T_{d_{i}}$ represents the union of applied tags and missing related tags,
5.
$C^{\textit{title},d_{i}}$ represents the headline of the article, and
6.
$C^{\textit{fline},d_{i}}$ represents the first line of the article.

The contexts are extracted from the body of each article using a natural language processing technique. We employ the Stanford NLP library9
⁹
www.nlp.stanford.edu/.

for this purpose. $C^{t,d_{i}}$ indicates an element from $C_{d_{i}}$ . In other words, $C^{t,d_{i}}$ indicates any one of the 6 context types from $C_{d_{i}}$ . Additionally, we want to represent words of a context type, where $w_{c,t}$ indicates a word (token) from a specific context type $C^{t,d_{i}}$ of a document. For example, $w_{c,\textit{org}}$ indicates a word from the organization context type $C^{\textit{org},d_{i}}$ of a document.

Definition 3. A document. A document is represented as $d_{i}=\{A_{d},d_{\textit{type}},dt,\textit{from},\textit{title},\textit{body},% T_{d_{i}},C_{d_{i}}\}$ . $A_{d}=\{a_{1},a_{2},..,a_{k}\}$ represents the list of authors who wrote $d_{i}$ , from represents the source (e.g., Washington Post, NewYork Times, etc.), title represents title, body represents body, $d t$ represents data time, and $d_{\textit{type}}$ is a document indicator having a value of either 0 or 1. If $d_{\textit{type}}=$ 0, then $d_{i}$ is from $G_{\textit{TNS}}$ . If $d_{\textit{type}}=$ 1 then $d_{i}$ is from $G_{\textit{PNS}}$ .
4.2.2 Problem statement

We formally define our problem statement as follows: given a query topic $q$ where $q\in Q$ , we rank and return the top N experts from each of the networks of TNS and PNS.

4.3 Data statistics and data preparation

4.3.1 Data statistics

In this section, we discuss the data, data statistics, and ground truth that will be used in this paper. We collect the news articles using an RSS feed collection. We then match and choose articles that contain at least one of the six queries of interest: Syria, climate change, election, India, crime, and boko haram. The TNS network is captured from a total of 118 RSS news feeds, while the PNS network is captured from a total of 54 RSS news feeds. The unique number of authors from TNS and PNS is 1,826 and 997, respectively. The number of documents mentioning the six queries from TNS and PNS is 6,198 and 2,918, respectively. Hence, the total number of documents is 9,116. Table 10 shows a sample of the various TNS and PNS data sources used in this work.

Table 10
Sample of news feed sources TNS and PNS networks

Traditional network (TNS)	Policy network (PNS)
Abc.net.au,Azernews.az,Bdonline.co.uk,Andhrawishesh.com,Arabtimesonline.com,Armenianow.com,Asia.nikkei.com	Aspeninstitute.org,Eastwestcenter.org,Climatechange.ifpri.info,Southasia.ifpri.info,Brussels.gmfus.org,Globalfutures.cgiar.org,Capri.cgiar.org

Ground Truth. In order to evaluate the effectiveness of our proposed approach, the same two human annotators mentioned in Section 3.6 compiled the ground truth for our six topics, and ranked all the authors. For each article, the Stanford NLP Library is used to extract people, locations, and organization. Statistics, such as the number of articles by an author, the average number of locations per document, the average number of persons mentioned per document, the average number of numbers per document, and the average number of organizations per document, are then calculated and given to the annotators. The annotators then read all the supplied articles to familiarize themselves with the content and context.

Our idea for labeling ground truth is based upon the work of Zhang et al. [4, 29]. Similar to the work of [29], we use a method called pooled relevance judgments used with human judgments. In the work of [29], a two-step method is used to rank (ground truth) the authors in a bibliographic network. First, the top N ranked authors from three existing systems (Rexa, Libra, and ArnerMiner) are combined into a single list. Second, two human judges make an assessment on the ranking of each author based on criteria such as the number of publications, the number of top conference papers, and what distinguished awards were received by each author, etc. However, in our work, labeling is a three-step process. Because this is the first known work (at the time of this writing) in the domain of ranking journalists, we do not have others systems from which we can collect pooling like was done by Zhang et al. [29]. Therefore, we must create a pooled relevance judgment with initial scores. So, for each author, the annotators calculate an initial score by combining the number of articles per topic and the average number of persons per document, resulting in an initial ranking list of all authors. The initial score is then normalized by dividing by the total of all initial scores. Second, the annotators compare each author in the initial list with their nearest positioned authors for an initial score threshold of less than 5, which was intuitively chosen based upon their experiences. The result was 26 unique authors (13 author pairs). Third, the annotators then use objective measures, such as the average number of locations mentioned per document, the average number of numbers per document, and the average number of organization per document, for a potential re-ranking. Out of the 26 authors, 8 authors do not require any change in their ranking position. Of the remaining 18 authors (9 author pairs, or approximately 0.63% of the 2823 unique total authors), they are re-ranked by swapping their ranking positions.

4.3.2 Data preparation

As mentioned earlier, sparsity in the applied tags of news articles is a challenge. The following discusses how we attempt to overcome this issue.

Missing related tags.We first explain an example to understand the background for missing related tags. Suppose there are two authors, Sujata Rao and Ian Bremmer. Both work on a common topic, China, and there are 2 additional related topics to China: economy and oil. Both authors write about the economy; however only author Sujata Rao has a document tagged with the related topics of both economy and oil. Author Ian Bremmer has a document that mentions economy and oil appearing in the textual content; however, the document is only tagged with economy and China (i.e., missing oil). Second, to overcome the sparsity in applied tags, we try to find the appropriate tags and assign them to $T^{\textit{miss},d_{i}}$ . Thus, for each document $d_{i}$ , we will remove stop words, and then iterate over each token (word) sequentially from left to right and top to bottom. In other words, we treat each document as a bag of words without the stop words. The first occurring $k$ tokens of a document $d_{i}$ , which are not found in applied tags, but available in global tags $T_{g}$ (refer to Table 9), are set to $T^{\textit{miss},d_{i}}$ . For documents with applied tags available, the average number of applied tags per document is 8 in our collection, so we arbitrarily chose a value of $k=$ 10 as a good starting point.

4.4 Context-based latent dirichlet allocation (CBLDA)

In this section, we present our proposed CBLDA probabilistic graphical approach, as shown in Fig. 2.

First, we will present the probabilistic generating algorithm for CBLDA. Then, we will discuss the objective function for generating the probability. And finally, we will discuss the implemented inference algorithm.

Let $W_{C^{t}}$ represent the dictionary of words for a specific context type $C^{t,d_{i}}$ . While performing parameter estimation for $\alpha,\beta,\gamma$ , CBLDA only needs to have the dictionary by topic $(V*Z)$ count matrix, document by topic $(D*Z)$ , and the dictionary of context by topic $(W_{C^{t}}*Z)$ count matrix per context. Our proposed CBLDA is shown in Algorithm 4.4.

CBLDA[1] Input: $D$ documents of our data set. $\alpha=$ $\frac{50}{Z}$ , $\beta=$ 0.01, $\gamma=$ 0.1 CBLDA–Algorithm For each topic $z$ , draw $\phi_{z}$ and $\mu_{z}$ . each of the word positions $w_{k}$ from document $d_{i}$ draw a topic $z_{d_{i}}$ from Multinomial ( $\theta_{d_{i}}$ ). draw a word $w_{i}$ from Multinomial ( $\phi_{z_{d_{i}}}$ ). draw a word $w_{c,t}$ from Multinomial ( $\mu_{z_{d_{i}}}$ ) for a uniformly chosen specific context type $C_{t,d_{i}}$ (i.e., $W_{C^{t}}$ ).

Second, we estimate the posterior probabilities of $z$ and $w$ , as defined in Eq. (4.4), similar to the work of [25, 26]. However, our estimate differs in that (1) our approach includes contexts, i.e., $C_{t,d_{i}}$ , where, as shown in Algorithm 4.4, we choose $C_{t,d_{i}}$ uniformly at random, and (2) our approach is document-centric, as opposed to author-centric in the references. In other words, we use $\theta_{d_{i},z}$ from Eq. (7) to infer the ranking of an author for a topic. Specifically, we calculate the average of topic by document probability of all documents under the author. Our equation consists of three parts (where calculations exclude the current instance):

1.
$C_{d_{i},z}^{-d_{i}}$ , which represents the number of times document $d_{i}$ is associated with topic $z$ , and the superscript $-d_{i}$ indicates excluding the current instance,
2.
$C_{w_{i},z}^{-d_{i}}$ , which represents the number of times word $w_{i}$ is assigned to topic $z$ , excluding the current instance, and
3.
$C_{w_{c,t},z}^{-d_{i}}$ , which represents the number of times word $w_{c,t}$ from context type $C_{t,d_{i}}$ (i.e., $W_{C^{t}}$ ) is assigned to topic $z$ , excluding the current instance.

Third, we employ the Gibbs sampling approach [6] for inference, as shown in the Algorithm 4.4. The algorithm converges in about 320 iterations.

$\displaystyle P(z_{d_{i}},w_{i},w_{c,t}|z_{-d_{i}},w_{-i},w_{-c,t},\alpha,% \beta,\gamma)\propto$ $\displaystyle\quad(C_{d_{i},z_{d_{i}}}^{-d_{i}}+\alpha)\frac{C_{w_{d_{i}},z_{d% _{i}}}^{-d_{i}}+\beta}{{\textstyle\sum_{w_{i}}(C_{w_{i},z_{d_{i}}}^{-d_{i}}}+% \beta)}\frac{C_{w_{c,t_{d_{i}}},z_{d_{i}}}^{-d_{i}}+\gamma}{{\textstyle\sum_{w% _{c,t}}(C_{w_{c,t},z_{d_{i}}}^{-d_{i}}}+\gamma)}$ (6)

After the Gibbs sampling, the probability of a topic given a document is defined by Eq. (7) using a chain rule similar to LDA [8]. The probability of a word given a topic is defined by Eq. (8). The probability of a word in a specific context type $C_{t,d_{i}}$ given a topic is defined by Eq. (9). In short, CBLDA helps to control on how we could see multiple views using $C_{t,d_{i}}$ .

$\displaystyle\theta_{d_{i},z}=\frac{C_{d_{i},z}+\alpha}{{\textstyle\sum_{z^{{}% ^{\prime}}}(C_{d_{i},z^{{}^{\prime}}}}+\alpha)}$ (7) $\displaystyle\phi_{w_{i},z}=\frac{C_{w_{i},z}+\beta}{{\textstyle\sum_{w_{i}^{% \prime}}(C_{w_{i}^{\prime},z}}+\beta)}$ (8) $\displaystyle\mu_{w_{c,t},z}=\frac{C_{w_{c,t},z}+\gamma}{{\textstyle\sum_{w_{c% ,t}^{\prime}}(C_{w_{c,t}^{\prime},z}}+\gamma)}$ (9)
4.5 Comparison methods

In this section, we discuss various existing approaches with which we will compare our approach.

PageRank. PageRank [20] is the most popular search algorithm, introduced by Page, and used by the Google search engine. In order to evaluate our approach, we will implement the PageRank algorithm from NetworkX [23]. We will create a graph where each node represents an author, and the edge value between two authors is calculated using the bag of words from the documents published by both authors. Specifically, we will calculate the term frequency-inverse document frequency (tf-idf) based cosine similarity and assign it as an edge value. The basic idea is that a user will first start with reading an article on a webpage. If the author is interested in the topic, the user will navigate to similar topic articles likely from other authors appearing on the same webpage. The reader might also surf the search engine to look for similar articles or authors. The random surfer of PageRank resembles the surfing action of the reader. Thus, due to its popularity and the similarity to how one might choose an article as being authoritative, we choose PageRank as a comparison method.

Probabilistic latent semantic indexing (PLSI). In PLSI, the probability of generating a word comes from the topic layer [9]. In PLSI, the topic mixture is conditioned on each document. We choose the Expectation-Maximization method, which is a general algorithm for calculating the maximum-likelihood for latent variables (i.e., incomplete data) [18]. PLSI converges at around 120 iterations.

Latent dirichlet allocation (LDA). LDA also uses the topic layer in order to calculate the probability of generating a word. LDA is similar to PLSI, except for the topic mixture. Specifically, in LDA, the topic mixture is sampled from the conjugate Dirchlet prior, which is the same for all documents. We will use a simple LDA implementation from Mallet [17], that converges at with at least 300 iterations.

Pachinko allocation model (PAM). Pachinko Allocation Model (PAM) is proposed to improve over topic models such as LDA [12]. PAM models the sparse correlation between the topics in addition to the word-correlation. We will use PAM implementation from Mallet [17], that converges at with at least 240 iterations.

4.6 Experimental setting

In this section, we discuss the settings in our experiments for our proposed approach, and the other four methods used in the evaluation. The following settings are based upon the typical settings used in other reported research work.

PageRank. We set the damping factor to 0.85 for 100 iterations.

PLSI. The number of topics $Z$ is set to 6.

LDA. We set $\alpha=50/Z$ , where $Z$ is the total number of topics, and is set to 6. We also set $\beta=$ 0.01.

PAM. The number of topics $Z$ is set to 6.

CBLDA. We find that our estimated topic model is not sensitive to hyper-parameters. Hence, we choose fixed values for $\alpha$ , $\beta$ , $\gamma$ . We set $\alpha=50/Z$ , where Z is the total number of topics, $\beta=$ 0.01, and $\gamma=$ 0.1, similar to [21, 26]. CBLDA converges at around 320 iterations.

Table 11
Results showing Precision@N (P@N), MAP@N, and DCG@N with N ranging [10–50] for network $G_{\textit{tns}}$ and $G_{\textit{pns}}$

Traditional network $G_{\textit{tns}}$
Method	P@10	P@20	P@30	P@40	P@50
PageRank	0.0000	0.0416	0.1111	0.1208	0.1333
PLSI	0.1999	0.2833	0.2944	0.2958	0.2966
LDA	0.1830	0.2660	0.3160	0.3290	0.3290
PAM	0.3233	0.3083	0.3444	0.3291	0.3233
CBLDA	0.2500	0.3166	0.3500	0.3458	0.3333
Policy network $G_{\textit{pns}}$
PageRank	0.0666	0.0666	0.0888	0.0833	0.0800
PLSI	0.4333	0.4500	0.4499	0.4458	0.4700
LDA	0.5490	0.5250	0.4770	0.4290	0.4730
PAM	0.5001	0.4394	0.5185	0.5118	0.5001
CBLDA	0.5666	0.4916	0.5000	0.4666	0.4800
Method	MAP@10	MAP@20	MAP@30	MAP@40	MAP@50
Traditional network $G_{\textit{tns}}$
PageRank	0.0000	0.0745	0.1599	0.1666	0.2076
PLSI	0.6851	0.4795	0.5059	0.4701	0.4716
LDA	0.5350	0.4177	0.4190	0.4240	0.4480
PAM	0.4666	0.5250	0.5177	0.4541	0.4666
CBLDA	0.6280	0.5053	0.5272	0.5102	0.5037
Policy network $G_{\textit{pns}}$
PageRank	0.2793	0.3821	0.4632	0.5176	0.4640
PLSI	0.7085	0.7260	0.6791	0.6433	0.6739
LDA	0.7490	0.7410	0.7070	0.6780	0.6840
PAM	0.6962	0.7112	0.6964	0.6854	0.6962
CBLDA	0.7854	0.8079	0.7561	0.7283	0.7189
Method	DCG@10	DCG@20	DCG@30	DCG@40	DCG@50
Traditional network $G_{\textit{tns}}$
PageRank	0.6807	0.5506	0.5555	0.5492	0.5487
PLSI	0.7797	0.7092	0.6994	0.6994	0.6982
LDA	0.6220	0.6104	0.6089	0.6160	0.6199
PAM	0.6855	0.6867	0.6829	0.6834	0.6855
CBLDA	0.7731	0.7348	0.7319	0.7309	0.7284
Policy network $G_{\textit{pns}}$
PageRank	0.6073	0.5764	0.5694	0.5677	0.5677
PLSI	0.8817	0.8630	0.8477	0.8367	0.8321
LDA	0.9105	0.8762	0.8700	0.8635	0.8552
PAM	0.8447	0.8606	0.8532	0.8504	0.8447
CBLDA	0.9222	0.9048	0.8926	0.8832	0.8755

Table 12

Results showing Average Difference in Precision, MAP, and DCG for each of the baselines with our approach CBLDA for both Traditional Network (TNS) and Policy Network (PNS)

Pairs for comparison	TNS (%)		PNS (%)
	Average Difference in Precision
PageRank vs CBLDA	23.	7786	42.	3900
PLSI vs CBLDA	4.	5146	5.	1160
LDA vs CBLDA	3.	4546	1.	0360
PAM vs CBLDA	$-$ 0.	6534	0.	6980
	Average Difference in MAP
PageRank vs CBLDA	37.	9240	33.	8080
PLSI vs CBLDA	$-$ 2.	1480	7.	3160
LDA vs CBLDA	5.	2220	4.	7520
PAM vs CBLDA	1.	2954	6.	2226
	Average Difference in DCG
PageRank vs CBLDA	16.	2880	31.	7960
PLSI vs CBLDA	2.	2628	4.	3420
LDA vs CBLDA	12.	438	1.	8920
PAM vs CBLDA	5.	5020	4.	4926

4.7 Evaluation

In this section, we discuss the ground truth, the different datasets used, and the evaluation metrics used in our experiments.

Precision@N: Precision provides the fraction of retrieved documents which are relevant to the query. It can be evaluated at a given cut-off value $N$ . The precision for each query (topic) is calculated, and then averaged over all the queries for precision@N.

Mean Average Precision@N (MAP@N): For a set of queries, MAP is defined as the mean of the average precision scores for each query [26]. MAP@N is the MAP evaluated at a given cut-off N.

Discounted Cumulative Gain@N (DCG@N): DCG has been introduced by Järvelin and Kekäläinen [10]. DCG works on the principle that highly relevant documents which appear lower in a search result list should be penalized more. This is because the search result was less accurate, and thus the graded relevance value is logarithmically reduced proportionally with respect to the position of the result.

We want to calculate DCG for the top $N$ . Hence, we choose the top $N$ predicted authors, where the corresponding ground truth rank is $\textit{rel}_{i}$ , which indicates the ground truth rank of the first predicted rank.

$\displaystyle\textit{DCG}_{n}=\textit{rel}_{1}+\sum_{i=2}^{N}{\frac{\textit{% rel}_{i}}{\log_{2}(i)}}$ (10)

In order to get a value between 0 and 1 for easy comparison, we need to normalize $\textit{DCG}_{n}$ . First, we calculate the Ideal DCG (IDCG) by sorting the author rankings in descending order (i.e., higher relevance first). Then, the normalized DCG is calculated as follows:

$\displaystyle\textit{DCG}=\frac{\textit{DCG}_{n}}{\textit{IDCG}}$ (11)

The results of precision@N, MAP@N, DCG@N results are shown in Table 11.

4.8 Results

Table 11 shows that our CBLDA approach achieves a second best precision at 0.25 after PAM on the traditional network, and best precision at 0.5666 on the policy network, when P@10. On average for all values of P@N, the traditional network precision of CBLDA is on average better than PLSI ( $\sim$ 5%), LDA ( $\sim$ 3%), and PageRank ( $\sim$ 24%). Similarly, for the policy network, precision of CBLDA is again, on average, better than PLSI ( $\sim$ 5%), LDA ( $\sim$ 1%), PAM ( $\sim$ 0.7%) and PageRank ( $\sim$ 42%).

Table 11 also shows the mean average precision (MAP) results. On average for all values of P@N, the traditional network MAP of CBLDA is better than LDA ( $\sim$ 9%), better than PLSI ( $\sim$ 1), and better than PageRank ( $\sim$ 41%). Similarly, for the policy network, MAP of CBLDA is $\sim$ 5% better than LDA, better than PLSI ( $\sim$ 7), and much better than PageRank ( $\sim$ 34%).

Similarly, the table shows that the DCG of CBLDA is comparable with that of PLSI for the traditional network. However, for the policy network, the DCG of CBLDA is $\sim$ 4% better than PLSI. On average for all values of P@N, the traditional network DCG of CBLDA is better than LDA ( $\sim$ 12%) and better than PageRank ( $\sim$ 16%). Similarly, for the policy network, the DCG of CBLDA is only slightly better on average than LDA ( $\sim$ 2%), but much better than PageRank ( $\sim$ 32%). Average differences in different metrics between each of the baseline algorithms and CBLDA is shown in Table 12, overall CBLDA provides positive average differences.

In summary, PageRank gives the worst performance in terms of precision, MAP, and DCG. Our approach significantly outperforms (anywhere from 1% to 42%) the baseline methods for Precision@N, outperforms the baseline methods for MAP@N (1%–41%), and outperforms the baseline methods for DCG@N (2%–31%). PLSI and LDA give better precision or MAP in either traditional or policy networks, while CBLDA performs better in both networks using these metrics. PAM appears equivalent in precision to CBLDA, however the MAP and DCG of CBLDA is much better than that of PAM. In other words, CBLDA is still better, since it will show highly ranked authors at the top of recommendation.

5. Conclusion

In summary, we showed that through contextual graphs, enriched with heterogeneous objects extracted from news articles using NLP techniques, we can improve the two mining tasks of classification and ranking. The result is that we are able to personalize news recommendation to the needs of the different readership communities.

In our study and proposed approach for classifying news articles (Section 3), we presented a novel graph-cut algorithm that outperforms baseline methods in terms of precision, recall, and F1, while still being comparable in accuracy. We collected data from three different news sources, extracted common verbs, organization, and numbers, and built a weighted graph using sentiments of each sentence in news articles. We also studied the penalty parameter $\alpha$ and reported the impact on our evaluation metrics. In the future, we will investigate using an external ontology for policy making to help improve the leveraging of the context (structural information). For example, in Wikipedia,10

¹⁰

https://www.wikipedia.org/.

the “Union Council of Ministers of India” provides different department names within the government. This might enable us to better capture entities, such as organization names. In addition, we will examine augmenting the graph with certain important features such as the designation of people names. For example: “Forensic science officials will also be called upon to examine the building quality”. One challenge here is to effectively extract the designation (i.e., “forensic science officials”). Due to an overwhelming number of common nouns, these designations (nouns) become underrepresented, thus are not used effectively. Also, we are currently working on extending our temporal graph to graph streaming approaches, thereby exploring related streaming techniques.

In our study and approach for ranking authorship (Section 4), we presented novel co-ranking algorithms applied to two different networks, i.e., traditional and policy networks. Our work is different from other ranking algorithms in that we rank authors from two different types of networks, extracting 6 different contexts from news articles. We proposed our method CBLDA, which is an extension of the Latent Dirichlet Allocation model. We demonstrated that our proposed approach performs better overall in terms of precision, mean average precision and discounted cumulative gain. In the future, we will investigate the extraction of additional context from the documents, such as the designation of individuals (e.g., Prime Minister, President, etc.). Another idea is to implement a multiple Gibbs sampling approach [26]. In our case, we would like to implement separate Gibbs sampling for each network sharing a common Multinomial-Dirichlet parameter, so as to better incorporate local and global network information.

In this work, we consider that our proposed graph-cut and CBLDA approaches could be integrated together – the goal of our next step. First, we will extract features such as verbs and organization from the news articles, and use our proposed Graph-Cut approach for change detection to capture contextually important news articles in the stream. Second, we will classify the news articles as either important or not, using the features extracted from these news articles as context for our CBLDA algorithm. In addition, we will apply both algorithms to a sequence of articles from real-time news streams.

Footnotes

Acknowledgments

We sincerely thank Jayshree Borah with the China Studies Centre, and the Indian Institute of Technology Madras in helping with the labeling of the articles as well as providing useful feedback. This material is based upon work supported by the National Science Foundation under Grant No. 1318957.

References

http://ndtv.com/allahabad-news/allahabad-stampede-not-due-to-railing-collapse-railway-minister-pawan-kumar-bansal-512954.

Balog

Azzopardi

and de Rijke

, A language modeling framework for expert finding, Information Processing & Management 45(1) (2009), 1–19.

Blank

, Resource description and selection for similarity search in metric spaces, volume 19. University of Bamberg Press, 2015.

Buckley

and Voorhees

E.M.

, Retrieval evaluation with incomplete information, in: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2004, pp. 25–32.

Gaughan

and Smeaton

A.F.

, Finding new news: Novelty detection in broadcast news, in: Asia Information Retrieval Symposium, Springer, 2005, pp. 583–588.

Geman

and Geman

, Stochastic relaxation, gibbs distributions, and the bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence (6) (1984), 721–741.

Goodman

L.A.

Thompson

K.M.

Weinfurt

Corl

Acker

Mueser

K.T.

and Rosenberg

S.D.

, Reliability of reports of violent victimization and posttraumatic stress disorder among men and women with serious mental illness, Journal of Traumatic Stress 12(4) (1999), 587–599.

Griffiths

, Gibbs sampling in the generative model of latent dirichlet allocation, Technical Report, 2002.

Hofmann

, Probabilistic latent semantic indexing, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 1999, pp. 50–57.

10.

Järvelin

and Kekäläinen

, Ir evaluation methods for retrieving highly relevant documents, in: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2000, pp. 41–48.

11.

Karkali

Rousseau

Ntoulas

and Vazirgiannis

, Efficient online novelty detection in news streams, in: WISE (1), 2013, pp. 57–71.

12.

and McCallum

, Pachinko allocation: Dag-structured mixture models of topic correlations, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 577–584.

13.

and Croft

W.B.

, Novelty detection based on sentence level patterns, in: Proceedings of the 14th ACM International Conference on Information and knowledge Management, ACM, 2005, pp. 744–751.

14.

and Croft

W.B.

, Improving novelty detection for general topics using sentence level information patterns, in: Proceedings of the 15th ACM international Conference on Information and Knowledge Management, ACM, 2006, pp. 238–247.

15.

and Croft

W.B.

, An information-pattern-based approach to novelty detection, Information Processing & Management 44(3) (2008), 1159–1188.

16.

Manmatha

Feng

and Allan

, A critical examination of tdt’s cost function, in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in information Retrieval, ACM, 2002, pp. 403–404.

17.

McCallum

A.K.

, Mallet: A machine learning for language toolkit. 2002.

18.

Mei

and Zhai

, A note on em algorithm for probabilistic latent semantic analysis, in: Proceedings of the International Conference on Information and Knowledge Management, CIKM, 2001.

19.

Page

Brin

Motwani

and Winograd

, The pagerank citation ranking: Bringing order to the web, Stanford InfoLab, 1999.

20.

Page

Brin

Motwani

and Winograd

, The pagerank citation ranking: Bringing order to the web. 1999.

21.

Porteous

Newman

Ihler

Asuncion

Smyth

and Welling

, Fast collapsed gibbs sampling for latent dirichlet allocation, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 569–577.

22.

Schiffman

and McKeown

K.R.

, Context and learning in novelty detection, in: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2005, pp. 716–723.

23.

Schult

D.A.

and Swart

, Exploring network structure, dynamics, and function using networkx, in: Proceedings of the 7th Python in Science Conferences (SciPy 2008), volume 2008, 2008, pp. 11–16.

24.

Steinbach

Karypis

Kumar

et al., A comparison of document clustering techniques, in: KDD Workshop on Text Mining, volume 400, Boston, 2000, pp. 525–526.

25.

Steyvers

Smyth

Rosen-Zvi

and Griffiths

, Probabilistic author-topic models for information discovery, in; Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2004, pp. 306–315.

26.

Tang

Zhang

Jin

Yang

Cai

Zhang

and Su

, Topic level expertise search over heterogeneous networks, Machine Learning 82(2) (2011), 211–237.

27.

Tang

Zhang

and Mei

, One theme in all views: modeling consensus topics in multiple contexts, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 5–13.

28.

Wang

and Shepherd

, Co-ranking images and tags via random walks on a heterogeneous graph, in: International Conference on Multimedia Modeling, Springer, 2013, pp. 228–238.

29.

Zhang

Tang

Liu

and Li

, A mixture model for expert finding, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2008, pp. 466–478.

30.

Zhang

Feng

Tang

Ojokoh

and Liu

, Co-ranking multiple entities in a heterogeneous network: Integrating temporal factor and users? bookmarks, in: International Conference on Asian Digital Libraries, Springer, 2011, pp. 202–211.

31.

Zhou

Orshanskiy

S.A.

Zha

and Giles

C.L.

, Co-ranking authors and documents in a heterogeneous network, in: Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, IEEE, 2007, pp. 739–744.

Personalized news recommendation using graph-based approach

Abstract

Keywords

1. Introduction

1 http://www.ibtimes.co.in/super-nintendo-world-open-universal-studios-tokyo-ahead-olympics-tribute-nintendo-games-707998.

1.2 Why is personalized context mining important?

2 www.trec.nist.gov.

3.1 Overview

3.2 Change detected

Table 1 Features from a sample document (article)

3.4.1 Data collection

3 www.thehindu.com.

3.5 Graph topology

6 www.opennlp.apache.org.

3.7 Our proposed method

3.8.1 Our proposed method

3.8.2 Comparison methods

4.1 Overview

4.2 Problem statement

Table 9 Notation

4.3 Data statistics and data preparation

4.3.1 Data statistics

Table 10 Sample of news feed sources TNS and PNS networks

4.4 Context-based latent dirichlet allocation (CBLDA)

4.6 Experimental setting

Table 11 Results showing Precision@N (P@N), MAP@N, and DCG@N with N ranging [10–50] for network G 𝑡𝑛𝑠 and G 𝑝𝑛𝑠

5. Conclusion

Footnotes

Acknowledgments

References

¹
http://www.ibtimes.co.in/super-nintendo-world-open-universal-studios-tokyo-ahead-olympics-tribute-nintendo-games-707998.

²
www.trec.nist.gov.

Table 1
Features from a sample document (article)

³
www.thehindu.com.

⁶
www.opennlp.apache.org.

Table 9
Notation

Table 10
Sample of news feed sources TNS and PNS networks

Table 11
Results showing Precision@N (P@N), MAP@N, and DCG@N with N ranging [10–50] for network $G_{\textit{tns}}$ and $G_{\textit{pns}}$