Abstract
Named Entity Disambiguation is the task of assigning entities from a Knowledge Graph (KG) to mentions of such entities in a textual document. The state-of-the-art for this task balances two disparate sources of similarity: lexical, defined as the pairwise similarity between mentions in the text and names of entities in the KG; and semantic, defined through some graph-theoretic property of a subgraph of the KG induced by the choice of entities for each mention. Departing from previous work, our notion of semantic similarity is rooted in Information Theory and is defined as the mutual information between random walks on the disambiguation graph induced by choice of entities for each mention. We describe an iterative algorithm based on this idea, and show an extension that uses learning-to-rank, which yields further improvements. Our experimental evaluation demonstrates that this approach is robust and very competitive on well-known existing benchmarks. We also justify the need for new and more difficult benchmarks, and provide an extensive experimental comparison of our method and previous work on these new benchmarks.
Introduction
A knowledge graph (KG) is a repository of structured information consisting of unique entities (e.g., notable people, cities, companies and other kinds of organizations, etc.), facts about entities (e.g., the date of birth of such people), and relations between entities (e.g., the cities where such people were born). The recent advent of large KGs, derived from Web-scale corpora and/or Wikipedia, has renewed the interest in algorithmic understanding of natural language text, especially in the context of the Web and social media where facts or properties about named entities are described in many documents.
Two crucial tasks in natural language understanding have to do with named entities, which are the persons, organizations, locations, etc. that are explicitly mentioned in text using proper nouns: (1) Named Entity Recognition (NER) corresponds to finding mentions to entities in the text; and (2) Named Entity Disambiguation (NED), which is the task of disambiguating the named entities by linking them to the actual entities in the KG (when possible). The NER process is usually done by taking lexical and grammatical features into account, meaning that some of the mentions identified through this process may refer to entities that are not in the KG. NED, on the other hand, is done for a specific KG, and provides a mapping between each mention and an existing KG entity (or
Our work concerns the NED task. This article describes effective algorithms for solving this problem, assuming the input is a KG and a document where all mentions to entities (explicit or implicit) have been identified.
Challenges The ambiguity of natural language has always been a challenge for NED, even to humans, because most real world entities can be referred to in many different ways (e.g., people have nicknames), while the same textual mention may refer to multiple real-world entities (e.g., different people have the same name). The following examples illustrate the issues:
Observe that the same entity (the NBA franchise team Utah Jazz) is referred to in different ways in the examples above: as “Utah” in Example 1 and explicitly by its full name in Example 2. On the other hand, two different players are mentioned in the same way, by their last name “Malone”: Jeff Malone in Example 1 and Karl Malone in Example 2. The disambiguation of the mentions to the players is hard because of the shared context: both mentions refer to basketball players who played as an NBA All-Star, and also played for the same franchise.

Example named entity disambiguation scenario.
A typical NED system proceeds in two stages: (1) candidate selection, which aims at quickly finding a small number of KG objects which are likely to be mentioned in the text, and (2) mention disambiguation which computes the final mapping between the named entities in the text and the objects in the KG. Candidate selection is often started by consulting alias dictionaries (see, e.g., [35]) using coarse-grained string similarity matching to minimize the chances of filtering out good candidates. The disambiguation step, on the other hand, is done by aggregating multiple and much more sophisticated notions of similarity.
The first disambiguation methods assumed that the mentions in the text were completely independent, and relied on local contextual features such as the words surrounding the mentions [2,3], and statistical features derived from KGs. These approaches work best when the surrounding context is rich enough to uniquely identify the objects being mentioned. For instance, these methods would work really well with famous politicians with an uncommon last name. On the other hand, they would fail to distinguish the two NBA players in the examples above given their shared context (from a lexical point of view): both players have the same last name and played for the same team. Indeed, “Malone” in both examples is likely to be mapped to Karl Malone, who is more well known and thus has a higher prior.
To avoid being fooled by disproportional priors, most state-of-the-art approaches exploit known connections between the objects in the KG to help with the disambiguation, based on the assumption that the disambiguation of each mention should somehow affect the disambiguation of another. For instance, Example 1 has an explicit mention to the NBA team Washington Bullets, which is easy to disambiguate; once that is done, the NED system must be significantly more likely to map “Malone” in that sentence to Jeff Malone, since he is more strongly related to that team. These associations between objects in the KG induce notions of semantic relatedness which are often captured as some property of a disambiguation graph containing objects in the KG which were identified as candidates for the mentions in the text as illustrated in Fig. 1.
The choice of semantic similarity determines both the accuracy and the computation cost of the NED method. One successful strategy [14] computes the set-similarity involving (multi-word) keyphrases about the mentions and the entities, collected from the KG. This approach works best when the named entities in the document are mentioned in similar ways to those in the corpus from which the KG is built (typically, Wikipedia). Another approach [27] computes the set similarity between the neighbor entities directly connected in the KG. Doing so, however, ignores entities that are indirectly connected yet semantically related, making limited use of the KG graph.
In previous work [11], we introduced a method that used an information-theoretic notion of similarity based on stationary probability distributions resulting from random walks [38] on the disambiguation graph, which led to consistently superior accuracy. This article describes substantial extensions over that method.
Our approach
Our
Contributions This article presents several significant extensions over our previous work:
a revised iterative disambiguation algorithm with fewer parameters and a simplified optimization goal than in [11];
a new disambiguation method based on a state-of-the-art “learning-to-rank” approach that further improves on the accuracy reported previously;
a comparative evaluation of both approaches against 11 other NED systems on 16 public datasets using a the GERBIL framework [40];
a much deeper experimental evaluation than previous works in the area, establishing that previous public benchmarks are rather “easy”, in the sense that a simple baseline can correctly disambiguate most mentions;
a framework for deriving new benchmarks and two benchmarks from large Web corpora (Wikipedia and Clueweb 2012) with documents of increasing difficulty, which are also balanced (i.e., they have the same number of documents in each difficulty class).
We evaluate the learning-to-rank approach in two ways. First, we follow the standard machine learning methodology (namely to separate the benchmarking data into training and testing sets) for each dataset. Next, we use the widely used CoNLL dataset for training, and test the resulting system on all other benchmarks. This results in an algorithm that consistently outperforms previous methods, across benchmarks and often by a wide margin.
It is also worth noting that our statically hand-tuned algorithm also outperformed most of previous methods and is quite competitive with the learning approach. These observations corroborate the superiority and robustness of using random walks for the NED task. In summary, the algorithm and the evaluation methodology described in this article significantly push the state-of-the-art in this task.
NED with random walks
This section first describes Named Entity Disambiguationas an optimization problem and then gives an overview of our solution based on using Random Walks to estimate semantic similarities and a greedy, iterative approximation algorithm (Section 3).
The NED problem
Let d be a document with all mentions to named entities marked up through an NER process, and
More precisely:
(Named Entity Disambiguation).
Given a set of mentions
A good assignment Γ balances two factors: the local similarity between mention
As usual, we define the local similarity
The global coherence
Under the reasonable assumption that the local similarity is normalized, we can formulate the NED problem as a min-max optimization where the goal is to maximize the global coherence while minimizing the loss in pairwise local similarity within the assignment, which can be estimated as
The primary role of
The state-of-the-art is to define
Global coherence with random walks
Our approach to capturing the global coherence [11] is rooted in Information Theory and corresponds to the mutual information between probability distributions arising from random walk processes on the disambiguation graph: one always restarting from a single candidate entity, and the other restarting from all entities used in the assignment Γ.
More precisely, we build a disambiguation graph G (Fig. 1) which is a subset of the KB: the nodes are candidate entities and their immediate neighbors, and the edges are the associations between these entities in the KB. Let N be the number of nodes in G. A random walk with restart is an iterative process that assigns scores to nodes in the graph, defined as:
Given any entity e in the graph, if we define the preference vector as
We define
Intuitively, our notion of coherence favors candidate entity
Linking to NIL
A mention is linked to
Disambiguation algorithms

Iterative

vecInit
This section gives the details of our two NED methods, which use the same underlying algorithm and differ only on the way they compute the semantic similarity.
As previously observed (see, e.g., [15]), the NED problem is intimately connected with a number of NP-hard optimizations on graphs, including the maximum m-clique problem [10], from which a polynomial time reduction is not hard to construct. Thus we resort to an iterative greedy algorithm, called Walking NED(
Algorithm 2 computes the vector representing the (partial) assignment of entities to mentions in the input document, from which we can copmute the documents signature.
A final step of the algorithm is to assign
Parameters The experimental evaluation reported here was obtained with the following parameter setting: in Eq. (1)
Most approaches including ours consider the NED problem as an entity ranking problem: candidates are ranked according to a score (e.g. Eq. (6)), and the highest ranked candidate is assigned. Such ranking is based on multiple criteria (e.g., prior probability, context similarity, semantic similarity, etc.) that may apply differently in different situations, making it hard or impossible to craft a fixed way of aggregating them all that performs well in all cases. With benchmarking data available, one can leverage machine learning to derive a better ranking strategy, at least for the specific dataset used for training.
This Learning to Rank approach originates from Information Retrieval. The methods can be divided into three classes [21]: pointwise, listwise, and pairwise. The pointwise approaches consider query-document pairs as independent instances, and use regression to predict the scores of each document (given the query) and rank them accordingly. As such, pointwise methods are not trained on actual rankings, but instead on features from the documents and the queries. On the other hand, listwise approaches are trained on the actual document rankings for different queries, and their goal is to learn how to predict a new ranking given a new query. Finally, the pairwise approaches work on ordered pairs of documents instead of full rankings, and seek to predict which document in the pair would rank higher.
Among the Learning-to-Rank approaches we experimented with, LambdaMART [41], which is a pairwise method, achieved the best performance. LambdaMART employs the MART (Multiple Additive Regression Trees) algorithm to learn Boosted Regression Trees, and takes advantage of LambdaRank [4] gradients to bypass the difficulty of handling non-smooth cost function introduced by most IR measures, and combines the cost function and IR metrics together to utilize the global ranking structure of documents.
Our Learning-to-Rank
Features Although there are many useful features for ranking [45], our main goal is to establish the robustness and utility of the semantic similarity for the NED task, rather than performing exhaustive feature engineering at the risk of over-fitting. Thus we use four features, all of which are familiar in this research area: prior probability, context similarity, semantic relatedness, and name similarity which is measured by the N-Gram distance [19] between a mention and the canonical name of the entity. More precisely, given a mention-entity pair
Training data Using a pairwise ranking method, for each mention we need ordered pairs of entities
In our experiments, we verify the system performance with two training approaches. First, we used the standard 10-fold cross-validation, thus training and testing on the same benchmark. Also, we experimented with training on the CoNLL dataset, which is the largest of the public benchmarks, while testing on other datasets. Both approaches worked well. In particular, we found that training CoNLL data produced a quite robust method that worked well on other benchmarks.
Prediction Using the training data, our approach trains a model and uses it to build an evaluator which takes in a feature set representing the similarity between a mention and its candidate entity and computes the probability that the mention refers to the candidate. All candidate entities are evaluated and ranked with the probability. The entity with the highest probability is then chosen as the final entity. More specifically, given the feature set
Computational cost
There are two factors contributing to the cost of
Let
As for the time required for the actual scoring, for fixed KG and input document, computing the prior probability and the context similarity are done through database lookups at
In our experience, the highest actual costs lie in building the disambiguation graphs, which must be done for each input document, and performing the random walks. Our current implementation keeps the entire disambiguation graph in main memory to speed this up. We leave for future work improving the performance of
Experimental validation
Given the host of applications where NED is useful and the inherent difficulty of the problem, a lot of effort has been devoted recently in establishing fair and comprehensive benchmarks for this task. In particular, Web-based experimental platforms such as GERBIL [40] are a clear step in the right direction, as they go a long way in automating the collection and reporting of results of different algorithms under the same benchmarks and evaluation conditions. In the appendix we report the results of both our methods on GERBIL version 1.2.4 with the D2KB setting.
Despite their effectiveness and convenience, the information reported by platforms such as GERBIL (particularly, aggregate accuracy measurements) is not enough for a deeper analysis that can lead to algorithmic improvements. Thus, we re-implemented and experimented with several NED systems and compared them against both
Experiment configuration We refer to previous work [11] for the details on building the entity graph and the initial pruning of candidate entities prior to the actual disambiguation per se. The KB used in our experiment is built from the Wikipedia 20130606 dump. The source code for our system is available from
Metrics We use the standard accuracy, precision, recall, and F1:
Where
Evaluation on established benchmarks
We compare
Cucerzan [7] – the first global NED approach,
M&W [26] – a leading machine learning NED solution,
Han11 [13] – a global method that also uses random walks (on a disambiguation graph built differently from ours),
AIDA [15] – a global method that formulates NED as a subgraph optimization problem,
GLOW [31] – a system combining local and global features for NED,
RI [6] – the start-of-the-art NED system using relational inference for mention disambiguation.
We also evaluate two useful baselines:
As mentioned in [9], GERBIL uses an old version of the public datasets. Thus we update four widely used public benchmarks: (1) MSNBC [7], with 20 news articles from 10 different topics (two articles per topic) and 656 linkable mentions in total; (2) AQUAINT, compiled by Milne and Witten [26], with 50 documents and 727 linkable mentions from a news corpus from the Xinhua News Service, the New York Times, and the Associated Press; (3) ACE2004 [31], a subset of the documents used in the ACE2004 Coreference documents with 35 articles and 257 linkable mentions, annotated through crowdsourcing; and (4)
The original dataset includes 5 other documents where all mentions are linked to
To avoid discrepancy of results due to different Wikipedia versions used in different systems, we update all datasets and results of compared NED systems to their redirected entities in our Wikipedia dump. All datasets used in this evaluation, including the new benchmarks introduced below, as well as the accuracy results obtained with each method on each document can be downloaded from
Table 1 shows the results of the two baselines and the above listed NED systems on the four public benchmarks. As customary, we report F1 aggregated across mentions (micro-averaged, indicated as
Accuracy results of all methods on the 4 public benchmarks
For the learning to rank approaches,
Discussion A few observations are worth making here. Among previous work, RI has the best performance across benchmarks. The disambiguation via textual similarity alone, as done by the
The reader will notice that virtually every method in the literature is evaluated against a baseline like
With respect to
Although the four benchmarks discussed above are useful reference points, since they are well-known and have been used for the evaluation of most NED systems, they leave a lot to be desired for a deeper and more systematic accuracy evaluation. As noted in the previous section, they are clearly biased towards popular entities, and thus, not representative of all scenarios where NED is necessary. To further illustrate the point, Table 2 breaks down the number of documents in each benchmark at different levels of accuracy achieved by
Breakdown of the public benchmarks by the accuracy of the Prior method; #docs and #mentions are, respectively, the number of documents and the average number of mentions per document in each bracket; the number in parenthesis is the fraction of the entire benchmark covered by each bracket
Breakdown of the public benchmarks by the accuracy of the
A desirable feature of any thorough evaluation that is not necessarily fulfilled by any of the previous benchmarks is that of representativeness. Namely, it would be ideal to have a mix of mentions or documents with different levels of difficulty in equal proportions (say on a 10-point scale from “easy” to “hard”). Without such equity, the effectiveness metrics reported in the literature (which aggregate at the mention or document level) may not be good predictors of actual performance in real applications. For instance, if a large fraction of the mentions in the benchmarks are “too easy” compared to real documents, the metrics will overestimate the true accuracy.
Of course, in order to fine tune the difficulty of the mentions and the documents in a benchmark one needs a reliable indicator of “difficulty” that can be applied to a large number of documents. Manual annotations are clearly undesirable here, and so is crowdsourcing: the number of annotations needed might prove prohibitive and even if resources are not a concern this leads to a single benchmark (i.e., if more documents are needed, more annotations would be required).
To obtain new and balanced benchmarks, we consider the
More precisely, we applied
Some statistics about the proposed benchmarks: Fig. 2(a) shows the average number of mentions per document and Fig. 2(b) shows the average number of candidates per mention. For the sake of comparison, we also report the same statistics from the documents in the

Corpus statistics.

Disambiguation graph statistics.
As one can see, the variability in our datasets is considerably smaller compared to
Average per-bracket accuracy on large-scale benchmarks. Brackets for
Figure 4 shows the accuracy on the new benchmarks. We plot the accuracy of the best performing methods for each of the difficulty brackets (defined by the accuracy of the
A few observations are worth mentioning here. First, the two new benchmarks complement the
Our

Average accuracy of the top-5 methods on the Wikipedia, Clueweb 12, and
We now look at the kinds of errors made by our method. To do so, we manually inspected every error for the smaller MSNBC, AQUAINT, and ACE2004 datasets, and analyzed 20 errors randomly picked in each bracket for the larger ones.
The first observation is that in the older benchmarks, a larger fraction of the errors in our method happen in the candidate selection phase, as illustrated in Fig. 5. On average, 54% of the errors in the smaller benchmarks are due to candidate selection (compared to 18% in the other ones). This reinforces the hypothesis that the entities mentioned in these older benchmarks are easier to disambiguate.3
Note: given that the smaller benchmarks are older, it is unlikely they mention more out-of-KB entities than the other ones, especially

Breakdown of errors by
Below we discuss prototypical errors in each of the phases.
Incorrect co-reference resolution We employ a co-reference resolution algorithm in our text processing pipeline to increase recall. Due to the heuristic nature of the algorithm, it is possible that distinct named entities are incorrectly deemed to be the same. For example, in the sentence “
Incomplete alias dictionary Currently, we disambiguate only those mentions corresponding to an alias from Wikipedia, leading to problems in sentences like “Thirteen miners were trapped inside the Sago Mine near Buckhannon,
Aggressive pruning Another source of error by our method is pruning the correct entity from the disambiguation graph. For example, in sentence “A state coordinator for the
Errors during mention disambiguation
These are errors where the correct entities according to the ground truth were selected as the candidates but not chosen during the mention disambiguation phase by our algorithm.
Lack of application domain We observed that most of the errors associated with locations happen because the documents in most benchmarks are news articles that start with the location of the news source reporting the news (e.g., New York Times documents always start with a mention to New York). More often than not, such locations are totally unrelated to the topic of the documents and other mentions in the document, breaking the global coherence assumption. These errors, which can be easily fixed via pre-processing, accounts for 5% of the mistakes of our algorithm in the MSNBC and AIDA benchmarks and 2% across all benchmarks.
Need for deeper text analysis There are of course very hard disambiguation cases where a deeper understanding of the text would be needed for a successful algorithmic approach. One example is the sentence: “Maj. Gen.
Questionable errors
We argue that in many cases, our algorithm (as well as other systems) chose an entity that are considered erroneous by the ground truth but that would be acceptable to a human judge. For example, in the sentence: “Coach Saban said the things
Questionable disambiguation errors
Questionable disambiguation errors
Given the iterative
A mention to the USS Cole which should have been linked to USS Cole (DDG-67), was linked to USS Cole bombing.
As for the initialization step, we found that most documents in our benchmarks do have unambiguous mentions available, and most of them are correctly linked to the true entity.5
Recall (Section 3) we initialize the document disambiguation vector with unambiguous mentions when available.
Finally, we found that most other errors happened after 5 iterations, when the document disambiguation vector already captures the topic fairly well. These errors are for mentions that are not semantically related to other mentions in the document, or simply due to the disproportionately high priors favoring (incorrectly) head entities.
NED is commonly cast as a ranking problem where we estimate the likelihood with which each candidate entity should be assigned to each mention, choosing the one with the highest likelihood. Based on the features used and the way mentions are disambiguated, most work about entity disambiguation in the literature can be divided into two main groups.
The first group, which we refer to as local methods, disambiguate each mention in a document independently of others using local features. These methods represent mentions and entities with feature vectors and measure the likelihood using a compatibility function such as vector similarity measures. The most commonly used local features include lexical features such as bag-of-words [2], keyphrases [14], n-grams [37], or semantic word embeddings [47] extracted or derived from the surrounding context, and statistical features such as the probability of entities given a mention from a knowledge graph. While the first local methods were unsupervised [2,3], combining features through manually tuned parameters, others optimize the parameter selection with various classifiers [8,24,26,43,44,46] often using Wikipedia as a source of training data. As described, the main issues with local methods are the data sparsity problem on the features and ignoring the semantic dependencies between mentions.
The second group of methods, which we refer to as global methods, adhere to the hypothesis that mentions from the same document are semantically coherent around the topic of the document, and cast the disambiguation problem as an optimization whose objective is to find the assignment with maximum global coherence. Thus appropriate semantic relatedness measures and efficient inference for the assignment with maximum global coherence are the two most important factors. Cucerzan [7] proposed the first global approach which measures the semantic relatedness between entities using Wikipedia categories of entities. Milne and Witten (M&W) [26] represent each entity with their directly connected entities in a knowledge graph and measure relatedness using normalized Google distance between the representations. Kulkarni et al. [20] also use the M&W semantic relatedness measure in their collective approach. In addition to that, Ratinov et al. [31] add the PMI (Pointwise Mutual Information) of entities into their SVM classifier, and Cheng and Roth [6] model more fine-grained semantics such as relations between mentions as constraints in the semantic relatedness, which greatly improved the results. More recently, Zwicklbauer et al. [47] use word embeddings from the context of entities as the representation to measure their semantic relatedness.
With semantic relatedness measures, entities and their relationships can be represented as an entity graph, and the optimization objective is to find a subgraph with maximum global coherence in which each node corresponds to the referent entity of a mention in the document. AIDA [15] aims to find a dense subgraph in the mention-entity graph that contains all mentions and a single mention-entity edge for each mention according to their definition of graph density. Han et. al [13] also use the graph built around candidates of mentions and apply random walk over the graph to obtain the pagerank score of entities which is used to measure the global coherence. Instead of using PageRank, AGDISTIS [39] applies the HITS algorithm on the graph and uses the authority score of a candidate for its global coherence with mentions and other entities.
The assumption that mentions belong to one single topic may not be true in many documents. For such cases, topic modelling can be used to distribute latent topics across candidate entities. Han and Sun [12] combine the local compatibility and global coherence using a generative entity-topic model to infer the underlying referent entities. Li et al. [22] introduce additional information of entities mined from external corpus into a generative graph model to improve the effectiveness of NED, especially for entities with rare information available in the knowledge graph. Kataria et al. [18] proposed a semi-supervised hierarchical topic model for entity disambiguation based on the Wikipedia’s category hierarchy and word-entity associations. Ganea et al. [9] recently proposed a light-weight probabilistic model based on the co-occurrence statistics derived from knowledge graphs.
Most semantic relatedness measures employed in NED systems compute the relatedness between two entities only. Instead, we propose a unified semantic representation for both entities and the collective entity-to-mention assignment (Γ), with which we can measure how well a candidate entity fits with those previously assigned in an unified way. Our representation can capture the semantics of unpopular entities (those with low degree in the entity graph), which makes our NED approach more robust. This observation is supported by our experiments. The idea of using random walks with restart has been applied on graphs constructed from the WordNet [25], with the stationary distribution to represent the semantics of words. It has been shown to be effective in the word similarity measurement [1,16], and word sense disambiguation [30]. However, we are not aware of any previous work using the stationary distribution from random walk with restart to represent entities and documents in NED.
When it comes to learning to rank, Zheng et al. [45] applied learning to rank approaches on the NED task and demonstrated its superior effectiveness over most state-of-the-art algorithms. Their results showed that the listwise method ListNet [5] performed better than the pairwise approach Ranking Perceptron [34]. However, we found that the pairwise approach LambdaMART [41] achieved the best performance on our datasets among most learning to rank algorithms.
Conclusion
We described a method for named entity disambiguation that combines lexical and statistical features with semantic signatures derived from random walks over suitably designed disambiguation graphs. Our semantic representation uses more relevant entities from the knowledge graph, thus reducing the effect of feature sparsity, and results in substantial accuracy gains. We described a hand-tuned greedy algorithm as well as one based on learning-to-rank. Both outperform the previous state-of-the-art by a wide margin. Moreover, we showed that our
Moreover, we demonstrated several shortcomings of the existing NED benchmarks and described an effective way for deriving better benchmarks and described two new such benchmarks based on web-scale annotated corpora (ClueWeb12 and Wikipedia). Our benchmark generation method can be tuned to produce “harder” or “easier” cases as desired. Overall, the benchmarks we describe complement the largest currently available public benchmark. Our experimental evaluation compared our methods against six leading competitors and two very strong baselines, revealing the superiority and robustness of our NED system in a variety of settings. Our method was particularly robust when disambiguating unpopular entities, making it a good candidate to address the “long tail” in Information Extraction.
Future work There are several directions worth exploring. Section 4.5 lists several ideas for algorithmic improvements that can lead to better NED systems in the future. Also, while the new benchmarks described here can be used for both accuracy and scalability tests (as one can easily obtain large quantities of documents from ClueWeb12 and Wikipedia), further work is needed in helping the design and verification of ground-truths.
System performance is one issue we will need to improve. Currently, the average time to disambiguate a document with less than 100 is in the order of a few minutes, and the time could increase with the number of mentions in a document and average number of candidates per mention. Majority of the time is spent on the expensive random walk computations. Approximating the semantic signature with less expensive random walk algorithms would be helpful. Other than that, designing a system using appropriate indexing and utilizing the current large-scale data processing infrastructure would also be interesting.
Footnotes
Evaluation on GERBIL
GERBIL [40] is a general framework for entity annotation, which has more than 11 NED systems and 16 public datasets for the entity disambiguation task. We compare our system with all available systems including graph-based systems: AGDISTIS [39], AIDA [15], Babelfy [28], FOX [36], WAT [29], xLisa [42], and PBoH [9]; context-based systems: DBpedia Spotlight [23], FREME NER [33], Kea [37], and NERD-ML [32]. The datasets mainly contain documents from news article, RSS feeds, tweets, encyclopedia and the mix. Detailed statistics of the datasets are shown in Table 5. We can see that most datasets are from news articles which means that entities mentioned in the documents are popular ones. Performance on datasets with very few mentions per document such as Microposts2014 and N3-RSS-500 will depends more on the contextual similarity, and less on the global coherence of most approaches.
Table 6 gives the results evaluated using GERBIL,6
Due to an implementation issue, each dataset has to be evaluated separately. Results are available in
There is a drop on the results from version 1.1.4 to version 1.2.4. Details:
We can see that comparing to other NED systems, our two systems
