Abstract
This article gives an overview of recent efforts focusing on integrating heterogeneous data using Knowledge Graphs. I introduce a pipeline consisting of five steps to integrate semi-structured or unstructured content. I discuss some of the key applications of this pipeline through three use-cases, and present the lessons learnt along the way while designing and building data integration systems.
Introduction
Data abounds in large enterprises. Beyond structured data, which garnered a lot of attention from data specialists in the past, the last few decades saw the meteoric rise of semi-structured and unstructured data including JSON documents, email or social network messages, and media content. Most companies are struggling to create a coherent and integrated view over all those types of data.
Knowledge Graphs have become one of the key modalities to integrate disparate data in that context. They provide declarative and extensible mechanisms to relate arbitrary concepts through flexible graphs that can be leveraged by downstream processes such as entity search [10] or ontology-based access to distributed information [2].

The XI Pipeline goes through a series of five steps to integrate semi-structured or unstructured content leveraging a Knowledge Graph.
Yet, integrating enterprise data to a given Knowledge Graph is a highly complex and time-consuming task. In this article, I briefly summarize the recent research efforts from my group in that regard. I introduce the XI Pipeline, an end-to-end process to semi-automatically map existing content onto a Knowledge Graph (see Section 2). I also discuss a series of systems we designed, built and deployed in that context to integrate publications (Section 3.1), social content (Section 3.2), and cloud infrastructure data (Section 3.3). Finally, I conclude by making a number of observations and recommendations for future efforts in Big Data Integration based on our past experience in that domain (Section 4).
An overview of the pipeline we devised to integrate heterogeneous contents leveraging a Knowledge Graph is given in Fig. 1. This pipeline focuses on semi-automatically integrating unstructured or semi-structured documents, as they are from our perspective the most challenging types of data to integrate, and as end-to-end techniques to integrate strictly structured data abound [9,13]. The Knowledge Graph underpinning the integration process should be given a priori, and can be built by crowdsourcing (see Section 3.2), by sampling from existing graphs (Section 3.1) or through a manual process (Section 3.3). The integration process starts with semi-structured or unstructured data given as input (left-hand side of Fig. 1) and goes through a series of steps, described below, to integrate the content by creating a set of new nodes and edges in the Knowledge Graph as output (right-hand side of Fig. 1).
Name-Entity Recognition (NER)
The first step is to go through all labels / textual contents in the input data and identify all entity mentions (e.g., locations, objects, persons or concepts) appearing in the text. Two main strategies can be applied here:
when the Knowledge Graph is complete and contains all entities of interest along with their labels, we proceed with Information Retrieval techniques to build inverted indices over the Knowledge Graph and identify all potential entities from the text by leveraging ad-hoc object retrieval techniques [21]; when the Knowledge Graph is incomplete and is missing a number of entities and labels of interest, things get more complex. The main problem we face in that case is to identify entities from text while not knowing anything about them, which is intrinsically a very challenging problem. To solve this issue, we leverage NLP techniques (part-of-speech tags), third-party information such as large collections of N-grams and Machine Learning to identify new entities and add them to the Knowledge Graph dynamically [11].
Entity linking
The first step typically returns a set of textual mentions (surface forms) from the input data, along with a set of candidates (entities) from the Knowledge Graph potentially corresponding to the mentions. The following task is to decide which entity from the graph corresponds to which mention from text and to link them. Many techniques can be used to solve this problem, which is typically referred to as Entity Disambiguation or Entity Linking in the literature [14].
Our solution to that problem departs from the state of the art in two important ways [3]: we use probabilistic graphs to combine several techniques, and micro-task crowdsourcing [5] to improve the results leveraging human computation. Empirical results show that involving humans in this process improve the end results by over 10% compared to automated approaches [4].
Type ranking
The next step we perform is pretty unique. We assume that each entity in the Knowledge Graph is associated with a series of types (there are many techniques to infer such entity types when they are missing from the Knowledge Graph, e.g., statistical techniques [6]). However, the types associated to a given entity in the graph are typically not all relevant to the mention of that entity as found in the input data. Hence, we introduced the task of ranking entity types given its mention and context in the input data [18]. We leverage features from both the underlying type hierarchy as well as from the textual context surrounding the mention to solve this task in practice [19]. The result of this process is a ranking of fine-grained types associated to each entity mention, which can be invaluable when tackling downstream steps such as Co-Reference Resolution or Relation Extraction (see below).
Co-Reference Resolution
Up to this point, we have created a series of high-quality links, along with relevant type information, to integrate mentions from the input data to entities in the Knowledge Graph. However, a number of further mentions available in the input data, such as noun phrases (e.g., “the Swiss champion” or “the former president”), cannot be resolved by our method. To tackle this issue, we introduce a Co-Reference Resolution step capturing further mentions from the input data and disambiguating them by taking advantage of all the data integrated so far. We developed novel methods to do so, simultaneously leveraging fine-grained type information [12] as well as deep neural networks [8] to maximize the quality of the results.
Relation extraction
The final step is to extract semantic relationships between the entities appearing in the input data. This is important in order to correctly capture the articulation of the input data as well as the dependencies between the extracted entities. Relation extraction is, generally-speaking, a very challenging task as they are a myriad of (explicit or implicit) ways to express a given relationship between several entities in the input data. To solve this problem, we resort to Distant Supervision leveraging the Knowledge Graph [16]. The basic idea is as follows: we consider pairs of entities connected through a relation in the Knowledge Graph as training data, and try to identify similar entities connected through the same relation from the input data. We devised a new neural architecture (the Aggregated Piecewise Convolutional Neural Network [15]) to solve this task effectively in practice.
Use-cases
The outcome of the process described above is a set of nodes and links connecting mentions from the input data to entities and relations in the Knowledge Graph. As a result, the Knowledge Graph can then be used as a central gateway (i.e., as a mediation layer) to retrieve all heterogeneous pieces of data related to a given entity, type, relation or query.
We extended this generic approach to integrate various types of Big Data. We briefly present below three such deployments focusing on integrating different input data: (1) research articles, (2) social media content, and (3) cloud infrastructure data.
ScienceWise: Integrating research articles
As the production of research artifacts is booming, it is getting more and more difficult to track down all the papers related to a given scientific topic. The ScienceWise [1] platform (co-created with EPFL and Leiden University) was conceived in that context, in order to help physicists track down articles of interests from arXiv. The platform allows physicists to register their interest from a Knowledge Graph where most entities relating to physics have been defined through crowdsourcing. As new articles are uploaded on arXiv, they are automatically integrated to the Knowledge Graph using a pipeline similar (although simpler) to the one described above in Section 2. As a result, the physicists are automatically notified whenever a new paper relating to one of their interests gets uploaded.
ArmaTweet: Integrating social media contents
The second system we built tackles social media content. Specifically, we looked into how Knowledge Graphs can help integrate series of tweets (i.e., microposts) that are difficult to handle otherwise given their short and noisy nature. The resulting system, ArmaTweet [20] (a collaboration between ArmaSuisse, the University of Oxford and my group) takes as input a stream of tweets, extracts structured representations from the tweets using a pipeline similar to the one presented above, and integrates them to a Knowledge Graph built by borrowing content from both DBpedia and WordNet. ArmaTweets allows to pose complex queries (such as “find all politicians dying in Switzerland” or “find all militia terror acts”) against a set of tweets, which could not be handled otherwise using classical Information Retrieval or Knowledge Reasoning methods.
Guider: Integrating cloud infrastructure data
Another integration project we worked on (together with Microsoft CISL) is Guider [7]: a system to automatically integrate cloud infrastructure data to a Knowledge Graph. The input data in this case is a very large set of logs produced automatically by a distributed computing infrastructure. We parse and integrate the log data drawing from the pipeline described in Section 2, but considerably customizing it to take into account the specificities of the data (e.g., classical NLP or entity linking techniques cannot be applied in this context, as the input data does not contain any sentence). The resulting graph captures lineage information among files and jobs running on the infrastructure. The deployed system is now used for a series of applications at Microsoft including job auditing and compliance, automated SLO extraction of recurring tasks, and global job ranking.
Conclusions & lessons learnt
Drawing from our own experience, Knowledge Graphs proved to be powerful and flexible abstractions to integrate heterogeneous pieces of content. Yet, the integration process required to correctly map the input data onto a Knowledge Graph is taxing, as automated techniques cannot fully grasp the semantics of arbitrary input data (yet). While working on the various efforts described above, we learnt a few lessons that we hope will be valuable for future research.
First, human attention (in the form of crowdsourcing or manual inspection of the input and/or output data) is still key to provide high-quality results. While automated techniques have improved, they are still far from providing ideal results. Along similar lines, one cannot expect perfect results from human experts either, given the inevitable subjectivity or ambiguity of some of the tasks in a large-scale integration project.
Second, entity types represent very useful constructs in integration efforts. We are not talking about coarse-grained types (e.g, person or location), but rather about very specific, fine-grained types (e.g, Dropout from Harvard or Municipalities of the canton of Fribourg) borrowed from a rich and expressive type hierarchy. Associating and ranking such fine-grained types early in the pipeline for each entity mention found in the input data is invaluable for many downstream tasks such as data summarization, co-reference resolution or relation extraction.
Third, the quality of the integration process is always constraint by the quality of the Knowledge Graph used as a mediation layer. Large Knowledge Graphs typically are full of errors and inconsistencies [17], which have to be fixed prior to the integration process in order to maximize the quality of the results. Missing data in the Knowledge Graph is yet another issue, which jeopardizes the entire integration process as working with incomplete data is inherently very challenging.
Finally, designing a generic platform capable of integrating different data for different applications proved to be impractical. Even if, as described above, many ideas and processes can be recycled from one project to the next, real data is always intricate and specific, making it essential to specialize the approach for the use-case at hand. Providing a library of composable software artifacts, each responsible for a certain integration subprocess and each focusing on a certain data modality, might be an interesting avenue for future work in that context.
