Abstract
This paper presents a survey on multilingual Knowledge Graph Question Answering (mKGQA). We employ a systematic review methodology to collect and analyze the research results in the field of mKGQA by defining scientific literature sources, selecting relevant publications, extracting objective information (e.g., problem, approach, evaluation values, used metrics, etc.), thoroughly analyzing the information, searching for novel insights, and methodically organizing them. Our insights are derived from 46 publications: 26 papers specifically focused on mKGQA systems, 14 papers concerning benchmarks and datasets, and 7 systematic survey articles. Starting its search from 2011, this work presents a comprehensive overview of the research field, encompassing the most recent findings pertaining to mKGQA and Large Language Models. We categorize the acquired information into a well-defined taxonomy, which classifies the methods employed in the development of mKGQA systems. Moreover, we formally define three pivotal characteristics of these methods, namely resource efficiency, multilinguality, and portability. These formal definitions serve as crucial reference points for selecting an appropriate method for mKGQA in a given use case. Lastly, we delve into the challenges of mKGQA, offer a broad outlook on the investigated research field, and outline important directions for future research. Accompanying this paper, we provide all the collected data, scripts, and documentation in an online appendix.
Keywords
Introduction
The most popular search engines on the Web process dozens of billions of queries per day.1
While speaking about the Web, the extraction of direct answers based on structured data is enabled by the introduction of the Semantic Web [11], which aims at making the Web data machine-readable. The Semantic Web corresponds to the formal definition of a Knowledge Graph (KG) [64] and, therefore, can be considered as a giant decentralized KG. A dominant share of all KGs as well as the Semantic Web itself are described with the Resource Description Framework (RDF) [89] (RDF-based KGs). A family of systems addressing the challenge of giving direct answers based on KGs is named Knowledge Graph Question Answering (KGQA) systems. KGQA systems have the same objective as KBQA ones, however, a KB
It is well known that the Web, which in this context is frequently confused with the Internet, is the major information source for people all over the world [35,76,124] in different domains of life. Despite that, the majority of the information on the Web (53.6%)3
The scope of this paper is limited to multilingual and cross-lingual Knowledge Graphs Question Answering (mKGQA) systems. mKGQA systems extend standard KGQA functionality by providing a possibility of processing questions or searching for information in several different languages
Our analysis of the related work published over the past decade suggests (see Section 3) that currently available systematic surveys on KGQA barely address the aspect of multilinguality. In particular, the majority of the related surveys dedicate one paragraph or less to multilinguality while mentioning it as a challenge for KGQA. In addition, none of the related work concentrates specifically on the multilingual aspect of these systems. As there are KGQA systems that explicitly focus on multilinguality, there is clearly a need for a survey on mKGQA. This paper is guided by the following research questions:
In this work, we employ a systematic review methodology posited in [12,78,91,103] (see Section 2) to collect and analyze the research results in the field of mKGQA by defining scientific literature sources, selecting relevant publications, extracting objective information (e.g., evaluation values, used metrics, etc.), analyzing the information, searching for new insights as well as generalizing them in a structured manner (i.e., in the form of a taxonomy). Finally, we summarize our observations and present a general outlook on the investigated research direction. Our insights are mainly derived from 46 publications, which were selected from more than a thousand publications that were retrieved during the initial selection phase. After the manual verification of the formal selection criteria, we selected 26 papers about mKGQA systems, 14 papers about benchmarks and datasets, and 7 systematic survey articles. To ensure the transparency and reproducibility of this work, we provide all the collected data, scripts, and documentation as an online appendix.
This article is structured as follows. In Section 2, we describe the methodology of the systematic survey. Section 3 contains the overview of the related systematic surveys about KGQA. In Section 4, we review the mKGQA systems and propose the taxonomy of the methods. The benchmarks for the mKGQA are reviewed in Section 5. We analyze and discuss the results of the work in Section 6. The article is concluded in Section 7.
To ensure transparency and reproducibility, we followed a strict systematic review methodology, which is based on prior literature [12,78,91,103]. In this section, we describe the methodology explicitly within the context of the actual review execution process. The methodology consists of the following three major phases: selection of sources, initial publications’ selection, as well as extraction and systematization of the information. The phases are described in the following subsections. It is worth mentioning that only the authors of this work were involved in the review process. The first author led the review process by conducting the respective steps (e.g., writing scripts for automated information extraction (Section 2.2), manual information extraction (Section 2.3) etc.). The other authors cross-checked the work of the first author. All the authors were making regular synchronization meetings to ensure mutual agreement.
Selection of sources
For the sources, we used well-established digital research databases related to computer science, which offer free access to the advanced search features. While following our multilingual agenda, we went beyond the English language for literature search, namely, we used sources in the following languages: English, German, and Russian. We chose these languages as for each of them at least one of the authors is a native speaker. To identify the sources we used the 3 search engines – Google,6
English:
German:
Literally translated as:
Russian:
Literally translated as:
To search for publications, we used digital research databases (sources) that were selected during the previous phase. With the advanced search functionality and the corresponding complex search queries, the three main aspects of the publications had to be covered:
System aspect – Question Answering systems; Data aspect – RDF-based Knowledge Graphs; Language aspect – Multilinguality and cross-linguality.
We considered the publications of the following types: it describes a system, a benchmarking dataset, or it is a survey publication. Thus, the publications needed to match the following acceptance criteria: it describes a QA system, a related benchmarking dataset, or is a systematic survey; the described system or the benchmark are intended to work on RDF-based KGs; the described system or the dataset are intended to work with multiple languages (at least two). Furthermore, only the publications released in the period from 2011 to 2023 were considered.17
This work started in 2021; the original intention was to cover publications of the past decade, which is why we started searching from 2011.
The conceptual representation of the query and its corresponding parts. The parts are concatenated with the
For each of the sources, we utilized a complex search query that covers all the three aspects described above. The conceptual form of the search query is presented in Table 1. After that, we automatically extracted the following publication properties: authors, title, abstract, publication year, DOI/URL, source, and publisher. The script and the corresponding documentation are available in the online appendix.18
Statistics on the selected and accepted publications grouped by its sources
Considering our research background with mKGQA, we identified that our systematic review methodology has high specificity. This means that some of the relevant publications, known to us before, were not included in the review process. Therefore, we integrated one exception to the selection process: we included publications that we previously were aware of and that matched the following criteria: match the three main aspects (see above), and cited at least five times or published through a peer-reviewed process (see column “Related Work” in Table 2). The share of the “Related Work” publications source is roughly 10%.
After the initial publications selection phase, all the accepted publications were manually analyzed in a more detailed way. In particular, along with the annotated information (authors, title, abstract, publication year, DOI/URL, source, publisher), we manually extracted factual information from the publications, which is needed to answer research questions, and transformed it into a tabular format with the following columns:
Paper type – describing a system, a dataset, or a systematic survey; Problem – textual description of the problem; Approach – proposal of authors on how to resolve the problem in general terms; Solution – actual results of the authors towards solving the problem; Languages – a set of languages that were used regarding the multilingual aspect; Knowledge graphs – set of knowledge graphs that were directly used in the work; Datasets – set of datasets that were mentioned or directly used in a publication; Metrics – a set of metrics used for the evaluation in a publication; Technologies – a set of technologies that were mentioned explicitly in a paper or seen in a repository; Source code & demo URLs – the links to the source code or/and demo application; Comment – an optional brief comment or remarks on the publication.
Thereafter, we cross-checked19
The first author did the manual extraction process, the other authors checked the resulted information.
In this section, we review the survey articles that are, first of all, related to the considered research field. Secondly, they have been chosen according to the methodology described in Section 2. In Table 3 we present an overview of the publications, described below.
Overview of the surveys
The overview of the survey papers that include the aspect of multilinguality
The overview of the survey papers that include the aspect of multilinguality
In 2017, Höffner et al. [63] dedicated their survey to the following problem: instead of a shared effort, many essential components are redeveloped. While shared practices emerge over time, they are not systematically collected. Moreover, as the authors describe it, most systems focus on a specific aspect while the others are quickly implemented, which leads to low benchmark scores and thus undervalues the contribution. The authors propose to mitigate these problems by systematically collecting and structuring methods of dealing with common challenges faced by the used approaches. The methodology consists of the following inclusion criteria for the publications: available via Google Scholar20
The search query:
The survey of Diefenbach et al. [40], which was published in 2017, targets the problem of making an “enormous amount of information in the form of knowledge bases” available with the help of question answering systems. The authors claim that they focus on the techniques behind existing QA systems (unlike the other articles). They consider five tasks (question analysis, phrase mapping, disambiguation, query construction, and querying distributed knowledge) in the QA process and describe how QA systems solve them. The defined main goal of the authors is to describe, classify, and compare all techniques used by QA systems participating in the QALD22
Question Answering over Linked Data.
In 2020, da Silva et al. [33] published the survey on end-to-end “simple QA systems”. The authors claim that in the traditional approaches, the process of answering a question can be divided into five steps corresponding to question analysis, phrase mapping, disambiguation, query construction, and querying distributed knowledge. However, given the improvements in deep neural network models and higher availability of training data, end-to-end architectures have become the state of the art. To conduct a systematic survey, the authors decided to focus on deep learning-based QA systems designed to answer factoid questions. In particular, they describe how each existing system addresses its critical features in terms of training end-to-end models. The authors also make the evaluation process on these systems and discuss how each approach differs from the others in terms of the challenges tackled and the strategies employed. The methodology of the survey has the following inclusion criteria: 1. an initial search23
The search query:
The survey article of Dimitrakis et al. [45] was published in 2020. The authors claim that the other surveys published up to 2018 are reviewing only the corresponding QA systems, while this survey contains a detailed list of available training/evaluation datasets for QA. Another distinctive feature of the survey is that it discusses how different types of QA systems and information sources can be combined into a unified pipeline to help researchers find combinatorial ways that can be more effective. As a result, the authors review approaches covering text-based, data-based, and hybrid methods as well as the corresponding datasets. Note that no publication selection methodology was described by the authors. The multilingual aspect is covered only by a small paragraph.
In 2021, Zhang et al. [152] published their paper on deep learning in KBQA. The authors claim that the recent advances in deep learning are entering the KBQA field to improve the corresponding systems. The survey reviews recent deep learning-based KBQA efforts for simple questions in two main streams: (1) the information extraction style and (2) the semantic parsing style. Then, the authors switch to the efforts that extend the neural architectures to answer more complex questions that require multi-hop deep reasoning. Finally, several well-known benchmarks for evaluating KBQA systems are reviewed (e.g., WebQuestions [10], SimpleQuestions [14], LC-QuAD [135]). The following challenges are mentioned by the authors as remaining: compositional generalizability, the gap between the natural language and a knowledge base, lack of training data, limited coverage of KBs, and lack of data in languages other than English. The publication selection protocol was not described in the survey published. There is only one sentence dedicated to the multilingual aspect.
The survey on KBQA by Pereira et al. [103] was published in 2022. The authors tackle the problem of the KB data accessibility as the visual navigation approach is not rich enough to answer more complex questions, and querying using SPARQL is not suitable for users who have not mastered the use of formal querying languages. The survey is mainly focused on the biomedical data domain. The authors follow a strict methodology – PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [100] to report the protocol’s execution and present the findings. In this survey, 66 documents were analyzed to classify KBQA systems according to their architectural styles. The survey reviews 25 semantic parsing pipeline systems, 12 using subgraph matching, 7 based on templates, and 22 performing information extraction. The authors believe that on the one hand, it is necessary to answer increasingly complex questions, and on the other hand, there is a need to deal with the inherent incompleteness of KBs. There is only one paragraph dedicated to the multilingual functionality.
In 2022, Antoniou et al. [4] published a survey on SQA systems. The authors claim that no categories of SQA systems have been identified (no typology/taxonomy) before this survey. Hence, at the date of publication, there are no surveys for the categorization of SQA systems. This survey distinguishes categories of SQA systems based on criteria in order to lay the groundwork for a collection of common practices, as no categories of SQA systems have been identified. The authors believe that the categorization and systematization can help developers, or anyone interested to find out directly the technique or steps used by each system or to benchmark their own system against existing ones. The classification created in the survey is based on the following properties: domain, data source, types of questions, types of analysis done on questions, types of representations used for questions, characteristics of the KB, techniques used for retrieving answers, user interaction, and answers. The methodology of the survey is not described. The authors dedicate only one paragraph to the multilingual functionality.
From the survey articles considered above, we can clearly see that none of them target the multilingual aspect of KGQA specifically. However, multilinguality is mentioned in all of the publications as an important challenge. The absence of a survey paper dedicated to mKGQA specifically is the main motivation for conducting this work. It was also noted that four out of seven of the papers do not publish their methodology or review protocol, hence, the reproducibility of their work is questionable. Therefore, we encourage the research community to include the methodology or the protocol in their survey papers.
Multilingual question answering systems over knowledge graphs
In this section, we review the 21 mKGQA systems that were discovered in the 26 publications24
Some systems are described in multiple publications as they present minor improvements.
In this section, we group the reviewed systems by methods that were used for their implementation. These method groups are described in detail in Section 4.2. Despite some systems may belong to multiple groups, we still assign them to one according to the dominance of a particular method group within a system.
Systems predominantly using methods based on rules and templates (G1)
The QALL-ME system [52] was published in 2011 and is designed to provide relevant information and answers to arbitrary questions of its users. The authors regard this task as a challenge because of the “the exponential growth of digital information”. Therefore, the authors of QALL-ME propose a reusable architecture for building multilingual QA systems that answer questions with the help of structured answer data sources from freely specifiable domains. The workflow of the system is managed by a software module named QA Planner, which orchestrates the QA components and thus passes the input question through the whole system until an answer is found. In particular, such components are language identifier, entity annotator, term annotator, temporal expression annotation, query generator, and answer retriever. The authors do not explicitly mention what methods were used for the implementation, however, for the query generation pattern mappings are used. The system implements a Service-Oriented Architecture (SOA). The authors claim that QALL-ME works with German, Spanish, English, and Italian. The system is intended to work on a non-public RDF-based knowledge graph and also evaluated on a benchmark that was not published on the web, hence, it is not possible to compare QALL-ME with other systems. Nevertheless, the authors present results based on the accuracy (72.89% average for all languages) and Recognizing Textual Entailment (RTE) component performance measures (86.97% average for all languages). The authors claim that more attention is needed regarding the acquisition of minimal question patterns and interactive QA process. It is worth mentioning that the authors provide the source code of the system, written in Java, which is currently outdated. The QA components used in the system mostly work in a dictionary-based setting and thus are challenging to port to other datasets.
The KGQA system authored by Aggarwal [ 1 ] was published in 2012. The author targets the problem of the poor accuracy of multilingual natural-language interfaces that provide access to the Semantic Web data. The approach of Aggarwal is a cross-lingual semantic search method, which aims to retrieve all relevant pieces of information even if they are available in languages different from the initial question’s language. Similarly to the previous paper – QALL-ME – this system is implemented with a multilingual QA pipeline that performs entity search (exact match between the entity and ontology label), parse tree generation (using the Stanford Parser25
The QAKiS system [21,30] was originally published in 2013 with an extension in 2014 [20]. The authors of QAKiS focus on the inequality of the information in the multilingual DBpedia chapters (i.e., chapters can contain different information from one language to another) providing more specificity on certain topics. Thus, the ability to utilize all the information across the languages would be beneficial for QA systems. The approach is targeted at enhancing users’ interactions with the Web of Data by providing query interfaces that provide flexible mappings between NL expressions, concepts, and relations in structured KBs. The implementation of the QAKiS systems contains four main multilingual modules: Named Entity Recognition – NER (Stanford Core NLP NER26
The SWSNL system [56] was published in 2013. The authors aim at simplifying form-based search by introducing a system that is able to search over domain-specific data with NL- or keyword-based input. The approach of the authors is to create a component-based KGQA system, which is similar to QALL-ME and Aggarwal et al., that is able to answer questions over domain-specific RDF-based KG. The resulting solution is a Java-based application that converts a textual question into a KG independent query (with preprocessing, NER, and semantic parsing) and thereafter transforms it to SPARQL using a rule-based interpretation approach. The authors create the target KG by crawling one of the accommodation websites and integrating its information into a custom ontology. For the QA evaluation, the authors collect and annotate a dataset of 68 questions in English and Czech languages. The results regarding Precision, Recall, and F1 score demonstrate the following values respectively: 66%, 34%, 45%. The authors set possible follow-up contributions as follows: extension of the evaluation corpus, integration of a full-text search, and improvement of a NER module as well as the performance of the system.
The AMUSE system was published in 2017 [57]. The authors target the “lexical gap” while mapping natural-language questions to SPARQL queries, especially in a multilingual setting. The approach within the AMUSE system is based on probabilistic inference, which is aimed at predicting the query that has the highest probability of being the correct interpretation of the given question string. The actual implementation has two levels and is built with the help of Universal Dependencies (UD) [99] and Java. The first layer (L2KB) is trained using an entity linking objective that learns to link parts of the query to identifiers. The second layer is a query construction layer that takes the top k results from the L2KB layer and assigns semantic representations to the words to yield a logical representation of the complete question with the help of a semantic parse tree. The final output of the system is a SPARQL query. The AMUSE systems works with English, German, and Spanish over DBpedia KG. The evaluation is performed on the QALD-6 dataset where the following macro F1 score values were obtained: 34%, 37%, 42% for the supported languages respectively. The authors see that questions that require modifiers (e.g., filtering) to be present in the corresponding SPARQL queries may become an improvement for their system.
The WDAqua-core0 system [43] was released in 2017. The authors aim at the problem of handling the growing amount of structured Semantic Web data. The system uses a combinatorial approach based on the semantics encoded in the underlying KG. The implementation of the WDAqua-core0 is carried out using the Qanary framework [15,42]. It can answer questions that require not only
The KGQA system UDepLambda [118] was released in 2017. The authors underline the problem of the particular focus on the English language in the publications related to the KGQA. Similarly to the AMUSE system, the proposed approach is to convert the NL questions to logical forms which are thereafter converted to machine-interpretable representations. The actual solution is also based on the universal dependencies [98] and maps NL to logical forms, representing underlying predicate-argument structures, in an almost language-independent manner. The system UDepLambda works with English, German, and Spanish languages over Freebase KG. The evaluation is done on two benchmarks WebQuestions and GraphQuestions. For the first benchmark, the following F1 score values are obtained: 49.5%, 46.1%, and 47.5% respectively for the supported languages. Given the second benchmark, the reported F1 score values are 17.7%, 9.5%, and 12.8% respectively to the supported languages. Despite the reasonable results of the system, the questions in languages other than English were machine-translated.
The system MuG-QA [ 155 ] was released in 2018 and is targeting the problem of handling the data within the rapid development of RDF, KGs, and the increase of non-English data. The approach of answering questions in the multilingual setting is focused on forming abstract conceptual grammar from the questions. Once a question is parsed, the resulting abstract grammar tree is matched with a KG to formulate a SPARQL query. The MUG-QA grammar is formed using the Grammatical Framework (GF) [116] and GF Resource Grammar Library [117], the entities and classes are linked using “interlanguage-links-dataset” [69]. The system works with English, French, Italian, and German languages. The MuG-QA is evaluated on the QALD-7 benchmark, which contains queries over DBpedia. The resulting micro F1 score values are as follows: 67.7%, 56.6%, 65.6%, and 61.3% respectively for the supported languages. The authors define that the “semantic flexibility” of the system and adding more languages are possible improvement directions for their system. The grammar-based methods require experts and increased labor costs for creating them. Despite the abstract grammar tree being language-agnostic, one is still required to create mappings for introducing a new language.
The WDAqua-core1 system [
44
] was published in 2018 and extending its predecessor – WDAqua-core0. The authors claim that a KGQA solution that would be freely available will allow the setup of the corresponding services across many new data sources and will likely boost the publication of new RDF datasets. The approach of WDAqua-core1 is based on the assumption that the questions can be understood by ignoring the syntax while focusing only on the semantics of the words. The implementation consists of a modular pipeline that contains query expansion, query construction, query ranking, and answer decision. The system is also integrated into the Qanary framework. Query expansion finds all concepts related to a particular n-gram substring in a question, query construction combines the concepts using a pre-defined algorithm for query patterns, query ranking ranks the generated queries according to a set of manually constructed features, and answer decision utilizes a binary classifier for additional filtering of queries. WDAqua-core1 supports English, German, French, Italian, and Spanish languages. The set of supported KGs includes DBpedia, Wikidata, MusicBrainz, and DBLP. The WDAqua-core1 system was evaluated on the QALD-{
The LAMA system [115] was published in 2018. The authors of the system target the conventional problem of the RDF data accessibility in the context of providing NL interface to RDF, s.t., a user does not have to learn a query language. As an approach, it is proposed to develop a QA system that is based on analyzing lexico-syntactic patterns that can help generate corresponding SPARQL queries, i.e., they search for generalized linguistic structures that denote semantic relationships between concepts. The actual KGQA solution contains several processing phases: pre-processing (syntax parsing and question classification), generation of additional intermediate structures (dependency tree, POS tags, question type), and core processing module, which transforms the syntax tree into an intermediate representation, and finally the intermediate representation is parsed to generate one or more triple patterns used in the final SPARQL. The solution is implemented using the following tools: SyntaxNet,27
The UTQA system [111] was released in 2016. The authors highlight the particular focus of the KGQA research field on the English language only. In the authors’ view, this happens because of several reasons: lack of multilingual tools and resources on the one side and “vocabulary gap” between source and target languages. The approach exploited in the UTQA system is based on a set of multilingual components that sequentially process a question: keyword extraction (using maximum-entropy Markov model, non-English ones are translated with Google Translate API28
The Platypus system [101] was released in 2018 and is available online.29
The QAnswer system, which is a follow-up of WDAqua-core0 and WDAqua-core1 was released in 2019 [39,41]. The authors target the problem of the limited accessibility of a large amount of LOD datasets. This problem is based on the fact that the majority of the systems allow accessing only one dataset and one language. The proposed approach is the same as for the WDAqua-core1 system – it is multilingual and KG-agnostic. The QA process consists of the following 4 steps: question expansion, query construction, query ranking, and answer decision. The system is extended by introducing the feedback and re-training functionality based on a user’s data. The QAnswer supports English, German, French, Italian, Spanish, Portuguese, Arabic, and Chinese languages. By default, the system is able to answer questions over Wikidata, DBpedia, MusicBrainz, DBLP, and Freebase. The evaluation results on the QALD-{
The DeepPavlov system [ 50 ] was published in 2020. The authors of the system target the answering of complex questions with “logical or comparative reasoning”. As an approach, it is proposed to decompose the task of KBQA into multiple steps or components: query template prediction, entity detection, entity linking, relation ranking, path ranking, constraint extraction (if the question has constraints), and generation of query from extracted entities, relations, and constraints. The components’ pipeline is based on deep-learning neural networks. Classification of questions by query template type using the BERT [37] large language model (CLS token), Entity Detection with BERT-based sequence labeling, Entity Linking is implemented using fuzzy matching of the string extracted at Entity Detection step with inverted index, relation ranking implemented with extracting relation candidates from the linked entities, the question’s token embeddings are passed to the 2-layer Bi-LSTM to obtain hidden states which are taken for the dot product of relation embeddings (of their title) and passed to Softmax layer (the model is trained to maximize the product of token embedding and right relation embedding), BERT is used for path ranking of relation candidates, regular expressions are used to extract modifiers. The solution is implemented using the Python programming language. The DeepPavlov KBQA system supports only English and Russian languages. The system is compatible with the Wikidata KG. The evaluation is done on the LC-QuAD 2.0 dataset, the authors reported the following values for Precision 60%, Recall 66%, and F1 score 63%. It is worth mentioning that the used models and therefore the whole system is quite resource-intensive, for proper functionality on a CPU machine it requires around 32 GB of RAM.
The authors of the Tiresias system [95], which was published in 2022, focus on improving the multilingual accessibility of the KGQA systems. In addition to the structured DBpedia information, the authors propose to use multilingual DBpedia abstracts as an additional information source. The Tiresias systems process a question in a sequential mode, in particular, (1) the main named entity is recognized with DBpedia Spotlight, (2) a DBpedia abstract is retrieved for the entity using a SPARQL query, (3) the question text is translated into English using Bing or Helsinki MT, and (4) the final answer is produced with a pre-trained BERT-like QA model. The authors evaluate their system on a custom bilingual dataset (English and Greek) with a manually defined approach that splits the results into correct, partially correct, and wrong. Hence, the evaluation results can not be compared to the other systems as no standard metrics are used in this work. The authors see the technical accessibility of the Tiresias, more information sources in the QA process, and the set of supported languages as possible extension areas.
The DeepPavlov 2023 system [136] was released in 2023. The authors’ main objective is to provide a user with full NL answers verbalized with KG triplets. Among that, the previous version of this system [50] was improved w.r.t. QA quality and now represents state-of-the-art for KGQA on Russian RuBQ 2.0 benchmark [119]. The system conducts the following tasks: entity detection, entity linking, relation ranking, SPARQL template prediction, SPARQL slot filling, and path ranking. As a result of the latter step, a complete SPARQL query over Wikidata is generated, which can be executed to get an answer. In the answer generation step, the system takes the query paths with answer URIs and uses the JointGT model [75] to produce the answer text. The components of the system that conduct the aforementioned tasks are BERT-based models trained on different KGQA datasets, such as LC-QuAD 1.0 [135]. The DeepPavlov 2023 system works on English and Russian languages, however, those are two different system instances as it uses monolingual neural models. The evaluation of the English version is provided with the LC-QuAD dataset (47% F1 score), and the Russian version was evaluated on RuBQ 2.0 dataset (53.1% F1 score). The system is accompanied by a working source code. The authors set the future objectives as combining knowledge for the systems from both structured and unstructured sources.
The authors of XSemPLR approach [153] tackle the task of cross-lingual semantic parsing (CLSP) over SQL, lambda calculus, and other meaning representations (eight in total) including SPARQL. The authors claim that their main contribution is a unified benchmark for CLSP constructed from nine existing datasets. However, for this survey, the most important contribution is the evaluation of multilingual LLMs on KGQA benchmarks. In particular, for the CLSP over SPARQL the authors used MCWQ benchmark [32], which contains questions in English, Hebrew, Kannada, and Chinese with queries over Wikidata. The following LMs were evaluated: LSTM [60], mBERT+Pointer-based Decoders (PTR) [36], XLM-R+PTR [31], mBART [28], Codex [26], BLOOM [122], mT5 [147]. The aforementioned models were used with the following settings: monolingual, monolingual few-shot, multilingual, cross-lingual zero-shot transfer, cross-lingual few-shot transfer. The highest results were provided by the mT5 model in the monolingual setting. Based on the exact match metric, the results are as follows: 39.29%, 33.02%, 23.74%, and 24.56% (for the aforementioned languages). Nevertheless, the authors claim that multilingual LLMs (e.g., BLOOM) are still inadequate to perform CLSP tasks. This work is provided with the source code for the evaluation. The authors define a challenge of a performance gap between monolingual training and cross-lingual transfer learning.
The CLRN system [131] represents a new approach to engage with the challenges of Cross-lingual KGQA (CLKGQA). Traditional methods typically revolve around the melding of multiple CLKGs into one consolidated KG. However, the authors challenge this approach, emphasizing shortcomings in the ability of existing Entity Alignment (EA) models to accurately align entity pairs in CLKGs. The authors suggest two important challenges to address: dependency of a QA model on a unified KG, and enhancement of an EA model’s performance. To tackle these issues, they propose the Cross-lingual Reasoning Network (CLRN), a revolutionary multi-hop QA model that allows for flexible shifting between knowledge graphs at any point in the multi-hop reasoning process. Further, they establish an iterative framework that couples the CLRN and EA models to extract potential alignment triple pairs from the CLKGs during the QA procedure, thus enhancing the performance of the EA model. Their experimental results demonstrate that the CLRN outperforms other baselines. The experiments were conducted on the MLPQ [130] benchmark that incorporates language-specific DBpedia KGs in English, Chinese, and French. The authors particularly note meaningful improvement in the EA model’s performance through iterative enhancement, leading to a statistically significant 1.0% increase in Hit@1 and Hit@10. Additionally, they open up an interesting discourse on the relationship between QA and EA from the QA perspective. The authors make their dataset and code publicly available, furthering the scope for future explorations.
The system authored by Y. Zhou et al. [154] was published in 2021. The authors aim to meet the rising demand for KGQA systems by answering multilingual questions. On the other hand, building a large-scale KG, as well as annotating QA data, is costly for each new language. Therefore, there is a considerable KGQA performance gap between source and target languages, which is consistent with the empirical results on a wide range of other tasks by prior works. The idea of the approach is to pre-train a multilingual transformer encoder in a self-supervised manner. Thereafter, fine-tune the multilingual encoder on the data of a data-rich (source) language. The assumption is that the fine-tuned model is generalizable enough to perform inference in other low-resource (target) languages. This paradigm can be adapted to KGQA in order to construct symbolic logical forms for KG queries. It is also proposed to replace the full-supervised machine translator with unsupervised bilingual lexicon induction (BLI) [71] for word-level translation. The actual implementation is using a BLI-based augmentation for multilingual training data. Thereafter, the encoder is adapted to the augmented data. The adversarial learning strategy coupled with BLI-based augmentation is proposed for robust cross-lingual transfer. The system by Y. Zhou et al. is capable of working with English, Farsi, German, Romanian, Italian, Russian, French, Dutch, Spanish, Hindi, and Portuguese languages. As the system works with the DBpedia KG, the evaluation is done on LC-QuAD 1.0 (translated to multiple languages with Google Translate) and QALD-9 benchmarks. The average F1 score values across the languages are 75.9% and 63.0% for the benchmarks, respectively. While considering the fact that the LC-QuAD 1.0 is machine-translated for the evaluation, the performance of the system may be questionable.
The system by A. Perevalov et al. [105] was published in 2022. The authors focus on the problem of unequal language distribution on the Web and therefore unequal content accessibility. In addition, only a few research initiatives are targeting the problem of multilingual access in the KGQA field. Therefore, the authors propose to combine well-known KGQA systems with machine translation (MT) tools in order to see the impact of machine translation on question-answering quality. In addition, determine whether machine translation could be an alternative to multilingual solutions. In the actual solution, the authors combine QAnswer, DeepPavlov, and Platypus with Yandex Translate API30
The Lingua Franca approach [128] has been aiming at improving the method by Perevalov et al. (see paragraph above) by introducing named entity-aware MT approach combined with mKGQA systems. Lingua Franca leverages symbolic information about named entities stored in Wikidata to preserve their correct translation to the target language. In particular, the developed solution has the following processing steps: (1) named entity recognition and linking for identifying the names entities in a question, and (2) MT with entity-replacement technique using the entity labels from Wikidata in a target language. The approach was evaluated on QAnswer and Qanary KGQA systems and QALD-9-plus dataset using German, French, and Russian questions. The majority of the experimental cases (19 out of 24) show that the KGQA systems that were using Lingua Franca outperformed the ones that used standard MT tools.
In Table 4 we summarize the reviewed systems ordered by publication date with their characteristics such as:
publication year of a paper;
languages that were used in the evaluation or supported by the described system;
knowledge graphs that were used in the evaluation;
datasets (or benchmarks) used in the evaluation;
metrics used to measure the QA quality of a system;
technologies used to implement the described system;
code/demo availability;
methods that were used according to the taxonomy described in Section 4.2.
The overview on the multilingual KGQA systems published between 2011 and 2023
The overview on the multilingual KGQA systems published between 2011 and 2023
(Continued)
In the next subsection, we present the taxonomy of the methods that are used to develop mKGQA systems.
While answering
Overview
The taxonomy is based on our review and materials from the previous survey articles [33,40,63,103]. Note that not all of the methods to be presented below are working in an end-to-end manner, meaning that not all of them directly produce an answer or a SPARQL query. Some of the methods require the use of other methods to form a complete mKGQA system.
In a nutshell, there are two system development paradigms. The first class of the KGQA systems relies on a sequence of predefined task-oriented components. This paradigm is named “Semantic Parsing” and is often referred to as “QA pipelines” [15]. In such systems, a question is processed in a multi-step setting respectively to the used components, for example, NER, Relation Prediction (REL), Query Builder (QB), and Query Executor (QE). The aim of such systems is to convert a NL question to a SPARQL query. The second paradigm is named as “end-to-end KGQA” and aims at answering a question in a single step. These systems are mainly based on neural network-related approaches [23] e.g., ranking of an answer candidate given a question, or translation of a question to a query (end-to-end semantic parsing) [149]. Both of the aforementioned paradigms may utilize one or more methods from the taxonomy defined below.
We organize the methods as follows, the high-level general groups (denoted as “G”) contain the low-level concrete methods (denoted as “M”):
methods based on rules and templates: syntax tree parsing is used to convert NL to a machine-readable syntax tree; grammar rules are used to extract structured information from NL with manually defined rules; logical representations are used as a machine-readable intermediate form to represent the semantics of a given NL text; dictionaries, indexes, and templates are used for generating queries or matching entities and relations; statistical methods: classical machine learning and statistical methods are used for the downstream tasks of KGQA (e.g., NER, REL, etc.); deep learning methods are mainly used in the context of language modeling, graph embedding models, and encoder-decoder architectures; machine translation methods: end-to-end machine translation methods are used for direct translation of a source language to the target one that is supported by the system; enhanced machine translation methods are used for machine translation with intermediate improvements (e.g., KG enriched [96]) of a source language to the target one that is supported by the system.
The taxonomy is demonstrated in Fig. 1. It is worth mentioning that the methods are not mutually exclusive within one system. Thus, multiple of them can be used within one system, for example, M1.4 and M2.2 are used within the QAnswer system. The methods used by the corresponding systems are listed in Table 4.

Let us review the methods while specifically focusing on their multilingual capability. It is worth underlining that the methods of groups G1 and G2 are widely used in monolingual KGQA, i.e., are not multilingual by definition, however, they may provide this functionality to some extent. For example, the method M2.2 (deep learning) covers multilingual language models as well as monolingual ones. The methods of group G3 are mostly used in the multilingual context. Technically, the methods from the G3 could be placed under the groups G1 or G2, however, in the context of mKGQA, G3 represents a completely different approach on the ideological level.
Most of the syntax tree parsing (M1.1) implementations are grammar-based (e.g., Stanford NLP Parser [80] or BLLIP [24]). Hence, those are closely dependent on the language-specific grammar rules, which are related to method M1.2. However, it is possible to extend syntax tree parsing methods to multiple languages while implementing it with multilingual language models or the ones trained separately on different languages (method M2.2), as demonstrated in the following publications [25,46,48]. One of the major state-of-the-art multilingual syntactic parsers Stanza [112] is based on Universal Dependencies [98], which is a large tree bank for many languages. The method M1.1 can be used for building machine-readable representations of a NL question, NER, and REL.
The method based on grammar rules (M1.2), as described above, is language-dependent by definition. In most cases, context-free grammars (CFG) are commonly used in NLP due to their efficient implementation [55]. One needs to define a set of language-dependent rules to extract particular structures from a NL text to implement this method. The Grammatical Framework (GF) [116] provides a syntax for creating pseudo-multilingual grammars (one still needs to define general rules for multiple languages, although it appears to be more convenient). Such tools as POTATO [83] and Yargy parser [84] are providing grammar-based functionality. The method M1.2 can be used for NER and REL, in addition, the structural elements of a NL text, such as subject-predicate-object structure, can be extracted. The extracted information leads to the building of machine-readable representations of a text, which are related to the following method M1.3.
The methods using logical representations (M1.3) are aimed at creating machine-readable meaning representations of NL with the means of description logics (DLs). For example, a question “In what city was Angela Merkel born?” is represented in a logical form as “
The methods based on dictionaries, indexes, and templates (M1.4) are mainly focused on the lookup tasks and the query generation. One of the examples could be a named entity linking (NEL) task while looking up the label-URI dictionary of a KG. The dictionaries can be exploited for mapping between language-specific terms. The templates can be used for the SPARQL query generation process via fulfilling the slots with the extracted information on the previous steps, e.g., the DeepPavlov system uses several query templates that correspond to the different query types. Hence, method M1.4 is language-dependent by default but can be extended to serve multilingual functionality, e.g., by introducing multilingual dictionaries that link all the language-specific labels of one entity together. The method M1.4 is used for such tasks as NEL, term translation, and QB.
The methods based on classical machine learning and statistical methods (M2.1) are solving a variety of classification and sequence labeling tasks. One can utilize logistic regression in combination with TF-IDF for detecting the expected answer type (classification task) [104]. Another example can be the NER task using Hidden Markov Models (HMMs) [9]. Conventionally, these methods are known as lightweight, transparent, and explainable in comparison to deep learning (method M2.2). Nevertheless, their multilingual functionality is limited. For example, while considering TF-IDF as a method for text-to-vector transformations, the usage of this method in multiple languages leads to extremely sparse feature sets as the vocabulary increases with each language. Consequently, one needs to develop different language-specific models in order to process multiple languages. The method M2.1 can be used for NER, answer type detection, POS tagging, intent detection, REL, and other similar tasks.
The methods based on deep learning M2.2 are applied to the wide range of KGQA tasks. The two main applications of this method class are graph embeddings and language modeling. The graph embedding direction is used in the KGQA field to search and extract the sub-graphs, relations, or entities given a textual question [121]. The language modeling direction is aimed at solving the KGQA downstream tasks with better quality and generalizability than the classical machine learning methods [148]. LLMs are well-suited for working in the multilingual setting (e.g., multilingual BERT supports 104 languages [109]), however, they require fine-tuning for the downstream tasks. The paper [153] demonstrated that LLMs are able to work in a zero- and few-shot setting with multiple languages for SPARQL query generation. Hence, the method class M2.2 is suitable for all the downstream KGQA tasks (e.g., NER, REL), as well as for the end-to-end KGQA (i.e., producing a SPARQL query directly or an answer). Nevertheless, one needs to take into account the resource consumption factor of the deep learning-based methods, especially seeing the latest LLMs such as PaLM [29], Chinchilla [61], LLaMa [134], and others.
The end-to-end machine translation M3.1 methods are utilized for translating source languages to the target ones that are supported by a particular KGQA system. In this case, the machine translation tool is treated as a “black box”, hence, a developer does not influence its working process. The majority of the neural machine translation models (e.g., [133]) provide the corresponding functionality for one language pair per model, i.e., multiple models required for translating different language pairs. However, the state-of-the-art large generative models [17] provide the ability to handle multiple languages for the translation tasks, despite the majority of the training data being in English.31
The methods related to the enhanced machine translation M3.2 are serving the same functionality as the M3.1, however, in this context, more background knowledge is used in the process. These machine translation methods can take into account the KG embeddings of the named entities (e.g., KG-NMT approach [96]), tag the named entities in an input text before translating it [137], or use bilingual lexicon induction for training data augmentation (e.g., [154]). Therefore, the translation process is improved based on the used background knowledge. For instance, the additional information regarding named entities ensures that they are not corrupted during the translation process.
To completely cover
Multilinguality denotes the usage of several different languages [66]. Formally, we consider a KGQA system as multilingual if it supports more than one language by default (i.e., without additional efforts to re-train, clone, or fine-tune it). In a more general definition, the languages handled by a multilingual system must belong to different language groups (e.g., Finnish vs Russian), alphabets (e.g., Bulgarian vs French), or writing systems (e.g., German vs Arabic). It is worth underlining that the quality deviation among the different languages handled by a multilingual method should tend to zero, while demonstrating the absolute results comparable with the monolingual state-of-the-art method. Let us formally state when a KGQA system handles multilingual questions well using the following definitions:
L is a set of languages:
Therefore, a system

Visual representation of the multi-objective quality function for mKGQA systems, the gold-colored results represent the Pareto front (optimal solution). The systems are associated with the quality quadrants that help to easily interpret the values.
We encourage researchers to use this procedure and our findings for comprehensive quality measurement of mKGQA systems.
While working on
We analyzed the data on reviewed mKGQA systems (see Table 4) regarding the languages and language families that are covered by them to answer

The visual representation of language and language family coverage among the mKGQA systems.
To the best of our knowledge, during the past decade, there were only 21 mKGQA systems developed within the research context. We observed that in most of the cases, namely 95%, the mKGQA systems target Indo-European language family. Therefore, the mKGQA systems mostly work within one writing system, namely Alphabetic (while using Latin, Cyrillic, Armenian, or Greek script). Hence, the actual generalizability and scalability of the used methods across diverse languages are unclear. In addition, our survey demonstrated that all systems except QAnswer target general-domain KGs only. We recalled that the researches are not using the standard metrics or evaluation tools that ensure the comparative evaluation results. Finally, the analysis showed that a significant share of the systems, namely 11 out of 21, is not accessible due to the outdated demo websites and source code or their absence. Therefore, the experimental results are not reproducible.
We highlighted three main groups of the methods for the mKGQA : rules and templates (G1), statistical methods (G2), and machine translation (G3). The analysis of the taxonomy, which was created in this work, demonstrated that the researchers prefer to reuse monolingual methods that are adapted to the other languages rather than working towards the language-agnostic ones. The analysis of the KGQA systems showed that the assignment of a system to only one method group is not possible, as most of them combine multiple methods.
Based on our observations, we foresee the following research challenges and research directions for the mKGQA. Developing methods and systems that work with diverse languages, in particular the ones that:
originate from different language groups (e.g., Finno-Ugric and Balto-Slavic); have different alphabets/scripts (e.g., Cyrillic and Latin); use different writing systems (e.g., Alphabetical and Abjad). How well a system performs w.r.t. different languages (e.g., does it have a high-quality variance among different languages)? What is a criterion of high-quality w.r.t. mKGQA? incorporate diverse languages (see the above paragraph about diverse languages) gold-standard answers on multiple KGs for wider applicability (e.g., at least Wikidata, DBpedia). Using Retrieval-augmented Generation (RAG): providing LLMs with relevant triples from KGs by verbalizing the triples and using them as a part of a prompt. Fine-tuning an LLM on verbalized triples from a KG for learning the facts.
Addressing how well zero- or few-shot transfer methods work w.r.t. training and evaluation on the diverse languages (see the list above). Providing case studies on domain-specific applications of mKGQA systems (e.g., Material Science, Chemistry, Business, Government, Law etc.); Searching for the advanced mKGQA evaluation metrics, namely:
Designing and developing general domain large-scale multilingual benchmarking datasets for trustworthy evaluation of mKGQA systems by
Exploring capabilities of LLMs for mKGQA by:
Benchmarks for multilingual question answering over knowledge graphs
This section describes the existing benchmarks that have been developed for mKGQA. The overview of the benchmarks will showcase their unique characteristics and contributions to the field, shedding light on the respective progress and challenges of mKGQA.
Overview
The research in the field of KGQA is strongly dependent on data, nevertheless, the particular challenge is the lack of the benchmarks for trustworthy evaluation of the KGQA systems in multiple languages [88,97]. In the field of OpenQA, several works related to machine translation of existing benchmarking datasets were done (e.g., [22,88]). However, this is not the case for KGQA. To the best of our knowledge, only five benchmarks (or benchmark series34
Some of the benchmarks (e.g., RuBQ 1–2, or QALD 1–10) have multiple versions. Therefore, we refer to them in general as benchmark series.
Overview on the mKGQA benchmarks
The QALD is a well-established benchmark series for mKGQA. It has several multilingual versions, namely QALD-{3,6,7,8,9,9-plus,10}. The QALD-3 includes 199 questions and ground truth SPARQL queries over DBpedia and MusicBrainz.35
EventQA is the benchmark for answering event-centric questions (e.g., “In which tournament, known as major, did Jason Dufner win?”). The benchmark contains 1000 questions in the corresponding languages: English, German, and Portuguese. The SPARQL queries are targeting the EventKG [54] – an event-centric KG that incorporates 690,247 events. The benchmark is represented using a newly developed JSON structure.
RuBQ KGQA benchmarking series includes two versions. The latest one – RuBQ 2.0 – contains 2910 questions and is based on its previous version RuBQ 1.0. Similarly to the latest QALD versions, the SPARQL queries within the RuBQ are written for Wikidata. The creation of this benchmark was done in a semi-automatic way: the automatically collected question-answer pairs in textual format were annotated using an entity linking tool; thereafter, the linked entities were checked by crowd-workers; finally, based on the linked question and answer entities, the SPARQL queries were generated and manually validated. The questions are represented in the native Russian language and machine-translated English language. Additionally, it contains a list of entities, relations, answer entities, SPARQL queries, answer-bearing paragraphs, and query-type tags. The benchmark uses a newly developed JSON structure.
MCWQ is a KGQA benchmarking dataset over Wikidata (similarly to QALD and RuBQ) that is based on the previously created CFQ dataset [77]. MCWQ contains questions in the Hebrew, Kannada, Chinese, and English languages. All the non-English questions were obtained using machine translation with several rule-based adjustments. It has a well-detailed structure including questions with highlighted entities, original SPARQL query over Freebase [13] (from CFQ), SPARQL query over Wikidata (introduced in MCWQ), textual representations of a question in four aforementioned languages, and additional fields. The benchmark newly developed JSON structure.
Mintaka is a recent KGQA benchmark that provides 20,000 questions in the following languages: English, Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish. Similarly to QALD, RuBQ, and MCWQ, Mintaka’s SPARQL queries are over Wikidata. The structure of Mintaka includes annotated named entities, gold standard answers, and internal fields such as question category (e.g., geography) and complexity (e.g., ordinal). The benchmark does not contain the gold standard SPARQL queries, instead, the Wikidata entity IDs are provided as an answer and the corresponding language-specific labels. The questions, annotated named entities, and gold standard answers were created and annotated by crowd-workers respectively. Mintaka follows a newly created JSON structure.
MLPQ is a large-scale KGQA benchmark that contains 300,000 questions in English, Chinese, and French. MLPQ SPARQL is over DBpedia similarly to the early QALD versions. MLPQ has its own RDF structure that contains a question’s text with a language tag, ground truth answer URIs, and a SPARQL query. As the benchmark was generated automatically, the authors provided query templates that were used for the generation process. In addition, MLPQ is accompanied by the DBpedia triples covering all the used questions.
Given the number and characteristics of the reviewed benchmarks, it is clear that there is little attention paid to the mKGQA research field. To the best of our knowledge, only 14 benchmarks tackle the multilingual aspect (some of them have different versions, which are the subsets of the newest ones). The total number of benchmarks (mono- and multilingual) according to the KGQA leaderboard is 48 [107]. Hence, the share of multilingual benchmarks is 29.1%. We observed that the crowd-sourcing approach for benchmark creation is gaining traction. The crowd-sourcing tasks vary from verification linked entities [82], translation and validation questions [106] to creation and annotation of questions from scratch [123]. While answering

The bubble chart represents the chronological order of the benchmarking datasets, their number of questions, and languages.
We also discovered some flaws in the existing benchmarks:
The maximum order of magnitude for the number of questions among the manually and semi-automatically created benchmarks is There is no standardized format used across the research groups for the KGQA benchmarks (except QALD JSON); The majority of the benchmarks stick to only one KG, namely Wikidata (e.g., RuBQ, Mintaka); The benchmarks created with machine translation tools or automatically have unclear questions’ quality (RuBQ, MCWQ, MLPQ).
The aforementioned flaws may become the objectives for future research in this field. Especially, it is worth developing a standardized format for the benchmarks, enlarging them regarding the number of questions, languages, and knowledge graphs using the crowd-sourcing setting.
This section discusses the analyzed work on mKGQA. The first subsection focuses on the challenges encountered in this field, highlighting the obstacles and limitations that researchers face when developing and evaluating KGQA systems in multiple languages. The second subsection explores potential future research directions in mKGQA, highlighting areas that require further exploration and innovation. By examining both the challenges and future research directions, this section aims to provide valuable insights and guide future advancements in mKGQA.
Challenges
While reviewing the related survey papers, mKGQA systems, and corresponding datasets, we identified several challenges that currently exist in this research field. In the following paragraphs, we discuss the most remarkable challenges in the mKGQA field.
Noisy human natural language input
One of the challenges in mKGQA is effectively handling noisy human natural language input. This challenge is amplified when dealing with multiple languages because different languages have diverse structures, grammatical rules, and vocabulary [3]. Moreover, questions can range from well-formed NL, where the syntax and grammar align with the language’s rules, to keyword questions, which consist of a few crucial terms without a proper sentence structure.
Grammatical and orthographic errors further contribute to the noisy input in mKGQA. Since users may not be fluent in all the languages they interact with, they are prone to making mistakes in constructing sentences, selecting appropriate words, or adhering to correct grammar rules. Grammatical and orthographic errors add complexity to understanding and interpreting the user’s intent accurately. The mKGQA systems that internally use methods from the group G1 “Rules and Templates” (e.g., SWSNL [56], AMUSE [57], UDepLambda [118]) are especially sensitive to this kind of noise.
Another aspect of noisy input in mKGQA is the wrong spelling of named entities. Named entities are essential components for understanding the context and semantics of a question [88]. However, users may misspell or incorrectly transliterate named entities from one language to another. For example, a user asking a question about the “Eiffel Tower” in French may mistakenly spell it as “Eiffle Tower” in English. This presents a challenge in effectively mapping and resolving named entities in different languages. This challenge was first mentioned by Perevalov et al. [105] and addressed within the Lingua Franca system [128] by Srivastava et al.
A user question and a knowledge graph are expressed in different languages
One of the critical challenges highlighted in the existing literature is the specific issue of handling mKGQA process, particularly when a user question and a KG are expressed in different languages [131]. This is commonly referred to as cross-linguality.
To address the cross-linguality challenge, the conventional approach involves enriching KGs with multilingual labels. By incorporating multilingual labels, KGs become capable of accommodating and understanding different languages. This approach aims to provide a more seamless and efficient user experience, regardless of the language in which the question is asked.
Another approach for dealing with cross-linguality is centered around leveraging text and graph vector representations [23]. Often, the availability of multilingual training data is limited, which makes it challenging to directly train models for mKGQA. In such cases, text and graph vector representations offer a way to bridge this gap. These representations capture semantic similarities and relationships between words or entities across different languages, allowing effective knowledge retrieval and alignment even in the absence of extensive multilingual training data. The context of cross-linguality was tackled within the LAMA system [115] explicitly on language-specific DBpedia KGs. However, current versions of KGs including DBpedia are not divided by languages.
The use of cultural traits and country-specific aspects
In addition to the previously mentioned challenges, another significant aspect in mKGQA is the consideration of cultural traits and country-specific aspects when searching for information. This challenge arises due to the diverse preferences, interests, and expectations of users across different cultures and countries.
When users search for information or ask questions, their expectations may vary based on their cultural background and country of origin. For instance, if a user from the United States asks for recommendations for popular TV series, their preferences and expectations might differ from those of a user from Japan or France. The concept of popularity and the perception of what constitutes an engaging TV series can vary significantly across cultures.
Addressing this challenge requires a nuanced understanding of cultural differences and the ability to provide contextually relevant answers to users from different backgrounds. The approaches to deal with this challenge should involve real-user feedback on ranking different entities pertaining to a common search topic (e.g., TV series). This may include setting up crowd-sourcing tasks, gathering its results, and building a dataset with the corresponding ranks. To our best knowledge, this challenge was not tackled by any work in the mKGQA field.
Quality discrepancies among different languages
One significant challenge evident from evaluation values in the field of mKGQA is that the quality of QA is notably lower in languages other than English. In particular, the evaluation results [105] for the QAnswer, DeepPavlov, and Platypus systems demonstrate high deviation among the QA quality results in different languages. This observation underlines the need for the research community to develop effective solutions aimed at enhancing the performance of multilingual question answering systems and bridging this gap in quality between languages.
The existing disparities in QA quality arise due to various factors. Firstly, the availability of high-quality training data and resources for languages other than English may be limited, resulting in inadequate training for multilingual models. Moreover, the linguistic complexities, divergent language structures, and semantic nuances present in different languages further contribute to the lower performance in non-English languages.
To address this challenge, substantial efforts are to be made to develop techniques that align multilingual KGs, curate larger and more diverse benchmarking datasets, and improve the training methodologies for multilingual QA models. In particular, a promising strategy is to leverage zero- or few-shot inference while working with LLMs. These approaches aim to reduce biases toward English and provide a more equitable distribution of performance across different languages.
The lack of language diversity
The landscape of mKGQA reveals certain limitations when it comes to targeting languages that belong to different language groups, employ distinct alphabets or writing systems. The existing systems predominantly focus on widely spoken languages (e.g., QAKiS [21], MuG-QA, [155]), which may overlook the linguistic diversity represented by low-resource and endangered languages. The systems from our review evaluated on the most diverse sets of languages are QAnswer [41], Zhou et al. [154], and Perevalov et al. [105].
In practice, the availability of mKGQA systems catering to low-resource and endangered languages is severely limited, with very few systems specifically designed for these language varieties (e.g., Bashkir language in [105]). This dearth of attention towards such languages is concerning, as it hinders the inclusivity and accessibility of KG-based information retrieval for diverse linguistic communities.
A common approach adopted in the development of mKGQA is the usage of machine translation approaches or the adaptation of monolingual methods to accommodate multiple languages (cf. [105,128], thereby pursuing a multilingual capability. However, this adaptation of methods primarily focuses on language pairs with substantial linguistic resources, neglecting the unique challenges associated with low-resource languages.
Size and translation quality of benchmarking datasets
It is important to acknowledge that the existing benchmarks for mKGQA are relatively small in scale. In terms of magnitude, the non-automatically generated benchmarks typically consist of around
The limited size of these benchmarks poses challenges in accurately evaluating and benchmarking the performance of such systems. The smaller scale restricts the diversity of queries and contexts that can be covered. This, in turn, limits the generalizability of the performance results and may hinder the development of robust mKGQA systems.
Additionally, the quality of translations within these benchmarks is often a concern. In some cases, the questions are machine-translated (e.g., RuBQ 2.0 [119]), leading to questionable translation quality. Poor translation quality (e.g., QALD-9 [139]) introduces noise and inaccuracies into the benchmark data, potentially affecting the reliability of evaluations and comparisons between different multilingual question answering systems.
Effective usage of large language models
In recent times, there has been a surge of attention and interest in leveraging LLMs for mKGQA as well as other NLP tasks. However, despite the initial excitement and optimism surrounding these powerful models, reports from the research community suggest that the obtained results do not always align with the high expectations set forth.
In particular, cross-lingual semantic parsing has faced challenges in achieving the desired performance with LLMs. Zhang et al. [153] demonstrated that the results obtained for cross-lingual semantic parsing did not meet the anticipated levels of accuracy and quality. Similarly, even in monolingual settings, Klager et al. [79] reported underwhelming outcomes with LLMs for semantic parsing tasks.
One promising approach, which addresses these shortcomings, involves integrating structured data directly into the mKGQA process, building upon the practices established in the domain of monolingual KGQA [7,72].
General challenges
The field of mKGQA encounters several challenges that require attention and resolution. In contrast to the aforementioned challenges, these extend beyond the multilingual aspect and also encompass the broader field of KGQA:
The majority of KGQA systems predominantly target a single domain, often limited to the general domain. This domain-specificity restricts the applicability and scope of these systems, hindering their potential to address diverse domains effectively. The lack of advanced metrics for evaluating the performance of KGQA systems poses a significant challenge. Without such evaluation metrics, it becomes difficult to comprehensively measure, compare, and benchmark the effectiveness of different multilingual KGQA approaches. Ensuring the reproducibility of KGQA systems is crucial for building upon existing research. However, we have observed low reproducibility rates in some studies within the field. The lack of a standardized format for KGQA benchmarks poses another significant challenge. Without a common benchmark format (e.g., QALD-JSON), it becomes technically arduous to compare the performance of different systems or share and replicate research findings. It is prevalent for KGQA benchmarks to focus solely on a single KGs (e.g., DBpedia or Wikidata). This narrow focus limits the generalizability and applicability of the developed systems, as they may struggle when applied to different KGs.
Future research directions
Future research directions in the field of mKGQA hold substantial potential for advancements. Analyzing the challenges, the following research directions have been identified for further exploration:
Study on how people ask questions in different languages: understanding how individuals formulate questions in various languages, including their native language (L1), second language (L2), etc. is crucial. This research direction entails analyzing the syntactic structures employed, common errors made, and prevalent misspellings encountered in multilingual question posing on the Web.
mKGQA with unequal data distribution across languages: KGs often exhibit disparities in data availability across different languages. Investigating efficient mKGQA techniques that tackle this unequal distribution of language-specific data within KGs is an important research direction.
mKGQA with cultural context: consideration of a user’s cultural background can significantly influence their information needs and preferences. Therefore, incorporating and adapting the ranking of answer candidates in accordance with a user’s cultural context represents an essential research direction for enhancing the effectiveness of mKGQA.
Improving mKGQA quality in languages other than English: while much progress has been made in English KGQA, there is a need to extend the focus to other languages. A crucial objective in this research direction is to minimize the standard deviation of QA quality among different languages, ensuring comparable performance across all supported languages.
Generalization of methods to handle diverse languages: enhancing the applicability of mKGQA techniques to languages originating from different language groups, alphabets, writing systems, low-resource, and endangered languages is a vital research direction. It involves developing approaches that can effectively address the unique challenges posed by each language type.
Extending benchmarks for mKGQA: Expanding the existing benchmarks used for evaluating mKGQA systems is essential. This research direction involves augmenting the number of questions and translations available in different languages and ensuring the quality of these resources to ensure comprehensive evaluation and comparison of mKGQA approaches.
Encouraging publication of negative results: as a majority of publications tend to report positive results, advocating for the publication of negative results is vital. This research direction promotes transparency and prevents duplication of efforts by enabling researchers to learn from unsuccessful attempts and focus on more promising avenues.
Ensuring high-level reproducibility and comparability: to establish a strong foundation in the field of mKGQA, it is imperative to ensure the reproducibility and comparability of research results. This research direction emphasizes the adoption of standardized evaluation metrics, the release of open-source code, and the sharing of datasets to facilitate fair comparison and validation of proposed approaches.
We suggest the research community to consider these research directions, as further progress can be made in the field of mKGQA, leading to improved systems and comprehensive solutions for addressing multilingual information needs.
Conclusion
In conclusion, this systematic survey provides a comprehensive overview of the current state of research in mKGQA. Our work is primarily motivated by the lack of surveys that focus specifically on mKGQA (see Section 1). We followed the strict review methodology (see Section 2) to ensure the reproducibility and transparency of the results produced by us. Therefore, in the first step 1875 publications were retrieved according to the defined criteria. The publication search was conducted in three languages: English, German, and Russian. Finally, 46 publications were accepted for the review, where 26 of them are related to the systems, 14 to the benchmarks, and 7 to the related survey papers.
We systematically reviewed all the aforementioned publications and proposed the taxonomy of the methods for mKGQA systems. We formally defined three characteristics of the methods for mKGQA systems, namely: resource efficiency, multilinguality, and portability. We prepared the online repository37
To answer
While answering
Finally, to answer
It is worth mentioning that our review methodology favors precision over recall when selecting the publications. Therefore, some relevant papers (e.g., [49,151] have been left out due to a mismatch of our selection criteria. The most common mismatch reason is caused by the absence of the peer review process (e.g., preprint only, see Section 2.2) or lack of citations for preprints (our threshold is at least five, see Section 2.2). Also, some papers use multilingual benchmarks but focus only on the English language [51].
In the future, we will regularly update the survey results through the online KGQA leaderboard to keep track of the state-of-the-art in the mKGQA field. We believe that our review methodology enables other researchers to conduct similar work more efficiently. We also encourage researchers to deal with the challenges and research directions mentioned in Section 6.
Footnotes
Acknowledgements
This work has been partially supported by grants for the ITZBund![]()
