Abstract
Large language models (LLMs) have shown effectiveness in various natural language understanding (NLU) tasks. However, they face notable limitations like hallucinations, a lack of contextual knowledge, and outdated or incomplete knowledge when applied across knowledge-intensive domains such as scientific research, biomedical sciences, finance, law, and others. These challenges commonly arise from the scarcity and under-representation of domain-specific data during the training and model alignment phases. Furthermore, Large Language Models (LLMs) struggle to provide nuanced expertise, as their internal knowledge remains static and generalized, hindering their ability to reason accurately or deliver context-aware results in specialized tasks. This survey investigates the integration of external knowledge into LLMs to address these limitations. The focus is on decoder-based LLMs, that is, autoregressive models that generate text sequentially. By investigating parametric and non-parametric approaches, this work discusses methods to enhance model reasoning capabilities, factual accuracy, and adaptability for domain-specific and knowledge-intensive tasks. Additionally, it highlights the potential of integrating external knowledge to improve explainability and ensure more trustworthy outputs. This survey supports software developers and natural language processing (NLP) researchers in designing NLU systems for specialized domains by leveraging pre-trained LLMs. Additionally, the work provides a foundation for advancing LLM-based NLU systems with insights into future research areas.
Keywords
Introduction
Large language models (LLMs) have demonstrated strong performance across a range of natural language understanding (NLU) tasks. This is due to their ability to encode vast amounts of knowledge extracted from enormous data crawled from the Internet (Hu et al., 2023). They acquire knowledge through the training phase, during which they process massive corpora of text data to learn statistical patterns and relationships between words, phrases, and concepts. Unlike explicit knowledge repositories such as relational databases, the knowledge in LLMs is encoded implicitly in their parameters. This implicit nature means that retrieving specific pieces of information is not straightforward and depends on probabilistic generation rather than deterministic querying (Wang et al., 2024b). Research shows that LLMs contain factual knowledge (Hu et al., 2023), but their knowledge is static, that is, confined to the state of information at the time of training when used in isolation, or “as is.” Relying solely on the knowledge embedded in these models’ parameters presents several fundamental challenges (Sanu et al., 2024), including hallucinations, outdated data, and a lack of domain-specific context. These challenges can be mitigated by integrating external knowledge with LLMs. 1
Tasks like sentiment analysis or spelling and grammar correction perform well with generic models because they rely primarily on linguistic pattern detection rather than deep subject understanding, and thus do not always require external knowledge integration. However, knowledge-intensive tasks such as information extraction (IE), which involve identifying relevant entities and relationships from domain-specific documents to create triples for knowledge graph construction, require structured domain modeling and access to specific, context-rich details to produce accurate results. For instance, creating a knowlegde graph (KG) from scientific ontologies and domain-specific documents requires compliance with domain-imposed constraints on entities, relations, and validation rules, which must be effectively incorporated into the LLM’s generation process (Gao et al., 2023; Wang et al., 2024b). Moreover, in scientific fields, critical details are often stored in proprietary, confidential documents that are not included in publicly available datasets, including numerical measurements (e.g., temperature, pressure), material properties, procedural descriptions, and safety information, which are derived from technical datasheets, experimental records, and internal reports. Handling these specialized documents requires precise contextual knowledge, encompassing domain-specific concepts, relationships, and constraints, which can be realized via integrating external knowledge with LLMs. In this work, we focus on integrating such externally available knowledge with LLMs to enhance performance on natural language understanding (NLU) tasks, particularly knowledge-intensive ones such as IE, knowledge graph construction (KGC), and knowledge graph population (KGP) for domains poorly represented in the generic dataset. The goal of this integration is to digitize and semantically structure complex textual data and semi-structured documents, thereby enabling downstream reasoning tasks. Beyond these, other knowledge-intensive NLU tasks, such as entity linking, question answering, fact verification, commonsense reasoning, scientific reasoning, and technical document summarization, also benefit substantially from external knowledge integration to ensure factual consistency and domain-specific accuracy.
Survey Purpose
There are several survey papers, as discussed below, that focus on model architecture, model benchmarking, hardware requirements, and explore various applications of LLMs. However, there is a noticeable shortage of literature offering an in-depth and focused discussion on approaches to knowledge integration. For instance, LLM families, such as GPT (Radford et al., 2018), have been compared, examining their internal architectures and datasets for training and fine-tuning, and their applications in diverse fields (Kumar, 2024; Minaee et al., 2024). Additionally, Wang et al. (2024e) shed light on the progression of language models, tracing their development from statistical and neural language models to transformers and, ultimately, LLMs. Another survey discusses open-source LLMs, emphasizing aspects such as data collection, model architectures, and training methodologies (Kukreja et al., 2024). The study by Dong et al. (2025) focuses on “safeguards” and “guardrails” touching the ethical aspects of LLMs usage. Others investigate reinforcement learning using human feedback, fine-tuning on domain-specific datasets, and task-specific fine-tuning (McIntosh et al., 2024; Susnjak et al., 2025) and transfer learning techniques (Sulaiman & Hamzah, 2024). Regarding the synergy between knowledge-based systems and LLMs, Some et al. (2025) provide a comprehensive examination of methods for integrating LLMs with knowledge-based systems. It provides an extensive overview of LLM evolution and architectural paradigms and discusses approaches for combining LLMs with knowledge resources such as knowledge bases, knowledge graphs, and retrieval-augmented generation (RAG) systems. However, it does not offer a detailed methodological analysis of the techniques used to integrate external knowledge into LLMs or address knowledge-intensive NLU tasks for domain-specific datasets. The study presents a broader perspective on what integration entails but lacks a systematic discussion of the underlying mechanisms that operationalize such integration. In essence, that work addresses the question, “how can LLMs be combined with knowledge-based systems in general?”
Building upon this foundation, the present survey extends the discussion from a conceptual overview to a methodological exploration. While prior studies, including Some et al. (2025), provide valuable insights into the general landscape of LLM-knowledge integration, there remains a lack of a comprehensive perspective that individually examines techniques for integrating external knowledge with LLMs for NLU-based systems, their challenges, and future applications. The current survey addresses this gap by examining methods and mechanisms for integrating external knowledge sources into LLMs to improve their performance on knowledge-intensive NLU tasks. It focuses on the question, “how can external, structured, or semi-structured knowledge be injected into LLMs to improve NLU in knowledge-intensive domains?” This not only enhances the domain-specificity of LLM responses but also addresses critical issues such as bias and misinformation, resulting in more reliable and trustworthy systems for users across various applications. By exploring these techniques, the paper aims to provide insights into improving model robustness and adaptability, thereby facilitating their adoption in real-world scenarios.
Intended Audience:
The intended audience for this survey includes researchers, practitioners, and developers working in natural language processing (NLP), machine learning, the Semantic Web, and artificial intelligence who seek to enhance LLMs for knowledge-intensive tasks. These tasks include IE, question answering, KGC, and KGP, particularly in specialized domains where external knowledge integration is necessary for accuracy and contextual understanding. By addressing the challenges and opportunities discussed in this paper, researchers can gain a deeper understanding of how to effectively leverage and integrate external knowledge to improve model accuracy, scalability, and applicability across diverse domains.
Scope and Literature Survey Methodology
The literature survey employed a taxonomy of search phrases and keywords, developed through iterative refinement during preliminary analysis. This analysis focused on foundational research in external knowledge representation, such as ontologies and knowledge bases. It also considered the limitations of LLMs in handling specialized scientific content. Additionally, approaches for building domain-specific, knowledge-intensive NLU systems were examined, particularly those that combine formal ontologies with proprietary semi-structured documents to construct knowledge graphs for domains underrepresented in general LLM training data. The scope of this survey encompasses the following: An overview of the concept of LLM-based natural language understanding systems. Limitations of LLMs in knowledge-intensive tasks. Approaches for addressing these limitations through external knowledge integration, categorized into parametric and non-parametric methods. Future research directions for advancing knowledge-augmented LLM-based NLU systems.
To facilitate literature retrieval, a Python script was developed to query multiple academic databases using our taxonomy as input. The script systematically crawled scientific papers from open-source databases including Google Scholar, 2 Semantic Scholar, 3 DBLP, 4 and ACL Anthology. 5 Papers from IEEE Xplore 6 and other sources with access restrictions were retrieved manually. The complete taxonomy, the Python script, and the crawling result are publicly available (Yadav, 2025). Research papers published between 2019 and 2024 were considered for inclusion, with a particular emphasis on those from 2022 to 2024 to ensure coverage of recent advances. Studies published in 2025 were manually selected based on their direct relevance to the survey’s objectives. From the set of papers retrieved through the Python-based crawling script, relevant studies were manually selected for in-depth research by filtering them based on title or abstract. The Python script facilitated the retrieval of seeding papers, while pertinent additional studies were identified through manual searches. Consequently, a hybrid approach combining automated retrieval and manual searching was employed to collect the literature for the survey. Table 1 provides a quantitative overview of the selected literature corpus, summarizing paper types, publication formats, and peer-review status.
Overview of the Selected Literature Corpus (2019–2025) Grouped by Different Categorizations.
While every effort was made to ensure comprehensive coverage, certain methodological constraints should be acknowledged: First, the keyword-based retrieval process may have unintentionally excluded studies that address knowledge integration under different conceptualizations or terminologies. Second, the rapid pace of progress in LLM research implies that very recent publications, that is, those appearing shortly before or after the completion of this survey, may not yet be indexed in the consulted databases. Third, the focus on English-language literature may have led to the omission of relevant work published in other languages. Notwithstanding these limitations, the resulting corpus is considered sufficiently broad and representative to capture the major trends and directions in contemporary research on LLM–knowledge integration.
This survey extends existing work by systematically examining knowledge integration in LLMs across both parametric and non-parametric paradigms, offering a more holistic view than prior studies that focus mainly on prompting and reasoning (Some et al., 2025) or RAG (Gao et al., 2023). Kindly refer to Table 2 for an overview of the surveyed methods. Our distinct contribution lies in: (a) coverage of both parametric and non-parametric integration techniques, (b) emphasis on methodological aspects of knowledge-intensive NLU tasks, such as IE and KGC, and (c) inclusion of recent developments in retrieval augmented generation (RAG) knowledge editing, and constrained-decoding. Although this survey offers a methodological analysis of knowledge integration strategies, unified performance benchmarking across these diverse approaches remains difficult due to differences in evaluation tasks, datasets, and domain-specific requirements. Our primary focus is on the systematic categorization and in-depth analysis of methods, with performance considerations highlighted where methodologically needed.
Overview of Surveyed Approaches.
What Are LLM-Based NLU Systems?
LLMs are large neural networks/deep learning models trained on vast corpora of text to capture patterns, semantics, and syntactic structures in natural language. Generally, models with hundreds of millions to several billion parameters are considered large. Examples, include BERT (340M parameters), GPT-3 (175B parameters), and T5 (11B parameters) (Devlin et al., 2019; Radford et al., 2018). These models have demonstrated impressive capabilities in various NLP tasks, including sentiment analysis (Krugmann & Hartmann, 2024), text classification (Fields et al., 2024), machine translation (Gao et al., 2024; Vaswani et al., 2017), named entity recognition (NER) (Luo et al., 2023), and others. NLU is a subfield of natural language processing (NLP) that builds systems to comprehend and interpret human language. It bridges the gap between human language and machine comprehension (Kulkarni, 2023). It involves various tasks, including semantic role labeling, NER, relationship extraction, coreference resolution, intent detection, question answering, reasoning with textual information, and more. The NLU algorithms extract structured patterns from unstructured or semi-structured data 7 (Singh et al., 2018). The emergence of language models such as BERT and the GPT series has led to significant improvements in NLU performance. Integrating LLMs with NLU systems has been shown to substantially improve the semantic understanding of natural language (Huang et al., 2024b), thereby enhancing accuracy. Numerous implementations have been developed to facilitate this integration. For example, McTear et al. (2023) have optimized their RASA-based (Sharma & Joshi, 2020) dialogue system by integrating the GPT-3.5-turbo model (Ye et al., 2023) to engage users in a motivational health coaching offering reflective dialogues. Rajasekharan et al. (2023) combine LLM with Answer Set Programming (ASP) to do qualitative and mathematical reasoning. Mukanova et al. (2024) propose a methodology that uses LLMs for performing the task of ontology enrichment and semantic processing. Other works also include: using LLMs for ontology engineering and knowledge graph creation (Shimizu & Hitzler, 2025). While these examples illustrate key applications of LLMs in NLU, they do not represent an exhaustive list, as research in this area continues to evolve. However, despite their success, they still face limitations in dealing with domain-specific knowledge and ensuring factual correctness in generated responses (Albtosh, 2024; Bouhoun et al., 2024; Kerner, 2024; Nagar et al., 2025).
How Do LLMs Perform NLU?
A breakthrough came with the introduction of transformers by Vaswani et al. (2017). Transformers, as originally defined, are encoder-decoder architectures that primarily rely on attention mechanisms. This mechanism models dependencies among elements of a sequence, such as words in a sentence in natural language processing or patches/pixels in an image in computer vision, and it assigns learnable weights that capture their relative relevance. Transformers serve as the fundamental architecture for LLMs, with different models leveraging specific components of the transformer structure. For instance, BERT utilizes only the encoder (Devlin et al., 2019), whereas GPT is built solely on the decoder mechanism (Radford et al., 2018). In contrast, models like BART incorporate both the encoder and decoder components (Lewis et al., 2020a). LLMs perform NLU by learning contextual relationships between words and sentences. Our focus is on decoder-based models, that is, the causal language models that produce the outputs autoregressively. These models are generative language models. In the training phase, they calculate the probability of a word’s occurrence based on the previous context words, that is, learn the statistical representation of language (Brown et al., 2020; Yan et al., 2024). In a fine-tuning phase, the pre-trained models are steered towards user-specific responses by applying task-specific fine-tuning or instruction fine-tuning. The work by OpenAI (Radford et al., 2018) offers valuable insights into the concepts of generative pre-training and discriminative fine-tuning. They demonstrate how large parameter neural networks, such as the GPT-family models, can offer a task-agnostic architecture for performing NLU tasks with better results than task-specific models. Furthermore, it has been observed that decoder-based GPT-type models excel in knowledge extraction compared to encoder-based models, except when the knowledge is a standalone word or composed of independent words (Allen-Zhu & Li, 2025a). LLMs reliance on large-scale static and generic text corpora can result in issues such as the generation of hallucinated facts or incomplete reasoning, which poses challenges for tasks requiring factual accuracy or specialized knowledge (Qiang et al., 2024). Developing techniques that incorporate external knowledge sources for access to proprietary or domain-specific data could significantly mitigate these issues, enabling LLMs to deliver more accurate and contextually relevant outputs across various applications and domains (Ye et al., 2024).
Challenges of LLM-Based NLU Systems
Since LLMs form the core of modern NLU systems, any shortcomings in LLMs naturally propagate to the NLU systems built on top of them. In the following, we discuss typical limitations of LLMs that emerge when they are applied to domain-specific or knowledge-intensive tasks. These challenges directly affect LLM-based NLU systems, reducing their ability to reliably process, interpret, and reason over specialized information.
Hallucinations Leading to Limitations in Factual Accuracy
One of the primary limitations of LLM-based NLU systems is their tendency to generate grammatically correct but factually incorrect information, a phenomenon known as “hallucination” or “confabulation” (Lewis et al., 2020b). Hallucination results in the generation of misinformation, making LLMs unreliable and ultimately reducing the accuracy of the output responses. In healthcare, for example, LLM hallucinations can have serious consequences, such as providing medically incorrect guidance, which could lead to harmful patient outcomes (Maes, 2025). In scientific fields, these inaccuracies may limit the effectiveness of LLMs in extracting knowledge and structuring data. One reason for hallucination is the lack of factual information from proprietary data (Lewis et al., 2020b). When performing tasks such as IE, LLMs can produce extraneous or “hallucinated” content—for example, appending interpretations or qualitative judgments to a value extracted from a technical datasheet (e.g., stating that a measured temperature is “high” or “low”). Although the extracted key values may be accurate, LLMs often generate supplementary information that is not explicitly present in the source document, potentially introducing semantic noise or misinformation. Another prevalent form of hallucination involves numerical distortion—either by altering decimal values or generating entirely spurious numbers absent from the input source. Additionally, when processing technical datasheets (semi-structured documents), LLMs may generate misinformation as the generated information is neither explicitly nor implicitly present in the datasheets used for training. For instance, consider the following use-case. The prompt that produced the output below included a task instruction for information extraction, along with a machine-parsed version of the datasheet, originally in PDF format. The input structure is provided below for clarity
8
:
The model-generated output: “There are apparent asymmetries in the bias applied to some of the samples (upper samples have a slight bias-rand), suggesting an inconsistency in the manufacturing process that could affect the quality of the final product.” In this output, the clause following “suggesting” constitutes a hallucination, as it introduces content not present in the source datasheet and thereby adds noise to the extracted information. Such hallucinations are particularly problematic for tasks such as information extraction and knowledge graph construction, where the accuracy and reliability of results depend on maintaining strict alignment with the underlying data sources. Relying on hallucinated data in these tasks can lead to inaccuracies and misrepresentations, ultimately affecting the reliability and applicability of LLM-generated knowledge. Therefore, understanding and investigating hallucinations in LLMs is important for their seamless application.
At a broader level, hallucinations are classified as intrinsic (resulting from internal parameters of the model) and extrinsic (due to integrated external knowledge) (Cleti & Jano, 2024). These categories can be further divided into subtypes, including fact-based hallucinations, where incorrect information is generated, and faithfulness hallucinations, wherein the output of the LLMs fails to align with the input prompt. Additionally, coherence hallucinations are characterized by the generation of incoherent text. Other types include relevance hallucinations, where the LLM produces an out-of-domain or irrelevant response, and sensibility hallucinations, which involve the generation of nonsensical text (Huang et al., 2025). Researchers or practitioners of NLP have different perspectives on hallucinations. The survey by Cleti and Jano (2024) highlights the dual nature of hallucinations across diverse domains like healthcare/science and art/design. For fact-based fields like healthcare and scientific research, hallucinations lead to the generation of incorrect facts that are not acceptable. On the other hand, for philosophical and abstract fields like arts and design, hallucinations can foster creativity by generating unconventional or unseen outputs that inspire innovations. Therefore, the issue of hallucinations and ways of mitigating it depends on the respective domain, or at least the broader categorization of the domain.
One perspective on why hallucinations persist in LLMs is that existing guardrail mechanisms are inadequate and fail to effectively prevent them. According to Pantha et al. (2024), generic guardrails face significant challenges when applied to specialized domains. Ideally, LLMs should rely on guardrails to issue alerts when they lack domain-specific understanding, signaling their inability to generate a reliable response. However, in reality, these mechanisms are often inadequate. The problem is that while some LLMs recognize their limitations and trigger alerts, many fail to detect their own uncertainty and instead generate misleading responses with unwarranted confidence. This failure stems from the lack of domain-specific understanding, preventing LLMs from correctly identifying when their outputs should be restricted. As a result, hallucinations persist. The inadequacy of generic guardrails highlights the need for domain-specific interventions. Research has explored customized guardrails designed to mitigate hallucinations in scientific applications. Proposed frameworks and methodologies address key challenges, including time sensitivity, contextualization of knowledge, and intellectual property concerns. By integrating domain-aware constraints, these solutions aim to enhance the trustworthiness and reliability of LLM-generated content in specialized fields.
Incompleteness and Outdated Data
LLMs struggle with processing or generating up-to-date information for the NLU task, as their training data is static and may not capture up-to-date domain-specific knowledge. This limitation arises because most pre-trained LLMs rely on datasets collected at a specific time, making them unable to reflect real-time changes, temporal trends, or evolving domain-specific knowledge (Fan et al., 2024; Gao et al., 2023; Lewis et al., 2020b). For example, an LLM-based system used in the manufacturing sector for tasks such as reasoning, design assistance, or technical question-answering may miss recent innovations in materials, automation methods, or production standards (Li et al., 2024a). This lack of timely updates can lead to outdated or incomplete insights, limiting the system’s usefulness in fast-moving industrial contexts. Similarly, in healthcare, where medical treatments, clinical guidelines, and biomedical discoveries evolve rapidly, an outdated LLM may generate recommendations based on outdated information, with potentially serious consequences. To overcome these challenges, it is crucial to integrate mechanisms that enable LLMs to access external, up-to-date knowledge bases or adapt dynamically to evolving datasets. The strategies for handling outdated knowledge enable models to bridge the gap between static training data and the real-time knowledge needed for accurate, reliable decision-making across diverse domains.
A common challenge with LLMs, closely related to the problem of incomplete or outdated knowledge, is their inconsistent response generation, that is, producing varied outputs for the same input. Even when the required external knowledge is available, the reliability of an NLU system also depends on the model’s ability to generate stable and predictable outputs. Inconsistency arises from several factors, including limited or imbalanced training data, stochastic sampling strategies, sensitivity to prompt phrasing, training data biases, and architectural design choices (Zhao et al., 2024b). Mathematically, LLMs are deterministic: given identical inputs and fixed model parameters (temperature = 0, fixed random seed), they produce the same output. However, most real-world applications intentionally introduce stochastic sampling through hyperparameters like temperature and top-p to promote diverse text generation. This causes LLMs to probabilistically sample tokens at each generation step, resulting in output variability across repeated prompts. Beyond stochasticity, another inherent source of inconsistency lies in the stateless nature of LLMs. Unlike systems such as recommender engines or chatbots that depend on historical user interactions or time-dependent profiles, LLMs operate without memory of previous exchanges. This statelessness improves privacy and security by avoiding storage of user-specific data, but it also presents difficulties for tasks requiring continuity or persistent reasoning. Without retained context, responses may appear arbitrary or disconnected when the input lacks explicit contextual cues. As a result, the stateless nature of LLMs contributes to the unpredictability of their outputs, leading to inconsistent responses (Liu et al., 2023, p. 7). Such variability poses a significant concern, as it affects the reliability and trustworthiness of LLMs across a wide range of applications (Saxena et al., 2024).
Lack of Contextual Understanding and Related Ambiguities
LLMs are sensitive to the noisy language of prompts. The lack of contextual understanding can contribute to linguistic ambiguities, as LLMs rely on statistical patterns inferred from the text rather than explicit reasoning (Keluskar et al., 2024). For instance, words with multiple meanings, such as “bank” (which could refer to a financial institution or the side of a river), can be misinterpreted if the model lacks sufficient domain-specific knowledge or contextual clues. This limitation becomes particularly evident in specialized fields or nuanced conversations where precise understanding is critical. Consequently, semantic disambiguation in LLMs remains an active research area (Yang et al., 2023). Keluskar et al. (2024) analyze ambiguities in LLM-generated responses within the context of open-domain question answering. They argue that ambiguity primarily arises due to a lack of context, which includes not only textual cues but also social and psychological ones. Their study highlights the need to integrate external knowledge sources, such as knowledge graphs, to enhance clarity and disambiguation. Their results demonstrate that contextual enrichment, that is, contextual information to LLMs, significantly reduces disambiguation and increases accuracy. Unlike scientific experts, LLMs struggle to understand the nuances of experimental setups or domain-specific methodologies. Humans can interpret the meaning of a document, identify and link related data across tables, and infer implicit relationships and contextual insights. LLMs, however, are generally unable to perform such reasoning reliably, limiting their ability to extract and contextualize knowledge from complex scientific texts. This often results in superficial or incomplete responses, such as misinterpreting parameters like temperature or pressure in experimental setups, because the context is not clearly defined.
Although LLMs possess relevant knowledge, they often struggle to apply it when prompts contain ambiguous entity types. Therefore, understanding different types of ambiguities is crucial for improving LLM-powered NLU tasks. Ambiguities can be categorized into three primary types: semantic, syntactic, and lexical (Zait & Zarour, 2018). Semantic ambiguity, also known as referential ambiguity, pertains to multiple interpretations of a word, phrase, or sentence. For example, the question, “What is the home stadium of the Cardinals?” yields different answers depending on whether it refers to the Arizona Cardinals (football) or the St. Louis Cardinals (baseball) (Keluskar et al., 2024). As this example illustrates, even humans struggle to resolve meaning without sufficient context, making it even more challenging for LLMs (Zait & Zarour, 2018). Lexical ambiguity arises at the word level and is related to parts-of-speech tagging. For instance, the word silver can function as a noun, adjective, or verb: “She bagged two silver (noun) medals.”, “She made a silver (adjective) speech.”, and “His worries had silvered (verb) his hair.” (Anjali & Babu, 2014). Lastly, syntactic ambiguity concerns the grammar and structure of a sentence. A widely cited example is: “I saw the man with the telescope.” This can be interpreted as either “I saw the man [who was holding the telescope].” or “I used the telescope to see the man.” Different types of ambiguities contribute to vagueness, fuzziness, and uncertainty in language, leading to confusion in LLM responses. Augmenting contextual information or extending LLMs with external knowledge bases can mitigate these issues (Singh & Patil, 2024). Additionally, LLMs struggle with self-verification, demonstrating a lack of self-consistency. This highlights a key challenge in polysemy resolution, emphasizing the need for further research into entity-type ambiguities and the broader complexities of language understanding (Sedova et al., 2024).
Approaches of Integrating External Knowledge Into LLM-Based NLU Systems
In this survey, external knowledge refers to information supplied through domain-specific documents, structured resources like knowledge bases or relational databases, and ontologies that formally represent domain expertise. Such specialized knowledge covers factual details—concepts, relations, and constraints that are unique to a domain, but also extends to insights from unstructured sources, including narrative text or document passages. Both structured and unstructured knowledge play essential roles in enabling reasoning within a specific domain. For example, interpreting technical datasheets can greatly benefit from integrating domain-specific ontologies with a defined set of entities and relations to achieve a comprehensive understanding of the mentioned concepts. In specialized domains such as science and engineering, integrating external knowledge can reduce hallucinations, enhance reasoning, improve factual accuracy, and boost overall task performance. For instance, the work by Wang and Li (2023) identifies challenges in using LLMs for manufacturing applications and recommends integrating an external knowledge base to generate industry-relevant insights. Another example of external knowledge integration can be seen in text-to-query systems, where LLMs translate natural language inputs into formal query languages such as SQL or SPARQL. In both cases, the model must understand and reason over the underlying structured schema—tables and relations in databases or entities and predicates in knowledge graphs, to generate syntactically correct and semantically meaningful queries. These systems demonstrate how LLMs integrate with structured external knowledge sources to produce actionable outputs, generating knowledge that extends beyond their pre-trained parameters.
Classification of knowledge:
In the context of LLMs, data refers to text/corpora from books, websites, articles, among others, and knowledge is derived by processing the data during the training phase, that is, the data is processed to create knowledge (da Costa & Oliveira e Souza Filho, 2024). LLMs store internal knowledge within their parameters, which is derived during the training phase from large-scale, unstructured, and unlabeled datasets. For larger LLMs with over 100 billion parameters, this training data encompasses vast amounts of internet-scale information. The knowledge stored in model parameters is represented by numerical values known as weights, which are learned probabilistically through the backpropagation algorithm. This form of knowledge retention is referred to as implicit memory or internal knowledge. However, this knowledge is inherently limited as it only reflects data available up to the model’s last training timestamp. Additionally, the training corpus is generic, often lacking specialized domain-specific information due to confidentiality restrictions that prevent such documents from being included in publicly available datasets. Furthermore, even within general datasets, highly detailed or technical documents may be disproportionately represented. Therefore, LLMs often requires external data to perform domain-specific or knowledge-intensive tasks, which serves as the additional contextual knowledge. Unlike implicit memory, this external data is maintained outside the model, typically in knowledge bases, relational databases, or structured file systems. When LLMs consume and process such external information, it is referred to as external knowledge. Integrating external knowledge with LLMs is crucial for reducing hallucinations, enhancing contextual understanding, and ensuring access to up-to-date information. This integration can be done in two ways, at a broader level: (i) parametric knowledge integration, that is, knowledge is encoded within the parameters of the LLMs, (ii) non-parametric knowledge integration, where the knowledge is not encoded within the model weights, but supplied from a storage system outside the LLM, like relational or knowledge databases. Table 2 summarizes the parametric and non-parametric methods, and the following sections explain these methods in detail.
Parametric Knowledge
As mentioned earlier, parametric knowledge is implicitly embedded within the model’s architecture (Allen-Zhu & Li, 2025a). Therefore, parametric methods for integrating knowledge with LLMs involve modifying their internal parameters and encoding/embedding information during training or fine-tuning.
Pre-Training LLMs
The development of LLMs typically follows a hierarchical training continuum that transitions from general-purpose learning to task- and fact-specific adaptation. At the foundation lies pre-training, where models are trained from scratch on large, unlabeled, and diverse corpora using self-supervised objectives (e.g., masked or causal language modeling) to acquire general linguistic and world knowledge. Foundational examples include encoder-based models such as BERT (Devlin et al., 2019) and decoder-based models such as GPT (Radford et al., 2018). Following the general pre-training phase, domain-adaptive pre-training (DAPT) continues training a pre-trained model on unlabeled, domain-specific corpora to better align its internal representations with specialized knowledge (e.g., biomedical, legal, or financial domains) (Chalkidis et al., 2020; Peng et al., 2021; Vasantharajan et al., 2022). For example, LEGAL-BERT (Chalkidis et al., 2020) and FinBERT (Peng et al., 2021) are further trained on unstructured textual data from their respective domains—LEGAL-BERT on diverse English legal texts such as legislation, court cases, and contracts, and FinBERT on the Reuters TRC2 financial news corpus ((2004), NIST). Both models demonstrate superior performance compared to the original BERT model on downstream tasks within their domains. In continuation of DAPT, task-adaptive pre-training (TAPT) further refines the model using unlabeled text drawn directly from the downstream task domain, enabling better alignment with the stylistic and distributional characteristics of the task data. This approach, formalized by Gururangan et al. (2020), has been shown to improve model robustness and task-specific generalization before supervised fine-tuning.
Building upon these stages, several studies have explored enhancing the model by incorporating external structured knowledge into the language modeling process. For instance, Yan et al. (2021) introduce a knowledge fusion layer on top of the transformer architecture to integrate knowledge graph information during pre-training without altering the underlying model structure. Similarly, Dong et al. (2023) propose a structured knowledge–aware pre-training framework that embeds structured knowledge into the model using the masked language modeling (MLM) objective from BERT, enabling it to learn representations of complex subgraphs for improved performance on Knowledge Base Question Answering (KBQA) tasks. Related approaches that inject knowledge into the pre-training, DAPT, and TAPT phases include (Hu et al., 2024; Zhang et al., 2022, 2023), all of which demonstrate that structured knowledge integration enhances model understanding and reasoning capabilities.
In the context of (large) language models, NLU can be viewed as a process of manipulating and extracting knowledge, where relevant information is identified and adapted to meet task-specific requirements. One illustrative example is IE, which involves deriving structured signals, such as entities or relations, from data, retrieving pertinent facts encoded within the model, or classifying sentences into predefined categories (Allen-Zhu & Li, 2025b). Such operations highlight how LLMs leverage and transform knowledge to support a broad range of NLU tasks. However, despite their ability to store vast amounts of information, LLMs often struggle to extract and manipulate knowledge effectively. For instance, a model trained on the fact “Abraham Lincoln was born in Hodgenville, K.Y.” may correctly answer the direct question “Where was Abraham Lincoln born?” but fail to respond to the inverse query “Who was born in Hodgenville, K.Y.?” unless explicitly trained with bidirectional mappings. This type of retrieval is called reverse retrieval. This limitation highlights that memorizing knowledge alone does not guarantee effective knowledge extraction. To address this gap, Allen-Zhu and Li (2025b) propose strategies to enhance LLMs performance by refining knowledge representation processes. One approach involves rewriting pre-training data using small auxiliary models to generate diverse permutations of knowledge, thereby improving retrieval flexibility. Another method focuses on incorporating fine-tuning data during pre-training, enhancing the model’s ability to reason over stored information.
Fine-Tuning LLMs
Unlike pre-training, DAPT, and TAPT, fine-tuning uses labeled, task-specific datasets and supervised learning objectives to adapt the model to specific applications, such as question answering, text classification, and summarization. The BERT paper (Devlin et al., 2019) distinguishes pre-training and fine-tuning for larger language models. During fine-tuning, the parameters of a pre-trained language model are updated using labeled examples to optimize its performance on downstream NLP tasks (Howard & Ruder, 2018; Soudani et al., 2024). For instance, Lee et al. (2019) first adapt BERT to the biomedical domain through additional pre-training on biomedical text, resulting in BioBERT, and subsequently fine-tune it on labeled, task-specific datasets for downstream NLP tasks such as NER, relation extraction, and question answering. LLMs, when fine-tuned, can be used to extract factual knowledge. Kazemi et al. (2023) fine-tune LLMs to do knowledge extraction and KGC. In general, fine-tuning has been effective in overcoming performance limitations on domain-specific NLU. The work of Muralidharan et al. (2024) examines the effectiveness of LLMs in understanding and extracting information in the scientific domain. They propose a methodology called Knowledge-AI, which fine-tunes LLMs for downstream NLU tasks, such as question answering, NER, summarization, and text generation.
Exploring different approaches to fine-tuning LLMs is essential for understanding their practical implications, scalability, and resource requirements. Fine-tuning strategies differ primarily in the number of parameters updated during training, influencing computational efficiency, memory consumption, and model adaptability. Full fine-tuning modifies all parameters of an LLM, typically achieving strong task-specific adaptation. However, this approach is computationally expensive, requiring high-end GPU clusters, particularly for models with more than 100 billion parameters (Rajabzadeh et al., 2024). To address these constraints, the NLP community has increasingly turned to the Parameter-Efficient Fine-Tuning (PEFT) paradigm, which updates only a small fraction of parameters, or introduces a small number of additional trainable parameters—while keeping the core model weights frozen. This reduces hardware requirements and accelerates training, making fine-tuning feasible on modest computational setups. Within this paradigm, several techniques have emerged, including Low-Rank Adaptation (LoRA) (Hu et al., 2022), prefix-tuning (Li & Liang, 2021), prompt-tuning (Petrov et al., 2023), BitFit (Ben Zaken et al., 2022), and adapter-based fine-tuning (Poth et al., 2023). Among these, adapter-based methods are one of the earliest and most prominent forms of PEFT. These methods introduce small, trainable neural modules, known as adapters, within the transformer layers of a pre-trained model. During fine-tuning, only the adapter parameters are updated.
For instance, Tian et al. (2024) combine knowledge graphs (KGs) with LLMs at the parameter level to improve reasoning. They note that simply adding KG information via prompting can introduce inconsistencies and lead the model to overly rely on its prior knowledge. To address this, they introduce a KG-adapter that leverages PEFT-based fine-tuning to embed both node and relation representations from the KG into the model. This approach enhances reasoning performance on KGQA. Similarly, Omeliyanenko et al. (2023) introduce an adapter-based architecture for integrating KG knowledge into LLMs for link prediction tasks. Their approach inserts lightweight, trainable layers between the transformer blocks of a pre-trained model, enabling task-specific adaptation without full model retraining (Wang et al., 2021). Extending these foundations, Wang et al. (2024a) propose an infuser-guided adapter integration framework—an optimized variant of adapter-based fine-tuning. This approach effectively mitigates catastrophic forgetting during knowledge integration. To ensure the added modules do not interfere with the model’s internal representations, they introduce infuser-based adapters, which selectively inject knowledge only when the model lacks it. This selective infusion mechanism delivers superior performance compared to conventional adapter-based methods, indicating a promising direction for adaptive, knowledge-aware LLM fine-tuning.
Fine-tuning LLMs requires significant resources and computational power and comes with its own set of challenges. Kazemi et al. (2023) highlight a downside known as “Frequency Shock”, where fine-tuned models tend to overpredict rare entities while underpredicting common ones in the training data, ultimately degrading performance. Ghosal et al. (2024) address another issue: “attention imbalance”, where the attention mechanism unevenly prioritizes specific tokens over others. Ovadia et al. (2024) compare two approaches for embedding factual knowledge into LLMs—unsupervised fine-tuning and Retrieval-Augmented-Generation (RAG). Their findings suggest that RAG outperforms unsupervised fine-tuning. They also note that unsupervised training often exposes models to multiple variations of the same fact, complicating the accurate retention of factual information. These challenges emphasize the importance of exploring non-parametric methods for integrating external knowledge into LLMs (cf., Section 4.2).
Knowledge Editing
Incorporating new knowledge into the parameters of LLMs through fine-tuning is computationally expensive due to their vast scale and the billions of parameters they contain. This challenge has motivated the development of more efficient mechanisms for integrating external knowledge without the high computational cost of full model retraining. Knowledge-based model editing (KME), or knowledge editing, addresses this by modifying only a small and targeted subset of parameters responsible for encoding specific information, thereby updating the model’s knowledge while preserving its pre-trained capabilities (Pinter & Elhadad, 2023; Wang et al., 2024c).
Unlike fine-tuning, which typically adjusts all or a broad range of parameters to optimize model performance for a downstream task, KME focuses on localized and semantically precise modifications. As highlighted by Ishigaki et al. (2024), knowledge editing generally proceeds in two stages: (a) localization, where the model identifies the neurons or attention heads associated with the target knowledge, and (b) modification, where only those parameters are updated to reflect new or corrected information. This targeted approach enables locality, ensuring that updates affect only the intended knowledge, and generality, allowing the edit to generalize across semantically related contexts (Wang et al., 2024d).
Beyond its application in LLMs, knowledge editing techniques have broader implications in machine learning, such as mitigating data biases and improving model robustness on downstream tasks (Ilharco et al., n.d). By maintaining the integrity of previously learned information while efficiently integrating new knowledge, KME provides a computationally efficient and semantically stable alternative to fine-tuning, making it especially valuable for dynamically updating LLMs in knowledge-intensive domains.
To better illustrate the conceptual and computational distinctions between pre-training, fine-tuning, and knowledge editing, Table 3 provides a structured comparison highlighting their objectives, data requirements, training scope, efficiency, and key representative techniques. This comparative overview clarifies the boundary between full-scale model adaptation and targeted knowledge modification, situating knowledge editing as a lightweight yet powerful alternative for dynamically updating LLMs in knowledge-intensive settings.
Comparison of Pre-Training, Fine-Tuning, and Knowledge Editing in Large Language Models (LLMs).
Comparison of Pre-Training, Fine-Tuning, and Knowledge Editing in Large Language Models (LLMs).
Steering an LLM involves guiding its output to align with specific stylistic, instructional, or knowledge-driven constraints (Lai et al., 2024). While it is commonly applied to text style transfer (TST)—for instance, adapting an LLM to generate Shakespearean-style language, steering also plays a crucial role in knowledge integration, ensuring that models correctly interpret and apply external information in their responses. There are multiple perspectives on steering and styling LLMs, many of which facilitate the integration of external knowledge. Steering techniques can be broadly classified into parametric and non-parametric approaches 9 . Parametric methods, such as neuron deactivation or fine-tuning, modify model parameters to influence responses. On the other hand, non-parametric methods, such as prompt-based steering, control model behavior without altering the underlying parameters. Lai et al. (2024) propose a novel parametric steering approach called sNeuron-TST for steering the style. This method identifies the neurons associated with source and target styles, deactivating the source-style neurons to enforce the target style in generated text. However, they observe that deactivating these neurons leads to performance degradation. To address this, they introduce a constructive decoding method that compensates for the removed neurons, improving output consistency. Beyond stylistic control, the similar parameter-level steering mechanisms of LLMs are also applicable to knowledge integration, as they direct the model’s attention mechanisms to prioritize user-specified information (e.g., instructions or domain-specific knowledge). This is achieved by identifying subsets of attention heads and applying attention reweighting, which enables the model to effectively incorporate and process new information. Such steering methods have been shown to enhance performance on knowledge-intensive tasks (Lamb et al., 2025; Zhang et al., 2024c). Therefore, steering techniques ensure that models remain aligned with the external knowledge sources.
Other Methods and Limitations of Parametric Approaches
Other methods of parametric data augmentation to LLMs include embedding techniques. They transform external knowledge into vector representations that can be integrated into the LLM’s latent space, thereby capturing the text’s linguistic features. The NLP community has used embeddings for years to transform raw textual information into numerical representations that can be processed by AI algorithms (Zhang et al., 2024a). Additionally, methods such as graph embeddings (Jain & Lapata, 2024; Pan et al., 2024) and knowledge graph transformers (Liu et al., 2024) enable the model to learn richer representations of external knowledge, thereby improving its contextual understanding and reasoning abilities. Another prominent parametric technique is knowledge distillation, where a large teacher model transfers its knowledge to a smaller student model by training the student to approximate the teacher’s outputs. This process embeds the teacher’s knowledge directly into the student’s parameters, allowing for model compression and efficiency gains while retaining much of the teacher’s performance (Hinton et al., 2015; Sanh et al., 2019; Wang et al., 2020). In the context of LLMs, knowledge distillation is widely used to reduce computational overhead and memory requirements, thereby making LLMs easier to deploy and more scalable.
Parametric knowledge within LLMs often faces challenges, such as a lack of explainability, as it is stored within the model weights; that is, the knowledge is converted into numerical values, making its provenance difficult to trace. This might lead to security risks due to its opaque nature. Another challenge is that updating knowledge is computationally expensive and time-consuming, as parametric knowledge updates require some level of modification to model weights/layers/architecture. To address these limitations, external non-parametric knowledge emerges as a good option, offering enhanced transparency, flexibility, adaptability, and operational simplicity (Wang et al., 2024b). Table 4 summarizes how these methods differ from non-parametric strategies in terms of computation, explainability, flexibility, and update mechanisms.
Comparison Between Parametric and Non-Parametric Methods.
Comparison Between Parametric and Non-Parametric Methods.
As previously discussed, non-parametric knowledge refers to information or knowledge provided to the LLM without altering its internal weights or architecture. This type of knowledge is stored in a separate system outside the model and is not encoded within its trainable parameters. The following are the most commonly used approaches for integrating external and up-to-date knowledge into LLMs in a non-parametric manner.
Prompting Methods
Prompting refers to the process of providing textual instructions to an LLM to elicit desired task-specific behavior. Common prompting strategies include zero-shot prompting, where only the instruction is provided without examples; few-shot prompting, which supplements the instruction with a few input–output examples; and chain-of-thought (CoT) prompting, which guides the model by decomposing the reasoning process into intermediate steps (Brown et al., 2020; Sun et al., 2024). As a non-parametric approach, prompting does not alter the model’s internal parameters or weights. Instead, it conditions the model externally via contextual cues by embedding relevant information directly into the prompt, within the LLM’s context limit. While techniques such as CoT prompting (Bosma et al., 2022) improve reasoning tasks like retrieval and classification, they remain insufficient for more complex operations, such as inverse search, where models must infer relationships beyond explicitly stated data, as explained in Section 4.1.1. In such cases, advanced methodologies like RAG and reversal training are essential, as only a limited amount of context can be included in the prompt due to the LLMs ’s context limits. Lack of context results in issues like hallucination and inappropriate response generation. To solve this problem, a new field of research emerged called context engineering (Piñeiro-Martín et al., 2025; Rasmussen et al., 2025), in which relevant context is provided to the LLM to support knowledge-intensive NLU tasks. RAG and knowledge graph (KG) integration, discussed in detail in the following sub-sections, can indeed be conceptualized as forms of context engineering. These findings highlight the importance of knowledge integration in LLMs, bridging the gap between limited contextual understanding and the effective application of knowledge in real-world, knowledge-intensive tasks (Allen-Zhu & Li, 2025a, 2025b)
Knowledge-Based Methods: Knowledge Graphs and Ontologies
Knowledge Graphs and their role in LLMs:
Knowledge graphs, such as DBpedia (Auer et al., 2007), or Wikidata (Vrandečić & Krötzsch, 2014), are external knowledge sources that store knowledge in the form of nodes and edges (Fensel et al., 2020). By linking LLMs with these graphs, it is possible to enhance the models’ ability to reason about entities and their relationships. In recent years, significant research has focused on integrating knowledge graphs and LLMs to reduce hallucinations and improve factual accuracy. LLMs and KGs complement each other’s capabilities. Merging both helps overcome each other’s limitations. LLMs gain access to factual knowledge from KGs, which improves accuracy and trustworthiness, as in fact-checking. KGs, on the other hand, benefit from LLMs in language processing and language understanding tasks like: synthesis of user responses, question-answering, automated KG construction, etc. (Choudhary & Reddy, 2023; Luo et al., 2024a; Pan et al., 2024). There are various ways to integrate KGs and LLMs. First, KG-enhanced LLMs leverage KGs as external knowledge sources to provide domain-specific context to LLMs, thereby improving the factual accuracy of their outputs. Second, LLM-augmented KGs, where LLMs are used for populating the triples in the knowledge graph, assist in KG-related tasks, and incorporate ontologies to ensure that the generated triples conform to domain-specific rules and constraints. Third is the synergized LLMs+KGs, a unified framework that aims to enhance each other’s capabilities through knowledge representation and reasoning (Pan et al., 2024).
KG-enhanced LLMs, LLM-augmented KGs, and synergized KG+LLMs are methods for integrating KGs and LLMs. However, the focus of the work is KG-enhanced LLMs, where KG serves as the external source. For instance, Ye et al. (2024) introduce a method to integrate KGs with Large Language Models (LLMs) to enhance factual accuracy. Their approach employs deep reinforcement learning, which identifies relevant inference paths within the KG based on user input and incorporates this information into the LLM prompt, thereby providing context that yields more domain-specific and accurate results. The work by Wang et al. (2025) discusses the challenges of handling evolving knowledge in the medical domain, emphasizing the need for continuous model updates and the integration of external knowledge. They propose a 3-step framework to develop an LLM-powered AI application for the medical domain: (i) modeling (which breaks down a complex task into simpler sub-tasks), (ii) optimization (enhancing the model response generation relevancy and accuracy by integrating external knowledge), and (iii) system engineering. They also highlight that more research is needed on system optimization, specifically on augmenting external knowledge with LLMs. Chen et al. (2024) handle out-of-date information issues in LLMs by integrating knowledge graphs to facilitate accurate fact identification and logical reasoning. They propose a method called Graph Memory-based Editing for Large Language Models (GMeLLo), which combines the strengths of both KGs and LLMs to perform multi-hop question answering in dynamic environments, that is, those with frequently updated external knowledge. Such advancements not only enhance the performance of LLMs in NLU tasks but also foster greater trust and adoption of these models across critical fields such as healthcare, science, and education. Building on these integration strategies, researchers have explored other practical applications that leverage the synergy between KGs and LLMs to address specific tasks, such as question answering and improving explainability. Feng et al. (2023), integrate KGs with LLMs to develop a multiple options question answering system. They call their methodology a knowledge solver, which allows them to search for relevant facts in the integrated knowledge graph. The approach also increases the explainability of LLMs’ reasoning processes by providing complete retrieval paths, as demonstrated through experiments on the datasets: CommonsenseQA (Talmor et al., 2021), OpenbookQA (Mihaylov et al., 2018), and MedQA-USMLE (Jin et al., 2021). There are other works in the same research area. For instance, Jiang et al. (2024) highlight the importance of connecting domain-specific KGs to LLMs for the task of domain-related question answering. They propose a subgraph retrieval method based on the CoT and PageRank, which returns the paths most likely to contain the answer, thereby improving the efficiency of the given NLU task (domain-dependent question answering). Additionally, Luo et al. (2024a) integrate KGs with LLMs to enable reasoning abilities for the task of knowledge-graph question answering (KGQA). They propose planning-retrieval-reasoning: LLMs first create a reasoning plan and then perform reasoning, which involves fetching reasoning paths from the KG using the generated plan. Hallucinations happen if the reasoning plan is incorrect. The aim is to distill the knowledge from KGs into LLMs to generate faithful relation paths as plans. Research studies such as (Ji et al., 2024; Ma et al., 2025; Wang et al., 2023a) explore similar approaches, focusing on using LLMs for reasoning and question answering by incorporating knowledge graphs. However, when developing methods that combine KGs and LLMs, scalability must be a key consideration. Without effective scalability, integrating extensive knowledge bases or graphs can demand substantial resources, posing significant challenges for large-scale deployment (Zhang et al., 2024b).
Constrained-Decoding in LLMs:
Recent advancements in natural language generation (NLG) have introduced constrained-decoding methods for LLMs (Chen et al., 2022; Hokamp & Liu, 2017; Post & Vilar, 2018), which apply ontological rules, among other constraints, to ensure that LLM outputs maintain logical consistency and conform to domain-specific structures (Pan et al., 2024). It can also be seen as a way of integrating external knowledge with LLMs. The purpose of this integration is to provide a response adhering to the provided input syntax or semantics. Traditional constrained decoding with LLMs was applied at the syntax level; a typical example is the introduction of JSON mode (Tam et al., 2024). Other frameworks for syntax-level constrained generation include Outline (Outline, 2026) and Guidance (Guidance, 2026). They can output formats like JSON, XML, python scripts, and SQL, amongst others. However, the concept can also be extended to the level of semantics. A foundational work in this area is presented by Hokamp and Liu (2017), where they utilize lexicons to control the generation process of LLMs. An evolving research area under the umbrella of semantic-based constrained decoding concerns interactions and potential synergies between graph-based knowledge systems and LLMs. Ontologies, the foundational framework for structuring and organizing knowledge graphs, formally represent the domain by serving as the Terminology Box (T-Box) that defines classes and their relationships (Luo et al., 2024b). These structured frameworks are instrumental in guiding the extraction and construction of domain-specific knowledge graphs and in constraining and refining LLM outputs to align with established schemas. Constrained-decoding leverages entities from the KG or the schema and structure of ontologies to regulate LLM responses, ensuring relevance and adherence to domain-specific logic. Technically, constrained decoding encompasses methods for controlling the output tokens of LLMs using external knowledge sources, such as controlled vocabularies, taxonomies, structured ontologies, or domain-specific knowledge graphs. To summarize, ontologies and knowledge graphs provide the necessary structure and knowledge to anchor LLM-generated outputs, enabling the generation of domain-relevant language that aligns with the ontology’s semantics. Although constrained decoding yields precise and structured responses, it is limited by increased latency, resulting in longer response generation times compared to unconstrained generation (Geng et al., 2023).
Memory-Based Systems
Retrieval-Augmented Generation (RAG):
It is a kind of memory-oriented framework for LLMs, since it facilitates the external storage (in the vector-database) and dynamic retrieval of information. The work by Akbar et al. (2025) discusses the use of vector databases and RAG frameworks to manage memory for conversational AI systems. The RAG approach integrates dense vector search with LLM-based text generation, thereby improving response quality in knowledge-intensive applications. As its name implies, RAG retrieves contextually relevant information for a query using dense vector retrieval techniques. The retrieved information is then embedded into the LLM’s prompt during text generation, improving both the factual accuracy and contextual relevance of the output. Rather than training or fine-tuning the LLM with vast amounts of knowledge, which consumes substantial resources and time, external knowledge can be dynamically retrieved using methods such as RAG. This reduces computational load and enables LLMs to scale efficiently while still benefiting from external knowledge (Li et al., 2025c). It is most useful in scenarios where LLMs lag behind task-specific architectures. This approach enables the model to access and incorporate knowledge in real time, eliminating the need to store the entire knowledge base within its parameters. This significantly reduces computational overhead and enhances the model’s decision provenance (Lewis et al., 2020b).
As highlighted above, RAG involves storing data in vector databases, such as Elasticsearch (Elastic, 2025), ChromaDB (Chroma, 2025), or others. Vector databases store vector representations of text, which are multidimensional arrays of numbers. They serve as the external memory. Several frameworks are available online that facilitate the seamless integration of RAG with LLMs. Examples include Hugging Face’s RAG implementation, Haystack (deepset, 2025), and LangChain (Mavroudis, 2024), among others. These frameworks provide tools and libraries to streamline the development of RAG pipelines, enabling efficient retrieval and context-enhanced text generation.
External knowledge integration into LLMs via RAG can be viewed as both parametric and non-parametric. Naive RAG implementations include using a pre-trained dense vector retriever, for example, instruct-embeddings (Su et al., 2023) and connecting it with a pre-trained LLM. In this scenario, neither of the two models—the retrievers nor the LLMs—is further fine-tuned, that is, following the non-parametric approach. The naive RAG approach focuses solely on the inference stage. In contrast, advanced RAG implementations offer two methods. The first method fine-tunes the dense-vector retriever for the specific task or domain. This is a non-parametric approach because the LLM’s parameters remain unchanged. The second method fine-tunes both the retriever and the LLM for the specific task or domain. This is a parametric approach, in which the LLM’s parameters and weights are adjusted to suit the use case. Some progress has also been made in pre-training RAG systems (Gao et al., 2023).
Further research on RAG implementations includes the following: RAG task categorization (Zhao et al., 2024a), where queries are classified into levels based on the type of external data required: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. This is done to ensure that the correct and most appropriate data is fetched for the given task and provided to the prompt as context. Zhao et al. (2024a) propose three methods of ingesting data into LLMs, that is, context (providing the retrieved context directly to the LLM), small model (adding a small model trained on the domain to help external data integration with LLMs), and fine-tuning (fine-tuning the LLM). They also highlight the challenges of deploying data-augmented LLMs and believe that there is no one-size-fits-all solution for it. Anjum et al. (2025) propose a framework called HALO for mitigating hallucinations in the medical question-answering systems to enhance reliability and accuracy. They use the RAG technique to integrate domain-specific information with the LLMs. The results show an increase in LLM accuracies from 44% to 65% for Llama-3.1 and from 56% to 70% for ChatGPT. Additional work in a similar line of research includes Hwang et al. (2025), who introduce Reliability RAG, an extension of traditional RAG systems designed to handle multiple data sources. Their approach focuses on estimating the reliability of various sources within the database. Similarly, Khan et al. (2024) developed a PDF-based, LLM-powered RAG system, showcasing its application in processing and retrieving information from PDF documents.
Traditional RAG techniques have several limitations that can negatively impact the quality of generated responses. In particular, the retrieval component often returns irrelevant or low-quality results, which directly affects the final output. Key issues in retrieval include the lack of pre-retrieval query processing, the absence of post-retrieval enhancements such as reranking, and text chunking that disregards semantic boundaries. The generation component also faces constraints, including the context window limitation of LLMs and the inherent performance differences between smaller and larger models. To mitigate retrieval-related shortcomings, optimized approaches such as Advanced RAG and Modular RAG have been proposed, with a primary focus on enhancing retrieval quality. For a detailed discussion of these methods, see Abo El-Enen et al. (2025). To address generation-related issues, such as context-window limitations and model capacity constraints, prior work has proposed methods, including positional interpolation, to extend context windows (Chen et al., 2023a) and architectural remedies for generation breakdowns in long-context settings (Hosseini et al., 2025).
RAG research is ongoing and rapidly evolving, with emerging approaches, such as GraphRAG 10 (Peng et al., 2025) and other optimization methods, being proposed. Therefore, this work can be further extended to incorporate additional state-of-the-art research in this direction.
Other Memory-Based Methods:
Knowledge graphs and ontologies, as discussed in Section 4.2.2, serve as a form of semantic memory for LLMs (Akbar et al., 2025, p. 12). Unlike the plain text-based embeddings typically used in RAG systems, they organize information into structured graph representations. This graphical organization enables graph-based retrieval, which goes beyond simple semantic-similarity search. Such capabilities form the key motivation behind the development of GraphRAG (Peng et al., 2025). In a similar manner, Rasmussen et al. (2025) adopt temporal knowledge graphs, that is, graph structures that evolve over time and retain historical information, as a mechanism for episodic memory (facts as episodes, changing over time) in the implementation of GraphRAG. Beyond retrieval-centric approaches, MemGPT (Packer et al., 2023) introduces a complementary memory paradigm by simulating operating-system-like virtual memory management. Whereas RAG primarily augments LLMs with external embedding-based or semantic knowledge retrieved from document corpora, MemGPT equips them with working memory by storing and recalling prior interactions from external storage, thereby sustaining dialogue continuity and overcoming fixed context window limitations. Instead of modifying the model’s parameters, MemGPT externalizes memory management: the LLM maintains a limited “active context” (analogous to RAM or short-term memory) while offloading less immediately relevant information into external storage (analogous to disk or long-term memory). This external memory may include prior conversation history, task states, or knowledge chunks, all stored outside the model’s internal weights. When needed, MemGPT dynamically recalls and reinserts this information back into the context window, effectively integrating relevant working memory. In this way, MemGPT functions as another non-parametric memory-based method for LLMs. Similarly, Li et al. (2025d) investigate memory integration methods motivated by operating systems for LLMs. Refer to the survey (Zhang et al., 2025) for more literature related to memory-based systems for LLMs.
Tool Usage and Function Calling
The core of agentic AI systems lies in their ability to use external tools or functions. Tool usage refers to the ability of LLM-based systems to interact with external resources or services to perform tasks beyond their internal capabilities or training data. Function calling is a specific type of tool usage in which an LLM invokes a predefined function or API. A simple example of a tool is a calculator. Since LLMs do not perform actual calculations but instead predict the next word based on previous text, they might correctly answer simple questions like
Reinforcement Learning
Reinforcement learning (RL) is a subfield of machine learning in which an agent improves its performance by interacting with an environment and receiving feedback in the form of rewards or penalties (Sutton & Barto, 1998). In the context of Large Language Models (LLMs), RL can be employed to optimize model behavior across different tasks. Here, the model acts as the agent, while the environment is more broadly defined as the task setup and feedback loop; within this loop, the reward function or reward model, that is, another machine learning system trained to evaluate the outputs of the LLMs, serves as an approximation of the environment’s feedback. Importantly, Reinforcement Learning from Human Feedback (RLHF) extends this paradigm by training the environment’s reward signal using human feedback, thereby aligning the model’s outputs more closely with human expectations. Instruction-following is one such task in which the reward model guides the agent toward producing responses aligned with human preferences. A prominent example is ChatGPT, which was tuned using RLHF. In this case, the reward model was trained on human-labeled data that distinguished between high-quality and low-quality responses, as described in Agarwal et al. (2022). Following the similar area of research, Yan et al. (2025), propose a variant of RLHF called Reinforcement Learning from Knowledge Graph Feedback (RLKGF), replacing human feedback as they are time-consuming and costly to accumulate, with a knowledge graph. They assume that the human reasoning process could be substituted by the reasoning paths in a knowledge graph. However, they also note that the knowledge in the KG represents human thinking, thereby aligning the model with human preferences. Their results show that RLKGF outperforms Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022; Lee et al., 2024), another variant of RLHF in which the human feedback component is replaced by AI-generated feedback. Any discussion of RL in the context of LLMs cannot overlook the InstructGPT paper (Agarwal et al., 2022), although classifying it and its related variants as parametric or non-parametric is not straightforward. The reward model is trained externally, making that component non-parametric, while the LLMs policy is updated internally, meaning its weights are adjusted based on signals from the reward model, which is parametric. Taken together, this constitutes a hybrid approach. However, there are other RL-based methods that are purely non-parametric. One such method is RL for prompt optimization. For instance, Deng et al. (2022) introduce RLPrompt, a discrete prompt-optimization method that harnesses reinforcement learning to generate optimal prompts that are transferable across models. Extending the work of RLPrompt, Kwon et al. (2024) introduce StablePrompt, an automatic prompt tuning method based on reinforcement learning. In this framework, an agent model generates candidate prompts, while an anchor model is incorporated to stabilize policy updates. The target LLM, evaluated against the dataset, supplies reward signals that reflect the quality of the generated prompts and drive the optimization process.
Future Research Directions
As discussed in the sections above, there are limitations with LLMs which require further research. Additionally, the wide range of applications of LLMs in NLU tasks presents numerous opportunities for future investigation. The following section outlines potential research directions for enhancing NLU by leveraging LLMs and incorporating external knowledge to create highly efficient systems or models. These research directions emerged as key areas for further study, based on challenges highlighted in the surveyed literature and LLM optimization papers.
Explainability in Knowledge-Augmented LLMs:
As LLMs integrate more external knowledge, ensuring their explainability becomes increasingly essential. Future research should focus on developing techniques to understand how knowledge is incorporated into the model’s decision-making process. Additionally, improving the trustworthiness of applications built with LLMs will require methods to recover accurate citations for LLM-generated answers and effectively identify potential biases in the information they rely on Rorseth et al. (2024). For instance, improving the transparency of how knowledge graphs are integrated with LLMs and clarifying their reasoning processes (Chen et al., 2023b) can enhance user trust.
Optimizing LLM Finetuning:
Future research can also address the issue of Frequency Shock (cf. Section 4.1.2) (Kazemi et al., 2023). Other areas of improvement related to fine-tuning include mitigation strategies for overcoming attention imbalance (Ghosal et al., 2024). Attention imbalance, as the name suggests, is a phenomenon observed in LLMs in which the attention mechanism disproportionately focuses on certain tokens. This can lead to the suppression of important information, such as factual knowledge stored in the model’s parameters, from being used in generating responses.
Scalability and Efficiency Improvements in Knowledge Integration:
A promising direction is improving the scalability and efficiency of knowledge-augmented LLMs. It is a critical area of research given the increasing demands for LLMs. Numerous strategies have been proposed to optimize LLMs, focusing on reducing computational cost and improving inference speed without compromising the model performance. These techniques include model compression, architectural advancements, and data efficiency methods, among others. Each of these techniques has its advantages and disadvantages and therefore requires further investigation and exploration (Huang et al., 2024a; Li et al., 2025b).
Agentic AI and Knowledge Integration:
Future research could explore the integration of Agentic AI with LLMs, representing a significant advancement in artificial intelligence by combining the strengths of symbolic paradigms (symbolic representation and logic) with those of connectionist paradigms (neural networks). This integration has the potential to enhance knowledge integration and decision-making capabilities, enabling the development of autonomous agents that can navigate complex environments, reason, and learn from experiences. The synergy between these paradigms is crucial for enhancing the adaptability and reasoning capabilities of AI systems (Sharma, 2024; Xiong et al., 2024).
Interactive Knowledge Retrieval and Integration:
Interactive NLP systems (iNLP), where knowledge is dynamically retrieved and integrated during inference, offer a promising avenue for improving LLM-based NLU systems. For example, developing dynamically evolving knowledge graphs (Chen et al., 2024). These systems engage in back-and-forth dialogues with external knowledge sources, refining their understanding of the task and accessing up-to-date information (Wang et al., 2023b). Combining LLM, Semantic Web (KGs and ontologies), and iNLP can also be a promising future research area.
Discussion
This paper explores LLM-based NLU, particularly for knowledge-intensive tasks such as IE and KGC in domain-specific contexts. It surveys methods for integrating domain-specific knowledge into LLMs to enhance accuracy and address common limitations of their out-of-the-box usage. It provides an overview of LLM-based NLU systems, discussing the technical details of how NLU tasks benefit from the use of LLMs since the advent of transformers. As LLMs have limitations, such as hallucinations, limited contextual understanding in specialized domains, and incomplete/outdated data, these limitations directly affect the performance of NLU systems. As part of the survey, the challenges of LLMs are discussed, and approaches to mitigate them are investigated. We acknowledge that while LLMs have demonstrated exceptional performance across a wide range of NLU tasks, integrating external knowledge sources, such as ontologies and knowledge graphs, among others, is a critical pathway to address issues of factual accuracy, domain-specific reasoning, and ambiguity resolution. The discussion emphasizes the importance of integrating external knowledge to enhance factual accuracy, minimize hallucinations, and optimize performance on domain-specific tasks. Two primary approaches for incorporating knowledge into LLMs were explored, namely parametric and non-parametric methods. Parametric methods encode knowledge within a model’s parameters by updating its weights through techniques such as pre-training, fine-tuning, steering, knowledge editing, embedding-based approaches, and knowledge distillation. These approaches offer advantages such as higher accuracy in knowledge extraction and manipulation. However, they are computationally intensive and prone to challenges such as catastrophic forgetting, knowledge conflicts, and limited explainability. In contrast, non-parametric approaches, such as prompting strategies, knowledge-based techniques, memory-augmented systems (e.g., RAG), and tool-usage or function-calling methods, operate without extensive model training. These methods offer greater explainability, improved hardware efficiency, and increased flexibility, making them well-suited for scenarios where adaptability and explainability are crucial. Table 4 provides a comparison of these approaches across other relevant dimensions.
The choice of integration technique ultimately depends on the specific use case and the nature and volume of the data. Furthermore, the method’s efficiency can only be determined through experimentation and thorough evaluation, as with other AI methods and algorithms. As research into improving knowledge-intensive NLU systems with LLMs is ongoing, there is ample room for experimentation and further investigation into developing autonomous methods to enhance the explainability of LLM-generated outputs and to explore the integration of Agentic AI with LLMs. Combining fields such as the Semantic Web, NLP, and interactive NLP (iNLP) to create systems that incorporate real-time human feedback, thereby adding a human-in-the-loop dimension, represents a significant opportunity to advance LLM-based NLU systems.
Conclusion
The survey investigates techniques of integrating external knowledge with LLMs. The taxonomy and the Python script 12 used for searching and crawling the seed papers can be found at Yadav (2025). The study seeks to provide insights to enhance the effectiveness of the out-of-the-box LLMs in handling domain-specific and knowledge-intensive tasks. Its emphasis lies in examining research on LLM-driven NLU systems, while highlighting the inherent challenges of LLMs that constrain the performance of such NLU applications. Approaches for overcoming the common limitations of LLMs, including hallucinations, lack of contextual understanding, and outdated data, are discussed in detail. These approaches suggest integrating external knowledge to achieve better factual accuracy with LLMs for knowledge-intensive tasks, such as IE and KGC. The investigation highlights that while parametric methods for integrating knowledge into LLMs have shown a positive impact on performance, they face challenges, including a lack of explainability and the extensive use of hardware resources during model training. Non-parametric approaches, which rely on external storage systems for knowledge retention, offer improved explainability and allow the provenance of results to be traced. Moreover, implementing non-parametric approaches is less computationally expensive. However, the decision of which approach to pick depends on the particular problem. Trade-offs must be made between performance and hardware requirements. Lastly, future research directions for improving LLM-based NLU systems are discussed.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
