Abstract
Large Language Models (LLMs) excel at many tasks but often struggle with complex, multi-step reasoning, leading to inconsistencies and hallucinations. Consequently, we propose a neural-symbolic integration framework that enhances LLM reasoning by incorporating formal knowledge—such as logical rules, ontologies, and knowledge graphs—into their CoT process. Our approach retrieves and integrates symbolic information to guide logical inference, resulting in more accurate and interpretable outputs. Experiments on compositional reasoning benchmarks demonstrate significant improvements over standard LLM methods. This work highlights the potential of neural-symbolic integration for developing more reliable and explainable AI systems in high-stakes applications.
Introduction
Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, driven by their ability to learn from vast amounts of data. However, despite their impressive language understanding capabilities, LLMs often struggle with complex, multi-step reasoning tasks (Cobbe et al., 2021; Wei et al., 2022). This can lead to issues such as inconsistent outputs, hallucinated information, and a lack of transparent decision-making processes (Wei et al., 2022). In high-stakes domains like healthcare, law, and finance, where reliability and explainability are paramount, these limitations present significant challenges (Cabrera et al., 2024; Cobbe et al., 2021).
One promising avenue to address these challenges is the integration of Formal Knowledge into LLMs (Alotaibi et al., 2024; Cobbe et al., 2021; Dhanraj & Eliasmith, 2025). Formal Knowledge refers to information that is represented in structured, symbolic forms—such as logical rules, ontologies, and knowledge graphs—which provide clear, unambiguous representations of domain-specific facts and relationships (Hogan et al., 2021). Such knowledge representations have long been used in classical AI systems to ensure logical consistency and offer clear justifications for decisions (Russell & Norvig, 2010). They can serve as a reliable foundation to support robust reasoning, mitigating some of the statistical shortcomings of purely data-driven approaches.
Despite the availability of rich formal knowledge sources, current LLMs predominantly rely on learned statistical patterns, which can be insufficient for tasks that require precise logical inference and multi-step reasoning (Wei et al., 2022). The absence of an internal mechanism to incorporate structured reasoning often results in outputs that are difficult to interpret and verify (Wei et al., 2022; Zhao et al., 2024). Therefore, bridging the gap between statistical learning and symbolic reasoning emerges as a critical research challenge (Alotaibi et al., 2024; Besold et al., 2017; Cabrera et al., 2024; Dhanraj & Eliasmith, 2025; Liang & Jordan, 2017; Xie et al., 2025). By integrating formal knowledge with LLMs, we aim to leverage the strengths of both approaches—harnessing the adaptability and language fluency of neural models while enforcing logical consistency and transparency through symbolic structures.
In this paper, we propose a novel neural-symbolic integration framework designed to enhance the reasoning capabilities of LLMs by incorporating formal knowledge into their CoT process. Our approach involves a multi-stage reasoning pipeline where the LLM interacts with an external formal knowledge base. This system retrieves relevant symbolic information via a retrieval-augmented mechanism and integrates it into the model's reasoning process, thereby grounding the inference in explicit, well-defined rules and relationships. Such integration not only reduces the likelihood of hallucinations but also improves the overall interpretability of the reasoning process.
Our main contributions can be summarized as follows:
The remainder of this paper is organized as follows. Section 2, reviews related work in LLMs, formal knowledge representations, and neural-symbolic integration. Section 3 details our proposed framework, including the architecture and integration strategy. Section 4 describes our experimental setup and evaluation metrics, followed by a discussion of our results in Section 5. Finally, Section 6 concludes the paper and outlines directions for future research.
Background and Related Work
This section reviews foundational work in LLMs, formal knowledge representations, and neural-symbolic integration, with a focus on the mathematical formulations that underpin these approaches.
Large Language Models
Recent advancements in LLMs, particularly those based on the Transformer architecture, have led to significant improvements in natural language processing (Vaswani et al., 2017). Models such as GPT and BERT demonstrate remarkable fluency and generalization. However, they primarily rely on statistical correlations learned from large corpora, often leading to (Wei et al., 2022; Zhao et al., 2023):
The probability of generating a token sequence
The attention weights
To address their inherent limitations in complex logical reasoning, recent research has explored two main groups of techniques to improve LLM inference accuracy: Self-consistency (e.g., voting mechanisms): This involves generating multiple CoTs and taking a majority vote on the final answer, effectively reducing errors from a single flawed reasoning path (Wang et al., 2022). Reflection: LLMs are prompted to review and critique their own generated CoT and final answer, identifying potential errors and iteratively refining their reasoning (Shinn et al., 2023). Tree-of-Thought or Graph-of-Thought: These methods explore multiple reasoning paths or structures, moving beyond linear CoT to find optimal solutions in complex problem-solving scenarios (Alotaibi et al., 2024; Yao et al., 2023).
While these techniques have shown promise, they often still grapple with ensuring strict logical consistency, mitigating hallucinations, and providing transparent, verifiable reasoning, especially in high-stakes domains. Our proposed framework aims to bridge this gap by integrating formal knowledge directly into the reasoning process.
Formal Knowledge is typically represented in symbolic forms such as logical rules, ontologies, and knowledge graphs. For instance, a simple logical rule can be expressed as:
Logical rules provide a formal mechanism for encoding constraints and inference rules. A common approach is the use of first-order logic (FOL), which allows for quantification over variables (Lloyd, 1984):
This pattern means that if a patient has both Fever and Cough, then they are likely to have Flu, with a certain confidence level (Hogan et al., 2021).
This rule states that if x is a parent of y, then x is also an ancestor of y. More generally, knowledge bases can store sets of logical rules K that define valid inference patterns.
Knowledge graphs structure entities and their relationships as:
Neural-symbolic approaches combine the learning capabilities of neural networks with the rigor of symbolic reasoning (Dhanraj & Eliasmith, 2025; Manhaeve et al., 2018). This field aims to harness the pattern recognition strengths of neural networks with the explainability and logical consistency of symbolic AI. Key approaches in this direction include: Neural-Symbolic Machines (NSM): Works like (Liang & Jordan, 2017) focus on learning semantic parsers by mapping natural language to logical forms, often leveraging weak supervision (Liang & Jordan, 2017). NSMs aim to build interpretable models that can reason over knowledge bases (Liang & Jordan, 2017). Probabilistic Logic Programming (PLP) with Neural Components: DeepProbLog (Manhaeve et al., 2018) integrates neural networks into a probabilistic logic programming framework, allowing learning from data while maintaining logical interpretability and explicit reasoning paths (Manhaeve et al., 2018). Symbolic Knowledge Injection/Constraint Satisfaction: Recent efforts, particularly with LLMs, involve incorporating formal knowledge (e.g., logical rules, ontologies) to guide or constrain the model's output and reasoning process (Dhanraj & Eliasmith, 2025). These methods often aim to prevent illogical outputs and ensure factual correctness. For instance, Alotaibi et al. (2024) explores using knowledge graphs with symbolic logic to enhance LLM reasoning, and Cabrera et al. (2024) investigates finite-state machines for improving LLM reasoning and planning. Similarly, Xie et al. (2025) proposes rule-based reinforcement learning to unleash logical reasoning in LLMs.
One method to achieve this is to modify the LLM's output by incorporating symbolic constraints (Dhanraj & Eliasmith, 2025). Suppose the LLM generates a CoT
Alternatively, a symbolic loss term can be incorporated into the training objective:
In retrieval-augmented frameworks, the LLM is conditioned on the input x and a set of relevant knowledge snippets
This allows the model to integrate external formal knowledge during generation, reducing hallucinations and enhancing factual accuracy (Gao et al., 2023).
Chain-of-thought (CoT) prompting guides LLMs to generate intermediate reasoning steps (Wei et al., 2022). This process can be mathematically formulated by marginalizing over a latent reasoning variable
Within a neural-symbolic context,
In this work, we propose a novel neural-symbolic integration framework in Figure 1, which was designed to enhance the reasoning capabilities of LLMs by incorporating formal knowledge into their CoT process. Our approach notably consists of three main components: (1) a knowledge retrieval module, (2) a CoT generator conditioned on both the input and retrieved formal knowledge, and (3) a final answer generator that integrates the intermediate reasoning with the input context. The overall system is specifically designed to enforce logical consistency through symbolic constraints.

Neural-Symbolic Integration (NSI) Framework.
Given an input x, our system first retrieves a set of relevant formal knowledge snippets
This framework operates in a sequential manner, where retrieved knowledge informs the generation of intermediate reasoning steps (chain-of-thought), which in turn guides the final output, ensuring logical consistency at each stage.
To construct the knowledge set K from training data and external sources, we propose a hybrid approach combining data-driven extraction with formal knowledge integration:
where
This hybrid approach ensures that K is comprehensive, accurate, and generalizable across reasoning tasks, and actively mitigates ambiguity or conflict in symbolic rules through a multi-faceted validation process involving logical consistency checks and human-curated external knowledge sources.
To incorporate formal knowledge into the reasoning process, we condition the generation of the CoT on both the input and the retrieved knowledge. The CoT probability is formulated as:
This formulation ensures that each intermediate step
To guarantee that the generated CoT adheres to the logical rules encoded in the formal knowledge, we introduce a symbolic constraint function
For example, if a retrieved knowledge rule states “All birds can fly” and the LLM generates a CoT that concludes “Penguins can fly,” the constraint function
This ensures that the output is not only plausible but also logically sound according to the formal knowledge.
During training, we integrate the standard cross-entropy loss for the LLM with an additional symbolic loss term that penalizes violations of the formal constraints. The overall loss function is defined as:
The inference procedure of our framework consists of the following steps:
The LLM generates a sequence of logical steps, conditioned by both the input and the retrieved formal knowledge, forming the explicit reasoning path.
The LLM synthesizes the final output based on the generated CoT and the original input context.
This framework leverages the complementary strengths of statistical learning and symbolic reasoning, providing a more robust and interpretable multi-step reasoning process.
In this section, we describe the datasets, baselines, evaluation metrics, and implementation details used to assess the effectiveness of our proposed neural-symbolic integration framework for reasoning tasks.
Datasets
We conduct experiments on several widely adopted benchmark datasets that target different aspects of reasoning:
For each dataset, we use the standard train-validation-test splits provided in the literature. Additionally, for ablation studies, we construct modified versions of these datasets by introducing controlled variations to assess the robustness of our method.
Baselines
To evaluate our framework, we compare it against several baselines:
Evaluation Metrics
We employ several evaluation metrics to comprehensively assess our method:
Learning rate: Batch size: 16 Number of training epochs: 10–20 (depending on the dataset) Fine-tuning is performed on NVIDIA GPUs with 16GB memory.
The fine-tuned LLM, conditioned on both the input and the retrieved knowledge, produces a sequence of explicit reasoning steps in natural language, mimicking a step-by-step human thought process.
The LLM then synthesizes the ultimate conclusion or answer based on the completed chain-of-thought and the initial query. The solutions are developed iteratively through this reasoning pipeline, guided by formal knowledge.
For each dataset, the following experimental steps are conducted:
Summary
This experimental setup is designed to rigorously evaluate the ability of our neural-symbolic integration framework to enhance multi-step reasoning in LLMs. By benchmarking on datasets such as GSM8K, CFQ, and SCAN, and comparing against strong baselines, we aim to demonstrate significant improvements in reasoning accuracy, consistency, and interpretability.
Results and Analysis
In this section, we present both quantitative and qualitative results of our proposed neural-symbolic integration framework, comparing it with related methods on typical reasoning datasets. Herein, we provide a detailed analysis of the performance and discuss the key factors that influence the results achieved by our model.
Role and Impact of Formal Knowledge K
The integration of formal knowledge K plays a critical role in augmenting LLMs with grounded, verifiable reasoning capacity. In our framework, K acts as both a filter and a guide: it constrains outputs to respect domain-specific logic and provides symbolic structure that complements the model's latent representations.
We demonstrate that the structure of K—whether it is handcrafted, mined from data, or learned through inductive logic programming—can significantly shape the model's behavior in downstream tasks.
Examples of K Used in Practice:
Biomedical domain:
These triples prevent unsafe medical recommendations by enforcing pharmacological constraints. Legal reasoning:
Enables LLMs to generate legally sound conclusions in contract analysis. Educational domain:
Ensures consistency in course planning and progression reasoning. Commonsense reasoning:
Supports compositional generalization in tasks like SCAN or CFQ (Lake & Baroni, 2018).
These structured knowledge snippets serve as anchor points that help filter out illogical responses, correct spurious generalizations, and improve interpretability and factual reliability in complex reasoning tasks.
Reasoning Chain for “Is a Whale Warm-Blooded?”:
If a creature is a mammal, then by R1 it must be warm-blooded. We know from F1 that whales belong to the class of mammals. Therefore, whales inherit the property warm-blooded.
Quantitative Results
Table 1 summarizes the performance of our model compared to several baselines, including a Vanilla LLM (without neural-symbolic integration), a RAG approach, and a previously proposed neural-symbolic baseline. The metrics reported include overall accuracy, CoT consistency, and an explainability score (on a scale of 1 to 5), with average values and standard deviations over multiple runs.
Performance Comparison on Benchmark Datasets.
Performance Comparison on Benchmark Datasets.
Our proposed method achieves a significant improvement, with an accuracy of 79.6% on datasets such as GSM8K, outperforming the Vanilla LLM by approximately 11%. The chain-of-thought consistency and explainability scores also show notable enhancements, indicating a more coherent and interpretable reasoning process.
To assess the contribution of each component in our framework, we conducted ablation studies. Table 2 presents the performance when key modules are removed or modified.
Ablation Study Results.
Ablation Study Results.
The results indicate that each component—formal knowledge integration, constraint enforcement, and knowledge retrieval—plays a critical role. For instance, removing the formal knowledge integration results in a drop of 5.5% in accuracy, demonstrating its importance in guiding the reasoning process. When symbolic loss is removed or altered in our ablation study, it effectively disables or reduces the impact of the consistency checks, which is what we aimed to demonstrate by showing its critical role in performance.
In addition to the quantitative metrics, we analyzed several reasoning examples to assess the interpretability of the generated CoT. Our method produces detailed intermediate steps that align well with the formal rules retrieved from the knowledge base. For example, in a math word problem from GSM8K, the baseline LLM might directly output a numerical answer, while our model explicitly outlines steps like identifying quantities, applying relevant arithmetic operations, and verifying against known properties (e.g., non-negative results). In contrast to the baseline LLM, the reasoning chains from our model are more logically coherent and transparent, which facilitates human evaluation and trust. Due to space constraints, detailed examples and case studies will be provided in an appendix or within the supplementary materials of the codebase upon acceptance.
Factors Affecting Model Performance
Several factors have been identified that influence the performance of our proposed framework:
Discussion
Our proposed neural-symbolic integration framework, which augments LLMs with formal knowledge, has demonstrated promising improvements in multi-step reasoning tasks. By incorporating a knowledge retrieval module, a CoT generator conditioned on external formal knowledge, and symbolic constraint enforcement, our approach has improved answer accuracy, reasoning consistency, and interpretability compared to standard LLMs and retrieval-augmented baselines.
However, several limitations of the current framework have been identified:
Looking ahead, several avenues for future research emerge:
In summary, while our proposed framework marks a significant step toward integrating formal knowledge with neural models for enhanced reasoning, overcoming its current limitations will be key to realizing its full potential in real-world, high-stakes applications.
Conclusion
In this paper, we presented a novel neural-symbolic integration framework that enhances the reasoning capabilities of LLMs by incorporating formal knowledge into their CoT process. Our approach effectively leverages a combination of a knowledge retrieval module, a CoT generator conditioned on external symbolic information, and a constraint enforcement mechanism to ensure logical consistency. This integration not only improves the accuracy of multi-step reasoning tasks but also enhances the interpretability of the generated reasoning chains.
Our experimental results on benchmark datasets, including GSM8K, CFQ, and SCAN, demonstrate that the proposed method outperforms conventional LLMs and retrieval-augmented models. Ablation studies further validate the importance of integrating formal knowledge and enforcing symbolic constraints in guiding the reasoning process.
Despite these promising outcomes, our framework faces certain limitations, such as dependency on the quality of the external knowledge base, increased computational complexity, and sensitivity to hyperparameter settings. Future work will focus on dynamic knowledge updates, improved retrieval mechanisms, and more scalable integration techniques to address these challenges.
Overall, our work contributes to bridging the gap between data-driven learning and formal reasoning, paving the way for more robust, interpretable, and reliable AI systems in high-stakes applications.
Footnotes
Acknowledgements
This research has been done under the research project QG.24.80 “Research on developing inference techniques for LLMs and their applications in the legal field” of Vietnam National University, Hanoi.
Author Contributions
Ngoc-Khuong Nguyen: Conceptualization, Methodology, Software, Writing—Original Draft.
Viet-Ha Nguyen: Data Curation, Validation, Visualization, Writing—Review & Editing.
Anh-Cuong Le: Supervision, Project Administration, Methodology, Resources, Writing—Review & Editing.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Vietnam National University Hanoi (VNU), (grant number QG.24.80).
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
