Formal Knowledge Augmented Language Models for Explainable and Robust Reasoning

Abstract

Large Language Models (LLMs) excel at many tasks but often struggle with complex, multi-step reasoning, leading to inconsistencies and hallucinations. Consequently, we propose a neural-symbolic integration framework that enhances LLM reasoning by incorporating formal knowledge—such as logical rules, ontologies, and knowledge graphs—into their CoT process. Our approach retrieves and integrates symbolic information to guide logical inference, resulting in more accurate and interpretable outputs. Experiments on compositional reasoning benchmarks demonstrate significant improvements over standard LLM methods. This work highlights the potential of neural-symbolic integration for developing more reliable and explainable AI systems in high-stakes applications.

Keywords

neural symbolic formal knowledge reasoning

1. Introduction

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, driven by their ability to learn from vast amounts of data. However, despite their impressive language understanding capabilities, LLMs often struggle with complex, multi-step reasoning tasks (Cobbe et al., 2021; Wei et al., 2022). This can lead to issues such as inconsistent outputs, hallucinated information, and a lack of transparent decision-making processes (Wei et al., 2022). In high-stakes domains like healthcare, law, and finance, where reliability and explainability are paramount, these limitations present significant challenges (Cabrera et al., 2024; Cobbe et al., 2021).

One promising avenue to address these challenges is the integration of Formal Knowledge into LLMs (Alotaibi et al., 2024; Cobbe et al., 2021; Dhanraj & Eliasmith, 2025). Formal Knowledge refers to information that is represented in structured, symbolic forms—such as logical rules, ontologies, and knowledge graphs—which provide clear, unambiguous representations of domain-specific facts and relationships (Hogan et al., 2021). Such knowledge representations have long been used in classical AI systems to ensure logical consistency and offer clear justifications for decisions (Russell & Norvig, 2010). They can serve as a reliable foundation to support robust reasoning, mitigating some of the statistical shortcomings of purely data-driven approaches.

Despite the availability of rich formal knowledge sources, current LLMs predominantly rely on learned statistical patterns, which can be insufficient for tasks that require precise logical inference and multi-step reasoning (Wei et al., 2022). The absence of an internal mechanism to incorporate structured reasoning often results in outputs that are difficult to interpret and verify (Wei et al., 2022; Zhao et al., 2024). Therefore, bridging the gap between statistical learning and symbolic reasoning emerges as a critical research challenge (Alotaibi et al., 2024; Besold et al., 2017; Cabrera et al., 2024; Dhanraj & Eliasmith, 2025; Liang & Jordan, 2017; Xie et al., 2025). By integrating formal knowledge with LLMs, we aim to leverage the strengths of both approaches—harnessing the adaptability and language fluency of neural models while enforcing logical consistency and transparency through symbolic structures.

In this paper, we propose a novel neural-symbolic integration framework designed to enhance the reasoning capabilities of LLMs by incorporating formal knowledge into their CoT process. Our approach involves a multi-stage reasoning pipeline where the LLM interacts with an external formal knowledge base. This system retrieves relevant symbolic information via a retrieval-augmented mechanism and integrates it into the model's reasoning process, thereby grounding the inference in explicit, well-defined rules and relationships. Such integration not only reduces the likelihood of hallucinations but also improves the overall interpretability of the reasoning process.

Our main contributions can be summarized as follows:

Novel Framework: We introduce a neural-symbolic integration framework that seamlessly combines LLMs with formal symbolic reasoning components.

Hybrid Reasoning Mechanism: We develop a method that retrieves and incorporates structured knowledge—such as logical rules and ontologies—into the CoT reasoning process, enhancing logical inference and consistency.

Empirical Validation: We conduct extensive experiments on compositional reasoning benchmarks and domain-specific tasks, demonstrating that our approach outperforms conventional LLMs in terms of accuracy, consistency, and explainability.

Enhanced Interpretability: Our framework provides a clear, traceable chain of reasoning that offers insights into the decision-making process of the model, making it easier for users to understand and trust the output.

The remainder of this paper is organized as follows. Section 2, reviews related work in LLMs, formal knowledge representations, and neural-symbolic integration. Section 3 details our proposed framework, including the architecture and integration strategy. Section 4 describes our experimental setup and evaluation metrics, followed by a discussion of our results in Section 5. Finally, Section 6 concludes the paper and outlines directions for future research.

2. Background and Related Work

This section reviews foundational work in LLMs, formal knowledge representations, and neural-symbolic integration, with a focus on the mathematical formulations that underpin these approaches.

2.1 Large Language Models

Recent advancements in LLMs, particularly those based on the Transformer architecture, have led to significant improvements in natural language processing (Vaswani et al., 2017). Models such as GPT and BERT demonstrate remarkable fluency and generalization. However, they primarily rely on statistical correlations learned from large corpora, often leading to (Wei et al., 2022; Zhao et al., 2023):

Hallucinations: Generating plausible yet factually incorrect outputs.

Lack of Logical Consistency: Difficulty in performing rigorous multi-step reasoning.

Opacity: Limited interpretability of the internal decision-making process.

The probability of generating a token sequence $y = {y_{1}, y_{2}, \dots, y_{T}}$ given an input x is modeled as:

\begin{aligned} P (y | x) = \prod_{t = 1}^{T} P (y_{t} | x, y_{< t}) \end{aligned}

(1)

where each token probability is computed via a softmax function:

\begin{aligned} P (y_{t} | x, y_{< t}) = softmax (W_{o} h_{t}) \end{aligned}

(2)

and the hidden state

h_{t}

is computed using self-attention:

\begin{aligned} h_{t} = \sum_{i = 1}^{n} α_{t, i} v_{i} . \end{aligned}

(3)

The attention weights $α_{t, i}$ are defined as:

\begin{aligned} α_{t, i} = \frac{exp (\frac{q_{t} \cdot k_{i}}{\sqrt{d_{k}}})}{\sum_{j = 1}^{n} exp (\frac{q_{t} \cdot k_{j}}{\sqrt{d_{k}}})} \end{aligned}

(4)

with

q_{t}

k_{i}

, and

v_{i}

representing the query, key, and value vectors respectively, and

d_{k}

being the key dimensionality.

To address their inherent limitations in complex logical reasoning, recent research has explored two main groups of techniques to improve LLM inference accuracy:

Model Training Techniques: These methods involve explicitly training or fine-tuning LLMs to enhance their reasoning capabilities. Approaches like Supervised Fine-Tuning (SFT) are used to train models on datasets of logical reasoning problems, enabling them to generate more accurate multi-step solutions. Reinforcement Learning (RL) techniques, such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) or specific policy-based methods, can further refine LLMs to align their reasoning with human preferences or predefined logical rules, minimizing errors and inconsistencies.

Prompting Techniques and Extensions of CoT: Beyond direct fine-tuning, several advanced prompting strategies have been developed to elicit better logical inference from LLMs without extensive model retraining. Chain-of-Thought (CoT) prompting (Wei et al., 2022) is a foundational technique in this category, guiding LLMs to generate intermediate reasoning steps before arriving at a final answer. Extensions of CoT include various methods to improve the robustness and accuracy of these chains:

Self-consistency (e.g., voting mechanisms): This involves generating multiple CoTs and taking a majority vote on the final answer, effectively reducing errors from a single flawed reasoning path (Wang et al., 2022).

Reflection: LLMs are prompted to review and critique their own generated CoT and final answer, identifying potential errors and iteratively refining their reasoning (Shinn et al., 2023).

Tree-of-Thought or Graph-of-Thought: These methods explore multiple reasoning paths or structures, moving beyond linear CoT to find optimal solutions in complex problem-solving scenarios (Alotaibi et al., 2024; Yao et al., 2023).

While these techniques have shown promise, they often still grapple with ensuring strict logical consistency, mitigating hallucinations, and providing transparent, verifiable reasoning, especially in high-stakes domains. Our proposed framework aims to bridge this gap by integrating formal knowledge directly into the reasoning process.

2.2 Formal Knowledge Representations

Formal Knowledge is typically represented in symbolic forms such as logical rules, ontologies, and knowledge graphs. For instance, a simple logical rule can be expressed as:

\begin{aligned} \forall x (ϕ (x) \Rightarrow ψ (x)) \end{aligned}

(5)

where

ϕ (x)

and

ψ (x)

are predicate functions over the domain x.

Logical rules provide a formal mechanism for encoding constraints and inference rules. A common approach is the use of first-order logic (FOL), which allows for quantification over variables (Lloyd, 1984):

\begin{aligned} \forall x (Fever (x) \land Cough (x)) \Rightarrow Flu (x) \end{aligned}

(6)

This pattern means that if a patient has both Fever and Cough, then they are likely to have Flu, with a certain confidence level (Hogan et al., 2021).

\begin{aligned} \forall x, y Parent (x, y) \Rightarrow Ancestor (x, y) \end{aligned}

(7)

This rule states that if x is a parent of y, then x is also an ancestor of y. More generally, knowledge bases can store sets of logical rules K that define valid inference patterns.

Knowledge graphs structure entities and their relationships as:

\begin{aligned} K = {(e_{i}, r_{i j}, e_{j}) ∣ e_{i}, e_{j} \in E, r_{i j} \in R} \end{aligned}

(8)

where

E

is the set of entities and

R

is the set of relations.

2.3 Neural-Symbolic Integration

Neural-symbolic approaches combine the learning capabilities of neural networks with the rigor of symbolic reasoning (Dhanraj & Eliasmith, 2025; Manhaeve et al., 2018). This field aims to harness the pattern recognition strengths of neural networks with the explainability and logical consistency of symbolic AI. Key approaches in this direction include:

Neural-Symbolic Machines (NSM): Works like (Liang & Jordan, 2017) focus on learning semantic parsers by mapping natural language to logical forms, often leveraging weak supervision (Liang & Jordan, 2017). NSMs aim to build interpretable models that can reason over knowledge bases (Liang & Jordan, 2017).

Probabilistic Logic Programming (PLP) with Neural Components: DeepProbLog (Manhaeve et al., 2018) integrates neural networks into a probabilistic logic programming framework, allowing learning from data while maintaining logical interpretability and explicit reasoning paths (Manhaeve et al., 2018).

Symbolic Knowledge Injection/Constraint Satisfaction: Recent efforts, particularly with LLMs, involve incorporating formal knowledge (e.g., logical rules, ontologies) to guide or constrain the model's output and reasoning process (Dhanraj & Eliasmith, 2025). These methods often aim to prevent illogical outputs and ensure factual correctness. For instance, Alotaibi et al. (2024) explores using knowledge graphs with symbolic logic to enhance LLM reasoning, and Cabrera et al. (2024) investigates finite-state machines for improving LLM reasoning and planning. Similarly, Xie et al. (2025) proposes rule-based reinforcement learning to unleash logical reasoning in LLMs.

One method to achieve this is to modify the LLM's output by incorporating symbolic constraints (Dhanraj & Eliasmith, 2025). Suppose the LLM generates a CoT $z = {z_{1}, z_{2}, \dots, z_{M}}$ as intermediate reasoning steps. The final output probability becomes:

\begin{aligned} P (y | x) = \sum_{z} P (y, z | x) \end{aligned}

(9)

Alternatively, a symbolic loss term can be incorporated into the training objective:

\begin{aligned} L = L_{LLM} (y, y^{*}) + λ L_{symbolic} (C (z, K)) \end{aligned}

(10)

where

L_{LLM}

is the standard cross-entropy loss,

L_{symbolic}

penalizes violations of the symbolic constraints, and

λ

is a weighting factor.

2.4 Retrieval-Augmented Generation (RAG)

In retrieval-augmented frameworks, the LLM is conditioned on the input x and a set of relevant knowledge snippets $K (x) = {k_{1}, k_{2}, \dots, k_{R}}$ retrieved from a knowledge base (Gao et al., 2023). The conditional probability is modeled as:

\begin{aligned} P (y | x, K (x)) = \prod_{t = 1}^{T} P (y_{t} | x, y_{< t}, K (x)) \end{aligned}

(11)

This allows the model to integrate external formal knowledge during generation, reducing hallucinations and enhancing factual accuracy (Gao et al., 2023).

2.5 Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting guides LLMs to generate intermediate reasoning steps (Wei et al., 2022). This process can be mathematically formulated by marginalizing over a latent reasoning variable $z$ :

\begin{aligned} P (y | x) = \sum_{z} P (y | z, x) P (z | x) \end{aligned}

(12)

Within a neural-symbolic context, $P (z | x)$ is further constrained by symbolic rules derived from formal knowledge K, ensuring that the generated CoT is both logically sound and interpretable.

3. Our Proposed Model

In this work, we propose a novel neural-symbolic integration framework in Figure 1, which was designed to enhance the reasoning capabilities of LLMs by incorporating formal knowledge into their CoT process. Our approach notably consists of three main components: (1) a knowledge retrieval module, (2) a CoT generator conditioned on both the input and retrieved formal knowledge, and (3) a final answer generator that integrates the intermediate reasoning with the input context. The overall system is specifically designed to enforce logical consistency through symbolic constraints.

Figure 1.

Neural-Symbolic Integration (NSI) Framework.

3.1 Overview of the Framework

Given an input x, our system first retrieves a set of relevant formal knowledge snippets $K (x) = {k_{1}, k_{2}, \dots, k_{R}}$ from an external knowledge base. This formal knowledge is then used to condition the CoT generation and guide the reasoning process. The model generates a latent CoT $z = {z_{1}, z_{2}, \dots, z_{M}}$ and, subsequently, a final output y. Mathematically, the overall probability of generating the final answer is modeled as:

\begin{aligned} P (y | x, K (x)) = \sum_{z} P (y | z, x, K (x)) P (z | x, K (x)) \end{aligned}

(13)

This framework operates in a sequential manner, where retrieved knowledge informs the generation of intermediate reasoning steps (chain-of-thought), which in turn guides the final output, ensuring logical consistency at each stage.

3.2 Generating the Knowledge Set K

To construct the knowledge set K from training data and external sources, we propose a hybrid approach combining data-driven extraction with formal knowledge integration:

Pattern Mining: Identify frequent and high-confidence logical patterns from structured and unstructured data using statistical methods such as association rule mining.

Inductive Logic Programming (ILP): Given positive and negative examples, learn logical rules that generalize over the observed data:

\begin{aligned} K = ILP (D^{+}, D^{-}) \end{aligned}

(14)

where

D^{+}

and

D^{-}

are sets of positive and negative examples respectively. These examples serve as ground truth for the ILP algorithm:

D^{+}

consists of observations that are consistent with the desired logical rules, while

D^{-}

comprises observations that violate these rules. For instance, if

D^{+}

contains (has_fever, has_cough, is_flu) and

D^{-}

contains (has_fever, no_cough, not_flu), ILP can learn the rule: has_fever(X)

\land

has_cough(X) ⇒ is_flu(X). This hybrid approach ensures that K is comprehensive, accurate, and generalizable across reasoning tasks, thereby mitigating issues of ambiguity or conflict by validating and refining knowledge snippets. To mitigate the risk of accentuating model hallucination when K is enriched with external content, our framework employs a multi-layered validation strategy. Firstly, the “Neural-Symbolic Integration” step validates and refines knowledge snippets using logical constraints, effectively filtering out inconsistent or erroneous information. Secondly, by incorporating “Crowdsourced and External Knowledge Augmentation” with human-validated rules, we leverage verified information. Finally, the “Constraint Enforcement” mechanism in our proposed model actively filters out any generated CoT that is not consistent with the formal knowledge, serving as a critical safeguard against hallucinations.

Neural-Symbolic Integration: Use transformer-based models fine-tuned with logical constraints to validate and refine the generated knowledge snippets.

Crowdsourced and External Knowledge Augmentation: Incorporate human-validated rules and external knowledge bases (e.g., Wikidata, ConceptNet) to enrich K.

This hybrid approach ensures that K is comprehensive, accurate, and generalizable across reasoning tasks, and actively mitigates ambiguity or conflict in symbolic rules through a multi-faceted validation process involving logical consistency checks and human-curated external knowledge sources.

3.3 Chain-of-Thought Generation with Formal Knowledge

To incorporate formal knowledge into the reasoning process, we condition the generation of the CoT on both the input and the retrieved knowledge. The CoT probability is formulated as:

\begin{aligned} P (z | x, K (x)) = \prod_{i = 1}^{M} P (z_{i} | z_{< i}, x, K (x)) . \end{aligned}

(15)

This formulation ensures that each intermediate step $z_{i}$ is generated while taking into account the entire context provided by the input x and the relevant formal knowledge $K (x)$ .

3.4 Enforcing Consistency via Symbolic Constraints

To guarantee that the generated CoT adheres to the logical rules encoded in the formal knowledge, we introduce a symbolic constraint function $C (z, K)$ defined as:

\begin{aligned} C (z, K) = {\begin{array}{ll} 1, & if z is consistent with K, \\ 0, & otherwise . \end{array} \end{aligned}

(16)

For example, if a retrieved knowledge rule states “All birds can fly” and the LLM generates a CoT that concludes “Penguins can fly,” the constraint function $C (z, K)$ would evaluate to 0, deeming the reasoning inconsistent. Conversely, if the rule is “All birds have feathers” and the LLM concludes “Sparrows have feathers,” $C (z, K)$ would evaluate to 1.** The final answer is then selected from those outputs whose corresponding reasoning chains satisfy the symbolic constraint. In other words, we refine the prediction as:

\begin{aligned} y^{*} = arg max_{y \; s .t .\; C (z, K) = 1} P (y | x, K (x)) . \end{aligned}

(17)

This ensures that the output is not only plausible but also logically sound according to the formal knowledge.

3.5 Training Objective

During training, we integrate the standard cross-entropy loss for the LLM with an additional symbolic loss term that penalizes violations of the formal constraints. The overall loss function is defined as:

\begin{aligned} L = L_{LLM} (y, y^{*}) + λ L_{symbolic} (C (z, K)), \end{aligned}

(18)

where:

$L_{LLM} (y, y^{*})$ is the standard cross-entropy loss between the predicted output and the ground truth,

$L_{symbolic} (C (z, K))$ penalizes inconsistencies in the CoT,

$λ$ is a hyperparameter controlling the trade-off between the two loss components.

3.6 Inference Process

The inference procedure of our framework consists of the following steps:

1. Knowledge Retrieval: Given input x, retrieve relevant knowledge $K (x)$ from an external knowledge base. This step employs a retrieval mechanism to fetch formal knowledge snippets pertinent to the given query or problem.

2. Chain-of-Thought Generation: Generate intermediate reasoning steps z with:

\begin{aligned} P (z | x, K (x)) = \prod_{i = 1}^{M} P (z_{i} | z_{< i}, x, K (x)) \end{aligned}

(19)

The LLM generates a sequence of logical steps, conditioned by both the input and the retrieved formal knowledge, forming the explicit reasoning path.

3. Constraint Enforcement: Apply the constraint function $C (z, K)$ to ensure the reasoning chain z is consistent with the formal knowledge. Only outputs with $C (z, K) = 1$ are considered valid. This acts as a post-generation validation step, ensuring that the model's reasoning adheres to predefined logical rules and constraints.

4. Answer Generation: Generate the final answer y by combining the CoT with the input context, modeled as $P (y | z, x, K (x))$ .

\begin{aligned} P (y | z, x, K (x)) = \sum_{z} P (y | z, x, K (x)) P (z | x, K (x)) \end{aligned}

(20)

The LLM synthesizes the final output based on the generated CoT and the original input context.

This framework leverages the complementary strengths of statistical learning and symbolic reasoning, providing a more robust and interpretable multi-step reasoning process.

4. Experiment

In this section, we describe the datasets, baselines, evaluation metrics, and implementation details used to assess the effectiveness of our proposed neural-symbolic integration framework for reasoning tasks.

4.1 Datasets

We conduct experiments on several widely adopted benchmark datasets that target different aspects of reasoning:

GSM8K: A dataset consisting of grade-school math word problems that require multi-step reasoning. Each problem typically includes a natural language description and a numerical answer. It contains 8,500 training problems and 1,000 test problems.

CFQ (Compositional Freebase Questions): This dataset is designed to evaluate compositional generalization in semantic parsing. It involves converting natural language questions into structured queries (e.g., SPARQL) and tests the ability to generalize to unseen combinations of known components. The dataset is split into standard train-validation-test splits provided in the literature, with a focus on out-of-distribution generalization.

SCAN: A synthetic dataset for testing compositional generalization in language tasks. SCAN challenges models to translate simple commands (e.g., “jump twice”) into action sequences, emphasizing the ability to handle novel compositions. It has a structured grammar, allowing for precise control over generalization patterns.

For each dataset, we use the standard train-validation-test splits provided in the literature. Additionally, for ablation studies, we construct modified versions of these datasets by introducing controlled variations to assess the robustness of our method.

4.2 Baselines

To evaluate our framework, we compare it against several baselines:

Vanilla LLM: A large language model (e.g., GPT-3 or GPT-4) with standard CoT prompting, without any explicit integration of formal knowledge.

Retrieval-Augmented LLM (RAG): A model that retrieves external knowledge from a large database but does not enforce symbolic constraints.

Existing Neural-Symbolic Methods: Recent approaches that integrate symbolic reasoning into neural models, where applicable.

4.3 Evaluation Metrics

We employ several evaluation metrics to comprehensively assess our method:

Accuracy: The percentage of correctly solved problems (e.g., correct final answers in GSM8K and CFQ). For numerical answers (e.g., GSM8K), exact match is required. For semantic parsing tasks (e.g., CFQ), correctness is verified against the gold standard logical form.

Logical Consistency: A measure of how often the generated CoT adheres to the formal knowledge constraints. This is quantified via a consistency score, calculated as the ratio of CoT outputs that satisfy the symbolic constraint function:

\begin{aligned} Consistency\; Score = \frac{# {z : C (z, K) = 1}}{Total\; z \; generated} \end{aligned}

(21)

F1-Score: Used in tasks where partial matches are acceptable (e.g., in semantic parsing tasks of CFQ Keysers & Dohan, 2020). This metric inherently handles cases where answers might be textually different but semantically equivalent to some extent.

Explainability: A qualitative evaluation, where expert annotators assess the interpretability and coherence of the reasoning chains generated by the model. Scores are assigned on a scale of 1 to 5, with higher scores indicating better explainability. The inter-rater agreement protocol will be detailed in future work.

4.4 Implementation Details

Preprocessing: All textual inputs are preprocessed through tokenization and normalization. For datasets like CFQ and SCAN, additional parsing is performed to align natural language queries with their corresponding structured representations.

Model Training: Our framework is implemented by fine-tuning a pre-trained LLM with the additional symbolic constraint loss. The specific base LLM used for our experiments is a fine-tuned version of Llama 2 (7B parameters), chosen for its strong performance and accessibility. The training objective is:

\begin{aligned} L = L_{LLM} (y, y^{*}) + λ L_{symbolic} (C (z, K)), \end{aligned}

(22)

where

L_{LLM}

is the cross-entropy loss between the predicted output y and the ground truth

y^{*}

, and

L_{symbolic}

penalizes violations of the symbolic constraints. We set

λ

through hyperparameter tuning.

Hyperparameters:

Learning rate: $η = 1 \times 10^{- 5}$

Batch size: 16

Number of training epochs: 10–20 (depending on the dataset)

Fine-tuning is performed on NVIDIA GPUs with 16GB memory.

Inference: During inference, for each input x, the model performs:

1. Knowledge Retrieval: Retrieve the set $K (x)$ of relevant formal knowledge snippets. This module utilizes a similarity-based search (e.g., embedding similarity) to identify the most pertinent rules or facts from the pre-constructed knowledge base.

2. Chain-of-Thought Generation: Generate intermediate reasoning steps z with:

\begin{aligned} P (z | x, K (x)) = \prod_{i = 1}^{M} P (z_{i} | z_{< i}, x, K (x)) . \end{aligned}

(23)

The fine-tuned LLM, conditioned on both the input and the retrieved knowledge, produces a sequence of explicit reasoning steps in natural language, mimicking a step-by-step human thought process.

3. Answer Generation: Produce the final answer y as:

\begin{aligned} P (y | x, K (x)) = \sum_{z} P (y | z, x, K (x)) P (z | x, K (x)) . \end{aligned}

(24)

The LLM then synthesizes the ultimate conclusion or answer based on the completed chain-of-thought and the initial query. The solutions are developed iteratively through this reasoning pipeline, guided by formal knowledge.

4. Constraint Enforcement: Filter out predictions for which the symbolic constraint $C (z, K) \neq 1$ . This final step ensures that only logically consistent outputs, validated against the formal knowledge, are presented as results, preventing hallucinations and enhancing reliability.

4.5 Experimental Procedure

For each dataset, the following experimental steps are conducted:

1. Baseline Comparison: Evaluate the performance of our method and all baselines on the test set, reporting accuracy, consistency scores, and F1-scores.

2. Ablation Study: Perform ablations by removing the symbolic constraint loss and/or the retrieval component, to assess the contribution of each module.

3. Qualitative Analysis: Examine a set of generated chain-of-thoughts to provide insights into how formal knowledge improves the reasoning process.

4. Statistical Significance: Run experiments over multiple random seeds (e.g., 5 seeds) and report the mean and standard deviation of the performance metrics.

4.6 Summary

This experimental setup is designed to rigorously evaluate the ability of our neural-symbolic integration framework to enhance multi-step reasoning in LLMs. By benchmarking on datasets such as GSM8K, CFQ, and SCAN, and comparing against strong baselines, we aim to demonstrate significant improvements in reasoning accuracy, consistency, and interpretability.

5. Results and Analysis

In this section, we present both quantitative and qualitative results of our proposed neural-symbolic integration framework, comparing it with related methods on typical reasoning datasets. Herein, we provide a detailed analysis of the performance and discuss the key factors that influence the results achieved by our model.

5.1 Role and Impact of Formal Knowledge K

The integration of formal knowledge K plays a critical role in augmenting LLMs with grounded, verifiable reasoning capacity. In our framework, K acts as both a filter and a guide: it constrains outputs to respect domain-specific logic and provides symbolic structure that complements the model's latent representations.

We demonstrate that the structure of K—whether it is handcrafted, mined from data, or learned through inductive logic programming—can significantly shape the model's behavior in downstream tasks.

5.2 Examples of K Used in Practice:

Biomedical domain:

– (Aspirin, treats, Headache)

– (Paracetamol, contraindicated_with, LiverDisease)

These triples prevent unsafe medical recommendations by enforcing pharmacological constraints.

Legal reasoning:

– Contract(x) $\land$ InvalidSignature(x) → Void(x)

Enables LLMs to generate legally sound conclusions in contract analysis.

Educational domain:

– ∀x, EnrolledIn(x, c) → Attending(x, c)

– HasPrerequisite(c, p) → Completed(x, p)

Ensures consistency in course planning and progression reasoning.

Commonsense reasoning:

– Bird(x) $\land$ ¬Fly(x) → Penguin(x)

– Cat → Animal → CanMove

Supports compositional generalization in tasks like SCAN or CFQ (Lake & Baroni, 2018).

These structured knowledge snippets serve as anchor points that help filter out illogical responses, correct spurious generalizations, and improve interpretability and factual reliability in complex reasoning tasks.

5.3 Reasoning Chain for “Is a Whale Warm-Blooded?”:

Input question: Is a whale warm-blooded?

Retrieved knowledge K:

R1: All mammals are warm-blooded.

F1: Whales are mammals.

Intermediate reasoning chain z:

If a creature is a mammal, then by R1 it must be warm-blooded.

We know from F1 that whales belong to the class of mammals.

Therefore, whales inherit the property warm-blooded.

Logic constraints applied: The system verifies each step of z against formal rules in K. For example, it applies modus ponens: from “X is a mammal” and “all mammals are warm-blooded,” it infers “X is warm-blooded.”

Final answer y: Yes, whales are warm-blooded animals.

5.4 Quantitative Results

Table 1 summarizes the performance of our model compared to several baselines, including a Vanilla LLM (without neural-symbolic integration), a RAG approach, and a previously proposed neural-symbolic baseline. The metrics reported include overall accuracy, CoT consistency, and an explainability score (on a scale of 1 to 5), with average values and standard deviations over multiple runs.

Table 1.
Performance Comparison on Benchmark Datasets.

Model Accuracy (%) CoT Consistency (%) Explainability Score

Vanilla LLM 68.5 ± 2.3 55.2 ± 3.1 3.8/5

RAG 72.3 ± 1.8 60.1 ± 2.5 4.1/5

Neural-Symbolic Baseline 74.0 ± 2.0 62.5 ± 2.8 4.2/5

Proposed Method 79.6 ± 1.5 70.8 ± 2.0 4.5/5

Model	Accuracy (%)	CoT Consistency (%)	Explainability Score
Vanilla LLM	68.5 ± 2.3	55.2 ± 3.1	3.8/5
RAG	72.3 ± 1.8	60.1 ± 2.5	4.1/5
Neural-Symbolic Baseline	74.0 ± 2.0	62.5 ± 2.8	4.2/5
Proposed Method	79.6 ± 1.5	70.8 ± 2.0	4.5/5

Our proposed method achieves a significant improvement, with an accuracy of 79.6% on datasets such as GSM8K, outperforming the Vanilla LLM by approximately 11%. The chain-of-thought consistency and explainability scores also show notable enhancements, indicating a more coherent and interpretable reasoning process.

5.5 Ablation Studies

To assess the contribution of each component in our framework, we conducted ablation studies. Table 2 presents the performance when key modules are removed or modified.

Table 2.
Ablation Study Results.

Configuration Accuracy (%) CoT Consistency (%)

Full Model (Proposed) 79.6 70.8

Without Formal Knowledge Integration 74.1 62.2

Without Constraint Enforcement 76.5 65.0

Without Knowledge Retrieval 73.2 60.5

Configuration	Accuracy (%)	CoT Consistency (%)
Full Model (Proposed)	79.6	70.8
Without Formal Knowledge Integration	74.1	62.2
Without Constraint Enforcement	76.5	65.0
Without Knowledge Retrieval	73.2	60.5

The results indicate that each component—formal knowledge integration, constraint enforcement, and knowledge retrieval—plays a critical role. For instance, removing the formal knowledge integration results in a drop of 5.5% in accuracy, demonstrating its importance in guiding the reasoning process. When symbolic loss is removed or altered in our ablation study, it effectively disables or reduces the impact of the consistency checks, which is what we aimed to demonstrate by showing its critical role in performance.

5.6 Qualitative Analysis

In addition to the quantitative metrics, we analyzed several reasoning examples to assess the interpretability of the generated CoT. Our method produces detailed intermediate steps that align well with the formal rules retrieved from the knowledge base. For example, in a math word problem from GSM8K, the baseline LLM might directly output a numerical answer, while our model explicitly outlines steps like identifying quantities, applying relevant arithmetic operations, and verifying against known properties (e.g., non-negative results). In contrast to the baseline LLM, the reasoning chains from our model are more logically coherent and transparent, which facilitates human evaluation and trust. Due to space constraints, detailed examples and case studies will be provided in an appendix or within the supplementary materials of the codebase upon acceptance.

5.7 Factors Affecting Model Performance

Several factors have been identified that influence the performance of our proposed framework:

Quality of the Formal Knowledge Base: The accuracy and completeness of the retrieved formal knowledge $K (x)$ are critical. Insufficient or outdated knowledge can adversely affect the reasoning chain.

Chain-of-Thought Length: There is a trade-off between the granularity of the intermediate steps and the computational cost. Longer chains tend to capture more detailed reasoning but require careful tuning to avoid unnecessary complexity.

Hyperparameter $λ$ : The balance between the standard cross-entropy loss and the symbolic constraint loss is governed by the hyperparameter $λ$ . Our experiments show that moderate values of $λ$ yield the best overall performance.

Retrieval Accuracy: The effectiveness of the knowledge retrieval module directly impacts the quality of the CoT. Enhancements in retrieval (e.g., using semantic search) can further boost the performance.

5.8 Discussion

Our proposed neural-symbolic integration framework, which augments LLMs with formal knowledge, has demonstrated promising improvements in multi-step reasoning tasks. By incorporating a knowledge retrieval module, a CoT generator conditioned on external formal knowledge, and symbolic constraint enforcement, our approach has improved answer accuracy, reasoning consistency, and interpretability compared to standard LLMs and retrieval-augmented baselines.

However, several limitations of the current framework have been identified:

Quality and Coverage of Formal Knowledge: The effectiveness of the proposed method is closely tied to the quality of the external knowledge base. Incomplete or outdated formal knowledge can lead to suboptimal reasoning chains, and may even propagate errors through the reasoning process. For instance, in a legal reasoning system, missing a crucial statute could lead to incorrect legal advice, even if the LLM's language generation is fluent.

Scalability and Computational Overhead: Integrating symbolic constraints and performing knowledge retrieval for every input increases computational complexity. This may affect scalability, especially when dealing with large-scale applications or when the CoT is very long. Deploying such systems in real-time high-throughput environments like financial fraud detection could face latency issues.

Hyperparameter Sensitivity: Balancing the loss components between the standard cross-entropy and the symbolic constraint (governed by the hyperparameter $λ$ ) is critical. Improper tuning can either over-constrain the model or fail to leverage the symbolic guidance effectively.

Retrieval Accuracy: The retrieval module's ability to fetch relevant formal knowledge directly impacts the quality of the reasoning chain. Inaccuracies in retrieval can result in a reasoning process that deviates from the intended logical framework.

Interpretability vs. Flexibility Trade-Off: While symbolic constraints enhance interpretability by ensuring logical consistency, they may limit the model's flexibility to generate creative or non-standard solutions when necessary. In certain exploratory scientific research, a purely logical path might overlook novel, less obvious connections.

Looking ahead, several avenues for future research emerge:

Dynamic Knowledge Updates: Future work should explore methods for continuously updating the formal knowledge base to reflect new information and evolving domain knowledge. This dynamic approach would help maintain the relevance and accuracy of the retrieved knowledge.

Improved Retrieval Mechanisms: Enhancing the retrieval component using advanced semantic search or embedding-based retrieval methods could improve the relevance of the formal knowledge snippets, thereby strengthening the overall reasoning process.

Scalable Neural-Symbolic Integration: Research should focus on reducing the computational overhead associated with symbolic constraint enforcement, perhaps through more efficient approximation methods or modular integration techniques that allow for parallel processing. This could also include incorporating data provenance mechanisms for better explainability and traceability of reasoning chains.

Extended Modalities and Multi-task Learning: Extending the framework to incorporate multi-modal inputs (e.g., combining text with images or structured data) and applying it to various reasoning tasks across different domains could demonstrate its broader applicability.

Adaptive Constraint Tuning: Developing techniques for adaptive tuning of the balancing hyperparameter $λ$ during training, possibly through meta-learning approaches, may lead to a more robust integration of formal knowledge without sacrificing the model's flexibility.

In summary, while our proposed framework marks a significant step toward integrating formal knowledge with neural models for enhanced reasoning, overcoming its current limitations will be key to realizing its full potential in real-world, high-stakes applications.

6. Conclusion

In this paper, we presented a novel neural-symbolic integration framework that enhances the reasoning capabilities of LLMs by incorporating formal knowledge into their CoT process. Our approach effectively leverages a combination of a knowledge retrieval module, a CoT generator conditioned on external symbolic information, and a constraint enforcement mechanism to ensure logical consistency. This integration not only improves the accuracy of multi-step reasoning tasks but also enhances the interpretability of the generated reasoning chains.

Our experimental results on benchmark datasets, including GSM8K, CFQ, and SCAN, demonstrate that the proposed method outperforms conventional LLMs and retrieval-augmented models. Ablation studies further validate the importance of integrating formal knowledge and enforcing symbolic constraints in guiding the reasoning process.

Despite these promising outcomes, our framework faces certain limitations, such as dependency on the quality of the external knowledge base, increased computational complexity, and sensitivity to hyperparameter settings. Future work will focus on dynamic knowledge updates, improved retrieval mechanisms, and more scalable integration techniques to address these challenges.

Overall, our work contributes to bridging the gap between data-driven learning and formal reasoning, paving the way for more robust, interpretable, and reliable AI systems in high-stakes applications.

Footnotes

Acknowledgements

This research has been done under the research project QG.24.80 “Research on developing inference techniques for LLMs and their applications in the legal field” of Vietnam National University, Hanoi.

ORCID iDs

Ngoc-Khuong Nguyen

Anh-Cuong Le

Author Contributions

Ngoc-Khuong Nguyen: Conceptualization, Methodology, Software, Writing—Original Draft.

Viet-Ha Nguyen: Data Curation, Validation, Visualization, Writing—Review & Editing.

Anh-Cuong Le: Supervision, Project Administration, Methodology, Resources, Writing—Review & Editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Vietnam National University Hanoi (VNU), (grant number QG.24.80).

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets analyzed during the current study are publicly available:

GSM8K is available at .

CFQ is available at .

SCAN is available at .

Source code for the proposed framework is available at

References

Alotaibi

Kulkarni

Zhou

(2024). Graph of logic: Enhancing LLM reasoning with graphs and symbolic logic. In IEEE international conference on big data, BigData 2024, Washington, DC, USA.

Besold

Garcez

Bader

Bowman

Domingos

Hitzler

Kühnberger

Lamb

Lowd

Lima

Penning

Pinkas

Poon

Zaverucha

(2017). Neural-symbolic learning and reasoning: A survey and interpretation. ArXiv abs/1711.03902.

Cabrera

Barros

Costa

(2024). Improving LLMs’ reasoning and planning with finite-state machines. In Intelligent systems—34th Brazilian conference, BRACIS 2024, Belém Do Pará, Brazil, November 17–21, 2024, Proceedings, Part II. 15413 (pp. 110–124).

Cobbe

Kosaraju

Bavarian

Chen

Jun

Kaiser

Plappert

Tworek

Hilton

Nakano

Hesse

(2021). Training verifiers to solve math word problems. ArXiv preprint arXiv:2110.14168.

Dhanraj

Eliasmith

(2025). Improving rule-based reasoning in LLMs via neurosymbolic representations. CoRR. abs/2502.01657.

Gao

Xiong

Gao

Jia

Pan

Dai

Sun

Guo

Wang

(2023). Retrieval-augmented generation for large language models: A survey. ArXiv. abs/2312.10997.

Hogan

Blomqvist

Cochez

D’amato

Melo

Gutierrez

Kirrane

Gayo

Navigli

Neumaier

Ngomo

Polleres

Rashid

Rula

Schmelzeisen

Sequeda

Staab

Zimmermann

(2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1–37. https://doi.org/10.1145/3447772

Keysers

Dohan

(2020). CFQ: A dataset for compositional generalization in semantic parsing. In ICLR workshop.

Lake

B. M.

Baroni

(2018). Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on learning representations (ICLR).

10.

Liang

Jordan

M. I.

(2017). Neural-symbolic machines: Learning semantic parsers on freebase with weak supervision. In International conference on learning representations (ICLR).

11.

Lloyd

(1984). Foundations of logic programming: Symbolic computation. Springer-Verlag.

12.

Manhaeve

Dumancic

Kimmig

Demeester

De Raedt

(2018). DeepProbLog: Neural probabilistic logic programming. In Proceedings of the international joint conference on artificial intelligence (IJCAI).

13.

Ouyang

Jiang

Almeida

Wainwright

Mishkin

Zhang

Agarwal

Slama

Ray

Schulman

Hilton

Kelton

Miller

Simens

Askell

Welinder

Christiano

Leike

Lowe

(2022). Training language models to follow instructions with human feedback. In Proceedings of The 36th international conference on neural information processing systems.

14.

Russell

Norvig

(2010). Artificial Intelligence: A Modern Approach. Prentice Hall .

15.

Shinn

Cassano

Gopinath

Narasimhan

Yao

(2023). Reflexion: Language agents with verbal reinforcement learning. In Proceedings of the 37th international conference on neural information processing systems.

16.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. In Advances in neural information processing systems (NeurIPS) (pp. 5998–6008).

17.

Wang

Wei

Schuurmans

Chi

Zhou

(2022). Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171.

18.

Wei

Schuurmans

Bosma

Ichter

Xia

Chi

Zhou

(2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in neural information processing systems (NeurIPS).

19.

Xie

Gao

Ren

Luo

Hong

Dai

Zhou

Qiu

Luo

(2025). Logic-RL: Unleashing LLM reasoning with rule-based reinforcement learning. CoRR. abs/2502.14768.

20.

Yao

Zhao

Shafran

Griffiths

Cao

Narasimhan

(2023). Tree of thoughts: Deliberate problem solving with large language models. In Proceedings of the 37th international conference on neural information processing systems.

21.

Zhao

Chen

Yang

Liu

Deng

Cai

Wang

Yin

(2024). Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol, 15(2), 1–38. https://doi.org/10.1145/3639372

22.

Zhao

Zhou

Junyi

Tianyi

Wang

Hou

Min

Zhang

Dong

Yang

Chen

Jiang

Ren

Tang

Liu

Wen

(2023). A survey of large language models. ArXiv, abs/2303.18223.

Formal Knowledge Augmented Language Models for Explainable and Robust Reasoning

Abstract

Keywords

1. Introduction

2. Background and Related Work

2.1 Large Language Models

4.1 Datasets

4.2 Baselines

4.3 Evaluation Metrics

4.6 Summary

5. Results and Analysis

5.1 Role and Impact of Formal Knowledge K

5.2 Examples of K Used in Practice:

5.3 Reasoning Chain for “Is a Whale Warm-Blooded?”:

5.4 Quantitative Results

Table 1. Performance Comparison on Benchmark Datasets. Model Accuracy (%) CoT Consistency (%) Explainability Score Vanilla LLM 68.5 ± 2.3 55.2 ± 3.1 3.8/5 RAG 72.3 ± 1.8 60.1 ± 2.5 4.1/5 Neural-Symbolic Baseline 74.0 ± 2.0 62.5 ± 2.8 4.2/5 Proposed Method 79.6 ± 1.5 70.8 ± 2.0 4.5/5

Table 2. Ablation Study Results. Configuration Accuracy (%) CoT Consistency (%) Full Model (Proposed) 79.6 70.8 Without Formal Knowledge Integration 74.1 62.2 Without Constraint Enforcement 76.5 65.0 Without Knowledge Retrieval 73.2 60.5

5.7 Factors Affecting Model Performance

5.8 Discussion

6. Conclusion

Footnotes

Acknowledgements

ORCID iDs

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References

Table 1.
Performance Comparison on Benchmark Datasets.

Model Accuracy (%) CoT Consistency (%) Explainability Score

Vanilla LLM 68.5 ± 2.3 55.2 ± 3.1 3.8/5

RAG 72.3 ± 1.8 60.1 ± 2.5 4.1/5

Neural-Symbolic Baseline 74.0 ± 2.0 62.5 ± 2.8 4.2/5

Proposed Method 79.6 ± 1.5 70.8 ± 2.0 4.5/5

Table 2.
Ablation Study Results.

Configuration Accuracy (%) CoT Consistency (%)

Full Model (Proposed) 79.6 70.8

Without Formal Knowledge Integration 74.1 62.2

Without Constraint Enforcement 76.5 65.0

Without Knowledge Retrieval 73.2 60.5