Abstract
Ontology integration plays a vital role in forming a unified knowledge base through existing knowledge. The integration process is triggered through ontology matching and ontology merging processes. The scale of ontologies dominates the ontology matching process. Earlier researchers have used the Ontology (Meta) Matching (OMM) technique to generate an efficient set of ontology alignments. The set of ontology alignments is used to match two ontologies. This process involves limitations such as the help of domain experts in choosing applicable similarity measures, opting for appropriate and applicable resources (such as WordNet, UMLS, etc.) to generate efficient semantic similarity, and requiring more computations to generate alignments. This raises the issue of increased computational complexity. To resolve this problem, researchers have employed a divide-and-conquer algorithmic strategy for executing multiple matching tasks concurrently with parallel processing. There is still a need to improve computational complexity while handling the integration of large-scale ontologies. With such needs, this article proposes a general and scalable framework to integrate large-scale ontologies. The proposed algorithm uses nature-inspired computing algorithms to apply the natural behavior of the ant-colony optimization algorithm to balance and optimize the working behavior of the machines during parallel processing. The LLM is used to generate highly efficient similarity results during the matching process, which results in a cohesive integrated ontology. The OAEI Anatomy track is used to test the performance of the framework to assess the computational time. The experimental results show that there is an improvement in computational time. It is possible to achieve the principles of ontology coherence and entity coverage.
Keywords
Introduction
The semantic web and ontology community (Machado et al., 2020; Stoilos et al., 2018) looks forward to enriching ontologies to share a knowledge base (He et al., 2023a, 2023b; Norouzi et al., 2023) across intelligent systems. It will be useful to recommend the required data to make better decisions. The biomedical ontologies, Human Phenotype Ontology (HP) (Köhler et al., 2017), Mammalian Phenotype Ontology (MP) (Cynthia et al.), and Human Disease Ontology (DOID) (Lynn Schriml et al., 2003), were provided and used for biomedical research purpose. HP provides vocabulary used for computational analysis of the human phenome and for diagnosing monogenic diseases. The DOID contains classification of human disease terminologies related to biomedical resources, human health, neurological disease, and neurological disorders. MP provides information about mammalian organisms that are used to demonstrate morphological, physiological, and behavioral changes in mammals during testing.
Various ontology engineers have designed biomedical ontologies for related domains that generate heterogeneous ontologies (Xue et al., 2020). Biomedical ontologies involve syntactic and semantic heterogeneity. In HP, most class labels are written in capital letters, but in MP, all class labels are written in small case letters. Consequently, such syntactic heterogeneity occurs when similar terminologies are used but written in different cases. Figure 1 also shows the presence of semantic heterogeneity in the HP and MP ontologies. In HP, the class name is “Abnormality of corneal size,” and with a similar meaning, the MP ontology defines that class as “abnormal cornea size.” It is essential to address such data heterogeneity issues to ensure the robust working of an intelligent system.

Semantic Heterogeneity Issue of HP Ontology and MP Ontology.
These biomedical ontologies are also used by researchers during their research. Researchers in the biomedical domain want to investigate medicines for hearing abnormalities on the basis of genetic data as well as patient disease data. In this case, researchers cannot use either the HP or DOID ontology to make better decisions during their research work. However, the HP ontology provides only genetic information, and the DOID ontology provides only disease data. A new ontology must be provided by integrating the knowledge bases of both ontologies to make better decisions.
The ontology integration process either forms a new integrated ontology or enriches the existing ontology. The integration process depends on ontology matching and ontology merging processes. The integration of ontologies involves merging similar entities by following the structural relationships of input ontologies. By considering earlier research, the Entity-to-Entity Ontology Model (EEOM) (Ocker et al., 2022; Xue et al., 2021a, 2021b, 2021c) and the Graph-Based Ontology Model (GPOM) (Ochieng and Kyanda 2018; Suleiman et al., 2024) approaches are used to integrate the ontologies. In the EEOM approach, ontology alignments are used to identify similar entities via different types of mapping rules. These similar entities are merged to form the integrated ontology incrementally. The GPOM approach represents candidate ontologies as graph data structures. The traversing algorithms, such as breadth first search or depth first search (DFS), are used to traverse the graphs. Different graph theory algorithms are used to determine a set of subgraphs that cover the maximum number of nodes. Afterward, the subgraphs are taken one by one to identify similar entities via a set of alignments or without the use of alignments. Once similar entities are found, the merging process initiates and continues to cover all the subgraphs. In this way, an integrated ontology is formed.
Either the EEOM or the GPOM approach is used; the ontologies are integrated into three categories. In the first category, defined as OIT1, an integrated ontology is formed via ontology alignment (Osman et al., 2021a, 2021b; Portisch et al., 2022; Xue et al., 2021a, 2021b, 2021c; Zhaoming & Rong, 2022; Zhou et al., 2023; Zhu et al., 2022). The ontology matching process is used to generate ontology alignments. The resulting set of alignments is used to integrate the ontologies. The second category, defined as OIT2, integrates ontologies by using reference alignment provided by external resources. Some researchers have also used the support of human domain experts to provide information about reference alignment (Geng, 2023; Norouzi et al., 2023; Ochieng & Kyanda, 2018; Portisch et al., 2022; Xue et al., 2021a, 2021b, 2021c). In the third category, defined as OIT3, an integrated ontology is formed without the use of ontology alignments. In this category, the matching process is used to identify similar entities. These similar entities are merged to form the integrated ontology.
OIT1 comprises a more exhaustive alignment process that requires more time and memory. To match and merge two ontologies with “n” number of entities may have
Earlier research has yielded remarkable results in the use of the Ontology (Meta) Matching (OMM) technique while matching heterogeneous ontologies. Section 3 describes the OMM technique in detail. The OMM technique may consume approximately 2/3 of the computational time while matching large-scale ontologies. To address the computational time issue, this work is proposed under the OIT3 category. The proposed system first divides the large-scale ontologies into small-scale ontologies via superclass and subclass structural relations. The divide strategy balances the heavy load of the matching process by harnessing the power of parallel computing. The load balancing process is optimized via the ant colony optimization (ACO) algorithm. The matching process uses the large language model (LLM) to avoid the lengthy OMM process. Recently, LLM has also been used to match large-scale ontologies efficiently (He et al., 2023a; Zhang et al., 2024). The merge process runs in the background to merge similar kinds of entities using the similarity score returned by the LLM. Accordingly, this work proposes the design and development of a generic framework for integrating large-scale ontologies using LLM.
This article has different sections. Section 2 specifies related work to identify limitations for integrating ontologies. This section also reviews the existing methodologies that have used LLM to match large-scale ontologies with significant results. Section 3 provides an overview and challenges of the OMM approach. Section 4 presents the design of the proposed model. Section 5 presents the working methodology and proposed algorithms for implementation of the proposed model. Section 6 presents an evaluation of the proposed framework. Section 7 presents concluding remarks about the purpose of the research article.
Integration of Ontologies and Related Work
Babalou and Konig-Ries (2020) provide a framework, CoMerger, which uses the structural similarity-based partitioning technique to divide large-scale ontologies into “n” blocks. The classes that are closest to each other in terms of structural similarity are retained in one block. In this way, the “n” number of blocks is formed to keep relevant and closer classes. Afterward, a direct merging process is initiated to merge these blocks to form a merged ontology. This work does not use similarity measures or functions, hence reducing the computational time. The authors have also suggested to take the benefit of parallel processing.
The authors (Patel & Jain, 2019) addressed computational time issues while integrating large-scale ontologies using a divide-and-conquer greedy algorithmic strategy and parallel processing. The input ontologies are treated as source ontologies and target ontologies on the basis of the degree of coupling and cohesion. The ontology that has maximum coupling and minimum cohesion among the concepts is taken as the source ontology. The other one is treated as the target ontology. The Lin measure agglomerative algorithm is applied to partition the source ontology into small clusters of concepts. These clusters are used to partition the target ontology into clusters. The cluster matching and matrix aggregation algorithms are applied to match and integrate the source and target ontologies via parallel computing. The pool of ontology matchers is used to generate efficient alignments. Ultimately, this work reduces the computational time, but it has to follow a critical process of the OMM. Future work should apply a workload balancing strategy to efficiently utilize the “n” number of processors involved in parallel processing.
Li et al. (2020) provided an ontology matching process using filters and verification phases. During the filtering phase, typical entities that have high similarity scores are identified using syntactic and semantic similarity functions. Afterward, such typical entities are used to divide the large-scale ontologies into subontologies. The clustering approach is applied to form clusters or blocks of such typical entities. These blocks are again partitioned to form pairs of blocks. The pair of blocks is used to form pairs of subontologies via extension methods. The verification phase matches the relevant subontologies. During the filtering phase, the irrelevant entities are excluded before matching, which reduces the computational time. The partitioning process changes the structure of the original ontologies so that subontologies are not matched properly, which results in a low recall rate. Babalou and Konig-Ries (2020), and Patel and Jain (2019) employed parallel computing, but there is still a need to apply and optimize a load balancing strategy during parallel computing. The proposed framework fulfills this shortcoming.
The authors (Xue et al., 2023) specified that, for integrating large-scale ontologies, the running time is greater since thousands of classes are matched with other classes. The nondominated sorting genetic algorithm (NSGA-II) is used to resolve the computational time. It integrates knowledge of different large-scale economics and finance domain ontologies to resolve heterogeneity issues. This work only evaluates the efficiency of the matching process and does not consider the evaluation of integrated ontologies to measure the degree of consistency. The authors (Huang et al., 2022) used a multiobjective OMM technique to find an optimized set of ontology alignments by collaborating with the particle swarm intelligence algorithm. The framework achieves better performance for small-scale ontologies. While matching large-scale ontologies, the framework fails to reach local optima. The diversity-enhancing strategy is used to solve the issue of local optima. This work reviews guidelines to avoid premature convergence while applying a population-based nature-inspired algorithm. Huang et al. (2020), Lu and Xue (2020), Xue et al. (2021a, 2021b, 2021c), and Zhou et al. (2023) have matched large-scale ontologies using nature-inspired computing algorithms. These algorithms yield the best results in handling a large search space but do not consider the evaluation of computational time.
Suleiman et al. (2024) merged RDF graph ontologies iteratively in memory to form a new ontology. To identify similarity scores, different tools of natural language processing (NLP), such as BERT, SM-DTR, fuzzy string matching algorithms, and the WordNet and Word2Vec databases, have been used. This work efficiently matches and updates the nodes of the RDF graph ontology in memory during the merging process. Hence, the usage of memory is optimized. This work efficiently merges middle-scale ontologies with minimum computational time. However, there is a need to assess the required computational time while integrating large-scale RDF graph ontologies.
Xue and Zhang (2021) used the Unified Medical Language System (UMLS) to generate efficient semantic similarity, as WordNet is not suitable for biomedical ontologies. A similar type of issue is also observed in Xue et al. (2018, 2020), where the authors used UMLS. However, this indicates the limitations of the OMM technique related to the use of the resources used to generate efficient semantic matching. This issue is resolved with the help of a LLM in the proposed system.
Osman et al. (2021b) provided a holistic approach for integrating large-scale ontologies via LogMap. It is a benchmark matching tool that provides reference alignments. This approach integrates multiple ontologies via simple merging and full merging at a time that avoids pairwise merging of two concepts incrementally. This minimizes the computational time but is not suitable for real-time environments. The authors (Ocker et al., 2022) provided a semiautomatic framework to integrate small-scale production domain ontologies. It focuses on validating the syntactic, semantic, and structural inconsistency among the ontologies to be integrated. Overall, this work highlights the need to provide a general framework that integrates large-scale ontologies not only for the production domain but also for any other domain. He et al. (2023b) developed a novel framework for integrating small-scale energy storage system ontologies. This work has addressed issues such as incoherence, ambiguity, and redundancy of integrated ontologies that are created for a variety of applications. Semantic-based ontology matching is used to identify a set of correspondences. The semantic heterogeneity is resolved via a pretrained and well-established universal sentence encoder (USE) language model. During the matching process, candidate entities are converted into embedded vectors using the USE and compared using cosine similarity to identify the semantic similarities. The authors also compared the manual efforts required by developers to align the ontologies. This indicates that the manual alignment process is a hectic and time-consuming task. Hence, it is essential to prohibit user interventions during the ontology integration process. Hnatkowska et al. (2020) provided attribute-based integration of ontologies. This model relies on human experts to identify semantic similarities with the help of the WordNet database. This approach helps in the generation of accurate semantic matching. Moreover, user interventions resulted in an increase in computational time.
Large Language Model and Related Work
He et al. (2023a) used LLM to test zero-shot performance to match large-scale ontologies of the OAEI Bio-ML track. The well-structured CIT-DOID and SNOMED-FMA ontologies are used to test the performance of the LLM. The framework works in two rounds to perform concept and structural-level matching of ontologies. It provides input as a prompt to ChatGPT (GP-4). The first round uses a binary classification method to identify overlapping concepts. It returns a value of “1” for overlapping concepts; otherwise, it returns a “0” value. In a second round, the framework is asked to identify hierarchical parent–child relationships among the concepts through prompts. In this way, a critical ontology matching task is performed very quickly with the help of LLM. Compared with BERTMap, the framework with LLM achieves outstanding performance. The authors suggested first designing a framework, and accordingly, designing a prompt to generate efficient matching results is needed. The working of model is reviewed to design an algorithm to generate efficient calls to the LLM according to the design of a proposed framework.
Earlier work employed ChatGPT-4 to match ontologies using LLM. This approach will generate error-prone responses while matching large-scale ontologies; however, some LLMs can handle limited amounts of data. To address this issue, Giglou et al. (2024) provide a framework, LLMs4OM that uses a dual module strategy defined as the retrieval phase and matching phase. During the retrieval phase, the user uses either queries or prompts to the LLM to retrieve similar entities. Matching is subsequently performed by escalating the power of the LLM on the basis of the prompt. In this work, zero-shot prompts were used to accomplish the matching task efficiently without the use of trained models. The authors perform a zero-shot performance comparison with seven (7) LLMs. Therefore, authors have provided an approach to choose a specific LLM to integrate large-scale ontologies of any domain. The proposed framework uses the work done for the matching phase using LLM of this article.
Zhang et al. (2024) developed a multiagent dialog model that cognitively assists the framework through LLM. It provides the reference alignment to initiate the ontology matching process without intervention by the domain expert. Thus, it helps to minimize decisions about the formation of reference alignment through human experts. Hertling and Paulheim (2023) used LLM to perform large-scale ontology matching. This work uses the NLP approach to identify overlapped entities and generate ontology alignments. The matching is initiated via zero-shot and few-shot prompts to the LLM. This model uses ontology alignment, which results in the computational time required to complete the ontology matching process.
Ontology (Meta) Matching Technique and its Challenges
The OMM technique uses similarity measures to identify heterogeneous entities. The similarity measures are also called ontology matchers. Appendix A specifies different similarity measures through Tables A1 to A5. The syntactic similarity measure is also called terminological similarity. It is used to determine the morphological similarity between entities. The semantic measure is also called a linguistic measure. It considers the synonym or hypernym relationship between two entities (Xue et al., 2021a) and calculates the semantic distance between two entities. The structural similarity measure is also called a context similarity measure. In this measure, the similarity between entities is calculated by comparing the structural information of two entities. Figure 2 shows the generic steps used in the OMM technique mentioned by Xue et al. (2021a).

Ontology (Meta) Matching Approach.
OMM is defined as a mathematical function
The recall, precision, and f-score metrics of the TF-IDF technique are used to assess the quality of the alignments. The f-score value is optimized via either a single-objective or multiobjective optimization function. In such cases, the ontology matching function is called the single objective (Huang et al., 2022; Verhodubs, 2020; Xingsi & Pan, 2018) or multiobjective (Zhou et al., 2023) OMM problem. The optimized f-score is used to choose highly capable alignments to identify similarity among the ontologies. In this way, OMM operations are accomplished, and then the merging process is initiated to integrate the ontologies.
The OMM technique faces certain challenges while handling the integration of large-scale ontologies. These are stated as follows:
It is critical to choose appropriate mathematical equations according to the required syntactic, semantic, and structural similarity measures. Most of the time, it will choose either by means of expert suggestions or on the basis of past experimental experience. Appendix A, Table A5 presents variations in the equation used to numerate the weighted sum similarity measure. The computational time is greater when storing and computing the data of each similarity matrix. The provision of domain experts to provide the reference alignment is a critical and time-consuming task when dealing with large-scale ontologies. In the OMM technique, the majority of studies use the WordNet Electronics Database to identify the semantic similarity. This database might not be suitable for some domains, such as the biomedical domain (Xue & Zhang, 2021; Xue et al., 2018, 2020). There is no hard or fast rule for setting the threshold value. Researchers either set it with the help of human domain experts or set it by choosing the maximum similarity score or average similarity score (Xue et al., 2021a, 2021b, 2021c).
This phase designs the proposed framework to set the baseline for examining the working of the proposed framework. It has two steps: the preliminary setup of the proposed framework and the design of the mathematical model.
Preliminary Setup of the Proposed Framework
The lists of parameters and their values are defined with reference to earlier work (Babalou & Konig-Ries, 2019; Negi & Malik, 2018; Osman et al., 2021a). Tables 1 and 2 list the bucket list of parameters used to trigger the ontology matching and integration processes, respectively. Table 3 lists the applicable list of parameters for evaluating the integrated ontology.
Bucket List of Parameters for Ontology Matching.
Bucket List of Parameters for Ontology Matching.
Bucket List of Parameters for Ontology Integration.
List of Parameters for Evaluating the Integrated Ontology.
Figure 3 shows the mathematical model of the proposed framework. It is represented as a theoretical machine, M = (Q, ∑, q0 F, δ), where Q = {q0, q1, q2, q3, q4, q5, q6, q7, q8, q9} is a set of states and ∑ is a finite set of input data. q0 is the start state of a machine, F is the final state of a machine, and the delta transition function δ: Q × ∑ → Q. The delta (δ) transition function shows the working of a machine on the basis of input parameters. Table 4 describes the operation of the machine via the delta transition function.

Theoretical Model of the Proposed Framework.
List of States of the Theoretical Model and the Delta Transition Function.
Phase I
The theoretical model starts working at the start state q0, reads input ontologies O1 and O2 and moves to state q1. The state q1 preprocesses the input ontologies. The sets of preliminary parameters MC_SET and MR_SET are formed for the ontology matching and merging processes, respectively. The input ontologies are converted into RDF graph ontologies and stored as Onto_Graph (O1) and Onto_Graph (O2). Afterwards, preliminary analysis of RDF graph ontologies is performed to identify the number of superclasses and subclasses of all the classes. Then, the machine moves to state q2 with MC_SET, MR_SET Onto_Graph (O1), and Onto_Graph (O2). At a state q2, Onto_Graph (O1) and Onto_Graph (O2) are divided into small-scale ontologies using superclass and subclass structural relations to minimize the large search space. This state forms the subontologies Sub_Onto (O1) and Sub_Onto (O2) and is used for further processing.
Phase II
At a state q3, the load balancing strategy is applied through parallel processing to distribute the heavy matching load among the machines. At this stage, a function, Assignment (Cin, Cjm, Number of Machines), is initiated to assign Cin, Cjm classes of Sub_Onto(O1) and Sub_Onto(O2) to the “n” number of machines. This helps to reduce the computational time. The state q4 called Onto_Match performs matching among Cin and Cjm in a loop until all classes of Sub_Onto(O2) are matched with Sub_Onto (O1). If there is similarity among classes Cin, Cjm is found, then state q4 sends the Optimistic_Similarity score to state q5. Otherwise, state q4 is repeated through the calling of a function GetNextClass() to match the next classes. To match these classes, state q4 sends requests via SetNextAssignment (Cin, Cjm, number of machines) to state q3. In this way, the state q3 runs in a background to perform parallel processing. The Optimistic_Similarity score for each matching step is stored and referred to at state q5.
Phase III
At state q5, a merge operation is performed to merge the overlapping classes. Simultaneously, at state q6, the integration process starts to integrate the overlapped classes. The integrated ontology and list of evaluation parameters are sent as inputs to a state q6. This state is responsible for evaluating the integrated ontology via a list of evaluation parameters. The machine moved to q7 to evaluate the integrated ontology. In this way, the machine enters the final state q8 and provides an evaluated integrated ontology.
Proposed Framework Architecture
Working Methodology and Algorithms of the Proposed Framework
Figure 4 shows the proposed system architecture diagram. The proposed framework is implemented by considering the general workflow of the mathematical model mentioned in Section 4.2.1. The working methodology has three (3) stages. Stage I is the preprocessing stage, Stage II is ontology matching and load balancing, and Stage III is the ontology integration stage.

Proposed System Architecture Diagram.
Algorithm_1 mentions the essential steps to be followed by the proposed framework to generate the integrated ontology. The preprocessing stage starts at Step No. 1 and ends at Step No. 8 of Algorithm_1. This module establishes the preliminary setup of the ontology matching process in correlation with Section 4.1, Table 1. The preprocessing stage reads two ontologies (in Web Ontology Language (.owl) format). Afterward, the framework gives freedom to the user to decide the sequence of ontologies to be merged. Suppose that the user provides a sequence as (O1, O2); then, the O1 ontology is treated as the source ontology, and the O2 ontology is treated as the O1 ontology. This step resolves the limitations of earlier work (Patel & Jain, 2019), in which the framework chooses the source and target ontologies.
The proposed framework uses an RDF graph format to integrate large-scale ontologies. Therefore, the input ontologies are converted to RDF graphs and named RDF Graph1 and RDF Graph2. The source ontology and target ontology partitioning algorithms are written using superclass and subclass structural relations (Babalou & Konig-Ries, 2020). These are applied to divide large-scale source and target ontologies into small-scale ontologies and generate the RDF SubGraph1 and RDF SubGraph2. The partitioning strategy minimizes the large search space (Hnatkowska et al., 2020; Li et al., 2020; Xue & Zhang, 2021; Xue et al., 2021a, 2021b, 2021c; Zhu et al., 2022).
Algorithm_2 gives the pseudocode to partition the source ontology. Algorithm_3 gives the pseudocode to partition the target ontology.
The ontology matching and load balancing stage starts at Step No. 9 and ends at Step No. 11 of Algorithm_1. At this stage, RDF SubGraph1, RDF SubGraph2, and the candidate class are provided as inputs to the ontology matching module. The candidate class is compared with each class of RDF Graph1. The RDF SubGraph1 and RDF SubGraph2 are traversed via the DFS algorithm. The ontology matching module applies parallel processing through the load balancing module. This is mentioned at Step No. 10 in Algorithm_1.
The load balancing module is responsible for managing and optimizing the heavy matching load. It is implemented with the help of the ACO algorithm (Krishnaveni et al., 2019). At this stage, the “n” number of machines/nodes is considered to distribute the heavy load of the matching process. During parallel processing, the matching tasks are assigned to machines via their working probability. With this aim, the working probability of each machine is determined via the current running load of each machine, the current maximum load among the “n” number of machines, and pheromones. At the initial stage, the same pheromone is assigned to each machine. If all the machines have approximately similar loads, then equal working probabilities are assigned to them. In other cases, the working probability is normalized to the scale of the working load. It supports the fair distribution of matching tasks according to the working ability of each machine. Afterward, the cumulative working probability of each machine is calculated, and accordingly, all the nodes are sorted. It helps to choose first of all the highly probable machines to assign the matching task. This strategy also involves nodes that have low working probabilities during the matching process. The working load of machines to which the matching task is assigned is increased. In this way, the classes of RDF SubGraph1 are divided among the “n” number of machines. The execution of the matching task is initiated to match the candidate class with the classes of RDF SubGraph1.
The pseudocode mentioned in Algorithm_4 describes the operation of the load balancing module. The ACO instance is created at line number 11 of Algorithm_4. The pseudocode of ACO is presented in Algorithm_5.
Algorithm_51 uses the pseudocode to assign matching tasks to different machines via their working probabilities. At line no. 3 of Algorithm_51, the pseudocode for computing the working probability of machines is invoked. It is mentioned in Algorithm_511. Algorithm_52 uses the pseudocode to execute the matching task by internally calling the API calls to LLM and generate the similarity score.
Algorithm_1 proceeds to Step No. 11, where the machines initiate API calls to the LLM through a prompt construction process. The LLM identifies the semantic similarity and sends the answer as a “true” or “false” or “a subclass of Class 1.” This result is treated as a similarity score. If the similarity score is “true,” then the candidate class is merged with a class of RDF SubGraph1. Otherwise, if the answer is supposed to be “a subclass of Class 1,” then the candidate class is merged, and its subclasses are added to RDF SubGraph1 according to the structural relation. The RDF SunGraph2 is traversed via a bottom-to-the-up approach to move to the superclass of the candidate class. The superclass of the earlier candidate class is identified as the new candidate class and is involved in the matching process. The matching process is called recursively to match the newly identified candidate class with each class of RDF Graph1. This process is repeated until the end point of RDF SubGraph2 is reached. In this way, the matching process runs in parallel.
The framework stores the matching results that are returned from “n” machines and referred during the merging process. The merge process also starts at the same time to merge the overlapped classes of RDF SubGraph2 to RDF SubGraph1. The pseudocode for the above process is presented in Algorithm_521, Algorithm_6. At, line no. 5, of Algorithm_521, Algorithm_6 is invoked that generate API call to LLM. Algorithm_61 uses the pseudocode to call to LLM and generate the similarity score. In Stage III, the proposed framework follows Steps No. 12 to No. 14 of Algorithm_1 to generate the integrated ontology in an incremental way.
The pheromone levels of the machines are updated after the execution of Stage III. The proposed framework intelligently handles parallel processing to utilize the computing power of machines. For this purpose, the total amount of time required to complete the assigned task by each machine is computed. The well-worked machines respond in less time. The pheromone level of such well-worked machines is increased. It may be happened that poorly worked machines respond slowly or might not respond after some time. In such cases, the matching process slows down and results in an increase in computational time. Therefore, it is essential to identify such machines to improve the running time of the matching process. For this purpose, the pheromone level of poorly working machines decreases slowly and is repeated for a certain number of iterations. After a certain number of iterations, suppose that one of the not well-worked machine has a negative pheromone level; then, that machine will receive a penalty. This node is declared a compromised node. The compromised node is disqualified in the next iteration, which helps optimize the parallel operation of the matching process. This strategy indirectly improved the computational time.
Experimental Setup
The benchmark dataset of Anatomy Track provided by the Ontology Alignment Evaluation Initiative (OAEI, 2022) is used to evaluate the performance of the proposed framework. Table 5 lists the data of the input ontologies.
Dataset of the Input Ontologies.
Dataset of the Input Ontologies.
The proposed framework is developed using Kotlin (JVM—version 1.9.22), JDK 19, Apache Jena (version 5.0.0), and LLM Studio (version 0.2.27) with Llama-3-8B Instruct. The proposed framework is executed in a cloud computing environment. Three A100 80G X 896 CPU core machines with 773 GB of RAM and 640 GB of GPUs are used. To run the proposed framework, Kotlin (JVM—version 1.9.22), JDK 19 and Apache Jena (version 5.0.0) are taken into consideration. LLM Studio (version 0.2.27) with Llama-3-8B Instruct was used specifically for the ontology matching process. During the execution process, virtualization is performed along with parallel computing. Therefore, three machines of A100 80G X 896 CPU are virtualized, and four (4) instances are created. Llama-3-8B is loaded on these 3 machines. In this way, three (3) machines X four (4) = 12 instances are created. Therefore, it is possible to match and integrate large-scale ontologies with the use of the minimum number of nodes in parallel processing. In this way, execution of the proposed framework is accomplished to integrate large-scale biomedical human anatomy track ontologies.
For comparative analysis purposes, existing frameworks that specifically use the LLM in ontology matching and integration processes are considered. The main purpose is to compare the computational time required by the proposed framework with the computational time required by the other frameworks as per their methodologies. The set of parameters is defined and are coded using numbers from (1) to (5). The coding descriptions are the integration category (1), type of ontology process and scale of the participating ontologies (2), computational time (3), dataset information (4), and evaluation result details (5). The computation parameter is assessed via (√) and (X) symbols. The (√) symbol indicates that work is done for computation time, and the symbol (X) indicates that work is not done for computational time. Table 6 presents a comparative analysis of the proposed framework with existing frameworks. A comparative analysis revealed that the proposed method completed ontology matching as well as the integration process in only 49 min and 50 s. Compared with the LLMA Dialog Model and OLaLa existing frameworks, the proposed framework yields improvements in computational time.
Evaluation of the Proposed Framework Results With Existing Frameworks.
Evaluation of the Proposed Framework Results With Existing Frameworks.
The proposed framework considers the RDF graph representation of input ontologies and builds the integrated ontology as RDF graph. Therefore, different graph metrics (Suleiman et al., 2024) are used to evaluate the quality of the integrated ontology. In this study, graph metrics such as Absolute Root Cardinality, Absolute Leaf Cardinality (ALC), Average Depth, Maximal Depth, Average and Maximum Breath are used. The online tool OntoMetrics (Lantow, 2016) is used to evaluate the integrated ontology with the above graph metrics.
Table 7 shows the numbers of classes and object properties. The numbers of classes in the source ontology and integrated ontology are not the same. This finding indicates that one thousand forty-seven (1047) classes of Adult Mouse Anatomy ontology are similar to some classes in Human Anatomy. These were merged into the Human Anatomy ontology to enrich it.
Evaluation of Integrated Ontology—Part I.
Evaluation of Integrated Ontology—Part I.
Table 8 shows that the ALC of the integrated ontology is greater than the ALC value of the source ontology. This finding indicates that the proposed framework generates a cohesive integrated ontology. Thus, it has enriched the source ontology with additional related data from the target ontology. Table 9 shows the evaluation of the integrated ontology. Here, the ontology is evaluated as per the setup of evaluation parameters mentioned in Table 3, Section 4.1. This finding indicates that the proposed framework successfully integrates large-scale ontologies.
Evaluation of Integrated Ontology—Part II.
Evaluation of Integrated Ontology—Part III.
A scalable framework is provided to integrate large-scale ontologies that address the main challenges, such as scalability and computational complexity. It works at different stages. In the first stage, the proposed framework is designed to provide a preliminary setup of ontology matching and integration processes. Afterward, the general workflow of the proposed research is designed and represented via a mathematical model. In the second stage, a mathematical model is used to develop the proposed framework. The proposed framework uses partitioning and load-balancing algorithms to enhance the performance of the critical matching process. The partitioning algorithm uses superclass and subclass relations and the cardinality of subclasses to partition the large-scale RDF graph ontologies into subgraphs without disturbing the structural relations.
After partitioning, the classes within the subgraphs are matched concurrently. Here, at this point, the proposed framework applies an ACO algorithm that not only balances the concurrent execution of the matching task but also optimizes it. The load-balancing stage initiates matching and merging processes concurrently and generates the integrated ontology iteratively. The matching process benefits the ability of the LLM to calculate the similarity score. This helps to avoid the extensive computations required to identify the syntactic, semantic, and structural similarity. The partitioning of the ontologies, matching of classes, and merging of classes are performed in an RDF graph only, which also decreases the computational complexity.
The proposed framework is compared with existing frameworks that have specifically used LLM to examine the improvement in computational time. Afterward, the quality of the integrated ontology is evaluated via base and graph metrics. This indicates that the integrated ontology is consistent. The findings revealed that the proposed algorithm successfully integrates large-scale ontologies. In the future, there is a plan to enhance the proposed framework to evaluate the computational time required to integrate large-scale BIO-ML ontologies.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Appendix A: Summary of similarity measures
See Tables A1–A5.
