Abstract
Case-Based Reasoning (CBR) system maintenance is an important issue for current medical systems research. Large-scale CBR systems are becoming more omnipresent, with immense case libraries consisting of millions of cases. Case-Base Maintenance (CBM) is the implementation of the following policies allowing to revise the organization and/or the content (information content, representation field of application, or the implementation) of the Case Base (CB) to improve future thinking. Diverse case-base deletion and addition policies have been proposed which claim to preserve case-base competence. This paper presents a novel clustering-based deletion policy for CBM that exploits the K-means clustering algorithm. Thus, CBM becomes a central subject whose objective is to guarantee the quality of the CB and improve the performance of CBM. The proposed approach exploited clustering, which groups similar cases using the K-means algorithm. We rely on the characterization made of the different cases in the CB, and we find this characterization by a method based on a criterion of competence and performance. From this categorization, case deletion becomes obvious. This quality depends on the competence and performance of the CB. Test results show that the proposed deletion strategy improved the efficiency of the CB while preserving competence.Furthermore, its performance was 13% more reliable. The effectiveness of the proposed approach examined on the medical databases and its performance has been compared with the existing approaches on deletion policy. Experimental results are very encouraging.
Keywords

Introduction
Case-Based Reasoning (CBR) is a machine learning algorithm for problem-solving and learning that caught a lot of attention over the last few years [1]. CBR is a problem-solving approach that consists of reusing past experiences to solve a new problem [2, 3]. A case can be defined by a pair: (Prob, Sol(Prob)) and the solution associated with the problem and is stored in a memory called the Case Base (CB). The problem part includes observational data and the behavioral situation, while the solution part contains a description of the solution provided by the reasoning. A case stored in the CB is called a source case and noted (srce, Sol(srce)), which will be used as a base for solving a new case that will be called a target case. The reasoning process often consists of five main steps: development, retrieval, adaptation, validation, and memorization. First, the elaboration phase builds a target case (target, Sol(?)) by completing or filtering the description of a problem from a possibly incomplete description from a case to be solved. Then, following a similarity metric, we recall source cases similar to the target case, adapted by building a Sol (target) solution; the validation of the solution comes next, if necessary. Finally, memorization consists of storing the new case (target, Sol(target)) once validated, if this storage is considered appropriate.
Any mature system, including CBR systems, must be maintained during the operating phase to ensure the quality of this system. Indeed, the maintenance of the CBR system becomes necessary for systems that are designed to operate for long periods and/or that will have to process large volumes of data and cases. It should be noted that the quality of a CBR system is related to the definition and representation of a case, the organization of the CB, the various indexes used, and the definition of suitable measures of similarities for case retrieval and adaptation steps [4]. There are various works in this field [5, 6, 7, 8], ranging from the modeling of the CBR cycle highlighting the phases related to maintenance, through the control of the different sources of knowledge constituting the CBR system, to the maintenance of CBR, which will be developed in this work.
The Case-Base Maintenance (CBM) phase is an important step for the good functioning of the CB operation throughout the entire life cycle of the medical system. CB is progressively enriched by the successive addition of cases, leading to an explosion in the number of cases in the CB and, on the other hand, to solutions that may be contradictory. Indeed, the explosion in the number of cases has repercussions over the system’s response in its retrieval and adaptation phase. Moreover, if we introduce any case in this CB we can get wrong solutions to the problem encountered mainly in the medical domain [9]. Therefore, implementing an effective medical CBR system is explicitly linked to the quality of the CB, which is the container of knowledge. CBR is a cognitive model centered on memory, a strategy that focuses on learning new competencies or generating hypotheses for new situations based on previous experiences; such strategies rely heavily on the competence of the CB to make highly adequate decisions. Therefore, the quality of CB becomes of paramount importance, mainly when we discuss the implementation of a medical diagnostic system using a CBR framework. CBR is adapted to the medical domain since medical experts use the knowledge (medical imaging) they have obtained from books and experiments. The exact functioning of CBR is a fundamental part of the system: learning by memorizing cases.
Aiming to maintain or even improve the quality, which might have degraded after several rounds of CBR, we turned to a particular branch of research. This branch focuses on the partitioning of the CB, which allows to build a refined CB structure and maintain it. The reasoning uses cases already stored in a CB. This base is supposed to be representative of all the problems that can be posed to the system. But the more the base grows, the longer the calculation time will be. This is why CB organization techniques and search and matching algorithms are particularly important. Hence several organizations are presented in [10]. Furthermore, CBM has been described as improving the performance of CBR systems: CBM implements policies to revise the organization or content of the CB to facilitate future reasoning for a particular set of performance objectives [9].
This work proposes a case deletion strategy for CBM based on the unsupervised K means algorithm. The combination of the CBM approach and the clustering algorithm has been chosen to take advantage of both learning paradigms’ strengths to improve the performance of CBR systems. The aim is to remove misclassified cases that take time for the reasoning of CBR and influence its performance. An efficient CBR system cannot stand without its CB, which is the system’s core. The proposed approach aims to build and maintain a quality CB with improved competence and performance criteria by reducing its size. The remainder of the paper is organized as follow: Section 2 presents related work of CBM systems, Section 3 describes some preliminaries of CBR, CBM and its strategies; Section 4 presents the proposed approach for a CBM, while Section 5 analyses the experimental results carried out in medical data sets; finally, Section 6 concludes this work and suggests future research.
Related work
This section presents the work that we consider most representative of the progress of the CBM. The selected studies contribute to CBR and, for the most part, influence the CBR community’s current work. This overview gives an insight into the extent of the level based on the deletion policies to CBM. In [1], Smyth and Keane proposed a case competency model to guide learning and case deletion. The authors presented a suite of algorithms to build and maintain this competency model at runtime effectively. Two new deletion policies (fingerprint deletion and fingerprint deletion) preserve competency by referring to this model at the time of deletion. The preliminary experimental results are promising in demonstrating that the competence estimates are useful in preserving the real competence of the system.
The system proposed by Smyth [5] is an approach to maintenance based on removing harmful cases from the CB. A competency model guides this approach to ensure that efficiency and competence are preserved and optimized during maintenance. In addition, Smyth and al. suggested additional ways to use the competency model when maintaining and acquiring cases. For example, the model can be used to identify potentially abnormal cases. Another possibility is to use the model to identify competence-rich subsets of a CB to be used as client-side CBs in a distributed CBR system. This also implies the possibility of a competency-based approach for distributed cases recovery. Finally, assistance in file creation can be provided by informing the engineer of knowledge about the regions of the CB with high or low competence.
Salam’o and Golobardes [11] presented two approaches based on deletion policies to case memory maintenance. The foundations of both approaches are the raw set theory, but each applies a different policy to remove or maintain cases. The main purpose of these methods is to maintain the system’s competence and reduce, as much as possible, the size of case memory to avoid inconsistent and redundant instances, obtain compact case memories, and maintain or improve the competence of the CBR system. Their experiences using different fields, most of them from the UCI repository [39], show that reduction techniques retain the competence obtained by the original case memory. The results obtained were compared with those obtained using well-known reduction techniques.
A measure of CBM was proposed by Haushin et al. [8], based on the existing literature and illustrated by a first test based on 69 cases. Among the existing CBM policies and strategies, the authors proposed a case deletion strategy. They are based on the characterization of the different cases in the CB, and we find it by a method based on a competence criterion. From this categorization, case deletion becomes evident. The notion of competence is quantified by measuring the maintenance action of the MC competence based on the concepts of recoverability and achievability associated with a relative recovery action. The interest of this measure is to quantify both the competence and performance of a CB.
Lu et al. [12] proposed a new competency model and a new maintenance procedure for the proposed competency model. Based on this competence maintenance procedure, fingerprint-based recovery (FBR), a competence-based case recovery method, can maintain its recovery and efficiency. The system proposed by Ali et al. [13] ensures that an acceptable level of competence is maintained. It was implemented using an autonomous forest fire application database (AFFA). The empirical investigation reveals that the proposed approach surpasses the fingerprint removal policy. In the literature, the fingerprint deletion criteria proposed in [4] are treated as a reference for case deletion policies. The authors implemented their proposed approach and the fingerprint removal policy. The competence of the two approaches was compared. The similarity between cases was calculated using Euclidean distance. For grouping purposes, the standard K-means classification algorithm was used. Although the fingerprint removal technique claims to improve the efficiency of the CB while preserving competence, its performance was 7% lower than the proposed approach at the optimal points but was worse at the other points.
Addition of two phases to the CBR cycle [5].
In the system proposed by Yan et al. [14], the problem-solving performance of a CBR system is closely related to the quantity and quality of cases stored in the CB. With the continued growth in the size of the business base, the so-called “submersion problem” can occur when the time cost of recovery exceeds the benefit of accuracy. From a cognitive science perspective, a dynamic maintenance method improved by selective and intentional memory forgetting CBR is proposed, which can mimic the memory function of the human brain selectively save new cases, update forgotten values and intentionally delete the old case. Experiences show the effectiveness of the proposed method. Selective memory and international forgetting policy can significantly reduce time and space complexity and maintain or improve the accuracy of the CBR classifier, thus improving CBR performance. The system proposed by Leake and Schack [15] proposed an improved approach. This flexible function deletion removes parts of cases, allowing a selective, uniform and uniform compression of the CB from different layers. It proposes and evaluates an initial set of feature removal strategies. The experimental results show that case-based maintenance may have to be modified when the content of the cases is not uniform. In such contexts, characteristic-based strategies can give better precision than case strategies. Moreover, the calculation bases and reference times may not be aligned, giving a space/time trade-off that can be exploited.
In [16], the authors presented a hybrid CBM method that equally utilizes the benefits of case addition and case deletion policies to maintain the CB in online and offline modes, respectively. The proposed approach has been evaluated using a simulated model of autonomic forest fire application. Its performance has been compared with the existing methods on a large case-base of the simulated case study. Mazin et al. [17] proposed a hybrid approach that combines genetic algorithm and CBR (GCBR) to improve CBR diagnosis. This approach applies the experience and knowledge of existing failure diagnosis cases to newly provided cases. The proposed approach can be valuable, especially for solving problems associated with moving failures. In addition, this method improves approaches that need to use similar systems but are in different fields. Considered as the essence of CBR system maintenance, CB directly impacts the quality of these systems. In this paper, we present a clustering-based CBM removal strategy that exploits the K-means clustering algorithm. The results presented in this paper show that the proposed policy is more efficient than the existing alternative reference deletion policy and ensures better competence and performance.
This section introduces the CBR system and CBM policies. First, the CBR cycle is presented, followed by the CBR maintenance policies, and finally, the criteria for assessing the quality of CBR.
Case base reasoning (CBR)
CBR can be defined as a reasoning paradigm that relies on past solved problems, also called source cases, to solve new problems, also called target problems [2, 3]. This paradigm is used in many industrial systems to solve problems in various application areas [18]. Often, CBR is presented as a solution to the bottleneck of the knowledge acquisition stage through the use of experimental knowledge, which is easier to collect. However, in contrast to other knowledge-based systems, several arguments present CBR as a much more accessible to implement a solution. Moreover, it has been reported that it is impossible to economize on the knowledge acquisition effort in practice.
CBR has a multiphase cycle, the number of phases of which varies according to different literature sources. It can be composed of three, four or five phases. Fuchs et al. [19] determined three phases, namely retrieval, adaptation, and memorization. Aamodt and Plaza [2] are the first authors to have described the CBR cycle and compose it in four phases: retrieval (search for asimilar case), adaptation (reuse of the found case), validation (revision of the selected case) and memorization (learning). Mille et al. [20] proposed adding a preliminary phase of elaboration at the beginning of the cycle. Figure 1 shows the CBR cycle with these five phases.In this study, we operated a five-phase CBR cycle [2].
Elaboration phase: the target case is constructed by completing or filtering the description of a problem from a possibly incomplete description. Retrieval phase: sources from the case database by searching for matches between descriptors of the source cases and the case to be solved (target). Adaptation phase: consists of constructing a solution to the problem of the target case inspired by the most similar source case(s) solution. Review phase: of the proposed solution in the case of a possible unsatisfactory solution, it would then be possible to correct it. In this case, the solution is evaluated in the real world using either the user, a human expert, domain knowledge or an automatic process.
Diagram of the different strategies and criteria used in the CBM [4]. Memorization phase: consists of storing a new solved case in the CB if this storage is considered suitable to enrich the system’s memory.

The work of Roth-Berghofer [21] and Reinartz [22] is focused on the modelling of the life cycle of the CBR system integrating the phases related to maintenance. Thus, two metaphases are proposed concerning application and maintenance, respectively. In addition, Richter [23] proposed a system of CBR through knowledge sources or containers. In this perspective, maintenance consists of developing techniques to control and react to changes in these different sources of knowledge. It should be noted that the sources adopted are:
Vocabulary source: it contains all the information on the definitions and structures used; Source of measures of similarity: it contains the necessary measures for case tracing. Adaptation source: it contains the rules for transforming the solution. Source of the case database: it represents the content and organization of the CB.Note that the first three sources of knowledge are developed before the system is running, while the CB usually is updated dynamically. According to Richter, each source can carry almost all of the available knowledge, and manipulations on one source have a small impact on the others.
Maintenance can be associated with each source of knowledge. Gabel [24] worked on learning similarity measures, such as evaluating weights associated with the descriptors of a case or the acquisition of similarity measures thanks to utility functions from a processing feedback loop associated with CBR applications. The CB has a central role, which explains why most of the work carried out in this area is essentially based on CBM [9]. Furthermore, the knowledge of a CBR system is case-related since cases are affected by any changes in knowledge sources. Therefore, the CB is the knowledge source most sensitive to changes in the CBR system, and its consultation is the most appropriate to trigger maintenance operations [4].
CBM is implementing policies to revise the organization and/or content (representation, scope, information content, or implementation) of the CB to improve future thinking [25]. Indeed, the CBM is a set of different realities, such as eliminating case redundancy, removing inconsistent cases, selecting groups of cases and improving the reasoning power of the system. In addition, cases can be rewritten to repair the problems of inconsistencies [26]. In their publication, Reinartz et al. [22] specified that CBM intervenes in the CBR cycle and, more precisely, at the end of this cycle when learning cases in the case base.Indeed, the authors proposed two new phases in this cycle, namely a scanning phase and a restoration phase (Fig. 2).
The scanning phase considers the current state of the CB and assesses its quality. The quality of the CB depends on several criteria based on the number of cases in the CB, their problem-solving power, and the response time.If the quality is poor, this phase suggests specific changes to achieve the desired quality. It also allows the triggering of maintenance during the “online” operation of the CBR cycle by proposing, among other things, several types of maintenance operations [4]. Finally, the restoration phase is used if the quality of the CB is unsatisfactory after the scanning phase. It allows selecting the methods used to choose a modification operator among those suggested by the scanning phase. These operators allow the content of the CB to be changed. Thus, the selection of modification operators aims to obtain the required quality level. As a result, the quality of the information is updated to reflect the change [4]. The same authors reported that a third field is added in the case representation (quality information), which contains all the data necessary to perform case database maintenance.
Indeed, the problem is dealt with by giving it an adequate solution in the “application” section. Then, in the maintenance part, the learning of the case in the CB begins. Before adding a case to the CB, the scanning phase evaluates it against the case set in the CB according to quality criteria that will be discussed in the next section. If the case is retained, the case database is updated in the restoration phase.
CBM criteria
The different approaches proposed for CBM can be divided into two policies, one concerning optimization and the other concerning CB partitioning (Fig. 2). These approaches aim to reduce the search time for cases either by optimization by eliminating the least relevant cases, following two strategies: addition and deletion of cases, or by partitioning, by partitioning the CB into several search spaces allowing the incremental selection of information-rich attributes that can cover the structure of the CB [25]. Several properties of the CB cases
Criteria for assessing the quality of CB
A CB is judged to be of good quality if it allows the CBR system to resolve as many problems as possible in a correct manner within a reasonable period of time. Several criteria for evaluating CB quality have been proposed in the literature. In this work, we are particularly interested in the following criteria [1, 25, 27]:
competence: it is measured by the number of different problems for which the system provides a good solution; system performance: it is assessed by the response time needed to propose a solution to a target case; this measure is directly related to adaptation costs and research costs; recovery of a CB case: it represents the set of target cases that this case can resolve; attainability of a target case: this is the set of source cases that can be used to solve it.
After explaining the different criteria for evaluating a CB, the following subsections describe how they are used and evaluated in the different CBM strategies. Subsequently, the two strategies of adding and deleting cases are defined [4].
A reduced CB is constructed from a blank CB, by successively adding cases and maximizing a criterion. There are two methods, one maximizing the competence criterion and the other the performance criterion.
Case deletion strategy
From a CB, this strategy evaluates cases according to a criterion to remove them and reduce the CB to a given number of cases. Evaluation criteria such as competence, redundancy, and inconsistency have been used in different methods.
Proposed approach
This section presents the design part of our work, in which a global reflection is conducted. It contains the main objectives, its general architecture, and the different components. In summary, automatic classification consists of grouping, in an unsupervised way, a set of objects or more widely of data, so that the objects of the same group (called cluster) are closer (in the sense of a chosen (dis)similarity criterion) to each other than those of the other groups (clusters). This is the main task in data mining. A statistical data analysis technique is widely used in many fields, including automatic learning, pattern recognition, signal and image processing, information retrieval, etc. Several methods have been developed in this context, the most popular being the K-means, which owes its popularity to its simplicity and ability to process large datasets.It addition, this section presents a deletion strategy for CBM that uses the K-means clustering algorithm. It presents the proposed approach, and its comparison with the case deletion criteria (identifies cases; case type, data nature). The database is grouped into k classes (cluster). This work aims to develop a CBM system for CBR approach. We used a method of clustering breast diseases by K-means (number of clusters
Description of the breast disease dataset [28]
Description of the breast disease dataset [28]
Detailed system architecture.
The database used in this paper is the breast disease database, and it is established by the IBN SINA Annaba Hospital (Algeria) [28, 29]. It is a multi-class database containing 100 instances (patients) with cancer, cyst, lipoma, adenofibroma, and abscess disease. Moreover, it includes 25 patients with cancer disease, 22 patients with cyst, 34 with adenofibroma, 12 with lipoma, and 7 patients with abscess disease. Thus, a patient is characterized by a vector of 22 characteristics (Attributes), including the class that contains the disease (cancer, lipoma, abscess, adenofibroma, cyst) (Table 1).
Data pre-processing
The good results that an automatic classifier can provide are primarily based on the pre-processing phase. Data from poor pre-processing will jeopardize the quality of the classifier. In our case, all available data represent software measurements stored in Access files. Files can only be loaded in arff (Attribute-Relation File Format) format. For this reason, our access files must be converted into arff files. We eliminated missing values in the database because the K-means algorithm does not process missing data. This database was transformed into a numerical database for the CBR approach during the classification step. In addition, a learning base was created with 70% of the original base and a test base with the remaining 30%. For our implementation, we used the libraries (import numpy as np, import pandas as pd, import math) to make the different manipulations of the different algorithms such as the clustering algorithm, the mathematical functions, the displays of the clusters, the management of the input and output flows, etc.
Clustering (K-means algorithm)
The choice of a learning method is very important. We used the clustering method, which helps us to decide on a CBR system. We chose K-means as the clustering method. The unsupervised clustering algorithm has shown interesting results in improving the data representation and classification results [30, 31], giving us a good research line to explore. The K-means algorithm developed by MacQueen in [32] is one of the simplest unsupervised learning algorithms. It is called the mobile centres algorithm [33, 34], which assigns each point in a cluster with the nearest center(centroid). The center is the average of all the points in the cluster. Its coordinates are the arithmetic mean for each dimension separately from all the points in the cluster, i.e., each cluster is represented by its gravity center.
Let
where,
The pseudo-code of the K-means algorithm is summarized in Algorithm 1 [32].
In this paper, a CBM approach based on clustering is proposed and implemented. First, the CB is grouped into k clusters using the K-means clustering algorithm (Fig. 4) (
Categorization of cases in a clustered CB.
The difference between these case categories has been represented in Fig. 4. A basic sample of cases was grouped into five groups. The pivotal cases are shown as stars. Support cases are noted as black squares, while axillary cases are shown as black dots. In such a scenario, cases located at the cluster border will not be adequately represented by the corresponding centers due to the clusters’ larger radius. The proposed approach searches for these cases to delete them as misclassified cases. We obtain this by applying K-meansto our database: the database is grouped into 5 clusters (
Table 2 shows that K-means divides the CB(breast diseases) into 5 clusters (cluster 0, cluster 1, cluster 2, cluster 3 and cluster 4) and classifies the 5 classes (diseases: cancer, cyst, adenofibroma, lipoma, and abscess) in clusters according to their closest affiliation to the center of gravity. Figure 5 illustrates that there are misclassified cases (described in Table 2).
Result of K-means of grouping into 5 clusters
Clustering of the breast disease database into 5 clusters.
From these results, we can say that CB maintenance is necessary when we chose the strategy of removing misclassified cases according to the algorithm in the following. According to the determined test criterion, the maintenance algorithm applied on a case-by-case basis (breast disease) allows the removal of misclassified cases. Algorithm 2 shows the CBM algorithm proposed by our system and the policy of deleting misclassified cases whose auxiliary cases are most likely to be deleted according to the determined test (a deletion threshold). Among the misclassified cases, we deleted the instances (patients) whose most effective attributes (called master symptoms) have a minimum number of positive values (i.e., master symptom values
Master symptom Value negative Value positive
Algorithm 2 summarizes the pseudo-code of the proposed maintenance algorithm (deletion policy).
The clustering process always requires a learning base as input. Creating a learning base means having individuals (in our case, patients with breast disease) whose class membership is known with certainty. Our database contains 100 patients; we divided it into two sub-databases: the learning database contains 70 patients, and we created 30 patients for the test. Moreover, to evaluate the algorithm, we divided our database into two subbases: the learning set and the test set, from which we performed 10 change tests each time.
Clustering of the CB breast diseases before maintenance
Clustering of the CB breast diseases before maintenance








