A new case-deletion strategy for case-base maintenance based on K-means Clustering Algorithm applied to medical data

Abstract

Case-Based Reasoning (CBR) system maintenance is an important issue for current medical systems research. Large-scale CBR systems are becoming more omnipresent, with immense case libraries consisting of millions of cases. Case-Base Maintenance (CBM) is the implementation of the following policies allowing to revise the organization and/or the content (information content, representation field of application, or the implementation) of the Case Base (CB) to improve future thinking. Diverse case-base deletion and addition policies have been proposed which claim to preserve case-base competence. This paper presents a novel clustering-based deletion policy for CBM that exploits the K-means clustering algorithm. Thus, CBM becomes a central subject whose objective is to guarantee the quality of the CB and improve the performance of CBM. The proposed approach exploited clustering, which groups similar cases using the K-means algorithm. We rely on the characterization made of the different cases in the CB, and we find this characterization by a method based on a criterion of competence and performance. From this categorization, case deletion becomes obvious. This quality depends on the competence and performance of the CB. Test results show that the proposed deletion strategy improved the efficiency of the CB while preserving competence.Furthermore, its performance was 13% more reliable. The effectiveness of the proposed approach examined on the medical databases and its performance has been compared with the existing approaches on deletion policy. Experimental results are very encouraging.

Keywords

Case-Based Reasoning (CBR)Case-Base Maintenance (CBM)deletion strategies clustering K-means algorithm performance competence

ï»¿

1. Introduction

Case-Based Reasoning (CBR) is a machine learning algorithm for problem-solving and learning that caught a lot of attention over the last few years [1]. CBR is a problem-solving approach that consists of reusing past experiences to solve a new problem [2, 3]. A case can be defined by a pair: (Prob, Sol(Prob)) and the solution associated with the problem and is stored in a memory called the Case Base (CB). The problem part includes observational data and the behavioral situation, while the solution part contains a description of the solution provided by the reasoning. A case stored in the CB is called a source case and noted (srce, Sol(srce)), which will be used as a base for solving a new case that will be called a target case. The reasoning process often consists of five main steps: development, retrieval, adaptation, validation, and memorization. First, the elaboration phase builds a target case (target, Sol(?)) by completing or filtering the description of a problem from a possibly incomplete description from a case to be solved. Then, following a similarity metric, we recall source cases similar to the target case, adapted by building a Sol (target) solution; the validation of the solution comes next, if necessary. Finally, memorization consists of storing the new case (target, Sol(target)) once validated, if this storage is considered appropriate.

Any mature system, including CBR systems, must be maintained during the operating phase to ensure the quality of this system. Indeed, the maintenance of the CBR system becomes necessary for systems that are designed to operate for long periods and/or that will have to process large volumes of data and cases. It should be noted that the quality of a CBR system is related to the definition and representation of a case, the organization of the CB, the various indexes used, and the definition of suitable measures of similarities for case retrieval and adaptation steps [4]. There are various works in this field [5, 6, 7, 8], ranging from the modeling of the CBR cycle highlighting the phases related to maintenance, through the control of the different sources of knowledge constituting the CBR system, to the maintenance of CBR, which will be developed in this work.

The Case-Base Maintenance (CBM) phase is an important step for the good functioning of the CB operation throughout the entire life cycle of the medical system. CB is progressively enriched by the successive addition of cases, leading to an explosion in the number of cases in the CB and, on the other hand, to solutions that may be contradictory. Indeed, the explosion in the number of cases has repercussions over the system’s response in its retrieval and adaptation phase. Moreover, if we introduce any case in this CB we can get wrong solutions to the problem encountered mainly in the medical domain [9]. Therefore, implementing an effective medical CBR system is explicitly linked to the quality of the CB, which is the container of knowledge. CBR is a cognitive model centered on memory, a strategy that focuses on learning new competencies or generating hypotheses for new situations based on previous experiences; such strategies rely heavily on the competence of the CB to make highly adequate decisions. Therefore, the quality of CB becomes of paramount importance, mainly when we discuss the implementation of a medical diagnostic system using a CBR framework. CBR is adapted to the medical domain since medical experts use the knowledge (medical imaging) they have obtained from books and experiments. The exact functioning of CBR is a fundamental part of the system: learning by memorizing cases.

Aiming to maintain or even improve the quality, which might have degraded after several rounds of CBR, we turned to a particular branch of research. This branch focuses on the partitioning of the CB, which allows to build a refined CB structure and maintain it. The reasoning uses cases already stored in a CB. This base is supposed to be representative of all the problems that can be posed to the system. But the more the base grows, the longer the calculation time will be. This is why CB organization techniques and search and matching algorithms are particularly important. Hence several organizations are presented in [10]. Furthermore, CBM has been described as improving the performance of CBR systems: CBM implements policies to revise the organization or content of the CB to facilitate future reasoning for a particular set of performance objectives [9].

This work proposes a case deletion strategy for CBM based on the unsupervised K means algorithm. The combination of the CBM approach and the clustering algorithm has been chosen to take advantage of both learning paradigms’ strengths to improve the performance of CBR systems. The aim is to remove misclassified cases that take time for the reasoning of CBR and influence its performance. An efficient CBR system cannot stand without its CB, which is the system’s core. The proposed approach aims to build and maintain a quality CB with improved competence and performance criteria by reducing its size. The remainder of the paper is organized as follow: Section 2 presents related work of CBM systems, Section 3 describes some preliminaries of CBR, CBM and its strategies; Section 4 presents the proposed approach for a CBM, while Section 5 analyses the experimental results carried out in medical data sets; finally, Section 6 concludes this work and suggests future research.

2. Related work

This section presents the work that we consider most representative of the progress of the CBM. The selected studies contribute to CBR and, for the most part, influence the CBR community’s current work. This overview gives an insight into the extent of the level based on the deletion policies to CBM. In [1], Smyth and Keane proposed a case competency model to guide learning and case deletion. The authors presented a suite of algorithms to build and maintain this competency model at runtime effectively. Two new deletion policies (fingerprint deletion and fingerprint deletion) preserve competency by referring to this model at the time of deletion. The preliminary experimental results are promising in demonstrating that the competence estimates are useful in preserving the real competence of the system.

The system proposed by Smyth [5] is an approach to maintenance based on removing harmful cases from the CB. A competency model guides this approach to ensure that efficiency and competence are preserved and optimized during maintenance. In addition, Smyth and al. suggested additional ways to use the competency model when maintaining and acquiring cases. For example, the model can be used to identify potentially abnormal cases. Another possibility is to use the model to identify competence-rich subsets of a CB to be used as client-side CBs in a distributed CBR system. This also implies the possibility of a competency-based approach for distributed cases recovery. Finally, assistance in file creation can be provided by informing the engineer of knowledge about the regions of the CB with high or low competence.

Salam’o and Golobardes [11] presented two approaches based on deletion policies to case memory maintenance. The foundations of both approaches are the raw set theory, but each applies a different policy to remove or maintain cases. The main purpose of these methods is to maintain the system’s competence and reduce, as much as possible, the size of case memory to avoid inconsistent and redundant instances, obtain compact case memories, and maintain or improve the competence of the CBR system. Their experiences using different fields, most of them from the UCI repository [39], show that reduction techniques retain the competence obtained by the original case memory. The results obtained were compared with those obtained using well-known reduction techniques.

A measure of CBM was proposed by Haushin et al. [8], based on the existing literature and illustrated by a first test based on 69 cases. Among the existing CBM policies and strategies, the authors proposed a case deletion strategy. They are based on the characterization of the different cases in the CB, and we find it by a method based on a competence criterion. From this categorization, case deletion becomes evident. The notion of competence is quantified by measuring the maintenance action of the MC competence based on the concepts of recoverability and achievability associated with a relative recovery action. The interest of this measure is to quantify both the competence and performance of a CB.

Lu et al. [12] proposed a new competency model and a new maintenance procedure for the proposed competency model. Based on this competence maintenance procedure, fingerprint-based recovery (FBR), a competence-based case recovery method, can maintain its recovery and efficiency. The system proposed by Ali et al. [13] ensures that an acceptable level of competence is maintained. It was implemented using an autonomous forest fire application database (AFFA). The empirical investigation reveals that the proposed approach surpasses the fingerprint removal policy. In the literature, the fingerprint deletion criteria proposed in [4] are treated as a reference for case deletion policies. The authors implemented their proposed approach and the fingerprint removal policy. The competence of the two approaches was compared. The similarity between cases was calculated using Euclidean distance. For grouping purposes, the standard K-means classification algorithm was used. Although the fingerprint removal technique claims to improve the efficiency of the CB while preserving competence, its performance was 7% lower than the proposed approach at the optimal points but was worse at the other points.

Figure 1.

Addition of two phases to the CBR cycle [5].

In the system proposed by Yan et al. [14], the problem-solving performance of a CBR system is closely related to the quantity and quality of cases stored in the CB. With the continued growth in the size of the business base, the so-called “submersion problem” can occur when the time cost of recovery exceeds the benefit of accuracy. From a cognitive science perspective, a dynamic maintenance method improved by selective and intentional memory forgetting CBR is proposed, which can mimic the memory function of the human brain selectively save new cases, update forgotten values and intentionally delete the old case. Experiences show the effectiveness of the proposed method. Selective memory and international forgetting policy can significantly reduce time and space complexity and maintain or improve the accuracy of the CBR classifier, thus improving CBR performance. The system proposed by Leake and Schack [15] proposed an improved approach. This flexible function deletion removes parts of cases, allowing a selective, uniform and uniform compression of the CB from different layers. It proposes and evaluates an initial set of feature removal strategies. The experimental results show that case-based maintenance may have to be modified when the content of the cases is not uniform. In such contexts, characteristic-based strategies can give better precision than case strategies. Moreover, the calculation bases and reference times may not be aligned, giving a space/time trade-off that can be exploited.

In [16], the authors presented a hybrid CBM method that equally utilizes the benefits of case addition and case deletion policies to maintain the CB in online and offline modes, respectively. The proposed approach has been evaluated using a simulated model of autonomic forest fire application. Its performance has been compared with the existing methods on a large case-base of the simulated case study. Mazin et al. [17] proposed a hybrid approach that combines genetic algorithm and CBR (GCBR) to improve CBR diagnosis. This approach applies the experience and knowledge of existing failure diagnosis cases to newly provided cases. The proposed approach can be valuable, especially for solving problems associated with moving failures. In addition, this method improves approaches that need to use similar systems but are in different fields. Considered as the essence of CBR system maintenance, CB directly impacts the quality of these systems. In this paper, we present a clustering-based CBM removal strategy that exploits the K-means clustering algorithm. The results presented in this paper show that the proposed policy is more efficient than the existing alternative reference deletion policy and ensures better competence and performance.

3. Preliminaries

This section introduces the CBR system and CBM policies. First, the CBR cycle is presented, followed by the CBR maintenance policies, and finally, the criteria for assessing the quality of CBR.

3.1 Case base reasoning (CBR)

CBR can be defined as a reasoning paradigm that relies on past solved problems, also called source cases, to solve new problems, also called target problems [2, 3]. This paradigm is used in many industrial systems to solve problems in various application areas [18]. Often, CBR is presented as a solution to the bottleneck of the knowledge acquisition stage through the use of experimental knowledge, which is easier to collect. However, in contrast to other knowledge-based systems, several arguments present CBR as a much more accessible to implement a solution. Moreover, it has been reported that it is impossible to economize on the knowledge acquisition effort in practice.

CBR has a multiphase cycle, the number of phases of which varies according to different literature sources. It can be composed of three, four or five phases. Fuchs et al. [19] determined three phases, namely retrieval, adaptation, and memorization. Aamodt and Plaza [2] are the first authors to have described the CBR cycle and compose it in four phases: retrieval (search for asimilar case), adaptation (reuse of the found case), validation (revision of the selected case) and memorization (learning). Mille et al. [20] proposed adding a preliminary phase of elaboration at the beginning of the cycle. Figure 1 shows the CBR cycle with these five phases.In this study, we operated a five-phase CBR cycle [2].

•
Elaboration phase: the target case is constructed by completing or filtering the description of a problem from a possibly incomplete description.
•
Retrieval phase: sources from the case database by searching for matches between descriptors of the source cases and the case to be solved (target).
•
Adaptation phase: consists of constructing a solution to the problem of the target case inspired by the most similar source case(s) solution.
•
Review phase: of the proposed solution in the case of a possible unsatisfactory solution, it would then be possible to correct it. In this case, the solution is evaluated in the real world using either the user, a human expert, domain knowledge or an automatic process.

Figure 2.
Diagram of the different strategies and criteria used in the CBM [4].

•
Memorization phase: consists of storing a new solved case in the CB if this storage is considered suitable to enrich the system’s memory.

3.2 Case base maintenance (CBM)

The work of Roth-Berghofer [21] and Reinartz [22] is focused on the modelling of the life cycle of the CBR system integrating the phases related to maintenance. Thus, two metaphases are proposed concerning application and maintenance, respectively. In addition, Richter [23] proposed a system of CBR through knowledge sources or containers. In this perspective, maintenance consists of developing techniques to control and react to changes in these different sources of knowledge. It should be noted that the sources adopted are:

•
Vocabulary source: it contains all the information on the definitions and structures used;
•
Source of measures of similarity: it contains the necessary measures for case tracing.
•
Adaptation source: it contains the rules for transforming the solution.
•
Source of the case database: it represents the content and organization of the CB.Note that the first three sources of knowledge are developed before the system is running, while the CB usually is updated dynamically. According to Richter, each source can carry almost all of the available knowledge, and manipulations on one source have a small impact on the others.

Maintenance can be associated with each source of knowledge. Gabel [24] worked on learning similarity measures, such as evaluating weights associated with the descriptors of a case or the acquisition of similarity measures thanks to utility functions from a processing feedback loop associated with CBR applications. The CB has a central role, which explains why most of the work carried out in this area is essentially based on CBM [9]. Furthermore, the knowledge of a CBR system is case-related since cases are affected by any changes in knowledge sources. Therefore, the CB is the knowledge source most sensitive to changes in the CBR system, and its consultation is the most appropriate to trigger maintenance operations [4].
3.2.1 CBM policies

CBM is implementing policies to revise the organization and/or content (representation, scope, information content, or implementation) of the CB to improve future thinking [25]. Indeed, the CBM is a set of different realities, such as eliminating case redundancy, removing inconsistent cases, selecting groups of cases and improving the reasoning power of the system. In addition, cases can be rewritten to repair the problems of inconsistencies [26]. In their publication, Reinartz et al. [22] specified that CBM intervenes in the CBR cycle and, more precisely, at the end of this cycle when learning cases in the case base.Indeed, the authors proposed two new phases in this cycle, namely a scanning phase and a restoration phase (Fig. 2).

The scanning phase considers the current state of the CB and assesses its quality. The quality of the CB depends on several criteria based on the number of cases in the CB, their problem-solving power, and the response time.If the quality is poor, this phase suggests specific changes to achieve the desired quality. It also allows the triggering of maintenance during the “online” operation of the CBR cycle by proposing, among other things, several types of maintenance operations [4]. Finally, the restoration phase is used if the quality of the CB is unsatisfactory after the scanning phase. It allows selecting the methods used to choose a modification operator among those suggested by the scanning phase. These operators allow the content of the CB to be changed. Thus, the selection of modification operators aims to obtain the required quality level. As a result, the quality of the information is updated to reflect the change [4]. The same authors reported that a third field is added in the case representation (quality information), which contains all the data necessary to perform case database maintenance.

Indeed, the problem is dealt with by giving it an adequate solution in the “application” section. Then, in the maintenance part, the learning of the case in the CB begins. Before adding a case to the CB, the scanning phase evaluates it against the case set in the CB according to quality criteria that will be discussed in the next section. If the case is retained, the case database is updated in the restoration phase.

3.2.2 CBM criteria

The different approaches proposed for CBM can be divided into two policies, one concerning optimization and the other concerning CB partitioning (Fig. 2). These approaches aim to reduce the search time for cases either by optimization by eliminating the least relevant cases, following two strategies: addition and deletion of cases, or by partitioning, by partitioning the CB into several search spaces allowing the incremental selection of information-rich attributes that can cover the structure of the CB [25]. Several properties of the CB cases $<$ ere proposed to perform a CB evaluation.

3.3 Criteria for assessing the quality of CB

A CB is judged to be of good quality if it allows the CBR system to resolve as many problems as possible in a correct manner within a reasonable period of time. Several criteria for evaluating CB quality have been proposed in the literature. In this work, we are particularly interested in the following criteria [1, 25, 27]:

•
competence: it is measured by the number of different problems for which the system provides a good solution;
•
system performance: it is assessed by the response time needed to propose a solution to a target case; this measure is directly related to adaptation costs and research costs;
•
recovery of a CB case: it represents the set of target cases that this case can resolve;
•
attainability of a target case: this is the set of source cases that can be used to solve it.

After explaining the different criteria for evaluating a CB, the following subsections describe how they are used and evaluated in the different CBM strategies. Subsequently, the two strategies of adding and deleting cases are defined [4].
3.3.1 Case addition strategy

A reduced CB is constructed from a blank CB, by successively adding cases and maximizing a criterion. There are two methods, one maximizing the competence criterion and the other the performance criterion.

3.3.2 Case deletion strategy

From a CB, this strategy evaluates cases according to a criterion to remove them and reduce the CB to a given number of cases. Evaluation criteria such as competence, redundancy, and inconsistency have been used in different methods.

4. Proposed approach

This section presents the design part of our work, in which a global reflection is conducted. It contains the main objectives, its general architecture, and the different components. In summary, automatic classification consists of grouping, in an unsupervised way, a set of objects or more widely of data, so that the objects of the same group (called cluster) are closer (in the sense of a chosen (dis)similarity criterion) to each other than those of the other groups (clusters). This is the main task in data mining. A statistical data analysis technique is widely used in many fields, including automatic learning, pattern recognition, signal and image processing, information retrieval, etc. Several methods have been developed in this context, the most popular being the K-means, which owes its popularity to its simplicity and ability to process large datasets.It addition, this section presents a deletion strategy for CBM that uses the K-means clustering algorithm. It presents the proposed approach, and its comparison with the case deletion criteria (identifies cases; case type, data nature). The database is grouped into k classes (cluster). This work aims to develop a CBM system for CBR approach. We used a method of clustering breast diseases by K-means (number of clusters $=$ 5) to design an efficient and reliable diagnostic support system (Fig. 3).

Table 1
Description of the breast disease dataset [28]

Id	Attribute	Value
S1	Age	$>$ 35 and $\leqslant$ 65, $\leqslant$ 35, $>$ 65
S2	ANTECEDENT_fami	Yes, no, no, no.
S3	Gender	F
S4	flow_mame	Yes
S5	retraction _mame	Yes, no, no.
S6	Adenopathy	Yes, no, no.
S7	modification_tc	Yes, no, no.
S8	Mass	Yes, no, no.
S9	Mastodynia	Yes, no, no.
S10	Mammography	Yes
S11	Opacity	Yes, no, no.
S12	Clarte	Yes, no, no.
S13	nothing_a_report	Yes, no, no.
S14	Shape	2 cm–5 cm, $>$ 5 cm, $<$ 2 cm, none
S15	Size	Yes, no, no, no.
S16	Contours	Oval, round, speculated, nothingness
S17	Homogeinite	Irregular, regular, nothingness
S18	Microcalcification	Hyterogeneous, homogeneous, homogeneous, nothingness
S19	Testing	m, b
S20	Cytology	m, b
S21	Biopsy	m, ‘inconclusive’, b
S22	Disease	Cancer, cyst, adenofibroma, lipoma, abscess

Figure 3.

Detailed system architecture.

4.1 Description of the database

The database used in this paper is the breast disease database, and it is established by the IBN SINA Annaba Hospital (Algeria) [28, 29]. It is a multi-class database containing 100 instances (patients) with cancer, cyst, lipoma, adenofibroma, and abscess disease. Moreover, it includes 25 patients with cancer disease, 22 patients with cyst, 34 with adenofibroma, 12 with lipoma, and 7 patients with abscess disease. Thus, a patient is characterized by a vector of 22 characteristics (Attributes), including the class that contains the disease (cancer, lipoma, abscess, adenofibroma, cyst) (Table 1).

4.2 Data pre-processing

The good results that an automatic classifier can provide are primarily based on the pre-processing phase. Data from poor pre-processing will jeopardize the quality of the classifier. In our case, all available data represent software measurements stored in Access files. Files can only be loaded in arff (Attribute-Relation File Format) format. For this reason, our access files must be converted into arff files. We eliminated missing values in the database because the K-means algorithm does not process missing data. This database was transformed into a numerical database for the CBR approach during the classification step. In addition, a learning base was created with 70% of the original base and a test base with the remaining 30%. For our implementation, we used the libraries (import numpy as np, import pandas as pd, import math) to make the different manipulations of the different algorithms such as the clustering algorithm, the mathematical functions, the displays of the clusters, the management of the input and output flows, etc.

4.3 Clustering (K-means algorithm)

The choice of a learning method is very important. We used the clustering method, which helps us to decide on a CBR system. We chose K-means as the clustering method. The unsupervised clustering algorithm has shown interesting results in improving the data representation and classification results [30, 31], giving us a good research line to explore. The K-means algorithm developed by MacQueen in [32] is one of the simplest unsupervised learning algorithms. It is called the mobile centres algorithm [33, 34], which assigns each point in a cluster with the nearest center(centroid). The center is the average of all the points in the cluster. Its coordinates are the arithmetic mean for each dimension separately from all the points in the cluster, i.e., each cluster is represented by its gravity center.

Let $N=\{x_{1},\dots,x_{n}\}$ be the set of $n$ objects to be clustered by a similarity criterion, where $x_{i}\in\Re^{d}$ for $i=$ 1, $\ldots$ , $n$ and $d\geqslant$ 1 is the number of dimensions. Additionally, let $k\geqslant$ 2 be an integer and $K=\{1,\ldots,k\}$ . For a $k$ -partition, $P=\{G(1),{\ldots},G(k)\}$ of $N$ , let $\mu_{j}$ denote the centroid of cluster $G(j)$ , for $j\in K$ , and let $M=\{\mu_{1},\ldots,\mu_{k}\}$ and $W=\{w_{11},{\ldots},w_{ij}\}$ . Therefore, the clustering problem can be formulated as an optimization problem [32], which is described by Eq. (4.3):

$\displaystyle\textit{P:minimise}\ z\left(W,M\right)=\sum\limits^{n}_{i=1}{\sum% \limits^{k}_{j=1}{w_{ij}}}d(x_{i},{\mu}_{j})$ $\displaystyle\textit{subject to }\ \sum\limits^{k}_{j=1}{w_{ij}=1,\ \textit{% for}\ i=1,..,n,}$ (1) $\displaystyle w_{ij}=0\ \textit{or}\ 1,\textit{for}\ i=1,..,n\ \textit{and}\ j% =1,..,k$

where, $w_{ij}=$ 1 implies object $x_{i}$ belongs to cluster $G(j)$ and $d(x_{i}$ , $\mu_{j}$ ) denotes the Euclidean distance between $x_{i}$ and $\mu_{j}$ for $i=1,{\ldots},n$ and $j=1,{\ldots},k$ .

The pseudo-code of the K-means algorithm is summarized in Algorithm 1 [32].

Algorithm 1. Standard K-means algorithm
1:	# Initialization:
2:	$N:=\{x_{1},\ldots,x_{n}\}$ ;
3:	$M:=\{\mu_{1},\ldots,\mu_{k}\}$ ;
4:	# Classification:
5:	For $x_{i}\in N$ and $\mu_{k}\in M$
6:	Calculate the Euclidean distance from each $x_{i}$ to
	the $k$ centroids;
7:	Assign object $x_{i}$ to the closest centroid $\mu_{k}$ ;
8:	# Centroid calculation:
9:	Calculate centroid $\mu_{k}$ ;
10:	# Convergence:
11:	If $M:=\{\mu_{1},\ldots,\mu_{k}\}$ remains unchanged in two
	consecutive iterations
	then:
12:	Stop the algorithm;
13:	else:
14:	Go to Classification
15:	End

In this paper, a CBM approach based on clustering is proposed and implemented. First, the CB is grouped into k clusters using the K-means clustering algorithm (Fig. 4) ( $K=$ 5 the number of clusters for our database). Its average represents each cluster. A distinction is made in tree cases; Pivot cases are those cases that play a primitive role in the resolution of the majority of (most relevant) cases. The cases most similar to the pivoting cases are designated as Support case, and the remaining cases are chosen as Auxiliary cases.

Figure 4.

Categorization of cases in a clustered CB.

The difference between these case categories has been represented in Fig. 4. A basic sample of cases was grouped into five groups. The pivotal cases are shown as stars. Support cases are noted as black squares, while axillary cases are shown as black dots. In such a scenario, cases located at the cluster border will not be adequately represented by the corresponding centers due to the clusters’ larger radius. The proposed approach searches for these cases to delete them as misclassified cases. We obtain this by applying K-meansto our database: the database is grouped into 5 clusters ( $K=$ 5) (Fig. 5).

Table 2 shows that K-means divides the CB(breast diseases) into 5 clusters (cluster 0, cluster 1, cluster 2, cluster 3 and cluster 4) and classifies the 5 classes (diseases: cancer, cyst, adenofibroma, lipoma, and abscess) in clusters according to their closest affiliation to the center of gravity. Figure 5 illustrates that there are misclassified cases (described in Table 2).

Table 2

Result of K-means of grouping into 5 clusters

Cluster	Cluster 0	Cluster 1	Cluster 2	Cluster 3	Cluster 4
Class	Cancer lipoma	Adenofibroma lipoma abscess	Cancer	Cancer	Cyst

Figure 5.

Clustering of the breast disease database into 5 clusters.

4.4 Descrption of proposed case base maintenance algorithm

From these results, we can say that CB maintenance is necessary when we chose the strategy of removing misclassified cases according to the algorithm in the following. According to the determined test criterion, the maintenance algorithm applied on a case-by-case basis (breast disease) allows the removal of misclassified cases. Algorithm 2 shows the CBM algorithm proposed by our system and the policy of deleting misclassified cases whose auxiliary cases are most likely to be deleted according to the determined test (a deletion threshold). Among the misclassified cases, we deleted the instances (patients) whose most effective attributes (called master symptoms) have a minimum number of positive values (i.e., master symptom values $=$ maximum number of negative values) according to expert opinion. In our breast disease CB, we relied on medical imaging attributes as follows:

•
Master symptom $=$ is the factor that causes disease (defined by the doctor);
•
Value negative $=$ false;
•
Value positive $=$ true.

Algorithm 2 summarizes the pseudo-code of the proposed maintenance algorithm (deletion policy).

Algorithm 2: Proposed case base maintenance algorithm

End.

Until we obtain a well maintained case base with better competence and performance;

1: Input

2: Case Base (CB) //Data set devised into Train set and Test set

3: All clusters obtained by K-means ( $k=$ 5 clusters) by train set;

4: Output

5: A maintained case base (reduced case base), competance, performance

6: Begin

7: Repeat

8: Determine the list of cases (instances) for each cluster;

9: Determine the list of misclassified cases;

10: For each (misclassified Case) do

11: if (Case is a pivot) then Save (Case)

12: else If (Case is a support) then Save (Case)

13: else if (Case is an auxiliary) then

14: Begin

15: Refer to “Medical Imaging Attributes”

16: Repeat

17: if the number of master symptoms $=$

maximum_value_negative_number

then Delete (misclassified case correspond-

ing)

18: Until (maximum_value_negative_number

$\leqslant$ maximum_value_positive_number)

19: End

4.4.1 Learning phase

Algorithm 2: Proposed case base maintenance algorithm
End.
Until we obtain a well maintained case base with better competence and performance;
1:	Input
2:	Case Base (CB) //Data set devised into Train set and Test set
3:	All clusters obtained by K-means ( $k=$ 5 clusters) by train set;
4:	Output
5:	A maintained case base (reduced case base), competance, performance
6:	Begin
7:	Repeat
8:	Determine the list of cases (instances) for each cluster;
9:	Determine the list of misclassified cases;
10:	For each (misclassified Case) do
11:	if (Case is a pivot) then Save (Case)
12:	else If (Case is a support) then Save (Case)
13:	else if (Case is an auxiliary) then
14:	Begin
15:	Refer to “Medical Imaging Attributes”
16:	Repeat
17:	if the number of master symptoms $=$
	maximum_value_negative_number
	then Delete (misclassified case correspond-
	ing)
18:	Until (maximum_value_negative_number
	$\leqslant$ maximum_value_positive_number)
19:	End

The clustering process always requires a learning base as input. Creating a learning base means having individuals (in our case, patients with breast disease) whose class membership is known with certainty. Our database contains 100 patients; we divided it into two sub-databases: the learning database contains 70 patients, and we created 30 patients for the test. Moreover, to evaluate the algorithm, we divided our database into two subbases: the learning set and the test set, from which we performed 10 change tests each time.

Table 3
Clustering of the CB breast diseases before maintenance

No. iteration

Number of clusters

Number of

cases per cluster

Learning rate

Number of cases cluster

Test rate

19%

30%

17%

30%

27%

10%

28%

17%

4.4.2 Testing phase

It is interesting to evaluate the algorithm’s performance on an independent data set: the test set. Indeed, we cannot rely on the results obtained on the training set because machine learning has lost its independence from these data. This phase allows us to realize the generalization power of the classifier, i.e., its capacity to obtain good results on any set of data from the same distribution. For the test set, we used 30 patients. We performed several tests by changing the test set each time to deduce the best performance and competence rate.

5. Validation

According to the above definitions, we can define technic validation and two evaluation metrics: competence and performance.

5.1 Random

Validation with the Random function is an efficient way to evaluate the experimental results. The data set is divided into a fixed number of subsets ( $k$ ), and the conservation method is repeated k times (in our case ( $k=$ 10). Each time, one of the k subsets is used as a test set, and the remaining $k-1$ subsets are combined to form a training (learning) set. Then, the average error in all $k$ tests is calculated. The advantage of this method is that it is less important how the data is divided. Each data point must occur exactly once in a test set and $k-1$ times in a training set. The variance of the resulting estimate is reduced as $k$ increases. Therefore, a variant of this method is to randomly split the data into a k-fold different test and training set.

5.2 Performance measures

In K-means clustering unsupervised learning, performance is measured by the response time. In addition, the error rate evaluates the system’s success and chooses the best competence rate for this unsupervised learningmethod. We propose here an explanation of the measures used:

•
Competence: is measured by the number of different problems for which the system provides a good solution. Thus, competence is the ability of a CB to solve certain problems. It is calculated according to the Cluster sum of squared errors noted $J$ Eq. (2), the quadratic error for $C_{i}=x$ :

$\displaystyle J=\sum\limits^{k}_{i=1}\sum\limits_{xj\in c_{i}}||x_{j}-c_{i}||^% {2}$ (2)

where, $x_{j}$ : case of the cluster $C_{i}$ , $C_{i}$ : centre of gravity of all data, $k$ : number of clusters
•
Performance: is measured by the response time required to propose a solution to a target case. This measure is directly related to adaptation costs and research costs. It is calculated in time to create a model (time taken to build a model).
•
Accuracy (classification rate): performance can be calculated based on the precision of the CB, which represents the power of the CB classification.

6. Experiment and analysis

6.1 Unsupervised clustering of K-means and CBM

This section evaluates the unsupervised clustering of a multi-class database using the K-means approach before and after maintenance on our breast disease database.We have broken down the database (CB) into two sub-bases: the learning database contains 70 patients and performs 30 patients for the test. Tables 3 and 4 summarize the application of database clustering before and after maintenance:

Figure 6.

Clustering result diagram before maintenance of the breast disease database.

Figure 6 shows the result obtained by clustering on the initial base (before maintenance).

The first step allows defining (36 misclassified cases) and applying our deletion algorithm.

Table 4

Clustering of the breast disease database after maintenance

No. iteration

Number of clusters

Number of

cases per cluster

Learning rate

Number of cases cluster

Test rate

33%

44%

27%

22%

23%

26%

15%

Table 5

CB breast disease performance and proficiency statistics resulting from the proposed maintenance algorithm

Characteristic of the case base	Before	Afterwards
Initial size of the case base	100	87
Size of the case base misclassified	36	24
Case size deleted	13	/
Size of the case base obtained (after maintenance)	87
Rate Performance of the case base
Reduction	13%	/
Error rate	36 %	27.58 %
Accuracy rate (classification rate)	64%	72,42%
Competence of the case base	250,0	182,0

The following figure (Fig. 7) shows the result obtained by clustering after maintenance.

These two diagrams show an improvement in the learning phase. In addition, there is a reduction in the learning and testing rates which shows that the reduced base (by removing the misclassified cases) is more efficient than the initial base for the classification of cases.

In the second step, and applying K-means on the new CB (cleaned base), we obtain a base of size (87 cases) with (13 cases to be deleted) and we before (24 misclassified cases). Thus, we can see an improvement in learning of about (13%).

Table 5 summarizes the statistics regarding the performance and proficiency of the breast disease CB from the interview method.

Table 6

Competence and performance measurement of the breast disease database before and after maintenance

Tests and tests	K-means before maintenance		K-means after maintenance
	Competence	Performance	Competence	Performance
Test 1	158	0.02	120	0.02
Test 2	158	0	120	0.01
Test 3	158	0.02	120	0.01
Test 4	158	0	120	0.01
Test 5	158	0	120	0.00
Test 6	158	0.01	120	0
Test 7	158	0.01	120	0
Test 8	158	0	120	0
Test 9	158	0	120	0
Test 10	158	0	120	0
Average	158	0.0044	120	0.0033

Figure 7.

Clustering result diagram after the maintenance of the Breast Disease database.

The performance expressed as the rate of reduction of CB as a function of accuracy shows a good result since the reduction is achieved by 13% of the CB. Table 6 below shows the clustering results with K-means, which aims to have a better response time needed to propose a solution to a target case. It is directly related to the adaptation costs and search costs in a cleaned and maintained database. Table 6 shows the competence and performance obtained by K-means on our base:

•

Competence: cluster sum of squared errors;

•

Performance: response time (Time taken to build model) in seconds.

Figures 8 and 9 illustrate the performance measures and system competence.

Figure 8.

Diagram of the performance obtained by K-means before and after maintenance.

Based on Fig. 8, we observed that the performance is significant in terms of response time after maintenance than before during the (10 tests) performed on the base and remains almost stable from the 8th test for pre-maintenance learning,stabilizing from the 5th test in post-maintenance learning. While, we also found at the level of Table 6, that the root mean square error that determines the system’s competence is minimized after the maintenance compared to before the maintenance of the base. This indicates the importance of cleaning the base and removing the misclassified cases that make the system more efficient, as shown in Fig. 9.

Table 7

Characteristics of the three UCI datasets

Bases	Size	Number of attributes	Nature	Class
Breast diseases	100	22	Multi-class	1 $=$ Cancer2 $=$ Cyst3 $=$ Adenofibroma4 $=$ Lipoma5 $=$ Abscess
Thyroid	215	5	Multi-class	1 $=$ Normal2 $=$ Hyper3 $=$ Hypo
Hepatitis	80	20	Bi-class	DieLive
Breast cancer wisconsin	683	11	Bi-class	BenignMalignant

Figure 9.

Diagram of the competence obtained by K-means before and after maintenance.

6.2 Comparison result

We compared our proposed approach with three UCI medical databases in this section. Table 7 represents the different characteristics of the three databases. This section describes the three databases used by UCI in this report and compares them with our approach. Four databases were used to make this comparison: our breast disease database, and the other three are available on the UCI machine learning databases website, namely the hepatitis database, the breast database, and the thyroid database.

6.2.1 Hepatitis database

This database contains 80 instances (patients) and 20 attributes, including the class attribute, of which 67 patients are classified as alive and 13 as dead. Therefore, this database is a two-class database.

6.2.2 Wisconsin Breast Cancer Database

The breast database (Wisconsin Breast Cancer) is a two-class database. It contains 683 instances (patients) and 11 attributes, including the class attribute {malignant and benign} of which 444 are benign and 239 malignant.

6.2.3 Thyroid database

Thyroid is a multi-class database, it contains 215 instances (patients) and 6 attributes, including a class {normal, hyper, hypo} of which the normal class contains 150 patients and the hyper class contains 35 patients and hypo contains 30 patients.

6.3 Unsupervised clustering of K-means and maintenance of the three bases

This section contains illustrative information and evaluations on the three databases used for the comparison. We evaluated the unsupervised clustering using the K-means approach before and after maintenance on the three databases used for the comparison. Then, we divided the database (CB) into two subbases: the training base contained 70 patients and obtained 30 patients for testing. Tables 8–13 summarize the application of clustering of the database before and after maintenance.

Table 8
Clustering of the Thyroid database before maintenance

No. iteration	Number of clusters	Number of cases per cluster	Learning rate	Number of cases per cluster	Test rate
4	0	20	13%	10	15%
	1	107	71%	43	66%
	2	23	16%	12	18%

6.3.1 The Thyroid base

We defined 89 misclassified cases, and by applying our deletion algorithm, we obtained the results listed in Table 9.

Table 9
Thyroid base clustering after maintenance

No. iteration	Number of clusters	Number of cases per cluster	Learning rate	Number of cases per cluster	Test rate
4	0	17	12%	12	20%
	1	106	76%	38	62%
	2	17	12%	11	18%

6.3.2 The Hepatitis base

Table 10
Clustering of the Hepatitis database before maintenance

No. iteration	Number of clusters	Number of cases per cluster	Learning rate	Number of cases per cluster	Test rate
5	0	34	61%	15	15%
	1	22	39%	9	38%

We defined 29 misclassified cases, and by applying our deletion algorithm, we obtained the results summarized in Table 11.

Table 11

Clustering of the Hepatitis database after maintenance

No. iteration	Number of clusters	Number of cases per cluster	Learning rate	Number of cases per cluster	Test rate
4	0	19	37%	7	30%
	1	33	63%	16	70%

6.3.3 The Wisconsin Breast Cancer database

Table 12
Clustering of the Wisconsin Breast Cancer database before maintenance

No. iteration	Number of clusters	Number of cases per cluster	Learning rate	Number of cases per cluster	Test rate
5	0	293	61%	147	147%
	1	185	39%	58	28%

We defined 313 misclassified cases, and by applying our deletion algorithm, we obtained the results listed in Table 13.

Table 13

Clustering of the Wisconsin breast cancer base clustering after maintenance

No. iteration	Number of clusters	Number of cases per cluster	Learning rate	Number of cases per cluster	Test rate
6	0	261	60%	123	65%
	1	176	49%	65	35%

This section evaluates the results obtained by the K-means algorithm before and after maintenance on the three bases: Thyroid, Heaptitis, and Wisconsin Breast Cancer. Tables 8–13 show that there is an improvement in the training phase where there is a reduction in the train and testing rates, indicating that the reduced base (by deleting the misclassified cases) performs better than the initial base for case classification.

6.4 Evaluation of the results of the three databases used

Tables 14–16 summarize the three databases’ maintenance results to compare K-means clustering. Although the latter aims to give a response time needed to propose a solution to a target case, it is directly related to the adaptation costs and search costs in a cleaned and maintained database. Table 14 shows the competence and performance obtained by K-means on the three bases used, Thyroid, Hepatitis, and Wisconsin Breast Cancer.

6.4.1 The Thyroid base

Table 14
Competence and performance measurement of the Thyroid base before and after maintenance

Tests	K-means Before Maintenance		K-means After Maintenance
	Competence	Performance	Competence	Performance
Test 1	8,634	0,00	8,309	0,00
Test 2	8,634	0,01	8,309	0,00
Test 3	8,634	0,00	8,309	0,00
Test 4	8,634	0,00	8,309	0,00
Test 5	8,634	0,00	8,309	0,00
Test 6	8,634	0,01	8,309	0,00
Test 7	8,634	0,00	8,309	0,00
Test 8	8,634	0,00	8,309	0,00
Test 9	8,634	0,00	8,309	0,00
Test 10	8,634	0,00	8,309	0,02
Average	8,634	0,001	8,309	0,002

Competence: Cluster sum of squared errors. Performance: Response time (time taken to build a model) in seconds.

6.4.2 The Hepatitis base

Table 15
Competence and performance measurement of the Thyroid base before and after maintenance

Tests	K-means Before Maintenance		K-means After Maintenance
	Competence	Performance	Competence	Performance
Test 1	169,997	0,00	154,470	0,00
Test 2	169,997	0,00	154,470	0,00
Test 3	169,997	0,02	154,470	0,O1
Test 4	169,997	0,00	154,470	0,01
Test 5	169,997	0,00	154,470	0,00
Test 6	169,997	0,00	154,470	0,00
Test 7	169,997	0,00	154,470	0,00
Test 8	169,997	0,00	154,470	0,00
Test 9	169,997	0,00	154,470	0,00
Test 10	169,997	0,00	154,470	0,00
Average	169,997	0,002	154,470	0,00125

6.4.3 Wisconsin Breast Cancer database

Table 16
Competence and performance measurement of theThyroid base before and after maintenance

Tests	K-means before maintenance		K-means after maintenance
	Competence	Performance	Competence	Performance
Test 1	191,869	0,02	187,323	0.00
Test 2	191,869	0,02	191,869	0.02
Test 3	191,869	0,00	191,869	0,01
Test 4	191,869	0,00	191,869	0,00
Test 5	191,869	0,00	191,869	0,02
Test 6	191,869	0,02	191,869	0,00
Test 7	191,869	0,00	191,869	0,00
Test 8	191,869	0,00	191,869	0,02
Test 9	191,869	0,02	191,869	0,00
Test 10	191,869	0,00	191,869	0,00
Average	191,869	0,008	191,415	0,0055

Tables 14–16 show the results obtained by the unsupervised K-means clustering approach before and after CBM, and the performance and competence measures obtained by the three databases Thyroid, Heaptitis, and Wisconsin Breast Cancer. The results indicate that the performance of the Thyroid, Hepatitis, Winsconsin database is important. However, it is slightly (about 0.001 seconds, 0.00075 seconds, 0.0025 seconds respectively) even as the competence which shows us the importance of CBM.

6.5 Comparative study between databases according to performance criteria

By applying the K-means on the four bases of comparison (maintained bases), we obtain the statistics summarized in Table 17.

Table 17
Statistics on the performance and competence of the four databases resulting from the maintenance method

	Breast diseases		Thyroid		Hepatitis		Breast
The basics	Before	After	Before	After	Before	After	Before	After
Features and characteristics of the case base
Initial size of the case base	100	87	215	201	80	75	683	625
Size of the case database misclassified	36	24	89	38	29	11	313	05
Case size deleted	13	/	14	/	5	/	58	/
Size of the case base obtained (after maintenance)	87	/	201	/	75	/	625	/
Rate Performance of the case
Reduction	13%	/	6,15%	/	6,25%	/	8,49%	/
Error rate	36%	27.58%	41,39%	18,90%	36,25%	14,66%	45,83%	0,80%
Accuracy rate (classification rate)	64%	72 ,42%	58,61%	81,10%	63,75%	85,34%	54,17%	99,20%
Competence of the case base	250,0	182,0	8,634	8,309	169,997	154 ,470	191 ,869	187,323

By the overall analysis of the results presented in Table 17, we can see that the best performance rates according to reductions, error rates and precision rates) on the base of breast disease are:

For the reduction is important for the measured breast disease base of 13% of the initial base but moderately important for the other three bases (6.15, 6.25 and 8.49 respectively). The error rate is reduced after deleting the misclassified cases by a difference of (8.42%, 22.49%, 21.58% and 45.03%) respectively for the four bases. This shows a better learning rate and determines the goal of having a more efficient system. Comparing the accuracy rates (classification rate) with our algorithm, they encourage us with a classification rate that can be improved after maintenance than before the removal of classified calluses (before maintenance) and with an improvement of (8.42%) for the breast disease database and (22.49%, 21.59% and 45.03) for the other three databases used for comparison. The figures (Figs 10 and 11) below show the differences between the two maintenance phases according to the performance measures.

Figure 10.

Diagram shows the performance factors of the four CBs before maintenance.

Figure 11.

Diagram shows the performance factors of the four CBs after maintenance.

Table 18 shows the results obtained by the unsupervised K-means clustering approach before and after database maintenance (cleaning) and the performance and proficiency measures obtained by the four databases (the ten tests performed for the three comparison databases are indicated in Section 5.3).

Table 18

Comparative study between the four databases used

The databases	Before maintenance		After maintenance
	Competence	Performance	Competence	Performance
Breast diseases (Our database)	158	0,004	120	0,003
Thyroid	8,634	0,002	8,31	0,002
Hepatitis	169,997	0,002	154,47	0,00125
Breast	191,869	0,008	191,41	0,005

Table 18 shows that the performance of our system is important although slightly (about 0.00111 seconds), as well as the competence, which shows us the importance of database maintenance. Figures 12 and 13 confirm this finding.

Figure 12.

Diagram of the performance obtained by K-means before and after maintenance for the four databases used.

We found that the response times almost stable for the breast disease base however in the thyroid base the performance decreased somewhat due to multi-class as we montioned before and justified in the other two classes (Wisconsin Breast Cancer and Hepatitis), the response time reduces. Therefore, the performance important due to the two-class bases and is given numerically. On the other hand, competence is important for multi-class bases than two-class bases where the squared error is minimized for multi-class bases than two-class bases (Fig. 13).

Figure 13.

Diagram of the competence obtained by K-means before and after maintenance for the four databases used.

We can therefore consider that the four databases give satisfactory performance (in terms of execution or response time) and competence (in terms of squared error) values after case maintenance (for the four databases) compared to the initial databases (before maintenance).Moreover, the nature of the data plays an important role. The proposed algorithm is more efficient with K-means on numerical data than on symbolic data and this is reflected in the highest classification rate in the Wisconsin Breast Cancer database as well as the number of classes in the database where we found that the classification rate and learning is important in two-class databases than multi-class ones (such as Wisconsin Breast Cancer and Hepatitis).

Table 19

Performance comparison by different methods with the proposed approach

Reference	Storage size of case base	Competence	Performance
Mohamed Karim Haouchine et al, 2006 [4]	10.99	92.33	–
Ali Rabiaet al., 2010 [13]	80	68	40
Abir Smiti, Zied Elouedi, 2014 [35]	32.45	85.92	55
Abir Smiti, Zied Elouedi, 2018 [36]	49.60	98.98	50.1
Khan MJ, Hayat H and Awan I 2019 [16]	12	96.30	–
Our approach	68.30	99.20	55.5

6.6 Comparison between different methods

In order to make a comparative analysis of the proposed approach, we tested other CBM systems based on the suppression policy and widely used in the literature. The results obtained by each method were then compared to the evaluation criteria proposed in this work. Table 19 shows the comparison results obtained by the performance and competence criteria. This table summarizes the nine CBM systems based on suppression policies. From the results obtained, we observe that our maintenance algorithm gave a good competence compared to other methods [16, 36]. The K-means method solves an unsupervised task, it does not require any information about the data. This method is useful for discovering a hidden structure that will improve the results of the CBM system (competence and performance). Therefore, it can be noted that clustering facilitates CB maintenance. The structuring of attributes facilitates clustering, the case deletion strategy claims to improve the efficiency of CB while preserving competence, its performance was more reliable than before the maintenance is determined by a significant reduction of the base measured about 13%.

The experimentation demonstrates that the proposed method is an interesting CBM approach capable of being efficient in reducing the CB size and retrieval time and achieving a satisfactory competence and performance criterion for CBM. We are aware that our approach has shortcomings, but some positives can be explored to improve the strategy in future work by further exploring the track of membership values of redundant and misclassified cases to multiple clusters. Furthermore, to show the scalability of our approach, we intend to evaluate it on real CBs.

7. Conclusion

Case base maintenance (CBM) is one of the most important research areas in CBR. The proposed clustering-based machine learning method promises increased proficiency and consistency over other methods. The model consists of combining CBR and clustering. It is a way to represent CBR, which will facilitate the other steps of the CBR cycle, including similar case finding and CBR maintenance. Indeed, the effectiveness of a CBR system depends on the speed and quality of the CB recovery process. However, in a traditional CBR application, the CB grows rapidly, and its content can be extremely varied. Therefore, it is necessary to properly organize the case base (into clusters) during learning for relevant case retrieval, done by K-means clustering.

The K-means algorithm is one of the most popular non-hierarchical approaches. It is the most widely used classification tool in scientific and medical applications. Although it does not work very well for symbolic attributes, it makes good geometric and statistical sense for numerical attributes. Although the proposed deletion technique improved the efficiency of the case base while preserving the competence, its performance was 13% more reliable. The proposed approach exploited the clustering of similar cases using the K-means algorithm. It relies on the characterization of the different cases in the case base and finds it based on a competence and performance criterion. From this categorization, the deletion of cases becomes obvious. The advantage of our application lies in using multi-class databases that are less used in the CBR approach. The results obtained are better and satisfactory for improving the database, especially for the maintenance of our breast disease database, in terms of performance and competence.

Several perspectives emerge from the work done, and some research directions and extensions are thus possible:

First, it would be interesting to maintain thecomplete knowledge of the system and not only the case base. For this, we consider a proposal to maintain the similarity container by changing the measures. An update strategy is also conceivable to change the weights of the case descriptors in the case base and increase the size of our breast disease database. Second, since the application of the model no longer matches the human interpretation of similarity between cases, it will be interesting to introduce fuzzy logic in the calculation of similarity between two cases by defining the states as “somewhat similar”, “not very similar” or “very similar”. However, the proposed approach can be improved by future work, mainly by considering several directions. Finally, we want to consider the impact of deep learning in the medical domain [37], as in [38] a deep belief network (DBN) is used for the classification process to diagnose the existence of a disease. This network has shown interesting results regarding feature selection and classification accuracy with a large storage size.

References

Smyth

Keane

. ‘Remembering to forget’, in Proceedings of the 14th international joint conference on Artificial intelligence. Proceedings of the 14th International joint conference on Artificial intelligence, 1995, pp. 377-382.

Aamodt

Plaza

. Case-based reasoning: Foundational issues, methodological variations, and system approaches, Communications. 1994; 7(1): 39-59.

Aamodt

. Knowledge-Intensive Case-Based Reasoning and Sustained Learning. Proc. of the 9th European Conference on Artificial Intelligence, ECCBR’04, Lecture Notes in Artificial Intelligence, 2004, pp. 1-15, Springer.

Haouchine

Chebel-Morello

Zerhouni

. Maintenance d’un système de raisonnement à partir de cas, in: International Conference, Modelling and Diagnosis, ICCMD’06, Université Badji Mokhtar, Annaba, Algérie, 2006.

Smyth

. Case-base maintenance, in: Pasqual del Pobil

Mira

Ali

. eds, Tasks and Methods in Applied Artificial Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), 1998, pp. 507-516. doi: 10.1007/3-540-64574-8_436.

Salamó

López-Sánchez

. Adaptive case-based reasoning using retention and forgetting strategies. Knowledge-Based Systems. 2011; 24(2): 230-247.

Juarez

Craw

Lopez-Delgado

Campos

. Maintenance of case bases: current algorithms after infifty years, IJCAI, 2018.

Nakhjiri

Salamó

Sańchez-Marrè

. Reputation-based maintenance in case-based reasoning. Knowledge-Based Systems. 2020; 193: 105283.

Leake

Wilson

. Categorizing case-base maintenance: Dimensions and directions. In: European Workshop on Advances in Case-Based Reasoning, Springer, 1998, pp. 196-207.

10.

Kolodner

. An introduction to case-based reasoning. Artificial Intelligence Review. 1992; 6(1), 3-34. doi: 10.1007/BF00155578.

11.

Salamó

Golobardes

. Hybrid Deletion Policies for Case Base Maintenance, in: FLAIRS, 2003, Barcelona, Spain.

12.

Zhang

. Maintaining Footprint-Based Retrieval for Case Deletion, Decision Systems & e-Service Intelligence (DeSI) Lab Centre for Quantum Computation & Intelligent Systems (QCIS) Faculty of Engineering and Information Technology, University of Technology, Sydney PO.. Box 123, Broadway, NSW, 2007, Australia.

13.

Rabia

et al., Clustering based deletion policy for case-base maintenance, in 2010 6th International Conference on Emerging Technologies (ICET). 2010 International Conference on Emerging Technologies (ICET), Islamabad, Pakistan: IEEE, 2010, pp. 45-48, doi: 10.1109/ICET.2010.5638384.

14.

Yan

Qian

Zhang

. Memory and forgetting: An improved dynamic maintenance method for case-based reasoning. Information Sciences. 2014; 287: 50-60. doi: 10.1016/j.ins.2014.07.040.

15.

Leake

Schack

. Flexible Feature Deletion: Compacting Case Bases by Selectively Compressing Case Contents, in: Hüllermeier

Minor

. eds, Case-Based Reasoning Research and Development. Cham: Springer International Publishing (Lecture Notes in Computer Science), 2015, pp. 212-227. doi: 10.1007/978-3-319-24586-7_15.

16.

Khan

Hayat

Awan

. Hybrid case-base maintenance approach for modeling large scale case-based reasoning systems. Human-centric Computing and Information Sciences. 2019; 9(1): 9. doi: 10.1186/s13673-019-0171-z.

17.

Mohammed

et al., Genetic case-based reasoning for improved mobile phone faults diagnosis. Comput Electr Eng. Oct 2018; 71: 212-222. doi: 10.1016/j.compeleceng.2018.07.053.

18.

Mohammed

Belal

A-K

Ibrahim

. Case based reasoning shell frameworkas decision support tool. Indian Journal of Science and Technology. November 2016; 9(42): doi: 10.17485/ijst/2016/v9i42/101280.

19.

Fuchs

Lieber

Mille

Napoli

. Une première formalisation de la phase d’élaboration du raisonnement à partir de cas. Actes du 14

{}^{\textit{i\`{e}me}}

atelier du raisonnement à partir de cas, Besançon, Mars, 2006.

20.

Mille

Fuchs

Herbeaux

. A unifying framework for Adaptation in Case-Based Reasoning. In: Voss

, ed., Proceedings of the ECAI’96 Workshop: Adaptation in Case-Based Reasoning, 1996, pp. 22-28.

21.

Roth-Berghfer

Iglezzakis

. Six Steps in Case-Based Reasoning: Towards a maintenance methodology for case-based reasoning systems, in Proceedings of the 9th German Workshop on CBR, GWCBR’01. Proceedings of the 9th German Workshop on CBR, GWCBR’01, Germany, 2001.

22.

Reinartz

Iglezakis

Roth-Berghofer

. Review and Restore for Case-Base Maintenance. Computational Intelligence. 2001; 17(2): 214-234. doi: 10.1111/0824-7935.00141.

23.

Richter

, Introduction. In Case-Based Reasoning Technology: From Foundations to Applications, in: Lena

Bartsc-Sporl

Burkhard

H.D.

Wess

. Springer-Verlag, Berlin, 1998, pp. 1-15.

24.

Gabel

. On the Use of Vocabulary Knowledge for Learning Similarity Measures, in: Althoff

K-D

. eds, Professional Knowledge Management. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), 2005, pp. 272-283. doi: 10.1007/11590019_32.

25.

Yang

. Keep It Simple: A Case-Base Maintenance Policy Based on Clustering and Information Theory, in: Hamilton

. eds, Advances in Artificial Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), 2000, pp. 102-114. doi: 10.1007/3-540-45486-1_9.

26.

Chebli

Djebbar

Merouani

. Improving the performance of computer-aided diagnosis systems using semi-supervised learning: a survey and analysis. International Journal of Intelligent Information and Database Systems. 2020; 13(2/3/4): 454. doi: 10.1504/IJIIDS.2020.10031616.

27.

Karoui

Kanawati

Petrucci

. COBRAS: Cooperative CBR System for Bibliographical Reference Recommendation, in: Roth-Berghofer

Göker

Güvenir

. eds, Advances in Case-Based Reasoning. Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), 2006, pp. 76-90. doi: 10.1007/11805816_8.

28.

Refai

Merouani

Aouras

. Maintenance of a Bayesian network: application using medical diagnosis. Evolving Systems. 2016; 7(3): 187-196. doi: 10.1007/s12530-016-9146-8.

29.

Seghir

Djebbar

. Maintenance de la base de cas par l’algorithme d’apprentissage K-means. Computer Science Department, Badji Mokhtar University, Annaba, Algéria, 2018.

30.

Ezzat

Mahdy

Hassanien

Darwish Vaclav

Gupta

, Automatic 3D Reconstruction Detection System for Knee Osteoarthritis Based on K-Means Algorithm, in: International Conference on Innovative Computing and Communications, vol. 1087 Khanna

Gupta

Bhattacharyya

Snasel

Platos

Hassanien

A.E.

, eds, Éd. Singapore: Springer Singapore, 2020; pp. 819-828. doi: 10.1007/978-981-15-1286-5_72.

31.

Arunkumar

. et al., K-Means clustering and neural network for object detecting and identifying abnormality of brain tumor. Soft Comput. Oct. 2019; 23: n

{}^{\circ}

19, 9083-9096. doi: 10.1007/s00500-018-3618-7.

32.

MacQueen

. Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. Math. Statistics and Probability. 1967; 1: 281-297.

33.

Diday

. Une nouvelle méthode de classification automatique et reconnaissance des formes: la méthode des nuées dynamiques. Revue de Statistique Appliquée, XIX19-33, 1971.

34.

Diday

. The dynamic clusters method in nonhierarchical clustering. International Journal of Computing and Information Sciences. 1973; 2: 61-88.

35.

Smiti

Elouedi

. WCOID-DG: An approach for case base maintenance based on Weighting, Clustering, Outliers, Internal Detection and D bsan-Gmeanss. Journal of Computer and System Sciences. 2014; 80(1): 27-38. doi: 10.1016/j.jcss.2013.03.006.

36.

Smiti

Elouedi

. SCBM: soft case base maintenance method based on competence model. Journal of Computational Science. 2018; 25: 221-227. doi: 10.1016/j.jocs.2017.09.013.

37.

Pustokhina

Pustokhin

Gupta

Khanna

Shankar

Nguyen

. An Effective Training Scheme for Deep Neural Network in Edge Computing Enabled Internet of Medical Things (IoMT) Systems. IEEE Access. 2020; 8: 107112-107123. doi: 10.1109/ACCESS.2020.3000322.

38.

Alqaralleh

BAY

Vaiyapuri

Parvathy

Gupta

Khanna

Shankar

. Blockchain-assisted secure image transmission and diagnosis model on Internet of Medical Things Environment. Pers Ubiquitous Comput. Févr. 2021; doi: 10.1007/s00779-021-01543-2.

39.

Machine Learning repository UCI: https://archive.ics.uci.edu/ml/datasets.html.

A new case-deletion strategy for case-base maintenance based on K-means Clustering Algorithm applied to medical data

Abstract

Keywords

1. Introduction

2. Related work

3.1 Case base reasoning (CBR)

3.2.2 CBM criteria

3.3 Criteria for assessing the quality of CB

3.3.2 Case deletion strategy

4. Proposed approach

Table 1 Description of the breast disease dataset [28]

4.2 Data pre-processing

4.3 Clustering (K-means algorithm)

Table 3 Clustering of the CB breast diseases before maintenance

5. Validation

5.1 Random

5.2 Performance measures

6.1 Unsupervised clustering of K-means and CBM

6.2.1 Hepatitis database

6.2.2 Wisconsin Breast Cancer Database

6.2.3 Thyroid database

6.3 Unsupervised clustering of K-means and maintenance of the three bases

Table 8 Clustering of the Thyroid database before maintenance

Table 9 Thyroid base clustering after maintenance

Table 10 Clustering of the Hepatitis database before maintenance

Table 12 Clustering of the Wisconsin Breast Cancer database before maintenance

6.4.1 The Thyroid base

Table 14 Competence and performance measurement of the Thyroid base before and after maintenance

Table 15 Competence and performance measurement of the Thyroid base before and after maintenance

Table 16 Competence and performance measurement of theThyroid base before and after maintenance

Table 17 Statistics on the performance and competence of the four databases resulting from the maintenance method

7. Conclusion

References

Table 1
Description of the breast disease dataset [28]

Table 3
Clustering of the CB breast diseases before maintenance

Table 8
Clustering of the Thyroid database before maintenance

Table 9
Thyroid base clustering after maintenance

Table 10
Clustering of the Hepatitis database before maintenance

Table 12
Clustering of the Wisconsin Breast Cancer database before maintenance

Table 14
Competence and performance measurement of the Thyroid base before and after maintenance

Table 15
Competence and performance measurement of the Thyroid base before and after maintenance

Table 16
Competence and performance measurement of theThyroid base before and after maintenance

Table 17
Statistics on the performance and competence of the four databases resulting from the maintenance method