Abstract
Outlier detection is a process to find out the objects that have the abnormal behavior. It can be applied in many aspects, such as public security, finance and medical care. An information system (IS) as a database that shows relationships between objects and attributes. A real-valued information system (RVIS) is an IS whose information values are real numbers. A RVIS with missing values is an incomplete real-valued information system (IRVIS). The notion of inner boundary comes from the boundary region in rough set theory (RST). This paper conducts experiments directly in an IRVIS and investigates outlier detection in an IRVIS based on inner boundary. Firstly, the distance between two information values on each attribute of an IRVIS is introduced, and the parameter λ to control the distance is given. Then, the tolerance relations on the object set are defined according to the distance, by the way, the tolerance classes, the λ-lower and λ-upper approximations in an IRVIS are put forward. Next, the inner boundary under each conditional attribute in an IRVIS is presented. The more inner boundaries an object belongs to, the more likely it is to be an outlier. Finally, an outlier detection method in an IRVIS based on inner boundary is proposed, and the corresponding algorithm (DE) is designed, where DE means degree of exceptionality. Through the experiments base on UCI Machine Learning Repository data sets, the DE algorithm is compared with other five algorithms. Experimental results show that DE algorithm has the better outlier detection effect in an IRVIS. It is worth mentioning that for comprehensive comparison, ROC curve and AUC value are used to illustrate the advantages of the DE algorithm.
Introduction
As an important branch of data mining, outlier detection aims to discover objects that behave very differently from most objects. Outlier detection plays an important role in many aspects. In public safety, it helps to detect fraudulent act [2]; in medical diagnosis, it can predict diseases of concern [29]; in network monitoring, it can be applied to data monitoring in wireless sensors [36]; in industrial production, it can be used to identify defective products [10]. Hawkins [11] gave a generally accepted definition, “an outlier is a deviation from other observations that leads to suspicion that it is produced by a different mechanism."
Outlier detection methods can be roughly divided into the following categories: statistical distribution-based method [24, 30]; depth-based method [17]; cluster-based method [13, 19]; distance-based method [21] and density-based method [3]. Statistical distribution-based method requires the data to have prior knowledge of the distribution law, and then treats the data that does not satisfy the distribution law as abnormal data. Depth-based method can obtain good detection results for two-dimensional or three-dimensional data, but it is difficult to apply to high-dimensional data. Cluster-based method detects outliers by examining the relationship between objects and clusters. In other words, an outlier is an object that belongs to a small sparse cluster or does not belong to any cluster. The detection effect of this method is very dependent on the parameter settings of the clustering algorithm. The idea of distance-based methods is that if a data object is far away from most data objects, the object is an outlier. The idea of density-based method is to calculate the local density of each object and the density of its neighbor points to judge the degree of abnormality. The latter two methods can be considered to have the same advantages and disadvantages. The advantage is that it is easy to understand and easy to implement. The disadvantage is that the distance needs to be calculated and the computational overhead is high. In addition, some generative neural networks in deep learning, such as GAN [22, 34] are also used to predict the distribution of data sets for outlier detection. The main focus of outlier detection is only on continuous numerical data. Most machine learning methods and neural network methods need sorted data to obtain vector distance, and lack of mechanisms to deal with non sorted classified data. It will also face problems such as dependence on big data, interpretability and training.
Rough set theory (RST) proposed by Pawlak [27, 28] is a set of theories to study incomplete data and imprecise knowledge representation. The main idea is to minimize the expression of knowledge under the premise of keeping the classification ability unchanged, and then derive the decision-making or classification rules of the problem. RST is widely used in feature selection [6, 35], pattern recognition [7, 48], classification [47] and data mining [4, 14], all of which are closely related to information systems (IS). Unlike other data mining methods, RST does not require any prior knowledge. Because of these advantages, RST is used for outlier detection. Many scholars have achieved good results by using it in various information systems to study the problem of outlier detection.
Initially on the classification IS, Francisco et al. [8] constructed a mathematical framework for outlier detection based on RST. Feng et al. [9] studied an outlier detection method combining information entropy and RST. Jiang et al. [15] investigated the hybrid outlier detection algorithm based on RST and GrC. Nguyen [25] proposed an outlier detection method based on RST approximate reasoning. Jiang et al. [20] gave an outlier detection algorithm based on approximate precision entropy. At that time, the numerical processing on the RVIS required discretization, which may lead to information loss. Later, in order to better study numerical and heterogeneous IS, Sangeetha et al. [33] proposed an outlier detection method for mixed data sets, using entropy weighted density outlier detection method based on rough sets to detect outliers. Yuan et al. [45, 46] introduced neighborhood rough sets and proposed some outlier detection methods that can effectively deal with mixed data. Feng et al. [44] proposed an outlier detection algorithm based on the neighborhood value difference metric by using the standard deviation neighborhood radius and the mixed distance metric. In recent years, fuzzy rough sets (FRSs) have also attracted a lot of attention as a generalization of rough sets. At first, FRSs have been successfully applied to feature selection and pattern recognition of numerical or heterogeneous features [37, 43]. Then, in order to study the application of FRSs in outlier detection. Yuan and Chen et al. [41, 42] investigated an outlier detection method based on fuzzy information entropy and an outlier detection method based on fuzzy rough granules.
It is worth mentioning that in the outlier detection tasks of different information systems, the processing of missing data is only described in the articles [41, 42], and the processing of missing data is the maximum probability filling method. After filling in the missing data, it is equivalent to carrying out research on a complete information system.
The contribution of this paper is to study that when part of the values in the real-valued information system (RVIS) are missing, we do not need to fill in the missing values, but directly detect outliers on the incomplete real-valued information system (IRVIS). From the point of view of the information value set under each attribute in the information system, a distance degree definition suitable for an IRVIS is designed. The idea behind the definition of distance degree is to consider the size of the set of information values under different attributes and the uniqueness of the values in the set, and the induced probability distribution can be used to define the distance degree between the information values of two objects under different conditions. Then, the tolerance class of each object is calculated by using the distance degree. After having the tolerance class, the inner boundary is proposed according to the concept of upper and lower approximations of sets in RST. The degree of marginality of an object is defined based on the number of different inner boundaries to which the object belongs. In addition, normalization gives the degree of exceptionality to evaluate the possibility of an object becoming an outlier. Finally, the degree of exceptionality (DE) outlier detection algorithm based on inner boundary is described. Experiments are carried out on ten data sets downloaded from UCI Machine Learning Repository, and five outlier detection algorithms (CBLOF [13], DB [21], KNN [31], SEQ [18], NOOF [5]) are compared. The experimental results shows that the DE outlier detection algorithm often has the better outlier detection effect in an IRVIS.
Our main contributions can be summarized as follows.
(1) A distance degree suitable for an IRVIS is studied. In particular, the distance degree between a missing value and another missing value is large enough, the distance degree between a missing value and a real value is relatively large, and the distance degree between two real values is clear.
(2) Based on the rich theoretical knowledge of RST, the concept of inner boundary is defined, and the degree of exceptionality of each object are calculated, which to some extent enriches the requirements of the definition of outlier concept.
(3) An outlier detection algorithm of an IRVIS based on inner boundary is proposed.
(4) The experiment proves the superiority of this algorithm in an IRVIS.
The remainder of this paper is organized as follows. Section 2 describes some preparatory work. Section 3 studies outlier detection in an IRVIS based on inner boundary, including examples and algorithms. Section 4 gives experimental analysis. Section 5 carries out evaluation analysis. Section 6 summaries this paper.
The workflow of this paper is depicted in Figure 1.

The workflow of this paper.
This section introduces some preliminaries needed for the formulation of the proposed method.
We first look back at binary relations and an IRVIS. Throughout this paper, O, A denote two non-empty finite sets, 2 O is the power set of O and |X| is the cardinality of X ∈ 2 O .
In this paper, put
Binary relations
Recall that R is a binary relation on O whenever R ⊆ O × O.
(1) reflexive, if (o, o′) ∈ R for any o ∈ O;
(2) symmetric, if (o, o′) ∈ R implies (o′, o) ∈ R;
(3) transitive, if (o, o′) ∈ R and (o′, o″) ∈ R imply (o, o″) ∈ R.
R is said to be an equivalence relation on O, if R is reflexive, symmetric and transitive; R is called a tolerance relation on O, if R is reflexive and symmetric.
An IRVIS
Let (O, A) be an IS. If there is a ∈ A such that * ∈ V a , here * means a null or unknown value, then (O, A) is called an incomplete information system (IIS).
Let (O, A) be an IIS. For each a ∈ A, denote
Then,
If P ⊆ A, then (O, P) is referred to as the subsystem of (O, A).
An IRVIS (O, A)
An IRVIS (O, A)
∀ a ∈ A, denote
For missing data, we have the following thoughts.
(1) Consider “o ≠ o′, a (o) = * , a (o′) ≠ * , a ∈ A ", because “a (o)" is treated as “do not care" information, thus a (o) has the probability of
(2) Consider “o ≠ o′, a (o) ≠ * , a (o′) = * , a ∈ A ", because “a (o′)" is treated as “do not care" information, thus a (o′) has the probability of
(3) Consider “o ≠ o′, a (o) = * , a (o′) = * , a ∈ A ", a (o) and a (o′) both have the probability of
d (a (o) , a (o′)) =
Tolerance relations in an IRVIS
Denote
(1) If P 1 ⊆ P2 ⊆ A, then ∀ λ ∈ [0, 1], ∀ o ∈ O,
(2) If 0 ≤ λ1 ≤ λ2 ≤ 1, then ∀ P ⊆ A, ∀ o ∈ O,
So
(2)
So
Then
Moreover, if
Put
(1)
(2)
(3)
(4) If P1 ⊆ P2 ⊆ A, then ∀ λ ∈ [0, 1], ∀ X ∈ 2 O ,
(5) If 0 ≤ λ1 ≤ λ2 ≤ 1, then ∀ P ⊆ A, ∀ X ∈ 2 O ,
(6)
(7)
(1)
(2)
(3)
Obviously,
To put it bluntly, the inner boundary is the intersection of the set X with the boundary set. These elements in the inner boundary represent a contradiction with the set X, because some elements satisfy X and others do not satisfy X in its tolerance class.
Outlier detection in an IRVIS based on the inner boundary
Outliers in an IRVIS based on the inner boundary
Research on outlier detection method based on inner boundary is inspired by literature [8]. In this section, we define outliers in an IRVIS based on the inner boundary.
A λ-exceptional subset E of X is made out of the elements in X that contradict m tolerance relations
(1) e ∈ E is called dispensable in E, if E - {e} is also a λ-exceptional subset of X.
(2) e ∈ E is called indispensable in E, if E - {e} is not a λ-exceptional subset of X.
NonCredundant exceptional subsets constitute the fundamental source of elements to be considered outlier candidates in subsequent stages of the detection method.
Clearly,
The purpose of this notion is to normalize the
Obviously,
In this paper, the set of all (λ, μ)-outliers in X or the subsystem (X, A) is denoted as
An example
In this part, an example is given to find outliers in an IRVIS based on the inner boundary.
The tolerance class of each object
The tolerance class of each object
Let
Suppose X = {o1, o2, o3, o4, o6}, Obviously, X ∈ Φ λ (O)
(1) For a
i
(i = 1, 2, 3, 4), the λ-lower approximation and the λ-upper approximation of X is calculated as follows:
(2) The corresponding
(3) The λ-degree of marginality of o in X by Definition 3.1 is calculated as follows:
(4) With λ-degree of marginality of o in X, the λ-degree of exceptionality of o in X can get it all at once. The results are shown below:
Given μ = 0.7, we have
Only o1 has a
This section introduces the outlier detection algorithm based on inner boundary in an IRVIS and analyzes the time complexity of related algorithms.
Given an IRVIS (O, A), Our outlier detection method is represented in algorithms 1, 2 and 3. Algorithm 1 calculates the tolerance class of each object under each attribute. Algorithm 2 calculates the
Algorithm 1 Computing
1:
2:
3: Calculating d (a (o) , a (o′)) ;
4:
5:
6: λ ∈ [0, 1]
7:
8: return
The time complexity for Algorithm 1 is as follows. In step 3, the time complexity for calculating the d (a (o) , a (o′)) is O (|V a |2) where V a = {a (o) : o ∈ O}.
For Algorithm 2, the time complexity of computing the intersection of each object’s tolerance classes with object set X in step 8 is O (|A| × |X|2). Then the time complexity in step 17 is O (|A|), thus the total time complexity for Algorithm 2 amounts to O (|A| × |X|2).
Step 10 in Algorithm 3 is to determine whether there is a subset relationship between the tolerance classes of two different objects, and its time complexity is O (|A|2 × |X|). The time complexity of steps 19 to 25 is O (|E|), O (|E|) ≤ O (|X|). So the total time complexity for Algorithm 3 is O (|A|2 × |X|).
Experimental analyses
Experimental setup
The experiments in this section are carried out on a computer equipped with Inter Corei5-7200U processor, frequency of 2.50 GHz, storage of 2.70 GHz and memory of 8 GB. The operating system is Windows 10. The experiments are developed with Python 3.8.
Algorithm 2 Calculating the
1: Obtain tolerance class
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19: return
Algorithm 3 Calculating the outlier set
1: E← ∅
2:
3: Obtain
4:
5: flag = False
6:
7:
8: continue;
9:
10:
11: flag = True;
12: break;
13:
14:
15:
16:
17:
18:
19:
20: Calculating
21: Calculating
22:
23:
24:
25:
26: return
The DE outlier detection algorithm is compared with 5 main outlier detection algorithms. Including distance-based algorithm (DB), k-nearest neighbors-based algorithm (KNN), cluster-based local outlier factor algorithm (CBLOF), sequence-based method (SEQ), neighborhood-based method (NOOF). To compare the performance of all outlier detection algorithms, the evaluation metric proposed by Aggarwal and Yu is used [1].
For each algorithm, this paper first calculates the outlierness of each object in the data set, and arranges them in descending order. The higher the outlierness of an object after the arrangement, the higher the probability that the object will become an outlier. Then selects the front part of the rearranged object sequence at intervals of the same size to see how many objects are actually selected from the rare class (real outlier object). The more objects selected from the rare class, the better the performance of the outlier detection algorithm. For example, there are 24 objects in the rare class of the data set, and an object sequence in descending order of outlierness is obtained through the outlier detection algorithm. Taking the first 20 objects in the object sequence. If there are nearly 20 objects that really belong to the rare class, it is considered that the performance of the outlier detection algorithm is very good. Since the missing data in each data set only contains a small part of itself, the parameter of DE algorithm in all data sets is set to 0.2. Relevant parameters in the algorithm proposed by others, those are set differently according to different data sets, but the purpose is to carry out multiple experiments under each data set and select the parameter setting with the best effect.
The threshold μ mentioned in the DE outlier detection algorithm is the standard used to judge whether an object is detected as an outlier. The threshold μ is set according to the number of objects in rare class. Intuitively, the DE outlier detection algorithm calculates the outlierness of each object and obtains a sequence in descending order of outlierness. If in a data set, the number of objects in rare class is m, the mth outlierness in the sequence is taken as the threshold μ.
Data set experiment results
Ten data sets from UCI Machine Learning Repository are used: Glass Identification Data Set, Computer Hardware Data Set, Divorce Predictors Data Set, Ionosphere Data Set, Breast Cancer Wisconsin (Original) Data Set, Breast Cancer Wisconsin (Diagnostic) Data Set, Airfoil Self-Noise Data Set, (OBS) Network Data Set, Vehicle Silhouettes Data Set, Obesity Data Set. The rare class is that contains only a small fraction of objects in each data set. It is usually defined according to specific context semantics. For example, in the context of medical cases, uncommon case is usually regarded as the rare class.
These data sets belong to real-value data sets. In order to study the task of outlier detection in an IRVIS, data sets need to be preprocessed. It includes converting the data set into an information system table, where one row corresponds to one object and one column corresponds to one attribute. It also includes randomly missing 1% of all information values in each information system table.
In UCI Machine Learning Repository public data sets, it is found that few data sets are suitable for outlier detection tasks. The 10 data sets used in this paper are mainly applied in the experiments of classification and clustering tasks. These data sets are relatively balanced in the number distribution of decision class, those are not suitable for outlier detection. In order to solve this problem, this paper uses the experimental technology proposed in [12] to preprocess these data sets to obtain data sets for outlier detection. This method randomly deletes objects of the rare class and retains objects of other classes. Finally, an unbalanced distribution in the decision class is formed. In this paper, we preprocessed two data sets (Divorce Predictors, Breast Cancer Wisconsin (Diagnostic)) to make them more discriminative. Table 3 summarizes the data sets used in this paper.
Data set statistics
Data set statistics
For more details, please see the experimental results of the following data sets.
(1) The threshold μ is set according to the proportion of rare class in data set. Specifically, the outlier ratio is 4.21% (see Table 3), and then the outlierness ranked at 4.21% in descending order is taken as the threshold μ (see Figure 2). The red points in the figure represent the distribution of real outlier points in the data set, and the blue points represent the distribution of normal points.

Scatter chart for Glass Identification.
(2) Table 4 shows the experimental results in detail, it is equivalent to giving the dynamic data table of outlier detection rate. The bold numbers in each row in the table indicate that the outlier detection rate is the best in a current outlier detection operation. In Table 4, “Top ratio" refers to the percentage of objects those are in front of the rearranged object sequence. “number of objects" indicates the number of selected objects. “Number of rare classes" refers to the number of real outliers contained in the selected objects. “Coverage" refers to the ratio of “Number of rare classes" to the number of real outliers.
Experimental results in Glass Identification Data Set
(3) Figure 3 shows the change trend of detection rate of six outlier detection algorithms. The line graph clearly reflects that DE algorithm has a relatively good advantage over the other five algorithms on the glass identification data set.

Detection rate comparison for Glass Identification.
In the outlier detection experiment of glass identification data set, the experimental progress and results are introduced in detail. For the next 9 data sets, Their scatter chart and detection rate comparison chart will be displayed together.

Detection rate comparison chart results.

Scatter chart results.
Experimental results in Computer Hardware Data Set
Experimental results in Divorce Predictors Data Set
Experimental results in Vehicle Silhouettes Data Set
Comparison among outlier detection algorithms
Experimental results in Ionosphere Data Set
Experimental results in Breast Cancer Wisconsin (Original) Data Set
Experimental results in Breast Cancer Wisconsin (Diagnostic) Data Set
Confusion matrix for predicting an outlier
Experimental results in Airfoil Self-Noise Data Set
Experimental results in (OBS) Network Data Set
Experimental results in Obesity Data Set
In this section, ROC curve and AUC standard are used to evaluate the experimental results, and Friedman test is performed.
The confusion matrix is required to draw the ROC curve. The confusion matrix is shown in table 16. There are 4 possible outcomes in the prediction of outliers: an outlier taken as outlier(true positive, TP), an outlier taken as normal (false negative, FN), an normal taken as outlier (false positive, FP) and an normal taken as normal (true negative, TN).
AUC results
AUC results
Accordingly, the true positive rate(TPR) and false positive rate(FPR) could be defined as follows:TPR=TP/(TP+FN), FPR=FP/(FP+TN).
Seting a threshold and calculating its corresponding TPR and FPR. Then, takeing FPR as the abscissa and TPR as the ordinate, which can correspond to a point in the two-dimensional map. After setting some thresholds, the ROC curve can be connected. The TPR is called “detection rate”, which represents the probability of being identified as an outlier among true outliers. On the contrary, the FPR is called false positive rate and means the probability of being identified as an outlier among non-outliers. Obviously, when evaluating the performance of the algorithm, these two indicators are often opposite. The ROC curve intuitively depicts the relative trade-off between TPR and FPR. If the ROC curve is closer to the upper left corner, it indicates that the algorithm performs better. The ROC curve gives a vivid representation of the performance of the algorithm, but is not specific enough. At this time, the numerical AUC needs to be introduced to quantitatively reflect the performance of the algorithm. AUC is the area under the ROC curve. Assuming that the probability of positive sample predicted by a model is P1, and the probability of negative sample predicted by a model is P2, AUC is the probability of P1textgreaterP2 (with higher values being better). The value range of AUC is generally [0.5, 1]. The larger the AUC, the more samples are correctly identified as outliers. In other word, the better the performance of the algorithm.
Figure 6 shows the ROC results for 10 data sets. The ROC results show that DE algorithm performs better on most data sets and runs stably. More clear and powerful conclusions can be drawn from AUC results, as exhibited in Table 16 (boldface numbers are the best) DE algorithm is the best overall.

ROC results.
Here are 6 algorithms and 10 data sets. Considering that each algorithm has a performance ranking on each data set, we do Friedman test to compare whether there are significant differences between algorithms. The experimental results are shown in Figure 7, DE algorithm has the lowest ranking mean on the 10 data set, which is significantly different from the algorithms with higher ranking mean.

Friedman test result.
A RVIS is an IS that uses real-valued data to display relationships between objects and attributes. An IRVIS is a RVIS with partially missing information values. In this paper, an outlier detection method in an IRVIS based on inner boundary has been studied and the corresponding DE algorithm has been proposed. Multiple experiments on ten data sets from UCI Machine Learning Repository have been carried out. When the performance of the other five algorithms fluctuates with different data sets, the DE algorithm remains relatively stable and has certain robustness. This shows that the performance of the proposed outlier detection algorithm is better than the other five outlier detection algorithms in an IRVIS. In future work, we will consider applying the algorithm proposed in this paper to detect abnormal cells on biological genetic data, because the expression of biological genetic data in different cell samples is real-valued data. But due to the complexity of biological genetic data, we need to focus on improving the time complexity of the proposed method to better adapt to big data.
Footnotes
Acknowledgments
The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions, which have helped immensely in improving the quality of the paper. This work is supported by Natural Science Foundation of Guangxi (2021GXNSFAA220114, 2022GXNSFAA035552).
