Outlier detection for incomplete real-valued data based on inner boundary

Abstract

Outlier detection is a process to find out the objects that have the abnormal behavior. It can be applied in many aspects, such as public security, finance and medical care. An information system (IS) as a database that shows relationships between objects and attributes. A real-valued information system (RVIS) is an IS whose information values are real numbers. A RVIS with missing values is an incomplete real-valued information system (IRVIS). The notion of inner boundary comes from the boundary region in rough set theory (RST). This paper conducts experiments directly in an IRVIS and investigates outlier detection in an IRVIS based on inner boundary. Firstly, the distance between two information values on each attribute of an IRVIS is introduced, and the parameter λ to control the distance is given. Then, the tolerance relations on the object set are defined according to the distance, by the way, the tolerance classes, the λ-lower and λ-upper approximations in an IRVIS are put forward. Next, the inner boundary under each conditional attribute in an IRVIS is presented. The more inner boundaries an object belongs to, the more likely it is to be an outlier. Finally, an outlier detection method in an IRVIS based on inner boundary is proposed, and the corresponding algorithm (DE) is designed, where DE means degree of exceptionality. Through the experiments base on UCI Machine Learning Repository data sets, the DE algorithm is compared with other five algorithms. Experimental results show that DE algorithm has the better outlier detection effect in an IRVIS. It is worth mentioning that for comprehensive comparison, ROC curve and AUC value are used to illustrate the advantages of the DE algorithm.

Keywords

RST IRVIS Outlier detection Inner boundary

1 Introduction

As an important branch of data mining, outlier detection aims to discover objects that behave very differently from most objects. Outlier detection plays an important role in many aspects. In public safety, it helps to detect fraudulent act [2]; in medical diagnosis, it can predict diseases of concern [29]; in network monitoring, it can be applied to data monitoring in wireless sensors [36]; in industrial production, it can be used to identify defective products [10]. Hawkins [11] gave a generally accepted definition, “an outlier is a deviation from other observations that leads to suspicion that it is produced by a different mechanism."

Outlier detection methods can be roughly divided into the following categories: statistical distribution-based method [24, 30]; depth-based method [17]; cluster-based method [13, 19]; distance-based method [21] and density-based method [3]. Statistical distribution-based method requires the data to have prior knowledge of the distribution law, and then treats the data that does not satisfy the distribution law as abnormal data. Depth-based method can obtain good detection results for two-dimensional or three-dimensional data, but it is difficult to apply to high-dimensional data. Cluster-based method detects outliers by examining the relationship between objects and clusters. In other words, an outlier is an object that belongs to a small sparse cluster or does not belong to any cluster. The detection effect of this method is very dependent on the parameter settings of the clustering algorithm. The idea of distance-based methods is that if a data object is far away from most data objects, the object is an outlier. The idea of density-based method is to calculate the local density of each object and the density of its neighbor points to judge the degree of abnormality. The latter two methods can be considered to have the same advantages and disadvantages. The advantage is that it is easy to understand and easy to implement. The disadvantage is that the distance needs to be calculated and the computational overhead is high. In addition, some generative neural networks in deep learning, such as GAN [22, 34] are also used to predict the distribution of data sets for outlier detection. The main focus of outlier detection is only on continuous numerical data. Most machine learning methods and neural network methods need sorted data to obtain vector distance, and lack of mechanisms to deal with non sorted classified data. It will also face problems such as dependence on big data, interpretability and training.

Rough set theory (RST) proposed by Pawlak [27, 28] is a set of theories to study incomplete data and imprecise knowledge representation. The main idea is to minimize the expression of knowledge under the premise of keeping the classification ability unchanged, and then derive the decision-making or classification rules of the problem. RST is widely used in feature selection [6, 35], pattern recognition [7, 48], classification [47] and data mining [4, 14], all of which are closely related to information systems (IS). Unlike other data mining methods, RST does not require any prior knowledge. Because of these advantages, RST is used for outlier detection. Many scholars have achieved good results by using it in various information systems to study the problem of outlier detection.

Initially on the classification IS, Francisco et al. [8] constructed a mathematical framework for outlier detection based on RST. Feng et al. [9] studied an outlier detection method combining information entropy and RST. Jiang et al. [15] investigated the hybrid outlier detection algorithm based on RST and GrC. Nguyen [25] proposed an outlier detection method based on RST approximate reasoning. Jiang et al. [20] gave an outlier detection algorithm based on approximate precision entropy. At that time, the numerical processing on the RVIS required discretization, which may lead to information loss. Later, in order to better study numerical and heterogeneous IS, Sangeetha et al. [33] proposed an outlier detection method for mixed data sets, using entropy weighted density outlier detection method based on rough sets to detect outliers. Yuan et al. [45, 46] introduced neighborhood rough sets and proposed some outlier detection methods that can effectively deal with mixed data. Feng et al. [44] proposed an outlier detection algorithm based on the neighborhood value difference metric by using the standard deviation neighborhood radius and the mixed distance metric. In recent years, fuzzy rough sets (FRSs) have also attracted a lot of attention as a generalization of rough sets. At first, FRSs have been successfully applied to feature selection and pattern recognition of numerical or heterogeneous features [37 , 43]. Then, in order to study the application of FRSs in outlier detection. Yuan and Chen et al. [41, 42] investigated an outlier detection method based on fuzzy information entropy and an outlier detection method based on fuzzy rough granules.

It is worth mentioning that in the outlier detection tasks of different information systems, the processing of missing data is only described in the articles [41, 42], and the processing of missing data is the maximum probability filling method. After filling in the missing data, it is equivalent to carrying out research on a complete information system.

The contribution of this paper is to study that when part of the values in the real-valued information system (RVIS) are missing, we do not need to fill in the missing values, but directly detect outliers on the incomplete real-valued information system (IRVIS). From the point of view of the information value set under each attribute in the information system, a distance degree definition suitable for an IRVIS is designed. The idea behind the definition of distance degree is to consider the size of the set of information values under different attributes and the uniqueness of the values in the set, and the induced probability distribution can be used to define the distance degree between the information values of two objects under different conditions. Then, the tolerance class of each object is calculated by using the distance degree. After having the tolerance class, the inner boundary is proposed according to the concept of upper and lower approximations of sets in RST. The degree of marginality of an object is defined based on the number of different inner boundaries to which the object belongs. In addition, normalization gives the degree of exceptionality to evaluate the possibility of an object becoming an outlier. Finally, the degree of exceptionality (DE) outlier detection algorithm based on inner boundary is described. Experiments are carried out on ten data sets downloaded from UCI Machine Learning Repository, and five outlier detection algorithms (CBLOF [13], DB [21], KNN [31], SEQ [18], NOOF [5]) are compared. The experimental results shows that the DE outlier detection algorithm often has the better outlier detection effect in an IRVIS.

Our main contributions can be summarized as follows.

(1) A distance degree suitable for an IRVIS is studied. In particular, the distance degree between a missing value and another missing value is large enough, the distance degree between a missing value and a real value is relatively large, and the distance degree between two real values is clear.

(2) Based on the rich theoretical knowledge of RST, the concept of inner boundary is defined, and the degree of exceptionality of each object are calculated, which to some extent enriches the requirements of the definition of outlier concept.

(3) An outlier detection algorithm of an IRVIS based on inner boundary is proposed.

(4) The experiment proves the superiority of this algorithm in an IRVIS.

The remainder of this paper is organized as follows. Section 2 describes some preparatory work. Section 3 studies outlier detection in an IRVIS based on inner boundary, including examples and algorithms. Section 4 gives experimental analysis. Section 5 carries out evaluation analysis. Section 6 summaries this paper.

The workflow of this paper is depicted in Figure 1.

Fig. 1

The workflow of this paper.

2 Preliminaries

This section introduces some preliminaries needed for the formulation of the proposed method.

We first look back at binary relations and an IRVIS. Throughout this paper, O, A denote two non-empty finite sets, 2^O is the power set of O and |X| is the cardinality of X ∈ 2^O.

In this paper, put

$O = {o_{1}, o_{2}, \dots, o_{n}}, A = {a_{1}, a_{2}, \dots, a_{m}} .$

2.1 Binary relations

Recall that R is a binary relation on O whenever R ⊆ O × O.

(1) reflexive, if (o, o′) ∈ R for any o ∈ O;

(2) symmetric, if (o, o′) ∈ R implies (o′, o) ∈ R;

(3) transitive, if (o, o′) ∈ R and (o′, o″) ∈ R imply (o, o″) ∈ R.

R is said to be an equivalence relation on O, if R is reflexive, symmetric and transitive; R is called a tolerance relation on O, if R is reflexive and symmetric.

2.2 An IRVIS

Definition 2.1 ([27]). Let O be an object set and A an attribute set. Suppose that O and A are finite sets. Then the pair (O, A) is called an information system (IS), if each attribute a ∈ A determines a information function a : O → V_a, where V_a = {a (o) : o ∈ O}.

Let (O, A) be an IS. If there is a ∈ A such that * ∈ V_a, here * means a null or unknown value, then (O, A) is called an incomplete information system (IIS).

Let (O, A) be an IIS. For each a ∈ A, denote

$V_{a}^{*} = V_{a} - {a (o) : a (o) = *} .$

Then, $V_{a}^{*}$ means the set of all non-missing information values with respect to the attribute a.

Definition 2.2. Suppose that (O, A) is an IIS. Then (O, A) is referred to as an incomplete real-valued information system (IRVIS), if for any a ∈ A and o ∈ O, a (o) is a real number.

If P ⊆ A, then (O, P) is referred to as the subsystem of (O, A).

Example 2.3. Table 1 expresses an IRVIS (O, A), where O = {o₁, o₂, ⋯ , o₇} is an object set and A = {a₁, a₂, a₃, a₄} is a set of attributes.

Table 1
An IRVIS (O, A)

a ₁ a ₂ a ₃ a ₄

o ₁ 23.6 0 200 *

o ₂ 15.5 1 * 10

o ₃ * 0 * 80

o ₄ 18.3 1 200 10

o ₅ 23.6 0 300 *

o ₆ * 1 100 40

o ₇ 25.4 1 * 80

	a ₁	a ₂	a ₃	a ₄
o ₁	23.6	0	200	*
o ₂	15.5	1	*	10
o ₃	*	0	*	80
o ₄	18.3	1	200	10
o ₅	23.6	0	300	*
o ₆	*	1	100	40
o ₇	25.4	1	*	80

Example 2.4. (Continued from Example 2.2)

$V_{a_{1}}^{*} = {23.6, 15.5, 18.3, 25.4}, V_{a_{2}}^{*} = V_{a_{2}} = {0, 1},$

$V_{a_{3}}^{*} = {100, 200, 300}, V_{a_{4}}^{*} = {10, 40, 80},$

2.3 The distance between information values with respect to each attribute in an IRVIS

∀ a ∈ A, denote

$\hat{a} = \max V_{a}^{*} - \min V_{a}^{*} .$

For missing data, we have the following thoughts.

(1) Consider “o ≠ o′, a (o) = * , a (o′) ≠ * , a ∈ A ", because “a (o)" is treated as “do not care" information, thus a (o) has the probability of $\frac{1}{| V_{a}^{*} |}$ to equal to one certain value of $V_{a}^{*}$ .

(2) Consider “o ≠ o′, a (o) ≠ * , a (o′) = * , a ∈ A ", because “a (o′)" is treated as “do not care" information, thus a (o′) has the probability of $\frac{1}{| V_{a}^{*} |}$ to equal to one certain value of $V_{a}^{*}$ .

(3) Consider “o ≠ o′, a (o) = * , a (o′) = * , a ∈ A ", a (o) and a (o′) both have the probability of $\frac{1}{| V_{a}^{*} |}$ to equal to one certain value of $V_{a}^{*}$ , so the joint probability of a (o) and a (o′) is $\frac{1}{| V_{a}^{*} |^{2}}$ .

Definition 2.5. Let (O, A) be an IRVIS. Then ∀ o, o′ ∈ O, ∀ a ∈ A, the distance degree between a (o) and a (o′) is defined as follows:

d (a (o) , a (o′)) =

${\begin{matrix} 0, & o = o^{'}; \\ 1 - \frac{1}{| V_{a}^{*} |^{2}}, & o \neq o^{'}, a (o) = *, a (o^{'}) = *; \\ 1 - \frac{1}{| V_{a}^{*} |}, & o \neq o^{'}, a (o) \neq *, a (o^{'}) = *; \\ 1 - \frac{1}{| V_{a}^{*} |}, & o \neq o^{'}, a (o) = *, a (o^{'}) \neq *; \\ 0, & o \neq o^{'}, a (o) \neq *, a (o^{'}) \neq *, a (o) = a (o^{'}); \\ \frac{| a (o) - a (o^{'}) |}{\hat{a}}, & o \neq o^{'}, a (o) \neq *, a (o^{'}) \neq *, a (o) \neq a (o^{'}) . \end{matrix}$

2.4 Tolerance relations in an IRVIS

Definition 2.6. Let (O, A) be an IRVIS. Given P ⊆ A and λ ∈ [0, 1],

$R_{P}^{λ} = {(o, o^{'}) \in O \times O : \forall a \in P, d (a (o), a (o^{'})) \leq λ} .$

$R_{P}^{λ}$ is called the binary relation induced by the subspace (O, P) with respect to λ. Clearly, $R_{P}^{λ}$ is a tolerance relation on O.

Denote

$R_{P}^{λ} (o) = {o^{'} \in O : (o, o^{'}) \in R_{P}^{λ}} .$ Then $R_{P}^{λ} (o)$ is referred as to the tolerance class of the object o under $R_{P}^{λ}$ .

Proposition 2.7. Let (O, A) be an IRVIS. Then the following properties hold:

(1) If P ₁ ⊆ P₂ ⊆ A, then ∀ λ ∈ [0, 1], ∀ o ∈ O,

$R_{P_{2}}^{λ} (o) \subseteq R_{P_{1}}^{λ} (o);$

(2) If 0 ≤ λ₁ ≤ λ₂ ≤ 1, then ∀ P ⊆ A, ∀ o ∈ O,

$R_{P}^{λ_{1}} (o) \subseteq R_{P}^{λ_{2}} (o) .$

Proof. (1) $\forall o^{'} \in R_{P_{2}}^{λ} (o)$ , we have ∀ a ∈ P₂, d (a (o) , a (o′)) ≤ λ. Since P₁ ⊆ P₂, ∀ a ∈ P₁, d (a (o) , a (o′)) ≤ λ. Thus $o^{'} \in R_{P_{1}}^{λ} (o)$ .

So $R_{P_{2}}^{λ} (o) \subseteq R_{P_{1}}^{λ} (o)$ .

(2) $\forall o^{'} \in R_{P}^{λ_{1}} (o)$ , we have ∀ a ∈ P, d (a (o) , a (o′)) ≤ λ₁. Since λ₁ ≤ λ₂, ∀ a ∈ P, d (a (o) , a (o′)) ≤ λ₂. Thus $o^{'} \in R_{P}^{λ_{2}} (o)$ .

So $R_{P}^{λ_{1}} (o) \subseteq R_{P}^{λ_{2}} (o)$ . □

Definition 2.8. Let (O, A) be an IRVIS. Given P ⊆ A, λ ∈ [0, 1] and X ∈ 2^O. Define

$\underline{R_{P}^{λ}} (X) = {o \in O : R_{P}^{λ} (o) \subseteq X};$

$\bar{R_{P}^{λ}} (X) = {o \in O : R_{P}^{λ} (o) \cap X \neq \emptyset} .$

Then $\underline{R_{P}^{λ}} (X)$ and $\bar{R_{P}^{λ}} (X)$ are called λ-lower approximation and λ-upper approximation of X, respectively.

Moreover, if $\underline{R_{P}^{λ}} (X) = \bar{R_{P}^{λ}} (X)$ , then X is called a exact set with respect to $R_{P}^{λ}$ ; otherwise, X is called a rough set with respect to $R_{P}^{λ}$ .

Put

$Φ_{a}^{λ} (O) = {X \in 2^{O} : X is a rough set with respect to R_{a}^{λ}},$

$Φ^{λ} (O) = ⋂_{a \in A} Φ_{a}^{λ} (O) .$

Theorem 2.9. Let (O, A) be an IRVIS.

(1) $\bar{R_{P}^{λ}} (\emptyset) = \underline{R_{P}^{λ}} (\emptyset) = \emptyset$ , $\underline{R_{P}^{λ}} (O) = \bar{R_{P}^{λ}} (O) = O$ .

(2) $\underline{R_{P}^{λ}} (X) \subseteq X \subseteq \bar{R_{P}^{λ}} (X)$ .

(3) $X \subseteq Y \Rightarrow \underline{R_{P}^{λ}} (X) \subseteq \underline{R_{P}^{λ}} (Y), \bar{R_{P}^{λ}} (X) \subseteq \bar{R_{P}^{λ}} (Y)$ .

(4) If P₁ ⊆ P₂ ⊆ A, then ∀ λ ∈ [0, 1], ∀ X ∈ 2^O,

$\underline{R_{P_{1}}^{λ}} (X) \subseteq \underline{R_{P_{2}}^{λ}} (X), \bar{R_{P_{2}}^{λ}} (X) \subseteq \bar{R_{P_{1}}^{λ}} (X) .$

(5) If 0 ≤ λ₁ ≤ λ₂ ≤ 1, then ∀ P ⊆ A, ∀ X ∈ 2^O,

$\underline{R_{P}^{λ_{2}}} (X) \subseteq \underline{R_{P}^{λ_{1}}} (X), \bar{R_{P}^{λ_{1}}} (X) \subseteq \bar{R_{P}^{λ_{2}}} (X) .$

(6) $\underline{R_{P}^{λ}} (X \cap Y) = \underline{R_{P}^{λ}} (X) \cap \underline{R_{P}^{λ}} (Y)$ ;

$\bar{R_{P}^{λ}} (X \cap Y) = \bar{R_{P}^{λ}} (X) \cup \bar{R_{P}^{λ}} (Y)$ .

(7) $\underline{R_{P}^{λ}} (O - X) = O - \bar{R_{P}^{λ}} (X)$ ; $\bar{R_{P}^{λ}} (O - X) = O - \underline{R_{P}^{λ}} (X)$ .

Proof. The conclusions are obvious according to Definiton 2.4, Propositon 2.4 and Definition 2.4. □

Definition 2.10. Let (O, A) be an IRVIS. Given a ∈ A, λ ∈ [0, 1] and $X \in Φ_{a}^{λ} (O)$ .

(1) $B_{a}^{λ} (X) = \bar{R_{a}^{λ}} (X) - \underline{R_{a}^{λ}} (X)$ is called the λ-boundary of X with respect to a.

(2) ${IB}_{a}^{λ} (X) = X - \underline{R_{a}^{λ}} (X)$ is called the inner λ-boundary of X with respect to a.

(3) ${OB}_{a}^{λ} (X) = \bar{R_{a}^{λ}} (X) - X$ is called the outer λ-boundary of X with respect to a.

Obviously,

${IB}_{a}^{λ} (X) = B_{a}^{λ} (X) \cap X, B_{a}^{λ} (X) = {IB}_{a}^{λ} (X) \cup {OB}_{a}^{λ} (X);$

$X \in Φ_{a}^{λ} (O) \Leftrightarrow B_{a}^{λ} (X) \neq \emptyset,$

$X \in Φ^{λ} (O) \Leftrightarrow \forall a \in A, B_{a}^{λ} (X) \neq \emptyset .$

To put it bluntly, the inner boundary is the intersection of the set X with the boundary set. These elements in the inner boundary represent a contradiction with the set X, because some elements satisfy X and others do not satisfy X in its tolerance class.

3 Outlier detection in an IRVIS based on the inner boundary

3.1 Outliers in an IRVIS based on the inner boundary

Research on outlier detection method based on inner boundary is inspired by literature [8]. In this section, we define outliers in an IRVIS based on the inner boundary.

Definition 3.1. Let (O, A) be an IRVIS. Given λ ∈ [0, 1] and E ⊆ X ∈ Φ^λ (O). Then E is called a λ-exceptional subset of X, if for any a ∈ A, $E \cap {IB}_{a}^{λ} (X) \neq \emptyset$ .

A λ-exceptional subset E of X is made out of the elements in X that contradict m tolerance relations $R_{a_{1}}^{λ}$ , $R_{a_{2}}^{λ}$ , ⋯, $R_{a_{m}}^{λ}$ .

Definition 3.2. Let (O, A) be an IRVIS. Given λ ∈ [0, 1] and E ⊆ X ∈ Φ^λ (O). Suppose that E is a λ-exceptional subset of X.

(1) e ∈ E is called dispensable in E, if E - {e} is also a λ-exceptional subset of X.

(2) e ∈ E is called indispensable in E, if E - {e} is not a λ-exceptional subset of X.

Definition 3.3. Let (O, A) be an IRVIS. Given λ ∈ [0, 1] and E ⊆ X ∈ Φ^λ (O). Suppose that E is a λ-exceptional subset of X. Then E is called nonCredundant, if for each e ∈ E is indispensable in E.

NonCredundant exceptional subsets constitute the fundamental source of elements to be considered outlier candidates in subsequent stages of the detection method.

Definition 3.4. Let (O, A) be an IRVIS. Given a ∈ A, λ ∈ [0, 1] and $o \in X \in Φ_{a}^{λ} (O)$ . If $o \in {IB}_{a}^{λ} (X)$ , then o contradicts $R_{a}^{λ}$ in X,

Definition 3.5. Let (O, A) be an IRVIS. Given λ ∈ [0, 1] and o ∈ X ∈ Φ^λ (O). Then λ-degree of marginality of o in X is defined as

${DM}_{o}^{λ} (X) = | {{IB}_{a}^{λ} (X) : o \in {IB}_{a}^{λ} (X), a \in A} | .$

${DM}_{o}^{λ} (X)$ represents the degree of contradiction of o with respect to m tolerance relations $R_{a_{1}}^{λ}$ , $R_{a_{2}}^{λ}$ , ⋯, $R_{a_{m}}^{λ}$ .

Clearly,

$0 \leq {DM}_{o}^{λ} (X) \leq | {a \in A : o \in {IB}_{a}^{λ} (X)} | \leq m .$

Definition 3.6. Let (O, A) be an IRVIS. Given λ ∈ [0, 1] and o ∈ X ∈ Φ^λ (O). Then λ-degree of exceptionality of o in X is defined as

${DE}_{o}^{λ} (X) = \frac{| {{IB}_{a}^{λ} (X) : o \in {IB}_{a}^{λ} (X), a \in A} |}{m} .$

The purpose of this notion is to normalize the ${DM}_{o}^{λ} (X)$ such that it can be limited to values between 0 and 1.

Obviously,

$0 \leq {DE}_{o}^{λ} (X) \leq 1 .$

Definition 3.7. Let (O, A) be an IRVIS. Given λ, μ ∈ [0, 1] and o^* ∈ X ∈ Φ^λ (O). Then o^* is called (λ, μ)-outlier in X or the subsystem (X, A), if there exists a nonCredundant exceptional subset E of X such that o^* ∈ E, ${DE}_{o^{*}}^{λ} (X) > μ .$

In this paper, the set of all (λ, μ)-outliers in X or the subsystem (X, A) is denoted as $Ω_{μ}^{λ} (X)$ .

3.2 An example

In this part, an example is given to find outliers in an IRVIS based on the inner boundary.

Example 3.8. Given an IRVIS (O, A) shown in Table 1. Pick λ = 0.5. Then the tolerance class of each object is shown in Table 2.

Table 2
The tolerance class of each object

$R_{a_{1}}^{λ} (o)$ $R_{a_{2}}^{λ} (o)$ $R_{a_{3}}^{λ} (o)$ $R_{a_{4}}^{λ} (o)$

o ₁ {o₁, o₅, o₇} {o₁, o₃, o₅} {o₁, o₄, o₅, o₆} {o₁}

o ₂ {o₂, o₄} {o₂, o₄, o₆, o₇} {o₂} {o₂, o₄, o₆}

o ₃ {o₃} {o₁, o₃, o₅} {o₃} {o₃, o₇}

o ₄ {o₂, o₄} {o₂, o₄, o₆, o₇} {o₁, o₄, o₅, o₆} {o₂, o₄, o₆}

o ₅ {o₁, o₅, o₇} {o₁, o₃, o₅} {o₁, o₄, o₅} {o₅}

o ₆ {o₆} {o₂, o₄, o₆, o₇} {o₁, o₄, o₆} {o₂, o₄, o₆}

o ₇ {o₁, o₅, o₇} {o₂, o₄, o₆, o₇} {o₇} {o₃, o₇}

	$R_{a_{1}}^{λ} (o)$	$R_{a_{2}}^{λ} (o)$	$R_{a_{3}}^{λ} (o)$	$R_{a_{4}}^{λ} (o)$
o ₁	{o₁, o₅, o₇}	{o₁, o₃, o₅}	{o₁, o₄, o₅, o₆}	{o₁}
o ₂	{o₂, o₄}	{o₂, o₄, o₆, o₇}	{o₂}	{o₂, o₄, o₆}
o ₃	{o₃}	{o₁, o₃, o₅}	{o₃}	{o₃, o₇}
o ₄	{o₂, o₄}	{o₂, o₄, o₆, o₇}	{o₁, o₄, o₅, o₆}	{o₂, o₄, o₆}
o ₅	{o₁, o₅, o₇}	{o₁, o₃, o₅}	{o₁, o₄, o₅}	{o₅}
o ₆	{o₆}	{o₂, o₄, o₆, o₇}	{o₁, o₄, o₆}	{o₂, o₄, o₆}
o ₇	{o₁, o₅, o₇}	{o₂, o₄, o₆, o₇}	{o₇}	{o₃, o₇}

Let $g_{1} = R_{a_{1}}^{λ} (o_{1}), g_{2} = R_{a_{2}}^{λ} (o_{1}), g_{3} = R_{a_{3}}^{λ} (o_{1}), g_{4} = R_{a_{4}}^{λ} (o_{1})$ , from Table 2, we have g₁ = {o₁, o₅, o₇} , g₂ = {o₁, o₃, o₅} , g₃ = {o₁, o₄, o₅, o₆} , g₄ = {o₁}.

Suppose X = {o₁, o₂, o₃, o₄, o₆}, Obviously, X ∈ Φ^λ (O)

(1) For a_i (i = 1, 2, 3, 4), the λ-lower approximation and the λ-upper approximation of X is calculated as follows: $\underline{R_{a_{1}}^{λ}} (X) = {o_{2}, o_{3}, o_{4}, o_{6}}$ ,

$\bar{R_{a_{1}}^{λ}} (X) = {o_{1}, o_{2}, o_{3}, o_{4}, o_{5}, o_{6}, o_{7}},$

$\underline{R_{a_{2}}^{λ}} (X) = \emptyset$ , $\bar{R_{a_{2}}^{λ}} (X) = {o_{1}, o_{2}, o_{3}, o_{4}, o_{5}, o_{6}, o_{7}},$

$\underline{R_{a_{3}}^{λ}} (X) = {o_{2}, o_{3}, o_{6}}$ , $\bar{R_{a_{3}}^{λ}} (X) = {o_{1}, o_{2}, o_{3}, o_{4}, o_{5}, o_{6}},$

$\underline{R_{a_{4}}^{λ}} (X) = {o_{1}, o_{2}, o_{4}, o_{6}}$ ,

$\bar{R_{a_{4}}^{λ}} (X) = {o_{1}, o_{2}, o_{3}, o_{4}, o_{6}, o_{7}},$

(2) The corresponding ${IB}_{a}^{λ} (X)$ could be calculated accordingly:

${IB}_{a_{1}}^{λ} (X) = X - \underline{R_{a_{1}}^{λ}} (X) = {o_{1}},$

${IB}_{a_{2}}^{λ} (X) = X - \underline{R_{a_{2}}^{λ}} (X) = {o_{1}, o_{2}, o_{3}, o_{4}, o_{6}},$

${IB}_{a_{3}}^{λ} (X) = X - \underline{R_{a_{3}}^{λ}} (X) = {o_{1}, o_{4}},$

${IB}_{a_{4}}^{λ} (X) = X - \underline{R_{a_{4}}^{λ}} (X) = {o_{3}},$

(3) The λ-degree of marginality of o in X by Definition 3.1 is calculated as follows:

${DM}_{o_{1}}^{λ} (X) = | {{IB}_{a_{1}}^{λ} (X), {IB}_{a_{2}}^{λ} (X), {IB}_{a_{3}}^{λ} (X)} | = 3,$

${DM}_{o_{2}}^{λ} (X) = | {{IB}_{a_{2}}^{λ} (X)} | = 1,$

${DM}_{o_{3}}^{λ} (X) = | {{IB}_{a_{2}}^{λ} (X), {IB}_{a_{4}}^{λ} (X)} | = 2,$

${DM}_{o_{4}}^{λ} (X) = | {{IB}_{a_{2}}^{λ} (X), {IB}_{a_{3}}^{λ} (X)} | = 2,$

${DM}_{o_{6}}^{λ} (X) = | {{IB}_{a_{2}}^{λ} (X)} | = 1,$

(4) With λ-degree of marginality of o in X, the λ-degree of exceptionality of o in X can get it all at once. The results are shown below:

${DE}_{o_{1}}^{λ} (X) = \frac{{DM}_{o_{1}}^{λ} (X)}{m} = \frac{3}{4} = 0.75,$

${DE}_{o_{2}}^{λ} (X) = \frac{{DM}_{o_{2}}^{λ} (X)}{m} = \frac{1}{4} = 0.25,$

${DE}_{o_{3}}^{λ} (X) = \frac{{DM}_{o_{3}}^{λ} (X)}{m} = \frac{2}{4} = 0.5,$

${DE}_{o_{4}}^{λ} (X) = \frac{{DM}_{o_{4}}^{λ} (X)}{m} = \frac{2}{4} = 0.5,$

${DE}_{o_{6}}^{λ} (X) = \frac{{DM}_{o_{6}}^{λ} (X)}{m} = \frac{1}{4} = 0.25,$

Given μ = 0.7, we have

${DE}_{o_{2}}^{λ} (X) = {DE}_{o_{6}}^{λ} (X) < {DE}_{o_{3}}^{λ} (X) = {DE}_{o_{4}}^{λ} (X) < μ < {DE}_{o_{1}}^{λ} (X) .$

Only o₁ has a ${DE}_{o_{1}}^{λ} (X)$ higher than the threshold μ, and is thus taken as outlier according to Definition 3.1.

3.3 Outlier detection algorithms in an IRVIS based on inner boundary

This section introduces the outlier detection algorithm based on inner boundary in an IRVIS and analyzes the time complexity of related algorithms.

Given an IRVIS (O, A), Our outlier detection method is represented in algorithms 1, 2 and 3. Algorithm 1 calculates the tolerance class of each object under each attribute. Algorithm 2 calculates the ${IB}_{A}^{λ} (X)$ (Includes the inner boundary under each attribute of the object set X). Algorithm 3 calculates the outlier set.

Algorithm 1 Computing $R_{B}^{λ} (o)$

Input: An IRVIS (O, A), a threshold λ ∈ [0, 1], B ⊆ A and o ∈ O.

Output: The tolerance class $R_{B}^{λ} (o)$ .

1: for a ∈ B

2: for o, o′ ∈ O do

3: Calculating d (a (o) , a (o′)) ;

4: end for

5: end for

6: λ ∈ [0, 1]

7: $R_{B}^{λ} (o) \leftarrow {o \in O : \forall a \in B, d (a (o), a (o^{'})) \leq λ};$

8: return $R_{B}^{λ} (o)$ .

The time complexity for Algorithm 1 is as follows. In step 3, the time complexity for calculating the d (a (o) , a (o′)) is O (|V_a|²) where V_a = {a (o) : o ∈ O}.

For Algorithm 2, the time complexity of computing the intersection of each object’s tolerance classes with object set X in step 8 is O (|A| × |X|²). Then the time complexity in step 17 is O (|A|), thus the total time complexity for Algorithm 2 amounts to O (|A| × |X|²).

Step 10 in Algorithm 3 is to determine whether there is a subset relationship between the tolerance classes of two different objects, and its time complexity is O (|A|² × |X|). The time complexity of steps 19 to 25 is O (|E|), O (|E|) ≤ O (|X|). So the total time complexity for Algorithm 3 is O (|A|² × |X|).

4 Experimental analyses

4.1 Experimental setup

The experiments in this section are carried out on a computer equipped with Inter Corei5-7200U processor, frequency of 2.50 GHz, storage of 2.70 GHz and memory of 8 GB. The operating system is Windows 10. The experiments are developed with Python 3.8.

Algorithm 2 Calculating the ${IB}_{A}^{λ} (X)$

Input: An IRVIS (X, A) and a thresholds λ ∈ [0, 1], X ∈ 2^O.

Output: Calculating ${IB}_{A}^{λ} (X)$ .

1: Obtain tolerance class $R_{a}^{λ} (o)$ by Algorithm 1;

2: ${IB}_{A}^{λ} (X) \leftarrow \emptyset$

3: for a ∈ A do

4: $\underline{R_{a}^{λ}} (X) \leftarrow \emptyset$

5: $\bar{R_{a}^{λ}} (X) \leftarrow \emptyset$

6: ${IB}_{a}^{λ} (X) \leftarrow \emptyset$

7: for o ∈ O do

8: if $R_{a}^{λ} (o) \cap X = R_{a}^{λ} (o)$ then

9: $\underline{R_{a}^{λ}} (X) \leftarrow \underline{R_{a}^{λ}} (X) \cup {o}$ ;

10: else if $R_{a}^{λ} (o) \cap X \neq \emptyset$ then

11: $\bar{R_{a}^{λ}} (X) \leftarrow \bar{R_{a}^{λ}} (X) \cup {o}$ ;

12: end if

13: end for

14: ${IB}_{a}^{λ} (X) = X - \underline{R_{a}^{λ}} (X)$ ;

15: end for

16: for a ∈ A do

17: ${IB}_{A}^{λ} (X) \leftarrow {IB}_{A}^{λ} (X) \cup {{IB}_{a}^{λ} (X)}$ ;

18: end for

19: return ${IB}_{A}^{λ} (X)$ .

Algorithm 3 Calculating the outlier set

Input: An IRVIS (X, A) and two thresholds λ, μ ∈ [0, 1], X ∈ 2^O.

Output: Calculating $Ω_{μ}^{λ} (X)$ .

1: E← ∅

2: $Ω_{μ}^{λ} (X) \leftarrow \emptyset$

3: Obtain ${IB}_{A}^{λ} (X)$ by Algorithm 2;

4: for ${IB}_{a}^{λ} (X) \in {IB}_{A}^{λ} (X)$ do

5: flag = False

6: for ${IB}_{a^{'}}^{λ} (X) \in {IB}_{A}^{λ} (X)$ do

7: if ${IB}_{a^{'}}^{λ} (X) = {IB}_{a}^{λ} (X)$ then

8: continue;

9: end if

10: if ${IB}_{a^{'}}^{λ} (X) \cap {IB}_{a}^{λ} (X) = {IB}_{a^{'}}^{λ} (X)$ then

11: flag = True;

12: break;

13: end if

14: if not flag then

15: $E \leftarrow E \cup {IB}_{a}^{λ} (X)$ ;

16: end if

17: end for

18: end for

19: for o ∈ E do

20: Calculating ${DM}_{o}^{λ} (X)$

21: Calculating ${DE}_{o}^{λ} (X)$

22: if ${DE}_{o}^{λ} (X) \geq μ$ then

23: $Ω_{μ}^{λ} (X) \leftarrow Ω_{μ}^{λ} (X) \cup {e}$ ;

24: end if

25: end for

26: return $Ω_{μ}^{λ} (X)$ .

The DE outlier detection algorithm is compared with 5 main outlier detection algorithms. Including distance-based algorithm (DB), k-nearest neighbors-based algorithm (KNN), cluster-based local outlier factor algorithm (CBLOF), sequence-based method (SEQ), neighborhood-based method (NOOF). To compare the performance of all outlier detection algorithms, the evaluation metric proposed by Aggarwal and Yu is used [1].

For each algorithm, this paper first calculates the outlierness of each object in the data set, and arranges them in descending order. The higher the outlierness of an object after the arrangement, the higher the probability that the object will become an outlier. Then selects the front part of the rearranged object sequence at intervals of the same size to see how many objects are actually selected from the rare class (real outlier object). The more objects selected from the rare class, the better the performance of the outlier detection algorithm. For example, there are 24 objects in the rare class of the data set, and an object sequence in descending order of outlierness is obtained through the outlier detection algorithm. Taking the first 20 objects in the object sequence. If there are nearly 20 objects that really belong to the rare class, it is considered that the performance of the outlier detection algorithm is very good. Since the missing data in each data set only contains a small part of itself, the parameter of DE algorithm in all data sets is set to 0.2. Relevant parameters in the algorithm proposed by others, those are set differently according to different data sets, but the purpose is to carry out multiple experiments under each data set and select the parameter setting with the best effect.

The threshold μ mentioned in the DE outlier detection algorithm is the standard used to judge whether an object is detected as an outlier. The threshold μ is set according to the number of objects in rare class. Intuitively, the DE outlier detection algorithm calculates the outlierness of each object and obtains a sequence in descending order of outlierness. If in a data set, the number of objects in rare class is m, the mth outlierness in the sequence is taken as the threshold μ.

4.2 Data set experiment results

Ten data sets from UCI Machine Learning Repository are used: Glass Identification Data Set, Computer Hardware Data Set, Divorce Predictors Data Set, Ionosphere Data Set, Breast Cancer Wisconsin (Original) Data Set, Breast Cancer Wisconsin (Diagnostic) Data Set, Airfoil Self-Noise Data Set, (OBS) Network Data Set, Vehicle Silhouettes Data Set, Obesity Data Set. The rare class is that contains only a small fraction of objects in each data set. It is usually defined according to specific context semantics. For example, in the context of medical cases, uncommon case is usually regarded as the rare class.

These data sets belong to real-value data sets. In order to study the task of outlier detection in an IRVIS, data sets need to be preprocessed. It includes converting the data set into an information system table, where one row corresponds to one object and one column corresponds to one attribute. It also includes randomly missing 1% of all information values in each information system table.

In UCI Machine Learning Repository public data sets, it is found that few data sets are suitable for outlier detection tasks. The 10 data sets used in this paper are mainly applied in the experiments of classification and clustering tasks. These data sets are relatively balanced in the number distribution of decision class, those are not suitable for outlier detection. In order to solve this problem, this paper uses the experimental technology proposed in [12] to preprocess these data sets to obtain data sets for outlier detection. This method randomly deletes objects of the rare class and retains objects of other classes. Finally, an unbalanced distribution in the decision class is formed. In this paper, we preprocessed two data sets (Divorce Predictors, Breast Cancer Wisconsin (Diagnostic)) to make them more discriminative. Table 3 summarizes the data sets used in this paper.

Table 3
Data set statistics

ID Data set Dimensions Instances Outlier ratio

1 Glass Identification 11 214 4.21%

2 Computer Hardware 10 209 5.74%

3 Divorce Predictors 55 130 35.38%

4 Ionosphere 34 351 35.90%

5 Breast Cancer Wisconsin (Original) 10 699 34.48%

6 Breast Cancer Wisconsin (Diagnostic) 32 489 26.99%

7 Airfoil Self-Noise 6 1503 5.26%

8 (OBS) Network 22 1075 11.16%

9 Vehicle Silhouettes 19 846 23.52%

10 Obesity 17 2111 12.88%

ID	Data set	Dimensions	Instances	Outlier ratio
1	Glass Identification	11	214	4.21%
2	Computer Hardware	10	209	5.74%
3	Divorce Predictors	55	130	35.38%
4	Ionosphere	34	351	35.90%
5	Breast Cancer Wisconsin (Original)	10	699	34.48%
6	Breast Cancer Wisconsin (Diagnostic)	32	489	26.99%
7	Airfoil Self-Noise	6	1503	5.26%
8	(OBS) Network	22	1075	11.16%
9	Vehicle Silhouettes	19	846	23.52%
10	Obesity	17	2111	12.88%

For more details, please see the experimental results of the following data sets.

Glass Identification Data Set. The glass identification data set from UCI Machine Learning Repository includes 214 objects with 11 attributes, which can be divided into 6 classes. The decision class “tableware glass" is regarded as a rare class, because the proportion of this category in the whole data set is very small, accounting for only 4.21%. The following will show experimental figure and result table of the glass identification data set, as well as the explanation of the relevant results.

(1) The threshold μ is set according to the proportion of rare class in data set. Specifically, the outlier ratio is 4.21% (see Table 3), and then the outlierness ranked at 4.21% in descending order is taken as the threshold μ (see Figure 2). The red points in the figure represent the distribution of real outlier points in the data set, and the blue points represent the distribution of normal points.

Fig. 2

Scatter chart for Glass Identification.

(2) Table 4 shows the experimental results in detail, it is equivalent to giving the dynamic data table of outlier detection rate. The bold numbers in each row in the table indicate that the outlier detection rate is the best in a current outlier detection operation. In Table 4, “Top ratio" refers to the percentage of objects those are in front of the rearranged object sequence. “number of objects" indicates the number of selected objects. “Number of rare classes" refers to the number of real outliers contained in the selected objects. “Coverage" refers to the ratio of “Number of rare classes" to the number of real outliers.

Table 4

Experimental results in Glass Identification Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.04(8)	4(44.4444)	0(0.0)	1(11.1111)	0(0.0)	1(11.1111)	1(11.1111)
0.08(16)	7(77.7778)	0(0.0)	1(11.1111)	1(11.1111)	1(11.1111)	1(11.1111)
0.11(24)	8(88.8889)	2(22.2222)	1(11.1111)	3(33.3333)	1(11.1111)	2(22.2222)
0.15(32)	8(88.8889)	2(22.2222)	1(11.1111)	3(33.3333)	1(11.1111)	2(22.2222)
0.19(40)	8(88.8889)	5(55.5556)	4(44.4444)	3(33.3333)	3(33.3333)	2(22.2222)
0.23(48)	8(88.8889)	6(66.6667)	9(100.0)	4(44.4444)	4(44.4444)	4(44.4444)
0.26(56)	9(100.0)	9(100.0)	9(100.0)	5(55.5556)	4(44.4444)	4(44.4444)
0.3(64)	9(100.0)	9(100.0)	9(100.0)	5(55.5556)	4(44.4444)	4(44.4444)
0.34(72)	9(100.0)	9(100.0)	9(100.0)	6(66.6667)	6(66.6667)	5(55.5556)

(3) Figure 3 shows the change trend of detection rate of six outlier detection algorithms. The line graph clearly reflects that DE algorithm has a relatively good advantage over the other five algorithms on the glass identification data set.

Fig. 3

Detection rate comparison for Glass Identification.

In the outlier detection experiment of glass identification data set, the experimental progress and results are introduced in detail. For the next 9 data sets, Their scatter chart and detection rate comparison chart will be displayed together.

Fig. 5

Detection rate comparison chart results.

Computer Hardware Data Set. The computer hardware data set from UCI Machine Learning Repository includes 209 objects and 10 attributes. The last attribute is the decision attribute, which represents the estimated relative performance score. Under this data set, all objects with “performance score ≥ 350" are regarded as objects in rare class. Because these few objects only account for 5.74% of the total data set (see Table 3). According to the percentage of rare class and the calculation method of threshold μ, the threshold μ in the computer hardware data set is obtained (see Figure 4(a)). It can be seen from Table 5 that among the top ten objects that are most likely to become real outliers calculated by different algorithms, DE algorithm detects 10 real outliers. In contrast, CBLOF algorithm detects 3 real outliers, DB algorithm detects 6 real outliers, KNN algorithm detects 6 real outliers, SEQ algorithm detects 9 real outliers, NOOF algorithm detects 9 real outliers. In the subsequent comparison, DE algorithm proposed in this paper has always been the best outlier detection rate.

Fig. 4

Scatter chart results.

Table 5

Experimental results in Computer Hardware Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.05(10)	10(90.9091)	3(27.2727)	6(54.5455)	6(54.5455)	9(81.8182)	9(81.8182)
0.1(20)	11(100.0)	8(72.7273)	11(100.0)	7(63.6364)	11(100.0)	11(100.0)
0.14(30)	11(100.0)	11(100.0)	11(100.0)	7(63.6364)	11(100.0)	11(100.0)
0.19(40)	11(100.0)	11(100.0)	11(100.0)	9(81.8182)	11(100.0)	11(100.0)
0.24(50)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)
0.29(60)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)
0.34(70)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)
0.38(80)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)	11(100.0)

Divorce Predictors Data Set. The data set has 170 objects and 55 attributes. The last column of attribute set is the decision column, which divides the entire data set into “0" and “1", representing predicted divorce and predicted non divorce respectively. After downloading the data set, it is found that the "0" and "1" classes contain 86 and 84 objects respectively, so the data set needs to be preprocessed to make it unbalanced. The experimental method is to randomly downsampling a part of the objects in the "0" (predicted divorce) class, so that this class can become a rare class concerned by the outlier detection experiment. The final experimental objects are 130, in which the proportion of rare class is 35.38% (see Table 3). The threshold μ calculated according to the outlier ratio is 0.3519 (see Figure 4(b)). Table 6 shows that DE and KNN algorithms perform very well until all true outliers are detected.

Table 6

Experimental results in Divorce Predictors Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.16(20)	20(44.4444)	16(35.5556)	20(44.4444)	20(44.4444)	6(13.3333)	20(44.4444)
0.31(40)	40(88.8889)	34(75.5556)	40(88.8889)	39(86.6667)	18(40.0)	29(64.4444)
0.47(60)	45(100.0)	37(82.2222)	41(91.1111)	45(100.0)	32(71.1111)	38(84.4444)
0.62(80)	45(100.0)	42(93.3333)	41(91.1111)	45(100.0)	43(95.5556)	44(97.7778)
0.78(100)	45(100.0)	43(95.5556)	45(100.0)	45(100.0)	45(100.0)	45(100.0)
0.93(120)	45(100.0)	43(95.5556)	45(100.0)	45(100.0)	45(100.0)	45(100.0)

Table 12

Experimental results in Vehicle Silhouettes Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.02(20)	8(4.0404)	6(3.0303)	8(4.0404)	8(4.0404)	2(1.0101)	6(3.0303)
0.05(40)	16(8.0808)	13(6.5657)	15(7.5758)	16(8.0808)	6(3.0303)	8(4.0404)
0.07(60)	24(12.1212)	20(10.101)	23(11.6162)	24(12.1212)	9(4.5455)	12(6.0606)
0.09(80)	33(16.6667)	27(13.6364)	31(15.6566)	33(16.6667)	11(5.5556)	20(10.101)
0.12(100)	41(20.7071)	32(16.1616)	41(20.7071)	41(20.7071)	19(9.596)	25(12.6263)
0.14(120)	51(25.7576)	36(18.1818)	51(25.7576)	51(25.7576)	26(13.1313)	33(16.6667)

Table 14

Comparison among outlier detection algorithms

algorithm	Advantage	Disadvantage
DE	high performance	calculate distance
CBLOF	deal with local outliers	unstable performance
DB	relatively simple	unstable performance
KNN	easy to understand	calculate distance
SEQ	deal with sequence	Discretization pretreatment for numeric data
NOOF	relatively simple	underuse of data information

Ionosphere Data Set. The data set has 351 objects and 34 attributes. The last column in the attribute set is the decision column, which divides the entire data set into two categories (g: good, b: bad). According to the distribution characteristics of the data set, it is obvious that the category labeled "b" only accounts for 35.90% of the data set (see Table 3), so this category is a rare category. According to the ratio of outliers, the threshold μ = 0.4848 is calculated (see Figure 4(c)). In the dynamic change table of outlier detection rate shown in Table 7, it is found that DE algorithm, KNN algorithm, and NOOF algorithm have the best outlier detection effect in the initial stage. In the intermediate stage of outlier detection effect, KNN algorithm is the best, but in the end, DE algorithm is the first to detect all the real outliers.

Table 7

Experimental results in Ionosphere Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.06(20)	20(16.0)	18(14.4)	10(8.0)	20(16.0)	18(14.4)	20(16.0)
0.11(40)	40(32.0)	37(29.6)	28(22.4)	40(32.0)	33(26.4)	40(32.0)
0.17(60)	58(46.4)	56(44.8)	46(36.8)	60(48.0)	41(32.8)	57(45.6)
0.23(80)	73(58.4)	73(58.4)	63(50.4)	79(63.2)	54(43.2)	69(55.2)
0.29(100)	90(72.0)	93(74.4)	79(63.2)	93(74.4)	59(47.2)	80(64.0)
0.34(120)	105(84.0)	105(84.0)	98(78.4)	101(80.8)	69(55.2)	87(69.6)
0.4(140)	116(92.8)	110(88.0)	110(88.0)	109(87.2)	78(62.4)	95(76.0)
0.46(160)	120(96.0)	110(88.0)	114(91.2)	111(88.8)	91(72.8)	99(79.2)

Breast Cancer Wisconsin (Original) Data Set. The data set has includes 699 objects and 10 attributes. The last attribute is the decision attribute, which divides all objects into “benign" and “malignant", in which “malignant" objects only account for 34.48% of the whole data set (see Table 3). Regarding “malignant" class as a rare class, according to the percentage of rare class and the calculation method of threshold μ, we can get the threshold μ in this data set (see Figure 4(d)). As shown in Table 8, DE algorithm, SEQ algorithm and NOOF algorithm are all very good in the initial detection stage, and then DE algorithm is slightly inferior to the SEQ algorithm in the whole detection process.

Table 8

Experimental results in Breast Cancer Wisconsin (Original) Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.03(20)	20(8.4034)	9(3.7815)	18(7.563)	18(7.563)	20(8.4034)	20(8.4034)
0.06(40)	40(16.8067)	19(7.9832)	38(15.9664)	35(14.7059)	40(16.8067)	40(16.8067)
0.09(60)	60(25.2101)	35(14.7059)	56(23.5294)	51(21.4286)	59(24.7899)	60(25.2101)
0.11(80)	75(31.5126)	46(19.3277)	75(31.5126)	67(28.1513)	79(33.1933)	80(33.6134)
0.14(100)	95(39.916)	64(26.8908)	95(39.916)	85(35.7143)	98(41.1765)	98(41.1765)
0.17(120)	115(48.3193)	79(33.1933)	115(48.3193)	104(43.6975)	117(49.1597)	115(48.3193)
0.2(140)	135(56.7227)	93(39.0756)	135(56.7227)	122(51.2605)	135(56.7227)	131(55.042)

Breast Cancer Wisconsin (Diagnostic) Data Set. The data set has 569 objects and 32 attributes. The last attribute is decision attribute, which divides the whole data set into “M" (malignant) and “B" (benign). After downloading the data set, it is found that the number of objects contained in the labels “M" and “B" are 212 and 357 respectively. The “M" (malignant) class is regarded as a rare class in the experiment, and a part of the objects in this class are randomly sampled. The final objects are 489, of which the proportion of rare class is 26.99% as shown in Table 3. With the ratio of rare class, it can be calculated that the threshold μ under this data set is 0.4333 (see Figure 4(e)). As shown in Table 9, the detection efficiency of DE algorithm is very good at the beginning and the end, but it lags behind CBLOF algorithm slightly in the intermediate detection stage.

Table 9

Experimental results in Breast Cancer Wisconsin (Diagnostic) Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.04(20)	20(15.2672)	13(9.9237)	15(11.4504)	15(11.4504)	19(14.5038)	18(13.7405)
0.08(40)	40(30.5344)	21(16.0305)	32(24.4275)	28(21.374)	38(29.0076)	33(25.1908)
0.12(60)	60(45.8015)	31(23.6641)	49(37.4046)	38(29.0076)	52(39.6947)	47(35.8779)
0.16(80)	80(61.0687)	44(33.5878)	64(48.855)	49(37.4046)	66(50.3817)	60(45.8015)
0.2(100)	97(74.0458)	58(44.2748)	74(56.4885)	63(48.0916)	80(61.0687)	71(54.1985)
0.25(120)	105(80.1527)	77(58.7786)	84(64.1221)	75(57.2519)	92(70.229)	81(61.8321)
0.29(140)	110(83.9695)	97(74.0458)	91(69.4656)	83(63.3588)	100(76.3359)	91(69.4656)
0.33(160)	114(87.0229)	117(89.313)	96(73.2824)	95(72.5191)	108(82.4427)	93(70.9924)
0.86(420)	131(100.0)	129(98.4733)	124(94.6565)	130(99.2366)	129(98.4733)	127(96.9466)
0.9(440)	131(100.0)	130(99.2366)	126(96.1832)	130(99.2366)	129(98.4733)	130(99.2366)

Table 15

Confusion matrix for predicting an outlier

	Predicted outlier	Predicted inlier
Actual outlier	TP	FN
Actual inlier	FP	TN

Airfoil Self-Noise Data Set. The data set has 1503 objects and 6 attributes. The last column of the attribute set is the decision column, which represents the “scaled sound pressure level", in decibels. If the experiment focuses on the objects with sound pressure level ≤ 110 and sound pressure level ≥ 136, then such objects are regarded as rare objects of special concern. This rare class only accounts for 5.26% of the whole data set (see Table 3). With the proportion of rare class and the calculation method of threshold μ, the threshold μ under this data set is 0.6 (see Figure 4(f)). As shown in Table 10, the effect of all outlier detection algorithms under this data set is not good. For example, when the top 20 are most likely to become outliers, the current best SEQ algorithm only detects 5 real outliers. Compared with these six algorithms, the detection effect of DE algorithm is relatively good.

Table 10

Experimental results in Airfoil Self-Noise Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.01(20)	3(4.0)	0(0.0)	1(1.3333)	1(1.3333)	5(6.6667)	2(2.6667)
0.03(40)	3(4.0)	0(0.0)	1(1.3333)	3(4.0)	12(16.0)	5(6.6667)
0.04(60)	10(13.3333)	0(0.0)	3(4.0)	4(5.3333)	15(20.0)	6(8.0)
0.05(80)	10(13.3333)	2(2.6667)	6(8.0)	4(5.3333)	16(21.3333)	11(14.6667)
0.07(100)	14(18.6667)	2(2.6667)	7(9.3333)	4(5.3333)	18(24.0)	13(17.3333)
0.08(120)	17(22.6667)	2(2.6667)	9(12.0)	4(5.3333)	18(24.0)	18(24.0)
0.09(140)	19(25.3333)	4(5.3333)	12(16.0)	4(5.3333)	20(26.6667)	19(25.3333)
0.11(160)	26(34.6667)	5(6.6667)	14(18.6667)	5(6.6667)	21(28.0)	20(26.6667)
0.12(180)	27(36.0)	9(12.0)	16(21.3333)	9(12.0)	22(29.3333)	22(29.3333)

(OBS) Network Data Set. The data set has 1075 objects and 22 attributes, and the last column of all attributes is decision attribute. It divides all objects into four categories, namely “NB-No Block", “Block", “No Block", “NB-Wait". There are 500 objects in the “NB-No Block" class, 120 objects in the “Block" class, 155 objects in the “No Block" class, and 300 objects in the “NB-Wait" class. The outlier detection experiments focus on the “Block" class, which is regarded as a rare class. This class accounts for 11.16% (see Table 3). According to rare class proportion and calculation method of threshold μ, the threshold μ under this data set is 0.8571 (see Figure 4(g)). In the dynamic change table of outlier detection rate shown in Table 11, the detection rate of DE algorithm in each stage has always been the highest.

Table 11

Experimental results in (OBS) Network Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.02(20)	20(17.2414)	4(3.4483)	16(13.7931)	7(6.0345)	13(11.2069)	6(5.1724)
0.04(40)	34(29.3103)	16(13.7931)	24(20.6897)	12(10.3448)	29(25.0)	10(8.6207)
0.06(60)	46(39.6552)	22(18.9655)	33(28.4483)	16(13.7931)	40(34.4828)	10(8.6207)
0.07(80)	57(49.1379)	34(29.3103)	36(31.0345)	20(17.2414)	52(44.8276)	12(10.3448)
0.09(100)	66(56.8966)	41(35.3448)	41(35.3448)	28(24.1379)	54(46.5517)	19(16.3793)
0.11(120)	73(62.931)	55(47.4138)	49(42.2414)	33(28.4483)	56(48.2759)	28(24.1379)
0.13(140)	81(69.8276)	61(52.5862)	51(43.9655)	39(33.6207)	68(58.6207)	34(29.3103)

Vehicle Silhouettes Data Set. The data set has 846 objects and 19 attributes. The last attribute is the decision attribute, which divides all objects into four classes, the “van" class includes 199 objects, the “saab" class includes 217 objects, the “bus" class includes 218 objects, and the “opel" class includes 212 objects. The number of objects in these four classes is roughly the same, but when the experiment is interested in detecting the “van" class, this class and the rest of the classes form an unbalanced distribution. “Van" is a rare class, accounting for 23.52% (see Table 3). The threshold μ=0.7222 is calculated according to the outlier ratio. (see Figure 4(h)). In the dynamic change table of outlier detection rate shown in Table 13, the detection rate effect of DE algorithm and KNN algorithm is relatively good in these six outlier detection algorithms.

Table 13

Experimental results in Obesity Data Set

Top ratio(%)	Number of rare classes(Coverage(%))
(Number of objects)	DE	CBLOF	DB	KNN	SEQ	NOOF
0.01(20)	20(7.3801)	0(0.0)	0(0.0)	0(0.0)	5(1.845)	3(1.107)
0.02(40)	31(11.4391)	4(1.476)	3(1.107)	4(1.476)	7(2.583)	7(2.583)
0.03(60)	35(12.9151)	9(3.321)	7(2.583)	10(3.69)	15(5.5351)	15(5.5351)
0.04(80)	36(13.2841)	16(5.9041)	17(6.2731)	16(5.9041)	23(8.4871)	18(6.6421)
0.05(100)	56(20.6642)	18(6.6421)	21(7.7491)	20(7.3801)	33(12.1771)	22(8.1181)
0.06(120)	63(23.2472)	22(8.1181)	23(8.4871)	24(8.8561)	37(13.6531)	28(10.3321)
0.07(140)	63(23.2472)	27(9.9631)	28(10.3321)	27(9.9631)	43(15.8672)	31(11.4391)
0.08(160)	63(23.2472)	30(11.0701)	34(12.5461)	28(10.3321)	43(15.8672)	38(14.0221)

Obesity Data Set. The data set has 2111 objects and 17 attributes. The last column of attributes is decision attribute, which divides the entire data set into “insufficient Weight", “Normal Weight", “Overweight" and “Obesity ". As a rare class, “insufficient Weight" only accounts for 12.88% of the whole data set (see Table 3). The threshold μ=0.75 is calculated according to the outlier ratio. (see Figure 4(i)). In the dynamic change table of outlier detection rate shown in Table 13, the detection rate effect of DE algorithm is relatively good in these six outlier detection algorithms.

5 Evaluation analyses

In this section, ROC curve and AUC standard are used to evaluate the experimental results, and Friedman test is performed.

The confusion matrix is required to draw the ROC curve. The confusion matrix is shown in table 16. There are 4 possible outcomes in the prediction of outliers: an outlier taken as outlier(true positive, TP), an outlier taken as normal (false negative, FN), an normal taken as outlier (false positive, FP) and an normal taken as normal (true negative, TN).

Table 16
AUC results

Data sets AUC value(rank)

DE CBLOF DB KNN SEQ NOOF

Glass Identification 0.9551(1) 0.8377(2) 0.625(6) 0.7783(3) 0.6901(5) 0.7026(4)

Computer Hardware 1.0(1) 0.9617(5) 0.9795(4) 0.934(6) 0.9972(3) 0.9977(2)

Divorce Predictors 0.9999(1) 0.8546(3) 0.7107(6) 0.9952(2) 0.7172(5) 0.814(4)

Ionosphere 0.9556(1) 0.9137(3) 0.911(4) 0.9381(2) 0.8056(6) 0.8687(5)

Breast Cancer (Original) 0.9748(2) 0.8744(6) 0.952(5) 0.9551(4) 0.9879(1) 0.9713(3)

Breast Cancer (Diagnostic) 0.9379(1) 0.8654(3) 0.8127(6) 0.8406(4) 0.8971(2) 0.8346(5)

Airfoil Self-Noise 0.7173(1) 0.483(6) 0.6293(3) 0.552(5) 0.6798(2) 0.6219(4)

(OBS) Network 0.9478(1) 0.7697(5) 0.8239(3) 0.692(6) 0.9079(2) 0.7846(4)

Vehicle Silhouettes 0.7794(2) 0.7294(5) 0.7454(3.5) 0.7454(3.5) 0.6785(6) 0.7842(1)

Obesity 0.6699(3) 0.6502(6) 0.704(2) 0.661(5) 0.7321(1) 0.6629(4)

Average rank 1.4 4.4 4.25 4.05 3.3 3.6

Data sets	AUC value(rank)
Glass Identification	0.9551(1)	0.8377(2)	0.625(6)	0.7783(3)	0.6901(5)	0.7026(4)
Computer Hardware	1.0(1)	0.9617(5)	0.9795(4)	0.934(6)	0.9972(3)	0.9977(2)
Divorce Predictors	0.9999(1)	0.8546(3)	0.7107(6)	0.9952(2)	0.7172(5)	0.814(4)
Ionosphere	0.9556(1)	0.9137(3)	0.911(4)	0.9381(2)	0.8056(6)	0.8687(5)
Breast Cancer (Original)	0.9748(2)	0.8744(6)	0.952(5)	0.9551(4)	0.9879(1)	0.9713(3)
Breast Cancer (Diagnostic)	0.9379(1)	0.8654(3)	0.8127(6)	0.8406(4)	0.8971(2)	0.8346(5)
Airfoil Self-Noise	0.7173(1)	0.483(6)	0.6293(3)	0.552(5)	0.6798(2)	0.6219(4)
(OBS) Network	0.9478(1)	0.7697(5)	0.8239(3)	0.692(6)	0.9079(2)	0.7846(4)
Vehicle Silhouettes	0.7794(2)	0.7294(5)	0.7454(3.5)	0.7454(3.5)	0.6785(6)	0.7842(1)
Obesity	0.6699(3)	0.6502(6)	0.704(2)	0.661(5)	0.7321(1)	0.6629(4)
Average rank	1.4	4.4	4.25	4.05	3.3	3.6

Accordingly, the true positive rate(TPR) and false positive rate(FPR) could be defined as follows:TPR=TP/(TP+FN), FPR=FP/(FP+TN).

Seting a threshold and calculating its corresponding TPR and FPR. Then, takeing FPR as the abscissa and TPR as the ordinate, which can correspond to a point in the two-dimensional map. After setting some thresholds, the ROC curve can be connected. The TPR is called “detection rate”, which represents the probability of being identified as an outlier among true outliers. On the contrary, the FPR is called false positive rate and means the probability of being identified as an outlier among non-outliers. Obviously, when evaluating the performance of the algorithm, these two indicators are often opposite. The ROC curve intuitively depicts the relative trade-off between TPR and FPR. If the ROC curve is closer to the upper left corner, it indicates that the algorithm performs better. The ROC curve gives a vivid representation of the performance of the algorithm, but is not specific enough. At this time, the numerical AUC needs to be introduced to quantitatively reflect the performance of the algorithm. AUC is the area under the ROC curve. Assuming that the probability of positive sample predicted by a model is P1, and the probability of negative sample predicted by a model is P2, AUC is the probability of P1textgreaterP2 (with higher values being better). The value range of AUC is generally [0.5, 1]. The larger the AUC, the more samples are correctly identified as outliers. In other word, the better the performance of the algorithm.

Figure 6 shows the ROC results for 10 data sets. The ROC results show that DE algorithm performs better on most data sets and runs stably. More clear and powerful conclusions can be drawn from AUC results, as exhibited in Table 16 (boldface numbers are the best) DE algorithm is the best overall.

Fig. 6

ROC results.

Here are 6 algorithms and 10 data sets. Considering that each algorithm has a performance ranking on each data set, we do Friedman test to compare whether there are significant differences between algorithms. The experimental results are shown in Figure 7, DE algorithm has the lowest ranking mean on the 10 data set, which is significantly different from the algorithms with higher ranking mean.

Fig. 7

Friedman test result.

6 Conclusions

A RVIS is an IS that uses real-valued data to display relationships between objects and attributes. An IRVIS is a RVIS with partially missing information values. In this paper, an outlier detection method in an IRVIS based on inner boundary has been studied and the corresponding DE algorithm has been proposed. Multiple experiments on ten data sets from UCI Machine Learning Repository have been carried out. When the performance of the other five algorithms fluctuates with different data sets, the DE algorithm remains relatively stable and has certain robustness. This shows that the performance of the proposed outlier detection algorithm is better than the other five outlier detection algorithms in an IRVIS. In future work, we will consider applying the algorithm proposed in this paper to detect abnormal cells on biological genetic data, because the expression of biological genetic data in different cell samples is real-valued data. But due to the complexity of biological genetic data, we need to focus on improving the time complexity of the proposed method to better adapt to big data.

Footnotes

Acknowledgments

The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions, which have helped immensely in improving the quality of the paper. This work is supported by Natural Science Foundation of Guangxi (2021GXNSFAA220114, 2022GXNSFAA035552).

References

Aggarwal

C.C.

and Yu

P.S.

, Outlier detection for high dimensional data, in: Proceedings of the 2001 ACM Sigmod International Conference on Management of Data (2001), 37–46.

Bolton

R.J.

, Hand

D.J.

, Provost

, Breiman

, Bolton

R.J.

and Hand

D.J.

, Statistical Fraud Detection: A Review Comment Comment Rejoinder, Statistical Science 17(3) (2002), 235–255.

Breunig

M.M.

, Kriegel

H.P.

, Ng

R.T.

and Sander

Lof: identifying density-based local outliers, in: Proceedings of the 2000ACMSIGMOD International Conference on Management of Data (2000), 93–104

Chen

H.M.

, Li

T.R.

, Da

, Lin

J.H.

and Hu

C.X.

, A rough-set-basedincremental approach for updating approximations under dynamicmaintenance environments, IEEE Transactions on Knowledge and Data Engineering 25(2) (2013), 274–284.

Chen

Y.M.

, Miao

D.Q.

and Zhang

H.Y.

, Neighborhood outlier detection, Expert Systems with Applications 37(12) (2010), 8745–8749.

Dai

J.H.

, Hu

Q.H.

, Zhang

J.H.

, Hu

and Zheng

N.G.

, Attributeselection for partially labeled categorical data by rough setapproach, IEEE Transactions on Cybernetics 47(9) (2017), 2460–2471.

Dai

J.H.

, Wang

W.T.

and Xu

, An uncertainty measure for incomplete decision tables and its applications, IEEE Transactions on Cybernetics 43(4) (2013), 1277–1289.

Francisco

M.P.

, Jos

V.B.M.

, Alberto

F.O.

and Miguel

A.O.

, Algorithm for the detection of outliers based on the theory of rough sets, Decision Support Systems 75(C) (2015), 63–75.

Feng

, Sui

and Cao

, An information entropy-based approach tooutlier detection in rough sets, Expert Systems with Applications 37(9) (2010), 6338–6344.

10.

Guo

, Wu

and Li

, Fault forecast and diagnosis of steam turbine based on fuzzy rough set theory, Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007), IEEE (2007), 501–501.

11.

Hawkins

D.M.

, Identification of outliers, Chapman and Hall, London (1980).

12.

Hawkins

, He

, Williams

G.J.

and Baxter

R.A.

, Outlier detection using replicator neural networks. CiteSeer, International Conference on Data Ware housing and Knowledge Discovery, Springer, Berlin, Heidelberg (2002), 170–180.

13.

, Xu

and Deng

, Discovering cluster-based local outliers, Pattern Recognition Letters 24(9-10) (2003), 1641–1650.

14.

Q.H.

, Yu

D.R.

, Xie

Z.X.

and Liu

J.F.

, Fuzzy probabilistic approximation spaces and their information measures, IEEE Transactions on Fuzzy Systems 14 (2006), 191–201.

15.

Jiang

and Chen

Y.M.

, Outlier detection based on granular computing and rough set theory, Applied Intelligence 42(2) (2015), 303–322.

16.

Jiang

and Chen

Y.M.

, Outlier Detection based on granular computing, International Conference on Rough Sets and Current Trends in Computing Springer-Verlag (2008).

17.

Johnson

, Kwok

and Ng

R.T.

Fast computation of 2 dimensional depth contours, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (1998), 224–228

18.

Jiang

, Sui

Y.F.

and Cao

C.G.

, Some issues about outlier detectionin rough set theory, Expert Systems with Applications 36 (2009), 4680–4687.

19.

Jayakumar

and Thomas

B.J.

, A new procedure of clustering based onmultivariate outlier detection, Journal of Data Science 11(1) (2013), 69–84.

20.

Jiang

, Zhao

H.B.

, Du

J.W.

, Xue

and Peng

Y.J.

, Outlier detection based on approximation accuracy entropy, International Journal of Machine Learning and Cybernetics 10(9) (2018), 2483–2499.

21.

Knorr

E.M.

, Ng

R.T.

and Tucakov

, Distance-based outliers: algorithms and applications, The VLDB Journal 8(3) (2000), 237–253.

22.

Liu

, Li

, Zhou

, Jiang

, Sun

, Wang

and He

, Generative adversarial active learning for unsupervised outlierdetection, IEEE Transactions on Knowledge and Data Engineering 32 (2020), 1517–1528.

23.

Macia-Perez

, Berna-Martinez

J.V.

, Oliva

A.F.

and Ortega

M.A.A.

, Algorithm for the detection of outliers based on the theory of roughsets, Decision Support System 75 (2015), 63–75.

24.

Markou

and Singh

, Novelty detection: a reviewpart 1, statistical approaches, Signal Processing 83(12) (2003), 248–2497.

25.

Nguyen

T.T.

Outlier detection: An approximate reasoning approach, in: Proceedings of the International Conference on Rough Sets and Intelligent Systems Paradigms (2007), 495–504.

26.

Pal

S.K.

, Meher

S.K.

and Dutta

, Class-dependent rough-fuzzy granular space, dispersion index and classification, Pattern Recognition 45(7) (2012), 2690–2707.

27.

Pawlak

, Rough sets, International Journal of Computer and Information Science 11 (1982), 341–356.

28.

Pawlak

, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht (1991).

29.

Raymond

, Outlier detection in personalized medicine, Acm SigkddWorkshop on Outlier Detection and Description (ODD ’13) (2013), 7–7.

30.

Rousseeuw

P.J.

and Leroy

A.M.

, Robust regression and outlier detection, Journal of the American Statistical Association 31(2) (1987), 260–261.

31.

Ramaswamy

, Rastogi

and Shim

, Outlier detection based on rough sets theory, Intelligent Data Analysis 13(2) (2000), 191–206.

32.

Shaari

, Bakar

A.A.

and Hamdan

A.R.

, Outlier detection based onrough sets theory, Intelligent Data Analysis 13(2) (2009), 191–206.

33.

Sangeetha

and Geetha

M.A.

, A fuzzy proximity relation approachfor outlier detection in the mixed dataset by using roughentropy-based weighted density method, Soft Computing Letters 3 (2021), 100027.

34.

Sabokrou

, Khalooei

, Fathy

and Adeli

, Adversarially learned one-class classifier for novelty detection, in: IEEE Conference on Computer Vision and Pattern Recognition (2018), 3379–3388.

35.

Wang

C.Z.

, Huang

, Shao

M.W.

, Hu

Q.H.

and Chen

D.G.

, Feature selection based on neighborhood self-information, IEEETransactions on Cybernetics 50(9) (2020), 4031–4042.

36.

Wang

and Mao

, Outlier detection based on a dynamic ensemblemodel: Applied to process monitoring, Information Fusion 51 (2019), 244–258.

37.

Wang

J.T.

, Qian

Y.H.

, Li

F.J.

, Liang

J.Y.

and Ding

W.P.

, Fusingfuzzy monotonic decision trees, IEEE Transactions on Fuzzy Systems 28 (2020), 887–900.

38.

Wang

C.Z.

, Wang

, Shao

M.W.

, Qian

Y.H.

and Chen

D.G.

, Fuzzy roughattribute reduction for categorical data, IEEE Transactions onFuzzy Systems 28 (2020), 818–830.

39.

Yao

Y.Y.

Granular computing for data mining, in: Proceedings of SPIE Conference on Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security (2006), 1–12.

40.

Yao

Y.Y.

Granular computing for data mining, in: Dasarathy B.V. (ed) Proceedings of SPIE Conference on Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security (2006), 12.

41.

Yuan

, Chen

H.M.

, Li

T.R.

, Liu

and Wang

, Fuzzy informationentropy-based adaptive approach for hybrid feature outlierdetection, Fuzzy Sets and Systems 421 (2021), 1–28.

42.

Yuan

, Chen

H.M.

, Li

T.R.

, Liu

, Sang

B.B.

and Wang

, Outlier detection based on fuzzy rough granules in mixed attribute data, IEEE Transactions on Cybernetics (2021).

43.

Yang

Y.Y.

, Chen

D.G.

, Wang

, Tsang

E.C.

and Zhang

D.L.

, Fuzzyrough set based incremental attribute reduction from dynamic datawith sample arriving, Fuzzy Sets and Systems 312 (2017), 66–86.

44.

Yuan

and Feng

, Outlier detection algorithm based onneighborhood value difference metric, Journal of ComputationalAnd Applied Mathematics 38 (2018), 81–85.

45.

Yuan

, Zhang

X.Y.

and Feng

, Sequence-based mixed attributeoutlier detection in neighborhood rough sets, J. Chin. Comput.Syst 39(6) (2018), 1317–1322.

46.

Yuan

, Zhang

X.Y.

and Feng

, Hybrid data-driven outlierdetection based on neighborhood information entropy and itsdevelopmental measures, Expert Systems with Applications 112 (2018), 243–257.

47.

Zhao

S.Y.

, Chen

, Li

C.P.

, Du

X.Y.

and Sun

, A novel approach to building a robust fuzzy rough classifier, IEEE Transactions on Fuzzy System 23(4) (2015), 769–786.

Outlier detection for incomplete real-valued data based on inner boundary

Abstract

Keywords

1 Introduction

2.1 Binary relations

2.2 An IRVIS

Table 1 An IRVIS (O, A) a 1 a 2 a 3 a 4 o 1 23.6 0 200 * o 2 15.5 1 * 10 o 3 * 0 * 80 o 4 18.3 1 200 10 o 5 23.6 0 300 * o 6 * 1 100 40 o 7 25.4 1 * 80

2.4 Tolerance relations in an IRVIS

3 Outlier detection in an IRVIS based on the inner boundary

3.1 Outliers in an IRVIS based on the inner boundary

3.2 An example

4 Experimental analyses

4.1 Experimental setup

4.2 Data set experiment results

Footnotes

Acknowledgments

References

Table 1
An IRVIS (O, A)

a ₁ a ₂ a ₃ a ₄

o ₁ 23.6 0 200 *

o ₂ 15.5 1 * 10

o ₃ * 0 * 80

o ₄ 18.3 1 200 10

o ₅ 23.6 0 300 *

o ₆ * 1 100 40

o ₇ 25.4 1 * 80