Exploiting scatter matrix on one-class support vector machine based on low variance direction

Abstract

When building a performing one-class classifier, the low variance direction of the training data set might provide important information. The low variance direction of the training data set improves the Covariance-guided One-Class Support Vector Machine (COSVM), resulting in better accuracy. However, this classifier does not use data dispersion in the one class. It explicitly does not make use of target class subclass information. As a solution, we propose Scatter Covariance-guided One-Class Support Vector Machine, a novel variation of the COSVM classifier (SC-OSVM). In the kernel space, our approach makes use of subclass information to jointly decrease dispersion. Our algorithm technique is even based on a convex optimization problem that can be efficiently solved using standard numerical methods. A comparison of artificial and real-world data sets shows that SC-OSVM provides more efficient and robust solutions than normal COSVM and other contemporary one-class classifiers.

Keywords

Support vector machine kernel covariance matrix one-class classification low variances subclass information

1. Introduction

The identification of “abnormal” instances or events in data sets is the focus of the majority of machine learning challenges. This technique is also known as outlier detection or anomaly detection. Machine Learning may be used in both supervised and unsupervised procedures. The unsupervised procedure is performed on unlabeled data and is only concerned with the subjective structure of data sets. This problem is addressed in a variety of practical domains: To analyze medical pictures in the life sciences area, for example, to discover aberrant cells or tumors. Another well-known anomaly detection application is fraud detection. As a result of the rise in digital products and online payment methods. Network intrusion detection is a critical application for monitoring server and network traffic. In this domain, traditional multi-class (or binary) classification algorithms pre-define two (or more) categories to categorize an unknown item. When the unknown item does not fit into any of those categories, the disadvantages of this sort of classifier become apparent. Many real-world issues, such as tumor identification or uncommon medical illnesses, provide a limited amount of such noisy data throughout the training phase [20]. Measurements Normal nuclear power plant operations (normal/target) data are possible. On the other hand, in the event of an accident, measuring the (abnormal/outlier) data is extremely risky [16]. In robust machine learning applications, one-class classification has been widely used. During the training phase, there is just one class (positive/target). For the other class (negative/outliers), it is either a poor representation of a representative sample or a very small distribution. A one-class classifier’s aim during testing is to identify the optimum separation of target and outlier data. For these reasons, the One-class classification issue has a variety of applications, including medical analysis [11], anomaly detection [35], face recognition [42], and web page categorization [30]. When the data boundary is concave and lengthy, the required number of training items may be quite large. Boundary-based techniques employed just boundary points surrounding the target class, such as the One-Class Support Vector Machine (OSVM) [32] or Support Vector Data Description (SVDD) [31]. A more practical algorithm would be to combine the maximum margin criteria with data spread control [7].

The performance of most classification algorithms is highly influenced by certain characteristics of the data set being modeled, class distribution imbalance, class overlap, lack of density, etc. An OKC classifier [3] has been proposed. It is a hybrid of the one-class SVM, k-nearest neighbors, and CART algorithms. This algorithm is sketched to the extent that it might work with the classification of non-linearly separable imbalanced datasets with no need for re-sampling. In [2], a new personalized federated learning method based on One-Class Support Vector Machine (FedP-OCSVM) to overcome the privacy and transmission issues stalled by the core machine learning model, and use it from various datasets.Then, compute the resulting support vectors for each data set collected. A device or sensor is used to train a local model (client). Unlike multi-class classification tasks, one-class classification necessitates low variance directions of the target class distribution. To tackle this challenge, the Covariance-guided One-Class Support Vector Machine (COSVM) [17] method incorporates the covariance matrix into the objective function of OSVM, which emphasizes low variance paths. This method has proven effective in reducing estimation error and enhancing classification performance. Nevertheless, in real-world scenarios, when the data is widely spread, and sub-classes arise within the training class, the COSVM method has some limitations. To overcome these limitations, several classifiers have recently utilized subclass information to advance their methods’ performance.

In the OSVM objective function, the Subclass One-Class Support Vector Machine (SOC-SVM) [24] introduces within-subclass dispersion of the training data. The results of the experiments illustrate the benefits of the video summarizing approach (SOC-SVM). The Kernel Support Vector Description (KSVDD) [25] method modifies the optimization process in feature spaces of any dimensions. An application on face recognition and human action recognition problems shows that KSVDD outperforms the OSVM, standard SVDD, and the minimum variance SVDD (MV-SVDD) [29]. Robust least squares one-class support vector machine (RLS-OCSVM) [40] possesses better robustness and generalization ability in comparison with its related methods. The novel robust LS-OCSVM based on correntropy loss function to efficiently reduce the effect of outliers. Moreover, the half-quadratic optimization technique is utilized to solve the optimization problem of the proposed robust LS-OCSVM. However, the performance of the proposed robust SC-OCSVM depends on the value of its scale constant $\eta$ . If the value of $\eta$ is appropriate, the performance of robust LS-OCSVM will be increased. Unfortunately, the incorrect configuration of $\eta$ surly makes robust LS-OCSVM obtain relatively low performance. One-Class Slab Support Vector Machines (OCSSVM) [18] has proved to be more efficient in terms of accuracy than regular SVMs, One Class SVMs, and other one-class classifiers. The OCSSVM makes use of a modified Sequential Minimal Optimization (SMO) technique to improve scalability without sacrificing precision. Also, the results are very dependent on the initial parameter selections. The OCSSVM is well-suited to open-set problems, although it may trouble with balanced datasets. Hardware innovations like as parallelism and caching may be able to further reduce training time. The working set selection approach should be adjusted to improve efficiency. Combining SMO with other algorithms, as well as optimizing loss functions and lowering iterations, could be quite advantageous.

In this work, we present Scatter-Covariance OSVM (SC-OSVM), inspired by [28] that uses subclass information and includes the between and within scatter matrix into the COSVM’s objective function. In this work, first we divided the target class into subclass. Second, we calculate the between and within scatter matrix. Then, we optimize the COSVM’s objective function by incorporate the between and within scatter matrix to provide the best classification accuracy. Unlike the SOC-SVM and KSVDD techniques, the SC-OSVM method has the benefit of reducing not just scattering data inside each subclass but also scattering data between subclasses to improve classification performance. Furthermore, the SC-OSVM uses the trade-off parameter to regulate the contribution of our scatter matrix $S$ and covariance matrix $\Delta$ . The suggested approach is still based on a convex optimization problem, with an optimal solution easily implementable using traditional numerical methods.

The remainder of the paper is structured as follows: The next section present an overview. Section 3 gives more details on a new subclass method based on COSVM. Section 4 compares our technique to existing state-of-the-art applicable one-class classifiers on a variety of common datasets. Finally, Section 5 includes some closing thoughts.

2. Overview

We detail the related work to One-Class Classification and the influence of low variance direction in this part. We also define the clustering algorithms.

2.1 One-class classification

In the issue of one-class classification (OCC) [19] , positive training samples, known as targets, must be differentiated from all other negative items, known as outliers.

•
Only positive examples are relevant to Learning. Measurements on the normal operating conditions of a nuclear power station, for example, are simple to collect. In the event of an accident, the same characteristics will be dangerous to measure.
•
Positive examples and a small number of poorly dispersed negative examples relevant to learning. This is most noticeable in tumor identification or uncommon medical conditions, when there is a limitation of negative training examples
•
Positive and unlabeled data are used for learning. This category has sparked a lot of research interest among the document classification community, and there are always unlabeled samples accessible.

A vast variety of algorithms have been created, as well as numerous real models, to solve OCC problems. These approaches are classified into three types: density-based methods, boundary-based methods, and reconstruction-based methods.
2.2 The scaling of the data on the one-class classification problem

The separation between target objects and outliers is based on a clear representation of the data. Principal Component Analysis (PCA) pre-processing is a key approach for decreasing dimensionality in data by removing low variance directions and retaining only high variance directions. [19] shown that in one-class classification, low-variance directions are the most descriptive, while high-variance directions are not always the best option. Many one-class classifiers focus on data scaling and are often damaged by data distributions in (irregular) subspaces. If the data is not linearly separable in the original (input) space, we converted it into a higher dimensional feature space (the kernel feature space), which is robust to large-scale variations in the scaling of the input data. The one-class classification based on Kernel space is a particular type of classification problem that is often utilized as an outlier detection and novelty detection approach. Tax and Duin [34] and Scholkopf et al. [32] created the support vector machine (SVM) technique to optimize the distance margin between the two analyzed classes using support vector. It has also been used in the OCC problem as OSVM [1] and SVDD [7]. The basic concept underlying these techniques is to create a judgment boundary around the target data in order to distinguish outliers (non-positives) from positive data.

2.3 Clustering algorithm

Clustering is an unsupervised classification method in which no data set learning is possible. Clustering’s primary task is to discover the distribution of patterns and relevant correlations in big data sets by assigning similar data to the same group known as a cluster. There are several clustering applications in a variety of disciplines, such market segmentation [37] and document classification [21]. Clustering’s two key difficulties are identifying comparable items to the same group and detecting overlapping structures and non-linearly-separable clusters [27]. The evaluation of clustering, also known as cluster validity, is a critical procedure for determining the right number of clusters in a data collection. The aim of cluster validity [13] is to find the ideal number of clusters C (optimal C clusters) that can more successfully validate the best description of the data structure using the validity function.Fuzzy clustering has been extensively studied and applied in a wide range of critical areas. Furthermore, fuzzy cluster validation is critical in fuzzy clustering. There are several cluster validity indices of fuzzy clustering in the clustering literature. For these reasons, determining a clustering method and validity index that provides the optimal grouping of our data is a critical challenge.

3. The novel scatter-covariance OSVM

In this part, we will go through our suggested algorithm in further details. We begin by introducing the COSVM method, which incorporates the target class’s estimated covariance matrix to favor low variance directions.

3.1 The COSVM method

The training data’s calculated covariance matrix covers all projectional directions, from low variance to high variance. incorporates the kernel covariance matrix into the objective function of the OSVM optimization problem to maintain the robustness of the OSVM classifier while emphasizing the low variance directions Using the kernel technique, the convex optimization issue of the COSVM method may be described as follows:

$\displaystyle\min_{\bm{\alpha}}\bm{\alpha}^{T}(\eta\mathbf{Q}+(1-\eta)\Delta)% \bm{\alpha}$

(1) $\displaystyle s.t.\;\;\;0\leqslant\alpha_{i}\leqslant\frac{1}{vN},\;\;\sum_{i=% 1}^{N}\alpha_{i}=1,$

where

$\displaystyle\Delta=\mathbf{Q}(I-1_{N})\mathbf{Q}^{T}.$ (2)

For clarity, we have used the vectorized form of $\bm{\alpha}=(\alpha_{1},\ldots,\alpha_{N})$ and $v\in(0,1]$ is the key parameter that controls the fraction of outliers and that of support vectors (SVs). $\mathbf{Q}$ is the kernel matrix as defined in Eq. (3.1):

$\displaystyle\mathbf{Q}(i,j)=\mathcal{K}(x_{i},x_{j}),$ (3) $\displaystyle i=1,\ldots,N;\;\;j=1,\ldots,N.$

$I$ is the identity matrix and $1_{N}$ is a matrix with all entries $\frac{1}{N}$ , and $\eta$ is the tradeoff parameter that controls the balance between the kernel matrix $\mathbf{Q}$ and the dual kernel covariance matrix $\Delta$ . According to [17], one can directly manipulate the confidence on the training dataset by adjusting the value of $v$ during the training phase. If the training dataset is extremely dependable, $v$ might be adjusted to a low number to take into account the entire training dataset. If it is unknown whether or not the training dataset accurately represents the target class, $v$ might be adjusted to a greater value.
3.2 Derivation of SC-OSVM

SC-OSVM considers subclass distribution to give more efficient and robust solutions than standard COSVM. The SC-OSVM takes into account the subclass distribution in order to provide more efficient and robust solutions than standard COSVM. The whole approach is founded on projecting the $N$ sample training data set, $\mathcal{X}=\{x_{i}\}_{i=1}^{N}$ to a higher dimensional feature space $\mathcal{F}={\{\Phi(x_{i})\}}_{i=1}^{N}$ by the function $\Phi$ , where linear classification might be achieved. In practice, $\mathcal{F}$ is not directly calculated. To calculate the mapping, the kernel trick technique [38] is utilized, where a kernel function $\mathcal{K}$ calculates the inner products of the higher dimensional data samples: $\mathcal{K}(x_{i},x_{j})=<\Phi(x_{i}),\Phi(x_{j})>,\forall i,j\in\{1,2,\ldots,N\}$ . Because the whole training set belongs to one class only, we cluster all of the training vectors after mapping to feature space to find $K$ clusters $\{C_{d}\}_{d=1}^{K}$ , where $|C_{d}|={N_{d}},\forall d\in\{1,2,\ldots,K\}$ . Let $m_{\Phi}^{d}$ denotes the mean of the cluster $C_{d}$ samples calculated in feature space:

$\displaystyle m_{\Phi}^{d}=\frac{1}{N_{d}}\sum_{i=1}^{N_{d}}\Phi(x_{i}).$ (4)

We calculated the kernel covariance matrix $\Sigma^{d}_{\Phi}$ of the training cluster $C_{d}$ using $m_{\Phi}^{d}$ :

$\displaystyle\Sigma_{\Phi}^{d}=\sum_{i=1}^{N_{d}}(\Phi(x_{i})-m_{\Phi}^{d})(% \Phi(x_{i})-m_{\Phi}^{d})^{T}.$ (5)

When $K$ subclasses are generated inside the target class, the within subclass scatter matrix (distribution of the training vectors) may be represented as follows:

$\displaystyle S_{\Phi}^{\mathbf{w}}=\sum_{d=1}^{K}\left(\frac{N_{d}}{N}\Sigma_% {\Phi}^{d}\right).$ (6)

Where $\frac{N_{d}}{N}$ is the prior probability of the $d-th$ subclass. The between scatter matrix can be defined as follows:

$\displaystyle S_{\Phi}^{B}=\sum_{d=1}^{K}\sum_{b=1,b\neq d}^{K}(m_{\Phi}^{d}-m% _{\Phi}^{b})(m_{\Phi}^{d}-m_{\Phi}^{b})^{T}.$ (7)

Using this definition, we incorporate the within subclass and between subclass scatter matrices as an additional $\mathbf{w}^{T}S_{\Phi}^{\mathbf{w}}\mathbf{w}$ and $\mathbf{w}^{T}S_{\Phi}^{B}\mathbf{w}$ , respectively, into the objective function of the optimization problem of COSVM Eq. (3.1). In fact, the term $\mathbf{w}^{T}S_{\Phi}^{\mathbf{w}}\mathbf{w}$ is used to minimize the dispersion within subclasses, whereas the term $\mathbf{w}^{T}S_{\Phi}^{B}\mathbf{w}$ has the advantage of minimizing the dispersion between subclasses. However, the dual problem, not the primary one, is solved by some optimization technique for COSVM. As a result, it is preferable to include the subclass scatter matrix directly in the dual problem. Therefore, we have to use the kernel trick to represent the additional term $\mathbf{w}^{T}S_{\Phi}^{\mathbf{w}}$ in terms of dot products only. From the theory of reproducing kernels, we know that any solution $\mathbf{w}$ must lie in the span of all training samples. Hence, we can find an expansion of $\mathbf{w}$ of the form:

$\displaystyle\mathbf{w}=\sum_{i=1}^{N}\alpha_{i}\Phi(x_{i}).$ (8)

By using the definitions of $\Sigma^{d}_{\Phi}$ Eq. (5), $m^{d}_{\Phi}$ Eq. (4) and the kernel function $\mathcal{K}(x_{i},x_{j})=<\Phi(x_{i}),\Phi(x_{j})>,\forall i,j\in\{1,2,\ldots,N\}$ , we derive the dot product form in Eq. (9), where $\mathbf{Q}$ is the kernel matrix as defined in Eq. (3.1). $I$ is the identity matrix and $1_{N}$ is a matrix with all entries $\frac{1}{N}$ .

$\displaystyle\mathbf{w}^{T}S_{\Phi}^{W}\mathbf{w}=\biggl{(}\sum_{i=1}^{N}% \alpha_{i}\Phi^{T}(x_{i})\biggr{)}\biggl{(}\sum_{d=1}^{K}(\frac{N_{d}}{N}% \Sigma_{\Phi}^{d})\biggr{)}\biggl{(}\sum_{k=1}^{N}\alpha_{k}\Phi(x_{k})\biggr{% )}=\sum_{i=1}^{N}\sum_{k=1}^{N}\sum_{d=1}^{K}\sum_{j=1}^{N_{d}}\frac{N_{d}}{N}% \alpha_{i}\Phi^{T}(x_{i})(\Phi(x_{j})-m_{\Phi}^{d})(\Phi(x_{j})-m_{\Phi}^{d})^% {T}\alpha_{k}\Phi(x_{k})=\sum_{i=1}^{N}\sum_{k=1}^{N}\sum_{d=1}^{K}\sum_{j=1}^% {N_{d}}\frac{N_{d}}{N}\biggl{(}\alpha^{d}_{i}\mathcal{K}(x_{i},x_{j})-\frac{1}% {N_{d}}\sum_{l=1}^{N_{d}}\alpha^{d}_{i}\mathcal{K}(x_{i},x_{l})\biggr{)}\biggl% {(}\alpha^{d}_{k}\mathcal{K}(x_{k},x_{j})-\frac{1}{N_{d}}\sum_{m=1}^{N_{d}}% \alpha^{d}_{k}\mathcal{K}(x_{k},x_{m})\biggr{)}=\sum_{i=1}^{N}\sum_{k=1}^{N}% \sum_{d=1}^{K}\sum_{j=1}^{N_{d}}\frac{N_{d}}{N}\left(\alpha^{d}_{i}\alpha^{d}_% {k}\mathcal{K}(x_{i},x_{j})\mathcal{K}(x_{k},x_{j})-\frac{2\alpha^{d}_{i}% \alpha^{d}_{k}}{N_{d}}\sum_{l=1}^{N_{d}}\mathcal{K}(x_{i},x_{j})\mathcal{K}(x_% {k},x_{l})\right.\left.+\frac{\alpha^{d}_{i}\alpha^{d}_{k}}{{N_{d}}^{2}}\sum_{% l=1}^{N_{d}}\sum_{m=1}^{N_{d}}\mathcal{K}(x_{i},x_{l})\mathcal{K}(x_{k},x_{m})% \right)=\sum_{i=1}^{N}\sum_{k=1}^{N}\sum_{d=1}^{K}\sum_{j=1}^{N_{d}}\frac{N_{d% }}{N}\biggl{(}\alpha^{d}_{i}\alpha^{d}_{k}\mathcal{K}(x_{i},x_{j})\mathcal{K}(% x_{k},x_{j})-\frac{\alpha^{d}_{i}\alpha^{d}_{k}}{N_{d}}\sum_{l=1}^{N_{d}}% \mathcal{K}(x_{i},x_{j})\mathcal{K}(x_{k},x_{l})\biggl{)}=\sum_{d=1}^{K}\frac{% N_{d}}{N}{\bm{\alpha}}^{\text{d}^{T}}\mathbf{Q}^{\text{d}^{T}}\mathbf{Q^{d}}{% \bm{\alpha}}^{\text{d}}-{\bm{\alpha}}^{\text{d}^{T}}\text{Q}^{\text{d}^{T}}1_{% N_{d}}\mathbf{Q^{d}}\bm{\alpha^{d}}=\sum_{d=1}^{K}\frac{N_{d}}{N}{\bm{\alpha}}% ^{\textbf{d}^{T}}\textbf{Q}^{\textbf{d}^{T}}(I-1_{N_{d}})\mathbf{Q^{d}}\bm{% \alpha^{d}}=\bm{\alpha}^{T}\sum_{d=1}^{K}\frac{N_{d}}{N}\Delta_{d}\bm{\alpha}$ (9)

$\Delta_{d}$ is the transformed version of $\Sigma_{\Phi}^{d}$ to be used in the dual form:

$\displaystyle\Delta_{d}=\mathbf{Q_{d}}(I-1_{N})\mathbf{Q_{d}}^{T}.$ (10)

This form of kernel covariance matrix $\Delta_{d}$ is only in terms of the kernel function and can be calculated easily using the kernel trick. Let $M_{d}$ is the “kernel mean of cluster $C_{d}$ ”, which is an $N_{d}$ dimensional vector. Each component of $M_{d}$ is defined as:

$\displaystyle(M_{d})_{j}=\frac{1}{N}\sum_{i=1}^{N_{d}}\mathcal{K}(x_{i},x_{j})% ,\;\;\forall j=1,\ldots,N.$ (11)

Using this definition, the between scatter matrix $S_{\Phi}^{B}$ is defined in Eq. (12).

$\displaystyle\mathbf{w}^{T}S_{\Phi}^{B}\mathbf{w}=\biggl{(}\sum_{i=1}^{N}% \alpha_{i}\Phi^{T}(x_{i})\biggr{)}\biggl{(}\sum_{d=1}^{K}\sum_{b=1,b\neq d}^{K% }(m_{\Phi}^{d}-m_{\Phi}^{b})(m_{\Phi}^{d}-m_{\Phi}^{b})^{T}\biggr{)}\biggl{(}% \sum_{k=1}^{N}\alpha_{k}\Phi(x_{k})\biggr{)}=\alpha^{T}\sum_{d=1}^{K}\sum_{b=1% ,b\neq d}^{K}(M_{d}-M_{b})(M_{d}-M_{b})^{T}\alpha.$ (12)

Hence, our target term to incorporate into the COSVM dual problem is:

$\displaystyle\bm{\alpha}^{T}\biggl{(}\sum_{d=1}^{K}\frac{N_{d}}{N}\Delta_{d}+% \sum_{d=1}^{K}\sum_{b=1,b\neq d}^{K}(M_{d}-M_{b})(M_{d}-M_{b})^{T}\biggr{)}\bm% {\alpha}.$ (13)

With this replacement, our proposed SC-OSVM method can be described by the optimization problem defined in Eq. (3.2).

$\displaystyle\min_{\bm{\alpha}}\bm{\alpha}^{T}\eta\mathbf{Q}\bm{\alpha}+\bm{% \alpha}^{T}(1-\eta)\biggl{(}\sum_{d=1}^{K}\frac{N_{d}}{N}\Delta_{d}+\sum_{d=1}% ^{N}\sum_{b=1,b\neq d}^{K}(M_{d}-M_{b})(M_{d}-M_{b})^{T}\biggr{)}\bm{\alpha}$ (14) $\displaystyle s.t.\;\;\;0\leqslant\alpha_{i}\leqslant\frac{1}{vN},\;\;\sum_{i=% 1}^{N}\alpha_{i}=1.$

Because both the kernel matrix $\mathbf{Q}$ and the covariance matrix $\Delta$ are positive definite [22, 14], the suggested approach still yields a convex optimization problem. As a result, the solution to this optimization issue will have a single global optimal solution that can be efficiently solved using numerical methods.

The balance between the dispersion matrix and the Kernel matrix, on the other hand, is controlled by our control parameter $\eta$ .

Figure 1.

Case 1: Schematic depiction of the decision hyperplane for SC-OSVM when the optimal linear projection would be along the direction of high variance. In this case, the optimal control parameter value for SC-OSVM is $\eta=1$ .

Figure 2.

Case 2: Schematic depiction of the decision hyperplane for SC-OSVM when the optimal linear projection would be along the direction of low variance. In this case, the optimal control parameter value for SC-OSVM is $\eta=0$ .

Figure 3.

Comparison between COSVM and SC-OSVM : The value of the trade-off parameter is set equal to $0$ ( $\eta=0$ ), to only consider the dispersion term.

3.3 The impact of the trade-off parameter

\eta

Finding the right value for $\eta$ is a critical step in improving classification using SC-OSVM. The parameter $\eta$ controls the contribution of our kernel matrix $\mathbf{Q}$ , the between scatter matrix $S_{\Phi}^{B}$ , and the within subclass scatter matrix $S_{\Phi}^{w}$ . depicts the situation in which the best choice hyperplane for the example target data is in the same direction as the high variance. Figure 1 shows the case where the optimal decision hyperplane for the example target data is on the same direction as the high variance. Because the control parameter $e t a$ is set to $1$ , the low variance directions will not be given particular consideration in this situation. Figure 2, on the other hand, is the scenario where the optimal direction of the choice hyperplane and the low variance are parallel. $\eta$ can be set to $0$ in this situation. However, in real-world situations ( $0<\eta<1$ ), the best choice hyperplane is unlikely to be entirely parallel to the direction of low or high variance. As a result, the value of $\eta$ must be adjusted such that the linear projections of the target data and the outlier data have less overlap. To optimize $\eta$ , we choose an indirect method, which will be discussed in full in the next section.

3.4 Schematic depictions

In this section, Fig. 3 presents graphic drawings to demonstrate the benefit of our SC-OSVM technique over the unimodal COSVM.

4. Experimental results

This section provides a thorough experimental analysis and findings for our proposed approach, which was tested on both artificial and benchmark real-world one-class datasets and compared to contemporary one-class classifiers. First, we examine the impact of varying the value of our main control parameter $\eta$ .

4.1 Optimising the value of $\eta$

First, we must optimize our key control parameter $\eta$ . In actuality, we must consider the consequences of adjusting the key control parameter. Because there is no simple technique to maximize the value of $\eta$ (e.g.cross-validation). As a consequence, we must employ a stopping criterion in order to determine the optimal $\eta$ value. As a stopping criteria, we utilize a predefined lowest proportion of outliers permitted ( $f_{\textit{OL}}$ ). The proportion of outliers is calculated by determining what fraction of the training samples are classified as outliers by the constructed target boundary. For incoming data sets, we set $\eta$ to $1$ and gradually lower its value while keeping track of the proportion of outliers. Importantly, there is no conflict between the value of $f_{\textit{OL}}$ and the value of the COSVM parameter $v$ and they may be changed independently to match the purpose of the dataset to be trained. The $v$ COSVM parameter can be adjusted to any value between $0$ and $1$ . The value of $f_{\textit{OL}}$ in SC-OSVM can be set to any value between $0$ and $v$ .

4.2 Data sets used

In our studies, we employed both artificially generated datasets and real-world datasets to evaluate the robustness of our suggested approach in varying situations.

4.2.1 Artificially dataset

We constructed various sets of $2$ D four-class data chosen from two distinct sets of distributions for the tests using artificially generated data: 1) Gaussian distributions with various covariance matrices. 2) Gamma distributions with various shape and scale parameters. Two distinct sets were constructed for each distribution, one with little overlap and the other with high overlap. The graphs of these produced data sets are shown in Fig. 4. We had a total of four $4$ classes, one of which was selected as the target and the others as outliers.

Figure 4.

Four artificial four-class datasets used for comparison. The blue class represents the target class (in each subfigure caption).

4.2.2 UCI machine learning dataset

For the real-world scenario, the majority of these datasets were obtained from the UCI machine learning library [4]. We have mainly focused on one of the most significant domains of one-class classification, medical diagnosis [10]. As these datasets were initially multi-class, one class is identified as the target and the others as outliers. Some of the target and outlier sets were too simple to categorize. We did not include those sets in our results. We also tested the robustness of our technique against varied feature sizes by using data sets of varying size and dimensions. Table 1 shows that the dimensions range from $3$ to $300$ , whereas the training set sizes range from $21$ to $288$ .

Table 1
Description of real data sets

Data set name	Number of targets	Number of outliers	Number of features	Number of clusters
Haberman’s Survival	81	225	3	2
Biomedical (diseased)	67	127	5	4
Biomedical (healthy)	127	67	5	4
SPECT Images (normal)	95	254	44	2
Balance-scale left	288	337	4	5
Balance-scale right	288	337	4	4
Waveform	21	600	300	3

4.2.3 The TON-IoT network dataset

The TON-IoT network dataset has 223,390,21 records of normal and attacks data. The dataset has 461,043 records collected from the entire network dataset to include all the attacks and normal events (300,000 normal and 161,043 attack). These records can be used to apply different machine learning models and handle the challenge of imbalanced normal and attack records that are usually challenging. This challenge refers to that the number of normal records is too much greater than abnormal ones. Since the TON-IoT network dataset has a large number of attributes and different attribute types, such as categorical and numeric ones, it demands filter it demands filtering and processing the attributes to improve the performances of machine learning techniques. The total number of features (without pre-processing) is 43 features. And the total number of clusters using VSC as a validity index and K-means is 4 or 5 clusters.

4.3 The area under an ROC curve

Performance measurement is a fundamental role in Machine Learning, particularly in the classification problem. The receiver operating characteristic (ROC) curve is commonly used in medical applications and research to evaluate diagnostic testing [36]. The ROC curve is defined as a plot of the true positive rate (TPR) on the vertical axis and the false positive rate (FPR) on the horizontal axis over all possible decision thresholds or stoppages. The TPR is the proportion of cases with the disease who test positive for it based on the diagnostic test, whereas the FPR is the proportion of cases without the disease who test positive for it based on the same diagnostic test. This curve is important in evaluating the diagnostic competence of tests to differentiate the real condition of subjects, identify the best cut off values, and analyze two alternative diagnostic tasks when each task is done on the same subject. The Area under the Curve (AUC) is an effective method for assessing the effectiveness of rate classifiers since its prediction is based on the whole ROC curve and thus includes all potential classification levels.

4.4 SC-OSVM method algorithm

Algorithm 4.4 outlines the proposed SC-OSVM method. The algorithm begins by partitioning the training dataset $\mathcal{X}$ into $K$ clusters ${C_{s}}{s=1}^{K}$ using a clustering technique. For each cluster $C_{s}$ , the dual covariance matrix $\Delta_{s}$ is estimated using Eq. (10). The between subclass scatter matrix $S_{\Phi}^{B}$ and within subclass scatter matrix $S_{\Phi}^{\mathbf{w}}$ are then calculated. These scatter matrices are incorporated into the objective function of the optimization problem of COSVM (Eq. (3.1)). The COSVM method is applied to all data to find the trade-off parameter $\eta$ . The contribution of the scatter matrix $S$ and covariance matrix $\Delta$ is regulated using the trade-off parameter $\eta$ . Finally, the SC-OSVM method is trained and tested.

SC-OSVM method algorithm

1.
Let $\mathcal{X}=\{x_{i}\}_{i=1}^{N}$ represent the training data set of $N$ samples, which are the features vectors associated to database. Divide $\mathcal{X}$ into $K$ clusters $\{C_{s}\}_{s=1}^{K}$ , where $|C_{s}|={N_{s}},\forall s\in\{1,2,\ldots,K\}$ .
2.
Estimate the dual covariance matrix $\Delta_{s}$ of each cluster $C_{s}$ using Eq. (10).
3.
Calculate the between $S_{\Phi}^{\mathbf{w}}$ and within $S_{\Phi}^{B}$ subclass scatter matrices.
4.
Incorporate the within subclass and between subclass scatter matrices as an additional $\mathbf{w}^{T}S_{\Phi}^{\mathbf{w}}\mathbf{w}$ and $\mathbf{w}^{T}S_{\Phi}^{B}\mathbf{w}$ , respectively, into the objective function of the optimization problem of COSVM Eq. (3.1).
5.
Apply COSVM method for all data and find the trade-off parameter $\eta$ .
6.
Uses the trade-off parameter $\eta$ to regulate the contribution of our scatter matrix $S$ $(S=S_{\Phi}^{\mathbf{w}}+S_{\Phi}^{B})$ and covariance matrix $\Delta$ .
7.
SC-OSVM training and testing.

4.5 Experimental protocol

SC-OSVM performance was compared to Covariance guided One-class SVM (COSVM), One-class SVM (OSVM), Support Vector Data Description (SVDD), $K$ Nearest Neighbors (K-NN), Parzen, and Gaussian. The classifiers were built using DDtools [33], and the radial basis kernel was utilized for kernelization. This kernel is calculated as $\mathcal{K}(x_{i},x_{j})=e^{-{\|x_{i}-x_{j}\|}^{2}/\sigma}$ , where $\sigma$ represents the positive “width” parameter. We set the number of clusters ${C_{\min}}=2$ and ${C_{\max}}=10$ for all data sets utilized, assuming that each data set’s objective contains a minimum of $2$ clusters (sub-class) and a maximum of $10$ clusters. The number of subclasses is obtained by individually using a clustering approach and the validity index described in [5, p. 1419] on samples belonging to each of the classes. Second, we utilized $10$ -fold stratified cross validation. In reality, we added $10\%$ randomly selected data to the outliers for testing, with the remainder serving as training data. This technique was done $10$ times to create distinct training and testing sets. The final result was obtained by averaging over several $10$ models. This confirms that the findings obtained were not a product of chance. Furthermore, we utilized the Area Under the ROC Curve (AUC) [9] generated by the ROC curves to evaluate the techniques, and we showed them in the results Table 5. As a consequence, in order to achieve a high separation between targets and outliers, the AUC criteria must be maximized.

4.6 Classifiers

All classifiers in this study are defined in the Matlab toolbox DDtools [33]. Our suggested approach was tested against the following seven established classifiers:

•
Gaussian [8] These classifiers, as the name indicates, represent the target data with a single Gaussian density and utilize maximum likelihood estimation for the mean and covariance matrix. For high dimensional data, a regularization term is frequently added to the covariance matrix. This classifier does not require any parameters to be tuned.
•
Parzen [23] The Parzen density estimation approach, also known as the kernel density estimation technique. The Gaussian kernel is frequently used to measure the probability density of training data. The kernel width is determined by maximizing the probability of the training set using a leave-one-out approach.
•
K Nearest Neighbors (K-NN) [15] This method determines if an incoming data point belongs to the target class or the outliers using the distance metric from the $K$ Nearest Neighbors. The majority of the neighbors decide on the class to which they belong. The parameter $K$ must be optimized, which may be achieved using the leave-one-out density estimation method.
•
Support Vector Data Description (SVDD) [41] The Support Vector Data Description (SVDD) is a boundary-based one-class classifier that helps to describe the hypersphere with the smallest volume surrounding the training data points. This classifier can also benefit from the kernel technique.
•
One-class SVM (OSVM) [16] The major difference between OSVM and SVDD is that SVDD estimates the boundary around the training data set using a hypersphere, whereas OSVM looks for the greatest margin hyperplane that separates the training data set from the origin. SVM with two classes. In OSVM, the origin is regarded as the sole member of the second class.
•
One-Class Neural Network (OC-NN) [6] It is one-class classifier, which consists of feed forward neural network and the one-class SVM.
•
Covariance guided One-class SVM (COSVM) [17] This classifier is explained in depth in Section 3.1. It is a modification of the original OSVM method that incorporates the covariance matrix into the OSVM objective function to emphasize low variance directions

DDtools was used to implement the k-NN, Parzen, Gaussian, and SVDD classifiers [33]. The SVM-KM toolkit was used to implement OSVM, COSVM, and SC-OSVM [39]. Kernelization in SVDD, OSVM, COSVM, and SC-OSVM employed the radial basis kernel. For these classifiers, the kernel width parameter $\sigma$ is important. It specifies the "scale" of data that will be used for training and testing. The data may not scale well at lower $\sigma$ levels. The proportion of SVs obtained from the training stage [12] is a excellent indicator of poor scaling. If all of the data points are treated as SVs, the scaling is poor and the classifier just memorizes the data. As a result, a good heuristic for optimizing the amount of sigma is to begin with a small value, gradually increase, and watch the proportion of SVs. Finally, take into account the value of $\sigma$ for a certain dataset when the number of SVs does not reduce any further [32]. We execute this individually for SVDD, OSVM, COSVM, and SC-OSVM in order to identify optimal $\sigma$ values for each dataset. The approach for improving the value of our control parameter $\eta$ is detailed in Section 4.1. As previously stated, the $\eta$ for SC-OSVM is calculated by applying the unimodal COSVM to the whole target class to avoid the high difficulty of computing numerous $\eta$ values for various clusters. As a result, the same $\eta$ value will be utilized for various target class clusters. For OSVM, COSVM, and SC-OSVM, the option $v$ was set to $0.2$ . The same parameter for SVDD is termed fraction of rejection, and it was similarly adjusted to $0.2$ using DDtools [33]. The lowest threshold for the percentage of outliers $f_{OL}$ when optimizing $\eta$ was set to $0.1$ . It is worth noting that changing the $v$ and $f_{OL}$ parameters for other datasets might result in even better performance for these tests. However, we have chosen the same value for $v$ and $f_{OL}$ for all datasets since it is impossible to specify ideal values for these parameters in a real-world situation because future data points would always be unknown. These parameters may be adjusted for a practical application, and the system can be re-trained as needed.

Table 2
Average FPR for a fixed TPR to $95\%$ of each method for the 4 datasets

Experiment OSVM COSVM SC-OSVM

Gamma (high overlap)) 98.67 97.42 46.75

Gaussian (high overlap) 91.57 90.32 18.47

Biomedical (healthy) 92.54 91.87 0.66

Biomedical (diseased) 96.27 95.58 1.03

Table 3
Average FPR for a fixed TPR to $99\%$ of each method for the 4 datasets

Experiment OSVM COSVM SC-OSVM

Gamma (high overlap)) 98.39 97.35 37.62

Gaussian (high overlap) 96.71 97.16 7.50

Biomedical (healthy) 95.49 95.40 14.66

Biomedical (diseased) 98.54 97.12 9.16

4.7 Results and discussion

Experiment	OSVM	COSVM	SC-OSVM
Gamma (high overlap))	98.67	97.42	46.75
Gaussian (high overlap)	91.57	90.32	18.47
Biomedical (healthy)	92.54	91.87	0.66
Biomedical (diseased)	96.27	95.58	1.03

Experiment	OSVM	COSVM	SC-OSVM
Gamma (high overlap))	98.39	97.35	37.62
Gaussian (high overlap)	96.71	97.16	7.50
Biomedical (healthy)	95.49	95.40	14.66
Biomedical (diseased)	98.54	97.12	9.16

Our proposed algorithm was evaluated by using several criteria, such as accuracy and true-false positive rates. True positive rate (TPR) indicate the correctness classification for a record type. False positive rate (FPR) classifies a record as anomalous when such record is legitimate. Tables 2 and 3 show the FPR values for a fixed TPR to $95\%$ and $99\%$ , respectively, computed for different artificial and real-world datastets.

Table 4
Average AUC of each method for the $7$ real-world data sets (best method in bold, second best emphasized)

Dataset	k-NN	Parzen	Gaussian	SVDD	OSVM	OC-NN	COSVM	SC-OSVM
Biomedical (healthy)	36.83	40.02	64.66	81.38	82.65	83.25	85.80	89.81
Biomedical (diseased)	89.42	90.09	89.60	90.28	91.04	90.55	91.04	92.16
SPECT (normal)	84.28	96.45	93.90	92.32	95.29	86.82	96.79	97.81
Balance-scale left	87.78	91.23	92.09	94.20	96.13	96.87	97.19	98.51
Balance-scale right	87.76	91.23	91.95	94.72	97.57	96.07	97.69	98.82
Haberman’s Survival	66.49	67.97	60.09	68.87	69.62	68.54	69.59	70.28
Waveform	87.78	97.02	92.10	97.20	97.19	96.67	97.23	98.36

Table 5

Average AUC of each method for the $4$ artificial data sets (best method in bold)

Dataset	OSVM	COSVM	SC-OSVM
Gauss. (low overlap)	93.12	95.66	96.88
Gauss. (high overlap)	87.15	88.29	91.10
Gamma (low overlap)	87.65	91.27	93.03
Gamma (high overlap)	96.87	98.17	98.69

Table 6

Average AUC of OSVM, COSVM and SC-OSVM for the Train-test-Network (best method in bold)

Dataset	OSVM	COSVM	SC-OSVM
Train-test-Network	91.03	93.66	97.81

The average AUC values achieved for the classifiers on the artificial and real data sets are stored in Tables 5, Table 6 and Table 4. As we can see, the SC-OSVM outperforms all other classifiers and produces the best results on almost all data sets in terms of unbiased AUC values derived by averaging over $10$ distinct models. In reality, the SC-OSVM has the benefit of decreasing overlapping and increasing classification accuracy by reducing dispersion within and between subclasses of the target class. In general, we observe that k-NN, Gaussian, and Parzen classifiers perform poorly on real-world datasets (Table 4) when compared to SVM-based classifiers (SVDD, OSVM, COSVM). This is due to the intrinsic limits of these classifiers. Because k-NN classifies a data point entirely on its neighbors, it is susceptible to outliers [15]. The Gaussian classifier has several apparent drawbacks since it assumes the underlying distribution is Gaussian, which is not necessarily the case in real datasets. In the event of high-dimensional data or a limited sample size, the Parzen classifier suffers from poor performance, as cited by [23]. This drawback of the Parzen classifier is clearly visible from the poor results on the Gene Expression datasets, which have a very high dimension. SVM-based classifiers are devoid of all these assumptions, resulting in better results in the vast majority of instances. However, in the case of artificial datasets (Table 5), we observe that these three classifiers (k-NN, Gaussian, and Parzen) are competitive, if not better, than SVM-based techniques. This is due to the fact that the artificial datasets are created using a pre-defined regular distribution. We can observe that the Gaussian approach performs well when the dataset was created using a Gaussian distribution, as predicted. It also works rather well on datasets generated by the banana distribution. This is due to the fact that the banana distribution is created by superimposing an underlying Gaussian distribution over a banana shape. However, in the case of the Gamma distribution, it performs badly since the distribution does not meet the classifier’s assumption.

Using these performance metrics, we can confirm that our proposed emphasizing low variance directions improves the classification accuracy, since all the classes seem to be recognized fairly well.

In terms of training computational complexity, the SC-OSVM is nearly as complex as the COSVM. In fact, the dual kernel covariance matrix may be computed during pre-processing and reused throughout the training phase. The SC-OSVM technique solves the quadratic programming problem using sequential minimum optimization and hence scalable with $O(N^{3})$ , where $N$ is the number of training data points therefore scales with is $O(N^{3})$ , where $N$ is the number of training data points [1].

Table 7 shows the average training times, in milliseconds, for various algorithms used in the experiments on both artificial and real-world datasets. As expected, SC-OSVM has similar training times as COSVM. The training times for the other algorithms, such as K-NN, Parzen, Gaussian, SVDD, OSVM, and OC-NN, are also listed in the table. The training times of K-NN, Gaussian, and Parzen are less than SC-OSVM for the artificial datasets because these three algorithms are not based on an iterative optimization process process. K-NN, Gaussian, and Parzen have closed-form solutions that can be computed efficiently, while SC-OSVM requires an iterative optimization process, which can be time-consuming. However, for the real-world datasets, SC-OSVM has comparable training times to K-NN, Gaussian, and Parzen, indicating that the iterative optimization process does not significantly affect the training times in this case.

Table 7

Average training times in milliseconds of different algorithms for the experiments on the artificial and real-world datasets

Experiment	K-NN	Parzen	Gaussian	SVDD	OSVM	OC-NN	COSVM	SC-OSVM
Artificial datasets	5.6	6.1	5.83	7.1	6.9	8.67	7.4	8.5
Real-world datasets	113.5	114.6	114.1	113.8	115.3	136.8	127.7	131.2

Figure 5.

ROC curves of each classifiers for the data set SPECT (normal).

We provide individual graphical representations of the data set models by displaying the actual Receiver Operating Characteristics (ROC) [26] for the SPECT (normal) real world data set. Figure 5 illustrates the ROC curves for all the classifiers, revealing that SC-OSVM delivers the highest performance based on its ROC curve.

5. Conclusion

This article presents an improved version of the Covariance guided One-class SVM (COSVM) classifier called Scatter Covariance One-class SVM (SC-OSVM), which utilizes subclass information of the target class to reduce dispersion within and between subclasses and improve classification performance. The proposed approach involves dividing the target class into subclasses, calculating the between and within scatter matrix, and optimizing the COSVM’s objective function by incorporating these matrices to achieve the best classification accuracy.

The advantage of the SC-OSVM over other techniques such as SOC-SVM and KSVDD is that it reduces dispersion not only within each subclass but also between subclasses to enhance classification performance. Moreover, the SC-OSVM uses a trade-off parameter to regulate the contribution of the scatter matrix $S$ and covariance matrix $\Delta$ in the optimization process. The suggested approach is based on a convex optimization problem, and its optimal solution can be easily implemented using traditional numerical methods.

To evaluate the performance of the SC-OSVM, various artificial and real-world benchmark datasets were used to compare it to contemporary one-class classifiers. The results show that the SC-OSVM outperforms other classifiers by a significant margin. In the future, the suggested SC-OSVM will be tested on security applications such as face recognition, anomaly detection, among others.

References

Ahmad

and Dey

, A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets, 2010.

Anaissi

Suleiman

and Alyassine

, A personalized federated learning algorithm for one-class support vector machine: An application in anomaly detection, In Computational Science – ICCS 2022: 22nd International Conference, London, UK, June 21–23, 2022, Proceedings, Part IV, Berlin, Heidelberg, Springer-Verlag, 2022, pp. 373–379.

Ayyagari

M.R.

, Classification of imbalanced datasets using one-class svm, k-nearest neighbors and cart algorithm, International Journal of Advanced Computer Science and Applications 11(11) (2020).

Blake

and Merz

, Uci repository of machine learning data sets, 1999.

Bouguessa

Wang

and Sun

, An objective approach to cluster validation, Pattern Recognition Letters 27(13) (2006), 1419–1430.

Chalapathy

Menon

A.K.

and Chawla

, Anomaly detection using one-class neural networks, 2018.

Cristianini

and Shawe-Taylor

, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, UK, 2000.

Dong

and Zhou

, Gaussian classifier-based evolutionary strategy for multimodal optimization, IEEE Trans. Neural Networks Learn. Syst. 25(6) (2014), 1200–1216.

Fawcett

, An introduction to ROC analysis, Pattern Recognition Letters 27(8) (2006), 861–874.

10.

Fazli

and Nadirkhanlou

, A novel method for automatic segmentation of brain tumors in mri images, CoRR, abs/1312.7573, 2013.

11.

Gardner

A.B.

Krieger

A.M.

Vachtsevanos

G.J.

and Litt

, One-class novelty detection for seizure analysis from intracranial eeg, Journal of Machine Learning Research 7 (2006), 1025–1044.

12.

Hayton

P.M.

Schölkopf

Tarassenko

and Anuzis

, Support vector novelty detection applied to jet engine vibration spectra, In Leen

T.K.

Dietterich

T.G.

and Tresp

, editors, NIPS, Cambridge, MA, USA, MIT Press, 2000, pp. 946–952.

13.

Hernández-Orallo

, Roc curves for regression, Pattern Recognition 46(12) (2013), 3395–3411.

14.

Horn

and Charles

, Matrix Analysis, Cambridge University Press, USA, 1990.

15.

Jiang

and Zhou

Z.-H.

, Editing training data for knn classifiers with neural network ensemble, In Yin

F.-L.

Wang

and Guo

, editors, Advances in Neural Networks – ISNN 2004, Berlin, Heidelberg, Springer Berlin Heidelberg, 2004, pp. 356–361.

16.

Juszczak

, Learning to recognise: a study on one-class classification and active learning, PhD thesis, Delft University of Technology, Netherlands, 2006.

17.

Khan

N.M.

Ksantini

Ahmad

I.S.

and Guan

, Covariance-guided one-class support vector machine, Pattern Recognition 47(6) (2014), 2165–2177.

18.

Kumar

Sinha

Chakrabarti

and Vyas

, A fast learning algorithm for one-class slab support vector machines, Knowledge-Based Systems 228 (2021), 107–267.

19.

Kwak

and Oh

, Feature extraction for one-class classification problems: Enhancements to biased discriminant analysis, Pattern Recognition 42(1) (2009), 17–26.

20.

and Croft

W.B.

, Improving novelty detection for general topics using sentence level information patterns, In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management, Arlington, Virginia, USA, ACM, 2006, pp. 238–247.

21.

Liu

and Liu

, Research of fast som clustering for text information, Expert Syst. Appl. 38(8) (2011), 9325–9333.

22.

Michelli

, Interpolation of scattered data: Distance matrices and conditionally positive definite functions, Constructive Approximation 2 (1986), 11–22.

23.

Muto

and Hamamoto

, Improvement of the parzen classifier in small training sample size situations, Intelligent Data Analysis 5(6) (2001), 477–490.

24.

Mygdalis

Iosifidis

Tefas

and Pitas

, Exploiting subclass information in one-class support vector machine for video summarization, In ICASSP, South Brisbane, QLD, Australia, IEEE, 2015, pp. 2259–2263

25.

Mygdalis

Iosifidis

Tefas

and Pitas

, Kernel subclass support vector description for face and human action recognition, In SPLINE, Aalborg, Denmark, IEEE, 2016, pp. 1–5.

26.

Nallammal

and Radha

, Performance evaluation of face recognition based on pca, lda, ica and hidden markov model, In Kannan

and AndrÃ¨s

, editors, ICDEM, volume 6411 of Lecture Notes in Computer Science, Springer, 2010, pp. 96–100.

27.

N’cir

C.-E.B.

Essoussi

and Limam

, Kernel-based methods to identify overlapping clusters with linear and nonlinear boundaries, J. Classification 32(2) (2015), 176–211.

28.

Nheri

Ksantini

Kaâniche

M.B.

and Bouhoula

, A novel dispersion covariance-guided one-class support vector machines, In Farinella

G.M.

Radeva

and Braz

, editors, VISIGRAPP (4: VISAPP), Portugal, SCITEPRESS, 2020, pp. 546–553.

29.

Parra

Deco

and Miesbach

, Statistical independence and novelty detection with information preserving nonlinear maps, Neural Computation 8 (1996), 260–269.

30.

and Davison

B.D.

, Web page classification: Features and algorithms, ACM Comput. Surv. 41(2) (2009), 12:1–12:31.

31.

Sadeghi

and Hamidzadeh

, Automatic support vector data description, Soft Comput. 22(1) (2018), 147–158.

32.

Scholkopf

Platt

J.C.

Shawe-Taylor

Smola

A.J.

and Williamson

R.C.

, Estimating the support of a high-dimensional distribution, Neural Computation 13(7) (2001).

33.

Tax

, Ddtools, the data description toolbox for matlab, 2012. version 19.1.

34.

Tax

D.M.J.

and Duin

R.P.W.

, Uniform object generation for optimizing one-class classifiers. J. Mach. Learn. Res. 2 (2001), 155–173.

35.

Tellenbach

, Detection, classification and visualization of anomalies using generalized entropy metrics, PhD thesis, TU, 2012.

36.

Tsang

I.W.

Kwok

J.T.

and Li

, Learning the kernel in mahalanobis one-class support vector machines. In IJCNN, Vancouver, BC, Canada, IEEE, 2006, pp. 1169–1175.

37.

van Hattum

and Hoijtink

, Market segmentation using brand strategy research: Bayesian inference with respect to mixtures of log-linear models, J. Classification 26(3) (2009), 297–328.

38.

Vapnik

V.N.

, An overview of statistical learning theory, IEEE Transactions on Neural Networks 10(5) (1999), 988–999.

39.

Vapnik

V.N.

, The Nature of Statistical Learning Theory, Springer, New York, NY, 2000.

40.

Xing

H.-J.

and Li

L.-F.

, Robust least squares one-class support vector machine, Pattern Recognit. Lett. 138 (2020), 571–578.

41.

Zafeiriou

and Laskaris

N.A.

, On the improvement of support vector techniques for clustering by means of whitening transform, IEEE Signal Process. Lett. 15 (2008).

42.

Zeng

Roisman

G.I.

Wen

and Huang

T.S.

, Spontaneous emotional facial expression detection, Journal of Multimedia 1(5) (2006), 1–8.