Principal component analysis method is one of the most widely used statistical procedures for data dimension reduction. The traditional principal component analysis method is sensitive to outliers since it is based on the sample covariance matrix. Meanwhile, the deviation of the principal component analysis based on the Minimum Covariance Determinant (MCD) estimation is significantly increased as the data dimension increases. In this paper, we propose a high-dimensional robust principal component analysis based on the Rocke estimator. Simulation studies and a real data analysis illustrate that the finite sample performance of the proposed method is significantly better than those of the existing methods.
Principal component analysis (PCA) is calculated by a covariance matrix. In order to get a robust principal component, it needs a robust covariance matrix estimator. Robust estimation methods include MCD estimator[1], MM estimator[2] S-D estimator[3], Rocke estimator[4] MRCD[5, 6, 7, 8], etc. In practical applications, the dimension of data is often very large. Since the Rocke estimator is more robust than other robust estimators in high-dimensional data[9], this paper selects the Rocke estimator to calculate the principal component and compares it with PCA based on the MCD estimator. We first introduce the definition and shortcomings of traditional PCA and PCA based on the MCD estimator. Then we describe the basic principle and calculation process of the Rocke estimator, and finally carry out numerical simulation and application to obtain the excellent robustness and practicability of PCA based on the Rocke estimator.
PCA and its instability
PCA is a common method for data dimensionality reduction, which is widely used in economic management, medicine, and other fields. In practical applications, many problems will involve a large number of variables, which will increase the calculation and complexity of the problem. And each variable reflects some information from the research to a certain extent, and there will be a certain correlation between these variables. In the process of data analysis, we all hope to involve fewer variables and more information, so PCA plays a great role in the actual problem research.
Its basic idea is to change many variables into a few principal components. Principal components are generally expressed as linear combinations of original variables, and these principal components can still retain most of the information from the original data, which is very useful in reducing the data dimension. For example, finding the principal component of p-dimensional data is to find the linear combination and to maximize its variance, that is to maximize , where is a -dimensional vector and , is the covariance matrix of variables. Variances represents the amount of information carried, and principal components are calculated in turn. Generally, the sum of the variances corresponding to principal components is more than 80% of the total variance.
In all kinds of observation data, because of technology and other reasons, there will be outliers, especially in the data of multiple variables ( variable), it is impossible to judge whether the data is outliers from a few dimensions (less than ). Furthermore, outliers have a great impact on the calculation of sample mean and sample covariance matrix. The traditional PCA is calculated via the sample mean and sample covariance matrix, so that the traditional PCA is very sensitive to outliers.
In view of the above shortcomings, some literatures consider the robust PCA based on the robust covariance matrix, such as MCD estimation[10]. However, the deviation of the mean vector and covariance matrix estimated by MCD estimation increases significantly with the increase of data dimension, which means that its stability is still not good enough, resulting in a big deviation from the original data when calculating the eigenvalues and eigenvectors of the covariance matrix. In this paper, we propose a more stable Rocke estimator which can also maintain good stability under high-dimension data, so that it has good robustness, high efficiency and faster computing speed than MCD estimation in low-dimension and high-dimension ( 15) data.
The basic principle and algorithm of Rocke estimation
Brief
Rocke estimation is a more robust estimation method for high-dimensional data based on the improved S-estimation. It was first proposed by Rocke [5]. Then Maronna and Yohai[4] improved it and made an empirical comparison with other estimates, and came to the conclusion that its robustness is better than other estimations in high-dimensional data, and the influence of the initial estimator in the estimation is more obvious. Through the comparison, it is concluded that the initial value estimated by KSD[11] is the best choice compared with other initial estimators in robustness and calculation speed. The Rocke estimation has good robustness in the estimation of mean and covariance matrix of high-dimensional data with 15.
The basic principle of Rocke estimation
Rocke estimation mainly uses the non-monotonic weight function and the robust covariance matrix estimator which is obtained by the continuous iterative updating of the Mahalanobis distances of each sample, so as to obtain the robust principal component. For example, there are n p-dimensional data from distribution function , KSD is firstly used to estimate the initial mean vector and covariance matrix of the data, then the squared Mahalanobis distances are calculated. Let be the scale estimator of Mahalanobis distance, and the function proposed by Rocke (1996) is a continuous stage weight function. Through the iterative process, the weight of the sample which deviates from 1 is reduced to 0. The closer is to 1, the greater the weight is given, and the maximum is 1. Thus, a new mean vector and variance matrix are obtained, which are iterated continuously. When the Mahalanobis distances of each sample changes very little, that is to say, when the scale estimator of Mahalanobis distance changes very little, the iteration stops, and a robust covariance matrix can be obtained.
The concrete algorithm of Rocke estimation
Step 1: Centralizing the data, and then get the mean vector and covariance matrix of sample data by KSD estimation.
Step 2: denotes the scale estimator of Mahalanobis distance, which can be obtained from the following equation
Where , which controls the size of breakdown point, and is smooth and nondecreasing in , with and When , Rocke estimation can obtain the largest breakdown point. The relationship of function and weight function is , which can be expressed as
where denotes weighted range. Because the Mahalanobis distance approximately follows the chi-square distribution of the degree of freedom satisfies beyond , where is a confidence level. When p is large, distribution tends to symmetrical shape, then we have . So let , then can be calculated by fixed point method.
Step 3: In the above steps, we get the Mahalanobis distance through the initial estimator, and then calculate the new mean vector and covariance matrix through the weight function. The weight function is
The final covariance matrix can be obtained by applying different weights to different samples to calculate , the un-normalized covariance matrix C through the following equations :
Step 4: Repeat Step 2 and Step 3 until , then we can get the final stable covariance matrix, where tol denotes the preset error. Next, the correlation matrix is calculated, and then the eigenvalues and eigenvectors are calculated for PCA.
Simulation
In this section, the traditional PCA (TPCA), PCA based on MCD estimation (MCDPCA) and PCA based on Rocke estimation (RPCA) are compared under the same set of data. We simulate data sets from mixture distribution with sample sizes of 100, 150, 300 and and dimensions of 6, 30, where denotes the ratio of contamination. Outliers come from the distribution . We set to be 0 ( 30, 300) or 0.5 ( 30, 150) in the simulations.
By changing the ratio of contamination, we repeated the simulation 100 times, using the TPCA, MCDPCA and RPCA to get 100 covariance matrices corresponding to each Then the corresponding covariance matrix and principal component eigenvalues are calculated, and then the averages of these 100 groups of eigenvalues are calculated, and the standard error (SD) corresponding to each eigenvalue of these 100 simulations is calculated to reflect its stability. The comparison results are shown from Table 1 to Table 4, and the simulation results of 30 dimensional data only give the results of the first 12 eigenvalues.
Results of 6-dimensional data simulation, 0
TPCA
SD
RPCA
SD
MCDPCA
SD
1.3494
0.0867
1.4536
0.1241
1.4634
0.1032
1.1672
0.0529
1.2220
0.0718
1.2355
0.0797
1.0375
0.0430
1.0543
0.0586
1.0547
0.0527
0.9308
0.0493
0.9063
0.0552
0.9134
0.0567
0.8226
0.0495
0.7662
0.0676
0.7649
0.0643
0.6925
0.0590
0.5975
0.0839
0.5681
0.1061
5.4943
0.1505
1.4371
0.1045
1.4434
0.1044
0.1373
0.0406
1.2252
0.0701
1.2351
0.0700
0.1146
0.0345
1.0527
0.0596
1.0645
0.0625
0.0994
0.0304
0.9216
0.0542
0.9241
0.0592
0.0848
0.0265
0.7762
0.0642
0.7650
0.0698
0.0695
0.0222
0.5872
0.0783
0.5680
0.0953
5.7557
0.0502
1.4213
0.0948
3.0444
2.1540
0.0677
0.0146
1.2111
0.0616
0.7825
0.5660
0.0563
0.0117
1.0593
0.0537
0.6833
0.4966
0.0478
0.0099
0.9144
0.0531
0.5920
0.4343
0.0398
0.0086
0.7767
0.0554
0.5027
0.3712
0.0327
0.0082
0.6172
0.0787
0.3951
0.3001
5.8391
0.0299
1.5534
0.7787
5.7919
0.7692
0.0453
0.0087
1.1759
0.2115
0.0662
0.2171
0.0374
0.0072
1.0238
0.1839
0.0523
0.1816
0.0314
0.0062
0.8978
0.1654
0.0412
0.1543
0.0261
0.0055
0.7550
0.1478
0.0299
0.1241
0.0208
0.0045
0.5941
0.1311
0.0185
0.0943
Partial results of 30-dimensional data simulation,
TPCA
SD
RPCA
SD
MCDPCA
SD
1.6353
0.0504
1.6663
0.0555
1.7648
0.0676
1.5493
0.0353
1.5748
0.0417
1.6518
0.0540
1.4805
0.0343
1.5032
0.0357
1.5710
0.0420
1.4194
0.0293
1.4379
0.0297
1.5048
0.0424
1.3666
0.0232
1.3857
0.0242
1.4369
0.0397
1.3226
0.0221
1.3341
0.0237
1.3831
0.0343
1.2791
0.0257
1.2904
0.0209
1.3305
0.0301
1.2372
0.0220
1.2455
0.0230
1.2781
0.0262
1.1965
0.0207
1.2038
0.0220
1.2316
0.0267
1.1577
0.0213
1.1660
0.0205
1.1885
0.0217
1.1222
0.0184
1.1283
0.0208
1.1463
0.0220
1.0880
0.0194
1.0955
0.0197
1.1048
0.0225
27.2945
0.4508
1.6901
0.0595
24.173
9.4582
0.1572
0.0264
1.5910
0.0429
0.3502
0.5454
0.1480
0.0246
1.5165
0.0339
0.3319
0.5225
0.1407
0.0228
1.4544
0.0321
0.3185
0.5060
0.1344
0.0221
1.4027
0.0277
0.3046
0.4856
0.1298
0.0214
1.3493
0.0237
0.2923
0.4669
0.1246
0.0211
1.3041
0.0269
0.2791
0.4470
0.1201
0.0204
1.2609
0.0257
0.2686
0.4308
0.1159
0.0194
1.2149
0.0233
0.2578
0.4147
0.1117
0.0188
1.1717
0.0195
0.2486
0.4009
0.1079
0.0181
1.1337
0.0220
0.2371
0.3821
0.1040
0.0173
1.0948
0.0226
0.2280
0.3685
Partial results of 6-dimensional data simulation,
TPCA
SD
RPCA
SD
MCDPCA
SD
1.4862
0.1086
1.6477
0.1647
1.7383
0.1849
1.2369
0.0820
1.3040
0.0998
1.3630
0.1161
1.0739
0.0631
1.0811
0.0860
1.0903
0.1034
0.9019
0.0668
0.8725
0.0869
0.8579
0.0887
0.7348
0.0684
0.6586
0.0949
0.6154
0.1017
0.5664
0.0723
0.4360
0.1020
0.3352
0.1211
5.4591
0.4462
1.6440
0.1518
1.6955
0.7353
0.1623
0.1233
1.3348
0.1037
1.3570
0.2442
0.1282
0.0992
1.0902
0.0860
1.0969
0.2015
0.1047
0.0912
0.8574
0.0803
0.8652
0.1658
0.0834
0.0713
0.6464
0.0802
0.6193
0.1480
0.0622
0.0644
0.4272
0.1032
0.3661
0.1263
5.7289
0.0998
1.6273
0.1529
1.8002
0.7353
0.0834
0.0299
1.2973
0.1028
1.2839
0.2442
0.0647
0.0232
1.0741
0.0841
1.0525
0.2015
0.0512
0.0183
0.8724
0.0753
0.8527
0.1658
0.0413
0.0171
0.6733
0.0992
0.6232
0.1480
0.0305
0.0146
0.4556
0.0967
0.3875
0.1263
5.8210
0.0385
1.7499
0.7501
2.8723
1.9470
0.0554
0.0118
1.2648
0.2406
0.9480
0.5912
0.0433
0.0098
1.0295
0.1917
0.7740
0.4836
0.0338
0.0082
0.8405
0.1673
0.6156
0.3910
0.0269
0.0070
0.6539
0.1445
0.4845
0.3148
0.0197
0.0057
0.4614
0.1280
0.3056
0.2142
Partial results of 30-dimensional data simulation,
TPCA
SD
RPCA
SD
MCDPCA
SD
1.9610
0.0804
2.0030
0.0820
2.2988
0.1169
1.8075
0.0640
1.8408
0.0666
2.1057
0.0874
1.6970
0.0484
1.7270
0.0534
1.9427
0.0720
1.6087
0.0458
1.6354
0.0488
1.8180
0.0635
1.5312
0.0427
1.5490
0.0464
1.7113
0.0523
1.4528
0.0330
1.4677
0.0356
1.6116
0.0520
1.3857
0.0309
1.4033
0.0320
1.5184
0.0494
1.3293
0.0316
1.3412
0.0316
1.4321
0.0432
1.2714
0.0313
1.2800
0.0327
1.3545
0.0418
1.2169
0.0323
1.2182
0.0314
1.2764
0.0439
1.1602
0.0297
1.1657
0.0332
1.2004
0.0372
1.1076
0.0269
1.1121
0.0285
1.1322
0.0385
27.2256
0.6510
2.0373
0.0911
12.0097
12.6877
0.1912
0.0443
1.8815
0.0658
1.3587
0.9515
0.1769
0.0407
1.7548
0.0564
1.2508
0.8761
0.1645
0.0376
1.6594
0.0483
1.1763
0.8259
0.1557
0.0360
1.5714
0.0485
1.1076
0.7784
0.1481
0.0338
1.4922
0.0404
1.0485
0.7387
0.1399
0.0334
1.4168
0.0347
0.9853
0.6946
0.1326
0.0316
1.3501
0.0375
0.9217
0.6497
0.1263
0.0300
1.2885
0.0311
0.8701
0.6136
0.1201
0.0285
1.2189
0.0306
0.8227
0.5803
0.1137
0.0269
1.1642
0.0250
0.7808
0.5518
0.1086
0.0257
1.1036
0.0256
0.7344
0.5190
It can be seen from the above results that in the case of low dimension (6-dimension), when there is no contaminated data, RPCA and MCDPCA can get almost the same results as the TPCA. When there are 10% outliers, RPCA and MCDPCA have good robustness, while TPCA has been greatly different. When the proportion of outliers is increased to 20%, the result of MCDPCA is still stable when , and the standard error corresponding to the result of MCDPCA increases when , indicating that there are unstable results (see Annex 1 for details). When the proportion of outliers is increased to 30%, the result of MCDPCA is quite different from that of original data, while the result of RPCA is always stable. Furthermore, judging from the standard errors of the above eigenvalues, the stability of RPCA is much better than that of TPCA and MCDPCA.
In the case of high-dimension (30-dimension) (see the results in Annex 1 for details), the result of MCDPCA has shown a significant fluctuation when the proportion of outliers reaches 10%, while the RPCA still maintains a good robustness when the proportion of outliers reaches 20%.
Real data analysis
In order to show that RPCA has a good robustness in the actual situation, we use the Wine dataset (1991) from the UCI machine learning, which can be available online at https://archive-beta.ics.uci.edu/dataset/109/wine. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. We select all 71 class-2 samples as real data, and randomly add 9 class-1 samples as contaminated data. Therefore, this dataset contains 13 variables and 80 samples, including 9 outliers data numbered 1 to 9. So, the contaminated rate is 11.25% All results can be obtained by using CovSest and CovMcd functions in the “rrcov” package.
Firstly, for comparison purposes we use Rocke estimation and MCD estimation to calculate the robust distance of this data and draw distance-distance graphs. The Fig. 1 illustrates the relationship of robust distance calculated by the Rocke estimation method and mahalanobis distance. As we can see, all nine observations of class-1 are clustered in the upper left area, where the points are called outliers. And five observations are clustered in the upper right area, which are diagnosed as bad leverage points. However, in the Fig. 2, MCD estimation has worse robustness in this situation, diagnosing only two outliers from class-1.
Diagnostic plot of Rocke estimation for the Wine dataset.
Diagnostic plot of MCD estimation for the Wine dataset.
Secondly, we calculate the TPCA, RPCA and MCDPCA for the sample data which only contains class-2 observations, so as to get the eigenvalues of the uncontaminted data. Then we use the dataset above to calculate eigenvalues with TPCA, RPCA and MCDPCA respectively. The results are shown in Table 5.
The comparison results for the Wine data
Uncontaminted data
Contaminted data
TPCA
RPCA
MCDPCA
TPCA
RPCA
MCDPCA
s 0.7167
0.7934
0.9613
0.8621
0.6838
1.4888
0.5806
0.5220
0.5246
0.5787
0.6491
0.5169
0.4250
0.3126
0.4147
0.3703
0.4680
0.4056
0.3608
0.2843
0.3323
0.3228
0.2712
0.2803
0.3037
0.2610
0.2450
0.2944
0.2121
0.2087
0.2404
0.1648
0.2376
0.1894
0.1350
0.1708
0.1493
0.1422
0.1477
0.1284
0.1170
0.1261
0.1329
0.0971
0.1130
0.1126
0.0862
0.0922
0.1133
0.0623
0.0777
0.1013
0.0671
0.0622
0.0929
0.0494
0.0575
0.0742
0.0444
0.0468
0.0600
0.0383
0.0316
0.0580
0.0349
0.0412
0.0377
0.0199
0.0299
0.0284
0.0179
0.0196
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
It can be seen from the above results that under the data without contamination, the eigenvalues of TPCA, MCDPCA and RPCA have similar results. Furthermore, after adding contaminated data, the eigenvalues of TPCA differ from that of MCDPCA with the results calculated from the original data, while RPCA still can show the characteristics of the original data.
Conclusion
Because the traditional PCA will be greatly affected by the outliers when calculating the sample covariance matrix, which makes the results deviate greatly. MCDPCA is a covariance matrix based on MCD estimation, which is a relatively robust estimation method, but its robustness is not as good as Rocke estimation. With the increase of data dimension, the deviation of MCD estimation increases significantly, which makes the deviation of MCDPCA very large. Rocke estimation has good robustness in high-dimensional data, and its calculation speed is faster than MCD estimation. Therefore, in the actual PCA, the application of Rocke estimation can improve the calculation speed and avoid the influence of outliers, which can be well applied to improve the existing methods, such as Gaussian process.[12, 13]
Footnotes
Acknowledgments
The authors acknowledge the Innovative Research Team in Universities of Guangdong Province of China(Grant No. 2021KCXTD079),Intelligent Vocational Education Engineering Technology Research Center of Guangdong Province of China(Grant No. 2021A118), Guangdong Province National Key Research and Development Plan (Grant No.2018B010109003).
TatsuokaKSTylerDE. On the uniqueness of S-functionals and M-functionals under nonelliptical distributions. The Annals of Statistics.2000; 28(4): 1219-1243.
3.
ZuoYCuiHHeX. On the Stahel-Donoho estimator and depthweighted means of multivariate data. The Annals of Statistics.2004; 32(1): 167-188.
4.
MaronnaRAYohaiVJ. Robust and efficient estimation of multivariate scatter and location. Computational Statistics and Data Analysis.2017; 109: 64-75.
5.
BoudtKRousseeuwPJVanduffelSVerdonckT. The minimum regularized covariance determinant estimator. Statistics and Computing.2020; 30(1): 113-128.
6.
BulutH. Mahalanobis distance based on minimum regularized covariance determinant estimators for high dimensional data. Communications in Statistics-Theory and Methods.2020; 49(24): 5897-5907.
7.
SchreursJVranckxIHubertMSuykensJKRuosseeuwPJ. Outlier detection in non-elliptical data by kernel MRCD. Statistics and Computing.2021; 31(5): 1-18.
8.
ZahariahSMidiH. Minimum regularized covariance determinant and principal component analysis-based method for the identification of high leverage points in high dimensional sparse data. Journal of Applied Statistics.2022; 1-19.
9.
LuoLBaoSPengX. Robust monitoring of industrial processes using process data with outliers and missing values. Chemometrics and Intelligent Laboratory Systems, 2019; 192: 103827.
10.
WangBChenY. A robust principal component analysis based on MCD estimator and its empirical study. Application of Statistics and Management.2006; 25(4): 462-468.
11.
PeñaDPrietoFJ. Combining random and specific directions for outlier detection and robust estimation in high-dimensional multivariate data. Journal of Computational and Graphical Statistics.2007; 16(1): 228-254.
12.
YinFLinZKongQXuYLiD. TheodoridisSCuiS. FedLoc: Federated learning framework for data-driven cooperative localization and location data processing. IEEE Open Journal of Signal Processing, 2020; 1: 187-215.
13.
XuYYinFXuWLinJCuiS. Wireless traffic prediction with scalable gaussian process: framework, algorithms, and verification. IEEE Journal on Selected Areas in Communications.2019; 37(6): 1291-1306.