MFFA: Multi-level feature fusion and anomaly map compensation for anomaly detection

Abstract

Embedding similarity-based methods obtained good results in unsupervised anomaly detection (AD). This kind of method usually used feature vectors from a model pre-trained by ImageNet to calculate scores by measuring the similarity between test samples and normal samples. Ultimately, anomalous regions are localized based on the scores obtained. However, this strategy may lead to a lack of sufficient adaptability of the extracted features to the detection of anomalous patterns for industrial anomaly detection tasks. To alleviate this problem, we design a novel anomaly detection framework, MFFA, which includes a pseudo sample generation (PSG) block, a local-global feature fusion perception (LGFFP) block and an anomaly map compensation (AMC) block. The PSG block can make the pre-trained model more suitable for real-world anomaly detection tasks by combining the CutPaste augmentation. The LGFFP block aggregates shallow and deep features on different patches and inputs them to CaiT (Class-attention in image Transformers) to guide self-attention, effectively interacting local and global information between different patches, and the AMC block can compensate each other for the two anomaly maps generated by the nearest neighbor search and multivariate Gaussian fitting, improving the accuracy of anomaly detection and localization. In experiments, MVTec AD dataset is used to verify the generalization ability of the proposed method in various real-world applications. It achieves over 99.1% AUROCs in detection and 98.4% AUROCs in localization, respectively.

Keywords

Anomaly detection pseudo sample feature fusion transformer anomaly map compensation

1 Introduction

Anomaly detection (AD), as a critical task usually refers to identifying and localizing anomalies with limited, even no, prior knowledge of abnormality. However, anomalies are very rare events under real-world application scenarios, and usually there is no prior information about anomalies, only with a large number of normal samples. Therefore, an unsupervised anomaly detection task is proposed which is usually estimated in a one-class learning setting.

Unsupervised anomaly detection approaches assume that the training dataset does not usually have anomalous samples, but provide a set of normal samples for reference, which can be used when the number of normal samples is much larger than anomalous ones. These methods are well suited for cases where it is difficult to collect anomalous data, such as medical and industrial applications [1, 2].

In computer vision field, anomaly detection refers to giving an anomaly score to the whole image. While anomaly localization is a more complex task, it assigns an anomaly score to each pixel or each patch of pixels to output an anomaly map. Thus, anomaly localization can produce more precise and interpretable results.

In anomaly detection and localization, there are basically two categorized methods which are reconstruction-based methods [3 –5] and embedding similarity-based methods [6 –9]. However, for anomalous images, reconstruction-based methods may sometimes produce low data reconstruction errors too, which leads to the failure of anomaly detection. Recently, embedding similarity-based methods got more promising performance, which used feature vectors from a model pre-trained by ImageNet to calculate scores by measuring the similarity between test samples and normal samples to detect anomalous regions. Compared with reconstruction-based methods, embedding similarity-based methods are not only simple and explicable, but also have outstanding performance.

However, embedding similarity-based methods still have two limitations for real-world applications. On the one hand, these methods usually require the use of extra data, such as ImageNet, to pre-train a model. This strategy may cause that models pre-trained by ImageNet lack adaptability for the detection of anomalous tiny spatial irregularities. On the other hand, previous studies used straightforward summation or concatenation for shallow and deep features of the network. However, such feature fusion approach cannot sufficiently utilize local and global information between different patches to perceive various anomalous patterns more accurately. Recently, CFA [10] made a related attempt by adding an additional patch descriptor after the pre-trained network and obtaining target-oriented features through hyper-sphere loss, thus alleviating the lack of adaptability. However, this obviously makes the network more difficult to train. Our goal is to keep the original structure without adding additional components to alleviate this problem.

Given the shortcomings of existing methods, we propose a novel anomaly detection framework named MFFA, which includes a pseudo sample generation (PSG) block, a local-global feature fusion perception (LGFFP) block, and an anomaly map compensation (AMC) block, respectively. It optimizes previous studies from the aspects of network, feature and anomaly map. Firstly, to improve the adaptability of the model to the detection of anomalous patterns and learn more essential representation from industrial datasets, we propose a PSG block that constructs a classification task with CutPaste augmentation to fine-tune the original pre-trained model, which can encourage the model sufficiently learn spatial irregularities of anomalous samples augmented by CutPaste, thus making the model more suitable for industrial anomaly detection tasks. Secondly, to better realize the local and global information interaction between different patches, we propose a LGFFP block. LGFFP block feeds the aggregated shallow and deep features to CaiT for guiding the self-attention between patches, thus helping the information interaction and fusion between different patches. Finally, our proposed AMC block uses pixel-level multiplication to compensate each other for the two anomaly maps generated by nearest neighbor search and multivariate Gaussian fitting, respectively, thus enhancing the response of anomaly map to anomalous regions and reducing the false positives of anomaly map to normal regions.

The contributions of our method can be summarized as follows.

We propose a PSG block, which uses anomalous samples created by CutPaste to make the model learn the spatial irregularities unique to the anomalous pattern, so as to reduce the overestimation of the normality of anomalous features.

We propose a LGFFP block, which guides the self-attention between patches, makes the multi-scale features between different patches fully be interacted and fused to increase the perception of anomalous areas.

We propose an AMC block, which makes full use of the advantages of the NNS and Gaussian methods to generate a more robust anomalous segmentation map.

The remainder of this paper is organized as follows. We first review the related works in Section 2 and describe the general framework and key components of our method in Section 3. In Section 4, ablation studies are conducted, and experimental results are analyzed. Section 5 presents our conclusion.

2 Related works

2.1 Reconstruction-based methods

Reconstruction-based methods are widely used for anomaly detection and localization. They encoded or reconstructed normal images using neural network architectures like Autoencoder (AE) [3 , 11–13], Transformer [14], Variational Autoencoder (VAE) [15 –18] or Generative Adversarial Networks (GAN) [4 , 20]. These methods relied on the hypothesis that generative models trained on normal samples only could successfully reconstruct normal regions, but failed for anomalous regions [12 , 21]. To localize anomalies, reconstruction-based methods took the pixel-wise reconstruction error or the structural similarity index [22] as the anomaly score [11]. Recently, InTra [14] proposed a Transformer model with masked patches which tried to improve performances by using global attentions. Although reconstruction-based methods are very intuitive and interpretable, their performance is limited by the fact that they may produce low data reconstruction errors for anomalous images.

2.2 Embedding similarity-based methods

Embedding similarity-based methods used feature vectors from a model pre-trained by ImageNet to calculate scores by measuring the similarity between test samples and normal samples to detect anomalous regions [6–9 , 23]. Patch-SVDD [23] extended Support Vector Data Description(SVDD) to the patch levels. The anomaly score was measured by using the Nearest Neighbor Search(NNS) based on the distance between an embedding vector from a test image and each embedding vector from training images. Patch-SVDD greatly improves the fine-grained detection of anomalies. PaDiM [6] proposed a patch-based approach that preserved the coordination from image space to feature maps, which is simple and effective and does not require much reasoning time. SPADE [8] also achieved good performance by using the K-NN algorithm on a set of normal embedding vectors at test time. Memory bank [9] and parameters of multivariate Gaussian distributions [7] were also constructed or fitted by features of normal images, so as to better record the general pattern of normal samples.

Our method is similar to the aforementioned approaches, which also uses the feature vectors of the pre-trained model and combines K-NN with multivariate Gaussian distributions to detect the anomaly. However, the MFFA framework combines and optimizes previous studies from the aspects of network, feature and anomaly map to achieve more accurate anomaly detection results.

2.3 Self-supervised learning

Self-supervised learning has been significantly developed over the past few years, which reduces dependency on labeled data and enables unsupervised semantic extraction. Representation learning is the core problem of Self-Supervised Learning in computer vision. Many methods have been proposed to learn the representation of images without labels, such as [24 –26]. These methods helped models learning useful representation by training with a pretext task from unlabeled data.

In recent researches, data augmentation strategy was also widely used in anomaly detection, which transformed unsupervised tasks into supervised learning tasks [27 –30] by adding pseudo-anomalies to the provided normal samples. For example, CutPaste [27] generated pseudo-anomalies by pasting small patches onto normal images and trained a model to detect these anomalous regions. DRÆM [30] jointly discriminatively learned the appropriate distance measure automatically over the joint original and reconstructed space using the simulated anomalies to produce accurate anomaly segmentation maps. MemSeg [31] proposed a memory-based segmentation network, which introduced simulated anomalies and memory samples to segment anomalies from the perspective of differences and commonalities. MSPBA [32] proposed a multi-scale patch-based representation learning method, which designed a novel loss function to extract critical and representative information from normal images by combining SVDD, rotation prediction and K-means ideas. However, these approaches were prone to bias towards pseudo-anomalies and failed to detect a large variety of anomaly types.

Fig. 1

Inference process of the proposed method. The NNS anomaly map is generated by the nearest neighbor search between the test embeddings and the memory bank, and the Gaussian anomaly map is generated by the Mahalanobis distance between the embedding of each patch in test image and the learned distribution parameters. Finally, the final anomaly map is generated through the AMC block.

3 Method

3.1 Overall architecture

In this section, we present our novel framework, MFFA, as shown in Fig. 1.

We first design a PSG block that constructs a classification task using CutPaste augmentation to fine-tune the original pre-trained model and make the model sufficiently learn spatial irregularities of anomalous samples augmented by CutPaste. After that, we design a LGFFP block to feed the aggregated shallow and deep features to CaiT for guiding the self-attention between different patches to effectively perceive various anomalous patterns. Then, the anomaly detection process is divided into two parts: one is used to detect anomalies in local patches, and the other is used to capture patch correlations from different semantic levels for reducing false positives. Using the Nearest Neighbor Search on the testing embeddings can better detect the responsive anomalies in local patches, and fitting multivariate Gaussian distribution for each patch of the training images can better summarize the information carried by the normal image features at each patch to learn the correlation. Finally, we design an AMC block to compensate the two anomaly maps with each other by pixel-level multiplication, which can enhance the response of anomaly map to anomalous regions and reduce the false positives of anomaly map to normal region.

Fig. 2

(left) The process of CutPaste augmentation. Cut a small rectangular area of variable sizes and aspect ratios from a normal training image.Then, rotate or jitter pixel values in the patch. Finally, paste the patch back to the image at a random position. Particularly, CutPaste-Scar uses a scar-like (more long-thin) rectangular box. (right) The process of constructing a classification task. In this process, the pre-trained CNN is fine-tuned.

Fig. 3

Structure of the LGFFP block. We first concatenate the features of the first two layers of the network, and then directly add the processed feature of the third layer to the concatenated feature to match the dimension of the transformer block of CaiT and aggregate local and global information. Subsequently, the aggregated feature is input into the transformer block of pre-trained CaiT to sufficiently interact information between different patches, and finally the output feature is obtained.

3.2 Pseudo sample generation block

Embedding similarity-based methods usually require the use of additional data, such as ImageNet, to pre-train a model, which may lead to a lack of sufficient adaptability of the features extracted by the model. Because the model pre-trained by ImageNet cannot effectively detect the tiny spatial irregularities of anomalies.

Hence, to improve the generalization ability of the model and learn more information from industrial datasets, following [27, 33], we formulate a pretext task of self-supervised learning and propose a PSG block that constructs a classification task using CutPaste augmentation to fine-tune the original pre-trained model. The PSG block can encourage the model sufficiently learn spatial irregularities of pseudo samples generated by CutPaste, thus making features extracted by the model more suitable for industrial anomaly detection tasks.

Figure 2(left) shows the process of CutPaste augmentation. Following [27], we adopt the same data augmentation strategy: first, cut a small rectangular area of variable sizes and aspect ratios from a normal training image. Then, rotate or jitter pixel values in the patch. Finally, paste the patch back to the image at a random location. Noting that CutPaste-Scar uses a scar-like (more long-thin) rectangular box to fit smaller anomalies, we don’t need to completely simulate the real anomaly, just as a rough approximation. Figure 2(right) depicts the process of constructing classification tasks to fine-tune the model. The images of Normal, CutPaste, and CutPaste-Scar are input into the pre-trained CNN for classification, and in this process of classification, the CNN is fine-tuned. Then the CNN is applied to the training and inference process.

3.3 Local-global feature fusion perception block

Embedding similarity-based methods usually used straightforward summation or concatenation for shallow and deep features of the network. However, we believe that this straightforward operation cannot sufficiently utilize local and global information between different patches. To alleviate this problem and perceive various anomalous patterns, we propose a LGFFP block. The LGFFP block feeds the aggregated shallow and deep features to CaiT [34] for guiding the self-attention between patches, thus allowing full interaction of local and global information between different patches, as shown in Fig. 3. CaiT is a typical variant of ViT [35], whose performance does not saturate early with the increase of depth.

To match the dimension of CaiT, we first concatenate the features of the first two layers of the network, denoted by $f_{E}^{1}$ and $f_{E}^{2}$ respectively, to capture local information f_con. Then, after up-sampling and random feature selection, the third layer feature $f_{E}^{3}$ is directly added to the f_con to aggregate local and global information to obtain f_agg. Subsequently, the f_agg is input into the transformer block of pre-trained CaiT to sufficiently interact local and global information between different patches, and finally the output feature f_out is obtained.

Fig.4

The training process of the proposed method. (top) Corset sub-sampling is used to construct a memory bank for the inference process. (bottom) The multivariate Gaussian distribution is fitted for each patch of the training images to construct the parameter matrix for the inference process.

3.4 Training

The training process of the proposed method is divided into two parts: memory bank based nearest neighbor search and patch distribution modeling.

Figure 4 shows the nearest neighbor search method we used. Following [6, 8], we also use the pre-trained model but fine-tune it by the PSG block. To avoid features that are too generic or too biased towards the task of natural image classification, we select the middle-level and high-level features of the network and concatenate them.

At the same time, average pooling is applied to the obtained concatenated feature maps to enlarge the size of receptive fields and the robustness to small spatial deviations.

Then we flatten all concatenated feature maps and adopt the coreset sub-sampling mechanism in [9 , 36–38] to select representative features to form a memory bank, thus reducing memory consumption and inference time while retaining performance.

Conceptually, coreset selection aims to find a subset S ⊂ A such that problem solutions over A can be approximated by those computed over S more closely and more quickly. Here we use minimax facility localization coreset selection to ensure that the feature coverage in the memory bank is roughly similar to the original feature space.

To further reduce coreset selection time, before selection, we reduce the feature dimension of the original feature space by random linear projections. Subsequently, the coreset selection process is as follows: firstly, a feature is randomly selected from the original space as the initial coreset element. Then, for each feature in the original space, find its nearest neighbor in the coreset in turn, and add the feature of the original space farthest from the nearest neighbor to the coreset, expanding the coreset iteratively until the number of coreset meets the condition. The final goal is to iteratively approximate:

$M_{B} = \arg \min_{M_{B} \subset M} \max_{m \in M} \min_{n \in M_{B}} {∥ m - n ∥}_{2}$ (1)

Where M_B denotes the memory bank, and M denotes the original feature space.

Equation 1 is intuitively interpreted as minimizing the maximum distance between the elements of the original space and the nearest neighbor in the memory bank. Following the method in [9], we solve for M_B. We believe that using the Nearest Neighbor Search on the testing embeddings can better detect the responsive anomalies in local patches.

Meanwhile, we also use PaDiM which was proposed for modeling the patch distribution, as illustrated in Fig. 4. But, in this process, we get the patch embedding vectors generated by the CNN fine-tuned with the PSG block.

During the training process, each patch of the training images is associated with its spatially corresponding embedding vectors in the fine-tuned CNN feature maps. Embedding vectors from different layers are then fused to get embedding vectors at different semantic levels. As feature maps have a lower resolution than the input image, many pixels have the same embeddings. Hence, the input image can be divided into a grid of (i, j) ∈ [1, W] × [1, H] positions where W × H is the resolution of the largest feature map. Finally, each patch position (i, j) in this grid is associated with an embedding vector x_ij computed as described above.

Since the generated patch embedding vector x_ij may contain redundant information, random feature selection is adopted to reduce the complexity of training and testing time.

To learn the normal image characteristics at position (i, j), we first computes the set of patch embedding vectors at (i, j), $X_{ij} = {x_{ij}^{k}, k \in [1, N]}$ from the N normal images as shown on Fig. 4. To sum up the information carried by this set, we makes the assumption that X_ij is generated by a multivariate Gaussian distribution $N (μ_{ij}, Σ_{ij})$ , where μ_ij is the sample mean of X_ij, and the sample covariance Σ_ij is estimated as follows:

$Σ_{ij} = \frac{1}{N - 1} \sum_{k = 1}^{N} (x_{ij}^{k} - μ_{ij}) {(x_{ij}^{k} - μ_{ij})}^{T} + λ I$ (2)

where the regularisation term λI makes the sample covariance matrix Σ_ij full rank and reversible. Finally, each patch is associated with a multivariate Gaussian distribution as shown in Fig. 4 by the matrix of Gaussian parameters.

We believe that fitting multivariate Gaussian distribution for each patch of the training images can effectively reduce the false positives caused by the nearest neighbor search method, because it considers the correlation of multiple patches at the same localization.

3.5 Inference

In the inference process, our method generates two anomaly maps as shown in Fig. 1. The first is the NNS anomaly map M_NNS generated by the nearest neighbor search, and the second is the gaussian anomaly map M_gaussian generated by the Mahalanobis distance [39].

The NNS anomaly map M_NNS is generated by the nearest neighbor search between the test embeddings and the memory bank to capture the anomalies in the local patch. The embedding sets of test images and memory bank denote $m_{test}^{j} (j = 1, \dots, n)$ and $m_{bank}^{i} (i = 1, \dots, m)$ , respectively. The formula for calculating the anomaly score is as follows:

$AL (j) = \min_{i \in {1, \dots, m}} {∥ m_{test}^{j} - m_{bank}^{i} ∥}_{2}$ (3)

Finally, we resize AL (j) (j = 1, ⋯ , n) to the same size as the original image to obtain the NNS anomaly map M_NNS.

Another anomaly map M_gaussian is obtained by using the Mahalanobis distance Maha (x_ij) to give an anomaly score to the patch in position (i, j) of a test image. Maha (x_ij) can be interpreted as the distance between the test patch embedding x_ij and learned distribution $N (μ_{ij}, Σ_{ij})$ , where Maha (x_ij) is computed as follows:

$Maha (x_{ij}) = \sqrt{{(x_{ij} - μ_{ij})}^{T} Σ_{ij}^{- 1} (x_{ij} - μ_{ij})}$ (4)

Thus, the matrix of Mahalanobis distances

$M = {(Maha (x_{ij}))}_{1 < i < W, 1 < j < H}$ (5)

can be calculated and adjusted to the same size as the original image to obtain the gaussian anomaly map M_gaussian.

3.5.1 Anomaly map compensation block

Finally, we propose an AMC block to compensate each other for the above two anomaly maps. The final anomaly map M_final is constructed by pixel-wise multiplication of two anomaly maps to enhance the response of anomaly map to anomalous regions and reduce the false positives of anomaly map to normal regions.

$M_{final} = M_{NNS} ⨂ M_{gaussian}$ (6)

Where ⨂ denotes pixel-wise multiplication. The image-level anomaly score is defined as the maximum value of the final anomaly map M_final.

4 Experiment

4.1 Dataset description

Our proposed method is evaluated on the MVTec AD [11], which is commonly used for anomaly detection and localization in industrial fields. It contains 15 real-world categories for anomaly detection, with 5 classes of textures and 10 classes of objects. The training set consists of 3629 images without anomalies, while the testing set has both anomalous and normal images, 1725 in total. Each class has multiple anomalies for testing. In addition, it also provides pixel-level annotations for anomalous test images.

Table 1
Ablation study on the complementarity of the AMC block for the Nearest Neighbor Search and the multivariate Gaussian distribution

Component Detection Localization

result (%) result (%)

NNS 98.5 98.0

Gaussian 97.8 98.3

NNS+Gaussian(AMC block) 98.5 98.3

Component	Detection	Localization
NNS	98.5	98.0
Gaussian	97.8	98.3
NNS+Gaussian(AMC block)	98.5	98.3

Table 2

Ablation study on baseline(NNS+Gaussian), the PSG block and the LGFFP block

Component	Detection	Localization
	result (%)	result (%)
baseline	98.5	98.3
baseline + LGFFP	98.7	98.4
baseline + LGFFP + PSG	99.1	98.4

Table 3

Anomaly localization results on BTAD datasets. We compare our method with VT-ADL, FastFlow, and convolutional auto encoders trained with MSE-loss and MSE+SSIM loss

Category	AE MSE [40]	AE MSE+SSIM [40]	VT-ADL [40]	FastFlow [41]	Ours
0	0.49	0.53	0.99	0.95	0.97
1	0.92	0.96	0.94	0.96	0.96
2	0.95	0.89	0.77	0.99	0.99
Average	0.78	0.79	0.90	0.97	0.98

4.2 Experimental settings

All images in MVTec are resized to a specific resolution (e.g. 256 × 256). Following the convention in prior works, anomaly detection and localization are performed on one category at a time. In this experiment, we adopt WideResNet50 as the backbone of our method. The training process of the proposed method is divided into two parts. Firstly, a memory bank for the inference process is constructed by using corset sub-sampling. Secondly, the multivariate Gaussian distribution is fitted for each patch of the training images to construct the parameter matrix for the inference process.

In the memory bank construction part, we choose the layer 2 and layer 3 features of backbone and the coreset compression rate is 0.01.

Table 4
Comparison of detection and localization results with state-of-the-art methods on the MVTec AD(AUROC). For each category with images of 256 × 256 resolution, the highest AUROC value is in bold

Image Size 256×256

Category Patch SVDD [23] SPADE [8] PaDiM [6] CutPaste [27] Reverse Distillation [42] Ours

detec. local. detec. local. detec. local. detec. local. detec. local. detec. local.

Carpet 92.9 92.6 98.6 97.5 99.9 99.0 93.9 98.3 98.9 98.9 99.2 99.1

Grid 94.6 96.2 99.0 93.7 95.7 96.5 100.0 97.5 100.0 99.3 99.4 96.4

Textures Leather 90.9 97.4 99.5 97.6 100.0 98.9 100.0 99.5 100.0 99.4 100.0 99.1

Tile 97.8 91.4 89.8 87.4 97.4 93.9 94.6 90.5 99.3 95.6 100.0 98.3

Wood 96.5 90.8 95.8 88.5 98.8 94.1 99.1 95.5 99.2 95.3 99.8 95.8

Average 94.5 93.7 96.5 92.9 98.4 96.5 97.5 96.3 99.5 97.7 99.7 97.7

Bottle 98.6 98.1 98.1 98.4 99.8 98.2 98.2 97.6 100.0 98.7 100.0 98.8

Cable 90.3 96.8 93.2 97.2 92.2 96.8 81.2 90.0 95.0 97.4 99.7 98.8

Capsule 76.7 95.8 98.6 99.0 91.5 98.6 98.2 97.4 96.3 98.7 98.0 98.8

Hazelnut 92.0 97.5 98.9 99.1 93.3 97.9 98.3 97.3 99.9 98.9 100.0 98.6

Objects Metal Nut 94.0 98.0 96.9 98.1 99.2 97.1 99.9 93.1 100.0 97.3 100.0 99.1

Pill 86.1 95.1 96.5 96.5 94.4 96.1 94.9 95.7 96.6 98.2 98.2 98.6

Screw 81.3 95.7 99.5 98.9 84.4 98.3 88.7 96.7 97.0 99.6 96.5 98.2

Toothbrush 100.0 98.1 98.9 97.9 97.2 98.7 99.4 98.1 99.5 99.1 96.4 98.5

Transistor 91.5 97.0 81.0 94.1 97.8 97.5 96.1 93.0 96.7 92.5 99.9 98.8

Zipper 97.9 95.1 98.8 96.5 90.9 98.4 99.9 99.3 98.5 98.2 100.0 99.0

Average 90.8 96.7 96.0 97.6 94.1 97.8 95.5 95.8 98.0 97.9 98.9 98.7

Total Average 92.1 95.7 96.2 96.0 95.5 97.3 96.1 96.0 98.5 97.8 99.1 98.4

Image Size	256×256
	Carpet	92.9	92.6	98.6	97.5	99.9	99.0	93.9	98.3	98.9	98.9	99.2	99.1
	Grid	94.6	96.2	99.0	93.7	95.7	96.5	100.0	97.5	100.0	99.3	99.4	96.4
Textures	Leather	90.9	97.4	99.5	97.6	100.0	98.9	100.0	99.5	100.0	99.4	100.0	99.1
	Tile	97.8	91.4	89.8	87.4	97.4	93.9	94.6	90.5	99.3	95.6	100.0	98.3
	Wood	96.5	90.8	95.8	88.5	98.8	94.1	99.1	95.5	99.2	95.3	99.8	95.8
	Average	94.5	93.7	96.5	92.9	98.4	96.5	97.5	96.3	99.5	97.7	99.7	97.7
	Bottle	98.6	98.1	98.1	98.4	99.8	98.2	98.2	97.6	100.0	98.7	100.0	98.8
	Cable	90.3	96.8	93.2	97.2	92.2	96.8	81.2	90.0	95.0	97.4	99.7	98.8
	Capsule	76.7	95.8	98.6	99.0	91.5	98.6	98.2	97.4	96.3	98.7	98.0	98.8
	Hazelnut	92.0	97.5	98.9	99.1	93.3	97.9	98.3	97.3	99.9	98.9	100.0	98.6
Objects	Metal Nut	94.0	98.0	96.9	98.1	99.2	97.1	99.9	93.1	100.0	97.3	100.0	99.1
	Pill	86.1	95.1	96.5	96.5	94.4	96.1	94.9	95.7	96.6	98.2	98.2	98.6
	Screw	81.3	95.7	99.5	98.9	84.4	98.3	88.7	96.7	97.0	99.6	96.5	98.2
	Toothbrush	100.0	98.1	98.9	97.9	97.2	98.7	99.4	98.1	99.5	99.1	96.4	98.5
	Transistor	91.5	97.0	81.0	94.1	97.8	97.5	96.1	93.0	96.7	92.5	99.9	98.8
	Zipper	97.9	95.1	98.8	96.5	90.9	98.4	99.9	99.3	98.5	98.2	100.0	99.0
	Average	90.8	96.7	96.0	97.6	94.1	97.8	95.5	95.8	98.0	97.9	98.9	98.7
Total Average	92.1	95.7	96.2	96.0	95.5	97.3	96.1	96.0	98.5	97.8	99.1	98.4

In the multivariate Gaussian distribution fitting part, we choose the features of the first three layers of backbone, where the dimension of the third layer feature after random feature selection is 768. For CaiT, we select the first 11 transformer blocks. In the patch distribution modeling, to reduce the redundancy of patch embedding vectors, the dimension of random feature selection we used is 570. The hyperparameter λ in the covariance estimation is set to 0.01.

For the whole model training, for each class we use the same hyperparameters. We use the SGD optimizer with the momentum of 0.9, weight decay of 0.00003, and the batch size is 32. We then fine-tune the model 64 epochs with the learning rate of 0.0003.

Fig. 5

Anomaly detection examples from MVTec AD. Columns from left to right represent anomalous images, ground truth, predicted heat map, predicted mask, and segmentation results. Even tiny or inconspicuous anomalies can have strong responses.

Fig. 6

The statistics of image-level anomaly scores with and non CaiT for all categories. The x-axis indicates anomaly score from 0 to 1 and y-axis is count.

Fig. 7

Visualization of anomaly segmentation heat maps with and non CaiT.

Fig. 8

Visualization on large or structural anomalies. From left to right are: bottle, metal nut, grid, hazelnut, transistor and zipper.

Fig. 9

Visualization on tiny or inconspicuous anomalies. From left to right are: carpet, tile, wood, cable, pill, toothbrush and screw.

Fig. 10

Failure cases of our method in the toothbrush category. Our method incorrectly localizes noisy regions and does not detect actual anomaly regions.

In the inference process, a Gaussian filter with σ = 4 is used to smooth the anomaly score map.

4.3 Expersimental results

4.3.1 Ablation study

We conduct several experiments to analyze our method. First, we investigate the complementarity of the AMC block for the Nearest Neighbor Search and the multivariate Gaussian distribution and report the numerical results in Table 1.

We take the Nearest Neighbor Search (NNS) and the multivariate Gaussian distribution (Gaussian) as two baselines and use the network features mentioned in the experimental details. NNS can better detect the responsive anomalies in local patches, but it doesn’t consider the correlations between normal patterns, which may lead to false positives. While Gaussian learns the correlation between multiple patches at the same location, so it is more accurate in anomaly localization, but it is not as strong as NNS in response to local anomalies. Meanwhile, Gaussian is less robust to unaligned categories, and NNS can better compensate for this disadvantage. The AMC block combines the above strategies to achieve more accurate anomaly detection and localization results.

Besides, we also explore the effectiveness of the PSG block and the LGFFP block for anomaly detection and show the results in Table 2. We take the above combination (NNS+Gaussian) as the baseline and gradually add two components. From the results it can be seen that combining these components helps to improve the accuracy of anomaly detection and localization. After the addition of LGFFP block, the performance of both detection and localization is improved, which indicates that the guidance of self-attention is beneficial to increase the perception of anomalous areas. It is worth noting that the entire image is classified directly in PSG block, no fine-grained manipulation is done. Therefore, only the performance of detection is improved, while the performance of localization remains stable.

4.3.2 Comparison with the state-of-the-art methods

For anomaly detection and localization, we take the area under the receiver operating characteristic(AUROC) as the evaluation metric, and the results are shown in Table 3. Comparison baselines include Patch-SVDD [23], SPADE [8], PaDiM [6], CutPaste [27], Reverse Distillation [42].

The average results show that our approach exceeds state-of-the-art methods by 0.6% in anomaly detection and 0.6% in anomaly localization, reaching 99.1% AUROCs and 98.4% AUROCs respectively. For textures and objects, our approach surpasses state-of-the-art methods in both anomaly detection and localization, 99.7%/97.7% and 98.9%/98.7% AUROCs, respectively. Figure 5 shows examples of anomaly maps produced by our method, which is used to localize anomalies in images from the MVTec AD dataset.

4.3.3 Discussions

We analyze the effect of CaiT in the LGFFP block on the fitting of multivariate Gaussian distributions in terms of both detection and localization.

For the detection aspect, we visualize the statistics of image-level anomaly scores with and non CaiT, respectively, as shown in Fig. 6. The non-overlap distribution of normal (blue) and anomaly (red) indicates strong AD ability. For most categories, the distribution of anomaly scores does not change much. But for grid, leather, Metal Nut, toothbrush, and wood, using CaiT widens the difference between normal and anomaly scores, bringing the scores closer to the right category.

For the localization aspect, we visualize anomaly segmentation heat maps with and non CaiT, respectively, as shown in Fig. 7. CaiT pays attention to the information between different patches, thereby increasing the perception of anomalous areas.

To investigate the robustness of our method to various types of anomalies, we also classify the anomaly types into two categories: large or structural anomalies and tiny or inconspicuous anomalies, and qualitatively evaluate the performance by the visualizations in Figs. 8 and 9. Compared to CutPaste using the same augmentation strategy, our method produces a more significant response to the anomaly region.

4.3.4 Inference speed and simulation platform

The MFFA proposed in this paper is performed under the deep learning development framework of PyTorch, with NVIDIA Tesla P100-SXM2-16GB for GPU acceleration, Intel(R) Xeon(R) Gold 6248R CPU. We calculate the time consumption of PaDiM [6] and SPADE [8] in the inference phase, and the inference time for one image is 0.339s and 0.359s for these two models, respectively, while the time to process one image is 0.356s for MFFA, which is similar to their inference speed.

4.4 Limitations

Although our method shows better performance on average AUROC, it is less effective in specific categories. The detection results of toothbrush show that its performance is lower than the state-of-the-art methods. For this category, images have a lot of dots as noises. Our method focuses more on detecting these noises and weakens the detection of actual anomalies. We guess that this may be caused by the randomness of the CutPaste argumentation strategy, which makes the model learn the irregularities of noise. As shown in Fig. 10, our method localizes the noisy regions, but this is not the true anomaly region.

4.5 Detection and localization on other datasets

To demonstrate the generality of our method, we also evaluate our method on another dataset, BTAD [40], which has 3 categories of industrial products with 2540 images. The training set contains only normal images, while the testing set is a mixture of normal and anomalous images. Under the measure of pixel-level AUROC, we compare the results of our method with the results of FastFLow [41] and three methods reported in VT-ADL [40]: auto encoder with mean square error, automatic encoder with SSIM loss and VT-ADL. The comparison results are shown in Table 4. We can observe that our method achieves 98% pixel-level AUROC, which is higher than the best performance of FastFlow.

5 Conclusion and future work

We design a novel anomaly detection framework, MFFA, which optimizes previous studies from the aspects of network, feature and anomaly map. The improvements can be summarized as follows: firstly, the PSG block encourages the model to sufficiently learn spatial irregularities of generated pseudo samples, thus improving the adaptability of the model to the detection of anomalous patterns for industrial anomaly detection tasks. Secondly, the LGFFP block pays more attention to the local and global information between different patches, thus increasing the perception of anomalous areas. Lastly, the AMC block combines the advantages of memory bank based nearest neighbor search and patch distribution modeling to achieve more accurate anomaly detection results. Our experimental results show superior anomaly detection and localization performance on the real-world datasets. It achieves over 99.1% AUROCs in detection and 98.4% AUROCs in localization, respectively.

Despite the considerable performance achieved by our method, there are still some room for improvement such as the classification in PSG blocks not being generalized to a fine-grained level. In our future work, we will further extend PSG block to the patch level to improve the performance of anomaly localization.

CRediT authorship contribution statement

Ruifan Zhang: Conceptualization, Methodology, Writing - original draft, Writing - review & editing, Visualization, Project administration, Data curation. Hao Wang: Conceptualization, Methodology, Writing - original draft, Writing - review & editing, Visualization, Project administration, Data curation. Gongping Yang: Supervision, Resources.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grants U1903127, and in part by the TaiShan Industrial Experts Programme under Grant tscy20200303

References

Napoletano

, Piccoli

and Schettini

, Anomaly detection innanofibrous materials by cnn-based self-similarity, Sensors 18(1) (2018), 209.

Heger

, Desai

and El

M.Z.

, Abdine, Anomaly detection in formedsheet metals using convolutional autoencoders,, Procedia CIRP 93 (2020), 1281–1285.

Fei

, Huang

, Jinkun

, Li

, Zhang

, Lu

Attribute restoration framework for anomaly detection, IEEE Transactions on Multimedia (2020).

Sabokrou

, Khalooei

, Fathy

, Adeli

Adversarially learned one-class classifier for novelty detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3379–3388.

Perera

, Nallapati

, Xiang

Ocgan: One-class novelty detection using gans with constrained latent representations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906.

Defard

, Setkov

, Loesch

, Audigier

Padim: a patch distribution modeling framework for anomaly detection and localization, in: International Conference on Pattern Recognition, Springer, 2021, pp. 475–489.

Rippel

, Mertens

, Merhof

Modeling the distribution of normal data in pre-trained deep features for anomaly detection, in: 2020 25th International Conference onPattern Recognition (ICPR), IEEE, 2021, pp. 6726–6733.

Cohen

, Hoshen

Sub-image anomaly detection with deep pyramid correspondences, arXiv preprint arXiv:2005.02357 (2020).

Roth

, Pemula

, Zepeda

, Scholkopf

, Brox

, GehlerTowards

Towards total recall in industrial anomaly detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14318–14328.

10.

Lee

, Lee

, Song

B.C.

Cfa: Coupled-hyperspherebased feature adaptation for target-oriented anomaly localization, arXiv preprint arXiv:2206.04325 (2022).

11.

Bergmann

, Fauser

, Sattlegger

, Steger

Mvtec ad- a comprehensive real-world dataset for unsupervised anomaly detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9592–9600.

12.

Bergmann

, Lowe

, Fauser

, Sattlegger

, StegerImproving

Improving unsupervised defect segmentation by applying structural similarity to autoencoders, arXiv preprint arXiv:1807.02011 (2018).

13.

Gong

, Liu

, Le

, Saha

, Mansour

M.R.

, Venkatesh

, Hengel

d.A.v.

Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1705–1714.

14.

Pirnay

, Chai

Inpainting transformer for anomaly detection, in: International Conference on Image Analysis and Processing, Springer, 2022, pp. 394–406.

15.

Venkataramanan

, Peng

K.-C.

, Singh

R.V.

, Mahalanobis

Attention guided anomaly localization in images, in: European Conference on Computer Vision, Springer, 2020, pp. 485–503.

16.

Kingma

D.P.

, Welling

Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013).

17.

Sato

, Hama

, Matsubara

, Uehara

Predictable uncertainty-aware unsupervised deep anomaly segmentation, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–7.

18.

Liu

, Li

, Zheng

, Karanam

, Wu

, Bhanu

, Radke

R.J.

, Camps

Towards visually explaining variational autoencoders, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8642–8651.

19.

Pidhorskyi

, Almohsen

and Doretto

, Generative probabilisticnovelty detection with adversarial autoencoders, , Advances inNeural Information Processing Systems 31 (2018).

20.

Akcay

, Atapour-Abarghouei

, Breckon

T.P.

Ganomaly: Semi-supervised anomaly detection via adversarial training, in: Asian conference on computer vision, Springer, 2018, pp. 622–637.

21.

Schlegl

, Seebock

, Waldstein

S.M.

, Langs

and Schmidt-Erfurth

, f-anogan: Fast unsupervised anomaly detection withgenerative adversarial networks,, Medical Image Analysis 54 (2019), 30–44.

22.

Wang

, Simoncelli

E.P.

, Bovik

A.C.

Multiscale structural similarity for image quality assessment, in: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, IEEE, 2003, pp. 1398–1402.

23.

, Yoon

Patch svdd: Patch-level svdd for anomaly detection and segmentation, in: Proceedings of the Asian Conference on Computer Vision, 2020.

24.

Chen

, He

Exploring simple siamese representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15750–15758.

25.

Hjelm

R.D.

, Fedorov

, Lavoie-Marchildon

, Grewal

, Bachman Trischler

A.P.

, Bengio

Learning deep representations by mutual information estimation and maximization, arXiv preprint arXiv:1808.06670 (2018).

26.

Doersch

, Gupta

, Efros

A.A.

Unsupervised visual representation learning by context prediction, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.

27.

C.-L.

, Sohn

, Yoon

, Pfister

Cutpaste: Self-supervised learning for anomaly detection and localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9664–9674.

28.

Yan

, Zhang

, Xu

, Hu

and Heng

P.-A.

, Learning semanticcontext from normal samples for unsupervised anomaly detection, in: Proceedings of the AAAI Conference on ArtificialIntelligence Vol. 35 (2021), 3110–3118

29.

Zavrtanik

, Kristan

and Skocaj

, , Reconstruction byinpainting for visual anomaly detection, Pattern Recognition 112 (2021), 107706.

30.

Zavrtanik

, Kristan

, Skocaj

Draem-a dis- criminatively trained reconstruction embedding for surface anomaly detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8330–8339.

31.

Yang

, Wu

, Liu

, Feng

Memseg: A semi-supervised method for image surface defect detection using differences and commonalities, arXiv preprint arXiv:2205.00908 (2022).

32.

Tsai

C.-C.

, Wu

T.-H.

, Lai

S.-H.

Multi-scale patchbased representation learning for image anomaly detection and segmentation, in: Proceedings of the IEEE/CVFWinter Conference on Applications of Computer Vision, 2022, pp. 3992–4000.

33.

Tack

, Mo

, Jeong

and Shin

, Csi: Novelty detection viacontrastive learning on distributionally shifted instances, Advances in Neural Information Processing Systems 33 (2020), 11839–11852.

34.

Touvron

, Cord

, Sablayrolles

, Synnaeve

, JegouGoing

Going deeper with image transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 32–42.

35.

Dosovitskiy

, Beyer

, Kolesnikov

, Weissenborn

, Zhai

, Unterthiner

, Dehghani

, Minderer

, Heigold

, Gelly

et al., An image is worth 16×16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).

36.

Sener

, Savarese

Active learning for convolutional neural networks: A core-set approach, arXiv preprint arXiv:1708.00489 (2017).

37.

Sinha

, Zhang

, Goyal

, Bengio

, Larochelle

, OdenaSmall-gan:

Small-gan: Speeding up gan training using coresets, in: International Conference on Machine Learning, PMLR, 2020, pp. 9005–9015.

38.

Agarwal

P.K.

, Har-Peled

, Varadarajan

K.R.

et al., Geometricapproximation via coresets, Combinatorial and computational geometry, 52(1) (2005).

39.

Mahalanobis

P.C.

On the generalized distance in statistics, National Institute of Science of India, 1936.

40.

Mishra

, Verk

, Fornasier

, Piciarelli

, ForestiVtadl:

G.L.

Vtadl: A vision transformer network for image anomaly detection and localization, in: 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), IEEE, 2021, pp. 01–06.

41.

, Zheng

, Wang

, Li

, Wu

, Zhao

, Wu

Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows, arXiv preprint arXiv:2111.07677 (2021).

42.

Deng

, Li

Anomaly detection via reverse distillation from one-class embedding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9737–9746.

MFFA: Multi-level feature fusion and anomaly map compensation for anomaly detection

Abstract

Keywords

1 Introduction

2 Related works

2.1 Reconstruction-based methods

2.2 Embedding similarity-based methods

2.3 Self-supervised learning

3.1 Overall architecture

3.3 Local-global feature fusion perception block

4.1 Dataset description

Table 1 Ablation study on the complementarity of the AMC block for the Nearest Neighbor Search and the multivariate Gaussian distribution Component Detection Localization result (%) result (%) NNS 98.5 98.0 Gaussian 97.8 98.3 NNS+Gaussian(AMC block) 98.5 98.3

4.3.1 Ablation study

4.3.2 Comparison with the state-of-the-art methods

4.3.3 Discussions

4.3.4 Inference speed and simulation platform

4.4 Limitations

4.5 Detection and localization on other datasets

5 Conclusion and future work

CRediT authorship contribution statement

Declaration of competing interest

Footnotes

Acknowledgment

References

Table 1
Ablation study on the complementarity of the AMC block for the Nearest Neighbor Search and the multivariate Gaussian distribution

Component Detection Localization

result (%) result (%)

NNS 98.5 98.0

Gaussian 97.8 98.3

NNS+Gaussian(AMC block) 98.5 98.3