Abstract
When the conventional semantic segmentation method is applied to fabric defect detection, the omission factor of small size defects is relatively high, and the network model with larger depth is easy to lose the features of small size defects and has poor real-time performance. To address these problems, we propose two sensitive semantic segmentations, ClothNet based on deep feature fusion and ClothNet-tiny based on atrous spatial pyramid pooling. First, in ClothNet, deep and shallow features are fused to compensate for the information loss caused by pooling. Second, ClothNet-tiny is designed to improve the detection speed. Finally, an adaptive loss function for defect size, namely weighted dice loss is proposed. The results on the validation set show that ClothNet achieves 78.8% Mean Intersection over Union mean. Compared to fully convolutional networks, ClothNet reduces memory consumption by 28% and ClothNet-tiny by 77%.
Keywords
Fabric defects are the biggest factor affecting fabric quality, thus reducing fabric availability.1,2 Defect detection is not only used to evaluate fabric quality, but also to infer machine operation. 3 Currently, fabric defects are mainly detected manually, and human attention and proficiency are very limited, even skilled workers can only achieve about 70% detection accuracy. 4 The process of defect detection affects productivity and results in significant labor costs. 5
In recent years, with the rapid development of computer vision, many fabric defect detection algorithms have emerged. 6 These algorithms can be divided into four major categories: statistical algorithms, spectral algorithms, dictionary learning, and deep learning. 7
As fabric defects are random, statistical algorithms were first studied, in which the images were preprocessed and then statistically identified. A typical approach was the Sylvester matrix similarity estimation algorithm proposed by Kumari et al. 8 The algorithm involved six stages: resolution matching, image enhancement using histogram specification and median mean-based sub-image-clipped histogram equalization, image registration by alignment and hysteresis processes, image subtraction, edge detection, and fault detection by rank of the Sylvester matrix. The algorithm achieved 93% accuracy on KTH-TIPS-I and KTH-TIPS-II, but the real-time performance was poor.
As the texture of fabrics was periodic, spectral algorithms were used to model the texture and used the model to detect defects. Jing et al. 9 used a genetic algorithm to adjust the Gabor filter on a defect-free fabric to match the defect-free texture, and then used the adjusted filter to detect defects, which achieved good results on plain fabrics but could not handle color fabrics. Schneider and Merhof 10 proposed a Fourier analysis-based algorithm for woven fabric detection, extracting distinct peaks to obtain texture frequencies and combining template matching and fuzzy clustering algorithms to extract warp and weft angles to obtain weave pattern information and construct a yarn map. The method obtained 97% accuracy on a dataset including 140 real fabric images, but the applicability range was too small, and the sampling requirement was high.
Mimicking the process of training manual fabric defect detection, visual inspection methods based on biological vision and dictionary learning have been proposed, which in turn were both low-rank models and thus were also referred to as low-rank decomposition. Li et al. 11 integrated dictionary learning and Laplace regularization into a low-rank representation model concerning the low-rankness and sparsity of biological vision. Dictionary learning reduced the noise of saliency maps, and the addition of Laplace regularization enlarged the grayscale difference between defects and background; finally, the generated visual salient maps were segmented. This method could achieve better results on fabrics with complex patterns, but the huge computational cost made it less real time. Noting that the traditional low-rank decomposition model ignored the effect of Gaussian noise, Shi et al. 12 proposed a low-rank decomposition model with noise regularization and gradient information. The algorithm used noise regularization to expand the gradient between defects and nondefects and constrained the noise terms based on the gradient information to guide the matrix decomposition, which achieved good results in recall and precision, but failed in real time.
In recent years, deep learning has become the preferred choice for many vision tasks. Neural network models can understand scenes like a human. Many researchers have applied deep learning to the fabric detection problem with satisfactory results. Liu et al. 13 and Simonyan and Zisserman 14 classified defective images based on the VGG16 neural network model and pruned the original VGG16 network model by visualizing the feature maps to maximize the model performance while reducing the network parameters. The tests on the self-constructed dataset achieved good classification performance. The disadvantage was that the model could only distinguish the presence or absence of defects, but could not obtain the size, type, and number of defects. Jing et al. 15 proposed a sliding window recognition method based on classification networks, which could achieve a general judgment of defect contours. However, to obtain specific contour and classification information, intensive segmentation of the image is required, leading to an increase in computational effort, and good real-time performance cannot be obtained.
In response to the above-mentioned deep learning networks that do not reflect the location and size of defects well, object-based detection methods, 16 Pixel-level segmentation and semantic segmentation have now been developed. Here is what they have done for fabric defects.
Li used the R-CNN model for identifying fabric defects, 17 the previous bounding box size of R-CNN was obtained based on the clustering of fabric defect sizes. The traditional nonmaximum suppression method was replaced by a soft nonmaximum suppression method to avoid incorrect discarding of prediction bounding boxes due to stacking, 18 but the large aspect ratio was not sufficient for fabric defects such as broken yarns. 19
The size of the prediction bounding box does not reflect the actual size of the defect, pixel-level segmentation of defects is a better alternative. Huo et al. 20 used a Bayesian optimization to prune UNet++, denoise and labeling of connected domains on the feature map output of the neural network. The threshold was set to estimate a connected domain defect. The method reduced the inference time by 24% with a 1.4% reduction in Intersection over Union. The method was able initially to identify the contours of defects, but the detection process could not be performed end to end, and the defects could not be classified.
Semantic segmentation provides predictive classification for every pixel. 21 It not only segments the defects but also identifies the classes of defects, and provides all information about the defects. 22 Jing et al. 23 used the network structure of UNet, replaced backbone with MobileNetv2 which had fewer parameters, and used depthwise separable convolution to improve the inference speed, achieving 70% accuracy (mean intersection over union; MIoU) on the self-built dataset. Semantic segmentation is usually used in robotics and autonomous driving, 24 some classic network structures have emerged; however, most of these networks are used for semantic segmentation of complex scenes. For the fabric defect detection task, the scene semantics are relatively homogeneous, which means that these networks are too deep, meanwhile some networks have huge receptive fields, while fabric defects are generally small and narrow, thus making such small-scale features easily overlooked in these networks.
Therefore, it is necessary to redesign the networks to cope better with the task of fabric defect detection. Meanwhile, in semantic segmentation network training, the imbalance of positive and negative samples in the dataset can lead to nonconvergence or slow convergence of the model, which can be solved by a weighted loss function. The weights are often set artificially and need to be adjusted several times. We designed a new loss function that adaptively balanced the positive and negative samples and had a more reasonable weight assignment to improve the training effect for small size categories. We also used a decaying cosine annealing learning rate to prevent the model from falling into a local optimum and to reduce model performance oscillations caused by restarting the learning rate. In summary, the proposed approach has the following contributions:
WD loss with size adaptive function is designed to solve the sample imbalance problem, the size adaptation function is a nonlinear function based on the pixel occupancy of positive samples. It is proved that WD loss can converge the model better than Dice loss, and improve the mean MIoU of the model by 4.7% and the hole MIoU by 9.2%. ClothNet is designed based on deep feature fusion, maintaining a high feature map size by dilated convolution makes small-scale defects less likely to be lost. ClothNet had a miss detection rate of 1.3% and a false detection rate of 2.0% on the test set. ClothNet-tiny uses a tandem network structure, compared to fully convolutional networks (FCNs), ClothNet-tiny reduces memory cost by 77%. ClothNet-tiny had a miss detection rate of 2.7% and a false detection rate of 3.4% on the test set. A decaying cosine annealing function is used to prevent the model from falling into a local optimum and to ensure the stability of the final training model.
Models
This section describes the proposed ClothNet and ClothNet-tiny network models in detail. ClothNet uses an encoding–decoding network structure, including a backbone part and an upsampling part, while using deep feature fusion to compensate for the information loss due to pooling.
25
ClothNet-tiny performs atrous spatial pyramid pooling after the backbone, and then concates the feature maps that form the dilation convolution. The concated feature map was sent to the upsampling section to recover the original resolution. All convolutional layers are followed by batch normalization,
26
and LeakyReLU is used as the activation function with
ClothNet
Residual blocks are used to build the backbone. Networks with greater depth are generally considered to have better performance, but in practice, deeper networks degrade. Residual connection solves the problem of network degradation by establishing identity mapping. In Figure 1, the 1 × 1 convolution in the residual block not only realizes the fully connected in the channel dimension, but is also flexible for decreasing and increasing dimensions. Convolution of the dim-decreased tensor with larger kernel size N × N will reduce the calculation volume. The convolution of N × N is used to enhance the network receptive field and increase the network parameters. Figure 1(a) shows the structure when the input and output dimensions are different, and Figure 1(b) shows the structure when the input and output dimensions are the same. The residual block is widely used in backbone networks because of its excellent performance and fast speed. 28

Residual block: (a) different input and output dimensions and (b) same input and output dimensions.
The output of the
Referring to Figure 2(a), when the input is 240 × 240, the minimum span of defects is concentrated between 8 and 15 pixels. In Figure 2(b) to (e), as the feature map size decreases, the defect contours become blurred, which also means that subsequent contour reduction becomes difficult. When the encoder downsamples the feature map to 60 × 60 size, the minimum span of defects occupies only 2–3 pixels, and 30 × 30 is the minimum size to ensure that the features are not scaled. To ensure the detection rate of defects with small size and the resolution restoration in the subsequent contour inference process, dilated convolution is selected instead of maxpooling.

Downsampling size comparison: (a) image 240 × 240; (b) feature map 240 × 240; (c) feature map 120 × 120; (d) feature map 60 × 60 and (e) feature map 30 × 30.
The dilated convolution can replace the receptive field expansion brought by the maxpooling without changing the feature map size. 29 Dilated convolution used a kernel size of 3 × 3 and dilation rate of 2, and the end of the encoder section. Figure 3 shows the 3 × 3 convolution kernel for different dilation rates.

Dilated convolution kernel with different dilation rates (Dr).
The calculation of the encoder receptive field is shown in equation (1). Where
The decoder part uses transpose convolution and tandem structure to form the upsampling layer, transpose convolution reduces the number of channels while restoring the resolution of the image, and convolution further extract features and infer the class to which the pixels belong. Outputs of different scales of the backbone layer are summed with the matched upsampled feature maps to achieve large span feature fusion which benefits the contour derivation. Finally, the classification probabilities of pixels are obtained by SoftMax activation function. Figure 4 shows the network structure of ClothNet.

ClothNet.
ClothNet-tiny
Figure 5 shows the network structure of ClothNet-tiny. ClothNet-tiny also uses the residual block as the basic layer of the backbone. In the residual block, the input dimensions are aligned with the output dimensions, the convolution is used instead of maxpooling for downsampling. The minimum downsampling size of the backbone is 30 × 30, it is the minimum downsampling size that keeps the minimum span of the defect from scaling with higher speed. The receptive field of ClothNet-tiny encoder is 38 × 38. Dilated convolution is used to process the feature maps output from the backbone feature network, the dilation rates are 1, 2, 3, 4. The extracted feature maps are concated as inputs to the decoder network for scale diversity. Compared with ClothNet’s decoder network, ClothNet-tiny does not use the structure of large span feature fusion, and the whole decoder network is connected in tandem.

ClothNet-tiny.
Comparison
According to Table 1, compared to FCN, ClothNet has 90% fewer parameters and ClothNet-tiny has 93% fewer. Even compared to PSPNet, which is more lightweight, ClothNet reduced by 44% and ClothNet-tiny reduced by 62%. Multiscale feature fusion and large span feature fusion structures are used in ClothNet, which results in better network performance, but such structures involve temporary storage of a large number of tensors, which can lead to memory usage. The higher downsampling rate of ClothNet-tiny makes it faster. As seen from Figure 6(a), for the fabric defect detection task, the large fabric width makes it necessary for the images to be segmented. Multiple images will be packaged into one batch for network prediction, which makes the video memory occupation very important. ClothNet-tiny eliminates a large number of feature fusions and the entire network is basically structured in tandem mode, making the video memory occupation much lower. In Figure 6(b), compared with FCN, ClothNet memory cost reduced by 28%, ClothNet-tiny reduced by 77%.
Models contrast
MIoU: mean intersection over union; FCN: fully convolutional network.

Image segmentation and video memory cost.
Training
Loss function
In semantic segmentation, positive and negative samples are unbalanced, and the number of negative samples is usually much larger than the number of positive samples. Figure 7(a) shows the distribution of positive and negative samples in the dataset; the pixel occupancy of positive samples is only

Distribution of samples: (a) number of positive and negative sample pixels and (b) number of pixels in different classes.
The Dice loss provides a good indication of the similarity between predicted results and ground truth.
31
To measure the effect of negative samples on the loss values, a size-adaptive weighting function designed for Dice loss, denoted by
Learning rate
The learning rate plays an important role in the convergence process of the network model. In general, the learning rate gradually decreases as the number of training epochs increases so that the model eventually converges. In the case where the objective function is multipeaked, such a strategy can also cause the model to fall into local optimization rather than global optimization. The cosine annealing learning rate is used to restart the learning rate by setting a progressively longer period and setting the learning rate to the initial value at the beginning of each period, thus jumping out of the local optimum. 32
The cosine annealing function also has its drawbacks: restarting the learning rate after a certain number of training iterations leads to large oscillations in model performance, slow convergence, and a decay factor is needed to constrain the learning rate.

Learning rate.
Experimental evaluation
In this section, the performance of ClothNet and ClothNet-tiny will be evaluated. All training and testing performed on an IPC with an Intel Core i7-6850K CPU, NVIDIA GeForce GTX 1080 Ti GPU, and Windows 10 OS; Adam was the optimizer for both networks,
Dataset
Both training and testing were performed using a self-built dataset containing six classes: defect-free, carrying, broken yarn, knots, holes, and soiled yarn. The original image is acquired from a 1920 × 1440 camera, and the image is divided into a 4 × 3 grid, where each subimage is 480 × 480 and the resize to 240 × 240. The images without defects are removed, only the images with defects are kept. Each image was manually labeled at the pixel level, with a total of 1451 images and 1846 defects. Table 2 shows the number of images and defects in each class. Some images have multiple disconnected defects, and each image contains at least one defect; 1303 images were selected as the training set and 148 images were selected as the test set. All images and ground truth were 240 × 240 in size. Figure 9 shows some defect images and ground truths of the dataset.
Number of images and defects

Dataset: (a) image and (b) ground truth.
Training and verification
First, mean intersection over union,
In equation (7),

Intersection over union (IoU) schematic.
A strategy is needed for deciding when to stop training the model, avoiding overfitting due to overtraining, and saving time.33,34 Assume that
A generalization loss
In equation (10), we can find that the appearance of higher generalization loss in the later stages of training (for larger value of
Set stop threshold

Loss-epoch: (a) ClothNet and (b) ClothNet-tiny.
ClothNet was trained for 1000 epochs to verify the improvement of WD loss. Figure 12 shows the plots of MIoU-epochs for each class in the test set, Figure 12(a) shows that the defect-free class converges quickly in the case of imbalanced training samples. In Figure 12(c) to (f), comparing with Dice loss, it is obvious that WD loss makes these classes converge earlier, especially in the hole class in Figure 12(e), WD loss converged 502 epochs in advance. WD loss makes the training progress much earlier, which indicates that WD loss plays a good role in dealing with the problem of sample imbalance. To get a good performance model, training with WD loss could save more time. In Figure 12, the detection accuracy (MIoU) of the model trained with WD loss is higher than that of Dice loss, and Figure 12(d) to (f), the MIoU of small-sized defects broken yarn, knot, and hole reached 72.0%, 77.2%, and 71.7%, indicating that the size-adaptive weighting function in WD loss has a better detection performance for small size defects.

Mean intersection over union (MIoU) with Dice loss and weighted dice (WD) loss: (a) defect-free; (b) carrying; (c) broken yarn; (d) knot; (e) hole and (f) soiled yarn.
Comparison
To test the performance of ClothNet and ClothNet-tiny, we selected models that performed well in the semantic segmentation task for comparison, including UNet, FCN, and PSPNet. The performance of the models was judged using MIoU as a metric, and all models were based on the same train and test set.
As can be seen from Table 1, ClothNet achieves the best results in all classes, and ClothNet-tiny achieves a good result with many fewer parameters than other networks. Both ClothNet and ClothNet-tiny show better results in the detection of the hole class with small size. UNet does well in contour segmentation but has poor classification ability. FCN’s huge number of parameters provides better accuracy. PSPNet fails to achieve better results in accuracy, but its smaller number of parameters is its obvious advantage.
In Figure 13, the detection results of each class are shown. In general, there are no missed detections in each model, but it is obvious that ClothNet and ClothNet-tiny have better classification results and there are no wrong detections. In the detection of holes with small size, the traditional model is less effective in inferencing about the defect contour and class, while ClothNet and ClothNet-tiny have excellent results.

Outputs of different models.
In terms of detail performance, in Figure 13, we can observe that ClothNet’s segmentation of contours is superior to that of the traditional network and ClothNet-tiny, due to ClothNet’s minimum downsampling size of 60 × 60 and the use of a large number of feature fusion techniques. In the detection results of the broken yarn class in Figure 13, we can better see ClothNet’s superior performance in contour segmentation.
ClothNet reaches 48FPS and ClothNet-tiny reaches 99FPS, both satisfying real-time (30FPS). Packing groups of images into batches can effectively improve the efficiency of the network, and the larger the batch size for processing quantitative images, the higher the efficiency. If memory is sufficient, larger batch sizes are encouraged in tasks with high parallelism requirements. Table 3 shows the time taken to process 1000 images with different batch sizes.
Time cost with different batch
Conclusion
To achieve end-to-end fabric defect detection, ClothNet and ClothNet-tiny were designed, both of which use an encoder–decoder architecture. ClothNet uses a large span of feature fusion to enhance defect contour segmentation, and a residual connection module to avoid network degradation and reduce calculation effort. ClothNet-tiny has fewer memory costs and faster detection speed, with lower hardware requirements in practice. ClothNet and ClothNet-tiny were trained and tested on a self-built dataset, then compared with traditional models.
ClothNet has only 10% of the parameters of FCN, and even compared to UNet and PSPNet, which have fewer parameters, ClothNet has 61.2% and 44.1% fewer parameters, respectively. Compared to ClothNet, ClothNet-tiny has 31.5% fewer parameters and only 4.7% lower performance (MIoU mean).
In the actual defect distribution, the samples will have serious imbalance, which will make the network insensitive to smaller size defects. The WD loss solves the imbalance problem for positive and negative samples by calculating them separately. The weights are calculated by the size-adaptive function, the weights corresponding to small size samples are distributed larger, which can solve the problem of imbalance between samples. Comparing with Dice loss, WD loss has higher detection accuracy, improving 2.7%, 1.8%, 10.3%, 15.1%, 5.6% in five defect classes, respectively, 502 epochs earlier convergence in the hole class, and the trend of earlier convergence is also reflected on other classes.
Current defect detection methods still require manual data collection and labeling, and for printed fabrics, for example, with their numerous and frequently updated patterns, such supervised networks are not well adapted to this way of working. However, semisupervised or unsupervised trained neural networks can be deployed faster. In addition, we can achieve the semantic generation of defect-images through Generative Adversarial Network (GAN), and input the random labels and images to the semantic segmentation network for supplementation, which can solve the problem of insufficient data to some extent, and this will be our future research direction.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
