Abstract
As a special case in content-based image retrieval, fabric retrieval has high potential application value in many fields. However, fabric retrieval has higher requirements for results, which makes it difficult for common retrieval methods to be directly applied to fabric retrieval. It is also a challenging issue with several obstacles: variety and complexity of fabric appearance, and high requirements for retrieval accuracy. To address this issue, this paper presents a novel method for fabric image retrieval based on soft similarity and pairwise learning. First, a soft similarity between two fabric images is defined to describe their relationship. Then, a convolutional neural network with compact structure and cross-domain connections is designed to learn the fabric image representation. Finally, listwise learning is introduced to train the convolutional neural network model and hash function. The generated hash codes are used to index the fabric image. The experiments are conducted on a wool fabric dataset. The experimental results show that the newly proposed method has a greater improvement than our previous work.
Keywords
Recently, with the advent of the fourth consumption era (fast consumption era), the ‘small-batch’ and ‘multi-variety’ have gradually become the production modes of the textile industry. Under this production mode, the textile companies have accumulated a large amount of historical production data, which makes it more difficult to find similar fabrics. The traditional search method is carried out in the form of manual comparison, which is consuming and labor intensive. Keyword-based image retrieval 1,2 methods have been widely used in the textile industry. However, such methods rely too much on manually labeled keywords, which leads to a strong subjectivity of the results. Content-based image retrieval (CBIR) 3 –7 is undoubtedly an effective method to address this issue.
Fabric image retrieval, a special case in CBIR, has high potential application value in many fields, such as e-commerce, inventory management, and textile product design. Fabric retrieval plays a very important role in product search work. For example, in textile order production, the processing will analyze the process parameters of the samples provided by the customer, and then find the visually identical or similar products from the historically produced products. If the historical production data are huge, it will be very difficult to complete this work manually. An accurate image retrieval system can accomplish this task well.
The image retrieval system receives an input query and is expected to output a list of images that are related or similar to the query. Technically speaking, there are two core components of CBIR: image representation and similarity measurement, respectively. Image representation vectorizes the images (including the queries and images in the database), the second component consists of ranking the images in the database based on the relevancy to the query image. The most challenging task in CBIR is to associate pixel-based low-level features with the high-level semantic feature from human perception, 8 which is called the ‘semantic gap’. CBIR builds an index on the visual content of images and searches for similar images by defined similarity. The choice of visual features directly affects the performance of the retrieval algorithm, which is the key technology of CBIR.
Image retrieval methods have gone through significant development in the past decade, starting with descriptors based on hand-crafted feature, 9 first organized in bag-of-words, 10 and further expanded by spatial verification, 11 Hamming embedding, 12 and query expansion. 13 The traditional CBIR methods often use global feature descriptors, such as color histogram, 14 color moment, 15 and local binary pattern 16,17 to represent the visual content of the image. With the proposal of scale-invariant feature transform (SIFT), 18 bag-of-words technology 19 was grafted into CBIR, so many local features were proposed for CBIR. The SIFT algorithm is highly robust to geometric deformations such as deformation, rotation, and scaling. Around SIFT, there has been a lot of work that has great influence in the field of CBIR and even in computer vision. Although having achieved certain success, these hand-crafted-based methods depend heavily on feature engineering, which leads to their limitation.
In 2012, AlexNet 20 won the ImageNet image recognition competition far exceeding second place. Deep learning technology has attracted great attention in the field of machine vision. Alex 20 tried to use the 4096-dimensional output of the fully connected layer in AlexNet as the image index, and directly retrieved it on the ImageNet dataset and achieved very good results, verifying that convolutional neural network (CNN) has a good representation of the image content. However, the extracted high-dimensional representation also causes huge computational overhead, causing the problem of ‘dimensionality disaster’. Many subsequent studies centered on how to reduce the dimensionality of the features extracted by deep CNNs.
There are two main problems in the current research on fabric image retrieval: (a) many methods are just for a specific type of fabric, making them less adaptable; (b) the existing methods only represent the fabric from one or two dimensions, which is difficult to describe fully the visual characteristics of the fabric.
As mentioned before, fabric image retrieval is a special case of common image retrieval. However, retrieval methods for general images are difficult to apply directly to fabric image retrieval, because general image retrieval tends to pay more attention to the local information in the image, but fabric retrieval pays more attention to small elements such as texture primitive shape, primitive size, color, and area composition. For example, the focus of the image shown in Figure 1(a) is often the zebra in the image, but the focus of the fabric image shown in Figure 1(b) is the global color and texture features. Moreover, the content of general images is often easier to describe, in Figure 1(a), a zebra is eating, while the content of fabric images is more difficult to describe. Therefore, it is more difficult to represent fabric images with fine textures.

General image and fabric image. (a) An image of a zebra and (b) an image of a striped fabric.
Most of the features in the woven fabric image are some global low-level features such as color and texture, as well as the middle and high-level features produced by their combination. Studies 21 –23 have shown that the last layer of the deep CNN contains the highest-order features that can be extracted from this model. The output of this layer is the deep features learned after several convolutional operations. The visual content is highly abstract and contains rich semantic information, such as the location, size, and category of the target. These features are the abstraction of the output of the previous layer, so the output of the previous layer has a lower degree of abstraction. Therefore, the features extracted by deeper convolutional layers are more abstract. Figure 2 shows the representation effect of VGG convolutional layers with different depths on fabric images. Generally, the output of the first and second layers often contains rich color and texture features. In this paper, we designed a deep CNN with a compact structure and cross-domain connections to bridge the ‘semantic gap’. To train the proposed network, a soft similarity between fabric images and the listwise loss is defined. Moreover, the hashing layer is introduced to solve the problem of ‘dimensionality disaster’. This paper takes worsted woven fabrics as the main research object. However, the proposed framework can be extended to other types of fabrics, provided that the CNN model needs to be retrained on datasets of other types of fabrics.

Feature activation map output by the first four convolutional units of VGG-16.
Network architecture
As shown in Figure 3, the proposed end-to-end framework for fabric image representation consists of five convolutional blocks, two short-circuit connections, two fully connected layers, and a hash layer. The network is based on VGG-16, 24 but is more like the combination of VGG-16 and ResNet. 25 As mentioned before, most of the features in the woven fabric image are some global low-level features such as color and texture, as well as the middle and high-level features produced by their combination. So the short-circuit connection was added to preserve the low-level feature in the deep layer.

The architecture of the proposed convolutional neural network (CNN) model.
There are two types of convolutional blocks used in the proposed model, which are shown in Figure 4(a) and (b). The convolution kernels all use a small size of 3 × 3. Under the condition of ensuring the perceptual field, the number of nonlinear layers is increased, and the network depth can be increased to optimize the learning model. Meanwhile, it also reduces the trainable parameters of the model, which reduces the risk of overfitting to a certain extent. To prevent the problem of gradient disappearance during training, a batch normalization
26
layer is added before Relu activation in each convolution block to normalize the output of the previous layer. Among the convolutional layers, the batch normalization layer treats each feature map output by the previous layer as a neuron. To prevent the loss of information in the normalization, two trainable parameters γ and β are added to make the normalization have different changes. The normalization process is shown in the following equation (1):

Two convolutional blocks which are used in the network architecture. (a) Conv block 1 and (b) conv block 2.
Each convolution block contains a 2 × 2 max-pooling layer, and the size of the feature map will be reduced to half of the previous one after the maximum pooling operation. In Figure 2, the thick solid arrow indicates the forward propagation process, and the dashed arrow indicates the short-circuit connection, also known as the cross-domain connection, and the purpose is to transfer the shallow and low-level features to the deep layer. To make the feature maps of the two layers have the same shape, the feature maps of the shallow layer undergo a 2 × 2 max-pooling operation when they are transferred to the deep layer.
Soft similarity learning of fabric images
The definition of soft similarity
For the two fabric images, I
i
and I
j
, with multidimensional labels, the similarity is commonly defined as
Assuming that
As shown in Figure 5, the similarity matrix M has 16 different situations when D = 4. According to equation (2), the similarity of pairwise images can be passed into five levels. For approximate neighbor nearest (ANN) search, the generated hash codes should preserve the similarity of the pairwise fabric images. To be specific, for a pair of generated binary codes bi and bj, if

The similarity matrix when D = 4.
Listwise learning
In this study, the fabrics in the dataset are all labeled from four dimensions. The similarity between two fabric images may have five cases, 0, 0.25, 0.5, 0.75, and 1, respectively. During training, the model receives a set of five fabric images,
Let
To ensure the quality of learning, the outputs r of the last fully connected layer of the model and the final hash codes b are restricted. Moreover, to reduce the loss of information in the hashing process, the quantitative loss is introduced into the objective function. The output of the last fully connected layer of the model is mapped by tanh activation function in the closed interval [–1, +1]. Quantitative loss is used to encourage numbers less than 0 to be closer to –1, and numbers greater than 0 to be closer to +1, which can be written as:
The framework of the fabric image retrieval system
The parameters in the proposed CNN model are optimized under the supervision of the defined soft similarity. The trained model will be applied to build the fabric image retrieval system. Generally, the CBIR system contains two parts: an offline module and an online module, as shown in Figure 6. In the offline module, the images in the retrieval database are converted into fixed-length vectors. In this work, the trained model will generate a k-bit hash code for each fabric image in the retrieval database, and the generated hash code will be regarded as the index of the corresponding fabric image. The black line with arrows in Figure 6 indicates the process of the offline module.

The framework of the fabric image retrieval system.
The offline module covers all the logic of the system from input to output. The retrieval system receives a query image, then inputs it into the trained CNN model to generate the corresponding hash code. Finally, the system calculates the Hamming distances between the generated hash code with all indexes in the index library and outputs the relevant fabric images. The Hamming distance is computed by:
Experimental configuration
Dataset
To study fabric image retrieval, in our previous works, 32 –37 a fabric dataset named WFID 33 has been established, which contains 82,073 fabric images. The images in this dataset are labeled from four views: (a) coarse texture, which is classified according to the presence or absence or type of pattern on the surface of the fabric image, is divided into four categories: monochrome fabric, stripe fabric, lattice fabric, and patterned fabric; (b) fine texture is simply divided into three categories according to the weave structure of the fabric; (c) fabric style refers to the subjective feel of a fabric (monochrome fabric: dark and bright; others: casual and business); (d) the pattern forming method is to distinguish according to its literal meaning. In this paper, the proposed CNN model and compared learning-based methods are all trained on the training set with 33,645 fabric images, and the performance of all methods is evaluated on the validation set (consisting of 1029 sets of data, each of which is an image and the 20 most relevant images in the dataset). The images were captured in a red, green, and blue model using a scanner (Canon 9000F Mark II). The light source of the scanner was a white light-emitting diode, which can guarantee a stable capture environment, and the resolution was set to 200 dpi.
Implementation details
In this paper, the proposed retrieval framework and CNN are implemented by using the Pytorch toolkit. The hardware environment is as follows: CPU = E5 2623V4@2.60 GHz, RAM = DDR4 32G, GPU =GeForce RTX 3090 (24G) × 2. It is stated here that all compared deep learning-based methods are implemented using the Pytorch toolkit and based on the bone of VGG-16, and the other methods, which are based on the hand-crafted descriptor, are implemented by using MATLAB 2018b.
During the training, the hyper-parameter configuration is as follows: batch_size = 32, weight_decay =5 × 10–5, optimizer = ADAM, and initial learning_rate =1 × 10–3. Specifically, the learning rate of the parameters of the convolutional layers inherited from VGG-16 is set to one-tenth of the subsequent layers, and the strategy for adjusting the learning rate can be denoted by:
Results and discussion
Parameter analysis
There are three parameters
Table 1 shows the results of the proposed method by using different parameters configuration. In the proposed model, the quantitative loss is used as an auxiliary component, so it only needs to be equipped with a smaller weight. When its weight is too large, it will affect the performance of fabric image feature learning. Therefore, this paper set the weight of quantitative loss as 0.001.
Results of mean average precision (mAP) and NDCG50 for different parameter configurations
Comparison with different settings
To prove the rationality of the proposed configuration, this study compared the performance of different network configuration settings on the WFID dataset. Specifically, the compared network configurations include: (a) entire VGG-16, abbreviated as EV; (b) shallow VGG-16 without short-circuit connection, abbreviated as SV; (c) entire ResNet-50, abbreviated as ER; (d) shallow ResNet-50, abbreviated as SR; (e) entire AlexNet, abbreviated as EA; (f) shallow AlexNet with short-circuit connection, abbreviated as SAS; (g) proposed setting, shallow VGG-16 with short-circuit connection. It is stated here that only the convolutional layer is different in the above configurations, and the fully connected layer and the hash layer are the same.
This study first compared the PR curves, mAP, and NDCG50 of the several models mentioned above on the validation set, and the results obtained are shown in Figure 7 and Table 2. The larger the area enclosed by the PR curve and the coordinate axis, the better the performance of the retrieval method. As shown in Figure 6, the performance gap of different models is not large, which illustrates the rationality of the proposed network framework to a certain extent. Comparing EV, ER, and EA, it can be seen that the representation ability of ResNet is better than that of AlexNet and VGG. However, the performance of SV is stronger than that of SR and SA, which shows that the representation ability of VGG is more robust. In addition, after the short-circuit connection is equipped in the SV model (proposed), the performance of the model has been significantly improved, indicating that this structure is beneficial to the transmission of shallow features to deep layers without obstacles. It is for these reasons that this study chose shadow VGG-16 with short-circuit connection as the stem of the proposed framework.

Precision-recall curves of models with different configurations at four different code lengths: (a) 32 bits; (b) 64 bits; (c) 128 bits and (d) 256 bits.
The mAP and NDCG50 results of models with different configurations at four different code lengths.
Ablation study for listwise learning
The proposed method of fabric image retrieval mainly includes two technical components: short-circuit connection and listwise learning. The previous section has verified that the short-circuit connection component improves the model retrieval performance. Among image representation methods based on similarity learning, pairwise learning using triple loss is the most commonly employed. The idea of triplet loss can be described as:
This section conducts ablation research to verify the superiority of list learning in the proposed method. In this experiment, pairwise and listwise learning methods are used to train the proposed CNN network model. To make a fair comparison, the other configurations of the model are the same in the two pieces of training. The results of the comparative experiment are shown in Figure 8. The two learning methods have little difference in mAP indicators, and both achieved good performance, which shows that the proposed CNN model has a good representation effect on fabric images. However, the NDCG50 indicators of the two learning methods are quite different, and the performance of listwise learning is significantly better than pairwise learning. NDCG50 reflects the sorting performance of the search algorithm for the results. The experimental results show that listwise learning can supervise model learns with more similarity of information, and thus achieve better retrieval performance.

Comparative experiment results of listwise learning and pairwise learning.
Comparison with several retrieval methods
To demonstrate the rationality of the proposed method. In this section, we compare the retrieval performance of the proposed model with several retrieval methods, including three retrieval methods (1–3) based on similarity learning and four methods (4–7) dedicated to fabric image retrieval. Specifically, a brief introduction and implementation of the comparison methods are as follows:
Improved deep hashing network (IDHN)28 is the first deep hashing method that directly uses pairwise quantified similarity, which can reflect the fine-grained similarity between a pair of multilabel images for supervised learning. The authors implemented this model by using Tensorflow. According to the code provided by the authors, this model was reimplemented using PyTorch, where VGG-16 was configured as the basic network. Central similarity quantization (CSQ)38 was proposed to optimize the central similarity between data points. The authors developed the center similarity with CNNs to learn a hash function. This study directly uses the authors’ public code for comparison experiments, where the feature learning network is VGG-16. Deep pairwise-supervised hashing (DPSH)39 is a model that can perform simultaneous feature learning and hash code learning for applications with the pairwise label. The authors provided us with the code implemented by PyTorch. Similarly, this paper also uses VGG as the stem of its model. Fabric retrieval using hierarchical search (FRHS)32 employs a hierarchical search strategy that includes coarse-level retrieval and fine-level retrieval (our previous work). Its image representation is based on a deep sparse network driven by a classification task. Fabric retrieval based on multitask learning (FRMT)33 proposed a multitask learning framework to represent the fabric image (our previous work). The learning of this model is guided by four classification tasks. Wool fabric retrieval based on CNN and ANN (FRCA).36 This method is proposed for fabric images based on Fourier transform and local binary pattern (texture feature). This study implements this method by using Matlab. Learning deep similarity models with focus ranking (FRDS).40 This novel embedding method is termed focus ranking that can be easily unified into a CNN for jointly learning image representations and metrics in the context of fine-grained fabric image retrieval. This study implements this method by using PyTorch.
The above methods are all trained on the training set of WFID. In the performance evaluation of the algorithm, each fabric image in the verification set is used as query input to the retrieval system, and then a string of retrieval results is output. This study calculates the mAP and NDCG50 index of each retrieval based on the output results, and the final result is the average of all retrieval mAP and NDCG50.
The quantitative results are presented in Table 3. Both IDHN, CSQ, and DPSH use pairs of similar or dissimilar images to drive learning image representation and coding. These methods simply define the relationship between the two images as similar or dissimilar, while ignoring part of the same annotations (hard similarity), so that the NDCG50 scores of these methods are not high. The length of the code has little effect on FRHS because it uses a two-step search strategy, and its poor retrieval performance is because it only uses one dimension of label information. When comparing FRMT (previous work) and the method proposed in this study, it can be found that the difference in their mAP indicators is not obvious, but the NDCG50 indicator has been greatly improved. Although FRMT learns all the label information, it does not consider the relationship between labels of different dimensions. The newly proposed method defines soft similarity and uses the soft similarity between a list of images to supervise the model to learn fabric representation and coding. This approach makes the retrieval results appear such a trend: the more similar the fabric to the query, the higher the position in the results, and the farther back the retrieval results that are completely different from the query. This is also the reason why the newly proposed method can achieve a better NDCG50 score.
Comparison of experiment results with several retrieval methods.
This study also presents the retrieval performance of the compared methods on different classifications of fabric images in Figure 9. The results show that the retrieval performance of all methods for pure color, stripe, and lattice fabrics is better than that of patterned fabrics because the features in patterned fabric images are more complicated. It can also be found that our method has significantly better performance than other methods, especially the NDCG evaluation index. The experimental results once again demonstrate the effectiveness and superiority of the proposed method for fabric image retrieval. Some retrieval samples are shown in Figure 10. It is particularly pointed out that the retrieval time of a single image of the method in this paper is only 0.24 seconds.

Experimental results on different classification fabrics. D1, D2, D3, and D4, respectively, represent monochrome, stripe, lattice, and pattern fabric.

Some retrieval examples.
Conclusions
In this paper, a novel method for fabric image retrieval based on listwise learning was presented. To narrow the gap between the fabric images and annotations, a CNN with a compact structure and cross-domain connections is designed. Then, the soft similarity is defined to describe and quantify the relationship between paired fabric images. The listwise learning is introduced to train the proposed model. The objective function consists of three parts: listwise loss of features
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China under grant 61976105, in part by the National Key R&D Program of China under grant 2017YFB0309200, and in part by the Fundamental Research Funds for the Central Universities under grant JUSRP52007A.
