Incorporating convolutions into transformers for textile fiber identification from fabric images

Abstract

Simple, fast and effective fiber identification can help consumers purchase their desired apparel and help the industry conduct large-scale textile testing. This paper presents a transformer architecture incorporating convolutions to recognize fibers in textile surface images, which meets the above requirements. Firstly, a convolution operation is performed on textile images to pick up overlapping patches as tokens and the linear projections in transformer encoders are replaced by depth-wise separable convolutions to extract the fiber representations. Secondly, the multi-head cross-attention module enables each label embedding to be compared with features at each spatial location to locate and pool the corresponding fiber characteristics. Finally, a simplified asymmetric loss is introduced to further purify the extracted fiber features. Experiments demonstrate that the proposed approach provides a significant improvement in fiber identification accuracy over both state-of-the-art multi-label classification frameworks and fiber identification architectures.

Keywords

Fiber identification non-destructive fiber recognition pattern recognition multi-label classification convolutional encoder multi-head cross-attention

With the pursuit of a high quality of life, consumers are increasingly concerned about the comfort and functionality of garments, such as skin-friendliness, wicking properties and sun protection.^1–3 The attributes of apparel are closely associated with the types of raw fibers.⁴^,⁵ Clothes of 100% cotton have superior softness, moisture absorption and skin-friendliness, yet are prone to wrinkling. Clothing of 100% polyester provides excellent durability, sun protection and drapability, but poor water vapor permeability. Fabrics with various mixing ratios of cotton and polyester fibers have different properties.⁶^,⁷ Therefore, simple and efficient identification of textile components can help consumers get their desired products.

The most common fiber identification methods are burning fibers, testing solubility, the staining test, microscope observation and photo discrimination.^8–11 They are widely used by various research units and fiber identification institutions because of their high accuracy in specific environments. All of these techniques, however, necessitate tearing textiles to collect fiber samples. In addition, they have drawbacks such as long testing times, high appraisal environmental requirements, strong human impact and an inability to conduct large-scale rapid fiber recognition.¹²^,¹³ For these reasons, a considerable number of textiles are marketed with components that do not match their labels, putting consumers’ interests at risk.¹⁴ Hence, non-destructive technologies for quick fiber classification, such as infrared spectroscopy and image recognition, have been developed.¹⁵^,¹⁶ The former classifies fibers by analyzing the spectrogram produced by the fibers’ absorption of infrared energy. Infrared spectroscopy necessitates prior knowledge of the types of garment fibers to be tested, and the cost of owning such equipment is very high, so this method is only employed by research institutes and customs. The latter identifies fibers with photographs taken directly on the textile surfaces by magnifiers or high-definition cameras. This image identification method without tearing the fabric and extracting the fibers to make samples is becoming a new research interest. Kampouris et al.¹⁶ collected a dataset of nine different classes of garment surfaces with portable photometric stereo sensors and distinguished fibers in single-component apparel images. Feng et al.¹³ built a textile surface image dataset with 50 $\times$ magnification and designed a multi-label classification network with a multi-branch DenseNet to identify fibers. Ohi et al.¹⁷ presented an Xception ensemble fiber recognition framework that performed well in single-component fabric classification. All of the above research relies on garment surface image datasets with magnifications of 50 $\times$ or less, making it difficult for models to understand slender fibers, for example, about 10 µm in diameter for cotton fibers, 8 µm for silk and even tens of nanometers for chemical fibers. In addition, the inherent limitations of convolutional neural networks (CNNs) for small target recognition in multi-label classification¹⁸ led to unsatisfactory results for fiber recognition in garment surface images, especially for mixed fiber fabrics.

Many powerful multi-label classification schemes have emerged in recent years that may help identify textile fibers. CNNs and recurrent neural networks (RNNs) jointly characterize semantic label dependencies and image-label correlations,¹⁹^,²⁰ graph convolutional networks (GCNs) model correlations between multiple labels^21–23 and transformers tackle complex dependencies between image features and target labels.^24–26 However, the large intra-type and small inter-type variations of apparel fibers¹⁶ seriously affect previous models’ performances. Moreover, various curling, entanglements and overlapping between fibers as well as the small sample size make fiber classification more challenging. There are numerous efforts¹⁸^,²⁷^,²⁸ to improve models’ representation capabilities by combining CNNs with transformers in various ways to exploit the advantages of CNNs in collecting local information and transformers in capturing long-range dependencies. Peng et al.²⁹ extracted fiber picture features with a transformer encoder and near-infrared spectrogram feature with a CNN and classified five types of fibers.

A fiber identification framework that incorporates convolution into the transformer architecture is proposed. Firstly, overlapping patches are obtained by convolutional operations on the input textile images or feature maps with strides different from patch sizes. These patches are flattened into token sequences, which are reshaped back into two-dimensional (2D) token maps after layer normalization.³⁰ This not only extracts local information about fibers in the image but also gradually decreases the feature resolution (i.e., token number) while increasing the feature dimension (i.e., token width) across stages. On the token maps, the $k \times k$ depth-wise separable convolutions (DWSCs)³¹ replace the linear projections in transformer encoders³² to generate the vectors $query$ , $key$ and $value$ for the self-attention, further acquiring local spatial context and alleviating the semantic ambiguity. Secondly, a transformer decoder-based textile composition unmixing module is used for fiber classification. Spatial features from the fiber feature extractor as the $key$ and the $value$ along with label embeddings extracted by the multi-head self-attention without masks as the query are sent together to transformer decoders to perform multi-head cross-attention. This makes each label embedding adaptively pool the desired fiber features to predict the existence of the corresponding type of fibers by subsequent binary classification. Finally, a textile surface dataset at a magnification of 200 $\times$ has been collected with commercially available magnifiers and mobile devices, such as smartphones. In these images, the physical characteristics of fibers can be clearly observed, which helps the recognition algorithm understand fiber representations.

Proposed approach

This work aims to identify fibers from textile surface images with challenges such as bending, overlapping and entanglement between fibers. Since only the physical mixing and intertwining of fibers occurs in the process of making fibers into fabrics, their properties are usually retained. Cotton fibers are flat and ribbon-like, with irregular twists around the axis.³³ Rayon fibers have longitudinal straight grooves with bright lusters after dyeing.³⁴ Animal hair fibers are all curly and scaly, but there are some subtle differences among various hair types. For example, wool scales are irregularly shaped into tiles or oblique rings, while cashmere scales are arranged in circles. In addition, compared with wool scales, cashmere scales are more evenly arranged, less dense and have wider scale spacing. By mining these unique visual properties, the proposed architecture can significantly improve fiber recognition performance, even with the small sample size and imbalanced sample problems. As shown in Figure 1, transformer encoders fused with convolutions are employed to extract fiber spatial representations and transformer decoders to adaptively decode various types of fiber characteristics. Convolutions extract overlapping patch embeddings as tokens, which can prevent the information loss of 2D spatial features that appear in non-overlapping patches,³⁵ thus preserving substantial nuanced information about fine fibers, such as wool scale patterns. DWSCs³¹ instead of linear projections to gain $query$ , $key$ and $value$ enable the framework to further acquire local spatial contexts, decrease computational complexity and diminish semantic ambiguity in self-attentions.²⁸^,³⁶ Fiber features and label features learned separately by multi-head self-attentions are fed together to multi-head cross-attentions to locate fiber features for each label. Moreover, a simplified asymmetric loss³⁷ is introduced to further purify the extracted fiber representations.

Figure 1.

Illustration of the proposed fiber identification framework: (a) fiber spatial feature extraction module; (b) textile composition unmixing module; (c) the convolutional encoder and (d) the transformer decoder without masks. DWSC: depth-wise separable convolution.

Fiber feature extraction

The fiber feature extraction architecture is a hierarchical transformer encoder incorporating convolution, which consists of three stages, as shown in Figure 1(a). Given a garment image $I \in R^{H \times W \times 3}$ as input, a 7 $\times$ 7 convolution with a stride of 4 and an output channel of $C_{1}$ is performed to extract overlapping patches $F_{1} \in R^{H_{1} \times W_{1} \times C_{1}}$ , which are then flattened as tokens $F_{t 1} \in R^{H_{1} W_{1} \times C_{1}}$ , where $H \times W$ is the resolution of the current stage input, $H_{1} W_{1}$ is the number of tokens, $P$ is padding and $H_{1} \times W_{1}$ is height and width of the current output feature map, as in the following equation:

H_{1} = ⌊\frac{H - 7 + 2 P}{4} + 1⌋, W_{1} = ⌊\frac{W - 7 + 2 P}{4} + 1⌋

(1)

Tokens are normalized by layer normalization³⁰ and then reshaped back into 2D feature maps $F_{1} \in R^{H_{1} \times W_{1} \times C_{1}}$ . DWSC³¹ transforms these maps into new token maps, which are then flattened into sequences $query (Q)$ , $key (K)$ and $value (V)$ for the multi-head self-attention operation, replacing the position-wise linear projection in the standard encoder,³² as shown in Figure 1(c). The DWSC performs 3 $\times$ 3 depth-wise convolution, batch normalization and 1 $\times$ 1 point-wise convolution sequentially. The $query$ , $key$ and $value$ obtained by DWSC with stride size 1 perform multi-head self-attention operations in the standard encoder layer according to Equations (2) and (3) to extract spatial feature information about different fibers in the image. After layer normalization and multi-layer perceptron (MLP) processing, this information is output as the new feature map $F_{s 1} \in R^{H_{1} \times W_{1} \times C_{1}}$ :

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{C_{K}}}) V

(2)

\begin{matrix} MultiH e a d (Q, K, V) \\ = Concat ({Attention}_{1}, \dots, {Attention}_{h}) W^{O} \end{matrix}

(3)

where

C_{K}

is the dimension of the

key

and

W^{O}

is the learnable weight parameter for the concatenated outputs from attention heads 1 to

h

. As illustrated in Figure 1(a), the convolution extracts tokens and the convolutional encoder captures fibers’ characteristic information together as “stage 1.” Taking the output feature map from the previous stage as input, a 3

\times 3

convolution with a stride of 2 is employed to extract tokens and “stage 1” is repeated twice as “stage 2” and “stage 3.” These stages jointly generate a hierarchical representation similar to that of a CNN. Each stage gradually decreases the feature resolution (i.e., the number of tokens) while increasing the feature dimension (i.e., the width of tokens) through convolutional operations.

Fiber composition unmixing

Fiber classification will be carried out in the transformer decoder, as shown in Figure 1(b). The output spatial features $F_{s 3} \in R^{H_{3} \times W_{3} \times C_{3}}$ from stage 3 go through an additional linear projection layer that projects the features from dimension $C_{3}$ to dimension $C$ . The projected features are flattened to be $F \in R^{H_{3} W_{3} \times C_{3}}$ and performed cross-attention with $queries$ (label embeddings) $Q_{0} \in R^{N \times C}$ to locate and pool type-related characteristics in transformer decoders, where $H_{3}$ , $W_{3}$ and $C_{3}$ are the output height, width and channel of stage 3, respectively, $C$ is the desired $query$ dimension in the decoder and $N$ is the number of fiber types. Label (fiber type) features are learned from previous label embeddings via multi-head self-attentions without masks in the transformer decoder with Equation (4), and then positional encodings are added as further subsequent $queries$ , as shown in Figure 1(d):

Q_{i}^{(1)} = Multi - head ({\tilde{Q}}_{i - 1}, {\tilde{Q}}_{i - 1}, Q_{i - 1})

(4)

where

Q_{i - 1}

represents the output of the previous decoder layer and

{\tilde{Q}}_{i - 1}

indicates that the positional encoding has been added and the function parameters have been removed from Equations (2) and (3) for simplicity. Then, the label features along with the reshaped fiber spatial features from the transformer encoder (i.e., the fiber feature extractor) are performed with multi-head cross-attention operations so that each label embedding,

Q_{i - 1, n} \in R^{C}

n = 1, \dots, N

, can adaptively locate and pool the fiber characteristics of interest

F

, as shown in Equation (5):

Q_{i}^{(2)} = Multi - head ({\tilde{Q}}_{i}^{(1)}, \tilde{F}, F)

(5)

where

\tilde{F}

and

F

are from fiber features. Each label embedding

Q_{i}^{(2)}

that gains better type-related characteristics updates itself

Q_{i}

after layer normalization through a position-wise feed-forward network (FFN)³² according to Equation (6):

Q_{i} = FFN (Q_{i}^{(2)}) = \max (0, Q_{i}^{(2)} W_{1} + b_{1}) W_{2} + b_{2}

(6)

where max (·) represents the rectified linear unit (ReLU) activation function,

W_{1}

and

W_{2}

are learned weight parameters and

b_{1}

and

b_{2}

are bias parameters. The label embedding

Q_{0}

is updated layer by layer by the above procedure, and the contextual information of the input garment image is gradually added by cross-attention calculation. Independently constructing learnable parameters for each label makes the semantics of each label feature quite clear, which helps the

N

types of fibers be better decoded in parallel in each layer.

Finally, the queried feature vectors $Q_{L} \in R^{N \times C}$ from the last decoder layer ( $L$ layer) are projected into logit values for each type of fiber $Q_{L, n} \in R^{C}$ in the linear projection layer, and then the existence of each type of fiber label is predicted by a sigmoid function based on the information obtained from Equation (7):

p_{n} = Sigmoid (W_{n}^{T} Q_{L, n} + b_{n})

(7)

where

W_{n} \in R^{c}

W = {{[W}_{1}, \dots, W_{n}]}^{T} \in R^{N \times C}

and

b_{n} \in R

b = [b_{1}, \dots, b_{n}] \in R^{N}

are parameters in the linear layer, and

p = {[p_{1}, \dots, p_{n}]}^{T} \in R^{N}

are the predicted probabilities for each type of fiber.

Loss function

Although the types of fibers can be well discriminated through cross-attention operation in transformer decoders, the imbalanced fiber characteristics in mixed fabrics and the small sample size problems may interfere with fiber differentiation. To overcome the above issues, a simplified asymmetric loss function³⁸ is introduced, which works remarkably well at alleviating the long-tail data distribution in multi-label classification and performs excellently in experiments.

There are $N$ types of fibers in total and an input textile image $I$ has fiber types $y_{I} = y_{1}, \dots, y_{n}$ , where $y_{n} \in \{0, 1\}$ , $n = 1, \dots, N$ represents a binary label to indicate if image $I$ has fiber type label $n$ . In other words, $y_{n} = 1$ if the image $I$ has the $n$ th type of fibers; otherwise, $y_{n} = 0$ . The framework predicts the probability of the existence of each type of fiber in the image $I$ , $p = {[p_{1}, \dots, p_{n}]}^{T} \in R^{N}$ . Then, the loss of each training sample is evaluated with the simplified asymmetric loss:

L = \frac{1}{N} \sum_{n = 1}^{N} \{\begin{matrix} {(1 - p_{n})}^{γ +} log (p_{n}), y_{n} = 1 \\ {(p_{n})}^{γ -} log ({1 - p}_{n}), y_{n} = 0 \end{matrix}

(8)

In experiments, the hyperparameters γ+ and γ− were set to 0 and 1, respectively, by default. The loss of each sample in the training dataset is averaged to compute the total loss.

Experimental details

Dataset

To increase the diversity of samples in the study, some photos were taken from randomly purchased clothes and masks from online or brick-and-mortar stores, while more photos were captured in the field during visits to numerous fashion stores with magnifiers and mobile devices. Figure 2 depicts the process for photographing the garment surfaces. Firstly, as shown in Figure 2, a commercially available optical magnifier (about US$50) with a magnification of 200 $\times$ was connected to a mobile device such as a smartphone or a laptop via Wi-Fi. The mobile devices were employed to observe and save target images. Secondly, the magnifier was randomly placed at five different locations on each garment to take one image at a time. To minimize the effect of textile color, clothing pattern and fabric texture on fiber identification, the magnifier was randomly rotated clockwise or counterclockwise for each photograph, as shown in Figure 3. If the backside of a garment could be captured, five additional photos were taken in the same way to obtain a more varied distribution of fibers. Since various smart devices collected photos at varying resolutions, each clothing picture was then manually cropped under the supervision of textile experts to generate a 400 $\times$ 400 pixel image that retained the greatest amount of semantic information. The label information came from manufacturers’ content labels and textile experts’ identification reports, which included the weaving method (i.e., woven fabrics, knitted fabrics and non-woven materials), fiber type and composition ratio (e.g., cotton 60% and linen 40%).

Figure 2.

The equipment and sampling procedure: (a) a magnifier; (b) objective lens with a diameter of 2 mm surrounded by eight light-emitting diodes and (c) example of a magnifier connected to a smartphone via Wi-Fi for photographing textile fibers.

Figure 3.

Some samples taken from different directions: (a) frontside images of a knitted fabric with 100% wool; (b), (c) frontside and backside images of a woven fabric with 91% cotton, 8% polyester and 1% spandex; (d) backside images of a non-woven material with 100% polypropylene.

Some 24,125 images of 78 textile categories made of 28 types of fibers were collected, where fabrics composed of the same types of fibers in different mixing ratios belong to one category (e.g., textiles with the composition of cotton 75% and rayon 25% are in the same category as fabrics with the composition of cotton 40% and rayon 60%). The statistical distribution of samples is depicted in Figure 4, which highlights the significant imbalances in various types of fibers and textile categories.

Figure 4.

Statistical information on the collected textile dataset: (a) statistics of the type of fibers in each sample; (b) occurrence frequency of each type of fiber in all textile samples and (c) textile sample number for each textile category.

There are 8.32% single-component textiles, 34.47% two-component textiles, 46.51% three-component textiles, 7.96% four-component textiles and 2.74% other-component textiles, while some components in multi-component textiles are less than 5%. In addition, the number of different types of fibers and textile categories varies greatly, and even some types of fibers or textile categories appear in only a few dozen garment images. The types of fibers in Figure 4(b) are acetate, alpaca, acrylic, bamboo fiber, camel hair, cashmere, cotton, cuprammonium, hemp, kapok fiber, linen, lyocell, milk fiber, modal, nylon, polyester, polyethylene, polylactic acid, polypropylene, rabbit hair, ramie, rayon, silk, soybean, spandex, vinylon, wool and yak hair.

Evaluation metrics

The fiber identification performance of various frameworks was evaluated with the mean average precision (mAP), per-class precision (CP), recall (CR) and F1-measure (CF1) and the overall precision (OP), recall (OR) and F1-measure (OF1). These metrics are calculated as follows:

\begin{matrix} C P = \frac{1}{C} \sum_{i} \frac{N_{i}^{c}}{N_{i}^{p}}, C R = \frac{1}{C} \sum_{i} \frac{N_{i}^{c}}{N_{i}^{g}}, C F 1 = \frac{2 \times C P \times C R}{C P + C R}, \\ O P = \frac{\sum_{i} N_{i}^{c}}{\sum_{i} N_{i}^{p}}, O R = \frac{\sum_{i} N_{i}^{c}}{\sum_{i} N_{i}^{g}}, O F 1 = \frac{2 \times O P \times O R}{O P + O R} \end{matrix}

(6)

where C is the number of labels,

N_{i}^{c}

is the number of images predicted correctly for the

i

th fiber type,

N_{i}^{p}

is the number of images predicted for the

i

th type and

N_{i}^{g}

is the number of ground truth images for the

i

th type. Among them, mAP, CF1 and OF1 are relatively more comprehensive and important metrics. Since different thresholds may affect the metric values, the threshold was set to 0.5 in all experiments.

Experimental setup

For all experiments, the input textile surface images were resized to 320 $\times$ 320. Some 80% of the images in the dataset were selected for training and 20% for testing. Brightness change, contrast change, rotation and flipping were used for data augmentation during training. Due to the small inter-type differences in fiber features, CvT-W24²⁸ without class tokens was adopted as the fiber feature extraction backbone, with the stride size of convolutional projection for $query / key / value$ adjusted to 1, padding to 2 and other parameters to their default values. The output fiber feature size is 20 $\times$ 20 $\times$ 1024, so ${C = C}_{3} = 1024$ is set in experiments. After adding positional encoding and reshaping, the fiber features are sent to the component decoding module, which employs two transformer decoder layers without attention masks for label feature updating. Each test architecture is initialized with ImageNet-trained weights and then further trained on the textile dataset. The architecture was trained for 100 epochs employing the AdamW³⁹ optimizer, with a batch size of 64, weight decay of 10⁻³, hyperparameters β1 of 0.9 and β2 of 0.9999 and a learning rate of 10⁻⁵.

Results and discussion

Results of fiber identification

The proposed approach was compared with the three frameworks specifically for identifying fibers in garment surface images¹³^,¹⁷^,²⁹ and the state-of-the-art multi-label classification methods, including GCN-based models²³^,⁴⁰^,⁴¹ and transformer-based models.¹⁸^,²⁴^,²⁵

As shown in Figure 4, the types of fibers in the collected dataset are remarkably imbalanced. To mitigate the challenges of imbalanced fibers and small sample sizes, only fabric images with more than 5% composition and from at least 10 different textiles were identified in experiments. Table 1 shows the fiber identification results of various architectures for 22 types of fibers in textile surface images at 200 $\times$ magnification. Notably, the presented framework outperforms its competitors in the main metrics mAP, CF1 and OF1 in recognizing fibers, for example, there was a 3.2% improvement compared with ISiC and 3.5% improvement compared with CU-Net in mAP, indicating that the proposed strategy is more suitable for identifying textile fibers.

Table 1.

Comparison of state-of-the-art multi-label image classification models on the collected textile dataset; the best results are shown in bold

Model	mAP	CP	CR	CF1	OP	OR	OF1
MS-CMA²³	73.9	73.4	64.7	68.8	76.3	67.8	71.8
SSGRL⁴¹	74.2	77.3	60.1	67.6	79.8	63.9	70.5
ML-GCN⁴⁰	74.6	75.4	63.2	68.7	78.3	66.9	72.2
TDRG²⁴	75.9	76.5	62.7	68.9	79.5	67.2	72.8
THFuse¹⁸	76.5	78.1	64.3	70.6	80.8	66.4	72.9
M3TR²⁵	76.7	77.5	66.2	71.4	80.1	66.7	72.8
FabricNet¹⁷	75.6	76.8	62.4	68.9	78.4	67.6	72.6
CU-Net¹³	76.9	77.4	68.7	72.8	78.2	72.2	75.1
ISiC-Net²⁹	77.2	80.9	63.4	71.1	81.7	69.8	75.3
Ours	80.4	80.7	70.2	75.1	81.9	72.7	76.9

mAP: mean average precision; CP: per-class precision; CR: recall; CF1: F1-measure; OP: overall precision; OR: overall recall; OF1: overall F1-measure.

The occurrence frequency of fiber types in the new dataset is shown in Figure 5(a), which has eliminated fibers from less than 10 textiles and those with less than 5% composition. Fiber types from 1 to 22 are alpaca, acrylic, bamboo fiber, camel hair, cashmere, cotton, hemp, kapok fiber, linen, lyocell, modal, nylon, polyester, polylactic acid, polypropylene, rabbit hair, ramie, rayon, silk, spandex, wool, and yak hair. The per-type precision and the per-type recall of the presented framework on the new dataset are depicted in Figure 5(b), which illustrates that the proposed approach performs excellently even on types of fibers with small sample sizes. The pink circles mark the types of fibers with small sample sizes (i.e., those with less than 500 samples). Blue squares mark the three types of fibers with the lowest F1 values, and red triangles mark the three types of fibers with the highest F1 values. In addition, the red triangles denote the top three F1 values and the blue squares denote the bottom three F1 values. Type 16 (rabbit hair), type 17 (ramie) and type 21 (wool) with higher F1 values are natural fibers with unique visual characteristics, while type 10 (lyocell fiber), type 11 (modal fiber) and type 19 (spandex) with lower F1 values are man-made fibers. Lyocell fibers (type 10) with smooth surfaces and modal fibers (type 11) with grooves are identified with interference from polyester fibers (type 13) and rayon fibers (type 18), respectively, which have similar surface characteristics and relatively massive fiber data sizes. The worst-performing spandex fiber (type 19), due to its greater elasticity (elongation up to 700%), is mostly used in textiles covered by other fibers,⁴²^,⁴³ which makes its recognition more challenging.

Figure 5.

Performance of the presented method on each type of fiber in the collected textile dataset: (a) occurrence frequency of each fiber type in the new dataset and (b) the per-type precision and recall of the proposed method (color online only).

Ablation study

Ablation experiments were carried out on the textile surface dataset to evaluate the contribution of each component in the proposed framework to fiber identification. The framework consists of the CvT network to extract fiber features from garment images, the transformer decoder to localize and pool fiber features for each label and the simplified asymmetric loss function to further purify the extracted fiber representations.

Firstly, the performance of different backbones for extracting fiber features was compared, that is, replacing only the fiber feature extraction backbone in the architecture without changing the other components. The new framework with various fiber feature extractors will be retrained, and their classification results are shown in Table 2. As a fiber feature extractor, CvT-w24, which incorporates convolutions into the transformer architecture, outperforms both the convolutional network (ResNet 101) and the transformer framework (Swin-L) because the CvT is able to take better care of local and global information in identifying slender fibers.

Table 2.

Performance comparison of the proposed architecture with different fiber feature extractors

Extractors	mAP	CP	CR	CF1	OP	OR	OF1
ResNet101	77.4	78.6	68.4	73.1	79.8	70.4	74.8
Swin-L	79.6	80.8	68.6	74.2	82.5	70.8	76.2
CvT-w24	80.4	80.7	70.2	75.1	81.9	72.7	76.9

mAP: mean average precision; CP: per-class precision; CR: recall; CF1: F1-measure; OP: overall precision; OR: overall recall; OF1: overall F1-measure.

Then, ablation experiments were performed on other components with CvT-w24 as the fiber feature extractor, as shown in Table 3. The Dec represents that the multi-head cross-attention module in the transformer decoder is utilized to match label embeddings and fiber representations, while ASL means that a simplified asymmetric loss function is employed to purify the fiber features. The baseline has a 0.8% improvement in mAP accuracy due to ASL, indicating that the ASL helps to purify the extracted fiber representation. The mAP accuracy of fiber recognition with the Dec in the baseline is substantially improved by 3.2%, while the mAP of the proposed method is improved by 3.7% compared to that of the baseline. The above results prove the practicality of the strategy to differentiate different types of fibers in an image by exploiting the multi-head cross-attention mechanism in transformer decoders, which allows each label to adaptively locate fiber features and pool the desired features. Figure 6 illustrates the test loss curves of the framework with different components. The loss of the proposed method decreases faster than those of the other approaches, and the loss curve of the method is steadier than that of baseline (CvT-w24), which further validates the effectiveness of the multi-head cross-attention (Dec) and ASL.

Table 3.

Ablation study for different components

Methods	Dec	ASL	mAP	CF1	OF1
Baseline (CvT-w24)			76.7	69.8	71.9
		$\sqrt$	77.5	70.3	72.1
	$\sqrt$		79.9	74.7	76.4
	$\sqrt$	$\sqrt$	80.4	75.1	76.9

Dec represents that the multi-head cross-attention module in the transformer decoder is utilized to match label embeddings and fiber representations, while ASL means that a simplified asymmetric loss function is employed to purify the fiber features.

mAP: mean average precision; CF1: F1-measure; OF1: overall F1-measure.

Figure 6.

Test loss curves of presented architecture with different components. Dec represents that the multi-head cross-attention module in the transformer decoder is utilized to match label embeddings and fiber representations, while ASL means that a simplified asymmetric loss function is employed to purify the fiber features.

Visualization of attention maps

The top five attention windows of some cross-attention maps in the last layer of the architecture were visualized and three of them were zoomed in for a better observation of fiber characteristics, as shown in Figure 7. The ground truth labels (queries) for raw textile images are displayed in the text above the images. These windows showed that the presented framework could approximately locate the unique properties of fibers, such as the scales of wool, the smooth surface of polypropylene and the twists of cotton. This proved the effectiveness of the presented fiber identification approach, which incorporated convolutions into the transformer encoder to extract features of slender fibers and employed the multi-head cross-attention mechanism in transformer decoders to locate and pool the desired characteristics for each label embedding.

Figure 7.

Visualization of some samples.

Conclusions

This paper presents a textile fiber identification framework that incorporates convolutions into the transformer architecture to identify multiple types of fibers at once by just processing textile surface images. Experiments demonstrate that input tokens of overlapping images and DWSC instead of linear projection in the transformer encoder can extract richer fiber characteristics, and the multi-head cross-attention module in the transformer decoder can effectively let each label embedding query the presence of fiber type labels and pool type-related fiber features. The proposed method enables the simple, fast and effective automatic identification of fibers without damaging the fabric, which is of significant importance for improving productivity and production efficiency in the textile industry. Firstly, it identifies fibers without tearing the fabric or using chemical reagents, thereby saving resources and eliminating environmental impact. Secondly, the recognition algorithm automatically extracts fiber features from fabric surface images and performs fiber classification, reducing the need for manual operations and the occurrence of human errors. Thirdly, it can instantly recognize the fibers in the captured fabric surface photos, significantly minimizing the time required for fiber testing. Fourthly, it can discriminate various types of fibers in the fabric at once, which improves the efficiency of fiber discrimination. Finally, it has excellent versatility in the textile industry, such as for real-time fiber identification for ordinary consumers purchasing textiles, automatic detection of fabric defects in textile fabrication by textile manufacturers and large-scale fiber composition classification for customs or textile testing companies.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The Fundamental Research Funds for the Central Universities (No. 2232023Y-01), the National Natural Science Foundation of China (Grant No. 61972081), and the Natural Science Foundation of Shanghai (Grant No. 22ZR1400200).

ORCID iD

Luoli Xu

References

Raeve

Vasile

Cools

Selected factors influencing wear comfort of clothing: case studies. J Text Eng Fash Technol 2018; 4: 66–71.

Bibi

Makhdoom

MUA.

Effect of material and applied finishes on the properties of single jersey knitted fabric. Int J Sci Res Publ 2021; 11: 717–726.

Wong

Lam

Kan

, et al. Influence of knitted fabric construction on the ultraviolet protection factor of greige and bleached cotton fabrics. Text Res J 2013; 83: 683–699.

Qian

XM.

The effect of material performances of knit fabric on clothing comfort. Adv Mater Res 2010; 156: 717–723.

Hoque

Hossain

Rahman

, et al. Fiber types and fabric structures influence on weft knitted fabrics. Heliyon 2022; 8: e09605.

Yoon

Buckley

Improved comfort polyester: part I: transport properties and thermal comfort of polyester/cotton blend fabrics. Text Res J 1984; 54: 289–298.

Krithika

SMU

Prakash

Sampath

, et al. Thermal comfort properties of bilayer knitted fabrics. Fibres Text East Eur 2020; 28: 50–55.

Goodway

Fiber identification in practice. J Am Inst Conserv 1987; 26: 27–44.

Bojun

Bin

Yan

Neural network technique for fiber image recognition. J Ind Text 2007; 36: 329–336.

10.

Chiu

Liaw

JJ.

Fiber recognition of PET/rayon composite yarn cross-sections using voting techniques. Text Res J 2005; 75: 442–448.

11.

Wang

, et al. Automatic identification of ramie and cotton fibers using characteristics in longitudinal view, part II: fiber stripes analysis. Text Res J 2009; 79: 1547–1556.

12.

Luo

Zhong

, et al. Identification of wool and cashmere SEM images based on surf features. J Eng Fibers Fabr 2019; 14: 1–9.

13.

Feng

Liang

Tao

, et al. Cu-net: component unmixing network for textile fiber identification. Int J Comput Vis 2019; 127: 1443–1454.

14.

Xing

Deng

Xin

, et al. Identification of extremely similar animal fibers based on matched filter and hog-svm. IEEE Access 2019; 7: 98603–98617.

15.

Zhou

Han

Via

, et al. Rapid identification of fibers from different waste fabrics using the near-infrared spectroscopy technique. Text Res J 2019; 89: 3610–3616.

16.

Kampouris

Zafeiriou

Ghosh

, et al. Fine-grained material classification using micro-geometry and reflectance. In: eds B Leibe, J Matas, N Sebe, et al., proceedings of the European conference on computer vision, Amsterdam, Netherlands, 11–14 October 2016, paper no. P-1B-27, pp.778–792. Cham: Springer.

17.

Ohi

Mridha

Hamid

, et al. Fabricnet: a fiber recognition architecture using ensemble convnets. IEEE Access 2021; 9: 13224–13236.

18.

Chen

Ding

, et al. Thfuse: an infrared and visible image fusion network using transformer and hybrid feature extractor. Neurocomputing 2023; 527: 71–82.

19.

Chen

Yeh

, et al. Order-free RNN with visual attention for multi-label classification. Proc AAAI Conf Artif Intell 2018; 32: 6714–6721.

20.

Yang

Lin

Chu

, et al. Deep learning with a rethinking structure for multi-label classification. In: eds WS Lee and T Suzuki. Asian conference on machine learning, Nagoya, Japan, 17-19 November 2019, paper no. paper no. 79, pp.125–140. New York: PMLR.

21.

Liang

A multi-scale semantic attention representation for multi-label image recognition with graph networks. Neurocomputing 2022; 491: 14–23.

22.

Khan

Chaudhuri

Banerjee

, et al. Graph convolutional network for multi-label VHR remote sensing scene recognition. Neurocomputing 2019; 357: 36–46.

23.

You

Guo

Cui

, et al. Cross-modality attention with semantic graph embedding for multi-label classification. Proc AAAI Conf Artif Intell 2020; 34: 12709–12716.

24.

Zhao

Yan

Zhao

, et al. Transformer-based dual relation graph for multi-label image recognition. In:eds T Hassner, C Pal, Y Sato, et al., Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021, paper no. 2387, pp. 163–172. New York: IEEE.

25.

Zhao

M3TR: multi-modal multi-label recognition with transformer. In: eds Y Yang, P Cesar, F Metze, et al., Proceedings of the 29th ACM International conference on multimedia, Chengdu, China, 20–24 October 2021, paper no. 195, p. 469–477. New York: ACM.

26.

Zhang

Luo

Pan

, et al. Strengthened multiple correlation for multi-label few-shot intent detection. Neurocomputing 2023; 523: 191–198.

27.

Hua

, et al. Dktnet: dual-key transformer network for small object detection. Neurocomputing 2023; 525: 29–41.

28.

Xiao

Codella

, et al. CvT: introducing convolutions to vision transformers. In: eds T Hassner, C Pal, Y Sato, et al., proceedings of the IEEE/CVF international conference on computer vision (eds T Hassner, C Pal, Y Sato, et al), Montreal, QC, Canada, 10–17 October 2021, paper no. 2295, pp. 22–31. New York, NY: IEEE.

29.

Peng

Qiu

, et al. Image-signal correlation network for textile fiber identification. In: eds V Oria, Q Jin, L Toni, et al., Proceedings of the 30th ACM International conference on multimedia, Lisbon, Portugal, 10–14 October 2022, paper no. 2448, pp. 3848–3856. New York: ACM.

30.

Kiros

Hinton

GE.

Layer normalization. arXiv preprint arXiv:1607.06450.

31.

Chollet

Xception: Deep learning with depthwise separable convolutions. In: eds J Rehg, Y Liu, Y Wu, et al., proceedings of the IEEE conference on computer vision and pattern recognition, Hawaii, USA, 21–26 July 2017, paper no. 451, pp. 1800–1807. New York: IEEE.

32.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: In: eds I. Guyon, UV Luxburg, S. Bengio, et al. 31st conference on neural information processing systems, Long Beach, CA, USA, 4-9 December 2017, paper no.124, pp. 5998–6008. New York: ACM.

33.

Suomela

Vajanto

Räisänen

Seeking nettle textiles – utilizing a combination of microscopic methods for fibre identification. Stud Conserv 2018; 63: 412–422.

34.

Mäkelä

Rissanen

Sixta

Identification of cellulose textile fibers. The Analyst 2021; 146: 7503–7509.

35.

Guo

Han

, et al. Cmt: convolutional neural networks meet vision transformers. In: eds K. Dana, G Hua, S. Roth, et al., Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, USA, 19-24 June 2022, paper no.151, pp. 12165–12175. New York: IEEE.

36.

Yuan

Guo

Liu

, et al. Incorporating convolution designs into visual transformers. In: eds T Hassner, C Pal, Y Sato, et al., Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021, paper no. 2593, pp. 579–588. New York: IEEE.

37.

Ridnik

Ben

Zamir

, et al. Asymmetric loss for multi-label classification. In: eds T Hassner, C Pal, Y Sato, et al. Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021, paper no. 7868, pp. 82–91. New York: IEEE.

38.

Liu

Zhang

Yang

, et al. Query2label: a simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834.

39.

Kingma

JL.

Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.

40.

Chen

Wei

Wang

, et al. Multi-label image recognition with graph convolutional networks. In: eds A Gupta, D Hoiem, G Hua, et al., Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15-20 June 2019, paper no. 63, pp. 5172–5181. New York: IEEE.

41.

Chen

Hui

, et al. Learning semantic-specific graph representation for multi-label image recognition. In: eds IS Kweon, N Paragios, MH Yang, et al., Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea (South), 27 October–3 November 2019, paper no.54, pp. 522–531. New York: IEEE.

42.

Almetwally

Mourad

MM.

Effects of spandex drawing ratio and weave structure on the physical properties of cotton/spandex woven fabrics. J Text Inst 2014; 105: 235–245.

43.

Halil

Těšinová

Aboalasaad

ARR.

Thermal comfort properties of cotton/spandex single jersey knitted fabric. Ind Textila 2021; 72: 244–249.