Attribute-guided and attribute-manipulated similarity learning network for fashion image retrieval

Abstract

Learning the similarity between fashion items is essential for many fashion-related tasks. Most methods based on global or local image similarity cannot meet the fine-grained retrieval requirements related to attributes. We are the first to clearly distinguish the concepts of attribute name and their values and divide fashion retrieval tasks that combine images and text into: attribute-guided retrieval and attribute-manipulated retrieval. We propose a hierarchical attribute-aware embedding network (HAEN) that takes images and attributes as input, learns multiple attribute-specific embedding spaces, and measures fine-grained similarity in the corresponding spaces. It can accurately map different attributes to the corresponding areas of the image, thereby facilitating the feature fusion of two different modalities of text and image, including enhancement and replacement. Then on this basis, we propose three attribute-manipulated similarity learning methods, HAEN_Avg, HAEN_Rec, and HAEN_Cmb. With comprehensive validation on two real-world fashion datasets, we demonstrate that our methods can effectively leverage semantic knowledge to improve image retrieval performance, including attribute-guided and attribute-manipulated retrieval tasks.

Keywords

1. Introduction

Fashion image retrieval [1, 2, 3, 4] refers to retrieving images that meet the user’s search intent. Traditional image retrieval systems only allow users to use text or image queries to express their search intent. It is difficult for users to describe their search intention through a single textual query in actual scenarios. Meanwhile, it is also difficult for users to find ideal images to express their intention accurately. In some cases, users want to search for fashion items with similar designs instead of roughly the same or similar items. There is also a situation where users want to query images similar to a given image but have other characteristics. Figure 1 shows these two different requirements of users for image retrieval tasks. The first two are based on a given image and an attribute and query other similar images with the same attribute value, such as neckline design in (a-1) and sleeve design in (a-2). The latter two are based on a given image and an attribute value and query similar images with the new attribute values, such as red/flowers in (b-1) and mini blue in (b-2). These detailed requirements need to combine the text description of the image and the image itself in the search.

Figure 1.

Four examples of composing text and image for image retrieval. The query conditions are listed on the left, and the results that meet user requirements are on the right.

With the development of deep learning, deep neural networks have been widely used in clothing retrieval tasks and achieved remarkable results. Most of the methods that have been proposed are to learn a joint embedding space so that the item similarity can be measured by calculating their distance in the space [2]. However, such coarse-grained methods are usually affected by occlusion, cropping, or different views between the original and the target images [3].

Composing text and image for image retrieval (CTI-IR) or conditional image retrieval is a new yet challenging task [4, 5, 6]. The input query is not the conventional image or text but a composition, i.e., a reference image and its corresponding modification or enhancement text. The text gives additional conditions which describe the semantic modification or enhancement from the query image to the target gallery images. The main challenge comes from the semantic fusion of image and text, i.e., the feature mapping and fusion of two types of different modal data [7].

Therefore, this paper will focus on understanding the semantics between text and images and design a multi-modal similarity learning network for two types of retrieval tasks. The main contributions of this paper are summarized as follows:

(1)

To our best knowledge, we are the first to clearly distinguish the concepts of attribute name and their values, which will facilitate future related research. We use attributes and attribute labels to represent them in this paper. Meanwhile, we also categorize fashion retrieval by combining text and image into two categories: attribute-guided and attribute-manipulated fashion retrieval.

(2)

We propose a hierarchical attribute-aware embedding network (HAEN) for semantic mapping between images and attributes. Given an attribute, HAEN can accurately map it to the corresponding areas of the image, thereby facilitating the feature fusion of two different modalities of text and image, including enhancement and replacement.

(3)

We also propose three similarity learning methods based on HAEN for attribute-manipulated image retrieval tasks, HAEN_Avg, HAEN_Rec, and HAEN_Cmb. HAEN_Cmb is a fusion of the first two approaches, taking full advantage of their strengths to obtain better results than the baseline. We are the first to propose a network to unify the two tasks, i.e., attribute-guided and attribute-manipulated retrieval tasks.

(4)

Extensive experiments conducted on two real-world datasets validate the superiority of our network.

The remainder of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 details the proposed network HAEN and its three variants. The experimental results and analysis are presented in Section 4, followed by the conclusion and future work in Section 5.

2. Related work

2.1 Fashion retrieval

Content-based image retrieval methods have attracted wide attention due to their convenience and accuracy [8, 9]. Most of the current research works on fashion retrieval follow global similarity computation and matching, e.g., they use CNN (convolutional neural network) or other deep learning networks to aggregate local features into a single global representation and then perform similarity calculations [10, 11]. To obtain local features of images and pay attention to the fine-grained similarity, Veit et al. [12] proposed to use a set of masks to select the embedding dimensions related to specific attributes and calculated the similarity based on the mask embedding vector. Kuang et al. [3] proposed a graph reasoning network to build a similarity pyramid, which represents the similarity by considering both global and local features. However, retrieval that only relies on images will be affected by certain factors such as occlusion and cropping. Attribute labels play an important role in fine-grained retrieval [13]. Ma et al. [14] proposed a fashion similarity learning network and used both image and the attributes to specify user’s query intent to perform attribute-guided fine-grained retrieval. This idea also attracted the attention of other researchers [5, 15]. How to perform feature fusion between text attributes and images, and map images to different attribute-related embedding spaces is significant for accurate retrieval, and it is still worthy of further research. Therefore, this paper will use multi-modal data (text $+$ images) to guide fine-grained retrieval tasks.

2.2 Composing text and image for image retrieval

Although CNN and related deep learning networks are widely used to learn the visual features of images, they cannot efficiently compose visual representations and natural-language semantics. Combined image and text for image retrieval incorporates user feedback into the image retrieval task to guide or modify the image retrieval results according to user expectations. Vo et al. [7] introduced a residual gating operation to fuse image and text embeddings, while Chen et al. [16] concatenated the text embedding at multiple layers of the image CNN to extract the composite representations. To find a more fine-grained semantic relationship between text and image, Jandial et al. [5] introduced a pyramid gated fusion mechanism that uses CNNs-based hierarchical visual representations across different abstractions to generate fine-grained Visio-linguistic embeddings and leverages their coarse hierarchy structure to learn the final compositional representation. Wen et al. [4] designed two composition modules: fine-grained local-wise and global-wise composition modules, and used a mutual enhancement module to promote them to share knowledge. Despite the outstanding progress of these studies, they do not consider the effect of attribute relationships on the effectiveness of identifying fine-grained differences in images. In this paper, we will study fine-grained semantic associations between attributes and images for the task of image retrieval for text-conditioned image retrieval tasks.

2.3 Attention mechanism in multi-modal similarity learning

In the field of text $+$ image multi-modal learning, the attention mechanism is an indispensable technology. Ji et al. [9] proposed a tag-based attention mechanism and a context-based attention mechanism to improve the performance of cross-domain retrieval of fashion images. Li et al. [17] proposed a joint attribute detection and visual attention framework for clothes image captioning. Veit et al. [12] designed a fine-grained similarity embedding space by selecting the dimension of the global embedding space through the mask corresponding to the specific attribute. Wei et al. [18] first extracted salient image regions and sentence tokens using R-CNN, and then learned fine-grained relationships between segments by applying self-attention and multi-modal cross-attention for image and sentence matching. Most notably, Ma et al. [14] proposed an attribute feature embedding network, which learns attribute-based embedding in an end-to-end manner to measure the attribute-specified fine-grained similarity of fashion items and get the state-of-art performance. Two attribute-aware spatial attention (ASA) and attribute-aware channel attention (ACA) are proposed. Inspired by the success of the attention mechanism, we also proposed to use two attribute-aware attention modules for fine-grained similarity learning. Inspired by these works, we propose a visual textual attention learning scheme that learns the interactive attention on visual and attribute features. Unlike previous work, our approach relies mainly on spatial attention and channel attention to extract semantic features, avoiding the reliance on pre-trained object detectors like R-CNN, and thus can be well suited for fine-grained visual search of fashion images.

3. Methodology

3.1 Problem formulation

Definition 1. Fashion entity, image, attribute and attribute label (tag). A fashion entity refers to a singular, identifiable, and separate fashion object. Generally, a fashion entity consists of an image and a set of attribute labels. Attributes are used to describe the characteristics of an entity in a certain aspect, such as “sleeve length”. We use the set $A=\{a\}$ to represent attribute set. The space formed by the attributes is called the attribute space. The attribute label (tag) refers to the value of the attribute, used to describe the specific characteristic of an attribute, such as“3/4 sleeves” or “sleeveless”. Each attribute $a_{i}\in A(1\leqslant i\leqslant|A|)$ corresponds to a label set $B_{i}$ of different sizes. $B_{i}=\{b_{i}^{1},b_{i}^{2},\ldots\}$ . Table 1 lists three commonly used attributes and their corresponding attribute labels.

Table 1
Examples of attributes and their corresponding attribute labels

Attributes	Attribute labels (values)
Collar design	Shirt collar, peter pan, puritan collar, rib collar
Neckline design	Strapless neck, deep V neckline, straight neck, V neckline
Sleeve length	Sleeveless, cup sleeves, short sleeves, elbow sleeves, 3/4 sleeves

Definition 2. Global similarity and local similarity [3] for image retrieval. The image retrieval task is described as follows. Given one image query $x$ and one gallery set $G=\{y\}$ , it computes the similarities between $x$ and each $y$ and ranks them. $x=\{x^{i}\}$ and $y=\{y^{i}\}$ , where $x^{i}\in\mathbb{R}^{C\times 1}$ and $y^{i}\in\mathbb{R}^{C\times 1}$ are the local $i^{\text{th}}$ features of the query image and the target one respectively. Global similarity is defined as

$\displaystyle\textit{Sim}(x,y)=S_{g}(G(x),G(y))$ (1)

where $G(\cdot)$ is the aggregation function, and $S_{g}(\cdot,\cdot)$ is the scalar global similarity function.

The aggregation function usually refers to the average pooling or max-pooling operator. The similarity function is often realized by the Cosine similarity or Euclidean distance. However, the aggregation function might aggregate noisy features such as clutter background, other objects, or unique regions, which can only be observed in the query or the gallery when existing occlusions, cropping, or different views. Therefore, local similarity, only considering the characteristics of a particular aspect or a few aspects, is also considered. The local similarity is defined as

$\displaystyle\textit{Sim}(x,i,y)=S_{l}(x^{i},y^{i})$ (2)

where $S_{l}(\cdot,\cdot)$ is the scalar global similarity function.

Definition 3. Attribute-guided image retrieval [14]. Attributes describe the characteristics of the entity in a certain aspect, so attribute-guided image retrieval can reflect the local similarity of two images. Given an image $x$ and a specific attribute $a$ , an attribute-guided feature vector $\mathcal{H}(x,a)\in\mathbb{R}^{C\times 1}$ is learned to reflect the characteristics of the corresponding attribute in the image $x$ . Therefore, for two fashion images $x$ and $y$ , the fine-grained fashion similarity w.r.t. the attribute is expressed by the cosine similarity between $\mathcal{H}(x,a)$ and $\mathcal{H}(y,a)$ . The attribute-guided image retrieval will return the top-k results with higher similarity.

$\displaystyle\textit{Sim}(x,a,y)=S_{a}(\mathcal{H}(x,a),\mathcal{H}(y,a))$ (3)

where $S_{a}(\cdot,\cdot)$ is the scalar global similarity function.

Definition 4. Attribute-manipulated image retrieval [7]. It is also known as text conditioned image retrieval (TCIR), which can be formally defined as follows. Given a multi-modal query of a reference image $x$ and its modification attribute label $b$ , we need to retrieve its corresponding target image from a set of gallery images. Suppose there is a set of triplets, denoted as $\mathcal{D}=\{(x,b,y)_{i}\}_{i=1}^{N}$ , where $x$ is the reference image, $b$ is the modification text, $y$ is the target image, and $N$ is the total number of triples. We aim to learn the latent space where the representation of the multi-modal query $(x,b)$ and that of the target image $y$ should be close. The similarity is expressed as

$\displaystyle\textit{Sim}(x,b,y)=S_{b}(\mathcal{T}(x,b),F(y))$ (4)

where $S_{b}(\cdot,\cdot)$ is the scalar global similarity function, $\mathcal{T}(\cdot)$ represents the transformation for mapping the multi-modal query to the latent space, and $F(\cdot)$ denotes that for the target image. The attribute-manipulated image retrieval will return the top-k targets with the greatest similarity.

Problem Statement. In our paper, we will build a network to deal with these two kinds of image retrieval tasks. When the input is an image $x$ and an attribute $a$ , it returns images that have the same label as $x$ on the attribute $a$ and are similar to $x$ ; when the input is an image $x$ and an attribute label $b$ , it returns images that has label $b$ and are similar to $x$ .

3.2 HAEN

Figure 2.

The proposed HAEN consists of five key modules: (a) feature extraction, (b) hierarchical attribute embedding (HAE), (c) attention, (d) mask, and (e) embedding branch.

Figure 2 is an overview of our proposed hierarchical attribute-aware embedding network, which contains five modules: feature extraction module, HAE module, attention module, mask module, and embedding branch module. First, we adopt a CNN model such as Resnet pre-trained on ImageNet [19] as the backbone network for feature extraction. To retain the spatial information of the image, we remove the last fully connected (FC) layer of the pre-trained CNN model. The image feature is represented as $I\in\textrm{R}^{c\times h\times w}$ , where $h\times w$ is the size of the feature map and $c$ is the number of channels. For a specific attribute $a$ , we use a hierarchical attribute embedding vector $E(a)$ to represent it.

3.2.1 Attribute and hierarchical relation embedding

The categories and attributes of fashion products provide clues to identify similarities in fashion products. As shown in Fig. 2, tops are associated with attributes such as “sleeve length” and “neckline”, while pants do not have these attributes. When we represent the attributes with one-hot vectors, the distance between sleeve length and neckline and that between sleeve length and pant length are the same, which is not consistent with the perception of human experts. We need to learn the relationship between these attributes, which will significantly increase the amount of information of the original input, which will help the model learn the relationship among attributes.

Based on the knowledge of fashion experts, we represent fashion attributes with a hierarchical structure. These attributes come from two fashion datasets, FashionAI and DARN. These attributes form an attribute tree, where each leaf node denotes an attribute value, aand the parent node denotes attribute names or categories. Among them, some parent nodes contain multiple child nodes, and sibling nodes will share the characteristics of the parent node, e.g., “sleeve” and “neck” are sibling nodes, and their parent node is “up”. We introduce an operator “ $<$ ” to describe the hierarchical relationship between attributes. Assume that $p$ and $s$ are two nodes in the attribute tree, $p<s$ means that $p$ is an ancestor of $s$ in the tree. Then we can define $e_{p,s}=1$ if $p<s$ or $p=s$ . The hierarchy embedding of attribute $a$ is defined as $E(a)$

$\displaystyle E(a)=[e_{1,a},\ldots,e_{Q,a}]$ (5)

where $Q$ is the number of nodes in the attribute tree.

3.2.2 Attention module

Some recent studies employ attention mechanisms to locate salient meaningful regions and identify potential associations between image regions and outputs. Considering that attribute-guided features are related to specific regions and styles of images, inspired by [14], we use two attention modules, ASA and ACA, to capture attribute-related features.

ASA: The core idea of the spatial attention mechanism is that every attribute value refer to parts should correspond to one or several regions of an image. Given an image $x$ and a specific attribute $a$ . The space attention vector is obtained firstly by using the spatial attention module ASA. A convolutional layer followed by a nonlinear activation function tanh is used to obtain $p(x)\in\textrm{R}^{c{{}^{\prime}}\times h\times w}$ .

$\displaystyle p(x)=\tanh(\textit{Conv}_{c{{}^{\prime}}}(x))$ (6)

where $\textit{Conv}_{c{{}^{\prime}}}$ contains $c{{}^{\prime}}1\times 1$ convolution kernels. For the attribute $a$ , we first project its hierarchical attribute embedding $E(a)$ into a $c{{}^{\prime}}$ dimensional vector implemented by an FC layer, then expand it to the same dimension as that of the image feature, and obtain $p(a)\in\textrm{R}^{c{{}^{\prime}}\times h\times w}$ .

$\displaystyle p(a)=\tanh(W_{s}E(a))\cdot 1$ (7)

where $W_{s}\in\textrm{R}^{c{{}^{\prime}}\times Q}$ is a transformation matrix and $1\in\textrm{R}^{c{{}^{\prime}}\times h\times w}$ is a spatial duplication matrix of ones. Then the attention weight $\alpha_{s}\in\textrm{R}^{h\times w}$ is expressed as

$\displaystyle\alpha_{s}=\textit{softmax}(\tanh(\textit{Conv}_{1}(p(a)\otimes p% (x))))$ (8)

where $\otimes$ is the element-wise multiplication operation, $\textit{Conv}_{1}$ contains a $1\times 1$ convolution kernel. Finally, through the following formula, we can get the spatial attention vector $I_{s}\in\textrm{R}^{c}$ .

$\displaystyle I_{s}[j]=\alpha_{s}[j]\odot x[j]$ (9)

where $\odot$ represents inner product operation.

ACA: For general visual tasks, different channels often represent different objects. For fashion image retrieval tasks, different channels represent different styles. Channel attention vector $I_{c}\in\textrm{R}^{c}$ is obtained through the channel attention module. On the one hand, we map the visual attention vector $I_{s}$ to the joint space by global average pooling, which can translate spatial visual features into low-dimensional features.

$\displaystyle I_{s}{{}^{\prime}}=\textit{GlobalAveragePooling}(I_{s})$ (10)

On the other hand, We encode the vector of attributes with a separate attribute embedding. An embedding layer is used to embed attribute $a$ into an embedding vector $q(a)$ with the same dimensionality of $I_{s}$ .

$\displaystyle q(a)=\delta(W_{c}E(a))$ (11)

where $W_{c}\in\textrm{R}^{c\times Q}$ denotes the transformation matrix, and $\delta$ refers to ReLU function. Two consecutive FC layers are employed to obtain channel attention weight $\alpha_{c}\in\textrm{R}^{c}$ .

$\displaystyle\alpha_{c}=\sigma(W_{2}\delta(W_{1}[q(a),I_{s}{{}^{\prime}}]))$ (12)

where $[,]$ denotes concatenation operation, $W_{1}\in\textrm{R}^{\frac{c}{r}\times 2c}$ and $W_{2}\in\textrm{R}^{c\times\frac{c}{r}}$ denote the transformation matrices, $r$ is reduction rate, $\sigma$ is ReLU function. The channel attention vector $I_{c}\in\textrm{R}^{c}$ is obtained by the element-wise multiplication between $\alpha_{c}$ and $I_{s}$ shown as

$\displaystyle I_{c}=\alpha_{c}\otimes I_{s}{{}^{\prime}}$ (13)

3.2.3 Embedding branch module

Two consecutive FC layers are employed on the channel attention output $I_{c}$ to generate the similarity feature $s$ , which is shared by multiple nodes of the specific attribute under the same parent node.

$\displaystyle s=\delta(W_{q^{p}}\delta(W_{q^{g}}I_{c}))$ (14)

where $q$ denotes the node of $a$ in the attribute tree, $p$ is the parent of $q$ , $g$ is the parent of $p$ , $W_{q^{g}}\in\textrm{R}^{\frac{c}{r{{}^{\prime}}}\times c}$ and $W_{q^{p}}\in\textrm{R}^{\frac{c}{2r{{}^{\prime}}}\times\frac{c}{r{{}^{\prime}}}}$ denote two transformation matrices, $\delta$ is ReLU function, and $r{{}^{\prime}}$ is the reduction rate. Note that if the node has no children, it is regarded as the lowest child.

3.2.4 Mask module

To disentangle similarity feature $s$ into meaningful dimensions corresponding to different attributes under the same parent node, we introduce a mask operation, which acts as a gating function to select relevant dimensions.

$\displaystyle f_{\textit{mask}}=\delta(W_{\textit{mask}}E(a))$ (15)

where $W_{\textit{mask}}\in\textrm{R}^{\frac{c}{2r{{}^{\prime}}}\times Q}$ is a transformation matrix, $\delta$ is ReLU activation function. With mask weight, the attribute specific masked feature vector of given image $x$ with the specified attribute $a$ is calculated as

$\displaystyle\mathcal{F}(x,a)=s\otimes f_{\textit{mask}}$ (16)

Finally, we further employ an FC layer over $F(x,a)$ to obtain the attribute-guided feature of given images.

$\displaystyle\mathcal{H}(x,a)=WI_{c}+b,$ (17)

where $W\in\textrm{R}^{c\times c}$ is the transformation matrix, $b\in\textrm{R}^{c}$ denotes the bias term.

3.2.5 Model learning

We use the triplet ranking loss to learn the similarity vector, which is proven effective in embedding learning tasks [20, 21]. Assume that $x_{i}$ is an anchor image, $\textit{Sim}(x_{i},a,x_{j})$ means the similarity of similar image pairs about attribute $a$ , and $\textit{Sim}(x_{i},a,x_{k})$ means the similarity of dissimilar image pairs about $a$ . The goal is to achieve the relationship of two similarities as $\textit{Sim}(x_{i},a,x_{j})>\textit{Sim}(x_{i},a,x_{k})$ , when given a triplet of $(x_{i},x_{j},x_{k})$ . The triplet loss is commonly defined as

$\displaystyle L(x_{i},x_{j},x_{k}|a)=\max\{0,m-\textit{Sim}(x_{i},a,x_{j})+% \textit{Sim}(x_{i},a,x_{k})\}$ (18)

where $m$ represents the margin, which is empirically set to be 0.2 in our experiments. Finally, we train the model to minimize the triplet ranking loss on the triplet set, and the overall objective function of the model is

$\displaystyle\underset{\theta}{\arg}\sum\nolimits_{(x_{i},x_{j},x_{k}|a)}L(x_{% i},x_{j},x_{k}|a)$ (19)

where $\theta$ denotes all trainable parameters.

3.3 HAEN_Avg

In the field of fashion, for attribute-manipulated retrieval tasks, the popular method is feature substitution [1], i.e., the feature corresponding to the modified label in the image is replaced with the feature value of the new label. The feature value of each label is calculated in advance. For example, it uses the feature value of the “3/4 sleeves” label to replace the feature corresponding to the “sleeveless” label of the input image.

Since HAEN can help extract the characteristics of each attribute corresponding to a given image, we propose the method HAEN_Avg to implement the attribute-manipulated fashion retrieval. The specific steps are as follows.

Step 1. Extract the feature value of each label in the image gallery. For any image in the gallery, we feed the image features and each attribute value into the HAEN model to obtain the attribute-guided image features $\mathcal{H}(y,a_{i})$ . Given an image $y\in G$ which has attribute label $b_{i}^{j}$ for attribute $a_{i}$ , where $a_{i}$ is an attribute in $A(1\leqslant i\leqslant|A|)$ and $b_{i}^{j}$ is a label of $y(1\leqslant j\leqslant|B_{i}|)$ . Suppose we denote the feature vector set of $b_{i}^{j}$ with $\zeta_{i}^{j}$ , we add $\mathcal{H}(y,a_{i})$ to $\zeta_{i}^{j}$ .

Step 2. Calculate the mean vector of each label. Then we get the mean feature values of all labels.

$\displaystyle\bar{\zeta}_{i}^{j}=\textit{avg}(\zeta_{i}^{j})/|\zeta_{i}^{j}|$ (20)

Step 3. Replace the input image feature with the mean image features correlated to the input label. Suppose there is an input $(x,b_{i{{}^{\prime}}}^{j{{}^{\prime}}})$ . First, we input the hierarchical attribute embedding corresponding to each attribute of $x$ into HAEN to obtain the respective attribute feature vector $(\mathcal{H}(x,a_{1}),\linebreak\mathcal{H}(x,a_{2}),\ldots,\mathcal{H}(x,a_{i% {{}^{\prime}}}),\ldots,\mathcal{H}(x,a_{|A|}))$ . Then $\mathcal{H}(x,a_{i{{}^{\prime}}})$ is replaced by $\bar{\zeta}_{i{{}^{\prime}}}^{j{{}^{\prime}}}$ . The final new feature vector of $x$ is $(\mathcal{H}(x,a_{1}),\mathcal{H}(x,a_{2}),\ldots,\bar{\zeta}_{i{{}^{\prime}}}% ^{j{{}^{\prime}}},\ldots,\mathcal{H}(x,a_{|A|}))$ .

Step 4. Similarity computation. For the input $(x,b_{i{{}^{\prime}}}^{j{{}^{\prime}}})$ , the similarity between it and the target image $y$ is calculated as

$\displaystyle\textit{Sim}(x,b_{i{{}^{\prime}}}^{j{{}^{\prime}}},y)=\sum\limits% _{i=1\&i\neq i{{}^{\prime}}}^{|A|}S_{b}(\mathcal{H}(x,a_{i}),\mathcal{H}(y,a_{% i}))+S_{b}(\bar{\zeta}_{i{{}^{\prime}}}^{j{{}^{\prime}}},\mathcal{H}(y,a_{i{{}% ^{\prime}}}))$ (21)

3.4 HAEN_Rec

The HAEN_Avg method is simple and easy to understand [1], but it ignores the information of other attributes. Moreover, it replaces individuals with global average characteristics, which may result in the loss of individual variability. So we propose HAEN_Rec, a feature substitution method based on reconstruction. The core of this method is the reconstruction module, whose object is to to generate new features based on the target label. Figure 3 shows the overview of the proposed HAEN_Rec.

Figure 3.

The proposed HAEN_Rec consists of two parts: (a) HAEN and (b) reconstruction module.

3.4.1 Reconstruction module

Each attribute $a_{i}$ corresponds to a label set $B_{i}$ of different sizes. $B_{i}=\{b_{i}^{1},b_{i}^{2},\ldots\}$ . All attribute labels form a binary vector $H=[h_{1}^{1},h_{1}^{2},\ldots,h_{2}^{1},h_{2}^{2},\ldots,h_{|A|}^{1},h_{|A|}^{% 2},\ldots]$ . If the item has label $b_{i}^{j}$ , then $h_{i}^{j}=1(1\leqslant i\leqslant|A|,1\leqslant j\leqslant|B_{i}|)$ , otherwise $h_{i}^{j}=0$ . Therefore, $H$ can be used as an item label indicator to let the model know which label the target item should have. To fully mine the latent semantic information of labels, we introduce a label semantic embedding matrix $W_{c}\in\textrm{R}^{D\times J}$ , and get the semantic representation $\hat{H}=[\hat{h}_{1}^{1},\hat{h}_{1}^{2},\ldots,\hat{h}_{2}^{1},\hat{h}_{2}^{2% },\ldots,\hat{h}_{|A|}^{1},\hat{h}_{|A|}^{2},\ldots]$ and the regularized attribute operation indicator $\bar{H}=[\bar{h}_{1}^{1},\bar{h}_{1}^{2},\ldots,\bar{h}_{2}^{1},\bar{h}_{2}^{2% },\ldots,\bar{h}_{|A|}^{1},\bar{h}_{|A|}^{2},\ldots]$ .

$\displaystyle\hat{h}_{i}^{j}=W_{c}\bar{h}_{i}^{j},\quad\bar{h}_{i}^{j}=h_{i}^{% j}\left/\sum\nolimits_{j=1}^{|H|}h_{i}^{j}\right.$ (22)

After preprocessing, the image is represented as $x\in\textrm{R}^{c\times h\times w}$ , where $h\times w$ is the image size, and $c$ is the number of channels. To seamlessly integrate the semantic representation of visual features and attributes, we copy $\hat{H}$ along the height and width dimensions according to the size of $x$ and get $\tilde{H}$ , and the reconstructed image features are expressed as

$\displaystyle\tilde{x}=\textit{Conv}_{3}(\textit{RELU}(\textit{Conv}_{2}([x,% \tilde{H}])))$ (23)

where $\textit{Conv}_{2}$ and $\textit{Conv}_{3}$ respectively represent the convolutional layer with $3\times 3$ convolution kernels, and $[,]$ represents the concatenation operation.

To fuse the image with the manipulation label, we still need to pay attention to the spatial regions and channels related to the modified attribute in the image features. We also use ASA module and ACA module to get the attention weights $\alpha_{s}\in\textrm{R}^{h\times w}$ and $\alpha_{c}\in\textrm{R}^{c}$ .

$\displaystyle I_{s}[j]=\alpha_{s}[j]\odot\tilde{x}[j],I_{c}=\alpha_{c}\otimes I% _{s}$ (24)

$I_{c}$ is the final image feature, which will be input to the embedding branch module. The output of the embedding branch module is the manipulated feature vector $\mathcal{T}(x,b)$ , which will be used for similarity computing.

3.4.2 Model learning

HAEN_Rec is based on HAEN, so we first train the HAEN model and then train the reconstruction module after fixing the HAEN parameters. We also use the triplet ranking loss as the loss function. First, we construct a set of triples $T=\{(x_{i},x_{j},x_{k}|a)\}$ , where $x_{j}$ is the image similar to $x_{i}$ on attribute $a$ and $x_{k}$ is the image dissimilar to $x_{i}$ on attribute $a$ . Given a triplet of $T$ , our goal is to train the model and make $S_{a}(x_{i},x_{j}|a)>S_{a}(x_{i},x_{k}|a)$ . $S_{a}(x_{i},x_{j}|a)$ denotes the fine-grained similarity between $x_{i}$ and $x_{j}$ with respect to attribute a, which can be expressed as the cosine similarity between $H(x_{i},a)$ and $H(x_{j},a)$ . The triplet loss function is defined as

$\displaystyle L(x_{i},x_{j},x_{k}|a)=\max(0,m-S_{a}(x_{i},x_{j})+S_{a}(x_{i},x% _{k}))$ (25)

where $m$ is the margin value, which is set to 0.2 empirically in the experiment.

3.4.3 Similarity computation

Given an image $x$ and a label $b$ , $b$ is a label of attribute $a_{i{{}^{\prime}}}$ .

Step 1. Input $x$ and each attribute into HAEN to get $F=\{\mathcal{H}(x,a_{1}),\mathcal{H}(x,a_{2}),\ldots,\mathcal{H}(x,a_{Q})\}$ .

Step 2. Feed the attribute manipulation indicator of $x$ , and $b$ into the reconstruction module to get the reconstructed feature vector $\mathcal{T}(x,b)$ .

Step 3. Replace the original feature and get $F{{}^{\prime}}=\{\mathcal{H}(x,a_{1}),\ldots,\mathcal{T}(x,b),\ldots,\mathcal{% H}(x,a_{Q})\}$ .

Step 4. Calculate the similarity between $x$ and the item in the gallery.

$\displaystyle\textit{Sim}(x,b,y)=\sum\limits_{i=1\&i\neq i{{}^{\prime}}}^{|A|}% S_{b}(\mathcal{H}(x,a_{i}),\mathcal{H}(y,a_{i}))+S_{b}(\mathcal{T}(x,b),% \mathcal{H}(y,a_{i{{}^{\prime}}}))$ (26)

3.5 HAEN_Cmb

The effect of the HAEN_Rec method depends on the clarity of the original query image. In actual scenes, many clothing images are occluded or distorted. At this time, the valuable information contained in the reconstructed attribute features obtained by this method may not be as much as HAEN_Avg. So, we propose a combination method, HAEN_Cmb, which uses an attention-based adaptive feature fusion module to integrate the processing results of the two methods, and adaptively adjust the weights to obtain the target feature vector that meets the needs more.

Figure 4.

The adaptive fusion module.

Figure 4 shows the structure of the adaptive fusion module. The input includes two parts: image feature $I_{c}^{\textit{Rec}}$ by the reconstruction module and the average feature $I_{c}^{\textit{Avg}}$ calculated in advance and saved locally. The difference with HAEN_Avg is that $I_{c}^{\textit{Avg}}$ here is not the feature vector finally output by HAEN, but the feature vector of the image output by ACA in the middle of the model because the two features need to be input to the EB module after fusion. The fusion process is as follows.

Step 1. Feed the concatenated vector of $I_{c}^{\textit{Rec}}$ and $I_{c}^{\textit{Avg}}$ into a FC layer for information integration.

$\displaystyle I_{f}=W_{3}\delta([I_{c}^{\textit{Rec}},I_{c}^{\textit{Avg}}])$ (27)

where $[,]$ is the concatenation operator, $W_{3}$ is a transformation matrix, $\delta$ is ReLu function. After the attention module, we can get the weights $\alpha_{f}\in R^{2}$ .

$\displaystyle\alpha_{f}=\textit{softmax}(I_{f})$ (28)

Step 2. Assuming that the weights corresponding to HAEN_Avg and HAEN_Rec are $w^{A}$ and $w^{R}$ respectively, the final result can be expressed as

$\displaystyle I_{c}^{\textit{Cmb}}=(w^{A}\otimes I_{c}^{\textit{Avg}})\oplus(w% ^{R}\otimes I_{c}^{\textit{Rec}})$ (29)

where $\otimes$ is the dot product operator, $\oplus$ is the addition operator. $I_{c}^{\textit{Cmb}}$ is input to the EB module to get the objective similarity vector $f_{\textit{cmb}}(I,a_{q})$ .

Model Learning. The training process of HAEN_Cmb is similar to that of HAEN_Rec. First, train the HAEN model. Then, train the reconstruction module after fixing the HAEN parameters. Finally, train the attention module after fixing the parameters of HAEN and the reconstruction module.

4. Experimental results

In this section, we perform extensive experiments on two benchmark datasets to evaluate our proposed HAEN and HAEN-based methods. We aim to answer the following research questions:

RQ1: Can our proposed HAEN perform better than other competitive models on attribute-guided fashion retrieval tasks?

RQ2: Can our proposed models based on HAEN perform better than other competitive models on attribute-manipulated fashion retrieval tasks?

RQ3: Are the key components in HAEN (i.e., HAE method, mask technique, and embedding branch) helpful for improving retrieval results?

RQ4: What is the visual effect of the attention mechanism in the model?

4.1 Experimental settings

Datasets. Recently, researchers in the community have contributed some fashion datasets for different research purposes, such as DARN [2], DeepFashion [8], FashionAI [22], FashionVC [23], and Fashion 200K [24]. However, most available public datasets lack attribute annotations and cannot be directly used for similarity measurement tasks based on attributes. It is worth noting that Ma et al. [14] have reconstructed two datasets based on FashionAI and DARN to fill the gap. Therefore, we use these two datasets in this paper to conduct experiments on mining attribute-driven similarity.

FashionAI [22] is a large-scale, high-quality fashion dataset. There are 8 attributes with 245 attribute values covering 6 categories of women’s clothing. Each attribute (e.g., sleeve length) is associated with a list of multiple attribute values (e.g., sleeveless, cup sleeve, and short sleeve). In our experiment, we split the dataset into three subsets for training/validation/testing at a ratio of 8:1:1 and train the model by constructing triplets from the training set. Concretely, for the triplet with respect to the sleeve length attribute, we randomly sampled two images with the same attribute value sleeveless as a similar pair and an image with a different sleeve length value as a dissimilar one.

DARN [2] is a fashion dataset collected for cross-domain image retrieval tasks. The dataset contains 253,983 images, and each image is annotated with 9 fashion attributes, and the total number of attribute values is 179. After delete the broken URLs, 200,580 images are obtained for our experiments. We sampled the triplets in a similar way to the FashionAI dataset.

Evaluation Metrics. For attribute-guided retrieval, given an image and an attribute, the model is required to find images similar to the query image subject to this attribute. We use mean average precision (MAP) as the evaluation metric.

For attribute-manipulated retrieval, we use top-k recall rate as the evaluation metric. It is defined as

$\displaystyle\textit{recall}=\frac{\sum\nolimits_{j=0}^{N}\textit{hit}(j)}{N^{% g}}$ (30)

where $N$ is the total number of test images and $N^{g}$ is the number of images with the attribute label. If one of the first $k$ results retrieved has the same attributes as the query, it means a hit, i.e., $\textit{hit}(j)=1$ means that the $j^{\text{th}}$ query is hit, otherwise $\textit{hit}(j)=0$ .

4.2 Attribute-guided retrieval results (RQ1)

For two different tasks, we use different baselines. We use four different models for experimental comparisons of attribute-guided retrieval tasks. To make a fair comparison, we use Resnet50 pre-trained on ImageNet as the image feature extraction network for all models.

(1)
Standard Triplet Network (STN) [12]: It is a simple similarity learning model by adding an FC layer as the embedding layer after the Resnet feature extraction network. The network parameters are updated through the backpropagation according to the triplet loss. Through the training of all triplets, STN aims to learn the similarity measure space of all attributes.
(2)
Conditional Similarity Networks (CSN) [12]: As an extended model of STN, CSN uses a learned mask to select relevant dimensions of the embedding space to model the specific attribute similarity.
(3)
Attribute-Specific Embedding Network (ASEN) [14]: This model has achieved the best performance in the field. It learns attribute-guided embedding space for fine-grained fashion similarity prediction.
(4)
HAEN: Proposed in this paper, HAEN leverages the hierarchical attribute embedding of a specific attribute as a condition. Two attention modules are applied to extract the features related to the attribute firstly. Then, it integrates the feature information through the embedding branch module and selects the dimension related to the attribute through the mask module.
(5)
w/o HAE: It is a variant of HAEN. It feed the one-hot vector into the attention module and HAE into the mask module to get the output features of the final FC layer.

Table 2
MAP of specific attribute-based retrieval on FashionAI

Model MAP for each attribute Mean

Skirt_ Sleeve_ Coat_ Pant_ Collar_ Lapel_ Neckline_ Neck_ MAP

length length length length design design design design

STN 48.38 28.14 29.82 54.56 62.58 38.31 26.64 40.02 38.52

CSN 61.97 45.06 47.30 62.85 69.83 54.14 46.56 54.47 54.47

ASEN 64.44 54.63 51.27 63.53 70.79 65.36 59.50 58.67 61.02

HAEN 64.13 55.52 56.41 72.31 73.32 69.22 62.41 59.80 64.14

w/o HAE 65.30 54.93 55.13 70.17 70.91 68.23 61.11 59.42 63.15

To demonstrate that using specific attribute spaces to measure fashion similarity is better than using a general space, Table 2 shows the retrieval results of different models on each attribute on the FashionAI dataset, and we get the following observations.

(1)
STN performs the worst, and its MAP value is only 38.52. The MAP value of CSN is 54.47, which is an increase of 41.41% compared to STN.
(2)
ASEN performs better. Compared with STN and CSN, its MAP value has increased by 58.41% and 12.02%, respectively. The main reason is that the model uses the attention mechanism to extract more relevant features to the specific attribute.
(3)
Compared with ASEN, HAEN proposed in this paper has a 5.11% improvement in MAP.

We also conduct experiments on the DARN dataset. Table 3 shows the retrieval results of different models on each attribute. The observations we get are similar to Table 2.
4.3 Attribute-manipulated retrieval results (RQ2)

Model	MAP for each attribute	Mean
	Skirt_	Sleeve_	Coat_	Pant_	Collar_	Lapel_	Neckline_	Neck_	MAP
	length	length	length	length	design	design	design	design
STN	48.38	28.14	29.82	54.56	62.58	38.31	26.64	40.02	38.52
CSN	61.97	45.06	47.30	62.85	69.83	54.14	46.56	54.47	54.47
ASEN	64.44	54.63	51.27	63.53	70.79	65.36	59.50	58.67	61.02
HAEN	64.13	55.52	56.41	72.31	73.32	69.22	62.41	59.80	64.14
w/o HAE	65.30	54.93	55.13	70.17	70.91	68.23	61.11	59.42	63.15

AMNet is a classic model for attribute manipulation retrieval tasks and has been used as a benchmark in many studies. We compare the proposed methods with AMNET in the following experiment.

We set $k$ to 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 for experiments to provide a basis for detailed analysis. Figure 5 shows the experimental he experimental results. HAEN_Avg uses the average value of each label as its feature vector, HAEN_Rec uses the value obtained by the reconstruction module as the feature vector, and HAEN_Cmb uses an adaptive attention module to fuse the feature vectors obtained by the above two methods. Figure 5 shows the Top-k of attribute-manipulated retrieval results on FashionAI. From it, we can get the following observations.

Table 3
MAP of specific attribute-based retrieval on DARN

Model	MAP for each attribute									Mean
	Clothes	Clothes	Clothes	Clothes	Clothes	Clothes	Collar	Sleeve	Sleeve	MAP
	category	button	color	length	pattern	shape	shape	length	shape
STN	15.48	34.12	16.43	36.26	43.14	35.38	14.43	67.23	55.67	35.34
CSN	25.13	40.25	44.68	47.36	48.47	45.22	26.98	76.73	56.76	45.73
ASEN	27.41	42.25	46.08	48.15	48.72	49.09	26.79	78.52	58.17	47.24
HAEN	32.10	47.04	45.03	48.27	49.92	51.22	28.05	78.29	58.47	48.70
w/o HAE	31.86	45.24	45.47	48.04	49.38	50.48	27.15	79.12	57.78	48.27

Figure 5.

Top-k recall rate of attribute-manipulated retrieval on FashionAI.

(1)

As $k$ increases, the top-k recall results of the four methods all show an upward trend. The three methods proposed in this paper obtain better retrieval performance than AMNET on all attributes. The reason is that AMNet uses Resnet to extract image features and uses public space to measure the similarity between items. In contrast HAEN extracts different image features based on different attributes so that the similarity between items can be evaluated in different feature spaces.

(2)

For six attributes such as clothes_category, clothes_button, clothes_length, clothes_shape, sleeve_shape, and sleeve_length, HAEN_Rec performs better than HAEN_Avg, which proves the effectiveness of the reconstruction module. However, HAEN_Rec does not perform as well as HAEN_Avg for the two attributes of clothes_color and clothes_pattern. The fact is that there is no clear correlation between these attributes and the regional characteristics of an image. In this case, the average embedding feature can be used for similarity calculation, while the reconstructed embedding feature will have a negative effect and is not suitable for similarity calculation.

(3)

The HAEN_Cmb method combines the structures of HAEN_Avg and HAEN_Cmb and allows them to complement each other through an adaptive feature attention module. It can be seen from the figure that for all attributes, no matter how k changes, HAEN_Cmb performs best, and its top-k curve is smoother, which proves the effectiveness of the proposed adaptive feature attention module.

4.4 Ablation study (RQ3)

4.4.1 The impact of HAE method

Table 4
Parameter statistics of different models

Model	#params	#params besides Resnet
ASEN	13753897	3161601
HAEN	13370025	2777729
HAEN w/o HAE	13361705	2769409

The relationship between fashion attributes, especially the hierarchy relationship, is a kind of effective information that can be used. We use a hierarchical attribute coding method to embed information into the model so that the attention module can learn the hierarchy structure between attributes. It can be seen from Tables 2 and 3 that HAEN (using the HAE method) can help improve retrieval precision and prediction accuracy compared to HAEN w/o HAE. Table 4 shows that the number of training parameters of HAEN is less than that of ASEN. These results prove the effectiveness of hierarchical relationships among attributes for fashion retrieval.

4.4.2 The impact of mask technique

Therefore, when representing item features, we first share the same FC layer to integrate feature information according to the hierarchical relationship between attributes and then use an embedded layer. When some attributes share specific characteristics, their embedding space should be shared in some dimensions. Thus, we use a mask operation similar to [12] to select specific dimensions for sub-nodes to measure their similarity. Figure 6 shows an instance of the mask technique. It can be observed that in the mask graph, compared with two child nodes under different parent nodes, such as “neck_design” and “lapel_design”, “pant_length” and “skirt_length”, have more overlapping in some dimensions.

4.4.3 The impact of embedding branch module

To see the effectiveness of the EB module, we add it to the ASEN model. The new model is called ASEN $+$ EB which is a variant of ASEN based on HAEN. We use a separate FC layer and an embedding layer for each attribute.

Table 5
MAP of specific attribute-based retrieval

Model	Mean MAP		#parameter
	FasionAI	DARN
ASEN	61.02	47.24	13753897
ASEN $+$ EB	65.35	50.03	15066665
HAEN	64.14	48.70	13370025

Figure 6.

Visualization of the mask operations on certain attributes.

Figure 7.

Two instances of attribute-guided fashion retrieval on FashionAI. The area in the red bounding box corresponds to the given attribute. The blue bounding boxes indicate that the images have the same label as the given image in terms of the given attribute. The green ones mean the images do not include the same label.

Figure 8.

Two instances of attribute-manipulated fashion retrieval on FashionAI. The blue bounding box indicates that the image attributes are consistent with the query conditions, and the green indicates that the image attributes do not meet the query conditions.

Figure 9.

Heatmaps of different attributes on different models.

From Table 5, we can see that ASEN+EB performs best. It proves the effectiveness of the EB module. However, ASEN $+$ EB needs to learn more parameters (1% more than ASEN and 2% more than HAEN), which leads to high computational costs. When the number of attributes increases, the number of parameters of the model will increase exponentially, and the training cost will be higher. The EB module is also used in HAEN, but the structure of HAEN is simpler than ASEN $+$ EB. Therefore, HAEN is the best when considering the calculation cost and retrieval accuracy.

4.5 Case study (RQ4)

4.5.1 Image retrieval visualization

To further demonstrate the effectiveness of the proposed method, the following will take a look at the effect of the proposed methods through application examples on two different tasks. Figure 7 shows two instances of attribute-guided fashion retrieval on the FashionAI dataset with given attributes by using HAEN. These results demonstrate that our HAEN is good at capturing the attribute-specific fine-grained similarity among fashion items. Figure 8 shows two instances of attribute-manipulated fashion retrieval on the FashionAI dataset.

4.5.2 Attention visualization

To verify the ability of the attention modules in HAEN to locate areas based on attributes, we visualize the attribute-guided attention function on fashion images. Figure 9 shows the comparison between HAEN and ASEN. The following two points can be seen.

(1)
The attention module in HAEN can more accurately identify the area related to a certain attribute in the fashion image. For example, in Fig. 9, for the attribute skirt length, ASEN mistakenly positions the focus in the middle of the coat, while HAEN accurately identifies the beginning and end of the skirt. For the attribute collar design, ASEN focuses on the human head, with only two scattered areas related to the attribute, while HAEN locates the continuous area related to the attribute.
(2)
For some images with complex scenes, both ASEN and HAEN do not perform well when paying attention to specific attributes such as coat length and pant length. However, HAEN can find the beginning and end positions of the coat when paying attention to the coat length attribute, while ASEN incorrectly locates the attribute on the sleeves of the clothes.

5. Discussion and conclusion

Aiming at the problem of using text and image information for a query in the fashion field, we propose for the first time in this paper to distinguish two tasks: attribute-guided image retrieval and attribute-modified image retrieval. We propose a similarity learning network HAEN based on the attention mechanism, which can learn the semantic relationship between text attributes and images. The extensive experiments show that compared with the state-of-the-art methods, HAEN performs better in attribute-specific similarity learning by using the hierarchical relationship between attributes and obtains competitive results in attribute-guided retrieval tasks. Based on HAEN, HAEN_Avg, HAEN_Rec, and HAEN_Cmb methods are proposed for attribute-manipulated retrieval tasks. HAEN_Cmb is a fusion of the first two methods, absorbing their advantages and obtaining much better results than the baseline. Since HAEN training requires images to be labelled, accurate image annotation is critical to improving retrieval performance. Therefore, we will study the multi-label classification and automatic labelling of fashion images in our future work.

References

Cheng

Song

Chen

Hidayati

S.C.

and Liu

, Fashion meets computer vision: A survey, ACM Computing Surveys 54(4) (2021), 1–41.

Huang

Feris

R.S.

Chen

and Yan

, Cross-domain image retrieval with a dual attribute-aware ranking network, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1062–1070.

Kuang

Gao

Luo

Chen

Lin

and Zhang

, Fashion retrieval via graph reasoning networks on a similarity pyramid, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3066–3075.

Wen

Song

Yang

Zhan

and Nie

, Comprehensive linguistic-visual composition network for image retrieval, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1369–1378.

Jandial

Badjatiya

Chawla

Chopra

Sarkar

and Krishnamurthy

, Sac: Semantic attention composition for text-conditioned image retrieval, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 4021–4030.

Yang

Wang

Zhou

and Li

, Cross-modal joint prediction and alignment for composed query image retrieval, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3303–3311.

Jiang

Sun

Murphy

L.-J.

Fei-Fei

and Hays

, Composing text and image for image retrieval-an empirical odyssey, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6439–6448.

Liu

Luo

Qiu

Wang

and Tang

, Deepfashion: Powering robust clothes recognition and retrieval with rich annotations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1096–1104.

Wang

Zhang

and Yang

, Cross-domain image retrieval with attention modeling, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 1654–1662.

10.

Wang

Zhang

Zhou

and Gu

, Clothing retrieval with visual attention model, in: 2017 IEEE Visual Communications and Image Processing (VCIP), IEEE, 2017, pp. 1–4.

11.

Han

Jiang

Y.-G.

and Davis

L.S.

, Learning fashion compatibility with bidirectional lstms, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 1078–1086.

12.

Veit

Belongie

and Karaletsos

, Conditional similarity networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 830–838.

13.

Liao

Zhao

Ngo

C.-W.

and Chua

T.-S.

, Interpretable multimodal retrieval for fashion products, in: Proceedings of the 26th ACM International Conference on Multimedia, 2018, pp. 1571–1579.

14.

Dong

Long

Zhang

Xue

and Ji

, Fine-grained fashion similarity learning by attribute-specific embedding network, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 11741–11748.

15.

Yan

Ding

Zhang

and Wang

, Learning fashion similarity based on hierarchical attribute embedding, in: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, 2021, pp. 1–8.

16.

Chen

Gong

and Bazzani

, Image search with text feedback by visiolinguistic attention learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3001–3011.

17.

Zhang

and Zhao

, Clothes image caption generation with attribute detection and visual attention model, Pattern Recognition Letters 141 (2021), 68–74.

18.

Wei

Zhang

and Wu

, Multi-modality cross attention network for image and sentence matching, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10941–10950.

19.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

20.

Vasileva

M.I.

Plummer

B.A.

Dusad

Rajpal

Kumar

and Forsyth

, Learning type-aware embeddings for fashion compatibility, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 390–405.

21.

Dong

Yang

and Wang

, Dual encoding for zero-example video retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9346–9355.

22.

Zou

Kong

Wong

Wang

Liu

and Cao

, Fashionai: A hierarchical dataset for fashion understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 296–304.

23.

Song

Feng

Liu

Nie

and Ma

, Neurostylist: Neural compatibility modeling for clothing matching, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 753–761.

24.

Han

Huang

P.X.

Zhang

Zhu

Zhao

and Davis

L.S.

, Automatic spatially-aware fashion concept discovery, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1463–1471.

Attribute-guided and attribute-manipulated similarity learning network for fashion image retrieval

Abstract

Keywords

1. Introduction

2.1 Fashion retrieval

2.2 Composing text and image for image retrieval

2.3 Attention mechanism in multi-modal similarity learning

3. Methodology

3.1 Problem formulation

Table 1 Examples of attributes and their corresponding attribute labels

4.1 Experimental settings

Table 3 MAP of specific attribute-based retrieval on DARN

4.4.1 The impact of HAE method

Table 4 Parameter statistics of different models

4.4.3 The impact of embedding branch module

Table 5 MAP of specific attribute-based retrieval

4.5.1 Image retrieval visualization

4.5.2 Attention visualization

References

Table 1
Examples of attributes and their corresponding attribute labels

Table 3
MAP of specific attribute-based retrieval on DARN

Table 4
Parameter statistics of different models

Table 5
MAP of specific attribute-based retrieval