Abstract
Holistic scene understanding is a challenging problem in computer vision. Most recent researches in this field were focusing on the object detection, the semantic segmentation and the relationship detection tasks. The attribute can provide meaningful information for the object instance, thus the object instance can be expressed more detail in the scene understanding. However, most researches in this field have been limited to several special conditions. Such as, several researches were just focusing on the attribute of special object class, because their solutions were aimed at a limited-scenarios, their methods are hardly to generalize in other scenarios. We also find that most of the research for multi-attribute detection task were only regarding each attribute as binary class and simply use the multi-binary-classifier method for the attribute detection. But these strategies above not consider the relation between each pair of the attributes, they will fall into trouble in the “imperfect” attribute dataset (which is labeled with the missing and incomplete annotations), and they will have low performance in the long-tail attribute class (which has lower rank of annotation and more missing labels). In this paper, we focus on the multi-attribute detection for a variant of object classes and take the relation between attributes into consideration. We propose a GRU-based model to detect a variable-length attribute sequence with a customized loss compute method to solve the “imperfect” attribute dataset problem. Furthermore, we perform ablative studies to prove the effectiveness of each part of our method. Finally, we compare our model with several existed multi-attribute detection methods on VG (Visual Genome) and CUB200 bird datasets to prove the superior performance of the proposed model.
Introduction
Holistic scene understanding is a challenging task for computer vision, autonomous robot and other intelligent agent fields of research. To achieve the goal of scene understanding for the intelligent agent, great efforts have been made in the recent years, such as the object detection Redmon et al. et al. [19, 20], image segmentation Long et al. [15], Liang et al. [13], the relationship detection Zhang et al. [27], Dai et al. [4], Li et al. [10] and the image classification Daniels and Metaxas (2018).
We think the attribute detection is also a key task of the holistic scene understanding. The attribute is a fine grained description of the object instance, and the attribute can describe the object from many perspectives, such as color, material, shape, action and so on. Several other tasks have shown to achieve a better performance with the use of object attributes, e.g., question and answer Antol et al. [1], Yu et al. [25] automatic caption generation Johnson et al. [7], Liang et al. [11], fine-grained recognition Farhadi et al. [5]and face recognition Iranmanesh, Kazemi, Soleymani, Dabouei and Nasrabadi (2018) Taherkhani et al. [22].
We consider object detection is one of the classic tasks of data mining. Analyzing anomalous data helps enterprises or users understand the formation mechanism behind it, so that they can make corresponding decisions and avoid losses. With the development of the network, anomaly detection for structured data, that is, graph anomaly detection, has received more and more attention. However, general-purpose graph neural networks (such as Graph Convolutional Networks, etc.) are mainly designed for normal data, and are prone to encounter the problem of “over-smoothing” in anomaly detection tasks, the expression of abnormal nodes and normal nodes is difficult to distinguish, which affects the accuracy of anomaly detection. Thus, object detection is usually a complex system engineering, but choosing the appropriate graph neural network is a key factor affecting system performance.
We regard the attribute as a set of the collection for an object instance, and the order in the attribute annotation for each object instance is nothing else matters. The relationship between attributes is no doubt, and the relevance of relationship between different pair of attributes are varied. Such as, “yellow” and “orange” are both color attributes, and their relationship should be stronger than the relationship between “yellow” and “big”. Several attributes are abstract, the vision feature is inefficient to represent these attributes, but with the relationship between attributes, these attributes may easier to be detected. We also consider that there is relationship between attributes and object class, such as “wet” is highly related to “sand”. We find that only several researches can take the relation described above in to consideration, such as Deep variation-structured reinforcement learning (DVRL) Liang et al. [12] and Multi-Label Image Recognition with Graph Convolutional Networks (ML-GCN) Chen et al. [3].
A large part of researches of multi-attribute detection were only focusing on the attributes for a special object class to reduce the annotation workload, and the attribute classes are pre-defined to ensure the high accuracy of the attribute annotation. Then, they used the multi-binary-classifier to solve the multi-attribute detection problem. The above method works well on the attributes for a special object class with well annotated attribute dataset, but if we do not limit the object class and cannot ensure all the attributes are carefully annotated, the multi-binary-classifier method will fall into trouble.
The public attribute detection dataset has more challenging compared to the object detection dataset. For each object instance, more than one attributes can be labeled by an annotator and the workload will be multiplied. Enumerating all the attribute for the given object instance and labeling them are difficult for annotators, that makes the attribute dataset have more uncertainty. The unlabeled attribute may be not really existing or annotator’s negligence, thus, we can only ensure that the annotated attributes are positive samples, but not all the unlabeled attribute (especially the longtail attribute) are negative samples. We call this an “imperfect” characteristic for many attribute datasets. Moreover, the number of the attributes for different object was vary significantly, which make the traditional classifier method incompetent.
The Visual Genome (simplified as VG below) Krishna et al. [8] is a dataset for vision tasks and it has different kinds of annotations, including 18k object classes and 13k attribute classes annotations, their attributes are generated from the region caption. Thus, the attribute annotation in VG is “imperfect”, not only half of the object instances are unlabeled with attributes, but also another half of the object instances are only labeled with several conspicuous attributes. We evaluate several multi-attributes detect methods on VG (in Section 4), and the result shows that the multi-binary-classifier method cannot work well on “imperfect” attribute dataset. Because, the multi-binary-classifier method regard all the unlabeled attributes as negative samples to optimize the model, and more likely to predict negative result on the long-tail attribute.
In this paper, we aim to design a GRU-based attribute detection module, with the relationship between attributes taken into consideration, for the attribute detection to avoid the performance reduce in “imperfect” attribute dataset. We consider adopt the CNN for handle GRU for obtaining the attribution, CNN can robustly classify objects, even if they are placed in different directions, which is known as the property of invariance. More specifically, convolutional neural networks are able to stably cope with changes in image translation, image viewing angle, image size, or light conditions. Moreover, CNN can directly use image data as input, no need to manually preprocess the image and additional feature extraction and other complex operations, retaining more data information, by reducing the loss of data features, in the field of video processing cold have achieved very good results. GRU network has fewer network parameters, using the hidden state for information transfer. The convergence time is shorter, which can better meet the real-time needs of human-computer interaction and image processing. The points above has been illustrated in the revised manuscript. The GRU-based attribute detection module has strong processing capabilities for long-term correlation data, and better meet the real-time requirements by fewer network parameters. And we compare our method with other attribute detect methods in VG dataset and the result is shown in Section 4, which achieved the better results using the same datasheet and implementation detail by current methods.
Related works
Attributes detection
The focus of AI adoption is restricted to improving the efficiency of operations or the effectiveness of operations. It can be leveraged to improve the stakeholder experience as well. We have seen continuous growth in the adoption of AI within the IT industry. With the help of artificial intelligence, an organization can get ROI in real time. This means that organizations will see their efforts being paid off. This is one of the most important AI trends. As more data regulations come into play, trust in AI will be pivotal. Most researches in the field of multi-attribute detection are limited to the special object class. For instance, Sarker, Kamal, Jabreel, Rashwan, Banu, Radeva and Puig (2018) is focusing on the attributes of food, Li et al. [9] and Lin, Zheng, Zheng, Wu and Yang (2017) are focusing on the attributes of person, and Huang et al. [6] Mahbub et al. [16] Zhuang et al. [28] are focusing on the attributes of face. These methods have special design for the selected scenario, but their method may not work well in other scene.
Most of the above research were using multi-binary-classifier method (such as Li et al. [9] Lin et al. (2017) Mahbub et al. [16] Zhuang et al. [28]) to detect the attributes. Two of them (Li et al. [9] and Lin et al. (2017)), which have released the source code in github, were selected to compare with our method in Section 4.
We also select two methods (DVRL Liang et al. [12] and ML-GCN Chen et al. [3]), which can simply change the attribute class to adapt to the given scene, for comperation with our method.
The DVRL Liang et al. [12] do not limit to the attribute of a special object class, and use the correlation matrix (between object and attribute) to improve the performance. We also evaluate the above method in Section 4 for comparison.
In the implementation of DVRL Liang et al. [12], the guided graph is used to represent the attributes of each object, which are generated from the dataset before the training by the object-attribute correlation matrix. There are both advantages and disadvantages of this method, through most of the impossible attributes are filtered, several positive attributes also be filtered. And the performance will significant decrease when the object detection module makes mistakes, because they will use the wrong object-attribute correlation to filter attributes at the current condition.
The ML-GCN Chen et al. [3] use Graph Convolutional Networks (GCN) to take the relation between the detection target in to consideration in the multi-label detection task, they take object detection as an example and apply a multi-layers GCN on the objects correlation matrix to capture the relation between objects. The attribute detection is also a multi-label detection task and the above method should be capable, we evaluate this method on attribute detection in Section 4.
Our method is inspired by Yang et al. [24], which use a sequence model for article attribute detection. The author designs a LSTM-base model to detect the attributes from articles in Yang et al. [24]. We think the RNN-based method has potential on attribute detection in the vision field and design a GRU-based method to detect the attributes for each object instance. For comparison with DVRL and ML-GCN, we do not use object-attribute (or attributes) correlation matrix from the dataset explicitly.
Multi-label classification
The multi-attribute detection is not a simple multi-class classification problem, it is rather a multi-label classification problem, because more than one attributes can be detected from a single object instance and the number of the detected attribute is varieties. The multi-label classification method is an extension of multi-class classification, through a single label that accurately classifies an instance into one or more classes. In a multi-label problem, there is no limit to the number of classes that an instance can be assigned to. Formally, multi-label classification can be seen as a problem finding a model that maps the input x to the binary vector y (assigning a value of 0 or 1 to each element (tag) in y). Several transformation methods exist for multi-label classification. Baseline is a binary correlation approach whose core competency is to train individual binary classifiers for each label independently. In our proposed method, the GRU-based module is used to process the multi-label classification problem and we take the relation between attributes into consideration.
The multi-label classification solutions proposed inspired by deep learning techniques. Some of these solutions rely on autoencoders, which take into account the nature of unsupervised learning. With this approach, the underlying features of any given data can be efficiently extracted, resulting in a well-encoded reduced dataset. Some authors rely on convolutional neural networks (CNNs) to solve prediction problems such as image processing, sound, text, and video. CNNs use a mix of convolutional, pooling, and standard processing layers to capture abstract features that describe the problem. The role of the convolutional layer is to detect the local connections of features from the previous layer, while the pool layer is used to merge reasonably similar features into the higher layers. Meanwhile, other researchers are focusing on the specific application sector of federated learning, namely the data-driven cooperative localization and location data processingï¼in order to better understand the attributes during training processes. The Yin et al. [30] presented a centralized implementation, where interested readers can find more details about the measurements, configurations of the DNNs, as well as a diagram of the whole navigation system.
Variable-length attributes detection network
Figure 1 shows the whole diagram of our proposed method, which consists of three parts, including feature extraction, multi-attribute detection and the loss computation method. Two primary modules of our network are the feature extract module and GRU-based multi-attribute detection module. We also use a message passing struct from the object detection to the attribute detection branch to refine the attribute features and take the relationship between the object and attributes in to consideration.
The whole structure of our model VADN. The DCNN generates shared convolutional features, the RPN predicts object regions. Our model can be divided into 3 modules, which are the feature extraction module (to extract the attribute appearance, color and space features), attribute feature refinement module, and the GRU-based multi-attribute detection module.
In the proposed method, the appearance features for each object are obtained from the deep convolutional network model (DCNN) with ROI-pooling layer. Because the DCNN is pretrained for object detection, we cannot ensure the extracted features can satisfy the need of attribute detection. Moreover, the ROI-pooling layer will lose most of the space features, because it resized the input features to the fixed size. We consider that the color and space features are especially useful in the attribute detection, thus we design two special modules for color and space feature extraction (as shown in Fig. 1) to extend the extracted appearance features from the pretrained feature extractor DCNN.
Attribute appearance features
We directly use the pooled features from DCNN as the input features, and then use two continuous fully connected layers (FC_1 and FC_2) with ReLU unit to extract appearance feature vector. These two fully connected layers both have a size of 1024, and the extracted appearance feature vector is in length of 1024.
Color features
We first use two convolution layers, which have 8 and 32 convolution kernels respectively, to extract color features. Because the color features should be space independent, these two convolution layser both have 1*1 convolution kernels. And then we use Max-Pooling and Mean-Pooling to do the down-sample. After that, a ROI-pooling layer is used to extract color feature map for each object instance. Finally, we use a fully connected layer with ReLU unit to generate the color feature vector, which is in length of 128.
Space features
The space features consist of local space features and global space features. As shown in Fig. 2, we first generate a 6-tuple from the object ROI and use a fully connected layer (FC_space_local) to embed the 6-tuple to the local space features. We then use a fully connected layer (FC_space_global) to embed the local space features to the global space feature’s space. Because the number of object proposals is variable, the max function is used to mix the global space features from different object instance and generate the global space feature vector, finally.
The 6-tuple is defined as
Space feature extraction module, which uses object ROIs to extract local and global space features.
The object and attribute appearance features are both extracted from the given image but in different branch, and these two types of features have different focus. We consider that there are relationships between the object and attributes and use the object features to refine the attribute features. During the feature refinement process, we do not directly concatenate these two types of appearance features together. Instead, a gate function is used to refine the attribute appearance features with the help of object appearance features. For each object instance, we use the object appearance features, which are extracted from the object detection branch, to refine the attribute appearance features. The features from “object to attribute” are simply defined as follows:
Where
where
We then use gated features
Where
We generate the attribute word vector by Pennington et al. [18] before the training phase, and the attribute word vector will be used in the GRU-base multiattribute detection module. The precomputed word vector dictionary is computed from 6B token and 400K vocab by Pennington et al. [18], and the similarity of the two given attribute word vectors can represent the correlation of the two attributes, generally.
In our proposed method, the variable-length attributes detection module works on the following principle: At state
In order to satisfy such principle, we design a GRU based model as shown in Fig. 3. The GRU-based multi-attribute detection module consists of four parts. The FC_attris a fully connected layer, which transforms the concatenated input features to attribute feature space. The transformed attribute feature (
GRU-based multi-attribute detection module. Attributes are detected at each step.
Dimension reduction display of several example attribute word vector. The red circle is the start node, the connected nodes are labeled attributes for a given object instance, the edges are the unique way from the start node “red” and pass the other labeled attribute nodes by the greed method.
As we have mentioned in Section 3.2, we regard the target attributes for an object as a digraph (as shown in 3), and the edges are the relationship among these attributes. There are edges between all the pairs of the attribute nodes in the digraph. We regard the word vector similarity as the strength of the relationship. The stronger the relationship between the two attributes, the high value of the edge weight will be. Because the digraph is static, we can find a unique way from the given start attribute node and pass the other given nodes with the sum of the edge weight be biggest. For example, as we shown in 4, we assume that there is an object instance with four labeled attributes “wooden, large, brown, red”, we select “red” as the start node and use the greed method (select highest edge from the last node step by step) among these attribute nodes to find the unique way which pass themselves. Then, we select other nodes “wooden”, “large”, “brown” as start node and repeat the above steps, respectively. Finally, we have four unique ways of these four nodes, and the 4 target attribute sequences are generated.
In the training phase, we design a special strategy to select one target attribute sequence from the target attribute sequences and compute the loss, the strategy is described as below:
There are no attribute labeled in the annotation. In this condition, we consider that these object’s attributes are high probability omitted by the annotator. Thus, we simply skip this object instance in the attribute loss computing process. The attribute annotation is not empty and all the labeled attributes are appeared in the predicted attribute sequence. In this condition, we consider that all the labeled attributes are detected, but several omitted attributes may still exist. We try to keep the predicted result in this condition and slightly punish the unlabeled attributes, which are detected, to
Example of condition 2. The attribute annotation is not empty and the first predicted attribute is not appeared in the attribute labels. This condition is strong possibility in the first few epochs of the training phase. We randomly select one target attribute sequence from the target attribute sequences and compute the loss. The attribute annotation is not empty and the first predicted attribute is appeared in the attribute labels. This condition is a general state in the training phase. We select the target attribute sequence, which is starting with the first predicted attribute, from the target attribute sequences and compute the loss. When we determine the target attribute sequence, the target sequences are padded to fix-length sequence for more easily computing the loss from batch of the object instances, and the attribute loss is computed as the following: In the condition 3 and condition 4:
where
In the condition 2:
where

Dataset
In our experiments, we use VG (Visual Genome) Krishna et al. [8] as the primary dataset. There are 108249 images in VG dataset, and the number of attribute instances is 1,670,182. The attribute annotation in VG dataset are not standard in many cases. For instance, some attributes are merged as a single attribute (example: “black” and “white” are always labeled as “black and white”), several attributes are synonymous but their words are different (example: “grey” and “gray” has the same meaning but they are labeled as two different attributes), and some attributes are labeled as shott sentences (example: “playing tennis”).
Therefore, we use WordNet Miller (1998) and NLTK Bird and Loper [2] to normalize the attributes in VG dataset. Then, we random select 46164 images for training and 10000 images for testing. We selected the most 100 frequency attribute labels during the preprocess. After the preprocessing of attributes, we have 343,838 attribute instances in total, and 6.12 attributes per image 1. The training dataset has 280k attributes in total and 6.1 attributes per image. The test dataset has 61k attributes in total and 6.1 attributes per image. In other dedicated attributes detection dataset, such as CUB-200, they have a standard annotation principle during the annotation process and the attribute annotation is binary-vector. We can ensure all the labeled attributes are positive or negative samples, therefore nothing else need to do. CUB-200 bird dataset Wah et al. [23], which has 200 object classes and 313 attribute classes, consists of 11788 images with 5994 and 5794 for training and testing, respectively. It has 31.48 attributes per image on average.
CUB-200 bird dataset Wah et al. [23], which has 200 object classes and 313 attribute classes, consists of 11788 images with 5994 and 5794 for training and testing, respectively. It has 31.48 attributes per image on average and the attribute annotation is binary.
Implementation details
MSDN Li et al. [10] is a multi-tasks detection network, which can perform object detection, relationship detection and caption generation jointly. In our experiments, we use MSDN as our base model only for object detection, and the parameters in MSDN are fixed during the training phase. In order to compare our model with other method effectively, such as DVRL and MSDN, we select VGG16 as base feature extraction module. The parameters in the additional layers are normally initialized and fine-tuned during the training phase. The Adam is used to optimize our model during the training phase. The learning rate is starting with 0.001 and will be reduced by a factor of 0.2 after each epoch. Moreover, our model can be trained end-to-end.
To prove all of our designs are working well in the attribute detection task and no invalid module is existing, we do ablative studies in Section 4.4. sFour variant models are designed as below:
The “VADN_base” is the base model in the ablative study, only the attribute appearance feature is used to detect the attribute. The “VADN_C” has an extra color feature extract module compared with the “VADN_base”. The “VADN_CS” has an extra space feature extract module compared with the “VADN_C”. The “VADN_CSR” has an extra feature refinement module compared with the “VADN_CS”.
We also design a variant model (VADN_freq) without our loss compute method as we describe in Section 3, and we simply regard the ascend order of the frequency for attributes as an alternative order to verify the effect of our loss computation method.
For comparison with the existing methods, we retrain four existing attributes detect models, which have released on git hub for attribute detection task, on VG dataset for comparison. To eliminate the data difference among these methods, all the models are using a same pre-trained RPN model for object region proposal. The Deep MAR Li et al. [9] and APR Lin et al. (2017) are designed for person attribute detection task, we regard the object as person and retrain these two models on VG dataset. The DVRL Liang et al. [12] is designed for attribute detection on image, and ML-GCN Chen et al. [3] is introduced for multilabel detection, we simply regard the attributes for the given object region as the target without any modify on the model structure.
In the above comparison, we only train the attribute detect branch, but the object detects branch and the shared convolution head are fixed. We further train our model VADN_CSR with our object detection model jointly to finetune the weights, and we call the fine-tuned model as our final model VADN_final. After the fine-tune process, our model has a slight improvement on both the object detect and the attribute detect tasks. Figure 6 gives the scene graph visualization results of our proposed method.
Due to the imperfect attribute annotations in the VG dataset, the accuracy is not an effective metric for multi-attribute detection. The predicted attributes, which do not appear in the ground truth, may still be correct attributes. Considering that 66% of the object are not annotated with attributes in VG, the number of potential attributes can be huge. We use F1-score, which considers recall and precision rates both, as a primary metric to evaluate our model. Note that we ignore the object instances which have no attributes annotations. We also use the average attributes number per object (Attr per object) as a metric to show the effectiveness of our attribute features and loss methods.
The WmAP is used in FineTag Zakizadeh et al. [26] to evaluate their model on CUB200, the higher, the better.
Ablative analysis of feature type
To analyze the performance of each module, we perform an ablative analysis. We compare our four variant models for the multi-attribute detection task using recall, precision and F1-score metrics. The results are shown in Table 1.
Color features: As shown in Table 1, our model VADN_C has 1.41% and 1.38% improvement compared with model VADN_base using recall and precision rate metrics, respectively. The F1-score also has 1.4% improvement.
Space features: As shown in Table 1, our model VADN_CS has 0.1% and 1.29% improvement compared with model VADN_C on recall and precision rate, respectively. And the F1-score also has 0.69% improvement. The improvement of precision metric benefit from the space features.
Refinement process: As shown in Table 1, our model VADN_CSR achieves 0.5% improvement compared with VADN_CS model while using recall rate metrics. The F1- score also has 0.25% improvement. The improvement of recall metric benefit from the space features.
The refinement process from object features to attribute features is effective for attributes detection, we think the primary reason is that attribute features focus on the whole object region and object features are focusing on the foreground. The refinement process can filter several background features for attributes detection and the refined features are more suitable for attribute detection.
Ablative analysis of the variant models. A, C and S are three modules to extract attribute appearance features, color features and space feature, respectively. The R is the refinement module, which uses the object appearance features to refine the attribute appearance features
Ablative analysis of the variant models. A, C and S are three modules to extract attribute appearance features, color features and space feature, respectively. The R is the refinement module, which uses the object appearance features to refine the attribute appearance features
(a) is the image with object and attribute labels. (b) is the THREE-semantic-level scene graph. There are many things on the table, such as food, bowl and glass. Because the plate is not detected, some relationships, which regard sandwich as object or subject, are unsuitable.
We compare our arrange method with a simple frequency order arrangement in the rearrange process which we discuss in Section 3.3. The VADN_freq is 7.22% lower than our rearrange method on the attribute detection task. Thus, our consideration, the difficulties of the detection for different attribute is not a simple ascend order of attribute frequency in the dataset, is reasonable. Our method can find a suitably order of the attributes for each object, and tend to predict more attributes for each object instance, which is suitable in “imperfect” attribute dataset. Because the unlabeled attributes are also may be appeared for the object instance.
Evaluation results of the different arrange method
Evaluation results of the different arrange method
Attributes detection task
The Deep MAR Li et al. [9] and APR Lin et al. (2017) are multi-binary-classifier method, which are widely used in attribute detection. But this method is not suitable for “imperfect” attribute dataset, because these method regard all the unlabeled attributes as negative samples in the training phase. And, our VADN has 8.51% and 18.1% improvement on F1-score compared to Deep MAR and APR, respectively.
The DVRL Liang et al. [12] use the object-attribute correlation matrix to improve the performance on attribute detection. The attribute detection in DVRL is depend on the object detection excessively, and object detect error will spread to the attribute detection. The attribute detect module in our method is partially based on the object features, and our model has 13.19% improvement on the on the F1-score compared to DVRL. Through the DVRL has a high recall performance, they predict more error attributes and have a lower comprehensive performance. The ML-GCN Chen et al. [3] is design for multi-label detection, but is not suitable for attributes detection. Because the correlation between attributes is complex in fact, and the attributes correlation matrix cannot represent the relation between attributes in the images completely. Moreover, the VG dataset is “imperfect”, thus the ML-GCN has low performance.
In another dataset CUB-200, our VADN also achieves 1.3% improvement compared to FineTag Zakizadeh et al. [26] in CUB-200 dataset.
Evaluation results of existing methods on VG and CUB-200 datasets. The VGG16+rank and FineTag are both provided by Zakizadeh et al. [26]
Evaluation results of existing methods on VG and CUB-200 datasets. The VGG16+rank and FineTag are both provided by Zakizadeh et al. [26]
The object and relationship detection are not the goal in our design, but we also find these two tasks have a slight improvement with the help of attribute detection. We integrate our model in to MSDN and fine-tune the whole model. As shown in Table 4, VADN has 0.21% and 0.10% improvement for the object detection task using Recall@64 and MAP metrics, respectively. And VADN has 0.19% and 0.13% improvement in relationship detection task using Recall@50 and Recall@100 metrics, respectively.
Results of object and relationship detection. The current version of MSDN is our base model, and VADN is our final model
Results of object and relationship detection. The current version of MSDN is our base model, and VADN is our final model
This paper addresses the problems of multi-attribute detection for imperfect attribute dataset. We propose a special loss function to transform the multi-attribute detection problem to a sequence predict problem. Then, we design a GRU-based attributes detection module to solve the multi-attribute detection problem. Three different types of features are used for attribute detection and a refinement process is used to further improve the performance. We perform ablative studies and several comparisons to show the performance of all our design. Furthermore, we compare our model VADN with several existed methods to prove the performance of our method. In the further, firstly, we consider, by using a multi-agent reinforcement learning algorithm to select the optimal aggregation threshold across different relationships, to optimize the model from the fusion weights based on cross-validation to guarantee a reliable and robust prediction performance. Secondly, Flexible design of end-to-end networks and loss functions is necessary in order to optimize the combination of branch settings and attention mechanisms. Thirdly, we will combine our attribute detect model with other high semantic level tasks (such as question & answer and caption) to further improve the performance.
