Multimodal semantic analysis with regularized semantic autoencoder

Abstract

The real-world data is multimodal and to classify them by machine learning algorithms, features of both modalities must be transformed into common latent space. The high dimensional common space transformation of features lose their locality information and susceptible to noise. This research article has dealt with this issue of a semantic autoencoder and presents a novel algorithm with distinct mapped features with locality preservation into a commonly hidden space. We call it discriminative regularized semantic autoencoder (DRSAE). It maintains the low dimensional features in the manifold to manage the inter and intra-modality of the data. The data has multi labels, and these are transformed into an aware feature space. Conditional Principal label space transformation (CPLST) is used for it. With the two-fold proposed algorithm, we achieve a significant improvement in text retrieval form image query and image retrieval from the text query.

Keywords

Semantic autoencoder hypergraph twofold validation cross model retrieval

1 Introduction

In modern days the multimodal community applications are increased. The multimedia data, like images and videos, are generated in large quantities on the web. The users get the content from heavily generated multimedia data to fulfil their requirements. The text features are exacted from the multimedia data while users are searching for them. The text description of an image or video shows the content of the data. The cross-modal retrieval process is used to retrieve multimodal data. It can be used for both text and image, to retrieve images and text, respectively. It uses the correlations of different modalities for retrieval data. For example, if a user is interested in sportsman, an image query can be used by the user and estimated related multimodal information with text details, videos, and images. The cross-modal retrieval provides enhanced multimodal results compare to the single- modal retrieval. The flexible search experience is also provided to users by cross-modal retrieval. Due to different modalities inconsistency in cross-modal retrieval, the image and text representation fails to match with each other directly. Various semantic learning techniques have been proposed for image-text retrieval in the past. Cross model retrieval has been a major challenge due to different modalities data available for classification ex. The image is input to search for text query or vice-versa. This problem is overcome by a classical mapping function that represents the data into a common space. The common space [1, and 4] representation of data provides the same dimensionality to the different modalities, and similarity is estimated via the simple metric. The similar representation of data for cross-media content is the fundamental research problem. The common space representation is used for single media data. The semantic information, like the class label and label space, are used by some supervised and semi-supervised cross-modal retrieval for mapping into the common representation (Table 1). A manifold regularization is used to learn the cross-media features from the vector-valued RKHS (Reproducing Kernel Hilbert Spaces) with kernel transformation. The retrieval accuracy is improved by the RKHS algorithm of SCVM [3].

Table 1
Notations

DHMMSAE Discriminative Dual Hypergraph multimodal semantic autoencoder

HMMSAE Dual Hypergraph multimodal semantic autoencoder

MMSAE Multimodal semantic autoencoder

SVM Suppoert vector machine

KNN K- nearest neighbours

PT Problem transformation

BR Binary relevance

CPLST Conditional Principal label space transformation

For the heterogeneous multimedia data of high dimension, the nonlinear learning methods are used to extract semantic similarity. The canonical correlation analysis and multiple kernel learning are used to capture semantic information of high dimensional data. A hybrid approach is also used to extract the semantic information of big data by combining modal sharing transfer sub-network and layer sharing correlation networks. In layer sharing correlation sub-networks, the semantic correlation information of cross-modal is used to complete the cross-modal retrieval task in the target domain [4].

Multilabel double-layer learning (MDLL) is proposed for the multilabel cross model retrieval task [5]. A deep supervised cross model retrieval (DSCMR) is developed, which provides a common space to compare the different modalities of the dataset. A combined model of image and text components of multimedia content is presented in [6]. The authors in [7] suggested a Scalable Deep Multi-Model Learning (SDML) for the cross model retrieval task. The SDML works on the principle of maximizing the between-class variation and minimizing the within-class variation. A Generalized Semi-Supervised Structured Subspaces learning (GSS-SL) is proposed for the cross model retrieval task [15]. A joint optimization framework with Kernel Correlation Maximization and Discriminative structure is presented for the cross model retrieval task [16]. A combination of different modalities, features selection and multi-model graph regularization scheme is proposed for the cross model retrieval task [17].

1.1 Autoencoder based technique

The retrieval tasks of cross-modal are divided into four subcategories; a) Single Label Paired (SL-P), b) Single Label Unpaired (SL-U), c) Multi-Label Paired (ML-P) and d) Multi-Label Unpaired (ML-U). In SL-P, the samples of two different modalities present in the one to one mapping. In SL-U, the paring is missing; ML-P deals with the multiple label data with the pair information. In ML-U, both modalities are having the number of multi labels data in different forms. The paired and unpaired data cross-modal retrieval is performed with the hashing methods. The cross-modal retrieval process is classified into a single label and multilabel. A single label sample is that in which each sample belongs to the single semantic class. There are some drawbacks present in the cross-modal approach. The semantic word representation can easily learn by the cross-modal retrieval compared to text word representation. In most of the multimodal retrieval concepts, the autoencoders [1 , 8–12] are used for decoding the semantic information of data. The new assign weights are multiplied with the dimension and content of the image or text. The modalities of text and visual are encoded in vectors automatically.

1.2 Research gap

From the above discussion, we can point out the following research gaps:

In multimodal, the features are mapped into common latent space, but these features lose their locality, which makes them prone to noise and false classification error.

Binary Labels in multimodality analysis include only the label information and ignores the mapped feature information.

The data is not divided into the same classes based on the features or label only.

High-dimensional features transformation in the encoder is prone to noise. So, the features are also transformed in multimodal space with hypergraph regularization.

1.3 Motivation and contribution

In this paper, we have worked twofold on the multi-modality classification. The data for this has multi labels, and for single modalities, the algorithm adaption (ex. Adaboost, rank-SVM, KNN) and Problem Transformation (PT) (ex. Label power set, binary relevance) are two algorithm categories. The binary relevance is a widely used method as PT can be used with any classifier, but it considers only the label and ignores the features information [29]. Figure 1 shows the relevance of our statement. The common latent space representation for the first two directions is shown in this figure. The image-text data are not mapped with distinctive, as shown in figure (a) and (b). The BR method for label transformation mixes the labels due to the unavailability of feature information as in Fig. 1(a). CPLST uses the feature information in label transformation and performs better in labelled data segmentation, but labels are not distinct. We also experimented with other label transformation schemes [30, 31] but CPLST outperformed other in multimodal analysis, section 4 also elaborates this. Although the multimodal extension of CPLST has mapped the labels into embedding space, yet the projection of the features in the common space is not efficient in multimodal semantic autoencoder.

Fig. 1

Common latent dimension representation for three categories with the most number of samples in wiki dataset. These are the results for (a) BR + DHMMSAE (b) MCPLST + MMSAE (c) MCPLST + DHMMSAE.

Our contribution to this work is

transforming the features into multimodal space by Conditional Principal label Space Transformation (CPLST).

Introduction of a novel multimodal semantic autoencoder named Distinctive Regularized Semantic Autoencoder (DRSAE).

preserving the high-dimensional features locality information by hypergraph regularization.

This work presents the joint projection of image and text features into a common latent space in the semantic autoencoder with distinctive features. The locality information of the features can be lost due to the high dimensionality of these data. Preserving it can make the data less prone to noise. To preserve feature space’s locality in manifold space, we add the hypergraph regularization to semantic autoencoder. In machine learning, it is believed that the features near to semantic space are more likely to share the common labels. To make the manifold locality projection of AE successful, features with multi labels should be distinct. So, a discriminative regularization factor is also added into the multimodal semantic AE. Figure 1(c) validates our statement. The proof of contribution is discussed in section 3 of this article. We name the proposed semantic autoencoder as Distinctive Regularized Semantic Autoencoder (DRSAE).

We here transform the multilabel into a feature aware semantic space by following the extension of Conditional Principal label space transformation (CPLST) into the multimodal space by [1] before common space data mapping.

1.4 Paper organization

Further in this paper, section 2 deal with the semantic autoencoder work. In section 3, the proposed method and problem statement will be given. The results and discussion are provided in section 5 that followed by concluded section 5.

2 Proposed semantic autoencoder

As previously discussed, the low dimensional space features don’t project by semantic autoencoder (SAE) [8]. The multimodal SAE features projection depend on them [1]. To project the multimodal features into the low dimension space, we add the regularization into the SAE. The P_v and P_t are the projection vector of image and text data, respectively. V and T contain the original features of image and text data.

In the image modality, the features are projected to hidden representation, and these have the complete information of the original information matrix and recover back them to original features. It can be represented as $P_{t} T = U, T = P_{t}^{T} U$ (1) the $U \in ℝ^{d \times n}$ is the n image/text features representation in d dimension hidden space. Similarly, image modality projection to hidden space and back is represented as: $P_{v} V = U, V = P_{v}^{T} U$ (2)

The hidden space would be shared for the semantic similarity for both modal data. We define the autoencoder as the loss function’s square, i.e., Frobenius norm’s square of the approximation error. It minimizes the error of projecting the feature matrix to hidden space and from the hidden space as:

$\begin{matrix} \arg_{\min} (∥ P_{v} V - U ∥_{F}^{2} + α {| | V - P_{v}^{T} U | |}_{F}^{2} \\ + ∥ P_{t} T - U ∥_{F}^{2} + β {| | T - P_{t}^{T} U | |}_{F}^{2}) \end{matrix}$ (3)

Equation 3 represents the linear encoder with a single hidden layer. As discussed earlier, this equation doesn’t confirm that the projection can maintain the manifold structure of the data. To preserve the manifold data structure, the feature matrix has to be presented into low dimensional feature space. For this purpose, we add the hypergraph regularization in the autoencoder’s loss function. Then Equation 3 updates as:

$\begin{matrix} argmin (∥ P_{v} V - U ∥_{F}^{2} + α ∥ V - P_{v}^{T} U ∥_{F}^{2} \\ + tr (P_{v} {VLV}^{T} P_{v}^{T}) + γ ∥ P_{v} V ∥_{F} + δ ∥ P_{v} ∥_{2, 1} \\ + ∥ P_{t} T - U ∥_{F}^{2} + α ∥ T - P_{t}^{T} U ∥_{F}^{2} + tr (P_{t} {TLT}^{T} P_{t}^{T}) \\ + γ ∥ P_{t} T ∥_{F} + δ ∥ P ∥_{t_{2, 1}} + β ∥ U - C ∥_{F}^{2}) \end{matrix}$ (4)

The first two terms are the loss function for projecting the features into hidden space and from hidden space for image modality. 3rd component is the hypergraph regularization parameter. Here L is the Laplacian parameter which is calculated by constructing the adjacency matrix from the features graph. We also added the nuclear form and L21 norm regularization to have a low-rank structure. Similar is for the text modality. A soft regularizer parameter $∥ U - C ∥_{F}^{2}$ is also added.

Inspired by the work [23, 24], we add the supervised label information in the above equation. The discriminative label information can be added in Equation 4 as: $\begin{matrix} argmin (∥ P_{v} V - U ∥_{F}^{2} + α ∥ V - P_{v}^{T} U ∥_{F}^{2} \\ + tr (P_{v} {VLV}^{T} P_{v}^{T}) + γ ∥ P_{v} V ∥_{F} + δ ∥ P_{v} ∥_{2, 1} \\ + ∥ P_{t} T - U ∥_{F}^{2} + α ∥ T - P_{t}^{T} U ∥_{F}^{2} + tr (P_{t} {TLT}^{T} P_{t}^{T}) \\ + γ ∥ P_{t} T ∥_{F} + δ ∥ P_{t} ∥_{2, 1} + β ∥ U - C ∥_{F}^{2} + α \\ ∥ S - AU ∥_{F}^{2}) \end{matrix}$ (5)

In it, the factor |S - AU|² is the Frobenius norm for the localization of the label. Here $S \in ℝ^{n \times c}$ is the label class matrix and $A \in ℝ^{c \times r}$ is the randomly initialize matrix. The S and A are non-negative matrices. c and r represent the classes in the label matrix and number of features, respectively. The S can be created as: $S = {\begin{matrix} 1, & {if y}_{j} = i, j = 1, 2, \dots n, i = 1, 2, \dots, c \\ 0, & otherwise \end{matrix}$ (6)

2.1 Discriminative Dual Hypergraph regularization Calculation

We have text and image modalities for the semantic analysis, so dual hypergraph regularization is proposed. To get the hypergraph regularization parameter, a weighted graph for given k points x₁, x₂, x₃ … x_k needs to be constructed with k nodes. Figure 2 demonstrates the proposed algorithm. The algorithm 1 lists the steps to calculate the hypergraph regularization parameter.

Fig. 2

Multimodal semantic autoencoder with Hypergraph regularization for locality preservation.

Steps to calculate the graph regularization parameter

Input: feature matrix for text or image modality
Output: graph regularization value
1. connect the points i and j with an edge with k nearest neighbour scheme
2. choose the weights between these two pints by Heat kernel scheme as $W_{ij} = e^{- \frac{∥ x_{i} - x_{j} ∥^{2}}{t}}$
3. for each connected points, compute the eigenvalues as LP_vV = λDP^TV^T. here D_ii = ∑_jW_ij is the diagonal weight matrix
4. construct the positive semidefinite Laplacian matrix L = D - W
5. if all graphs are not constructed
6. go to step 3
7. end if

2.2 Optimization of autoencoder loss function minimization

Equation 4 is the convex joint equation for P_v, P_t and U and solved by the iterative algorithm. P_v and P_t are similar, so we solve here P_v only. The square of loss function for P_v is solved and the image modality component is updated as in Equation 7. $\begin{matrix} argmin (∥ P_{v} V ∥_{F}^{2} + ∥ U ∥_{F}^{2} - 〈 (2 P_{v} V, U) 〉 \\ + α (∥ V ∥_{F}^{2} + ∥ P_{v}^{T} U ∥_{F}^{2}) - 〈 (V, P_{v}^{T} U) 〉 \\ + tr (P_{v} {VLV}^{T} P_{v}^{T}) + γ ∥ P_{v} V ∥_{F} + δ ∥ P_{v} ∥_{2, 1}) \end{matrix}$ (7)

Equation 7 is modified as;

$\begin{matrix} P_{v} {VV}^{T} - UV + α (P_{v}^{T} {UV}^{T} - UV) + P_{v} {VLV}^{T} \\ + \frac{γ}{2} V + δ P_{v} D = 0 \end{matrix}$ (8)

Equation 8 can be written as $P_{v} {VV}^{T} (1 + L) - (1 + α) {UV}^{T} + α P_{v}^{T} {UV}^{T} = 0$ (9)

Similarly for text modality, the solution is $P_{t} {TT}^{T} (1 + L) - (1 + α) {UT}^{T} + α P_{t}^{T} {UT}^{T} = 0$ (10)

Equations 9 and 10 are the Sylvester equations and can be solved by Bartels-Stewart algorithm. To obtain the solution of U, we differentiate Equation 5 w.r.t U. The differentiation further can be solved as

$\begin{matrix} (U - P_{v} V) + α P_{v} (P_{v}^{T} U - V) + (U - P_{t} T) + α P_{t} \\ (P_{t}^{T} U - T) + β (U - C) + α (A^{T} U - S) = 0 \end{matrix}$ (11)

$\begin{matrix} U - P_{v} V + α P_{v} P_{v}^{T} U - α P_{v} V + U - P_{t} T + α P_{t} \\ P_{t}^{T} U - α P_{t} T + β U - β C + α A^{T} U + α S = 0 \end{matrix}$ (12)

Rearranging Equation 12: $\begin{matrix} U (1 + α P_{v} P_{v}^{T} + 1 + α P_{t} P_{t}^{T} + α A^{T} + β) \\ = P_{v} V + α P_{v} V + P_{t} T + α P_{t} T + β C + α S \\ = (1 + α) P_{v} V + (1 + α) P_{t} T + β C + α S \end{matrix}$ (13)

The final solution for U is: $\begin{matrix} U = ((1 + α) P_{v} V + (1 + α) P_{t} T + β C + σ S) . \\ {((1 + α P_{v} P_{v}^{T} + 1 + α P_{t} P_{t}^{T} + σ A^{T} + β))}^{- 1} \end{matrix}$ (14)

$\begin{matrix} U = ((1 + α) P_{v} V + (1 + α) P_{t} T + β C + σ S) . \\ {(α P_{v} P_{v}^{T} + α P_{t} P_{t}^{T} + σ A^{T} + (2 + β))}^{- 1} \end{matrix}$ (15) We solve this equation by alternating minimization procedure. Algorithm 2 shows the process of multimodal semantic hypergraph regularizes autoencoder optimization.

Multimodal semantic hypergraph regularized autoencoder Optimization

Input: image and text feature matrices V, T, iterations count
Output: projection matrices P_v, P_t
1. calculate the Laplace matrix L for V and T
2. initialize U by C
3. take partial derivative w.r.t P_v and solve Equation 6.
4. take partial derivative w.r.t P_t and solve Equation 7.
5. fix P_v, P_t and update for U by Equation 9.
6. if iterations not finished
7. repeat the steps 3,4,5
8. end if
9. project the feature matrices into calculated P_v, P_t

3 Experiment

3.1 Dataset

Wiki dataset: It is the Wikipedia features article dataset that contains approximate 2866 relation of image to text. Among the Wiki dataset, 76% of features pairs are used for the training purpose and 24% for testing. The features of the dataset have ten different categories of labels. Text features are extracted with 10-dim LDA, and image features are extracted using CNN.

3.2 Evaluation parameter

We have evaluated the classification results based on mean average precision (MAP). MAP for text searching with image query as input and image searching with text query is evaluated. The hamming loss for the label transformation evaluation is used.

3.3 Results

The multimodal classification problem requires the projection of data into hidden space. We have used two modal data which has images and their text description. These have the same labels as their semantic meaning is similar. To extract the semantic code vector, we experimented with three label space dimension reduction prototypes: MCPLST [25], FaIE [26], CSSP [27]. The labels are transformed into a new paradigm instead of binary vectors for multiclass. Table 2 shows the results with the proposed feature vectors projection into the hidden space in semantic AE for wiki dataset for baseline comparison.

Table 2
Baseline comparison of wiki dataset for multimodal classification

Baseline methods R = 50 R = all

Image to text Text to image Image to text Text to image

Without label transformation + proposed projection 0.484494 0.577670 0.465550 0.439779

MCPLST label transformation + proposed projection 0.492440 0.579081 0.470555 0.442287

CSSP label transformation + proposed projection 0.491427 0.579762 0.471134 0.442978

FaIE label transformation + proposed projection 0.490147 0.584372 0.469321 0.442604

MCPLST label transformation + MMSAE projection 0.488901 0.581128 0.469954 0.442595

MCPLST label transformation + hypergraph regularized projection 0.491490 0.579827 0.471004 0.442782

Baseline methods	R = 50	R = all
Without label transformation + proposed projection	0.484494	0.577670	0.465550	0.439779
MCPLST label transformation + proposed projection	0.492440	0.579081	0.470555	0.442287
CSSP label transformation + proposed projection	0.491427	0.579762	0.471134	0.442978
FaIE label transformation + proposed projection	0.490147	0.584372	0.469321	0.442604
MCPLST label transformation + MMSAE projection	0.488901	0.581128	0.469954	0.442595
MCPLST label transformation + hypergraph regularized projection	0.491490	0.579827	0.471004	0.442782

To evaluate the performance, MAE is calculated for the first R = 50 results of a query and all results. Image to text and text to image both modalities are checked for performance. The proposed projection scheme is tested with the no-label transformation. Without transformation, labels are converted into multilabel by one to many methods which are similar to BR. In that case, the accuracy is minimum amongst all baseline schemes. The label transformation by CPLST shows the most improved classification with the proposed projection for the image queries for text searching. Although for text queries, it’s not true. The FaIE transformation is best for it. The projection with baseline hypergraph regularization with MCPLST performs better than the MMSAE projection. We specified the MMSAE projection because the previous work by Wu et al. [1] has used this scheme. The results of MAE with all test results are also competitive, but we assume that in a multimodal searching framework, the user will hardly be interested in more than 50 search results. The plot for hamming losses for the label transformation methods is shown in Fig. 3.

Fig. 3

Hamming loss vs sub-problems reduction for Label transformation ratio.

The hamming losses are least for the CPLST label transformation and highest for BR. The parameters α, β, and σ are set to 1, 0.1, 0.1, respectively. The hidden space size can’t be more than the label categories (d ⩽ C). To set the hidden dimension, we evaluated the baseline methods for different dimensions of d and shown in Fig. 3. We tested for the hidden dimension d_i=2,4,6,8,10 for wiki dataset. This data has 10 label categories. The plot in Fig. 4 demonstrates that MAP increases with the increase in the hidden dimension. The discretized improvement in the scheme with no label transformation and DHMMSAE is more with the hidden dimension change. However, this pattern is not visible in others due to semantic label transformation with features information.

Fig. 4

Baseline comparison of MAE for hidden dimension varying from 2 to 10 for wiki dataset.

It is also noticed that after d_i=4, the improvement in the MAP is not significant in case of MCPLST with DHMMSAE projection and MMSAE projection whereas, in FaIE and CSSP label transformation with DHMMSAE projection, it gains the noticeable improvement at every d_i. It proves that with the higher hidden space dimension of semantic autoencoder, the MCPLST label transformation performs better. A similar improvement pattern can be noticed in both text to image and image to text modalities.

We compared the proposed project scheme with PLS [27], GMLDA [28], 3-view CCA [29], ml-CCA [30], LGCFL [31], JFSSL [32] and MMSAE [1] in Table 3. PLS uses the covariance/correlation between pairs for data to maximize the projection in the common space. In contrast, other schemes are supervised learning schemes to project the features into latent space. GMLDA combines the CCA and LDA. 3-view CCA, ml-CCA, LGCFL and JFSSL extend the algorithm to incorporate the label information. MMSAE project the features into semantic space after label transformation into embedding space. Due to distinct labels with projected features into LPP space, the DHMMSAE can improve the image to text classification accuracy up to 2.5% from the most recent SAE work MMSAE [1].

Table 3

Wiki dataset classification evaluation for text to image and image to text search queries

Baseline methods	R = 50		R = all
	Image to text	Text to image	Image to text	Text to image
PLS	0.445	0.549	0.417	0.410
GMLDA	0.458	0.549	0.414	0.401
3view-CCA	0.439	0.521	0.372	0.354
ml-CCA	0.467	0.537	0.430	0.403
LGCFL	0.473	0.548	0.442	0.418
JFSSL	0.471	0.535	0.443	0.414
MMSAE	0.488	0.581	0.469	0.442
Proposed	0.492	0.580	0.470	0.442

Though the text to image query has no impact in the case of both schemes, this is because the text features are extracted from Linear Discriminant analysis (LDA), which extracts the low dimensional features. LDA also belongs to the feature space dimension reduction paradigm like MCPLST. Due to these LDA features, MMSAE and DHMMSAE is similar for text to image searching. The proposed scheme has shown a significant improvement from all other states of the art in Table 3 for both image 2 text and text 2 images.

4 Conclusion

In the field of multimodal semantic analysis, we have started with the multilabel transformation to inherit the feature information in labels. CPLST, FaIE, CSSP, BR label transformation methods are experimented with the proposed semantic autoencoder and evaluated by hamming loss curve. The CPLST transformation is the most effective method for semantic code vector generation. The semantic autoencoder is regularized with a hypergraph to preserve the manifold information of features in the latent space. To make the projection more distinct and robust to feature noise, a discriminative factor is also added to SAE. We name this semantic autoencoder as DHMMSAE. Results are tested with wiki dataset for image query for text searching and vice versa. An improvement of 2.5% over recent MMSAE work is achieved, and more significant performance is noticed with 6 other states of the art. Quantitative analysis with hidden space dimension is also discussed to analyze the hidden dimension effect.

Throughout this work, we worked with a linear single layer autoencoder, so we plan to experiment with a multilayer semantic autoencoder as the next part of our work. During this work, we noticed the LDA text features don’t improve DHMMSAE due to the prior extraction of distinctive features. It may require a good embedding space linking images and text to improve the multimodal classification.

References

, Wang

and Huang

, Multimodal semantic autoencoder for cross-modal retrieval, Neurocomputing 331 (2019), 167–175. doi: 10.1016/j.neucom.2018.11.042

Cao

, Lin

, He

and He

, Hybrid representation learning for cross-modal retrieval, Neurocomputing 345 (2019), 45–57. doi: 10.1016/j.neucom.2018.10.082

Zhang

, Wang

and Dai

, Semi-supervised cross-modal common representation learning with vector-valued manifold regularization, Pattern Recognition Letters (2019), https://doi.org/10.1016/j.patrec.2019.01.002

Huang

, Peng

and Yuan

, Cross-modal common representation learning by hybrid transfer network, arXiv preprint arXiv:1706.00153 (2017).

, Ma

, Wang

, Liu

and Huang

, Multilabel double-layer learning for cross-modal retrieval, Neurocomputing 275 (2018), 1893–1902. doi.org/10.1016/j.neucom.2017.10.032

Rasiwasia

, Pereira

J.C.

, Coviello

, Doyle

, Lanckriet

G.R.G.

, Levy

and Vasconcelos

, A new approach to cross-modal multimedia retrieval, In Proceedings of the 18th ACM international conference on Multimedia (2010), 251–260. ACM, doi

, Zhen

, Peng

and Liu

, Scalable Deep Multi-modal Learning for Cross-Modal Retrieval, In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019), 635–644. ACM, https://doi.org/10.1145/3331184.3331213

Kodirov

, Xiang

and Gong

, Semantic autoencoder for zero-shot learning, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), 3174–3183. doi

Jang

, Seo

and Kang

, Recurrent neural network-based semantic variational autoencoder for sequence-to-sequence learning, Information Sciences 490 (2019), 59–73. doi.org/10.1016/j.ins.2019.03.066

10.

Talwar

, Mongia

, Sengupta

and Majumdar

, AutoImpute: Autoencoder based imputation of single-cell RNA-seq data, Scientific Reports 8(1) (2018), 16329. doi.org/10.1038/s41598-018-34688-x

11.

Corizzo

, Ceci

and Japkowicz

, Anomaly detection and repair for accurate predictions in geo-distributed Big Data, Big Data Research 16 (2019), 18–35. doi.org/10.1016/j.bdr.2019.04.001

12.

Silberer

and Lapata

, Learning grounded meaning representations with autoencoders, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (2014), 721–732. doi

13.

Huang

, Zhang

, Zhao

, Xu

and Li

, Image–text sentiment analysis via deep multimodal attentive fusion, Knowledge-Based Systems 167 (2019), 26–37. doi: org/10.1016/j.knosys.2019.01.019

14.

Carrara

, Esuli

, Fagni

, Falchi

and Fernández

A.M.

, Picture it in your mind: Generating high-level visual representations from textual descriptions, Information Retrieval Journal 21(2-3) (2018), 208–229. doi.org/10.1007/s10791-017-9318-6

15.

Zhang

, Ma

, Li

, Huang

and Tian

, Generalized semi-supervised and structured subspace learning for cross-modal retrieval, IEEE Transactions on Multimedia 20(1) (2017), 128–141, doi

16.

and Wu

X.-J.

, Cross-modal subspace learning with Kernel correlation maximization and Discriminative structure-preserving, arXiv preprint arXiv:1904.00776 (2019). doi.org/10.1007/s11042-020-08989-1

17.

Wang

, He

, Wang

and Tan

, Joint feature selection and subspace learning for cross-modal retrieval, IEEE transactions on pattern analysis and machine intelligence 38(10) (2015), 2010–2023. doi: 10.1109/TPAMI.2015.2505311

18.

Liu

, Gao

, Han

, Wang

and Gao

, Graph and autoencoder based feature extraction for zero-shot learning, In Proc. IJCAI, (2019), 15–36. doi

19.

Hong

, Chen

, Wang

and Tang

, Hypergraph regularized autoencoder for image-based 3D human pose recovery, Signal Processing 124 (2016), 132–140. https://doi.org/10.1016/j.sigpro.2015.10.004

20.

Huang

, Elhoseiny

, Elgammal

and Yang

, Learning hypergraph-regularized attribute predictors, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 409–417. doi

21.

Zhang

, Ding

and Cui

, Introducing Hypergraph Signal Processing: Theoretical Foundation and Practical Applications, IEEE Internet of Things Journal (2019), doi

22.

Hao

Y.-J.

, Gao

Y.-L.

, Hou

M.-X.

, Dai

L.-Y.

and Liu

J.-X.

, Hypergraph Regularized Discriminative Nonnegative Matrix Factorization on Sample Classification and Co-Differentially Expressed Gene Selection, Complexity 2019 (2019), https://doi.org/10.1155/2019/7081674

23.

Long

, Lu

, Peng

and Li

, Graph regularized discriminative non-negative matrix factorization for face recognition, Multimedia Tools and Applications 72(3) (2014), 2679–2699. https://doi.org/10.1007/s11042-013-1572-z

24.

Tai

and Lin

H.-T.

, Multilabel classification with principal label space transformation, In Neural Computation (2012), doi

25.

Lin

, Ding

, Hu

and Wang

, Multilabel classification via feature-aware implicit label space encoding In Proceedings of the 31st International Conference on Machine Learning - Volume 32 (ICML’14). JMLR.org, II-325-II-333, doi

26.

and Kwok

, Efficient Multi-label Classification with Many Labels, Proceedings of the 30th International Conference on Machine Learning, in PMLR 28(3) (2013), 405–413

27.

Rosipal

and Krämer

, Overview and Recent Advances in Partial Least Squares. In: C. Saunders, M. Grobelnik, S. Gunn, Shawe-Taylor J. (eds.) Subspace, Latent Structure and Feature Selection. SLSFS 2005, Lecture Notes in Computer Science 3940. (2006), Springer, Berlin, Heidelberg, doi

28.

Sharma

, Kumar

, Daume

and Jacobs

D.W.

, Generalized Multiview Analysis: A discriminative latent space, 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, (2012), 2160–2167, doi:10.1109/CVPR.2012.6247923

29.

Gong

, Ke

, Isard

and Lazebnik

, ‘A multi-view embedding space for modelling internet images, tags, and their semantics’, International Journal of Computer Vision 106(2) (2014), 210–233. doi:

30.

Ranjan

, Rasiwasia

and Jawahar

C.V.

, Multi-label Cross-Modal Retrieval, 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, (2015), 4094–4102. doi:10.1109/ICCV.2015.466

31.

Kang

, Xiang

, Liao

, Xu

and Pan

, Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval, in, IEEE Transactions on Multimedia 17(3) (2015), 370–381. doi: 10.1109/TMM.2015.2390499

32.

Wang

, He

, Wang

and Tan

, Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval, in, IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10) (2016), 2010–2023. doi: 10.1109/TPAMI.2015.2505311

DHMMSAE	Discriminative Dual Hypergraph multimodal semantic autoencoder
HMMSAE	Dual Hypergraph multimodal semantic autoencoder
MMSAE	Multimodal semantic autoencoder
SVM	Suppoert vector machine
KNN	K- nearest neighbours
PT	Problem transformation
BR	Binary relevance
CPLST	Conditional Principal label space transformation