Unifying attentive sparse autoencoder with neural collaborative filtering for recommendation

Abstract

The autoencoder network has been proven to be one of the powerful techniques for recommender systems. Currently, the ways of utilizing autoencoder in recommender systems can be divided into two categories: modeling user-item interaction rely solely on autoencoder and integrating autoencoder with other models. Most existing methods based on autoencoder assume that all features of model’s input are equally the same contributing to the final prediction, which can be regarded as attention weight vectors; however, this hypothesis is not reliable, especially when exploring users’ interaction frequency with different items. Moreover, combining autoencoder with traditional methods, the usual strategy is to leverage a linear kernel of the inner product of user and item vectors to predict user preferences, which will lead to insufficient expression power and hurt the performance of recommendation when facing data sparsity and cold start problems. To tackle the above two problems, we propose a novel hybrid deep learning model for top-n recommendation, called attentive stacked sparse autoencoder (A-SAERec), which can capture attention weights vector of a user for items, and then combined with the neural matrix factorization to improve the performance of recommender model. Extensive experiments on four real-world datasets show that our A-SAERec algorithm has significant improvements over state-of-the-art algorithms.

Keywords

Attention mechanism sparse autoencoder collaborative filtering recommender systems

1. Introduction

Personalized recommender systems have became a powerful tool to alleviate information overload issue in multiple scenarios (e.g. social recommendation [1], POI recommendation [2], et al.). Traditional collaborative filtering techniques, such as matrix factorization (MF) [3], have achieved great success in recommendation tasks. Relying on the user-item ratings matrix, MF methods often represent users’ interests and items’ features as latent factor vectors in a common latent space, and then predict user preferences for an item with a linear kernel, i.e., a dot product of their latent factors. When suffering from the data sparsity and cold start problems; however, MF methods often fail to capture the complicated user-item interactions, which will lead to slow model learning and overfitting [4, 5].

In recent years, deep learning has been widely used in recommender systems [12], and it has been verified the capability of capturing the complex relationships within data. As an unsupervised neural network, the autoencoder, is capable of learning a representation of the input data, also known as feature extraction and dimensionality reduction, which has attracted researchers’ attention on how to leverage it in recommender systems [7]. Different methods have been explored to leverage autoencoder in recommender systems, for example, AutoRec method employs the user rating vectors or item rating vectors as input and predict user ratings in the output layer [8]. CDAE model directly leverages user-item implicit feedback matrix as input to predict user preferences [9]. The variational autoencoder as a generative model with multinomial likelihood is extended to collaborative filtering for implicit feedback [10]. Besides, the side information is utilized in autoencoder model to alleviate data sparsity problems and improve the recommendation performance [11, 12, 19]. To sum up, existing works integrate autoencoder with traditional methods to improve the performance of recommender systems, such as combining the denoising autoencoder with matrix factorization and integrating the contractive autoencoder with single value decomposition [23, 13].

Although these methods have achieved better performance than the MF method, all of them ignore the fact that users usually place different attention weight vectors for items, which could result in misleading recommendation results. Motivated by this observation, we propose a novel hybrid model for top-n recommendation, called attentive stacked sparse autoencoder (A-SAERec), which integrates the attention mechanism into the stacked sparse autoencoder to learn complex feature representation from input data and capture the attention weights of user-item pair and make more relevant recommendations. The major contributions of this paper are summarized below.

1)
We design a hybrid deep learning framework for top-n recommendation. Instead of leveraging a linear kernel of the inner product of user and item vectors to predict user preferences, we employ attentive stacked sparse autoencoder to learn features and integrate them into the neural matrix factorization model to exploit user-item interactions.
2)
We combine the attention network with stacked sparse autoencoder, in which the output of hidden layer is utilized as the input of the attention network, to capture the attention weight vectors of each specific user-item pair, then the element-wise product of the hidden layer output and the attention weight vector to form the latent vector. As far as we know, we are the first attempt to leverage a joint neural network to tightly couple the attention mechanism with the stacked sparse autoencoder for personalized recommendation.
3)
We conduct comprehensive experiments on four real-world datasets, which demonstrates that our algorithm outperforms state-of-the-art algorithms and improves the accuracy of recommendation.

The rest of this paper is organized as follows. Section 2 discusses the related work on autoencoder techniques used for recommender systems. Sections 3 and 4 illustrate the basic knowledge of autoencoder and the proposed model architecture, respectively. Section 6 shows the experiment results and analyses. Finally, we conclude the work of this paper.
2. Related work

Early researches on recommendation have mainly focused on using traditional machine learning approaches, for example, Robles et al. [14] present a new Bayes approach for collaborative filtering, Xu et al. [15] design a SVM (Support Verctor Machine)-based prediction method for personal recommender system for TV programs, and Oku et al. [16] utilize SVM for context-aware recommendation. However, with the rapid growth of Internet service, traditional methods face data sparsity and cold start problems, which are able to satisfy the needs of personalized recommendations. Recently, employing deep learning to recommender systems has been demonstrated great success [17]. As a popular choice for the deep learning architecture of recommendation systems, autoencoders have attracted more attention from researchers. The existing works about autoencoder can be classified two categories in recommender systems: modeling user-item interaction rely solely on autoencoders and integrating autoencoders with other models.

2.1 Modeling rely solely on autoencoders

Ouyang et al. [18] design an autoencoder based collaborative filtering (ACF), which first converts the rating in range of [1, 5] to a vector consists of 0 as well as 1, and then takes the vector as input to predict ratings. However, the ACF method contains only one hidden layer, so that the capability of extracting features is limited. Sedhain et al. [8] propose AutoRec that can be seen as a variant of ACF method, in which the user’s ratings are directly utilized as inputs. Strub et al. [20] design a collaborative filtering neural network based on stacked denoising autoencoder (SDAE) to make the model more robust where the side information is exploited to enhance the accuracy of recommendation. Wu et al. [9] present the collaborative denoising autoencoder (CDAE) method, which learns latent representations of corrupted user-item preferences that can best reconstruct the full input. Zhang et al. [21] propose the hybrid collaborative recommendation via semi-autoencoder, which exploits both content information and the learned non-linear characteristics to produce personalized recommendations. Liang [10] and Karamanolakis [22] et al. employ the variational autoencoder to learn feature representation through from side information to alleviate data sparsity problem.

Although these methods have improved the performance of recommender systems, all algorithms default that a user keeps the same attention weight vector when facing different items, which will result in misleading recommendation.

2.2 Integrating autoencoders with other models

Existing works prefer to combine the autoencoder with traditional collaborative filtering methods, such as matrix factorization algorithms. Dong et al. [23] propose a novel deep learning model (aSDAE), which integrates the SDAE with matrix factorization and leverages the side information to provide an accurate recommendation. Li et al. [24] integrate the variational autoencoder with matrix factorization and leverage both content data and rating information for recommendation. Alfarhood et al. [25] design a dual autoencoder framework with matrix factorization for recommending scientific articles. Analyzing above approaches, we find that they prefer to utilize the matrix factorization methods to predict user final preferences; however, simply using a linear kernel with an inner product of user and item vectors may fail to give an accurate prediction of user-item interactions.

To sum up, we propose a novel hybrid deep learning framework, instead of leveraging a linear kernel to model user-item interactions, which combines with neural matrix factorization to make final recommendation and employ an attention network to capture attention weights for each user-item pair to improve the recommendation performance.

Figure 1.

Illustration of the architecture of sparse autoencoder.

3. Preliminaries

3.1 Sparse autoencoder

The autoencoder is an unsupervised learning neural network for feature representations extraction and dimensionality reduction, which contains of two main parts: the encoder and the decoder. The encoder ${h(\cdot)}$ takes a given input $x\in{\mathbb{R}^{D}}$ and squashes it into a latent space $z\in{\mathbb{R}^{K}}$ , while the encoder $f(\cdot)$ maps this hidden representation back to a reconstructed vector $\hat{x}$ (i.e. $f(h(x))\approx x$ ). The overall computation process of autoencoder is defined as

$\displaystyle z=h(x)=h(Wx+b),$

(1) $\displaystyle\hat{x}=f(W^{\prime}h(x)+b^{\prime}),$

where $W$ and $W^{\prime}$ are the $D{\times}K$ weight matrices, $b$ and $b^{\prime}$ are bias vectors. Square error and cross-entropy are two ways of formulating the reconstruction error $L$ , their formulas show as

$\displaystyle\textit{Loss}=\frac{1}{2}{\|x-\hat{x}\|}^{2},$ (2) $\displaystyle\textit{Loss}=-\sum\limits_{i=1}^{n}(x_{i}\log\hat{x}_{i}+(1-x_{i% })\log(1-\hat{x}_{i})).$ (3)

For a training set of $M$ instances, the autoencoder trains the parameters to minimize the average reconstruction error as

$\displaystyle J(W,b)=\left[\frac{1}{M}\sum\limits_{i=1}^{M}{l}(x_{i},\hat{x}_{% i})\right]+\frac{\lambda_{w}}{2}\sum\limits_{l=1}^{n_{l}-1}\sum\limits_{i=1}^{% s_{l}}\sum\limits_{j=1}^{s_{l+1}}(W_{ji}^{(l)})^{2},$ (4)

where the first term is an average sum-of-squares error term, and the second term is weight decay term which is used to help prevent overfitting, $\lambda_{w}$ is the weight decay parameter, $n_{l}$ represents the number of layers in autoencoder, $s_{l}$ denotes the number of neurons in layer $l$ , and $W_{ji}^{(l)}$ means the connection weight between unit $j$ in layer $l$ and unit $i$ in layer $l+1$ .

The autoencoder tries to approximate an identity function $h_{w,b}(x)=x$ ; however, this seems a particularly trivial to be learned. The origin inputs can be perfectly reconstructed with the extracted feature, and the features are redundant and not representative enough. Thus, researchers impose a sparsity constraint on the hidden units to form a novel autoencoder structure, sparse autoencoder (SAE) [26], which forces the network to learn a compressed representation of the input data by limiting the neuron output value. In SAE method, the neuron can be regarded as inactivated when its output value is closed to 0, in order to constrain the neurons to be inactive in most of the time, we utilize the $\hat{\rho}_{j}$ to donate the average activation value of hidden neuron $j$ , the $\hat{\rho}_{j}$ is computed as

$\displaystyle\hat{\rho}_{j}=\frac{1}{n}\sum\limits_{i=1}^{n}\left[{a_{j}^{(l)}% (x_{i})}\right],$ (5)

where $a_{j}^{(l)}(x_{i})$ represents the activation function of hidden unit $j$ when the autoencoder is given an input $x$ . We set a sparsity parameter $\rho$ which is usually a small value (e.g. $\rho=$ 0.01) and enforce the $\hat{\rho}=\rho$ to achieve constraint. The SAE adds an extra penalty term to make the activation rate of the hidden layer become sparse as

$\displaystyle\sum\limits_{j=1}^{s_{2}}KL(\rho\|\hat{\rho_{j}})=\sum\limits_{i=% 1}^{s_{2}}\rho\log\frac{\rho}{\hat{\rho}_{j}}+(1-\rho)\log\frac{1-\rho}{1-\hat% {\rho}_{j}},$ (6)

where $s_{2}$ represents the number of units in hidden layer, and $KL(\rho\|\hat{\rho}_{j})$ is the Kullback-Leibler divergence between $\hat{\rho_{j}}$ and $\rho$ . The cost function of an sparse autoencoder is described as

$\displaystyle J_{\textit{sparse}}(W,b)=J(W,b)+\beta\sum\limits_{j=1}^{s_{2}}KL% (\rho\|\hat{\rho_{j}}),$ (7)

where $J(W,b)$ is shown as before Eq. (4), $\beta$ denotes the sparsity penalty term parameter.
3.2 Stacked sparse autoencoder

A deep learning architecture which consists of multiply layers of basic sparse autoencoder, Stacked Sparse Autoencoder (SSAE), where the outputs of each layer are wired to the input of successive layer. For each instance $x_{i}\in\mathbb{R^{d}}$ , the SSAE with $L$ hidden layers by directly taking it as input can approximate a non-linear function $g(\cdot)$ that transforms the input $x_{i}$ low-dimensional but high-level feature representation $h_{\textit{SAE}}$ , the $h_{\textit{SAE}}$ is computed as

$\displaystyle h_{\textit{SAE}}=g_{L}(W_{L}(g_{L-1}(W_{L-1}\cdots g_{1}(W_{1}x_% {i}+b_{1}))+b_{L-1})+b_{L}).$ (8)

Existing works have demonstrated that a deep structure autoencoder with more hidden layers can generate more complicated hierarchies of features; however, the computational cost of training model may increase significantly as stacking more hidden layers. The impact of the number of stacked hidden layers on model performance will be discussed in Section 5.4.2

4. Proposed methodology

In our proposed model, we employ $M$ and $N$ to denote the number of users and items. The user-item interaction matrix $R\in\mathbb{R}^{M\times N}$ with user’s implicit feedback as

$\displaystyle r_{ui}=\left\{\begin{array}[]{ll}1&\text{if }\textit{user}_{u}% \text{ interacts with }\textit{item}_{i},\\ 0&\text{otherwise},\\ \end{array}\right.$ (9)

where the value 1 of $r_{ui}$ represents that the user ${u}$ may prefer item $i$ , we convert a score of 5 to a binary ratings of 1 and anything less than a 5 to 0 when processing the training data. The primary goal of recommender systems is to generate an item list that reflects the user preferences.

4.1 The attentive sparse autoencoder

In this section, we design a novel deep learning network, attentive sparse autoencoder (A-SAE), which is a generic framework by integrating attention mechanism into sparse autoencoder. Inspired by the visual attention mechanism of human being can focus to specific part of an image or words in sentences, the attention mechanism in deep learning can be simply understood as a vector of weights to measure the importance of input elements. We assume that there is a user-item implicit interaction with $M$ users and $N$ items, then the SAE takes the vector $x\in\mathbb{R}^{N}$ which is the interaction of the user on the $N$ items as input and generate the hidden latent representation $h_{\textit{SAE}}\in\mathbb{R}^{k}$ with the non-linear function $g(\cdot)$ as

$\displaystyle h_{\textit{SAE}}(x)=g_{L}(W_{L}g_{L-1}(x)+b_{L}),$ (10)

and the hidden representation $h_{\textit{SAE}}(x)$ is fed into a attention mechanism layer,

$\displaystyle t(x)=\textit{ReLU}(W_{h}h_{\textit{SAE}}(x)+b_{h})$ (11) $\displaystyle a(x)=\textit{softmax}(t(x)),$

where $W_{h}\in\mathbb{R}^{k\times k}$ and $b_{h}\in\mathbb{R}^{k}$ are weight matrix and bias vector respectively, and $k$ represents the number of neurons in attention layers. $t(x)$ denotes the output of activation function of attention layer. We employs the softmax function to normalize to get the attention scores, which is defined as

$\displaystyle a(x)=\frac{\exp(x_{i})}{\sum_{j=1}^{n}\exp(x_{j})}.$ (12)

The final output of the feature representation of the attentive sparse autoencoder can be computed as

$\displaystyle f_{\textit{Att}}(x)=a(x)\odot h_{\textit{SAE}}(x),$ (13)

We combine the hidden layer output with the softmax function output via element-wise product. The reconstruction vector $\hat{x}$ and the A-SAE objection function can be updated by

$\displaystyle\hat{x}=f(W^{\prime}f_{\textit{Att}}(x)+b^{\prime}),$ $\displaystyle L_{e}=J_{\textit{a-sae}}(W,W_{h},b,b_{h})=J(W,W_{h},b,b_{h}){}+% \beta\sum\limits_{j=1}^{s_{2}}KL(\rho\|\hat{\rho_{j}})+\lambda_{W_{h}}\|W_{h}% \|^{2},$ (14)

where $\|\cdot\|^{2}$ is the $L_{2}$ regularization term, which is used to prevent overfitting, and $\lambda_{h}$ controls the regularization strength.

Figure 2.

Structure of the A-SAERec model. For simplicity, we did not show the decoder parts of each basic SAE.

4.2 Combining A-SAE with neural collaborative filtering

The hybrid architecture of the proposed A-SAERec is shown in Fig. 2, which contains two main networks: the A-SAE network for modeling features and the Neural Collaborative Filtering (NCF) framework for modeling interactions between users and items. The traditional collaborative filtering methods, such as matrix factorization, employ a linear kernel with an inner dot product of user and item to predict the user-item interactions; however, linear functions may fail to capture the complex structure of user-item interactions and existing literature has demonstrated that non-linearities have a potential advantage of improving recommendation performance. By replacing the inner dot product, He et al. [29] devise a general neural collaborative filtering architecture, which leverages a general matrix factorization (GMF) and a non-linearities multi-layer perceptron (MLP) to model the user-item interaction; however, the representation of users and items is randomly initialized via the one-hot encoding of user (item) ID of the input layer in NCF model, which only explores the items’ and users’ features in a limit manner. A-SAERec method exploits a attentive stacked sparse autoencoder to capture both user and item features and then the output of A-SAE severs as the input of the second network to model user-item interactions.

Instead of randomly initializing the users’ and items’ representation in the NCF framework, both GMF and MLP first use the A-SAE to extract the feature representations of users and items. Let $p_{u}^{G}$ and $q_{i}^{G}$ represent the user latent vector and item latent vector in GMF respectively, we define the element-wise products of user and item vectors as

$\displaystyle\phi(p_{u}^{G},q_{i}^{G})=p_{u}^{G}\odot\ q_{i}^{G},$ (15)

then the result is projected to the output layer as

$\displaystyle r_{ui}=a_{\textit{out}}(h^{T}(p_{u}^{G}\odot\ q_{i}^{G})),$ (16)

where $a_{\textit{out}}$ denotes the activation function, $h$ is the weight of the output layer. GMF is a generalized version of matrix factorization (MF), which can extend and generalize the MF by adopting different activation function, for example, we employ the sigmoid function $\sigma(z)=1/(1+e^{-z})$ as $a_{\textit{out}}$ and allow $h$ to be learned from data with log loss.

In order to further model the interaction between user and item, MLP uses a large level of flexibility and non-linearity to learn the relationship between the extracted user latent vector $p_{u}^{M}$ and item latent vector $q_{i}^{M}$ , which is shown as

$\displaystyle z_{1}=\phi_{1}(p_{u}^{M},q_{i}^{M})=\Bigg{[}\begin{array}[]{cc}p% _{u}^{M}\\ q_{i}^{M}\end{array}\Bigg{]},$ $\displaystyle\phi_{2}(z_{1})=a_{2}(W_{2}^{T}z_{1}+b_{2}),$ $\displaystyle\dots$ (17) $\displaystyle\phi_{L}(z_{L-1})=a_{L}(W_{L}^{T}z_{L-1}+b_{L}),$ $\displaystyle\hat{r}_{ui}=\sigma(\phi_{L}(z_{L-1})),$

where $W$ , $b$ , and $a$ represent weight matrix, bias vector, and activation function respectively. We combine two models by concatenating their last hidden layer to output the predicted score of the interaction between user $u$ and item $i$ :

$\displaystyle\phi^{\textit{GMF}}=p_{u}^{G}\odot q_{i}^{G},$ $\displaystyle\phi^{\textit{MLP}}=a_{L}\left(W_{L}^{T}a_{L-1}\left(\cdots a_{2}% \left(W_{2}^{T}\Bigg{[}\begin{array}[]{cc}P_{u}^{M}\\ p_{i}^{M}\end{array}\Bigg{]}+B_{2}\right)\cdots\right)+b_{L}\right),$ (18) $\displaystyle\hat{y}_{ui}=\sigma\left(h^{T}\Bigg{[}\begin{array}[]{cc}\phi^{% \textit{GMF}}\\ \phi^{\textit{MLP}}\end{array}\Bigg{]}\right),$

the loss function of the proposed A-SAERec method will be discussed in Section 4.3.

4.3 Optimizer

For the feature extraction part, the Stochastic Gradient Descent (SGD) algorithm is employed to optimizer the objection function of A-SAE. The gradients of parameters are given as

$\displaystyle\frac{\partial}{\partial W_{ij}^{(l)}}J(W,W_{h},b,b_{h})=\left[% \frac{1}{M}\sum\limits_{i=1}^{M}\frac{\partial}{\partial W_{ij}^{l}}J(W,W_{h},% b,b_{h};x_{i},\hat{x}_{i})\right]+\lambda_{w}W_{ij}^{(l)},$ $\displaystyle\frac{\partial}{\partial W_{h}}J(W,W_{h},b,b_{h})=\left[\frac{1}{% M}\sum\limits_{i=1}^{M}\frac{\partial}{\partial W_{h}}J(W,W_{h},b,b_{h};x_{i},% \hat{x}_{i})\right]+\lambda_{h}W_{h},$

(19) $\displaystyle\frac{\partial}{\partial b_{i}^{(l)}}J(W,W_{h},b,b_{h})=\frac{1}{% M}\sum\limits_{i=1}^{M}\frac{\partial}{\partial b_{i}^{(l)}}J(W,W_{h},b,b_{h};% x_{i},\hat{x}_{i}),$ $\displaystyle\frac{\partial}{\partial b_{h}}J(W,W_{h},b,b_{h})=\frac{1}{M}\sum% \limits_{i=1}^{M}\frac{\partial}{\partial b_{h}}J(W,W_{h},b,b_{h};x_{i},\hat{x% }_{i}),$

where $\lambda_{w}$ and $\lambda_{h}$ mean the regularization coefficients. The parameters of A-SAE are updated according to follow rule,

$\displaystyle\Delta=\Delta-\eta\frac{\partial}{\partial\Delta}J(W,W_{h},b,b_{h% }),$ (20)

where $\Delta$ denotes the parameter set $\{W_{ij},W_{h},b_{h},b_{h}\}$ and $\eta$ is the learning rate.

The deep interactions modeling process outputs the predicted rating $\hat{y}_{ui}$ between user $u$ and item $i$ . Considering the definition of implicit feedback, the value of $r_{ui}$ can be viewed as labels, that is, 1 denotes the item $i$ is relevant to user $u$ and 0 otherwise. The prediction score $\hat{y}_{ui}$ indicates the possibility that a user relates an item, which needs to be constrained in range of [0, 1], and this can be easily achieved by utilizing a probabilistic function such as the logistic. The loss function of prediction is shown as

$\displaystyle L_{p}=\sum\limits_{(u,i)\in R\bigcup R^{-}}(1-y_{ui})\log\hat{y}% _{ui}+y_{ui}\log\hat{y}_{ui}+\lambda_{\theta}\|\theta\|^{2},$ (21)

where $\lambda_{\theta}$ and $\theta$ denote the penalty term parameter and the regularization- term. The loss function is same as the binary cross-entropy loss, and we transform the recommendation problem with implicit feedback to a binary classification problem. Therefore, the loss function of our proposed A-SEARec model can be described as

$\displaystyle\textit{Loss}=L_{p}+\gamma L_{eu}+\delta L_{ei},$ (22)

where $L_{eu}$ and $L_{ei}$ denote user and item feature representation extraction, which is shown in Eq. (4.1). $\gamma$ and $\delta$ are the hyper-parameter for normalizing the $L_{eu}$ and $L_{ei}$ , respectively. The algorithm of A-SAERec is shown in Algorithm 4.3.

Learning Algorithm for A-SAERecthe user-item implicit feedback matrix $r_{ui}$ the prediction score $\hat{y}_{ui}$ pre-train the user feature extraction model A-SAE with input $U$ pre-train the item feature extraction model A-ASE with input $I$ all training samples $U\leftarrow L_{eu}(U)$ $I\leftarrow L_{ei}(I)$ update each parameter in $\Delta$ update parameter by Eq. (4.3) $p_{u}^{G}$ , $q_{i}^{G}$ , $p_{u}^{M}$ , $p_{i}^{M}$ $\leftarrow$ Initialize with A-ASE model

epoch in range[Epochs] $<p_{u}^{G}$ , $q_{i}^{G}>$ , $<p_{u}^{M}$ , $p_{i}^{M}>$ $\phi^{\textit{GMF}}=p_{u}^{G}\odot q_{i}^{G}$ $\phi^{\textit{MLP}}$ : the first layer: $z_{1}=\Bigg{[}\begin{array}[]{cc}p_{u}^{M}\\ q_{i}^{M}\end{array}\Bigg{]}$ the rest layers are computed as Eq. (4.2) $\hat{y}_{ui}=f_{\textit{activation}}\left(\Bigg{[}\begin{array}[]{cc}\phi^{% \textit{GMF}}\\ \phi^{\textit{MLP}}\end{array}\Bigg{]}\right)$ L $\leftarrow$ employ Eq. (21) with the SGD method to learn parameters; Evaluate the Top-K recommendations End Algorithm
5. Experiment

An extensive experiments is conducted with aim of answer the following research questions:

RQ1: How dose our proposed A-SAERec algorithm perform as compared with start-of-the-art baseline algorithms for recommender systems? RQ2: How dose the impact of different hyper-parameter tuning (e.g., the number of different hidden layers and dimensions of the latent factor) on the performance of A-SAERec algorithm? RQ3: Can the attention network effectively improve the performance of our A-SAERec algorithm?

5.1 Datasets

Four real-world datasets are exploited to demonstrate the validly of our proposed attentive stacked sparse autoencoder model including Movielens-10M,1

¹
https://grouplens.org/datasets/movielens/.

Yelp,2

https://www.yelp.com/dataset/challe.

Book-crossing,3

http://www2.informatik.uni-freiburg.de/∼cziegler/BX/.

and LastFM.4

⁴

https://labrosa.ee.columbia.edu/millionsong/lastfm.

•

Movielens-10M dataset consists of more than about 10 million ratings of 72,000 users and 10,000 movies.

•

Yelp dataset records user ratings on local businesses, and also contains social relations and attribute information of businesses, which contains 1,392,041 interactions of users on different electronics.

•

Book-crossing dataset comes from a book-drifting website where readers can exchange their books with each other, which contains 1,149,780 ratings on 271,379 books by 278,858 users.

•

LastFM dataset includes 1892 users, 17,632 artists, and 92,834 ratings, which is a music dataset and contains a list of their most popular artists and the number of plays.

For above all datesets, we filter the user with less than 10 interactions. The characteristic of the datasetes are summarized in Table 1.

Table 1

The statistics of the datasets

Datasets	Users	Items	Interactions	Sparsity
Ml-10M	67,312	10,677	3,089,624	99.57%
Yelp	9,396	143,992	811,610	99.94%
Book-Crossing	278,858	271,397	1,149,780	99.95%
LastFM	1,892	17,632	92,834	99.41%

5.2 Evaluation metrics

To evaluate the performance of our model, we randomly split each dataset into training set and test set. For each user, we randomly hold her one item has interacted and sample 100 items that are not interacted by the user to form test set. Two common ranking evaluation metrics, i.e., Hit Radio (HR) and Normalized Discounted Cumulative Gain (NDCG), are adopted to measure the performance of model. The HR intuitively measures how many test items are in present the top-n recommendation list, which is defined as

$\displaystyle\textit{HR@N}=\frac{\text{Number of Hits@N}}{|GT|},$ (23)

where $|GT|$ is collections of test dataset. The NDCG is employed to account for the position of the hit by assigning higher scores in recommendation list. The metric of NDCG is shown as

$\displaystyle\textit{NDCG@N}=\sum\limits_{i=1}^{N}\frac{{2^{\textit{rel}_{i}}}% -1}{{{{\log}_{2}}(i+1)}},$ (24)

where $\textit{rel}_{i}$ is the graded relevance of the recommendation result at position $i$ .

5.3 Baselines

We use the python library of tensorflow5

⁵
https://github.com/haomiaocqut/ReSys_A-SAERec.

to implement our proposed A-SAERec algorithm and compare it against several traditional collaborative filtering baselines as well as sate-of-the-art deep learning based models:

•

Biased SVD [27]: Singular value decomposition algorithm is one of the demensionality reduction methods that are widely utilized for their capacity to improve the performance of recommender systems.

•

ItemKNN [28]: A standard item-based collaborative filtering method by recommending similar items based on the adopted items previously.

•

NeuMF [29]: A general deep learning model for top-N recommendation consists of a generalized MF component and an MLP component. It is generic and can express and generalize matrix factorization under its framework.

•

AutoSVD $++$ [13]: A hybrid recommendation model by integrating contractive autoencoder paradigm into matrix factorization technique with superior scalability and computational efficiency.

•

JCA [30]: A joint collaborative autoencoder framework can simultaneously learns both the user-user and item-item correlations, thus leading to robust model and improves the performance of top-N recommendation effectively.

•

RecVAE [31]: A generalized model based on variational autoencoders for collaborative filtering with feedback implicit.

5.4 Experiment results and analysis

For our proposed A-SAERec, we randomly select 70% of each dataset as the training set, 10% as th evalidation set, and the rest 20% as test set. The optimizer is the Adam optimizer, and the learning rate $\eta$ is set to 0.001. We employ a five hidden layer for attentive stacked sparse autoencoder with a shape of ‘ $D*10->D*8->D*6->D*4->D*2->D*4->D*6->D*8->D*10$ ’ in which $D$ is the dimension and set to 10. For the plenty parameters $\lambda_{w}$ , $\lambda_{h}$ , $\gamma$ , and $\delta$ , we set value 0.001 for all datasets. The sparsity parameter $\rho$ is set to 0.01 and the number of the MLP layer is set to 4 in deep interactions modeling part. We utilize the best baseline on each dataset for comparisons.

5.4.1 Performance comparison (Q1)

Table 2 displays the performance of our A-SAERec algorithm and other baseline algorithms in HR@5 and NDCG@5 metrics. As we can see, the A-SAERec algorithm achieves the best performance compared with other adopted five algorithms. In detail, on Ml-10M dataset, the HR@5 and NDCG@5 metrics of A-SAERec algorithm has reached 0.7136 and 0.5546, respectively. The relative improvement over the state-of-the-art model RecVAE is 3.39% and 10.12% on Book-Crossing dataset. The HR@5 of A-SAERec significantly outperforms the NeuMF algorithm by a large margin on Ml-10M dataset and the improvement is 4.07%. For AutoSVD $++$ , the HR@5 and NDCG@5 metrics of A-SAERec arrives 0.4717 and 0.3481 on lastFM, respectively. However, the HR@5 metric of AutoSVD $++$ algorithm outperforms the A-SAERec on the Yelp dataset with about 2.95% relative improvement when the number of hidden layers of both AutoSVD $++$ and A-SAERec is 8.

Table 2
Performance of recommendation models with metric HR@5 and NDCG@5

Model	Ml-10M		Yelp		Book-Crossing		LastFM
	HR@5	NDCG@5	HR@5	NDCG@5	HR@5	NDCG@5	HR@5	NDCG@5
Biased SVD	0.6257	0.4527	0.2374	0.1469	0.3438	0.2485	0.3349	0.2227
ItemKNN	0.5941	0.4239	0.2125	0.1051	0.3071	0.2045	0.2251	0.1246
NeuMF	0.6729	0.5023	0.2023	0.1331	0.3597	0.1717	0.4470	0.3309
AutoSVD $++$	0.6330	0.4641	0.2612	0.1529	0.3531	0.2539	0.3547	0.2579
JCA	0.6569	0.4687	0.2420	0.1485	0.3834	0.2772	0.3432	0.2497
RecVAE	0.6428	0.4913	0.2251	0.1232	0.3921	0.2914	0.4009	0.3088
A-SAERec	0.7136	0.5546	0.2535	0.1557	0.4054	0.3209	0.4717	0.3481
%Improv.	4.07%	10.41%	$-$ 2.95%	1.83%	3.39%	10.12%	5.53%	5.20 %

Table 3

Performance of recommendation models with metric HR@10 and NDCG@10

Model	Ml-10M		Yelp		Book-Crossing		LastFM
	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
Biased SVD	0.6820	0.4327	0.3571	0.1909	0.4628	0.2619	0.4847	0.2711
ItemKNN	0.6437	0.4032	0.3413	0.1523	0.4597	0.2538	0.3843	0.1756
NeuMF	0.7434	0.5422	0.3021	0.1527	0.3362	0.3108	0.5239	0.3201
AutoSVD $++$	0.6850	0.4412	0.3582	0.2016	0.4660	0.2829	0.4918	0.3212
JCA	0.7154	0.4807	0.3320	0.1852	0.4052	0.2756	0.5376	0.3078
RecVAE	0.7253	0.5140	0.3452	0.1931	0.4784	0.3324	0.5631	0.3516
A-SAERec	0.7997	0.5812	0.3778	0.2130	0.5016	0.3510	0.5957	0.3835
%Improv.	7.60%	7.19%	5.47%	5.65%	4.83%	5.60%	5.79%	9.07%

Figure 3.

The HR@5 and NDCG@5 performance of our A-SAERec algorithm compared with other baselines on four datasets, with latent dimension ranging from 10 to 50.

Figure 4.

The HR@10 and NDCG@10 performance of our A-SAERec algorithm compared with other baselines on four datasets, with latent dimension ranging from 10 to 50.

Table 3 shows the performance of HR@10 and NDCG@10 of all adopted algorithms. Analyzing the results, it is easy to find that the A-SAERec performs better than other baseline algorithm. Such as, in the metrics of HR@10 and NDCG@10, the A-SAERec improves 5.79% and 9.07% compared with RecVAE on LastFM dataset, respectively. In the rest three datasets, HR@10 improvements range from 5.47% on Yelp dataset, 4.83% on Book-Crossing dateset, and 7.60% on Ml-10M dataset.The NDCG@10 improvements range from 5.65%, 5.60%, and 7.19% on Yelp, Book-Crossing, and Ml-10M dataset, respectively. In additions, the NeuMF, AutoSVD $++$ , JCA, and RecVAE models show better performance than two traditional methods Biased SVD and ItemKNN, which indicates the effectiveness of deep learning techniques in improving recommendation performance. To sum up, the experiment results show that the joint structure A-SAERec performs better than the state-of-the-art algorithms and can effectively improve the performance of recommender systems.

5.4.2 Hyper-parameter investigation (Q2)

Figure 3 shows the performance of (HR@5 and NDCG@5) all competitive algorithms with respect to the different number of latent factors $D$ . We evaluate the value of dimension as [10, 20, 30, 40, 50], respectively. In Fig. 3, we can see that the A-SAERec achieve an overwhelming advantage over other baselines on all dataset, which proves the A-SAERec has a good effect in modeling the user-item complexity interaction. In particular, the improvement of A-SAERec is quite obvious on Ml-10M dataset and Book-Crossing dataset. The possible reason is that the dataset is relatively dense and there are more user-item interaction samples, so that the attention network can learn more effective attention weights from data.

In Fig. 4, we have shown the performance of (HR@10 and NDCG@10) all adopted algorithms with respect to the different numbers of dimension factors $D$ . We have almost same conclusion, regardless of the latent dimension setting. Our proposed A-SAERec model is far superior to all adopted other baseline algorithms. As the latent dimensions change, the results of A-SAERec as well fluctuates within a small range. From Fig. 4, we can see the RecVAE is a strong baseline which beats the NeuMF, JCA, and AutoSVD $++$ on Ml-10M dataset, Book-Crossing dataset, and LastFM dataset even though they have a deep structure. In the metric HR@5 and NDCG@5, the NeuMF algorithm significantly outperforms all other methods on Ml-10M dataset and LastFM dataset. Based on the results in Table 3, the HR@10 and NDCG@10 of A-SAERec reaches 0.7993 and 0.5804 on Ml-10M dataset, respectively, which is an extremely satisfied results.

Figure 5.

Illustration of the four metrics performance of A-SAERec on four datasets, with the number of hidden layer ranging from 1 to 5.

The experimental results shown in Figs 3 and 4 indicate the utility of our A-SAERec algorithm on the task of top-N recommendation. The attention weights learned from user-item interactions has a potential on the learned representation of users and items. The substantial improvement of A-SAERec over the baselines can be credited to two reasons: (1) our A-SAERec algorithm employs complicate user’s and item’s features extracted from the implicit matrix via non-linear network to model user-item interactions, instead of traditional methods apply a linear kernel with an inner product of user and item vector to predict score. (2) our model leverage an attention network to capture user’s and item’s attention weights on different aspects of each item and each user, which could improve the performance of recommender system.

Figure 5 shows the performance of (HR@5, NDCG@5, HR@10 and NDCG@10) A-SAERec with respect to the different number of hidden layers $L$ . We test the value of hidden layers as [1, 2, 3, 4, 5] inside each encoder and decoder. In our A-SAERec model, a five-layer network is utilized to reconstruct the input. As shown in Fig. 5, leveraging less than two layers are not enough to learn effective feature representation, and generally three or four hidden layers are good enough to train model.

5.4.3 Impact of attention network (Q3)

In our A-SAERec model, we suppose the user place different important aspect of different items and use an attention network to capture attention weights. In order to verify and validate the utility of attention mechanism in A-SAERec method, we designed an ablation experiment in which the following two models are compared with our A-SAERec algorithm.

•
SAERec: A variant of A-SAERec method, which combines stacked sparse autoencoder with neural matrix factorization. The difference from our algorithm is that it does not use the attention mechanism.
•
NeuMF: A deep learning model which only employs the one-hot encoding of a user’s and an item’s identities as input.

Figure 6.
Illustration of the four metrics performance of compared three algorithms on four datasets.

The results are shown in Fig. 6, we test each algorithm on four datasets in metric HR@5, NDCG@5, HR@10, and NDCG@10. From Fig. 6, we can see: (1) the SAERec algorithm outperforms the NeuMF algorithm by a large margin, especially, the metric of HR@5, NDCG@5, HR@10, and NDCG@10 of SAERec algorithm have reached 0.692, 0.5287, 0.7832, and 0.5532, respectively. The relative improvement over NeuMF algorithm is 2.93%, 5.25%, 8.26%, and 1.82%, which shows the utility of extracting representation from user-item interactions data and also demonstrates the effectiveness of our proposed structure on improving the performance of recommender systems. (2) the A-SAERec algorithm performs better than SAERec across all datasets, which verifies the effectiveness of mechanism on capturing the attention weights for each user-item pair.
6. Conclusion

In this paper, we present a hybrid deep collaborative filtering framework (A-SAERec), which combines the attentive stacked sparse autoencoder with neural collaborative filtering. As far as we know, we are the first attempt to leverage a joint neural network to tightly couple the attention mechanism with the stacked sparse autoencoder for personalized recommendation. For the user and item feature extraction, we utilize a attentive stacked sparse autoencoder and the implicit feedback as input. Then the neural matrix factorization is exploited to model user-item interactions using the user’s and item’s features as input. In order to improve recommendation performance, we employ the attention network to capture the attention weights for each user-item pair. Experiments on four real-world datasets show that our algorithm outperforms other baselines and demonstrates the utility of improvement the performance of recommender systems.

Footnotes

Acknowledgments

The work is supported by the National Natural Science Foundation of China (No. 61702063), the Natural Science Foundation of Chongqing (No. cstc2019jcyj-msxmX0544), the Science and Technology Research Program of Chongqing Municipal Education Commission (No. KJQN202001136).

References

Fan

Zhao

Tang

and Yin

, Graph neural networks for social recommendation, in: Proc. the 2019 World Wide Web Conference, 2019, pp. 417–426.

Yin

Wang

Chen

and Zhou

, Spatial-aware hierarchical collaborative deep learning for POI recommendation, IEEE Transactions on Knowledge and Data Engineering 29(11) (2017), 2537–2551.

Koren

Bell

and Volinsky

, Matrix factorization techniques for recommender systems, Computer 42(8) (2009), 30–37.

Lam

X.N.

T.D.

and Duong

A.D.

, Addressing cold-start problem in recommendation systems, in: Proc. the 2nd International Conference on Ubiquitous Information Management and Communication, 2008, pp. 208–211.

Xue

A.Y.

Xie

Zhang

Huang

and Li

, Solving the data sparsity problem in destination prediction, The VLDB Journal 24(2) (2015), 219–243.

Zhang

Liu

and Jin

, Deep learning based recommender system: A survey and new perspectives, Frontiers of Computer Science 52(1) (2019), 1–38.

Zhang

Liu

and Jin

, A survey of autoencoder-based recommender systems, Frontiers of Computer Science 14(2) (2020), 430–450.

Sedhain

Menon

A.K.

Sanner

and Xie

, Autorec: Autoencoders meet collaborative filtering, in: Proc. the 24th International Conference on World Wide Web, 2015, pp. 111–112.

DuBois

Zheng

A.X.

and Ester

, Collaborative denoising auto-encoders for top-n recommender systems, in: Proc. the Ninth ACM International Conference on Web Search and Data Mining, 2016, pp. 153–162.

10.

Liang

Krishnan

R.G.

Hoffman

M.D.

and Jebara

, Variational autoencoders for collaborative filtering, in: Proc. the 2018 World Wide Web Conference, 2018, pp. 689–698.

11.

Chen

and de. Rijke

, A collective variational autoencoder for top-n recommendation with side information, in: Proc. the 3rd Workshop on Deep Learning for Recommender Systems, 2018, pp. 3–9.

12.

Zhang

Ren

and Ji

, Taxonomy-aware collaborative denoising autoencoder for personalized recommendation, Applied Intelligence 49(6) (2019), 2101–2118.

13.

Zhang

Yao

and Xu

, An Efficient Hybrid Collaborative Filtering Model via Contractive Auto-encoders, in: Proc. the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp. 957–960.

14.

Robles

Larrañaga

Peña

J.M.

Marbán

Crespo

and Pérez

M.S.

, Collaborative filtering using interval estimation naive Bayes, in: Proc. the First International Atlantic Web Intelligence Conference, 2003, pp. 46–53.

15.

J.A.

and Araki

, A SVM-based personal recommendation system for TV programs, in: Proc. the 12th International Multi-Media Modelling Conference, 2006.

16.

Oku

Nakajima

Miyazaki

and Uemura

, Context-aware SVM for context-dependent information recommendation, in: Proc. the 7th International Conference on Mobile Data Management, 2006, pp. 109–109.

17.

Da’u

and Salim

, Recommendation system based on deep learning methods: A systematic review and new directions, Artificial Intelligence Review 53(4) (2020), 2709–2748.

18.

Ouyang

Liu

Rong

and Xiong

, Autoencoder-based collaborative filtering, in: Proc. the 2014 International Conference on Neural Information Processing, 2014, pp. 184–291.

19.

Meng

and Zhang

, Collaborative additional variational autoencoder for top-N recommender systems, IEEE Access 7 (2019), 5707–5713.

20.

Strub

and Mary

, Collaborative filtering with stacked denoising autoencoders and sparse inputs, in: Proc. the 2015 NIPS Workshop on Machine Learning for eCommerce, 2015.

21.

Zhang

Yao

Wang

and Zhu

, Hybrid collaborative recommendation via semi-autoencoder, in: Proc. the 2017 International Conference on Neural Information Processing, 2017, pp. 185–193.

22.

Karamanolakis

Cherian

K.R.

Narayan

A.R.

Yuan

Tang

and Jebara

, Item recommendation with variational autoencoders and heterogeneous priors, in: Proc. the 3rd Workshop on Deep Learning for Recommender Systems, 2018, pp. 10–14.

23.

Dong

Sun

Yuan

and Zhang

, A hybrid collaborative filtering model with deep structure for recommender systems, in: Proc. the AAAI Conference on Artificial Intelligence, 2017, pp. 1309–1315.

24.

and She

, Collaborative variational autoencoder for recommender systems, in: Proc. the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 305–314.

25.

Alfarhood

and Cheng

, CATA+⁣+: A collaborative dual attentive autoencoder method for recommending scientific articles, IEEE Access 8 (2020), 183633–183648.

26.

, Sparse autoencoder, CS294A Lecture Notes 72 (2011), 1–19.

27.

Koren

, Factorization meets the neighborhood: a multifaceted collaborative filtering model, in: Proc. the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 426–434.

28.

Sarwar

Karypis

Konstan

and Riedl

, Item-based collaborative filtering recommendation algorithms, in: Proc. the 10th International Conference on World Wide Web, 2001, pp. 285–295.

29.

Liao

Zhang

Nie

and Chua

T.S.

, Neural collaborative filtering, in: Proc. the 26th International Conference on World Wide Web, 2017, pp. 173–182.

30.

Zhu

Wang

and Caverlee

, Improving top-k recommendation via joint collaborative autoencoders, in: The World Wide Web Conference, 2019, pp. 3483–3482.

31.

Shenbin

Alekseev

Tutubalina

Malykh

and Nikolenko

S.I.

, RecVAE: A new variational autoencoder for Top-N recommendations with implicit feedback, in: Proc. the 13th International Conference on Web Search and Data Mining, 2020, pp. 528–536.

Unifying attentive sparse autoencoder with neural collaborative filtering for recommendation

Abstract

Keywords

1. Introduction

2.1 Modeling rely solely on autoencoders

2.2 Integrating autoencoders with other models

3.1 Sparse autoencoder

5.1 Datasets

1 https://grouplens.org/datasets/movielens/.

5 https://github.com/haomiaocqut/ReSys_A-SAERec.

5.4.1 Performance comparison (Q1)

Table 2 Performance of recommendation models with metric HR@5 and NDCG@5

Footnotes

Acknowledgments

References

¹
https://grouplens.org/datasets/movielens/.

⁵
https://github.com/haomiaocqut/ReSys_A-SAERec.

Table 2
Performance of recommendation models with metric HR@5 and NDCG@5