Abstract
Collaborative filtering (CF) has achieved great performance in recommender system over past decades. CF-based methods firstly map users and items to latent factors which share the same latent space, and then use a linear function to predict user ratings on items, such as inner product or cosine distance. It only uses original latent feature, however feature interactions are usually helpful in enhancing recommendation performance. To tackle such issue, we used Factorization Machines (FM) to enhanced linear methods by incorporating the second-order feature interactions. In this paper, we propose a novel hybrid model, AutoFM, which combine Denoising Autoencoder (DAE) and FM together. AutoFM follows collaborative filtering method, it firstly uses DAE to map users and items to latent factor, then it uses FM calculating user ratings on items. To tackle the cold start problem, we also take as the input of FM user’s and item’s side information besides of latent factor. We conduct AutoFM on three real-world datasets, and the experiment results show that AutoFM consistently outperforms the state-of-the-art method.
Introduction
With the development of the Internet, amounts of online information have been increased explosively, and it takes more and more time for users to get useful information. Therefore, recommender systems have become more indispensable in helping people overcome information overload. The goal of recommender systems is to help users identifying the items that best fit their personal tastes from a large repository of items.
The existing recommendation algorithms can be roughly categorized into three classes: content-based,collaborative-based and hybrid algorithm. The content-based recommendation algorithm [1, 2] make use of item content, such as film director, and recommend to user items which has higher similarity to those the user has liked in past. Collaborative-based recommendation algorithm [3–5] uses only user historical data, such as user ratings on item, and the user will be recommended items that other users who has the similar preferences liked in the past. Collaborative-based recommendation algorithm generally has better recommendation performance than content-based recommendation algorithm. However, Collaborative-based recommendation algorithm has some limitations. The recommendation performance drops significantly when the rating matrix becomes sparse with the increase of users and items amount. In addition, it cannot be used for recommending items for new users without historical records which is called cold start problem. These issues have been addressed in recent yearsby the development of hybrid recommendation algorithm [6–8]. Hybridalgorithmcan provide enhanced recommendation performance under sparse data conditions by integratingcontent-based andcollaborative-based algorithms.
In recent years, deep learning has made great achievements in speech recognition [9, 10], computer vision [11, 12] and Natural Language Processing [13, 14]. Researchers have studied the application of deep learning to the recommender system [15–18]. Sedhain et al. [19] used autoencoder to encode the historical behavior vectors of users or items, and used the output layer of autoencoder to calculate users’ ratings on items. Strub et al. [20] added masking noise on the sparse input to simulate the unknown ratings on item. Spontaneously, they introduced denoising autoencoder, and reconstructed the original sparse input. Because of the sparsity of user-item rating matrix, Strub et al. [21] extended autoencoder to integrate side information of users or items, which achieved performance improvement. Dong et al. [22] present a novel deep learning model called additional stacked Denoising Autoencoder (aSDEA), which extends the stacked Denoising Autoencoder to integrate additional side information into the inputs, and then overcomes cold start problem and data sparsity problem. Zheng et al. [23] used autoencoder to deal with implicit feedback information from users. Unlike general autoencoder, the learning object is cross entropy loss function. Item content contains lots of information which is important and help for recommendation, however few works focus on them in recommendation task. Li et al. [24] map item content information to a content latent factor via variational autoencoder (VAE) and they also embed collaborative information to collaborative latent factor. They added item content latent factor to item collaborative latent factor to get item latent factor, then they calculated inner product of user and item latent factor to predict user ratings in items.
The presented autoencoder based recommendation model usually used a linear function, such as inner product or cosine distance, to predict the user’s preference to items when obtaining latent factors of users and items. Linear function only considers original latent features, however feature interactions have been proved to improve prediction performance [25]. To consider feature interactions, we propose a new hybrid collaborative recommendation model named AutoFM, which can automatically learn second order of feature interactions using Factorization Machines (FM) [26]. Specifically, we firstly use a DAE to map user and item sparse history behavior vector to latent factor, and then we predict user ratings on items through FM. In addition, in order to deal with cold start problem, we also add side information of users and item to the proposed model. We conducted AutoFM to two public datasets and a dataset collected from an android APP named Green Travel. Experiments on the three real-word datasets show that our model can achieve great recommendation performance. Specifically, the main contributions of this paper can be summarized as follows: We propose a hybrid collaborative recommendation model, AutoFM, which integrates denoising autoencoder and Factorization Machines. It can effectively extract latent vector and learn second order feature interactions between both users and items; To improve recommendation performance, we firstlyuse denoising autoencoder which randomly change a fraction of the input vector to be zero rather than basic autoencoder; secondly, we add side information of users and item into the hybrid model, specifically we concatenate latent vector and side vector as the input of FM part; We conduct experiments on real-world datasets to evaluate the effectiveness of our hybrid model. Experimental results show that our hybrid model outperforms state-of-art methods in terms of Root Mean Squared Error(RMSE).
Related work
There are varieties of recommendation methods, including content-based, collaborative filtering-based, knowledge-based, hybrid method and so on. Among them, collaborative filtering algorithm is one of the most commonly used methods, and hybrid models can provide enhanced recommendation performance by exploiting additional sources of information regarding users and items, which denoted as side information.Except autoencoder-based methods mentioned in section 1, there are many other collaborative filtering-based and hybrid recommendation methods [27]. Probabilistic Matrix Factorization (PMF) [5] is a classicalcollaborative filtering algorithm for recommendation, it models the user preference matrix as a product of two lower-rank user and item matrices.Salakhutdinov et al. [28] proposed a restrictedBoltzmann machine (RBM) based recommender model, it is the first recommendation model that built based deep learning. Wang et al. [29] proposed a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the side information and collaborative filtering for the ratings matrix. Ying et al. [30] proposed a hybrid pair-wise approach with implicit feedback, collaborative deep ranking (CDR), which leverages deep feature representation of item content into Bayesian framework of pair-wise ranking model. Zhang et al. [31] proposed a hybrid model, which generalize contractive auto-encoder paradigm into matrix factorization framework. It jointly models content information as representations of effectiveness and compactness, and leverage implicit user feedback to make accurate recommendations.
Recurrent neural network (RNN) have also been widely studied for recommender system. RNN is specifically suitable for session-based recommendation. Hidasi et al. [32]proposed a session-based recommendation model based GRU. Tan et al. [33] improve this model by several strategies, such as data augmentation, and a method to account for shifts in the input data distribution.Smirnova and Vasile [34] proposed a contextual RNN, which take into account the contextual information both in the input and output layers.Convolution Neural Network (CNN) has been utilized in recommender system for feature extraction of textual or image data.Seo et al. [35] made use of two convolutional neural network to learn feature representations from user and item reviewtexts, and predict rating scores with dot product in the final layer. Wang et al. [36] adopt CNN to extract image features, and investigated the image features to Point-of-Interest (POI) recommendation. He et al. [37] incorporatedvisual features (learned via CNN) into matrix factorization.
Preliminaries
In this section, we have a brief review on denoising autoencoder and factorization machines.
Denoising autoencoder
Autoencoder is a special feed-forward neural networks which consists of an encoder and a decoder part [39]. They are unsupervised networks where the output of the network only needs to reconstruct the initialinput. The encoder f (·) takes a given input
However, basic autoencoder often degenerate into an identity network. Specially, in recommendation task, our goal is try to predict accurately user ratings on items that have not been observed up to now, rather than reconstruct observed ratings. In other words, missing ratings do not bring information to the network. Denoising autoencoder reconstruct the input from a corrupted version to learning a more effective representation from the input [38]. Denoising autoencoder can prevent basic autoencoder from simply learning the identity, which can capture something useful about the input in its hidden representation. There are three common methods to corrupt the input [20]: Gaussian Nosie: Gaussian Nosie is added to a subset of the input Masking Nosie: A fraction Salt-and-Pepper Noise: A fraction
In this paper, we used masking noise to corrupt the input, and the corrupted version of the input is
The loss function of the network is:
Factorization machines (FM) were first introduced by [26] in 2010. The idea behind FMs is to model interactions between features (explanatory variables) using factorized parameters. FM has the ability to the estimate all interactions between features even with extreme sparsity of data.Given a real valued feature vector
The equation in Equation (2) requires O (kn2) complexity because all pairwise interactions have to be computed. But with some reformulation, the complexity can be reduced from O (kn2) to a linear runtime O (kn). The pairwise interactions in (2) can be reformulated:
We omitted the derivation process in Equation (4) which can be found in [26].
In this section, we introduce our proposed hybrid collaborative filtering model: AutoFM. Figure 1 illustrates thestructure of AutoFM model, which consists of two parts: autoencoder part and FM part.

Components and structure of AutoFM.
Given m users
Autoencoder part
We apply two autoencoder networks for conducting collaborative filtering in this part. It learns latent factors of both users and items from sparse vectors
Similarly,the I-DAEcan be formulated as follows:
The objective function of autoencoder part is formulated as follows:
In this part, we integrate the side information into collaborative filtering, Fig. 1(b)illustrates the structure of FM. Let x ∈ X and y ∈ Y denote side information of user and item. Here, side information includes a wide variety of information such as user profiles, groups of user affiliation, film plotlines, book reviewsand so on.
Firstly, weconcatenate h
u
, h
i
, x and y to obtain a feature vector
The objective function about FM part is formulated as follows:
Which is the L2 regularization of embedding matrix.
The loss function of AutoFM consists of three parts: the prediction error, reconstruction error and regularization of latent factor and embedding matrix. It can be formulated as follows:
In this section, we conduct extensive experiments with two real-word datasets. First, we presentthe dataset and the evaluation metric used in our experiments. Then, we describe the baselinealgorithms selected for comparisons and introduce the parameter settings. Finally, weanalysis the experiment results.
Datasets and evaluation metric
Dataset Description. We evaluate the performance of our AutoFM model on two public datasets and a dataset collected from an android APP named Green Travel. MovieLens is a movie rating dataset that has been widely used on evaluating CF algorithms, we use the two stable benchmarkdatasets, Movielens-100k and Movielens-1M. Green-travel dataset contains a large number of users, restaurants and ratings for restaurants. The MovieLens-100K dataset contains 100K ratings from 943 users on 1682 movies, and the MovieLens-1M dataset contains more than 1 million ratings from 6040 users on 3706 movies, and Green-travel contains more than 1.5 million ratings from 9276 users on 4132 restaurants. Each rating is an integer between 1 and 5. Therefore, MovieLens-1M is much sparser as only 2.57% of its user-item matrix entries contain ratings, and MovieLens-100K has ratings in 3.49% of its user-item matrix entries, and green travel has ratings in 1.41% of its user-item matrix entries.
Moreover, we extract user and item side information to construct the additional matrices
Evaluation Metrics. We employ the widely used Root MeanSquared Error (RMSE) as the evaluation metric for measuring the prediction accuracy. It is defined as follows:
Baselines. In order to evaluate the performance of our proposed hybrid collaborative filtering model, we choose four other recommendation algorithms as it’s comparison. The four recommendationalgorithms are as follows: PMF:Probabilistic Matrix Factorization [5] is a model to factorize the user-item matrix to user and item factors. It assumes there exists gaussian observation noise and Gaussian priors on the latent factor vectors; SVD: Singular Value Decomposition [40] uses rating matrix as input and estimates two low-rank matrices. It also uses user bias and item bias to reduce the error; Autorec:It is an autoencoder based collaborative filtering model. It reconstructed the sparse user vector or item vector, and the position of reconstructed vector are considered as predicted ratings [19]; CFN:This model addedside information to input layer and every hidden layer of the autoencoder network [21]. aSDAE: Additional Stacked Denoising Autoencoder using user’s historical behaviors, user’s side information and item’s side information [22].
Parameter Settings. In the experiments, we split the datasets into 80% for training and 20% for testing.
For PMF and SVD, we used grid search to find the best values for the number of latent factors from 8, 16, 32, 64, 128, 256, and regularization parameter from 0.0001, 0.001, 0.01, 0.1, 1.0, 10.For Autorec and CFN, we selected the dimension of hidden layer of the network from 16, 32, 64, 128, 256, 512 and regularization parameter λ from 0.0001, 0.001, 0.01, 0.1, 1.0, 10.
For CFN we select α and β from (1.0, 0.75), (0.75, 0.5), (0.5, 0.25), (0.25, 0.1).For aSDAE, we set the parameters α, β and λ to 0.2, 0.8 and 0.01. The number of layers is set to 4 like [22].
For AutoFM, we select the number of user latent factor and item latent factor from 100, 200, 300, 500, 700, 1000. Embedding size of the FM part was selected from 8, 16, 24, 32, 64, 128. α, β and λ are both selected from 0.0001, 0.001, 0.01, 0.1, 1.0, 10. The corruption ratio ν was selected from {0.0, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4}. The weight W of the network was initialized from a uniform distribution between [-0.1, 0.1], and the bias b was initialized from zero vector. The batch size was set to 128 and learning rate was 0.0001. We used rele (·) activation functions for every layer of the network including the output layer,and used Adam to minimize the loss function.
Evaluation results
For each model, we conduct five experiments on the datasets. Table 1 shows the average RMSE of PMF, SVD, Autorec, CFN, aSDAE and our proposed hybrid model on three datasets. We conduct three different experiments about AutoFM on datasets. In the first experiment (AutoFM1 in Table 1), in order to prove the effectiveness of our hybrid model, we didn’t use side information in AutoFM. In the second experiment (AutoFM2 in Table 1), we want to prove the effectiveness of side information, so we added side information to AutoFM. We used three layers DAE which include input layer, hidden layer and output layer in the first and second experiments. However, in order to prove the effectiveness of the method we add side information using FM, we extended the DAE in the second experiment to SDAE like [22], which is presented as AutoFM3 in Table 1. We can answer three questions as follows through the three experiments.
RMSE of baselines and AutoFM. AutoFM1 didn’t use side information, but AutoFM2 used. AutoFM3 replaced DAE with SDAE
RMSE of baselines and AutoFM. AutoFM1 didn’t use side information, but AutoFM2 used. AutoFM3 replaced DAE with SDAE
PMF, SVD, Autorec and AutoFM1 don’t use side information. We can observe from Table 1 that AutoFM1 achieved smaller RMSE than PMF, SVD and Autorec, which demonstrate our hybrid model is effective. Besides, it demonstrates the effectiveness of deep learning that Autorec and AutoFM achieve better performance than PMF and SVD.CFN add side information based on Autorec. Similarly, AutoFM2 add side information based on AutoFM1.We can see from Table 1 that CFN outperform Autorec, and AutoFM2outperform AutoFM1, which demonstrate the effectiveness of side information.AutoFM2 extends CFN and AutoFM3 extends aSDAE by using our method to add side information. We can observe from Table 1 that AutoFM2 outperforms CFN and AutoFM3 outperforms aSDAE, which demonstrate that the method we add side information using FM is effective.
Figure 2 shows the RMSE on three datasets when AutoFM took different corruption ratio. We can observe from Fig. 2 that the sparser the dataset was, the smaller corruption ratio required to achieve best RMSE.

RMSE of AutoFM for different corruption ratio on three datasets.
We also analyze the number of parameters and time cost each epoch of all deep learning methods on great travel dataset. The results are show in Table 2. The green travel datasetcontains more than 1.5 million ratings from 9276 users on 4132 restaurants, and the length of user and item side information vector are 186 and 2746. For item Autorec and item CFN, we set the number of neural unit in hidden layer as 512. For AutoFM1 and AutoFM2, we set the number of hidden layer as 512 in both user DAE part and item DAE part. For sSDAE and AutoFM3, we set the hidden layers of user DAE and item DAE as 512, 64, 512. For AutoFM, we set the embedding size of FM part as 8. Because Autorec don’t use side information, it has minimum parameters.
Number of parameters (×106) and training time (s) for each epoch of the deep models considered for Green-travel dataset
CFN and aSDAE use fully connected neural network to handle side information, so they need more parameters than other models. AutoFM use FM to handle side information, so they only need a small amount of extra parameters than Autorec. We can observe from Table 1 that AutoFM2 and aSDAE achieve similarly RMSE, but AutoFM2 use 3 layers DAE and aSDAE use 5 layers SDAE. When we put 5 layers SDAE into AutoFM3, it achieves smaller RMSE than aSDAE. Besides, the parameters AutoFM needs is about 1/8 of aSDAE. These can also demonstrate the effectiveness of our hybrid model.
Moreover, we can see from Table 2 that the more parameters a model has, the more time it needs spend for one epoch. AutoFM cost much more time than Autorec, because AutoFM has a FM part, and FM should spend several extra time.
In this paper, we propose a hybrid collaborative recommendation model denoted as AutoFM thatintegratesdenoising autoencoder (DAE) and factorization machines (FM), which can effectively extract latent vectors, and learn second order feature interactions between both users and items latent factors. Moreover, we also add side information of users and items to the FM partfor improving recommendation performance. The autoencoder part mapssparse vectors for users and items to latent factors. The FM part firstly concatenate latent factors and side information vectors to obtain one feature vector, then the vector was employed as input of FM to predict user ratings for items. Thus, the AutoFM model has the advantages of integrating collaborative filteringand side information, considering second order feature interactions between latent factors pertaining to users and items. Experimental resultswith three real-world datasetsdemonstrated that our hybrid model outperformed a number of other models.
The FM part learnedsecond order feature interactions, and it can be extended to consider higher order feature interactions between latent factors in future works. In addition, the user ratings for items is a kind of explicit feedback.However, explicit ratings are not always available for each user, but implicit feedbacks, e.g. user viewing history, is easier to collect and richer, comparing to explicit ratings. Both explicit and implicit feedbacks can be considered into the model to achieve better recommendation performance in future works.
Footnotes
Acknowledgment
This work was supported in part by the National Key R&D Program of China under Grant 2018YFC0831502.
