Collaborative social deep learning for celebrity recommendation

Abstract

Recently how to recommend celebrities to public has become an interesting problem in real social network applications. In this problem, a matrix of users’ following actions is usually very sparse. Therefore, it causes conventional collaborative filtering based methods to degrade significantly in recommendation performance. To address the sparsity problem, side information could be rendered. Collaborative social topic regression (CSTR) is an appealing new method, which combines the matrix of general users’ following actions, content information and a social network of celebrities. However, this method is limited by using the topic model latent Dirichlet allocation (LDA) as the critical component. The learned content representation may not be compact and effective enough. Moreover, the social network of general users also exists, which is helpful for recommendation. In this paper, we employ a deep learning component to learn more effective feature representation and incorporate social network information of general users by adding social regularization terms. We propose a novel hierarchical Bayesian model named collaborative social deep learning (CSDL), which jointly handles deep learning for the content information and collaborative filtering for general users’ following actions, the social network of celebrities and that of general users. Experiments on two real-world datasets show the effectiveness of our proposed model.

Keywords

Celebrity recommendation deep learning social network text mining

1. Introduction

Nowadays, social network websites such as Twitter and Tencent MicroBlog have become popular for users to connect friends and share opinions online. There are some special users, such as celebrities, famous organizations, and well-known groups, who spread newest ideas and are always followed by a number of other users. More and more users regard social network websites as an appealing media to gain the up to date information from authorities and elites [11]. The public have become more interested in celebrities and famous organizations. Celebrities and organizations have already realized the enthusiasm of common people to get the newest ideas and opinions which they are interested in, and signed up their own accounts to propagate their opinions in succession [5].

In social network websites, celebrities are a small part of all users, but the scale of them is relatively large, which can flood users with a huge number of information and hence put them at risk of information overload. Therefore, how to recommend celebrities (items) whom general users are truly interested in to follow, becomes an interesting problem [42]. In this issue, we can only observe users’ following actions from microblogs, and unfollow actions do not mean they are not interested. The following actions can be seen as positive samples. This class of collaborative filtering with only positive samples is also called one-class collaborative filtering (OCCF) [21]. Therefore, conventional collaborative filtering (CF), which automatically predicts the interest of a particular user based on the collective behavior records of similar users or items [10, 29], cannot be directly applied to solve this celebrity recommendation problem. What’s more, traditional CF-based models suffer from the sparsity problem and imbalance of rating data, especially for new and infrequent users. To deal with the major weaknesses of traditional CF-based recommendation systems, many models have been proposed to explore side information, such as items’ content information [1, 31] and users’ social network [9, 17]. For instance, collaborative topic regression (CTR) [31] is a state-of-the-art model, which naturally incorporates content information by latent Dirichlet allocation (LDA) [2] into a collaborative filtering framework. Trustwalker [9] is a social network-based CF model, which finds the user’s like-minded neighbors to address the rating sparsity limitation. But very few attempts have been made to focus on utilizing both content and social network information [22].

CSTR [5] is a recently proposed hybrid method. It is a probabilistic graphical model that seamlessly integrates a topic model and probabilistic matrix factorization (PMF) [25] which factorizes following records and social information. To our best knowledge, it is the first model to explore whether celebrities’ social network has an effect on improving the performance and efficiency of recommending celebrities. However, this model learns the features of side information associated with items by LDA. Therefore it is not effective enough to learn the latent representation especially when side information is very sparse [33]. Fortunately, deep learning as a set of feature-learning method is a perfect alternative. It could model multiple levels of representation of raw input by composing simple but non-linear modules. Each module transforms the representation at one level into a slightly more abstract style at a higher-level [12]. Deep learning has attracted a lot of attention because of its promising performance to learn representation on various tasks [7, 20]. Recently, collaborative deep learning (CDL) [33] was proposed, by tightly coupling deep learning and matrix factorization to improve recommendation performance and efficiency.

Meanwhile, CSTR only considers the social network of celebrities. Since the social network information of general users also exists and is available, it is desirable to incorporate this kind of side information into celebrity recommendation models. But the social network of general users and that of celebrities are greatly different [5]: (i) the social network of general users is huge and sparse, but that of celebrities is small and dense; (ii) to a certain extent, celebrities’ social network can be regarded as the relationships among different kinds of interests, and the social network of general users may mainly indicate friendship; (iii) celebrities are followed by plenty of other users in their social network, but general users are not.

In this paper, we propose a hierarchical Bayesian model called collaborative social deep learning (CSDL) for celebrity recommendation. It incorporates deep learning for content information, matrix factorization for both celebrities’ social network and users’ interest matrix, and social regularization terms for the social network of general users. We connect content, the social network of celebrities and general users’ interests these three data sources by using the shared item latent feature space. The matrix factorization of celebrities’ social network and general users’ interests will learn the low-rank item latent feature space, while a deep learning method named Stacked Denoising Autoencoders (SDAE) provides a content representation of the items in the item latent feature space, in order to make better recommendation performance. As for the social network of general users, we first adopt Jaccard’s coefficient, a feature intrinsic to the network topology, to compute node similarity [18]. Then, CSDL incorporates it by adding social regularization terms to constrain the taste difference between a general user and others. Note that, although our method employs SDAE for feature representation, as a generic framework it can also admit other deep learning methods, such as convolutional neural networks and recurrent neural networks.

The main contributions of this study in celebrity recommendation in social websites can be summarized as follows.

(i)
We propose a hierarchical Bayesian framework, named CSDL, which combines deep feature learning of item content, celebrities’ social network, general users’ social network and general users’ preferences to improve the performance and efficiency of celebrity recommendation.
(ii)
We introduce the social network of general users by adopting Jaccard’s coefficient to compute a user similarity value, which can be utilized to constrain the taste difference between a general user and others.
(iii)
Even if a new user has followed only one celebrity, CSDL can alleviate the data sparsity problem in CF by using the social network of celebrities and social information of general users, which will consequently improve the recommendation performance. Even though the content information is very sparse, CSDL can achieve desired performance by employing a deep learning component in an attempt to learn more effective feature representation. And by integrating multiple information sources into one model, CSDL can also improve the recommendation accuracy even the social network matrix of celebrities is sparse.
(iv)
We conduct experiments on two real-world datasets to evaluate the effectiveness of CSDL. Experimental results show CSDL outperforms other state-of-the-art methods in terms of Recall and Average Precision under different sparsity level settings.

The rest of the paper is organized as follows. We review the related works in Section 2. And then explain the details of our method in Section 3. Experimental settings and results are provided in Section 4. Finally, Section 5 concludes the paper and discusses future work.
2. Related work

In general, our work is closely related to the following topics: latent factor models, latent factor models associated with social network, and latent factor models based on deep learning.

2.1 Latent factor models

Traditional recommender systems are mostly based on collaborative filtering and there are two widely used approaches. In neighborhood methods [10, 28], the similarity between users based on the content they have consumed and rated is the basis of a new recommendation. Most effective recommendation methods are latent factor models [24, 25], which provide better recommendation results than the neighborhood methods [31]. Among latent factor methods, matrix factorization (MF) works well [31]. It allows us to discover the latent factors of user-item interactions by factorizing the feedback matrix into a joint latent space of user and item features respectively. Many MF-based methods have been proposed. Existing works usually build some assumptions on the latent factors. Probabilistic matrix factorization [24, 25] provides an assumption that both the prior probability distribution of the latent factors and the probability of the observed ratings given the latent factors follow a Gaussian distribution. MF-based models such as non-negative matrix factorization (NMF [13]) and maximum margin matrix factorization (MMMF [28]) are usually used to reduce the dimensions of user-item matrix and smooth out the noise information. All of them are helpful to algorithm scalability, but cannot be directly applied to the case in which ratings have only two states: observed and not observed. To handle such problem, Hu et al. [8] introduce different confidences for observed items and unobserved items. In our model, we adopt PMF for both the social network of celebrities and general users’ following records. And we regard those unfollow actions as unobserved data and model them as negative samples with low confidence.

2.2 Latent factor models associated with social network

Recently, some works have studied the effectiveness of social network to help recommendation performance improvement. [22, 38] explore social network of common users to better understand common users’ interest. Social matrix factorization [17, 22] incorporates social relationships into MF to improve recommendation performance. Shen and Jin [27] study varieties of users’ social relationships. Cheng et al. [4] utilize users’ social relationships to mine users’ preferences on locations and make favorite recommendation. Guo et al. [6] study social trust information and prove that both explicit and implicit influence of trust should be taken into consideration in a recommendation model. [37] considers trust relationships to more accurately reflect users’ reciprocal influence on the formation of their own opinions for high-quality recommendations. [3, 5] are the most relevant works with ours. They incorporate LDA with social matrix factorization to gain powerful learning feature. The difference between these two works is that the topic model in [3] functions not only user interest but also celebrities’ social network, while the topic model in [5] only plays a part in user interest matrix factorization. However, Ding et al. [5] only consider the social network of celebrities. In our model, besides the social network of celebrities, we also focus on how the social network of general users would have an effect on the task of celebrity recommendation.

2.3 Latent factor models based on deep learning

The idea of combing deep learning models with collaborative filtering is just proposed in recent years. As we best know, Salakhutdinov et al. [26] are the first to apply deep learning into the task of collaborative filtering. They modify the restricted Boltzmann machines for the task of collaborative filtering and achieve good performance. Recently, some deep learning models are proposed to learn latent factors from content information, such as raw features of audio and text [7, 20]. X. Wang and Y. Wang [36] utilize deep belief nets (DBN) for music recommendation, and unify feature extraction and recommendation of songs in a joint framework. This framework automatically learns the feature vectors of songs using a deep belief network, which is a generative probabilistic graphical model. Differently, Oord et al. [19] address the music recommendation problem using the convolutional neural networks (CNN) rather than DBN. They conduct a weighted matrix factorization to obtain latent factors for all songs, after that they use deep learning to map audio content to those latent factors. Most recently, Wang et al. [33] propose a hierarchical Bayesian model (CDL) which tightly couples SDAE and MF. To our best knowledge, CDL is the first hierarchical Bayesian model to bridge the gap between state-of-the-art deep learning models and recommender system. This work is much close to our work but differs from ours. Because CDL does not incorporate social network to improve performance, which is very important in celebrity recommendation as discussed before. A deep learning model is adopted to learn latent features from content information and model it with a latent factor model in our model.

Figure 1.

The graphical model for the PMF model.

3. The proposed model

In this section, we first discuss probabilistic matrix factorization. Then, we review stacked denoising autoencoders. After that, we present CSDL in detail.

3.1 Probabilistic matrix factorization

In our model, we conduct PMF. Similar to previous work [24, 25], we provide Gaussian priors on the latent factors. Note that, we employ PMF to factorize the social network of celebrities and general users’ following records simultaneously, and they share the item latent feature space. The graphical model is shown in Fig. 1. Let ${u_{i}},{v_{j}},{s_{m}}\in{\mathbb{R}^{K}}$ be user $i^{\prime}{\rm{s}}$ factor vector, celebrity $j^{\prime}{\rm{s}}$ factor vector and the extra social latent factor for celebrity $m$ respectively, where $K$ is the dimensionality of the latent space. We can get the corresponding representation in a matrix form, i.e. $U=({u_{i}})_{i=1}^{I},$ $V=({v_{j}})_{j=1}^{J},$ and $S=({s_{m}})_{m=1}^{J},$ where $I$ is the number of users and $J$ is the number of celebrities. Let $R$ be the user-celebrity following matrix, ${r_{ij}}=1$ if user $i$ follows celebrity $j,$ otherwise ${r_{ij}}=0.$ The social network between celebrities is a directed graph, represented as an adjacency matrix $Q,$ where ${q_{mj}}=1$ when there is a following action from $m$ to $j$ and ${q_{mj}}=0$ otherwise.

We assume the following generative process:

1.
For each user $i,$ draw a user latent vector

$\displaystyle{u_{i}}\sim N(0,\lambda_{u}^{-1}{I_{K}}).$
2.
For each item $j,$ draw an item latent vector

$\displaystyle{v_{j}}\sim N(0,\lambda_{v}^{-1}{I_{K}}).$
3.
For each celebrity $m,$ draw an extra social latent vector

$\displaystyle{s_{m}}\sim N(0,\lambda_{s}^{-1}{I_{K}}).$
4.
For each user-item pair $(i,j),$ draw a following action

$\displaystyle{r_{ij}}\sim N(u_{i}^{T}{v_{j}},c_{ij}^{-1}),$

where ${c_{ij}}$ is the confidence parameter for ${r_{ij}}.$
5.
For each celebrity’s social network pair $(m,j),$ draw a social action ${q_{mj}}\sim N(s_{m}^{T}{v_{j}},d_{mj}^{-1}),$ where ${d_{mj}}$ is the confidence parameter for ${q_{mj}}.$

Note that, ${I_{K}}$ is a $K$ dimensional identity matrix.

In many practical recommendation scenarios, users rarely express their explicit behaviors, while implicit ones are very popular (e.g. clicking and browsing history). This class of collaborative filtering with only positive samples is called OCCF [21]. In that work, the authors introduce different confidence parameters ${c_{ij}}$ for different ratings ${r_{ij}}.$ We will use the same strategy to set ${c_{ij}}$ a higher value when ${r_{ij}}=1$ than when ${r_{ij}}=0.$ If ${c_{ij}}$ is large, we trust ${r_{ij}}$ more. ${r_{ij}}=0$ can be interpreted into two ways: the user $i$ is either not interested in item $j$ or is unaware of it. ${c_{ij}}$ is defined as follows,

$\displaystyle{c_{ij}}=\left\{\begin{array}[]{l}a,{\rm{\quad}}{r_{ij}}=1,\\ b,{\rm{\quad}}{r_{ij}}=0,\end{array}\right.$

where $a$ and $b$ are tuning parameters satisfying $a>b>0.$ We set three different cases for ${d_{mj}}$ : (i) ${q_{mj}}=1$ means that we are very confident that celebrity $m$ is interested in $j;$ (ii) ${q_{mj}}=0$ but ${q_{jm}}=1,$ then we make moderate confidence; (iii) both ${q_{mj}}$ and ${q_{jm}}$ are 0, then we make low confidence. So, we get the definition of ${d_{mj}}$ as follows,

$\displaystyle{d_{mj}}=\left\{\begin{array}[]{l}e,{\rm{\quad}}{q_{mj}}=1,\\ f,{\rm{\quad}}{q_{mj}}=0{\quad\rm{and}\quad}{q_{jm}}=1,\\ g,{\rm{\quad}}{q_{mj}}=0{\quad\rm{and}\quad}{q_{jm}}=0,\\ \end{array}\right.$

where $e,$ $f$ and $g$ are tuning parameters satisfying $e>f>g>0.$ As Section 4.3 shows, we use five-fold cross validation on the training set with grid search to decide the confidence parameters. These confidence parameters mean that model samples with different confidence. The feedback have different confidences under different domains, so these confidence parameters may be different. Then, the maximum posterior estimation (MAP) of PMF is maximizing the joint log-likelihood equivalently, thus, it’s to solve the following objective function,

$\displaystyle\mathop{\max}\limits_{U,V,S}-{c_{ij}}\sum\nolimits_{ij}{{{({r_{ij% }}-u_{i}^{T}{v_{j}})}^{2}}}-{d_{mj}}\sum\nolimits_{mj}{{{({q_{mj}}-s_{m}^{T}{v% _{j}})}^{2}}}$ $\displaystyle\quad-{\lambda_{u}}\left\|{{u_{i}}}\right\|_{2}^{2}-{\lambda_{v}}% \left\|{{v_{j}}}\right\|_{2}^{2}-{\lambda_{s}}\left\|{{s_{m}}}\right\|_{2}^{2},$

where ${\lambda_{u}},$ ${\lambda_{v}}$ and ${\lambda_{s}}$ are hyper-parameters.

Figure 2.
A 1-layer SDAE with $L=2.$ .

3.2 Stacked denoising autoencoders

To recover the clean input, SDAE [30] learns a compressed representation from randomly corrupted input through a feedforward neural network. We define a $J-{\rm{by}}-Z$ matrix ${X_{c}}$ as the clean input to the SDAE, while the noise-corrupted matrix is denoted by ${X_{0}},$ where $Z$ is the size of vocabulary. Therefore, description of celebrity $j$ is presented by the bag-of-words vector ${X_{c,j*}}.$ The output of layer $l$ of the SDAE is denoted by ${X_{l}},$ which is a matrix containing $J$ rows and ${K_{l}}$ columns. Analogous to $X_{c}$ , row $j$ of ${X_{l}}$ is denoted by ${X_{l,j*}}.$ $L$ is the number of SDAE layers, ${W_{l}}$ is the weight matrix of layer $l$ and ${b_{l}}$ is the bias vector of layer $l .$ To be simple, we define ${W^{+}}$ as the set of weight matrices and biases of all layers. Figure 2 is an example of SDAE with $L=2.$ An SDAE solves the following optimization problem,

$\displaystyle\mathop{\min}\limits_{{W^{+}}}\left\|{{X_{c}}-{X_{L}}}\right\|_{F% }^{2}+{\lambda_{w}}\sum\limits_{l}{(\left\|{{W_{l}}}\right\|_{F}^{2}+\left\|{{% b_{l}}}\right\|_{2}^{2})},$

where ${\lambda_{w}}$ is the regularization hyper-parameter.

Assuming that the corrupted input ${X_{0}}$ and the clean input ${X_{c}}$ are observed variables, SDAE can be generalized as a probabilistic model [33]. Then, we can get the following generative process:

1.
For each layer $l$ of the SDAE network,

(a)
For each column $n$ of the weight matrix ${W_{l}},$ draw

$\displaystyle{W_{l,n}}\sim N(0,\lambda_{w}^{-1}{I_{{K_{l}}}}).$
(b)
Draw the bias vector ${b_{l}}\sim N(0,\lambda_{w}^{-1}{I_{{K_{l}}}}).$
(c)
For each row $j$ of ${X_{l}},$ draw

$\displaystyle{X_{l,j}}\sim N(\sigma({X_{l-1,j}}{W_{l}}+{b_{l}}),\lambda_{h}^% {-1}{I_{{K_{l}}}}).$

2.
For each item , draw a clean input,

$\displaystyle{X_{c,j}}\sim N({X_{L,j}},\lambda_{n}^{-1}{I_{J}}).$

Where ${\lambda_{w}},$ ${\lambda_{h}}$ and ${\lambda_{n}}$ are hyper-parameters and $\sigma(\cdot)$ is the sigmoid function. If ${\lambda_{h}}$ goes to infinity, i.e., ${X_{l,j}}=\sigma({X_{l-1,j}}{W_{l}}+{b_{l}}),$ the model will degenerate to be the original SDAE.

Figure 3.
(a) shows the graphic model of CSDL. The part inside the dashed rectangle shows an example of SDAE with $L=2.$ (b) shows the graphical model of the degenerated CSDL called CSDL–. To be simple, we use ${X_{0}},$ ${X_{L/2}}$ and ${X_{L}}$ to replace $X_{0,j}^{T}$ , $X_{L/2,j}^{T}$ and $X_{L,j}^{T}$ respectively. Note that, the shaded nodes denote observed variables.

Figure 4.
A simple social network. ${u_{1}},$ ${u_{2}}$ and ${u_{3}}$ represent users, ${v_{1}}$ and ${v_{2}}$ represent items. The directed edges indicate following actions.

3.3 Collaborative social deep learning

Figure 3(a) shows the graphic model of CSDL. Next, we describe a general framework that integrates matrix factorization, social regularization and deep feature learning in detail. The generative process of CSDL is described as follows:

1.
For each layer $l$ of the SDAE network,

(a)
For each column $n,$ draw the weight matrix and bias vector $W_{l}^{+},$ draw $W_{l,n}^{+}\sim N(0,\lambda_{w}^{-1}{I_{{K_{l}}}}).$
(b)
For each row $j$ of ${X_{l}},$ draw

$\displaystyle{X_{l,j}}\sim N(\sigma({X_{l-1,j}}{W_{l}}+{b_{l}}),\lambda_{h}^% {-1}{I_{{K_{l}}}}).$

2.
For each item (celebrity) $j,$

(a)
Draw a clean input ${X_{c,j}}\sim N({X_{L,j}},\lambda_{n}^{-1}{I_{J}}).$
(b)
Draw a latent item offset vector ${\tau_{j}}\sim N(0,\lambda_{v}^{-1}{I_{K}}),$ and then set the latent item vector to be:

$\displaystyle{v_{j}}={\tau_{j}}+X_{\frac{L}{2},j}^{T}.$

3.
For each celebrity $m,$ draw a social latent factor

$\displaystyle{s_{m}}\sim N(0,\lambda_{s}^{-1}{I_{K}}).$
4.
For all users, draw latent user vector set, $U\sim p(U),$ where

$\displaystyle p(U)\propto N(0,\lambda_{u}^{-1})\prod_{i=1}^{I}\prod_{t=1,t\neq i% }^{I}N({u_{t}},\lambda_{f}^{-1}Y_{it}^{-1}).$ (1)
5.
For each user-item pair, draw a following action

$\displaystyle{r_{ij}}\sim N(u_{i}^{T}{v_{j}},c_{ij}^{-1}).$
6.
For each social network pair, draw a relationship

$\displaystyle{q_{mj}}\sim N(s_{m}^{T}{v_{j}},d_{mj}^{-1}).$

Note that, the middle layer ${X_{L/2}}$ serves as a linkage among description information, general users’ following action records and the social network of celebrities. This middle layer is the hinge that enables CSDL to exploit these three different data sources. There is an extreme situation when set ${\tau_{j}}=0,$ namely, general users’ following action records and the social network of celebrities will not affect the generation of the latent factor of items.

In order to further improve recommendation accuracy, our model also considers the social information of general users. The social network of general users differs from that of celebrities as discussed before, so we adopt a different but more effective strategy to handle it. Figure 4 shows a simple following topology structure in social network websites. Following action will generate a tie. For example, there exists a tie when user ${u_{1}}$ follows celebrity ${v_{1}}.$ Therefore, ${u_{1}}$ has a total of two ties. The Microblogging-like follow network is usually built because of the similar interest. The more stronger tie between two users, the more similar interest they likely have [35]. The study of Oneela et al. [18] provides empirical confirmation of the following intuition: (i) tie strength is partly determined by the local network structure; (ii) the stronger tie between two users, the more common friends they have.

We adopt Jaccard’s coefficient, a simple measure that effectively captures common neighborhood, to compute nodes’ similarity values. Then, we incorporate the social network of general users by assigning a different prior to each user, which is based on the similarity values between the general users and their friends as Eq. (1) shows. We denote the similarity matrix between general users by $Y\in{R^{I\times I}}.$ We let ${F_{i}}$ be the set of following users (celebrities or general users) of the $i{\rm{th}}$ user, and ${F_{t}}$ be the set of following users of the $t{\rm{th}}$ user. Similar to [41], we have:

$\displaystyle{Y_{it}}=\frac{{\left|{{F_{i}}\cap{F_{t}}}\right|}}{{\left|{{F_{i% }}\cup{F_{t}}}\right|}},$ (2)

where $\left|{{F_{i}}\cap{F_{t}}}\right|$ is the set of two general users’ common following actions, and $\left|{{F_{i}}\cup{F_{t}}}\right|$ is the set of two general users’ total following actions. Typically, we let the similarity value in $Y$ within the range [0,1]. This definition has natural probabilistic interpretations: given two arbitrary users $u$ and $v,$ their Jaccard’s coefficient is equal to the probabilistic that a randomly chosen tie of $u$ (resp. $v$ ) is also a tie of $v$ (resp. $u$ ) [15, 35].

Note that, this work can be up-scaled to other applications. For example, when recommending scientific articles for users in CiteULike, users’ reference libraries, the citation relations between articles and co-author between users are available, the proposed model can be adjusted to up-scaled to this recommendation task easily.
3.4 Parameter learning and optimization

In our model, we develop an EM-style algorithm to learn the maximum posterior (MAP) estimates (because of our model’s Bayesian nature, fully Bayesian methods can also be applied [33]). We denote $\Phi$ as ${\lambda_{u}},{\lambda_{v}},{\lambda_{s}},{\lambda_{w}},{\lambda_{f}},{\lambda% _{n}},{\lambda_{h}},$ and with Bayesian inference, we will get

$\displaystyle P(U,V,S,{X_{l}},{W^{+}}|R,Q,{X_{0}},{X_{c}},\Phi)\propto$ $\displaystyle\quad P(U|{\lambda_{u}},{\lambda_{f}})P(V|{\lambda_{v}},{X_{L/2}}% )P(S|{\lambda_{s}})P({W^{+}}|{\lambda_{w}})P(R|U,V)$ (3) $\displaystyle\quad\times P(Q|S,V)P({X_{c}}|{X_{L}},{\lambda_{n}})P({X_{l}}|{X_% {l-1}},W_{l}^{+},{\lambda_{h}}).$

Analogous to the generalized SDAE, to reduce computational complexity, we can also take ${\lambda_{h}}$ to infinity. Maximization of posterior probability is equivalent to maximizing the joint log-likelihood of $U,V,S,\{{X_{l}}\},{X_{c}},{W^{+}},R$ and $Q$ given $\Phi,$

$\displaystyle\ell=-\frac{{{\lambda_{u}}}}{2}\sum\limits_{i}{\left\|{{u_{i}}}% \right\|_{2}^{2}}-\frac{{{\lambda_{s}}}}{2}\sum\limits_{m}{\left\|{{s_{m}}}% \right\|_{2}^{2}}-\frac{{{\lambda_{w}}}}{2}\sum\limits_{l}{(\left\|{{W_{l}}}% \right\|_{F}^{2}+\left\|{{b_{l}}}\right\|_{2}^{2})}-\frac{{{\lambda_{v}}}}{2}% \sum\limits_{j}{\left\|{{v_{j}}-X_{L/2,j*}^{T}}\right\|_{2}^{2}}-\frac{{{% \lambda_{n}}}}{2}\sum\limits_{i}{\left\|{{X_{L,j*}}-{X_{c,j*}}}\right\|_{2}^{2% }}-\sum\limits_{i,j}{\frac{{{c_{ij}}}}{2}}{({r_{ij}}-u_{i}^{T}{v_{j}})^{2}}-% \sum\limits_{m,j}{\frac{{{d_{mj}}}}{2}{{({q_{mj}}-s_{m}^{T}{v_{j}})}^{2}}}-% \frac{{{\lambda_{f}}}}{2}\sum\limits_{i}{{{\sum\limits_{{u_{i}}\in U/{u_{i}}}{% {Y_{it}}({u_{i}}-{u_{t}})}}^{2}}.}$ (4)

As discussed in the previous section, ${X_{L/2}}$ serves as a bridge between deep learning and matrix factorization. There are two extreme cases that demonstrate poor performance in the experiments. The first case, when ${{{\lambda_{v}}}\mathord{\left/{\vphantom{{{\lambda_{v}}}{{\lambda_{n}}}}}% \right.\kern-1.2pt}{{\lambda_{n}}}}$ goes to positive infinity, the reconstruction error will disappear, and it results in ineffective learning feature ${X_{L/2}}.$ In contrast, another extreme case happens when ${{{\lambda_{v}}}\mathord{\left/{\vphantom{{{\lambda_{v}}}{{\lambda_{n}}}}}% \right.\kern-1.2pt}{{\lambda_{n}}}}$ approaches to zero. In the second case, CSDL will degenerate into two isolated models, i.e., the latent representation learned by deep learning and the latent factor learned with matrix factorization will not influence each other anymore. Note that, the social parameter ${\lambda_{f}}$ balances the effect of social relations of general users. When ${\lambda_{f}}=0,$ CSDL will collapse to the model called CSDL – as shown in Fig. 3b, in which the social relations of general users component essentially vanishes.

Similar to [5, 33], we optimize the objective function using coordinate ascent. Given current ${W^{+}},$ we set the gradients of $\ell$ with respect to ${u_{i}},$ ${v_{j}}$ and ${s_{m}}$ to zero. Then we obtain the following updating rules:

$\displaystyle{u_{i}}\leftarrow{({\lambda_{u}}{I_{K}}+V{C_{i}}{V^{T}}+{\lambda_% {f}}{Y_{i}}{1_{I}}{I_{K}})^{-1}}(V{C_{i}}{R_{i}}+{\lambda_{f}}UY_{i}^{T}),$ (5) $\displaystyle\quad{s_{m}}\leftarrow{({\lambda_{s}}{I_{K}}+V{D_{m}}{V^{T}})^{-1% }}V{D_{m}}{Q_{m}},$ (6) $\displaystyle{v_{j}}\leftarrow{({\lambda_{v}}{I_{K}}+U{C_{j}}{U^{T}}+S{D_{j}}{% S^{T}})^{-1}}({\lambda_{v}}X_{L/2,j*}^{T}+U{C_{j}}{R_{j}}+S{D_{j}}{Q_{j}}),$ (7)

where ${C_{i}}$ is a diagonal matrix with ${c_{ij}}$ as its diagonal elements, and ${R_{i}}={({r_{i1}},{r_{i2}},\cdots,{r_{iJ}})^{T}}$ is a column vector containing all the following action records of user $i .$ Similarly, ${C_{j}}$ and ${R_{j}}$ are defined for celebrity $j .$ ${Q_{m}}={({q_{m1,}}{q_{m2}},\cdots,{q_{mJ}})^{T}}$ and ${D_{m}}$ is a diagonal matrix with ${d_{mj}}$ as its diagonal elements. ${Q_{j}}$ and ${D_{j}}$ are defined as the same. ${Y_{i}}={({Y_{i1}},{Y_{i2}},\cdots,{Y_{iI}})^{T}}$ and ${1_{I}}$ is a $I-{\rm{dimensional}}$ column vector with all elements equal to 1.

Based on current $U$ and $V,$ we utilize back-propagation learning algorithm to learn weight matrix ${W_{l}}$ and bias vector ${b_{l}}$ for each layer. The gradient of $\ell$ with respect to ${W_{l}}$ and ${b_{l}}$ are:

$\displaystyle{\nabla_{{W_{l}}}}\ell=-{\lambda_{w}}{W_{l}}-{\lambda_{v}}\sum% \limits_{j}{{\nabla_{{W_{l}}}}X_{L/2,j*}^{T}({X_{L/2,j*}}-{v_{j}})}-{\lambda_{% n}}\sum\limits_{j}{{\nabla_{{W_{l}}}}{X_{L,j*}}({X_{L,j*}}-{X_{c,j*}})},$ $\displaystyle{\nabla_{{b_{l}}}}\ell=-{\lambda_{w}}{b_{l}}-{\lambda_{v}}\sum% \limits_{j}{{\nabla_{{b_{l}}}}X_{L/2,j*}^{T}({X_{L/2,j*}}-{v_{j}})}-{\lambda_{% n}}\sum\limits_{j}{{\nabla_{{b_{l}}}}{X_{L,j*}}({X_{L,j*}}-{X_{c,j*}})}.$

The above steps will be repeated until the model converges to a steady state. In order to learn more robust feature representation, we adopt dropout strategy. And some generally utilized techniques can be adopted to alleviate local optimum issue, for instance, we add momentum term in our model. Finally, we obtain the latent factors $U$ and $V$ to predict the potential following actions.

3.5 Prediction

After learning the optimal parameters ${U^{*}},$ ${V^{*}},$ ${S^{*}}$ and ${W^{{+^{*}}}},$ we use point estimates to calculate the predicted following actions, and then, we obtain the rules:

$\displaystyle r_{ij}^{*}\approx{(u_{j}^{*})^{T}}(X_{L/2,j*}^{{T^{*}}}+\tau_{j}% ^{*})={(u_{i}^{*})^{T}}v_{j}^{*}.$

Note that, when any new celebrity $j$ has no following action in the training data, its offset $\tau_{j}^{*}$ will be 0.

4. Experiments

In this section, we design several experiments on two real-world datasets, and compare performance between CSDL and three state-of-the-art algorithms. The experiments are designed to answer the following questions:

(i)
To what degree does CSDL outperform the state-of-the-art methods, especially when the data is extremely sparse?
(ii)
To what degree does the different data sparsity setting level of celebrity network affect recommendation performance?
(iii)
How is the recommendation performance affected by the social parameter ${\lambda_{f}}$ and other parameters?

Table 1
Statistics of the user-item matrix

Statistics User Item

Min. number of general users’ following actions 6 0

Max. number of general users’ following actions 1009 3031

Avg. number of general users’ following actions 28.8 68.9

Table 2
Statistics of the matrix of celebrities’ social network

Statistics Degrees

Min. number of degrees 0

Max. number of degrees 962

Avg. number of degrees 36.4

Figure 5.
Data sparsity of general users’ following actions.

Figure 6.
Data sparsity of celebrities’ social relations.

4.1 Datasets description

Statistics	Degrees
Min. number of degrees	0
Max. number of degrees	962
Avg. number of degrees	36.4

To effectively illustrate the performance of CSDL, we use the same Tencent MicroBlog and Twitter datasets used in [5]. We only sample a subset of 10,000 users from the whole set of users, which do not consider the noisy data that users who follow less than 5 celebrities. To show the effectiveness of the social network of celebrities, we use the full social network of celebrities. In Tencent MicroBlog dataset, there are 4,183 celebrities, 10,000 users, 288,491 user following action records, thus, the sparsity is 0.68%. For celebrities’ social network, there are 152,284 edges, the sparsity is 0.87%. Twitter dataset has 900 celebrities, 10,000 users, 198243 user following action records, thus, the sparsity is 2.2%. And there are 129510 edges, the sparsity is 15.99%. The descriptions of celebrities contain the keywords extracted from the corresponding MicroBlog profile of celebrities. The vocabulary size $Z$ is 20,915 in Tencent MicroBlog and 5468 in Twitter.

Next, we take Tencent MicroBlog dataset as an example to analyze the dataset statistic feature. The statistics of the following actions matrix of general users and the social network of celebrities are summarized in Tables 1 and 2, respectively. The data sparsity of general users’ following actions and celebrities’ social relations are shown in Figs 5 and 6, respectively. The y-axis is the value of number of general users, x-axis denotes the number of general users’ following actions in Fig. 5. And the y-axis is the value of celebrities, x-axis denotes the number of directed edges in Fig. 6.

4.2 Evaluation metrics

The same as [33], we randomly choose $P$ items (celebrities) from each general user’s following records to form the training set and take all rest as the testing set. To evaluate our proposed model under different sparsity level, we set $P$ to 1 and 10 to achieve sparse settings and dense settings respectively. For each sparsity level, we repeat the experiment five times with different randomly selected training sets and report the average results.

The information of following action records is the form of implicit feedback, a zero entry may be owing to the fact that the user is not interested in the item, or that the user is not aware of its existence. So, precision is not a suitable performance measure. Recall and Average Precision (AP) are commonly used evaluation metrics in top-M recommendation with implicit feedback. Similar to [5], we employ Recall and Average Precision (AP) to quantize the performance of recommendation,

$\displaystyle{\text{Recall}}@{\text{M}}=\frac{{{\text{following in top M list}% }}}{{{\text{total following number}}}},$ $\displaystyle{\text{AP}}@{\text{M}}=\frac{{\sum\nolimits_{m=1}^{M}{\text{% precision(m)*rel(m)}}}}{{{\text{total following number}}}},$

where rel(m) is a 0-1 binary variable, which indicates the user follows a celebrity or not. For both Recall and AP, the final result is the average over all users.

4.3 Baselines and experimental settings

Our comparison models include three state-of-the-art hybrid recommendation algorithms as follows,

CTR. Collaborative Topic Regression is a model combining topic modeling and collaborative filtering [31].

CSTR. Collaborative Social Topic Regression is a hybrid model performing latent Dirchelet allocation and matrix factorization, which introduces side information of celebrities’ social network [5].

CDL. Collaborative Deep Learning is a hierarchical Bayesian model, which tightly couples deep feature learning of content information and collaborative filtering [33].

CSDL. Collaborative Social Deep Learning is our proposed model, which can be seen as the extension of CDL.

In the experiments, we first use a validation set to find the optimal parameters for CTR, CDL, and CSTR. When $P=1$ , we choose 0.25*I general users’ following records from test set to form validation set. And we remove these following records when test model. When $P=10$ , we choose 80% of the following records to train model from training set and take the rest as validate set. By using five-fold cross validation on the training set with grid search, we find that all models achieve good performance when $a=1,$ $b=0.1,$ $K=50.$ And the tuning parameters $e,$ $f,$ $g$ for the confidence parameters ${d_{mj}}$ of CSTR are 1, 0.2, 0.1, respectively. For other hyper-parameters, such as ${\lambda_{u}}$ and ${\lambda_{v}},$ we also try our best to tune. There are different values for those hyper-parameters in every model under different data sparsity situations over our training datasets.

For CSDL, we directly set $a=1,$ $b=0.1,$ $e=1,$ $f=0.2,$ $g=0.1,$ $K=50$ and use grid search on the hyper-parameters $\Phi.$ For the grid search, we perform five-fold cross validation as well. For the model parameters of SDAE, such as learning rate $\eta,$ we use the same strategy. We set noise level to 0.3 to get the corrupted input ${X_{0}}$ from the clean input ${X_{c}}.$ Similar to [33], we set dropout rate to 0.1 to achieve adaptive regularization. For example, the 2-layer CSDL model $(L=4)$ has a Bayesian SDAE of architecture 20915-200-50-200-20915 for the Tencent MicroBlog dataset.

Table 3
AP@M when the matrix of general users’ following actions is sparse and dense

		Sparse		Dense
		AP@60	AP@300	AP@60	AP@300
Tencent blog	CSDL	0.3056	0.2141	0.4039	0.3170
	CDL	0.2754	0.1961	0.3899	0.3004
	CSTR	0.2807	0.1966	0.3912	0.3024
	CTR	0.1845	0.1235	0.3852	0.2988
Twitter	CSDL	0.3008	0.2156	0.4186	0.3160
	CDL	0.2610	0.1869	0.3953	0.2947
	CSTR	0.2781	0.1995	0.3983	0.2949
	CTR	0.2289	0.1552	0.3933	0.2935

Figure 7.

Recall@M when the matrix of general users’ following actions is sparse.

Figure 8.

Recall@M when the matrix of general users’ following actions is dense.

4.4 Comparison with other models

In order to compare the quality of recommendation in the stated methods experimentally, the Recall and AP of each method were measured. The curves in the graph indicate that as the number of recommended items M increases, Recall values tend to increase. Figures 7 and 8 show the results about Recall of algorithms CSDL, CDL, CSTR, CTR. Our experiment uses the real-world datasets from Tencent MicroBlog and Twitter under both $P=1$ and $P=10$ when M $=$ 20, 30, 40, 50, 60. From Fig. 7, we find that CTR performs poorly when $P=1$ that means the matrix of general users’ following action records is extremely sparse. The performance of CTR is far from satisfactory due to the sparsity problem. CDL outperforms CTR by using deep learning other than LDA since deep learning can handle the sparse text information much better and learn a much more effective latent representation for each celebrity. By adding the social network of celebrities, CSTR achieves a better performance than CTR. Our model can further improve the performance by effectively integrating deep learning, the social network of celebrities and social relationships of general users into modeling.

From Fig. 8, we find that CDL, CSTR and CTR can achieve extremely close performance, when the matrix of general users’ following action records is dense $(P=10).$ This result indicates that the general users’ following action records have a great effect in the celebrity recommender system. When a general user has followed a lot of celebrities, the auxiliary information, like content information and the social network of celebrities, will have little positive impacts on performance improvement. However, by adding the social relations of general users, our method still achieves decent performance improvement. Note that, our method is a uniform model that integrates four information resources together, including content, general users’ following action records, the social network of celebrities and social relations of general users, which learn from each other. So we believe the improvement also comes from the unified model, not just from the social relationships of general users.

For the ranking metric AP, CSDL also achieves the best performance, as shown in Table 3 under $P=1$ and $P=10$ when M $=$ 60, 300, respectively.

Table 4
AP@M when the matrix of celebrities’ social relations is sparse and dense

	Sparse		Dense
	AP@60	AP@300	AP@60	AP@300
CSDL	0.2776	0.1967	0.2858	0.2017
CSTR	0.2553	0.1785	0.2638	0.1837

Figure 9.

Recall@M when the matrix of celebrities’ social relations is sparse.

Figure 10.

Recall@M when the matrix of celebrities’ social relations is dense.

4.5 Sparsity impact of celebrities’ social network

Both CSDL and CSTR integrate the social network of celebrities into models to improve the performance in celebrity recommendation. Thus, to explore the effectiveness of the social network of celebrities, we use different data sparsity setting levels of celebrity network in Tencent MicroBlog dataset. We set up two cases:

(i)
randomly select 2 edges associated with each celebrity from celebrities’ social network to set sparse social network;
(ii)
randomly select 20 edges associated with each celebrity from celebrities’ social network to set dense social network.

Figures 9 and 10 illustrate the Recall under sparse and dense social network, when $P=1,$ M $=$ 60, 100, 200, 300, 400, respectively. Although a smaller M might be more reasonable in some applications, there also exist other scenarios where a large M makes sense. For example, more than 100 celebrities should be studied by a sociologist when he wants to write a paper about the influence of celebrities in the websites. When the social network of celebrities is sparse, CSDL highly outperforms CSTR. This result indicates that CSDL is much better than CSTR under sparse social network. Compared with CSTR, our model CSDL also achieves better performance under dense social network as shown in Fig. 10. For AP, we find the similar result as illustrated in Table 4. In a word, CSDL can handle both sparse and dense social network.

Moreover, the larger M, the more highly CSDL outperforms CSTR. From this phenomenon, we can draw a conclusion that CSDL has a higher stability.

Figure 11.
Recall@M when the matrix of general users’ following actions is sparse and ${\lambda_{f}}=0.$

Figure 12.
Recall@M when the matrix of general users’ following actions is dense and ${\lambda_{f}}=0.$

4.6 Parameters impact analysis

In this part, we will analyze the parameters impact in Tencent MicroBlog dataset.

4.6.1 Social parameter ${\lambda_{f}}$

The social parameter ${\lambda_{f}}$ balances the effect of social relations of general users, and the bigger ${\lambda_{f}}$ is, the more we use social relations of general users to make recommendation. When ${\lambda_{f}}=0,$ CSDL will collapse to CSDL–, in which the social relations component essentially vanishes. In Figs 11 and 12, we find that CSDL– outperforms CSTR under $P=1$ and $P=10.$ The performance gain of CSDL– over CSTR comes from the deep learning component, which can learn more effective feature representation. Moreover, because our model CSDL integrates social relations of general users, it can further improve the performance, as shown in Figs 11 and 12. In conclusion, the social relations of general users can enhance the recommendation accuracy decently.

4.6.2 Extreme cases of ${\lambda_{v}}/{\lambda_{n}}$

As discussed before, whether ${{{\lambda_{v}}}\mathord{\left/{\vphantom{{{\lambda_{v}}}{{\lambda_{n}}}}}% \right.\kern-1.2pt}{{\lambda_{n}}}}$ is too large or small, our model will get hurt. Figure 13 shows the impact of ${{{\lambda_{v}}}\mathord{\left/{\vphantom{{{\lambda_{v}}}{{\lambda_{n}}}}}% \right.\kern-1.2pt}{{\lambda_{n}}}}$ under $P=10,$ by changing ${\lambda_{v}}$ and fixing other hyper-parameters. We can observe that as increasing or reducing ${\lambda_{v}}$ from the optimal ${\lambda_{v}},$ the performance will degrade gradually. This phenomenon is explained as follows: when ${{{\lambda_{v}}}\mathord{\left/{\vphantom{{{\lambda_{v}}}{{\lambda_{n}}}}}% \right.\kern-1.2pt}{{\lambda_{n}}}}$ is large, the auxiliary content information about celebrities controls the learning process of $V$ and the performance fully depends on ${X_{L/2}};$ when ${{{\lambda_{v}}}\mathord{\left/{\vphantom{{{\lambda_{v}}}{{\lambda_{n}}}}}% \right.\kern-1.2pt}{{\lambda_{n}}}}$ is small, the performance is purely generated by matrix factorization component. The experimental result indicates CSDL can achieve better performance when we appropriately combine content information of celebrities, matrix factorization on celebrities’ social network and general users’ following action records.

Table 5
Recall@60 when the matrix of general users’ following actions is sparse and dense

#layers	1	2	3
Sparse following records	0.3701	0.3745	0.3701
Dense following records	0.5528	0.5536	0.5503

Figure 13.

Recall@M when change ${\lambda_{v}},$ fix other hyper-parameters and set $P=10.$

4.6.3 Layers of SDAE

We use SDAE, which is a deep learning method, to learn a deep representation of content. SDAE transforms the representation at one level into slightly more abstract style at a higher-level. Therefore, we study the effect of the layers of SDAE. Table 5 shows the Recall@60 results when CSDL is set with different number of layers under both the sparse general users’ following action records and the dense general users’ following action records. As we can see, CSDL will overfit when it exceeds two layers.

4.7 Algorithm complexity

In order to update ${u_{i}},$ ${v_{j}},$ ${s_{m}},$ we have to calculate the value of $U{C_{j}}{U^{T}},$ $S{D_{j}}{S^{T}},$ $V{C_{i}}{V^{T}},$ $V{D_{m}}{V^{T}},$ which is the computational bottleneck. The time complexity of naive calculation are $O({K^{2}}I)$ and $O({K^{2}}J)$ for each general user and celebrity respectively. Similar to [8], we can achieve a significant speedup using the fact that $U{C_{j}}{U^{T}}=U({C_{j}}-bI){U^{T}}+bU{U^{T}}.$ Now, we can pre-compute $bU{U^{T}},$ and $({C_{j}}-bI)$ has only ${I_{a}}$ non-zero elements, where ${I_{a}}$ is the number of the general users who have followed celebrity $j,$ and typically ${I_{a}}\ll I.$ As for other matrix products, we adopt the same trick. Consequently, the computational complexity of updating ${u_{i}},$ ${s_{m}}$ are $O({K^{2}}J+{K^{3}}).$ The complexity of ${v_{j}}$ is $O({K^{2}}J+{K^{3}}+Z{K_{1}}),$ where ${K_{1}}$ is the dimensionality of the output in the first layer of SDAE. Note that, since the computation is dominated by the first layer, the complexity of updating all the weights and biases is $O(JZK_{1}).$ Thus for a whole epoch the total time complexity is $O(JZK_{1}+K^{2}IJ+IK^{3}+JK^{3}).$

5. Conclusion

In this paper, we have developed a novel hierarchical Bayesian model called CSDL for celebrity recommendation in the context of social network websites. CSDL can seamlessly integrate general users’ following action records, celebrities’ content information and social network information into one principled model. In CSDL, social network information includes both the relationships of celebrities and that of general users. In addition, CSDL uses side information and deep feature learning to alleviate the sparsity problem faced by traditional CF methods and CSTR. The experimental results on two real-world datasets show that our approach outperforms other state-of-the-art algorithms under different sparsity level settings.

In the future, we will try other deep learning methods to further improve performance of our hierarchical Bayesian framework. One promising choice is recurrent neural network (RNN), which can consider the context and order of words to improve the performance. We will also further explore the social information to improve recommendation performance.

Footnotes

Acknowledgments

This work is supported by the National Key Research and Development Program of China under grant 2016YFB1000901, the National Natural Science Foundation of China (NSFC) under grant No. 61202227 and Provincial Natural Science Foundation of Anhui Higher Education Institution of China under grant No. KJ2018A0013.

References

Basilico

and Hofmann

, Unifying collaborative and content-based filtering, in: ICML, (2004), 9.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

Chen

Zheng

Wang

Hong

and Lin

, Context-ware Collaborative Topic Regression with Social Matrix Factorization for Recommender Systems, in: AAAI, (2014).

Cheng

Yang

King

and Lyu

M.R.

, Fused matrix factorization with geographical and social influence in location-based social networks, In: AAAI, (2012).

Ding

Jin

and Li

, Celebrity recommendation with collaborative social topic regression, in: International Joint Conference on Artificial Intelligence, (2013), 2612–2618.

Guo

Zhang

and Yorke-Smith

, A Novel Recommendation Model Regularized with User Trust and Item Ratings, IEEE Transactions on Knowledge & Data Engineering 28(7) (2016), 1607–1620.

Cao

and Cao

, Deep Modeling of Group Preferences for Group-Based Recommendation, in: AAAI, (2014), 1861–1867.

Koren

and Volinsky

, Collaborative Filtering for Implicit Feedback Datasets, in: ICDM, (2008), 263–272.

Jamali

and Ester

, TrustWalker: a random walk model for combining trust-based and item-based recommendation, in: KDD, (2009), 397–406.

10.

Koren

Bell

and Volinsky

, Matrix Factorization Techniques for Recommender Systems, Computer 42(8) (2009), 30–37.

11.

Kwak

Lee

Park

and Moon

, What is Twitter, a social network or a news media? in: WWW, (2010), 591–600.

12.

Lecun

Bengio

and Hinton

, Deep learning, Nature 521(7553) (2015), 436–444.

13.

Lee

D.D.

and Seung

H.S.

, Algorithms for Non-negative Matrix Factorization, in: NIPS, (2000), 556–562.

14.

Kawale

and Fu

, Deep Collaborative Filtering via Marginalized Denoising Auto-encoder, in: CIKM, (2015), 811–820.

15.

Liben-Nowell

and Kleinberg

, The link-prediction problem for social networks, Journal of the Association for Information Science and Technology 58(7) (2007), 1019–1031.

16.

Liu

Z.D.

and Huang

, Relational Stacked Denoising Autoencoder for Tag Recommendation, in: AAAI, (2015).

17.

Yang

Lyu

R.M.

et al., SoRec: social recommendation using probabilistic matrix factorization, in: CIKM, (2008), 931–940.

18.

Onnela

J.P.

Saramäki

Hyvönen

Szabó

Lazer

Kaski

et al. Structure and tie strengths in mobile communication networks, Proceedings of the National Academy of Sciences of the United States of America 104(18) (2007), 7332–7336.

19.

Oord

A.V.D.

Dieleman

and Schrauwen

, Deep content-based music recommendation, in: NIPS, 26 (2013), 2643–2651.

20.

Ouyang

Liu

Rong

and Xiong

, Autoencoder-Based Collaborative Filtering, in: NIPS, (2014), 284–291.

21.

Pan

Zhou

Cao

and Liu

N.N.

, One-Class Collaborative Filtering, in: ICDM, (2008), 502–511.

22.

Purushotham

Liu

and Kuo

C.C.J.

, Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems, Computer Science, (2012).

23.

Rennie

J.D.M.

and Srebro

, Fast maximum margin matrix factorization for collaborative prediction, in: ICML, (2005), 713–719.

24.

Salakhutdinov

and Mnih

, Bayesian probabilistic matrix factorization using Markov chain Monte Carlo, in: ICML, (2008), 880–887.

25.

Salakhutdinov

Mnih

, Probabilistic matrix factorization, in: NIPS, (2008), 1257–1264.

26.

Salakhutdinov

Mnih

and Hinton

, Restricted Boltzmann machines for collaborative filtering, in: ICML, (2007), 791–798.

27.

Shen

and Jin

, Learning personal + social latent factor model for social recommendation, in: KDD, (2012), 1303–1311.

28.

Srebro

Rennie

J.D.M.

and Jaakkola

, Maximum-Margin Matrix Factorization, in: NIPS, (2005), 1329–1336.

29.

and Khoshgoftaar

T.M.

, A survey of collaborative filtering techniques, Advances in Artificial Intelligence 2009(12) (2009), 2.

30.

Vincent

Larochelle

Lajoie

Bengio

and Manzagol

P.A.

, Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, Journal of Machine Learning Research 11(12) (2010), 3371–3408.

31.

Wang

and Blei

D.M.

, Collaborative topic modeling for recommending scientific articles, in: KDD, (2011), 448–456.

32.

Wang

and Li

, Relational Collaborative Topic Regression for Recommender Systems, IEEE Transactions on Knowledge & Data Engineering 27(5) (2015), 1343–1355.

33.

Wang

and Yeung

D.Y.

, Collaborative Deep Learning for Recommender Systems, in: KDD, (2015), 1235–1244.

34.

Wang

and Yeung

D.Y.

, Towards Bayesian Deep Learning: A Framework and Some Existing Methods, IEEE Transactions on Knowledge and Data Engineering, 28(12) (2016), 3395–3408.

35.

Wang

Ester

Wang

and Chen

, Social Recommendation with Strong and Weak Ties, in: CIKM, (2016), 5–14.

36.

Wang

and Wang

, Improving Content-based and Hybrid Music Recommendation using Deep Learning, in: the ACM International Conference on Multimedia, (2014), 627–636.

37.

Yang

Lei

Liu

and Li

, Social Collaborative Filtering by Trust, in: International Joint Conference on Artificial Intelligence, (2013), 2747–2753.

38.

Yang

S.H.

Long

Smola

Sadagopan

Zheng

and Zha

, Like like alike:joint friendship and interest propagation in social networks, in: WWW, (2011), 537–546.

39.

Yao

Wang

Zhang

and Cao

, Collaborative Topic Ranking: Leveraging Item Meta-Data for Sparsity Reduction, in: AAAI, (2015), 374–380.

40.

Ying

Chen

Xiong

and Wu

, Collaborative Deep Ranking: A Hybrid Pair-Wise Recommendation Algorithm with Implicit FeedbackYing, in: PAKDD, (2016).

41.

Zhao

Cai

and Zhuang

, User Preference Learning for Online Social Recommendation, IEEE Transactions on Knowledge & Data Engineering, 28(9) (2016), 2522–2534.

42.

Zimmerman

Parameswaran

and Kurapati

, Celebrity Recommender, Proceedings of Workshop on Personalization in Future Tv Universidad De Malaga, (2002), 39–47.

Statistics	User	Item
Min. number of general users’ following actions	6	0
Max. number of general users’ following actions	1009	3031
Avg. number of general users’ following actions	28.8	68.9

Collaborative social deep learning for celebrity recommendation

Abstract

Keywords

1. Introduction

2.1 Latent factor models

2.2 Latent factor models associated with social network

2.3 Latent factor models based on deep learning

3.1 Probabilistic matrix factorization

4. Experiments

4.2 Evaluation metrics

4.3 Baselines and experimental settings

Table 3 AP@M when the matrix of general users’ following actions is sparse and dense

Table 4 AP@M when the matrix of celebrities’ social relations is sparse and dense

4.6.1 Social parameter λ f

4.6.2 Extreme cases of λ v / λ n

Table 5 Recall@60 when the matrix of general users’ following actions is sparse and dense

4.7 Algorithm complexity

5. Conclusion

Footnotes

Acknowledgments

References

Table 3
AP@M when the matrix of general users’ following actions is sparse and dense

Table 4
AP@M when the matrix of celebrities’ social relations is sparse and dense

4.6.1 Social parameter ${\lambda_{f}}$

4.6.2 Extreme cases of ${\lambda_{v}}/{\lambda_{n}}$

Table 5
Recall@60 when the matrix of general users’ following actions is sparse and dense