Modeling multi-attribute and implicit relationship factors with self-supervised learning for recommender systems

Abstract

Interactions of users and items can be naturally modeled as a user-item bipartite graph in recommender systems, and emerging research is devoted to exploring user-item graphs for collaborative filtering methods. In reality, user-item interaction usually stems from more complex underlying factors, such as the users’ specific preferences. A user-item bipartite graph could be used to understand the differences in motivation. However, existing research has not clearly proposed and modeled the factors that affect the differences, ignoring the similarities between user pairs and item pairs, preventing them from capturing fine-grained user preferences more effectively. In addition to the two points mentioned above, most GNN-based models for recommendation have the following two limitations: First, the model’s accuracy depends on the number of observed interactions in the dataset. Secondly, node representations are vulnerable to noisy interactions. This work has developed a novel recommendation model called “Multi-Attribute and Implicit Relationship Factors With Self-Supervised Learning for Collaborative filtering” (MIS-CF), which explicitly proposes and models multi-attribute and implicit relationship factors for collaborative filtering recommendation. Meanwhile, an auxiliary self-supervised learning task is designed to help the downstream task optimize the node representation. MIS-CF aggregates multi-attribute spaces through the user-item bipartite graph and establishes user-user and item-item graphs to model the similar relationship information of neighbor pairs through a memory model. The self-supervised learning task generates contrastive learning via self-discrimination, thus mining the rich auxiliary signals within the data, improving the accuracy and robustness of our model. Moreover, the sparse regularizer is utilized to alleviate the overfitting problem. Extensive experimental results on three public datasets not only show the significant performance and robustness gain of the proposed model but also prove the effectiveness and interpretability of fine-grained implicit factors modeling.

Keywords

Recommender system computing methodologies collaborative filtering self-supervised learning

1. Introduction

With information overload, users want to obtain interesting information more efficiently in the Internet environment. The company also hopes that products can attract and retain users to the greatest extent, thereby achieving development. The recommender system generates personalized item recommendations and deals with the information overload problem. Since the recommender system has received effective feedback in practice, it has not only aroused great interest in academia [1, 2], but also has been extensively developed in industry[3].

A de facto solution for many modern recommender systems is the Collaborative Filtering (CF) technique. The basic assumption is that people who share similar purchases in the past tend to have similar choices in the future [4]. The matrix factorization algorithm [5, 6] adds the concept of latent vector based on the CF algorithm. The vector is inferred from the record of user-item interaction. However, only the characteristics of the user and item are considered, a clear combination of user-item interaction is missing. In essence, this kind of user-item interaction information can be naturally modeled into a graph. The graph can contain more precise information and enhance the connectivity between users and items. More recently, graph convolutional neural networks have become one of the best performance architectures for various graph learning tasks. GC-MC [7] uses two multi-link graph convolution layers to aggregate user features and item features. LightGCN [8] is a simplifying and powerful graph convolution network that can aggregate the information of neighbor nodes from low-order to high-order efficiently without the feature transformation and nonlinear activation.

Although they are effective, there are still three important limitations. First, these methods suppose that user purchases items with constant motivation. However, in the real world, the motivations behind the user’s decision-making are multiple. People’s personal attributes will significantly affect preferences. A person will prefer items with different attributes based on his own different attributes. Attributes are derived from a person’s personality, occupation, professional direction, etc. Therefore, methods that do not distinguish between purchase motives inevitably lose fine-grained valuable information. Recent work MCCF[9] captures more complex interaction characteristics by setting up multiple components, but still does not use more comprehensive and effective information to provide interpretable suggestions. Secondly, these methods only consider the characteristics of the nodes in the bipartite graph and treat the graph as an independent individual. Ignoring the user-user and item-item relationships outside the bipartite graph. Implicit relationships can be modeled through user-user and item-item graphs to reflect more complex interaction characteristics. Thirdly, the interactions observable in a real recommendation scenario are very sparse compared to the size of the entire dataset. These models are limited to learning node embedding representations of bipartite graphs with sparse data. More importantly, the GNN-based model is sensitive to the constructed user-item bipartite graph structure. Any change in the graph structure (e.g., adding noisy interactions) will dramatically drop model performance.

Figure 1.

A toy example of purchasing relationships records with different purchasing motivations.

Figure 1 shows a toy example. Suppose the latent attributes of the user are ignored, and the difference in purchase motivation is not considered. In that case, the possibility of the user $U_{1}$ buying the product $I_{3}$ or $I_{4}$ cannot be compared. However, assuming that the influence of potential attributes on preferences is taken into account – user $U_{1}$ , $U_{3}$ , and $U_{4}$ prefer high-tech products. In contrast, user $U_{2}$ prefers artworks – it can be determined that item $I_{4}$ is more suitable for $U_{1}$ than $I_{3}$ . According to the user-item interaction perspective, item $I_{4}$ is purchased by users who prefer high-tech products. As a result, $U_{1}$ tends to purchase $I_{4}$ . Due to the similarity of the attributes of the purchased items given by the user-user and item-item level, it can be inferred that the preferences of $U_{3}$ and $U_{1}$ are more similar. This similarity of the attributes can be captured by implicit relationship modeling. Consequently, it is necessary to design a recommender system that can describe fine-grained user preferences on two levels.

In this paper, multi-attribute factors and implicit relationship factors are accurately proposed and modeled. For a given user-item bipartite graph, multiple attributes were extracted in the beginning. Then the two-layer attention mechanism is used to distinguish the probability distribution of the attribute space. Finally, the attribute factors are modeled. At the same time, a sparse regularizer can alleviate overfitting caused by these attribute factors reflecting similar motivations. For the implicit relationship part, user-user and item-item graphs are used separately. The attention-based memory module is used to learn the specific relationship vector between node pairs. Then the relationship-level attention is used to select information-rich neighbors for preference modeling automatically. We have tried to enhance the robustness of the overall model by adding the auxiliary task part of self-supervised learning. Two different views are first generated by reconstructing the graph structure of the initial user-project graph. Then an embedded representation of each node is generated under each of the two views. Finally, the comparative learning loss is constructed to mine the internal auxiliary signals of the node data in the graph.

The contributions of this paper are as follows:

•

MIS-CF is proposed, a new collaborative filtering method based on graph neural network. It can capture the fine-grained hidden factors behind user behavior based on attribute-level attention and implicit relationship aggregation.

•

Memory model is used to learn separately constructed user-user and item-item graphs to capture the relationship information between neighbor pairs. MIS-CF will learn all three graphics at the same time and unify the multi-attribute and implicit relationship information through the information fusion layer to achieve end-to-end capture.

•

Self-supervised learning is applied as an auxiliary task for graph neural network-based recommendation and mining auxiliary signals within the graph by means of node self-identification. The obtained auxiliary signals are further enriched for node representation learning and enhance the robustness of the model for filtering noisy interactions.

•

Extensive experiments on three public datasets are conducted to evaluate the proposed method. Experimental results show the effectiveness and interpretability of MIS-CF.

2. Related work

2.1 Graph neural networks

The classic collaborative filtering algorithm used to be the preferred model of the recommender system. The matrix factorization (MF) model has been derived from collaborative filtering. MF has also become the most popular one among various collaborative filtering methods. The early MF method[10] was designed to simulate the user’s explicit feedback by mapping users and items to the latent factor space so that the user-item relationship (rating) can be obtained through the dot product of latent factors. PMF[5] is based on the regularized matrix factorization and introduces a probability model for further optimization. BiasedMF[6] introduces a bias term and a regular term to improve accuracy and alleviate the problem of overfitting, respectively. LLORMA-Local[11] uses different combinations of low-rank approximations to reconstruct scoring matrix entries. With the widespread use of deep learning technology, neural MF models have emerged. For example, AutoRec[12], which uses the idea of AutoEncoder to complete the self-encoding of the item or user vector, and then uses the result of the self-encoding to get the estimated score; CF-NADE[13] constructs a denoising autoencoder, which immediately eliminates part of the input space in each iteration. However, compared to the MF, graphs have natural advantages in representing rich pairwise relationship information in recommender systems. With the great achievements of Graph Neural Networks (GNNs) [14, 15, 16], more and more works try to apply GNNs in recommender systems to capture information better.In the early stages of development, Bruna et al. tried to migrate the convolution of Euclidean space to the spatial and frequency domains of the graph network[17]. Then the application of polynomial spectral filter and linear filter dramatically reduces the computational cost[18, 16]. Along with spectral graph convolution, directly performing graph convolution in the spatial domain is also investigated[19, 20]. After that, the attention mechanism is used to generate the weights of neighbor nodes[15], and the heterogeneous graph attention network is also used to distinguish the different importance of nodes and meta-paths in the convolutionprocess[21]. DisenGCN[22] uses a novel neighborhood routing mechanism on the homogeneous graph to find the factors that may lead to an edge from a given node to one of its neighbors. LR-GCCF [23] used the residual method for user-item bipartite graph representation in the learning process. NIA-GCN [24] uses the heterogeneity of the user-item bipartite graph to model the relationship information between neighbor nodes. MBGCN [25] explores the importance of multiple behaviors for the recommendation.NGCF [26] established a user-item bipartite graph to gather high-order neighborhood information. GC-MC[7] uses two multi-link graph convolution layers to aggregate user features and item features. MCCF[9] is the first to explore using the edge information of bipartite graphs to construct multiple components to obtain multiple preferences of users. Multi-GCCF[2] adds multiple graphs to the model. These methods still ignore more fine-grained information and do not make full use of graph information to capture multi-level latent attribute information.

2.2 Self-supervised learning

Self-supervised learning (SSL) is a new learning paradigm that can learn auxiliary signals from raw data. It is firstly used in the fields of computer vision (CV) and natural language processing (NLP)[27, 28, 29, 30]. Current research work on SSL can be roughly divided into two categories: generative models and contrastive models. Auto-encoding[31] is one of the most popular generative models[32], which enhances the robustness of the model by artificially adding noisy data. Contrastive models learn to compare via a Noise Contrastive Estimation (NCE) objective[33, 34, 35]. In view of the excellent performance of comparative learning on graph representation learning, the growing work applies self-supervised learning to the recommendation models based on graph neural network. SGL[36] generates a two-view representation of the nodes on the user-item bipartite graph, and then constructs comparative learning to optimize node representation. $\mathcal{S}^{3}$ -Rec[37] designs an auxiliary task for SSL that uses random masks on attributes and items to maximize mutual information on attributes and sequences. [38] proposes to construct self-supervised item recommendation with a two-tower DNN model utilizing uniform feature masking and discarding. CLS4Rec[39] proposes a contrastive framework of SSL for improving social recommendation which adopts the approach of random augmentation. Considering that self-supervised learning on graphs can obtain the auxiliary signals from inter-data by exploring the graph structure and the models mentioned above have achieved a better performance in their scenarios with SSL. We set SSL as an auxiliary task to fully use the information inside the graph data.

Figure 2.

The framework of MIS-CF. This example predicts the rating that user $u$ would give to item $i$ . For simplicity, the figure only presents the user part and the item part is similar. The specific details of modules are presented in the corresponding section.

3. Approach

The overall framework of the proposed model is shown in Fig. 2. Firstly, the Bipartite Graph Convolutional Neural Network (Bipar-GCN) layer acts as an encoder to generate user and item embeddings by processing the user-item bipartite graph. Secondly, Self-supervised Learning (SSL) module serves as an auxiliary task to mine the information contained in the bipartite graph itself for enriching the representation of user and item. Thirdly, Implicit Relation Modeling (IRM) layer encodes additional latent information by constructing and processing the other two graphs. The other two graphs here represent user-user similarity and item-item similarity, respectively. Fourth, The Attention-based Memory Module in the IRM layer assists in capturing latent relationship information by generating a relation vector $f$ .

3.1 Bipartite graph convolutional neural networks

In the recommendation scenario, the users’ ratings of items can be modeled as a user-item bipartite graph with two types of nodes $\mathcal{G}=\{\mathcal{U,I,E,R}\}$ , where $\mathcal{U}$ and $\mathcal{I}$ respectively represent $N_{u}$ users and $N_{i}$ item sets; the rating set $\mathcal{R}$ contains the user’s rating level for each interacted item; each edge $e=\left(u,i,r\right)\in\mathcal{E}$ indicates that user $u$ has an explicit rating $r$ for item $i$ . And the rating set $\mathcal{R}$ contains a rating levels $\left\{1,\ldots,R\right\}$ . Use $\mathbf{U}=[\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{N_{u}}]\in\mathbb% {R}^{d\times N_{u}}$ and $\mathbf{P}=[\mathbf{p}_{1},\mathbf{p}_{2},\ldots,\mathbf{p}_{N_{i}}]\in\mathbb% {R}^{d\times N_{i}}$ to represent the feature matrix of users and items respectively, where $d$ is the dimension of their features. The details of this module can be seen in Fig. 3.

Figure 3.

Illustration of Bipar-GCN module from the user part. The number of attribute space is a hypeparameter which should be set at the beginning.

3.1.1 Multi-attribute extraction

Assume that $M$ latent attribute spaces could be extracted from the user-item bipartite graph $\mathcal{G}$ . The $m$ -th latent attribute space affects the $m$ -th purchase motivation in the user-item interactions. Then the transformation matrix $\mathbf{W}=\{\mathbf{W}_{1},\mathbf{W}_{2},\ldots,\mathbf{W}_{M}\}$ and $\mathbf{Q}=\{\mathbf{Q}_{1},\mathbf{Q}_{2},\ldots,\mathbf{Q}_{M}\}$ are used to extract the potential attributes of users and items respectively. The $m$ -th latent attribute space for user $u$ is denoted as:

$\displaystyle\mathbf{s}^{u}_{m}=\mathbf{W}_{m}\mathbf{u}_{u}$ (1)

And it is same for the $m$ -th latent attribute space of item $i$ , which is denoted as:

$\displaystyle\mathbf{h}^{i}_{m}=\mathbf{Q}_{m}\mathbf{p}_{i}$ (2)

3.1.2 Node-level attention

In the following, we target user $u$ and his interacted item set $\mathcal{P}_{u}$ . Through the above formula. For user $u$ , there are $M$ user-specific attributes $\{\mathbf{s}^{u}_{m}\}^{M}_{m=1}$ . For item $i\in\mathcal{P}_{u}$ , it also has $M$ item-specific attributes $\{\mathbf{h}^{i}_{m}\}^{M}_{m=1}$ . The next step is to aggregate the spaces of item nodes to encode the $m$ -th attribute feature of user $u$ . But for the existing items $i\in\mathcal{P}_{u}$ , it cannot be considered that they have the same weight to influence user $u$ . Therefore, by adding a node-level attention mechanism to infer the items actually purchased by the user $u$ due to the $m$ -th attribute space.

In particular, the possibility of user $u$ purchasing item $i$ based on the $m$ -th attribute space can be formulated as:

${}^{*}e^{ui}_{m}=\textit{att}_{\textit{node}}(\mathbf{s}^{u}_{m},\mathbf{h}^{i% }_{m};m),$ (3)

where $att_{\textit{node}}$ denotes the deep neural network which performs the node-level attention. Then it was normalized by the softmax function to obtain the weight coefficient $e^{ui}_{m}$ :

$\displaystyle e^{ui}_{m}=\textit{softmax}(^{*}e^{ui}_{m})=\frac{\exp(\sigma(% \mathbf{a}^{T}_{m}\cdot[\mathbf{s}^{u}_{m}\oplus\mathbf{h}^{i}_{m}]))}{\sum_{i% \in\mathcal{P}}\exp(\sigma(\mathbf{a}^{T}_{m}\cdot[\mathbf{s}^{u}_{m}\oplus% \mathbf{h}^{i}_{m}]))},$ (4)

where $\sigma$ denotes the activation function, $\oplus$ denotes the concatenate operation and $\mathbf{a}_{m}$ is the node-level attention vector for $m$ -th attributes space.

In this way, by aggregating all the item-specific attributes of $i\in\mathcal{P}_{u}$ , the $m$ -th item-aggregated attribute feature $\mathbf{z}^{u}_{m}$ for user $u$ can be learned to be expressed as:

$\displaystyle\mathbf{z}^{u}_{m}=\sigma\left(\sum_{i\in\mathcal{P}_{u}}e^{ui}_{% m}\cdot\mathbf{h}^{i}_{m}\right).$ (5)

It can be seen from Eq. (5) that each $\mathbf{z}^{u}_{m}$ is aggregated based on the features of the items purchased by the user $u$ based on the current attribute, so it represents the user’s corresponding purchase motivation.

3.1.3 Space-level attention

Similarly, considering that different attributes have different degrees of influence on userséˆ¥?purchase motivation, an attribute space-level attention mechanism is needed to learn the importance of different attribute spaces.

Taking $M$ item-aggregated attribute spaces of user $u$ as input, we aim to learn the weights of each item-aggregated attribute space $(\beta^{u}_{1},\beta^{u}_{2},\ldots,\beta^{u}_{M})$ as follows:

$\displaystyle(\beta^{u}_{1},\beta^{u}_{2},\ldots,\beta^{u}_{M})=\textit{att}_{% \textit{spac}}(\mathbf{z}^{u}_{1},\mathbf{z}^{u}_{2},\ldots,\mathbf{z}^{u}_{M}),$ (6)

where $\textit{att}_{\textit{spac}}$ denotes the deep neural network which performs the attribute space-level attention. Considering that the importance of different attribute spaces still depends on the attributes of the user $u$ itself, we concatenate $\mathbf{z}^{u}_{m}$ and $\mathbf{s}^{u}_{m}$ , and learn their unified embedding as follows:

$\displaystyle\mathbf{d}^{u}_{m}=\sigma(\mathbf{C}_{m}\cdot[\mathbf{z}^{u}_{m}|% |\mathbf{s}^{u}_{m}]+\mathbf{b}_{m}),$ (7)

where $\mathbf{C}_{m}$ is the weight matrix and $\mathbf{b}_{m}$ is the bias vector. Then with a space-leave attention vector $\mathbf{q}$ , the importance of the $m$ -th item-aggregated attribute space $\beta^{*}_{m}$ can be learned as follow:

$\displaystyle\beta^{*}_{m}=\sigma(\mathbf{q}^{T}\cdot\mathbf{d}^{u}_{m}+b),$ (8)

where $b$ is the bias, it is worth noting that $\mathbf{q}$ and $b$ are parameters shared by all attribute spaces because, in our hypothesis, the purchase motivation caused by attributes can be quantified. Then use the softmax function to normalize ${}^{*}\beta^{*}_{m}$ to obtain the weight of $m$ -th item-aggregated attribute space $\beta^{*}_{m}$ , as shown below:

$\displaystyle\beta^{u}_{m}=\frac{\exp(\beta^{*}_{m})}{\sum^{M}_{k=1}\exp(\beta% ^{*}_{k})}.$ (9)

According to each attribute space $\mathbf{z}^{u}_{m}$ and its weight, the final embedding $\mathbf{z}_{u}$ of user $u$ can be obtained in the bipartite graph, as shown below:

$\displaystyle\mathbf{z}_{u}=\sum^{M}_{m=1}\beta^{u}_{m}\cdot\mathbf{z}^{u}_{m}.$ (10)

3.2 Implicit relation modeling

In our hypothesis, there are multiple attributes for users and items, and these attributes affect the user’s preferences for items. Similarly, because there are similar decision-making patterns in human nature, common item preferences can be inferred through groups of people with similar attributes.

The addition of the Implicit Relation Modeling (IRM) layer could model similar users and items. It would capture the similar relationships between paired nodes by constructing and learning additional user-user and item-item graphs (multi-graphs). We select thresholds based on the cosine similarity that leads to an average degree of 20 for each graph.

3.2.1 Relation vector generation

Figure 4.

Illustration of attention-based memory module from the user part.The memory size is set to 4 in this example.

In most cases, users and similar users, items and similar items only have the same attributes in certain aspects. However, these implicit relationships between them cannot be directly reflected in the existing data. Inspired by recent advances in memory networks and attention mechanisms, we inserted an attention-based memory module to learn the relation vectors among similar users or items. The structure of this module is shown in Fig. 4. The memory matrix of this module is represented as $\mathcal{M}=\{\mathcal{M}_{1},\mathcal{M}_{2},\ldots,\mathcal{M}_{N}\}\in% \mathcal{R}^{N\times d}$ , where $N$ is the memory size and $d$ is the dimension of the user and item embeddings. For example, the input of the user part of this module is a similar user pair $(u_{n},u_{(n,l)})$ , , where $u_{(n,l)}$ denotes the $l$ -th similar user of user $u_{n}$ . This module returns the vector $f^{u}_{(n,l)}$ , which represents the relationship between $u_{n}$ and $u_{(n,l)}$ .

For a given pair of similar users, an operation is firstly applied to learn their joint embedding $c_{u}$ , which is given by:

$\displaystyle c_{u}=\frac{u_{n}\odot u_{(n,l)}}{\|u_{n}\|\cdot\|u_{(n,l)}\|},$ (11)

where $\odot$ denotes the element-wise product of vectors.

Then the joint embedding $c_{u}$ is extended to a matrix via the memory matrix $\mathcal{M}$ :

$\displaystyle F^{u}_{j}=c_{u}\odot\mathcal{M}_{j},$ (12)

For user $u$ , each of his similar users has a matrix $F^{u}\in\mathbb{R}^{N\times d}$ , which can be interpreted as storage of conceptual building blocks used to describe the similar preferences in different attribute spaces ( $N$ can be seen as the number of latent spaces).

Since different attribute spaces have different influences, the key matrix $\mathbf{K}\in\mathbb{R}^{N\times d}$ is used to learn the attention of similar users in different attribute spaces. Each element of the attention vector $\alpha$ is defined as:

$\displaystyle\alpha^{*}_{j}=\mathbf{K}^{T}_{j}c_{u}.$ (13)

The final attention scores are obtained by normalizing $\alpha$ using the softmax function:

$\displaystyle\alpha_{j}=\frac{\exp(\alpha^{*}_{j})}{\sum_{k}\exp(\alpha^{*}_{k% })}.$ (14)

Finally, in order to generate the relation vector, these attention scores were used to calculate a weighted representation of $F$ :

$\displaystyle f^{u}_{(n,l)}=\sum_{j}\alpha_{j}F^{u}_{j}.$ (15)

The output is a specific relation vector $f^{u}_{(n,l)}$ , which can be seen as the influence vector of $u_{(n,l)}$ to user $u_{n}$ ’s preferences.

So far here has been several relational vectors $\{f^{u}_{(n,1)},f^{u}_{(n,2)},\ldots,f^{u}_{(n,l)}\}$ about user $n$ . Each relation vector represents the influence of a similar user on the target user $u_{n}$ . Besides, in order to infer the influence of different similar users on user $u_{n}$ , a relation-level attention mechanism is introduced to adaptively learn the weight of each relation vector.

3.2.2 Relation-level attention

Intuitively, if a similar user have more expertise on a class item that attribute in the field is stronger, then his choice of target users have a greater impact. So we set the deep neural network to learn the individual relation vector’s weight $(\omega^{u}_{(n,1)},\omega^{u}_{(n,2)},\ldots,\omega^{u}_{(n,l)})$ as follows:

$\displaystyle(\omega^{u}_{(n,1)},\omega^{u}_{(n,2)},\ldots,\omega^{u}_{(n,l)})% =\textit{att}_{\textit{rela}}(f^{u}_{(n,1)},f^{u}_{(n,2)},\ldots,f^{u}_{(n,l)}),$ (16)

formally, with relation vector $f^{u}_{(n,l)}$ as inputs:

$\displaystyle\omega^{*}_{(n,l)}=\sigma(Wf^{u}_{(n,l)}+b),$ (17)

where $W$ is the model parameter and $\sigma$ denotes the activation function.

Then, the final relation-level attention is normalized with a softmax function to obtain the weight of each similar user:

$\displaystyle\omega^{u}_{(n,l)}=\frac{\exp(\omega^{*}_{(n,l)})}{\sum_{j\in L_{% n}}\exp(\omega^{*}_{(n,j)})},$ (18)

where $L_{n}$ denotes all similar users that user $n$ has, the number is equivalent to the degree of nodes in the multi-graph.

With these learned weights, we can obtain the final embedding $\mathbf{v}_{u}$ of user $u$ in IRM layer as follows:

$\displaystyle\mathbf{v}_{u}=\sum_{l\in L_{n}}\omega^{u}_{(n,l)}f^{u}_{(n,l)}.$ (19)

3.3 Information fusion

It is crucial to merge these different embeddings effectively for the separately modeled multi-attribute information and implicit relationships. In this work we investigated three methods to summarize the individual embeddings into a single embedding vector( $U_{u}$ for user $u$ and $I_{i}$ for item $i$ ): element-wise sum, concatenate, and attention mechanism. The actual operation of these three methods is described in Table 1. We experimentally compared them in Section 4.

Table 1
Comparison of different message fusion methods

	Formula
Element-wise sum	$\mathbf{U}_{u}=\mathbf{z}_{u}+\mathbf{v}_{u}$
Concatenation	$\mathbf{U}_{u}=[\mathbf{z}_{u}\\|\mathbf{v}_{u}]$
Attention	$\mathbf{A}_{u}=\textit{softmax}(\sigma(W_{a1}\cdot\mathbf{z}_{u}+W_{a2}\cdot% \mathbf{v}_{u}))$ ; $\mathbf{U}_{u}=[\mathbf{z}_{u};\mathbf{v}_{u}]\cdot\mathbf{A}_{u}$

3.4 Self-supervised graph learning

Figure 5.

A toy example of connectivity in a GCN model with two different ways for data augment. We only display the type of node dropout in the overall framework and the other one is edge dropout.

SSL extracts auxiliary signals by means of node self-discrimination in the following two steps: reconstructing the graph structure to build different vector representations of a node under two views, and then comparing the embedded representations of the learned nodes.

3.4.1 Data enhancement on graph structure

Nowadays, SSL is widely used in CV and natural NLP fields, producing good performance improvement by using data augmentation methods such as rotation, blurring, and cropping. However, models built based on graph neural networks cannot fully replicate these processing approaches when performing data enhancement for the following reasons:

(1)
The graph convolutional neural network is a topological structure, different from the regular Euclidean structure in CV.
(2)
Each data instance is processed independently in CV, while each node of the graph convolutional neural network built based on user-item interaction is not independent and has some connection.

The bipartite graph is constructed based on users’ actual ratings of items, and the connections between the nodes contain rich collaborative filtering information. Hence, mining this inherent graph structure information is helpful for representational learning. Inspired by recent work on SGL[36], we attempt two approaches on bipartite graphs of user-item interactions, as shown in Fig. 5. A node-drop or edge-drop approach is taken to generate two related views s1 and s2 independently from each other. The same drop rate is used during the construction of the views to constitute fair comparative learning. The equation showing the specific operation is as follows:

•
Node dropout: Nodes in the graph are randomly discarded with probability ${p}$ . Also the edges connected to the nodes will be discarded. The operation of building the view is as follows:

$\displaystyle s_{1}(\mathcal{G})=\left(\mathbf{M}^{\prime}\odot\mathcal{I},% \mathcal{E}\right),\quad s_{2}(\mathcal{G})=\left(\mathbf{M}^{\prime\prime}% \odot\mathcal{I},\mathcal{E}\right),$ (20)

where $\mathbf{\mathbf{M}^{\prime},\mathbf{M}^{\prime\prime}\in\{0,1\}^{|\mathcal{I}|}}$ are mask vectors used to discard item nodes connected to the central node thus generating a different view of the user node. Utilizing node-dropout, those neighboring nodes that are more helpful for the central node representation learning can be identified and the sensitivity of the model to changes in graph structure can be reduced.
•
Edge dropout: Randomly discarding edges in the graph with a certain probability ${p}$ , the operation of creating views is as follows:

$\displaystyle s_{1}(\mathcal{G})=\left(\mathcal{I},\mathbf{M}_{1}\odot\mathcal% {E}\right),\quad s_{2}(\mathcal{G})=\left(\mathcal{I},\mathbf{M}_{2}\odot% \mathcal{E}\right)$ (21)

where $\mathbf{\mathbf{M}_{1},\mathbf{M}_{2}\in\{0,1\}^{|\mathcal{E}|}}$ are the mask vectors used to discard the edges connected to the central node in the graph. It is worth noting that the nodes in the graph are not removed in this process. Some noisy interactions can be removed by edge discarding, and the robustness of the model is enhanced at the same time.
•
Node representation under different views: The operations on graph structure mentioned above need to be performed before each training epoch. Any one of the structure change methods needs to be uniquely specified before model training to generate the associated two views randomly. Under each of the two views, the corresponding user embedding representation is updated by aggregating the neighboring points on the graph. The generation is done as follows:

$\displaystyle\mathbf{{Z}_{u}}=H\left(\mathbf{{Z}_{i}},\mathcal{G}\right)$ (22)

where ${H}$ denotes the task of aggregating neighbors $\mathbf{Z_{i}}$ with attribute space awareness to update the central node representation $\mathbf{Z_{u}}$ on graph $\mathcal{G}$ as described in section A. Throughout the process of data enhancement, no additional weight matrix is introduced that needs to be involved in the training.

3.4.2 Contrast learning loss

The user node representation generated under the two augmented views are constructed for the comparison learning of positive and negative samples. We consider the user representation $\mathbf{{Z_{u}}^{\prime},{Z_{u}}^{\prime\prime}}$ of the same user node under two views as positive sample pair, while the representation of different user nodes under two views $\mathbf{{Z_{u}}^{\prime},{Z_{v}}^{\prime\prime}}$ are negative sample pair. The assisted task of self-supervised learning encourages that the user representation of positive sample pair should be as consistent as possible, while enforcing the divergence among negative sample pair. Formally, we follow SimCLR[40] and adopt the contrastive loss InfoNCE [41]. The contrast learning loss on the user side is constructed as follows:

$\displaystyle\mathcal{L}_{ssl}^{user}=\sum_{u\in\mathcal{U}}-\log\frac{\exp% \left(s\left(\mathbf{z}_{u}^{\prime},\mathbf{z}_{u}^{\prime\prime}\right)/\tau% \right)}{\sum_{v\in\mathcal{U}}\exp\left(s\left(\mathbf{z}_{u}^{\prime},% \mathbf{z}_{v}^{\prime\prime}\right)/\tau\right)}$ (23)

where ${s}$ measures the similarity between the node representation, which is set as a cosine similarity function. ${\tau}$ is the temperature parameter in softmax, which can be tuned reasonably to avoid the model training falling into a local optimum. The self-supervised loss on the item side can be obtained in the same way, and we combine the losses on both sides to obtain the final loss of the self-supervised task.

Remark: We elaborately described the user representation learning process here. Because item representation learning is a dual process, we omitted it for brevity.

3.5 Rating prediction

So far, obtaining the final embeddings of user $u$ and item $i$ from the user and item part separately (i.e., $\mathbf{U}_{U}$ and $\mathbf{I}_{i}$ , we concatenate them and make it pass through MLP to predict the rating $r^{\prime}_{ui}$ from $u$ to $i$ as:

$\displaystyle g_{1}=[\mathbf{U}_{u}\oplus\mathbf{I}_{i}],$ $\displaystyle g_{2}=\sigma(W_{2}\cdot g_{1}+b_{2}),$ $\displaystyle\ldots$ $\displaystyle g_{l}=\sigma(W_{l}\cdot g_{l-1}+b_{l}),$ $\displaystyle r^{\prime}_{ui}=w^{T}\cdot g_{l},$ (24)

where $l$ is the index of a hidden layer.

3.6 Multi-task joint training

Since our final task is to score prediction, the primary goal of training is to minimize the difference between the predicted rating and the ground truth:

$\displaystyle\mathcal{L}_{r}=\frac{1}{2|\mathcal{O}|}\sum_{(u,i)\in\mathcal{O}% }(r^{\prime}_{ui}-r_{ui})^{2},$ (25)

where $\mathcal{O}$ is the set of observing ratings, and $r_{ui}$ is the ground truth rating by the user $u$ on the item $i$ .

Then it is worth noting that to alleviate overparametrization and overfitting, we employ the $L_{0}$ regularization [42] to our objective function. By sparsifying the multi-attribute extraction matrices $\mathbf{W}$ and $\mathbf{Q}$ , we can avoid unnecessary resources and alleviate overfitting, because irrelevant degrees of freedom are pruned away. To further improve the performance of the model, we construct the contrast learning loss for the joint training task. The final objective function is as follows:

$\displaystyle\mathcal{L}_{ssl}=\mathcal{L}_{ssl}^{user}+\mathcal{L}_{ssl}^{% item},$ (26) $\displaystyle\min_{\Theta}\mathcal{L}=\mathcal{L}_{r}+\lambda_{1}\|\theta\|_{0% }+\lambda_{2}\mathcal{L}_{ssl}$

where $\Theta$ denotes the model parameter set, $\theta=\{\mathbf{W},\mathbf{Q}\}$ , $\lambda_{1}$ and $\lambda_{2}$ are hyper-parameters to control the strengths of spare regularization and the task of SSL.

4. Experiments

Experiments on three real-world datasets were performed to evaluate our model, and the ablation studies on each proposed component were also conducted. Further, experiments were carried out to illustrate the influence of different information fusion methods on the results.

4.1 Datasets and evaluation metrics

We conducted extensive experiments on three real datasets: MovieLens, Amazon, and Yelp, which are publicly accessible and vary in terms of domain, size, and sparsity.

•
MovieLens-100K: A Widely adopted benchmark dataset in movie recommendation, which contains 100,000 ratings from 943 users to 1,682 movies and the sparsity is 0.06304.
•
Amazon: A widely used product recommendation dataset, which contains 65,170 ratings from 1,000 users to 1,000 items, and the sparsity is 0.06517.
•
Yelp: A local business recommendation dataset, which contains 30,838 ratings from 1,286 users to 2,614 items, and the sparsity is 0.00917.

We randomly selected 80% of historical ratings for each dataset as the training set and treated the remaining as the test set.

For all experiments, we evaluated our model and baselines in terms of two widely-used evaluation protocols: Root Mean Squard Error (RMSE) and Mean Absolute Error (MAE) as evaluation metrics.
4.2 Baseline algorithms

We studied the performance of the following models. Matrix factorization methods: PMF[5], BiasMF[6] and LLORMA-Local[11]; autoencoders based methods: AUTOREC[12] and CF-NADE[13]; graph convolutional networks based collaborative filtering model: GC-MC[7] and LightGCN[8]. In addition, we use AUTOREC and CF-NADE to represent the item-based setting, which has beter performance than the user-based.

Table 2
Performance comparison of rating predictions. The smaller the value, the better the performance. Bold numbers represent the performance of our model on each dataset, and underlined numbers represent data for the best performing model on each dataset

Models	PMF	BiasMF	LLORMA	AUTOREC	CF-NADE	GC-MC	LightGCN	MCCF	MIS-CF
Yelp
RMSE	0.3967	0.3902	0.3890	0.3817	0.3857	0.3850	0.3721	0.3806	0.3397
MAE	0.1571	0.1616	0.1547	0.1201	0.1427	0.1354	0.0997	0.1029	0.0921
Amazon
RMSE	0.9339	0.9028	0.9019	0.9213	0.8987	0.8946	0.8898	0.8876	0.8709
MAE	0.7113	0.6759	0.6725	0.7064	0.6565	0.6619	0.6521	0.6428	0.6314
Movielens
RMSE	0.9638	0.9257	0.9313	0.9435	0.9229	0.9145	0.9092	0.9070	0.8925
MAE	0.7559	0.7258	0.7286	0.7370	0.7168	0.7160	0.7072	0.7050	0.6943

4.3 Parameter settings

We randomly initialized the model parameters with a Gaussian distribution $\mathcal{N}(0,0.1)$ , then used the Adam as the optimizer. For the unique ones of MIS-CF, we tune the batch size ,the learning rate, the dropout rate and the coefficient of SSL within the ranges of {64,128,256}, {0.0005,0.001,0.002} {0.1,0.2,0.4,0.5} and {0.005, 0.01, 0.05, 0.1, 0.5} respectively. The parameters for $\mathcal{L}_{0}$ regularization were set according to literature[42]. We varied the number of attribute spaces in the range of {1,2,3,4}. In the Attention-based Memory Module, the number of memory slices $M$ was set to 4 for Yelp and 8 for Amazon as well as Movielens. For the neural network, we empirically employed two layers for all the neural parts and the activation function as ReLU. The model was implemented by Pytorch, and the embedding dimension was tested in {16,32,64,128,256,512} for different experiments. All the baselines were initialized as the corresponding papers, and in terms of neural network models, we used the same embedding dimension for a fair comparison. Then they were carefully tuned to achieve optimal performance.

4.4 Comparison with baselines

Table 2 reports the overall performance compared with baselines. Each result was the average performance from 5 runs with random initializations. From the results, we make the following observations:

•
The proposed model consistently outperforms all the baselines, indicating the effectiveness of MIS-CF in the recommendation. More precisely, MIS-CF improves over the strongest baselines with respect to RMSE by 8.71%, 1.88%, and 1.60% for Yelp, Amazon and Movielens, respectively. For the MAE, MIS-CF outperforms the strongest baselines by 7.62%, 1.77%, and 1.52%, respectively. This shows that MIS-CF can better predict ratings by using multiple graphs and multiple attention mechanisms to leverage potential information.
•
The performance of MIS-CF on the yelp dataset has been significantly improved, even though the yelp dataset is sparse. This illustrates that information can be better captured by adding multiple graphs, thereby effectively alleviating the sparsity problem when using collaborative filtering.
•
It can be observed that AUTOREC, CFNADE, GC-MC, and MCCF generally outperform PMF, BiasMF, and LLORMA-Local, indicating the power of neural network models. Meanwhile, among these baselines, the overall performance of GNNs-based models is better than other models, which means that GNNs have a powerful role in the performance of graph data.

4.5 Ablation analysis

We performed ablation analysis on Yelp and Amazon datasets, starting with only the Bipar-GCN layer, adding the task of self-supervised learning without IRM layer, adding the IRM layer without using the memory module, and completing the IRM layer one by one. The Table 3 illustrates the performance contribution of each component. The embedding size is 128 for all ablation experiments. We compare to $d=$ 64 baselines because they outperform the $d=$ 128 versions.

Table 3
Ablation analysis

Architecture	Amazon
	RMSE	MAE
Best baseline ( $d=$ 64)	0.8876	0.6428
Best baseline ( $d=$ 128)	0.9062	0.6545
Bipar-GCN	0.8846	0.6417
Bipar-GCN $+$ SSL	0.8732	0.6324
Bipar-GCN $+$ IRM(without memory model)	0.8781	0.6379
MIS-CF ( $d=$ 128)	0.8709	0.6314

We make the following observations:

•

All the main components of our proposed model, Bipar-GCN, IRM layer, Memory Model, and SSL, are demonstrated to be functional and effective.

•

Even if the aggregation method is not changed, adding multiple layers or a joint self-supervised learning task can significantly improve performance. This verifies that the full use of multi-graph information and structure information contained in the original graph considerably enhances the original model.

•

Our model performs better in larger embedding sizes compared to the baseline.

•

Combining all components leads to further improvement, indicating that the different aggregation methods capture different information about users, items, and user-item relationships more effectively.

4.6 Effect of different information fusion methods

After we obtained two embeddings from different components, we compared the effects of using the three methods to summarize them into a vector on the results. Table 4 shows the experimental results for Yelp and Amazon. It can be found that attention performs better than summation and concatenation. Attention provides additional flexibility to the model and may enable the model to recognize more valuable information.

Table 4
Comparison of different information fusion methods

	Yelp		Amazon
	RMSE	MAE	RMSE	MAE
Element-wise sum	0.3487	0.0945	0.8864	0.6423
Concatenation	0.3510	0.0992	0.8882	0.6483
Attention	0.3397	0.0921	0.8709	0.6314

4.7 Performance of different dropout strategies

To verify the performance of different data enhancement methods, we selected the sparse dataset yelp and the dataset amazon, which is denser than the former. A comparison test was performed between the node-dropout and edge-dropout methods, and the final experimental results were shown in Table 5. The experimental results show that the overall performance of the edge-dropout approach is better than the node-dropout approach on both datasets. In particular, on the dataset yelp, the edge-dropout approach improves more than the node-dropout approach. We attribute this to the fact that the edge-dropout approach can better capture the inherent patterns of the graph structure. Comparing with the approach of edge-dropout, more information contained in the original bipartite graph may be lost by node-dropout.

Table 5
Comparison of different dropout strategies

	Yelp		Amazon
	RMSE	MAE	RMSE	MAE
Node dropout	0.34210	0.1007	0.8768	0.6352
Edge dropout	0.33971	0.0921	0.8709	0.6314

4.8 Robustness to noisy interactions

To verify the improvement of model robustness by the auxiliary task of SSL, we did anti-interference experiments on the dataset yelp. Specifically, we added a certain proportion of adversarial samples to the data in the training set (e.g., adding ratings artificially to the negative user-item interactions) and kept the test set data unchanged. We conducted a comparative analysis at 5%, 10%, and 15% noise rates, and the results obtained are shown in Fig. 6. The experimental results show that the model obtained by adding self-supervised learning to the training process is less sensitive to noisy data. Especially when the amount of noisy data is large, the stability of MIS-CF is much better than that of MI-CF(MIS-CF without the SSL module)

Figure 6.

Impact of different noise ratios on yelp.

4.9 Hyperparameter analysis

Figure 7.

Impact of latent attributes spaces numbers on three real datasets.

Figure 8.

Impact of embedding dimensions on three real datasets.

4.9.1 The number of Attribute Spaces

We changed the number of Attribute Spaces within {1,2,3,4} while keeping other parameters the same. Figure 7 shows the experimental results in three real datasets. It can be found that the number of attribute spaces dramatically affects the results, and the optimal number varies depending on the specific dataset. For Yelp, the bipar graph is more sparse than others, and most of the ratings are only two-level. Therefore, one attribute space is enough to model latent attributes. As for Amazon and MovieLens, the graphs are much denser, with an even distribution of ratings. So the advantages of multiple attribute spaces are shown. When the number of attribute spaces increases to achieve the best performance, the effect will decrease if it continues to increase, which may be related to the overfitting problem.

4.9.2 The embedding dimensions

The embedding dimension $d$ is one of the critical parameters that affect the performance and capacity of the model. The effect of different embedding dimensions on performance is shown in Fig. 8. As the embedding dimension $d$ increases, the performance of the recommendation model will increase because a larger dimension $d$ can enhance the representation ability. However, when $d$ is greater than the optimal value, it will decrease the performance. To obtain the best performance, an appropriate embedding dimension $d$ needs to be adopted.

5. Conclusion

We proposed a novel recommendation model MIS-CF, that aims to model multi-attribute and implicit relationship factors with self-supervised learning for collaborative filtering recommender systems. Our idea is to explore the two factors that affect users’ purchase motivation to reveal the fine-grained factors behind the interaction. Meanwhile, the performance and robustness of the model will be further improved by setting up an auxiliary task for joint training, which takes full use of the information hidden in the graph structure. Firstly, the user-item bipartite graph is used to model the multi-attributes of users and items, and the latent semantics of specific user-item pairs are coded and represented as attribute spaces, respectively. Secondly, the user-user and item-item graphs are clearly modeled, and implicit relationships are also modeled at a fine-grained level by using the memory attention network and relational attention. Three embeddings were learned from two perspectives, which significantly increased the representation capabilities and reflected fine-grained user preferences. Thirdly, considering the necessity of mining graph structure information, we construct the embedding representation of nodes under different views on the user-item bipartite graph. Different views of each node are generated through a graph structure transformation method which can be categorized into two branches: dropping edges and dropping nodes. Comparing the representation under two views, we maximize the distance between the negative sample pair, while enforcing the consistency between the positive sample pair to learn a better representation.

Extensive experiments on three datasets also demonstrated the effectiveness of our approach, and an ablation study quantitatively verified that each component made a significant contribution. We will integrate auxiliary information to further improve optimization efficiency in the future.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant No. 62072060, 72074036); this work is also partly funded by the China Postdoctoral Science Foundation (2020M673145) and the Program for Innovation Research Groups at Institutions of Higher Education in Chongqing (CXQT21032).

References

Ying

Chen

Eksombatchai

Hamilton

W.L.

and Leskovec

, Graph convolutional neural networks for web-scale recommender systems, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 974–983.

Sun

Zhang

Coates

Guo

Tang

and He

, Multi-graph convolution collaborative filtering, in: 2019 IEEE International Conference on Data Mining (ICDM), IEEE, 2019, pp. 1306–1311.

Cheng

H.-T.

Koc

Harmsen

Shaked

Chandra

Aradhye

Anderson

Corrado

Chai

Ispir

, Wide & deep learning for recommender systems, in: Proceedings of the 1st workshop on deep learning for recommender systems, 2016, pp. 7–10.

Schafer

J.B.

Frankowski

Herlocker

and Sen

, Collaborative filtering recommender systems, in: The adaptive web, Springer, 2007, pp. 291–324.

Mnih

and Salakhutdinov

R.R.

, Probabilistic matrix factorization, Advances in neural information processing systems 20 (2007), 1257–1264.

Koren

Bell

and Volinsky

, Matrix factorization techniques for recommender systems, Computer 42(8) (2009), 30–37.

Berg

R.v.d.

Kipf

T.N.

and Welling

, Graph convolutional matrix completion, arXiv preprint arXiv:1706.02263, (2017).

Deng

Wang

Zhang

and Wang

, Lightgcn: Simplifying and powering graph convolution network for recommendation, in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 639–648.

Wang

Shi

Song

and Li

, Multi-component graph convolutional collaborative filtering, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 6267–6274.

10.

Koren

Bell

and Volinsky

, Matrix factorization techniques for recommender systems, Computer 42(8) (2009), 30–37.

11.

Lee

Kim

Lebanon

and Singer

, Local low-rank matrix approximation, in: International conference on machine learning, PMLR, 2013, pp. 82–90.

12.

Sedhain

Menon

A.K.

Sanner

and Xie

, Autorec: Autoencoders meet collaborative filtering, in: Proceedings of the 24th international conference on World Wide Web, 2015, pp. 111–112.

13.

Zheng

Tang

Ding

and Zhou

, A neural autoregressive approach to collaborative filtering, in: International Conference on Machine Learning, PMLR, 2016, pp. 764–773.

14.

Gori

Monfardini

and Scarselli

, A new model for learning in graph domains, in: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, IEEE, 2005, pp. 729–734.

15.

Veličković

Cucurull

Casanova

Romero

Lio

and Bengio

, Graph attention networks, arXiv preprint arXiv:1710.10903, (2017).

16.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609. 02907, (2016).

17.

Bruna

Zaremba

Szlam

and LeCun

, Spectral networks and locally connected networks on graphs, arXiv preprint arXiv:1312.6203, (2013).

18.

Defferrard

Bresson

and Vandergheynst

, Convolutional neural networks on graphs with fast localized spectral filtering, arXiv preprint arXiv:1606.09375, (2016).

19.

Duvenaud

Maclaurin

Aguilera-Iparraguirre

Gómez-Bombarelli

Hirzel

Aspuru-Guzik

and Adams

R.P.

, Convolutional networks on graphs for learning molecular fingerprints, arXiv preprint arXiv:1509.09292, (2015).

20.

Atwood

and Towsley

, Diffusion-convolutional neural networks, in: Advances in neural information processing systems, 2016, pp. 1993–2001.

21.

Wang

Shi

Wang

Cui

and Yu

P.S.

, Heterogeneous graph attention network, in: The World Wide Web Conference, 2019, pp. 2022–2032.

22.

Cui

Kuang

Wang

and Zhu

, Disentangled graph convolutional networks, in: International Conference on Machine Learning, PMLR, 2019, pp. 4212–4221.

23.

Chen

Hong

Zhang

and Wang

, Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 27–34.

24.

Sun

Zhang

Guo

Tang

and Coates

, Neighbor Interaction Aware Graph Convolution Networks for Recommendation, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 1289–1298.

25.

Jin

Gao

Jin

and Li

, Multi-behavior recommendation with graph convolutional networks, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 659–668.

26.

Wang

Feng

and Chua

T.-S.

, Neural graph collaborative filtering, in: Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval, 2019, pp. 165–174.

27.

Bachman

Hjelm

R.D.

and Buchwalter

, Learning representations by maximizing mutual information across views, arXiv preprint arXiv:1906.00910, (2019).

28.

Chen

Kornblith

Norouzi

and Hinton

, A simple framework for contrastive learning of visual representations, in: International conference on machine learning, PMLR, 2020, pp. 1597–1607.

29.

Hjelm

R.D.

Fedorov

Lavoie-Marchildon

Grewal

Bachman

Trischler

and Bengio

, Learning deep representations by mutual information estimation and maximization, arXiv preprint arXiv:1808.06670, (2018).

30.

Zhai

Oliver

Kolesnikov

and Beyer

, S4l: Self-supervised semi-supervised learning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1476–1485.

31.

Devlin

Chang

M.-W.

Lee

and Toutanova

, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, (2018).

32.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119.

33.

Gidaris

Singh

and Komodakis

, Unsupervised representation learning by predicting image rotations, arXiv preprint arXiv:1803.07728, (2018).

34.

Fan

Xie

and Girshick

, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.

35.

Oord

A.v.d.

and Vinyals

, Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748, (2018).

36.

Wang

Feng

Chen

Lian

and Xie

, Self-supervised graph learning for recommendation, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 726–735.

37.

Zhou

Wang

Zhao

W.X.

Zhu

Wang

Zhang

Wang

and Wen

J.-R.

, S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 1893–1902.

38.

Yao

Cheng

D.Z.

Chen

Menon

Hong

Chi

E.H.

Tjoa

and Kang

, Self-supervised learning for deep models in recommendations, arXiv e-prints, (2020), arXiv–2007.

39.

Xie

Sun

Liu

Gao

Ding

and Cui

, Contrastive Pre-training for Sequential Recommendation, arXiv e-prints, (2020), arXiv–2010.

40.

Chen

Kornblith

Norouzi

and Hinton

, A simple framework for contrastive learning of visual representations, in: International conference on machine learning, PMLR, 2020, pp. 1597–1607.

41.

Gutmann

and Hyvärinen

, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 297–304.

42.

Louizos

Welling

and Kingma

D.P.

, Learning Sparse Neural Networks through

L\_0

Regularization, arXiv preprint arXiv:1712.01312, (2017).

Modeling multi-attribute and implicit relationship factors with self-supervised learning for recommender systems

Abstract

Keywords

1. Introduction

2.1 Graph neural networks

2.2 Self-supervised learning

3.1 Bipartite graph convolutional neural networks

3.2.1 Relation vector generation

Table 1 Comparison of different message fusion methods

4.1 Datasets and evaluation metrics

Table 2 Performance comparison of rating predictions. The smaller the value, the better the performance. Bold numbers represent the performance of our model on each dataset, and underlined numbers represent data for the best performing model on each dataset

4.4 Comparison with baselines

Table 3 Ablation analysis

Table 4 Comparison of different information fusion methods

Table 5 Comparison of different dropout strategies

4.9.2 The embedding dimensions

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
Comparison of different message fusion methods

Table 2
Performance comparison of rating predictions. The smaller the value, the better the performance. Bold numbers represent the performance of our model on each dataset, and underlined numbers represent data for the best performing model on each dataset

Table 3
Ablation analysis

Table 4
Comparison of different information fusion methods

Table 5
Comparison of different dropout strategies