Abstract
Interactions of users and items can be naturally modeled as a user-item bipartite graph in recommender systems, and emerging research is devoted to exploring user-item graphs for collaborative filtering methods. In reality, user-item interaction usually stems from more complex underlying factors, such as the users’ specific preferences. A user-item bipartite graph could be used to understand the differences in motivation. However, existing research has not clearly proposed and modeled the factors that affect the differences, ignoring the similarities between user pairs and item pairs, preventing them from capturing fine-grained user preferences more effectively. In addition to the two points mentioned above, most GNN-based models for recommendation have the following two limitations: First, the model’s accuracy depends on the number of observed interactions in the dataset. Secondly, node representations are vulnerable to noisy interactions. This work has developed a novel recommendation model called “Multi-Attribute and Implicit Relationship Factors With Self-Supervised Learning for Collaborative filtering” (MIS-CF), which explicitly proposes and models multi-attribute and implicit relationship factors for collaborative filtering recommendation. Meanwhile, an auxiliary self-supervised learning task is designed to help the downstream task optimize the node representation. MIS-CF aggregates multi-attribute spaces through the user-item bipartite graph and establishes user-user and item-item graphs to model the similar relationship information of neighbor pairs through a memory model. The self-supervised learning task generates contrastive learning via self-discrimination, thus mining the rich auxiliary signals within the data, improving the accuracy and robustness of our model. Moreover, the sparse regularizer is utilized to alleviate the overfitting problem. Extensive experimental results on three public datasets not only show the significant performance and robustness gain of the proposed model but also prove the effectiveness and interpretability of fine-grained implicit factors modeling.
Introduction
With information overload, users want to obtain interesting information more efficiently in the Internet environment. The company also hopes that products can attract and retain users to the greatest extent, thereby achieving development. The recommender system generates personalized item recommendations and deals with the information overload problem. Since the recommender system has received effective feedback in practice, it has not only aroused great interest in academia [1, 2], but also has been extensively developed in industry[3].
A de facto solution for many modern recommender systems is the Collaborative Filtering (CF) technique. The basic assumption is that people who share similar purchases in the past tend to have similar choices in the future [4]. The matrix factorization algorithm [5, 6] adds the concept of latent vector based on the CF algorithm. The vector is inferred from the record of user-item interaction. However, only the characteristics of the user and item are considered, a clear combination of user-item interaction is missing. In essence, this kind of user-item interaction information can be naturally modeled into a graph. The graph can contain more precise information and enhance the connectivity between users and items. More recently, graph convolutional neural networks have become one of the best performance architectures for various graph learning tasks. GC-MC [7] uses two multi-link graph convolution layers to aggregate user features and item features. LightGCN [8] is a simplifying and powerful graph convolution network that can aggregate the information of neighbor nodes from low-order to high-order efficiently without the feature transformation and nonlinear activation.
Although they are effective, there are still three important limitations. First, these methods suppose that user purchases items with constant motivation. However, in the real world, the motivations behind the user’s decision-making are multiple. People’s personal attributes will significantly affect preferences. A person will prefer items with different attributes based on his own different attributes. Attributes are derived from a person’s personality, occupation, professional direction, etc. Therefore, methods that do not distinguish between purchase motives inevitably lose fine-grained valuable information. Recent work MCCF[9] captures more complex interaction characteristics by setting up multiple components, but still does not use more comprehensive and effective information to provide interpretable suggestions. Secondly, these methods only consider the characteristics of the nodes in the bipartite graph and treat the graph as an independent individual. Ignoring the user-user and item-item relationships outside the bipartite graph. Implicit relationships can be modeled through user-user and item-item graphs to reflect more complex interaction characteristics. Thirdly, the interactions observable in a real recommendation scenario are very sparse compared to the size of the entire dataset. These models are limited to learning node embedding representations of bipartite graphs with sparse data. More importantly, the GNN-based model is sensitive to the constructed user-item bipartite graph structure. Any change in the graph structure (e.g., adding noisy interactions) will dramatically drop model performance.
A toy example of purchasing relationships records with different purchasing motivations.
Figure 1 shows a toy example. Suppose the latent attributes of the user are ignored, and the difference in purchase motivation is not considered. In that case, the possibility of the user
In this paper, multi-attribute factors and implicit relationship factors are accurately proposed and modeled. For a given user-item bipartite graph, multiple attributes were extracted in the beginning. Then the two-layer attention mechanism is used to distinguish the probability distribution of the attribute space. Finally, the attribute factors are modeled. At the same time, a sparse regularizer can alleviate overfitting caused by these attribute factors reflecting similar motivations. For the implicit relationship part, user-user and item-item graphs are used separately. The attention-based memory module is used to learn the specific relationship vector between node pairs. Then the relationship-level attention is used to select information-rich neighbors for preference modeling automatically. We have tried to enhance the robustness of the overall model by adding the auxiliary task part of self-supervised learning. Two different views are first generated by reconstructing the graph structure of the initial user-project graph. Then an embedded representation of each node is generated under each of the two views. Finally, the comparative learning loss is constructed to mine the internal auxiliary signals of the node data in the graph.
The contributions of this paper are as follows:
MIS-CF is proposed, a new collaborative filtering method based on graph neural network. It can capture the fine-grained hidden factors behind user behavior based on attribute-level attention and implicit relationship aggregation. Memory model is used to learn separately constructed user-user and item-item graphs to capture the relationship information between neighbor pairs. MIS-CF will learn all three graphics at the same time and unify the multi-attribute and implicit relationship information through the information fusion layer to achieve end-to-end capture. Self-supervised learning is applied as an auxiliary task for graph neural network-based recommendation and mining auxiliary signals within the graph by means of node self-identification. The obtained auxiliary signals are further enriched for node representation learning and enhance the robustness of the model for filtering noisy interactions. Extensive experiments on three public datasets are conducted to evaluate the proposed method. Experimental results show the effectiveness and interpretability of MIS-CF.
Graph neural networks
The classic collaborative filtering algorithm used to be the preferred model of the recommender system. The matrix factorization (MF) model has been derived from collaborative filtering. MF has also become the most popular one among various collaborative filtering methods. The early MF method[10] was designed to simulate the user’s explicit feedback by mapping users and items to the latent factor space so that the user-item relationship (rating) can be obtained through the dot product of latent factors. PMF[5] is based on the regularized matrix factorization and introduces a probability model for further optimization. BiasedMF[6] introduces a bias term and a regular term to improve accuracy and alleviate the problem of overfitting, respectively. LLORMA-Local[11] uses different combinations of low-rank approximations to reconstruct scoring matrix entries. With the widespread use of deep learning technology, neural MF models have emerged. For example, AutoRec[12], which uses the idea of AutoEncoder to complete the self-encoding of the item or user vector, and then uses the result of the self-encoding to get the estimated score; CF-NADE[13] constructs a denoising autoencoder, which immediately eliminates part of the input space in each iteration. However, compared to the MF, graphs have natural advantages in representing rich pairwise relationship information in recommender systems. With the great achievements of Graph Neural Networks (GNNs) [14, 15, 16], more and more works try to apply GNNs in recommender systems to capture information better.In the early stages of development, Bruna et al. tried to migrate the convolution of Euclidean space to the spatial and frequency domains of the graph network[17]. Then the application of polynomial spectral filter and linear filter dramatically reduces the computational cost[18, 16]. Along with spectral graph convolution, directly performing graph convolution in the spatial domain is also investigated[19, 20]. After that, the attention mechanism is used to generate the weights of neighbor nodes[15], and the heterogeneous graph attention network is also used to distinguish the different importance of nodes and meta-paths in the convolutionprocess[21]. DisenGCN[22] uses a novel neighborhood routing mechanism on the homogeneous graph to find the factors that may lead to an edge from a given node to one of its neighbors. LR-GCCF [23] used the residual method for user-item bipartite graph representation in the learning process. NIA-GCN [24] uses the heterogeneity of the user-item bipartite graph to model the relationship information between neighbor nodes. MBGCN [25] explores the importance of multiple behaviors for the recommendation.NGCF [26] established a user-item bipartite graph to gather high-order neighborhood information. GC-MC[7] uses two multi-link graph convolution layers to aggregate user features and item features. MCCF[9] is the first to explore using the edge information of bipartite graphs to construct multiple components to obtain multiple preferences of users. Multi-GCCF[2] adds multiple graphs to the model. These methods still ignore more fine-grained information and do not make full use of graph information to capture multi-level latent attribute information.
Self-supervised learning
Self-supervised learning (SSL) is a new learning paradigm that can learn auxiliary signals from raw data. It is firstly used in the fields of computer vision (CV) and natural language processing (NLP)[27, 28, 29, 30]. Current research work on SSL can be roughly divided into two categories: generative models and contrastive models. Auto-encoding[31] is one of the most popular generative models[32], which enhances the robustness of the model by artificially adding noisy data. Contrastive models learn to compare via a Noise Contrastive Estimation (NCE) objective[33, 34, 35]. In view of the excellent performance of comparative learning on graph representation learning, the growing work applies self-supervised learning to the recommendation models based on graph neural network. SGL[36] generates a two-view representation of the nodes on the user-item bipartite graph, and then constructs comparative learning to optimize node representation.
The framework of MIS-CF. This example predicts the rating that user 
The overall framework of the proposed model is shown in Fig. 2. Firstly, the Bipartite Graph Convolutional Neural Network (Bipar-GCN) layer acts as an encoder to generate user and item embeddings by processing the user-item bipartite graph. Secondly, Self-supervised Learning (SSL) module serves as an auxiliary task to mine the information contained in the bipartite graph itself for enriching the representation of user and item. Thirdly, Implicit Relation Modeling (IRM) layer encodes additional latent information by constructing and processing the other two graphs. The other two graphs here represent user-user similarity and item-item similarity, respectively. Fourth, The Attention-based Memory Module in the IRM layer assists in capturing latent relationship information by generating a relation vector
Bipartite graph convolutional neural networks
In the recommendation scenario, the users’ ratings of items can be modeled as a user-item bipartite graph with two types of nodes
Illustration of Bipar-GCN module from the user part. The number of attribute space is a hypeparameter which should be set at the beginning.
Assume that
And it is same for the
In the following, we target user
In particular, the possibility of user
where
where
In this way, by aggregating all the item-specific attributes of
It can be seen from Eq. (5) that each
Similarly, considering that different attributes have different degrees of influence on users�purchase motivation, an attribute space-level attention mechanism is needed to learn the importance of different attribute spaces.
Taking
where
where
where
According to each attribute space
In our hypothesis, there are multiple attributes for users and items, and these attributes affect the user’s preferences for items. Similarly, because there are similar decision-making patterns in human nature, common item preferences can be inferred through groups of people with similar attributes.
The addition of the Implicit Relation Modeling (IRM) layer could model similar users and items. It would capture the similar relationships between paired nodes by constructing and learning additional user-user and item-item graphs (multi-graphs). We select thresholds based on the cosine similarity that leads to an average degree of 20 for each graph.
Relation vector generation
Illustration of attention-based memory module from the user part.The memory size is set to 4 in this example.
In most cases, users and similar users, items and similar items only have the same attributes in certain aspects. However, these implicit relationships between them cannot be directly reflected in the existing data. Inspired by recent advances in memory networks and attention mechanisms, we inserted an attention-based memory module to learn the relation vectors among similar users or items. The structure of this module is shown in Fig. 4. The memory matrix of this module is represented as
For a given pair of similar users, an operation is firstly applied to learn their joint embedding
where
Then the joint embedding
For user
Since different attribute spaces have different influences, the key matrix
The final attention scores are obtained by normalizing
Finally, in order to generate the relation vector, these attention scores were used to calculate a weighted representation of
The output is a specific relation vector
So far here has been several relational vectors
Intuitively, if a similar user have more expertise on a class item that attribute in the field is stronger, then his choice of target users have a greater impact. So we set the deep neural network to learn the individual relation vector’s weight
formally, with relation vector
where
Then, the final relation-level attention is normalized with a softmax function to obtain the weight of each similar user:
where
With these learned weights, we can obtain the final embedding
It is crucial to merge these different embeddings effectively for the separately modeled multi-attribute information and implicit relationships. In this work we investigated three methods to summarize the individual embeddings into a single embedding vector(
Comparison of different message fusion methods
Comparison of different message fusion methods
A toy example of connectivity in a GCN model with two different ways for data augment. We only display the type of node dropout in the overall framework and the other one is edge dropout.
SSL extracts auxiliary signals by means of node self-discrimination in the following two steps: reconstructing the graph structure to build different vector representations of a node under two views, and then comparing the embedded representations of the learned nodes.
Nowadays, SSL is widely used in CV and natural NLP fields, producing good performance improvement by using data augmentation methods such as rotation, blurring, and cropping. However, models built based on graph neural networks cannot fully replicate these processing approaches when performing data enhancement for the following reasons:
The graph convolutional neural network is a topological structure, different from the regular Euclidean structure in CV. Each data instance is processed independently in CV, while each node of the graph convolutional neural network built based on user-item interaction is not independent and has some connection.
The bipartite graph is constructed based on users’ actual ratings of items, and the connections between the nodes contain rich collaborative filtering information. Hence, mining this inherent graph structure information is helpful for representational learning. Inspired by recent work on SGL[36], we attempt two approaches on bipartite graphs of user-item interactions, as shown in Fig. 5. A node-drop or edge-drop approach is taken to generate two related views s1 and s2 independently from each other. The same drop rate is used during the construction of the views to constitute fair comparative learning. The equation showing the specific operation is as follows:
Node dropout: Nodes in the graph are randomly discarded with probability
where Edge dropout: Randomly discarding edges in the graph with a certain probability
where Node representation under different views: The operations on graph structure mentioned above need to be performed before each training epoch. Any one of the structure change methods needs to be uniquely specified before model training to generate the associated two views randomly. Under each of the two views, the corresponding user embedding representation is updated by aggregating the neighboring points on the graph. The generation is done as follows:
where
The user node representation generated under the two augmented views are constructed for the comparison learning of positive and negative samples. We consider the user representation
where
So far, obtaining the final embeddings of user
where
Since our final task is to score prediction, the primary goal of training is to minimize the difference between the predicted rating and the ground truth:
where
Then it is worth noting that to alleviate overparametrization and overfitting, we employ the
where
Experiments on three real-world datasets were performed to evaluate our model, and the ablation studies on each proposed component were also conducted. Further, experiments were carried out to illustrate the influence of different information fusion methods on the results.
Datasets and evaluation metrics
We conducted extensive experiments on three real datasets: MovieLens, Amazon, and Yelp, which are publicly accessible and vary in terms of domain, size, and sparsity.
MovieLens-100K: A Widely adopted benchmark dataset in movie recommendation, which contains 100,000 ratings from 943 users to 1,682 movies and the sparsity is 0.06304. Amazon: A widely used product recommendation dataset, which contains 65,170 ratings from 1,000 users to 1,000 items, and the sparsity is 0.06517. Yelp: A local business recommendation dataset, which contains 30,838 ratings from 1,286 users to 2,614 items, and the sparsity is 0.00917.
We randomly selected 80% of historical ratings for each dataset as the training set and treated the remaining as the test set.
For all experiments, we evaluated our model and baselines in terms of two widely-used evaluation protocols: Root Mean Squard Error (RMSE) and Mean Absolute Error (MAE) as evaluation metrics.
We studied the performance of the following models. Matrix factorization methods: PMF[5], BiasMF[6] and LLORMA-Local[11]; autoencoders based methods: AUTOREC[12] and CF-NADE[13]; graph convolutional networks based collaborative filtering model: GC-MC[7] and LightGCN[8]. In addition, we use AUTOREC and CF-NADE to represent the item-based setting, which has beter performance than the user-based.
Performance comparison of rating predictions. The smaller the value, the better the performance. Bold numbers represent the performance of our model on each dataset, and underlined numbers represent data for the best performing model on each dataset
Performance comparison of rating predictions. The smaller the value, the better the performance. Bold numbers represent the performance of our model on each dataset, and underlined numbers represent data for the best performing model on each dataset
We randomly initialized the model parameters with a Gaussian distribution
Comparison with baselines
Table 2 reports the overall performance compared with baselines. Each result was the average performance from 5 runs with random initializations. From the results, we make the following observations:
The proposed model consistently outperforms all the baselines, indicating the effectiveness of MIS-CF in the recommendation. More precisely, MIS-CF improves over the strongest baselines with respect to RMSE by 8.71%, 1.88%, and 1.60% for Yelp, Amazon and Movielens, respectively. For the MAE, MIS-CF outperforms the strongest baselines by 7.62%, 1.77%, and 1.52%, respectively. This shows that MIS-CF can better predict ratings by using multiple graphs and multiple attention mechanisms to leverage potential information. The performance of MIS-CF on the yelp dataset has been significantly improved, even though the yelp dataset is sparse. This illustrates that information can be better captured by adding multiple graphs, thereby effectively alleviating the sparsity problem when using collaborative filtering. It can be observed that AUTOREC, CFNADE, GC-MC, and MCCF generally outperform PMF, BiasMF, and LLORMA-Local, indicating the power of neural network models. Meanwhile, among these baselines, the overall performance of GNNs-based models is better than other models, which means that GNNs have a powerful role in the performance of graph data.
We performed ablation analysis on Yelp and Amazon datasets, starting with only the Bipar-GCN layer, adding the task of self-supervised learning without IRM layer, adding the IRM layer without using the memory module, and completing the IRM layer one by one. The Table 3 illustrates the performance contribution of each component. The embedding size is 128 for all ablation experiments. We compare to
Ablation analysis
Ablation analysis
We make the following observations:
All the main components of our proposed model, Bipar-GCN, IRM layer, Memory Model, and SSL, are demonstrated to be functional and effective. Even if the aggregation method is not changed, adding multiple layers or a joint self-supervised learning task can significantly improve performance. This verifies that the full use of multi-graph information and structure information contained in the original graph considerably enhances the original model. Our model performs better in larger embedding sizes compared to the baseline. Combining all components leads to further improvement, indicating that the different aggregation methods capture different information about users, items, and user-item relationships more effectively.
After we obtained two embeddings from different components, we compared the effects of using the three methods to summarize them into a vector on the results. Table 4 shows the experimental results for Yelp and Amazon. It can be found that attention performs better than summation and concatenation. Attention provides additional flexibility to the model and may enable the model to recognize more valuable information.
Comparison of different information fusion methods
Comparison of different information fusion methods
To verify the performance of different data enhancement methods, we selected the sparse dataset yelp and the dataset amazon, which is denser than the former. A comparison test was performed between the node-dropout and edge-dropout methods, and the final experimental results were shown in Table 5. The experimental results show that the overall performance of the edge-dropout approach is better than the node-dropout approach on both datasets. In particular, on the dataset yelp, the edge-dropout approach improves more than the node-dropout approach. We attribute this to the fact that the edge-dropout approach can better capture the inherent patterns of the graph structure. Comparing with the approach of edge-dropout, more information contained in the original bipartite graph may be lost by node-dropout.
Comparison of different dropout strategies
Comparison of different dropout strategies
To verify the improvement of model robustness by the auxiliary task of SSL, we did anti-interference experiments on the dataset yelp. Specifically, we added a certain proportion of adversarial samples to the data in the training set (e.g., adding ratings artificially to the negative user-item interactions) and kept the test set data unchanged. We conducted a comparative analysis at 5%, 10%, and 15% noise rates, and the results obtained are shown in Fig. 6. The experimental results show that the model obtained by adding self-supervised learning to the training process is less sensitive to noisy data. Especially when the amount of noisy data is large, the stability of MIS-CF is much better than that of MI-CF(MIS-CF without the SSL module)
Impact of different noise ratios on yelp.
Impact of latent attributes spaces numbers on three real datasets.
Impact of embedding dimensions on three real datasets.
We changed the number of Attribute Spaces within {1,2,3,4} while keeping other parameters the same. Figure 7 shows the experimental results in three real datasets. It can be found that the number of attribute spaces dramatically affects the results, and the optimal number varies depending on the specific dataset. For Yelp, the bipar graph is more sparse than others, and most of the ratings are only two-level. Therefore, one attribute space is enough to model latent attributes. As for Amazon and MovieLens, the graphs are much denser, with an even distribution of ratings. So the advantages of multiple attribute spaces are shown. When the number of attribute spaces increases to achieve the best performance, the effect will decrease if it continues to increase, which may be related to the overfitting problem.
The embedding dimensions
The embedding dimension
Conclusion
We proposed a novel recommendation model MIS-CF, that aims to model multi-attribute and implicit relationship factors with self-supervised learning for collaborative filtering recommender systems. Our idea is to explore the two factors that affect users’ purchase motivation to reveal the fine-grained factors behind the interaction. Meanwhile, the performance and robustness of the model will be further improved by setting up an auxiliary task for joint training, which takes full use of the information hidden in the graph structure. Firstly, the user-item bipartite graph is used to model the multi-attributes of users and items, and the latent semantics of specific user-item pairs are coded and represented as attribute spaces, respectively. Secondly, the user-user and item-item graphs are clearly modeled, and implicit relationships are also modeled at a fine-grained level by using the memory attention network and relational attention. Three embeddings were learned from two perspectives, which significantly increased the representation capabilities and reflected fine-grained user preferences. Thirdly, considering the necessity of mining graph structure information, we construct the embedding representation of nodes under different views on the user-item bipartite graph. Different views of each node are generated through a graph structure transformation method which can be categorized into two branches: dropping edges and dropping nodes. Comparing the representation under two views, we maximize the distance between the negative sample pair, while enforcing the consistency between the positive sample pair to learn a better representation.
Extensive experiments on three datasets also demonstrated the effectiveness of our approach, and an ablation study quantitatively verified that each component made a significant contribution. We will integrate auxiliary information to further improve optimization efficiency in the future.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China (Grant No. 62072060, 72074036); this work is also partly funded by the China Postdoctoral Science Foundation (2020M673145) and the Program for Innovation Research Groups at Institutions of Higher Education in Chongqing (CXQT21032).
