Collaborative graph contrastive learning for recommendation

Abstract

Recently, Graph Neural Networks (GNNs) using aggregating neighborhood collaborative information have shown effectiveness in recommendation. However, GNNs-based models suffer from over-smoothing and data sparsity problems. Due to its self-supervised nature, contrastive learning has gained considerable attention in the field of recommendation, aiming at alleviating highly sparse data. Graph contrastive learning models are widely used to learn the consistency of representations by constructing different graph augmentation views. Most current graph augmentation with random perturbation destroy the original graph structure information, which mislead embeddings learning. In this paper, an effective graph contrastive learning paradigm CollaGCL is proposed, which constructs graph augmentation by using singular value decomposition to preserve crucial structure information. CollaGCL enables perturbed views to effectively capture global collaborative information, mitigating the negative impact of graph structural perturbations. To optimize the contrastive learning task, the extracted meta-knowledge was propagate throughout the original graph to learn reliable embedding representations. The self-information learning between views enhances the semantic information of nodes, thus alleviating the problem of over-smoothing. Experimental results on three real-world datasets demonstrate the significant improvement of CollaGCL over state-of-the-art methods.

Keywords

Self-supervised learning recommendation contrastive learning data augmentation

1 Introduction

Recommendation has been extensively utilized to alleviate information overload issues [9]. Among various methods, Collaborative Filtering (CF) directly digs user preference from historical user-item interactions, which serves as a popular architecture for recommendation [6, 10]. Many existing popular methods have achieved significant improvement by combining with CF paradigm [19, 40]. However, CF methods require high-quality interaction information to learn users and items representations [5]. Graph Convolutional Networks (GCNs) are effective in recommendation by stacking multiple convolutional layers to extract local collaborative signals [4, 13]. Various neural network GCNs [33] learn reliable user and item representations, such as graph auto-encoder [2, 3] and attention mechanism [38]. Most GCN-based models require sufficient quality labeled data for model training. However, in practical scenarios, data is extremely sparse, making it challenging for GCNs-based models to capture high-quality users and items interaction information [26]. Contrastive learning (CL) acquires general features from unlabeled data has been substantiated as an efficient strategy for tackling data sparsity [16]. Due to its consistent requirement, contrastive learning has achieved significant success in numerous recommendation tasks [15, 28].

Graph Contrastive Learning (GCL) is the application of CL in graph-based recommendation. The primary idea of GCL for enhancing users and items representation is to capture the consistency between different augmentation views [35]. Most current GCL methods focus on constructing efficient graph augmentation views. SGL [22] has been proposed to generate graph augmentation by applying stochastic strategies that destroy the structural information of user-item interactions, such as node dropping and edge perturbation. LightGCL [34] utilizes singular value decomposition (SVD) augmentation to preserve critical information within graph structure. Some studies have proposed contrastive learning methods that differ from graph augmentation. SimGrace [30] constructs augmented view by randomly perturbing the parameters of graph neural network. To improve performance in graph contrastive learning, SimGCL [31] and XSimGCL [39] proposed adding uniform noise to user and item embeddings as an alternative approach to graph augmentation for constructing perturbed views. XSimGCL optimized the model on a single view, which provides a feasible way to simplify contrastive learning. The combination of contrastive learning with other recommendation methods has also achieved great success, such as sequential recommendation [42] and reinforcement learning-based recommendation [43].

Despite the effectiveness of these methods, GCL-based recommendation suffer from several problems: i) Current GCN-based recommendations are limited by the issue of over-smoothing [14, 20] and noise interactions [25]. ii) Graph augmentation with structural perturbation may lead to the disruption of structure information, which misleads the learning of embeddings [31]. iii) It is difficult for current GCL methods to extract valuable information from global collaborative relations. To alleviate the problem above, an effective graph contrastive learning method Collaborative Graph Contrastive Learning (CollaGCL) was proposed in this paper, and the graph augmentation is guided by randomized singular value decomposition. CollaGCL focuses on assisting perturbed view in learning reliable users and items embedding representations to optimize the recommendation task. Some modules have been designed to mitigate the negative impact of structural perturbation by restoring the missing structural information. The contributions are as follows:

To mitigate the impact of structural perturbation, CollaGCL introduces global collaborative information into the learning process of perturbed views. The extracted meta-knowledge was propagated throughout the original graph to optimize recommendation and contrastive learning tasks.

The joint-view representation learning module was designed to allow perturbed view to actively aggregate effective collaborative information from the main views. This helps mitigate the negative effects of over-smoothing and graph augmentation.

Experiments conducted on both baselines and CollaGCL have confirmed the improved recommendation performance of CollaGCL. Further experiments validate the robustness of CollaGCL.

2 Related work

2.1 GCN-based methods

The current mainstream recommendation models make recommendations by learning embedding representations of users and items. Efficiently learning high-quality representations is an important research focus. GCN [31] pioneered the application of convolutional neural network (CNN) concepts to graph structures for learning neighborhood information of nodes. Leveraging their simplicity and ease of implementation, GCN has gradually become a core component of recommendation models, such as NCF [6] and NGCF [10]. However, GCNs face challenges such as high complexity with large-scale datasets and issues like over-smoothing after multiple convolution layers. LightGCN [18] significantly reduces complexity by removing feature transformation and non-linear activation modules from GCN. However, the issue of over-smoothing still persists. ResGCN [11] constructs a residual connection convolutional network to strengthen self-information and alleviate the over-smoothing problem. This allows the model to maintain a healthy learning trend even after 56 layers. However, the self-information of nodes originates from a single view and cannot capture diverse information. SVD-GCN [32] directly calculates the final embeddings of LightGCN using a low-rank method based on singular value decomposition, eliminating the need for convolution operations. This effectively alleviates over-smoothing but lacks a certain level of interpretability.

Although these methods have made significant contributions to alleviating the issues of over-smoothing and high-complexity in GCNs, there is still a need for further research on addressing the challenges that GCN faces in graph contrastive learning (GCL).

2.2 Contrastive learning for recommendation

Contrastive learning, owing to its self-supervised [12, 27] nature, is effective in learning features from unlabeled samples [24], serving as a valuable approach to mitigate the data sparsity issues in recommendation. Learning embedding representations from diverse graph structures and constructing a contrastive learning framework for optimization is a recent research hotspot in recommendation. To the best of our knowledge, SGL [22] employs a strategy of random dropout to enhance nodes and edges, constructing an effective CL framework. However, the random approach may discard some crucial structural information, making it challenging for the model to learn information between different views. Introducing multiple perturbed views also further increases the model’s complexity. AutoGCL [36] uses the learned embeddings to automatically deploy different graph augmentation strategies for each node. However, AutoGCL does not take into account edge augmentation strategies. SimGCL [31] and XSimGCL [39] introduce uniform noise to the embedding spaces of users and items, achieving feature augmentation. XSimGCL proposes a single-view CL framework, optimizing SimGCL’s multi-view framework and providing new ideas for the simplification of graph contrastive learning. The CL strategy of LightGCL involves reconstructing the adjacency matrix using truncated SVD. However, it does not take into account the impact of lost structural information. AdaptiveGCL [41] employs a variational graph autoencoder to generate existing edges and learn a parameterized mask for removing noisy edges in the graph. This method dynamically generates suitable perturbation views but may face complexity issues due to a large number of parameters and frequent graph generation.

Current GCL methods employ various augmentation strategies [21], but these approaches often fail to preserve crucial structural information from the original graph and optimize the negative impact of the augmentation.

3 Methodology

As depicted in Fig. 1, CollaGCL builds contrastive learning framework using LightGCN (View 1) and SVD augmentation (View 2). It is recommended to deploy LightGCN in the module of Graph Encode and normalization in the module of Norm. In the Joint-View Learning module, a fully-connected layer and aggregation function are used to learn collaborative information from main view (View 1). The Global Learning module should be able to capture global collaborative information and generate appropriate embeddings for subsequent tasks.

Fig. 1

Overall structure of CollaGCL.

As a common practice of CF, $U = {u_{1}, . . ., u_{i}, . . ., u_{i}}$ and $V = {v_{1}, . . ., v_{j}, . . ., v_{j}}$ denote the users and items set, where I and J denote their quantities. The adjacency matrix $A \in R^{I \times J}$ is used to represent the relationships between users and items, where $A_{ij} = 1$ indicates a connection between users i and items j, while $A_{ij} = 0$ signifies no connection between them. $\tilde{A} = A + I$ represents the self-connected adjacency matrix and ${\tilde{D}}_{ii} = \sum_{j} {\tilde{A}}_{ij}$ as the degree matrix for nodes. $I$ is the identity matrix. The normalized adjacency matrix used for graph convolution is denoted as $\bar{A} = {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}$ . The low-rank method of Randomized Singular Value Decomposition can decompose the adjacency matrix $\bar{A}$ into ${\hat{U}}_{q}$ , ${\hat{S}}_{q}$ , and ${\hat{V}}_{q}$ , such that $\hat{A} = {\hat{U}}_{q} {\hat{S}}_{q} {\hat{V}}_{q}^{T}$ , and ${\hat{U}}_{q} {\hat{U}}_{q}^{T} = I$ , ${\hat{V}}_{q} {\hat{V}}_{q}^{T} = I$ . Here, $\hat{A}$ represents the reconstructed adjacency matrix, and q denotes the rank of SVD (the top q most important information). The detailed low-rank decomposition method will be explained in subsequent chapters.

3.1 Global collaborative relation learning

Graph augmentation with random perturbation may remove important structure information, which misleads the representation learning. In addition, removing important nodes or edges from the graph structure may compromise the integrity of the connected graph, resulting in a lack of learnable consistency among contrastive views [31]. As shown in Fig. 2, to alleviate the negative impact of graph structure perturbation, CollaGCL captures the first-hop neighborhood information of the original graph structure. The reason is that first-layer embeddings capture smooth user-item interaction information and the embeddings will be over-smoothing with the increasing of layers. In accordance with the common practice, users and items are represented as d-dimensional embeddings to learn neighborhood representations. Then, a single-layer LightGCN is employed to capture collaborative information: $c^{(u)} = \bar{A} \cdot h_{0}^{(v)}, c^{(v)} = {\bar{A}}^{T} \cdot h_{0}^{(u)}$ (1)

Fig. 2

The framework of global learning module.

where $h_{0}^{(u)} \in ℝ^{I \times d}$ and $h_{0}^{(v)} \in ℝ^{J \times d}$ represent the randomly initialized users and items embeddings using the Xavier Initialization method, respectively, and d denotes embedding size. $c^{(u)} \in ℝ^{I \times d}, c^{(v)} \in ℝ^{J \times d}$ represent the collaborative information. Although collaborative information effectively represents the original graph structure, insufficient neighborhood information for certain nodes results in inadequate training and impacts the overall performance of the model. As depicted in Fig. 2, to mitigate the influence of these nodes, a low-rank method is employed to construct interaction graphs between users and users, as well as items and items, for further embedding learning. Directly utilizing the adjacency matrix $\hat{A}$ to compute the interaction graph and employing low-rank method for simplification to reduce model complexity: $\begin{matrix} A_{uu} & = \hat{A} \cdot {\hat{A}}^{T} = {\hat{U}}_{q} {\hat{S}}_{q}^{2} {\hat{U}}_{q}^{T} \\ A_{ii} & = {\hat{A}}^{T} \cdot \hat{A} = {\hat{V}}_{q} {\hat{S}}_{q}^{2} {\hat{V}}_{q}^{T} \end{matrix}$ (2)

Subsequently, the user and item embeddings generated by the low-rank constructed graph can be represented as $e_{uu} = {\hat{U}}_{q} {\hat{S}}_{q}^{2} {\hat{U}}_{q}^{T} \cdot h_{0}^{(u)}$ and $e_{ii} = {\hat{V}}_{q} {\hat{S}}_{q}^{2} {\hat{V}}_{q}^{T} \cdot h_{0}^{(v)}$ . In order to avoid affecting the optimization of the original graph and enhance the generalization ability of the model, the low-rank parameter $w \in ℝ^{q \times d}$ is defined to replace ${\hat{U}}_{q}^{T} \cdot h_{0}^{(u)}$ and ${\hat{V}}_{q}^{T} \cdot h_{0}^{(v)}$ . Therefore, the formulas for e_uu and e_ii can be simplified to: $e_{uu} = {\hat{U}}_{q} {\hat{S}}_{q}^{2} \cdot w, e_{ii} = {\hat{V}}_{q} {\hat{S}}_{q}^{2} \cdot w$ (3)

Finally, the collaborative information is aggregated with the generated graph information to enhance the embedding representations. The embeddings from the global learning layer serve as inputs for the main view and perturbed view, replacing randomly initialized embeddings $h_{0}^{(u)}$ and $h_{0}^{(v)}$ . This is because calculating embeddings using $h_{0}^{(u)}$ and $h_{0}^{(v)}$ leads to a performance decrease: $E_{0}^{(u)}, G_{0}^{(u)} = c^{(u)} + e_{uu}; E_{0}^{(v)}, G_{0}^{(v)} = c^{(v)} + e_{ii}$ (4)

where $E_{l}^{(u)} \in ℝ^{I \times d}$ and $G_{l}^{(u)} \in ℝ^{I \times d}$ represent the l-th layer aggregated user embeddings for the main view and the perturbed view, respectively. $E_{l}^{(v)} \in ℝ^{J \times d}$ and $G_{l}^{(v)} \in ℝ^{J \times d}$ are item embeddings in the l-th layer.

3.2 Joint-view representation learning

3.2.1 Main view propagation

To better aggregate global collaborative information, a two-layer LightGCN was adopted to encode the neighborhood information for each node in the main view. In the l-th layer, the aggregation process is expressed as $E_{l}^{(u)} = \bar{A} \cdot E_{l - 1}^{(v)}, E_{l}^{(v)} = {\bar{A}}^{T} \cdot E_{l - 1}^{(u)}$ (5)

3.2.2 Singular value decomposition augmentation

To construct a better contrastive learning paradigm and extract important structure information, SVD method is used to reconstruct the graph structure as the perturbed view. Firstly, SVD is performed on the adjacency matrix $\bar{A}$ as $\bar{A} = {USV}^{T}$ . Here, $S \in ℝ^{I \times J}$ is a diagonal matrix storing the singular values of $\bar{A}$ . $U \in ℝ^{I \times I}$ and $V \in ℝ^{J \times J}$ are orthonormal matrix. However, executing SVD on large matrices is computationally expensive. In order to reduce computational costs, randomized SVD [1] algorithm is used as a replacement for the time-consuming SVD algorithm. The core concept behind the randomized SVD algorithm involves to approximate the input matrix with a low-rank orthonormal matrix and perform SVD on this smaller matrix. As illustrated in Fig. 3, the adjacency matrix is reconstructed with the randomized SVD as $\hat{A} = {\hat{U}}_{q} {\hat{S}}_{q} {\hat{V}}_{q}^{T}$ , where ${\hat{U}}_{q} \in ℝ^{I \times q}, {\hat{V}}_{q} \in ℝ^{J \times q}$ , and ${\hat{S}}_{q} \in ℝ^{q \times q}$ retain the top q most important information of adjacency matrix $\bar{A}$ , and $\hat{A}$ is reconstructed adjacency matrix of $\bar{A}$ .

Fig. 3

The process of graph augmentation.

3.2.3 Perturbed view propagation

GCNs aggregate too much information from high-order neighborhood after stacking multiple layers, making embeddings hard to distinguish and leading to the over-smoothing problem [8]. In contrastive learning, the over-smoothing problem still exists due to the construction of multiple views. Different contrastive views capture various collaborative information, and joint-view connections is utilized to incorporate self-information to emphasize the semantics of the centric node. However, directly connecting different views embeddings could potentially amplify the impact of noise. Therefore, fully-connected layers are used to reintegrate the main view embeddings to learn suitable information. Embeddings from different views are combined to mitigate the issue of over-smoothing. The specific implementation of message aggregation is as follows: $\begin{matrix} m_{l - 1}^{(u)} & = G_{l - 1}^{(u)} + f (E_{l - 1}^{(u)}) \\ m_{l - 1}^{(v)} & = G_{l - 1}^{(v)} + f (E_{l - 1}^{(v)}) \end{matrix}$ (6)

where f (·) is fully-connected function, mapping the embedding into a d-dimensional space. The perturbed view preserves the essential graph structure information to learn important and reliable embedding representations. In each layer, the perturbed view propagation aggregates the joint-view messages m^(u) and m^(v) as the embeddings for that layer: $\begin{matrix} G_{l}^{(u)} = & \hat{A} \cdot m_{l - 1}^{(v)} = {\hat{U}}_{q} {\hat{S}}_{q} {\hat{V}}_{q}^{T} \cdot m_{l - 1}^{(v)} \\ G_{l}^{(v)} = & {\hat{A}}^{T} \cdot m_{l - 1}^{(u)} = {\hat{V}}_{q} {\hat{S}}_{q} {\hat{U}}_{q}^{T} \cdot m_{l - 1}^{(u)} \end{matrix}$ (7)

3.3 Cross-view meta-knowledge diffusion

This module is designed to capture reliable information (meta-knowledge) to propagate throughout the entire graph structure to optimize embedding representations. The embeddings learned during the propagation process in the perturbed view contain essential structural information and less noise. Layer aggregation with weight is used to extract reliable information as meta-knowledge from each layer. This method captures various semantic information from different layers and uses weights to learn uniform feature representations. The extracted meta-knowledge is as follows: $k^{(u)} = \frac{1}{L + 1} \sum_{l = 0}^{L} G_{l}^{(u)}, k^{(v)} = \frac{1}{L + 1} \sum_{l = 0}^{L} G_{l}^{(v)}$ (8)

Setting the aggregation coefficient as $\frac{1}{L + 1}$ in CollaGCL yields better results, but it can be adjusted to different values or learnable parameter. The nodes in the perturbed view lose part of the neighborhood information due to perturbation. The extracted meta-knowledge $k^{(u)} \in ℝ^{I \times d}$ and $k^{(v)} \in ℝ^{J \times d}$ is injected into graph encode with the aim of propagating representative and important information throughout the entire graph structure. This helps the discarded parts of the structure to regain valuable information. Normalization distributes features in a uniform range to reduce the influence of outliers. Each component with smaller singular values aggregates meta-knowledge from neighborhood to obtain essential information. The results of meta-knowledge diffusing in the original graph structure are expressed as $\begin{matrix} d^{(u)} = & \frac{\bar{A} \cdot k^{(v)}}{\max (∥ \bar{A} \cdot k^{(v)} ∥_{2})} \\ d^{(v)} = & \frac{{\bar{A}}^{T} \cdot k^{(u)}}{\max (∥ {\bar{A}}^{T} \cdot k^{(u)} ∥_{2})} \end{matrix}$ (9)

where d^(u) and d^(v) denote user and item embeddings through which meta-knowledge diffuses in the entire graph structure. In order to better optimize the CL loss, it’s necessary to extract the feature information form different views. The embedding for the main view directly uses the meta-knowledge extraction method to aggregate layer information. For the perturbed view, the diffused knowledge d^(u) and d^(v) needs to be aggregated with the meta-knowledge k^(u) and k^(v). This effectively preserves the information in the perturbed view, and the diffused knowledge from the original graph further helps optimize the insufficiently learned embeddings in the perturbed view to alleviate information loss caused by dimension reduction in the SVD process. Therefore, the calculation for the final embeddings used for CL loss in the two views is as follows: $\begin{matrix} e^{(u)} = \frac{1}{L + 1} \sum_{l = 0}^{L} E_{l}^{(u)}, & g^{(u)} = k^{(u)} + d^{(u)} \\ e^{(v)} = \frac{1}{L + 1} \sum_{l = 0}^{L} E_{l}^{(v)}, & g^{(v)} = k^{(v)} + d^{(v)} \end{matrix}$ (10)

where e^(u) and g^(u) represent the final user embeddings of main view and perturbed view for contrastive learning, respectively. e^(v) and g^(v) are represented in the same way. The embeddings used for recommendation tasks should aggregate information from different views. Due to the presence of partially insufficiently trained node information in the perturbed view embeddings, directly aggregating this information onto e^(u) and e^(v) would impact the embedding consistency. Therefore, using the embeddings d^(u) and d^(v), optimized on the original structure, as perturbed view information for recommendation tasks is a better choice. The embedding representations z^(u) and z^(v) for recommendations are $z^{(u)} = e^{(u)} + d^{(u)}, z^{(v)} = e^{(v)} + d^{(v)}$ (11)

Cross-view meta-knowledge diffusion serves as an embedding optimization module, providing high-quality embeddings for the loss function. The construction of this module has two benefits: i) it compensates for the missing information caused by structural perturbations and helps contrastive learning better distinguish positive and negative instances. ii) it assists the recommendation task in making more accurate predictions. Both aspects work together to improve the model’s performance.

3.4 Contrastive learning

To enhance recommendation performance, a multi-task joint optimization strategy was employed for contrastive learning. Including both BPR loss $L_{rec}$ and InfoNCE loss $L_{cl}$ , the total loss $L$ is represented as: $L = L_{rec} + λ_{1} \cdot L_{cl} + λ_{2} \cdot ∥ Θ ∥_{2}^{2}$ (12) where Θ is the collection of all parameters, including random initialization and fully-connected layers. λ₁ and λ₂ denotes the weight of CL and L₂ regularization, respectively. The likelihood of user u being associated with item i is given by: ${\hat{y}}_{u, i} = z_{u}^{(u)} z_{i}^{(v)}$ . The specific implementation of BPR loss is: $L_{rec} = - \sum_{(u, i^{+}, i^{-}) \in B} log (σ ({\hat{y}}_{u, i^{+}} - {\hat{y}}_{u, i^{-}}))$ (13)

where σ (·) and log(·) represent the sigmoid and logarithm function, respectively. Items i⁺ and i^- are positive and negative instance. The traditional CL models like SGL and SimGCL contrast node embedding by constructing two additional views, while the embeddings generated from the main view are not directly involved in the InfoNCE loss. In CollaGCL, the embeddings from the two views are directly used as the comparison objects for InfoNCE loss: $L_{cl}^{(u)} = \sum_{i \in B} - log \frac{exp (s (e_{i}^{(u)}, g_{i}^{(u)}) / τ)}{\sum_{j \in B} exp (s (e_{i}^{(u)}, g_{j}^{(u)}) / τ)}$ (14)

where s (·) and τ denote cosine similarity and temperature, respectively. The InfoNCE loss $L_{cl}^{(v)}$ for the items are defined in the same way. The losses for users and items are combined to obtain the final loss used for CL tasks as $L_{cl} = L_{cl}^{(u)} + L_{cl}^{(v)}$ . The calculation strategy of InfoNCE loss enables the independent optimization of user and item embedding representations to reduce mutual influence between them, while simultaneously allowing control over their impact on the overall loss.

4 Evaluation

The recommendation performance of CollaGCL is assessed through several representative experiments, aiming to address the following questions:

RQ1: How does the recommendation performance of CollaGCL compare to baseline methods?

RQ2: How does the complexity of CollaGCL?

RQ3: How does the uniformity of learned embedding by CollaGCL?

RQ4: How does the key component of CollaGCL improve recommendation performance?

RQ5: How does CollaGCL perform against popularity bias?

RQ6: How does the configuration of hyperparameters affect the performance of CollaGCL?

4.1 Experimental settings

4.1.1 Datasets and evaluation metrics

Three real-world datasets were used to assess model performance in the experiments, and the detailed information about the datasets is presented in Table 1. Yelp is a dataset containing user rating information, Gowalla consists of users’ check-in records, and Amazon is a dataset of users’ ratings on products with book category. The datasets is partitioned into training, testing, and validation sets in a 7:2:1 ratio and employed Recall@N and NDCG@N as evaluation metrics [10 , 23], where N={20, 40}.

Table 1
Statistics of experimented datasets

Dataset User# Item# Interaction# Density

Yelp 29,601 24,734 1,517,326 0.00187

Gowalla 50,821 57,440 1,172,425 0.00044

Amazon 78,578 77,801 2,240,156 0.00047

Dataset	User#	Item#	Interaction#	Density
Yelp	29,601	24,734	1,517,326	0.00187
Gowalla	50,821	57,440	1,172,425	0.00044
Amazon	78,578	77,801	2,240,156	0.00047

4.1.2 Baseline methods

Some popular baseline methods are compared with CollaGCL in terms of performance, including CF and CL methods, and below is a brief introduction to these baselines:

NGCF [10] employs collaborative filtering methods to extract high-order collaborative signals between users and items.

LightGCN [18] adopts a simplified GCN structure without matrix transformation and non-linear activation.

HCCF [29] employs CL on a hypergraph to capture users and items information.

SGL [22] introduces random elimination of nodes and edges to construct contrastive views.

NCL [37] combines users (items) graph structural and semantic space neighbors to construct contrastive learning task.

SimGCL [31] adds uniform noise to user and item embeddings as an alternative approach to graph augmentation for constructing perturbed views.

XSimGCL [39] builds single view CL tasks with uniform noise perturbation on user-item embeddings. It improves training efficiency compared to SimGCL.

LightGCL [34] adopts SVD augmentation to construct perturbed view.

4.1.3 Hyperparameter settings

All baseline methods strictly followed the parameter settings as instructed in the original paper and were adjusted for optimal performance. The parameter information for CollaGCL is presented in Table 2, and these parameters are tuned within reasonable ranges to search for the best performance. The optimization strategy for model parameters involves using the Adam optimizer.

Table 2
The parameter information for CollaGCL

Parameter Definition Scope

λ₁ weight of CL loss {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}

λ₂ weight of regularization {1e-6, 1e-7, 1e-8}

τ temperature {0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1.0, 3.0, 5.0}

$B$ mini-batch size {1024, 2048, 4096, 8192}

d embedding size {32, 64, 128}

L GNN layers {1, 2, 3}

q rank of SVD {1, 3, 5, 7, 9, 11, 13, 15}

α learning rate 0.001

Parameter	Definition	Scope
λ₁	weight of CL loss	{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
λ₂	weight of regularization	{1e-6, 1e-7, 1e-8}
τ	temperature	{0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1.0, 3.0, 5.0}
$B$	mini-batch size	{1024, 2048, 4096, 8192}
d	embedding size	{32, 64, 128}
L	GNN layers	{1, 2, 3}
q	rank of SVD	{1, 3, 5, 7, 9, 11, 13, 15}
α	learning rate	0.001

4.2 Performance comparison (RQ1)

All baseline methods and CollaGCL were trained on the datasets, and detailed results are provided in Table 3. By comparing the results, it can be noticed that contrastive learning methods (SGL, SimGCL, etc) consistently outperform collaborative filtering methods (NGCF, LightGCN), primarily because self-supervised learning effectively mitigates the data sparsity issue. Compared to contrastive learning methods with structural perturbation (SGL, LightGCL, etc), approaches that retain graph structural information (SimGCL, XSimGCL) achieve better results, indicating that preserving graph structure information is beneficial for enhancing contrastive learning performance. In comparison to other baseline methods, CollaGCL exhibits significant advantages on all three datasets. This is mainly attributed to CollaGCL’s mitigation of the negative effects of structural perturbation and the design of modules to reduce over-smoothing problem in graph neural networks. The improvement in the Yelp dataset, which has less data and more interactions compared to the Gowalla and Amazon datasets, is relatively lower. The reason for this phenomenon may be that the increase in the number of node interactions limits the advantages of contrastive learning to a certain extent. Nodes with a large number of interactions will affect their performance with neighboring nodes due to the fusion of noisy information. The significant improvement on the Gowalla and Amazon datasets highlights the superiority of CollaGCL in handling sparse data. This is mainly attributed to the construction of the contrastive learning framework and the SVD augmentation.

Table 3
Performance comparison with baselines on different datasets

Data Metric NGCF LightGCN HCCF LightGCL SGL NCL SimGCL XSimGCL CollaGCL Impr.

Yelp R@20 0.0830 0.0909 0.0988 0.1003 0.1008 0.1012 0.1088 0.1098 0.1147 4.5%

N@20 0.0699 0.0777 0.0844 0.0862 0.0861 0.0866 0.0940 0.0948 0.0985 3.9%

R@40 0.1351 0.1467 0.1587 0.1587 0.1619 0.1620 0.1718 0.1739 0.1788 2.8%

N@40 0.0891 0.0981 0.1066 0.1075 0.1083 0.1089 0.1169 0.1181 0.1218 3.1%

Gowalla R@20 0.1917 0.2177 0.1941 0.2119 0.2231 0.2256 0.2264 0.2158 0.2551 12.6%

N@20 0.1140 0.1276 0.1153 0.1232 0.1358 0.1372 0.1386 0.1321 0.1549 11.7%

R@40 0.2702 0.3054 0.2783 0.2984 0.3111 0.3100 0.3130 0.3019 0.3454 10.3%

N@40 0.1345 0.1507 0.1371 0.1458 0.1589 0.1596 0.1613 0.1546 0.1786 10.7%

Amazon R@20 0.0752 0.1004 0.0880 0.1150 0.1034 0.1030 0.1103 0.1090 0.1319 19.5%

N@20 0.0563 0.0755 0.0667 0.0893 0.0804 0.0791 0.0861 0.0848 0.1047 21.6%

R@40 0.1150 0.1516 0.1364 0.1681 0.1537 0.1513 0.1634 0.1617 0.1880 15.1%

N@40 0.0196 0.0924 0.0826 0.1067 0.0969 0.0951 0.1033 0.1020 0.1229 18.9%

Data	Metric	NGCF	LightGCN	HCCF	LightGCL	SGL	NCL	SimGCL	XSimGCL	CollaGCL	Impr.
Yelp	R@20	0.0830	0.0909	0.0988	0.1003	0.1008	0.1012	0.1088	0.1098	0.1147	4.5%
	N@20	0.0699	0.0777	0.0844	0.0862	0.0861	0.0866	0.0940	0.0948	0.0985	3.9%
	R@40	0.1351	0.1467	0.1587	0.1587	0.1619	0.1620	0.1718	0.1739	0.1788	2.8%
	N@40	0.0891	0.0981	0.1066	0.1075	0.1083	0.1089	0.1169	0.1181	0.1218	3.1%
Gowalla	R@20	0.1917	0.2177	0.1941	0.2119	0.2231	0.2256	0.2264	0.2158	0.2551	12.6%
	N@20	0.1140	0.1276	0.1153	0.1232	0.1358	0.1372	0.1386	0.1321	0.1549	11.7%
	R@40	0.2702	0.3054	0.2783	0.2984	0.3111	0.3100	0.3130	0.3019	0.3454	10.3%
	N@40	0.1345	0.1507	0.1371	0.1458	0.1589	0.1596	0.1613	0.1546	0.1786	10.7%
Amazon	R@20	0.0752	0.1004	0.0880	0.1150	0.1034	0.1030	0.1103	0.1090	0.1319	19.5%
	N@20	0.0563	0.0755	0.0667	0.0893	0.0804	0.0791	0.0861	0.0848	0.1047	21.6%
	R@40	0.1150	0.1516	0.1364	0.1681	0.1537	0.1513	0.1634	0.1617	0.1880	15.1%
	N@40	0.0196	0.0924	0.0826	0.1067	0.0969	0.0951	0.1033	0.1020	0.1229	18.9%

4.3 Efficiency study (RQ2)

GCL methods suffer from computational cost due to the need to construct additional views. The definitions of parameters are provided in Table 2. In addition, E and M represent interaction number and node number in a batch, respectively. The complexities of baselines are shown in Table 4. SVD is performed in the pre-processing stage which takes $O (qE)$ . Compared to GCN-based methods, SVD-based graph convolution takes less complexity (q (I + J) < E). Due to the use of additional graph convolutional networks, CollaGCL’s complexity is higher than that of LightGCL and LightGCN. CollaGCL employs L + 2 rounds of LightGCN during the training process, so the complexity is $O [2 E (L + 2) d]$ . The propagation of the perturbed view and the low-rank graph generation use a graph reconstruction method, so the complexity is $O [2 q (I + J) (L + 1) d]$ . The complexity of InfoNCE loss is $O (B d + B Md)$ and BPR loss is $O (2 B d)$ . The total number of parameters used by CollaGCL is (q + d + I + J) d. CollaGCL’s complexity mainly arises from graph convolution operations. It uses a small number of parameters for training, resulting in a relatively fast training speed.

Table 4
Complexity comparison of baseline methods

Stage Computation LightGCN LightGCL CollaGCL

Pre-processing Normalization $O (E)$ $O (E)$ $O (E)$

SVD - $O (qE)$ $O (qE)$

Training Graph Convolution $O (2 ELd)$ $O (2 ELd)$ $O$

- $O [2 q (I + J) Ld]$ $O [2 q (I + J) (L + 1) d]$

BPR Loss - $O (2 B d)$ $O (2 B d)$

InfoNCE Loss $O (B d + B Md)$ $O (B d + B Md)$ $O (B d + B Md)$

Stage	Computation	LightGCN	LightGCL	CollaGCL
Pre-processing	Normalization	$O (E)$	$O (E)$	$O (E)$
	SVD	-	$O (qE)$	$O (qE)$
Training	Graph Convolution	$O (2 ELd)$	$O (2 ELd)$	$O$
		-	$O [2 q (I + J) Ld]$	$O [2 q (I + J) (L + 1) d]$
	BPR Loss	-	$O (2 B d)$	$O (2 B d)$
	InfoNCE Loss	$O (B d + B Md)$	$O (B d + B Md)$	$O (B d + B Md)$

4.4 Uniformity study (RQ3)

The uniformity of learned embedding representations is an intuitive method for evaluating the performance of graph contrastive learning. It is closely related to the model’s recommendation performance, but excessive uniformity can also lead to a decrease in performance. In order to demonstrate the uniformity of embeddings, users and items are ranked by the interactions, defining the top 10% as hot and the bottom 80% as cold. Subsequently, 500 users and items were sampled from each of the hot and cold groups to learn embedding representations. To clarify the presentation, t-SNE is utilized to learn a 2-dimensional representation, and Gaussian kernel density estimation curves are then plotted (arctan (feature_y/feature_x)).

The experimental results are shown in Fig. 4, all methods on the Yelp have learned clustered embedding features. In all examples, the density curves of popular users and items exhibit a certain degree of steepness. The proportion of highly interactive nodes in Yelp is much higher than in the other two datasets. Therefore, the Yelp dataset is more prone to over-smoothing and noise problem of hot groups, and it may be beneficial to limit the interactions of high interaction nodes to alleviate this problem. LightGCN, compared to CL methods, learns a more clustered distribution, possibly due to the sparsity of the data. Compared to other CL methods, SGL exhibits poorer uniformity, possibly due to the random dropout strategy making the learning process challenging for certain nodes. CollaGCL, XSimGCL, and SimGCL have learned more uniform embedding representations, indicating that contrastive learning mitigates data sparsity. CollaGCL exhibits local clustering to prevent excessive uniformity, which could make features difficult to distinguish.

Fig. 4

The distribution of representations learned from three datasets.

4.5 Ablation study (RQ4)

To examine whether the proposed modules have a positive impact, ablation experiments are conducted on different variants. Brief descriptions of all variants are as follows:

baseline: contrastive learning using only SVD augmentation.

w/o-gcr: at the model input stage, the module of global learning is removed.

w/o-jrl: in the graph convolution stage, the module of joint-view learning is removed.

w/o-meta: in this variant, meta-knowledge diffusion is eliminated.

In comparison to the results presented in Table 5, CollaGCL outperforms all variants on different datasets. w/o-gcr and w/o-meta have a significant impact on all three datasets, highlighting the importance of global collaborative information for perturbed view learning and the optimization role of meta-knowledge for recommendation tasks. w/o-jrl has a greater impact on datasets with relatively fewer data and less influence on larger and sparser datasets. This could be because the sparsity of data affects the effectiveness of graph convolutional networks, making it challenging for perturbed views to learn valuable information from the main view. CollaGCL and all its variants outperform the baseline, indicating that all the combinations of modules have a beneficial effect.

Table 5
Ablation study on key components of CollaGCL

Data Yelp Gowalla Amazon

Variants Recall@20 NDCG@20 Recall@20 NDCG@20 Recall@20 NDCG@20

baseline 0.0955 0.0813 0.2037 0.1210 0.0813 0.0613

w/o-gcr 0.1035 0.0882 0.2370 0.1442 0.1185 0.0946

w/o-jrl 0.1074 0.0922 0.2529 0.1541 0.1290 0.1031

w/o-meta 0.1107 0.0951 0.2209 0.1358 0.1131 0.0893

CollaGCL 0.1147 0.0985 0.2551 0.1549 0.1319 0.1047

Data	Yelp	Gowalla	Amazon
baseline	0.0955	0.0813	0.2037	0.1210	0.0813	0.0613
w/o-gcr	0.1035	0.0882	0.2370	0.1442	0.1185	0.0946
w/o-jrl	0.1074	0.0922	0.2529	0.1541	0.1290	0.1031
w/o-meta	0.1107	0.0951	0.2209	0.1358	0.1131	0.0893
CollaGCL	0.1147	0.0985	0.2551	0.1549	0.1319	0.1047

4.6 Resistance against popularity bias (RQ5)

To investigate the model’s ability to mitigate popularity bias, the data is grouped based on interaction counts, with each group having an interaction count range of {0-5, 6-10, 11-15, 16-20, 21-25}, and items with fewer interactions are defined as long-tail items. The decomposed Recall@20 is used as the evaluation metric for the experiment is ${recall}^{(g)} = \frac{| (𝕍_{rec}^{u})^{(g)} \cap 𝕍_{test}^{u} |}{| 𝕍_{test}^{u} |}$ , where $𝕍_{test}^{u}$ denotes the item set associated with user u in the test, and $(𝕍_{rec}^{u})^{(g)}$ represents the item set belonging to group g that are recommended to user u.

The experimental results are shown in Fig. 5, CollaGCL effectively mitigates the popularity bias on both datasets and provides good recommendations for nodes with few interactions. This is because CollaGCL effectively learns the diversity of features from different views and mitigates the issue of data sparsity. As the interaction count increases, CollaGCL gradually approaches the performance of other baseline methods. This phenomenon occurs because increased interactions lead to more thorough training of these nodes. It is noteworthy that in certain groups, SGL performs worse than LightGCN, possibly due to the random dropout strategy disrupting the graph structure and affecting connectivity, resulting in suboptimal embedding for some groups. The multi-perturbed view structure in SimGCL allows the learning of more diverse features for the long-tail items, leading to better performance than XSimGCL in certain groups. The application of contrastive learning in recommendation models helps optimize embeddings for long-tail items, alleviating the problem of popularity bias and data sparsity.

Fig. 5

The ability to promote long-tail items.

4.7 Hyperparameter analaysis (RQ6)

Experiments were carried out to assess how key parameters influenced the overall performance of CollaGCL. The parameter adjustments were made while keeping the other parameters at optimal values. As shown in Fig. 6, the relatively flat trend across all three datasets is due to the effective preservation of important information. On the Amazon dataset, excessively large values of q introduce noise issues, while too small values of q struggle to retain crucial information, both leading to a sharp decline in performance. According to experimental results, the optimal value for q on the three datasets is 5.

Fig. 6

The impact of q.

The temperature coefficient τ is used to control the model’s discrimination against negative samples. Setting a larger τ leads to a more uniform distribution, where the model treats all negative samples equally. On the other hand, a smaller τ sharpens the distribution, making the model pay more attention to hard negative samples. As shown in Fig. 7, CollaGCL is highly sensitive to the parameter τ, and the optimal value for τ across the three datasets is found to be 0.2.

Fig. 7

The impact of τ.

The parameter λ₁ is used to control the influence of the CL loss on the model. As shown in Fig. 8, on the large and relatively sparse Gowalla and Amazon datasets, CollaGCL tends to prefer larger weights for optimizing the model, with the optimal value being 0.8. On the other hand, for the relatively smaller Yelp dataset, CollaGCL tends to favor smaller weights for model optimization, and the optimal value is found to be 0.2. It can be observed that when λ₁ takes a value of 1.0 on Yelp, the performance sharply declines. The reason for this phenomenon may be that the model is overly sensitive to the contrastive learning loss, leading to issues such as training instability.

Fig. 8

The impact of λ₁.

5 Conclusion

This paper proposed an effective contrastive learning method based on SVD approach for constructing perturbed views. Given that graph structural perturbations result in missing connectivity information in the original graph, CollaGCL considered incorporating global collaborative information as view input signals. To mitigate the impact of over-smoothing, the joint-view representation learning module reorganized the main view information and applied self-connections to emphasize the node’s self-information. To better optimize the training task, cross-view meta-knowledge diffusion module extracted meta-knowledge based on the properties of singular value decomposition and propagated it within the graph structure, improving the embeddings of both views. Experimental results demonstrate that CollaGCL has improved recommendation performance and mitigated popularity bias. In future work, consideration is given to utilizing auto-encoders to optimize the model performance and proposing suitable single-view propagation schemes.

Footnotes

Acknowledgments

This work was supported by the Postgraduate Research & Practice Innovation Program of Jiangsu Province [KYCX23_3078].

References

Halko

, Martinsson

P.G.

and Tropp

J.A.

, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Review 53(2) (2011), 217–288.

Kingma

D.P.

and Welling

, Auto-encoding variational Bayes, Stat 1050 (2014), 1.

Kipf

T.N.

, Thomas

and Welling

, Variational graph auto-encoders, Stat 1050 (2016), 21.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, International Conference on Learning Representations (ICLR) 2017.

Dong

, et al. A hybrid collaborative filtering model with deep structure for recommender systems, Proceedings of the AAAI Conference on artificial intelligence (2017), 1309–1315.

, et al. Neural collaborative filtering, Proceedings of the 26th international conference on world wide web (2017), 173–182.

Liang

, et al. Variational autoencoders for collaborative filtering, Proceedings of the 2018 world wide web conference (2018), 689–698.

, et al. Deeper insights into graph convolutional networks for semi-supervised learning, Proceedings of the AAAI conference on artificial intelligence (2018), 3538–3545.

Huang

, et al. Online purchase prediction via multi-scale modeling of behavior dynamics, Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (2019), 2613–2622.

10.

Wang

, et al. Neural graph collaborative filtering, Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval (2019), 165–174.

11.

, et al. Deepgcns: Can gcns go as deep as cnns? Proceedings of the IEEE/CVF international conference on computer vision (2019), 9267–9276.

12.

Jaiswal

, et al. A survey on contrastive self-supervised learning, Technologies 9(1) (2020), 2.

13.

Chen

, et al. Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach, Proceedings of the AAAI conference on artificial intelligence (2020), 27–34.

14.

Zhou

, et al. Towards deeper graph neural networks with differentiable group normalization, Advances in Neural Information Processing Systems 33 (2020), 4917–4928.

15.

Zhou

, et al. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization, Proceedings of the 29th ACM international conference on information & knowledge management (2020), 1893–1902.

16.

Singh

, Scalability and sparsity issues in recommender datasets: a survey, Knowledge and Information Systems 62 (2020), 1–43.

17.

Ren

, et al. Sequential recommendation with self-attentive multi-adversarial network, Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (2020), 89–98.

18.

, et al. Lightgcn: Simplifying and powering graph convolution network for recommendation, Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (2020), 639–648.

19.

Wang

, et al. Disentangled graph collaborative filtering, Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (2020), 1001–1010.

20.

Min

, Wenkel

and Wolf

, Scattering gcn: Overcoming oversmoothness in graph convolutional networks, Advances in Neural Information Processing Systems 33 (2020), 14498–14508.

21.

You

, et al. Graph contrastive learning with augmentations, Advances in Neural Information Processing Systems 33 (2020), 5812–5823.

22.

, et al. Self-supervised graph learning for recommendation, Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval (2021), 726–735.

23.

Xia

, et al. Knowledge-enhanced hierarchical graph transformer network for multi-behavior recommendation, Proceedings of the AAAI Conference on Artificial Intelligence (2021), 4486–4493.

24.

Gao

, Yao

and Chen

, SimCSE: Simple Contrastive Learning of Sentence Embeddings, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021), 6894–6910.

25.

Wang

, et al. Denoising implicit feedback for recommendation, Proceedings of the 14th ACM international conference on web search and data mining (2021), 373–381.

26.

Lin

, et al. Task-adaptive neural process for user cold-start recommendation, Proceedings of the Web Conference 2021 (2021), 1306–1316.

27.

Liu

, et al. Self-supervised learning: Generative or contrastive, IEEE Transactions on Knowledge and Data Engineering 35(1) (2021), 857–876.

28.

Xia

, et al. Self-supervised hypergraph convolutional networks for session-based recommendation, Proceedings of the AAAI conference on artificial intelligence (2021), 4503–4511.

29.

Xia

, et al. Hypergraph contrastive collaborative filtering, Proceedings of the 45th International ACM SIGIR conference on research and development in information retrieval (2022), 70–79.

30.

Xia

, et al. Simgrace: A simple framework for graph contrastive learning without data augmentation, Proceedings of the ACM Web Conference 2022 (2022), 1070–1079.

31.

, et al. Are graph augmentations necessary? simple graph contrastive learning for recommendation, Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval (2022), 1294–1303.

32.

Peng

, Sugiyama

and Mine

, SVD-GCN: A simplified graph convolution paradigm for recommendation, Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022), 1625–1634.

33.

, et al. Graph neural networks in recommender systems: a survey, ACM Computing Surveys 55(5) (2022), 1–37.

34.

Cai

, et al. LightGCL: Simple Yet Effective Graph Contrastive Learning for Recommendation, The Eleventh International Conference on Learning Representations 2022.

35.

Xie

, et al. Contrastive learning for sequential recommendation, 2022 IEEE 38th international conference on data engineering (ICDE) (2022), 1259–1273.

36.

Yin

, et al. Autogcl: Automated graph contrastive learning via learnable view generators, Proceedings of the AAAI conference on artificial intelligence (2022), 8892–8900.

37.

Lin

, et al. Improving graph collaborative filtering with neighborhood-enriched contrastive learning, Proceedings of the ACM Web Conference 2022 (2022), 2320–2329.

38.

, et al. Graph Transformer for Recommendation, Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023), 1680–1689.

39.

, et al. XSimGCL: Towards extremely simple graph contrastive learning for recommendation, IEEE Transactions on Knowledge and Data Engineering (2023), 1–14.

40.

Xia

, et al. Graph-less collaborative filtering, Proceedings of the ACM Web Conference 2023 (2023), 17–27.

41.

Jiang

, Huang

and Huang

, Adaptive graph contrastive learning for recommendation, Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2023), 4252–4261.

42.

Yang

, et al. Debiased Contrastive Learning for Sequential Recommendation, Proceedings of the ACM Web Conference 2023 (2023), 1063–1073.

43.

Ren

, et al. Contrastive state augmentations for reinforcement learning-based recommender systems, Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023), 922–931.

Collaborative graph contrastive learning for recommendation

Abstract

Keywords

1 Introduction

2 Related work

2.1 GCN-based methods

2.2 Contrastive learning for recommendation

3 Methodology

3.2.1 Main view propagation

4.1 Experimental settings

4.1.1 Datasets and evaluation metrics

Table 1 Statistics of experimented datasets Dataset User# Item# Interaction# Density Yelp 29,601 24,734 1,517,326 0.00187 Gowalla 50,821 57,440 1,172,425 0.00044 Amazon 78,578 77,801 2,240,156 0.00047

4.1.3 Hyperparameter settings

Footnotes

Acknowledgments

References

Table 1
Statistics of experimented datasets

Dataset User# Item# Interaction# Density

Yelp 29,601 24,734 1,517,326 0.00187

Gowalla 50,821 57,440 1,172,425 0.00044

Amazon 78,578 77,801 2,240,156 0.00047