RGCF: Refined graph convolution collaborative filtering with concise and expressive embedding

Abstract

Graph Convolution Networks (GCNs) have attracted significant attention and have become the most popular method for learning graph representations. In recent years, many efforts have focused on integrating GCNs into recommender tasks and have made remarkable progress. At its core is to explicitly capture the high-order connectivities between nodes in the user-item bipartite graph. However, we found some potential drawbacks existed in the traditional GCN-based recommendation models are that the excessive information redundancy yield by the nonlinear graph convolution operation reduces the expressiveness of the resultant embeddings, and the important popularity features that are effective in sparse recommendation scenarios are not encoded in the embedding generation process.

In this work, we develop a novel GCN-based recommendation model, named Refined Graph convolution Collaborative Filtering (RGCF), where a refined graph convolution structure is designed to match non-semantic ID inputs. In addition, a new fine-tuned symmetric normalization is proposed to mine node popularity characteristics and further incorporate the popularity features into the embedding learning process. Extensive experiments were conducted on three public million-size datasets, and the RGCF improved by an average of 13.45% over the state-of-the-art baseline. Further comparative experiments validated the effectiveness and rationality of each part of our proposed RGCF. We released our code at https://github.com/hfutmars/RGCF.

Keywords

Collaborative filtering recommendation graph convolution network high-order connectivity information redundancy node popularity

1. Introduction

Modern recommendation systems have been widely applied to many online services such as video recommendation [4], music recommendation [5], e-commerce [8], and social network [28], etc. Collaborative Filtering (CF) is the mainstream algorithm in recommendation systems [19, 21], which vectorizes all users and items using only their ID features (user and item IDs, which are sequentially encoded features), and reconstructs their historical interactions with the inner product of them. Matrix Factorization (MF) [30] is the most classical CF method and can achieve good performance when sufficient interaction data is available. However, since the issue of sparsity is ubiquitous in modern recommendations, MF fails to learn expressive vector representations for users and items. In this case, how to make full use of the limited interaction data and how to dig out additional features from only these ID data become the key to improve the performance of CF models.

1.1 Why graph convolution networks?

In order to solve the performance bottleneck caused by sparsity of datasets, many efforts have been devoted to constructing complex embedding functions. Specifically, integrating all available useful information into the embedding representations can improve model performance. SVD $++$ [15] is a pioneer work that incorporates items that users historically interacted with into the construction of the user’s representation to model his/her preference to get expressive embedding. However, SVD $++$ only encodes explicit connectivities between users and items into the embedding function, while forgoing the modeling of the implicit connectivities, which can be viewed as the paths between current node and its multi-hop neighbor nodes in the user-item bipartite graph (as known as high-order connectivities). The graph-based methods [9, 31] are capable of capturing this high-order connectivities due to its capability of learning path information. For example, HOP-Rec [31] indirectly integrates high-order connectivities into the embedding learning process by using random walk to enrich the interaction data of users with multi-hop connected items. Apart from Graph-based methods that indirectly use high-order connectivities to enrich the training data, GCN-based methods [22, 32, 14] directly encode high-order connectivities into the embedding function. [26] has demonstrated that such direct high-order connectivities modeling strategy in GCN is better than other indirect high-order connectivities modeling strategies, and it is clear that GCN is the state-of-the-art approach for capturing high-order connectivities inside the user-item interaction graph.

1.2 Why refined graph convolution networks?

It is worth mentioning that in some GCN-based machine learning tasks, such as image classification [34] and node classification [14], nonlinear network layers are necessary for feature extraction since the initial vector representations contain abundant and diverse information. In contrast, IDs of users (or items) used by most CF methods carry no complicated patterns or diverse semantic information that can be mined. Therefore, directly using nonlinear graph convolution layers to process ID features like [26, 32] will inevitably bring noise into the learned embeddings and further reduce the ability to capture high-order connectivities. To be specific, the network layers in the GCN fail to distill useful information and features from aggregated embedding inputs mapped by one-hot ID features only. Towards this end, LR-GCCF [35] designed a linear GCN structure by removing the nonlinear activation function from the network layers and achieved significant improvements. Additionally, too many parameters of the weight transformation matrices in the network layers are prone to the issue of overfitting and introduce redundant information into the embedding outputs. As discussed above, the nonlinear network layers in the traditional GCN structure are not suitable for the recommendation tasks. We elaborate on this in Section 3.2. The traditional graph convolution process can be seen as a weighted summation of neighbor nodes, which is then passed through a nonlinear network layer. Most GCN-based methods [26, 32] apply symmetric normalization to compute the weight for each neighbor node. Despite its effective for learning expressive embeddings, we argue that such symmetric normalization is insensitive to node popularity and fails to capture it. In Fig. 1, compared to item i4 with only one interaction, user u1 may prefer i3 with five interactions for that more interactions usually indicate potentially higher quality, illustrating that node popularity is an important distinguishing feature. However, considering only the popularity feature is insufficient for modeling cold user preferences. For example, in Fig. 1, we cannot tell whether user u1 prefers i5 or i4. From the perspective of popularity, u1 prefers i5 which has higher popularity, while from the perspective of collaborative filtering, u1 prefers i4 because there is a path between u1 and i4 but no path between u1 and i5. Such a phenomenon indicates that neither collaborative signal nor popularity feature alone can learn user preference well, but adding popularity features to the learning process of collaborative signals is clearly a feasible and effective strategy. Node popularity can be reflected by redits own degree in the user-item interaction graph. Therefore, using these degree data to fine-tune the symmetric normalization is capable of coordinating node popularity and collaborative signals to further improve the performance of the model. Section 4.3 verifies this assumption.

Figure 1.

Illustration of the user-item bipartite interaction graph for target user u1, item i3 is interacted with {u2, u4, u5, u6, u7}, item i4 is interacted with {u3}, u1 is more interested in i3 than i4 because i3 and i4 are both the 3-order neighbor nodes of u1 but i3 is more popular than i4. Item i5 is more popular than item i4 but there is no path between i5 and u1, therefore we cannot distinguish which item is more in line with u1’s preference.

1.3 Why not the layer aggregation mechanism?

In some nonlinear GCN-based CF methods, such as NGCF [26] and LR-GCCF [35], the layer-aggregation mechanism [33] is applied to concatenate embeddings obtained at each convolution layer as the final embeddings. Despite its effectiveness, we argue that this layer-aggregation mechanism is unnecessary in CF scenarios when the negative impact of nonlinear network layers is eliminated. Specifically, when nonlinear network layers are not considered, the graph convolution can be seen as a linear aggregation process, that is, the embedding obtained at N-th convolution layer is equivalent to a linear combination of the embeddings of all neighbors within N hops, and the concatenation of embeddings obtained at each layer can also be seen as a similar linear combination. Therefore, using layer-aggregation mechanism in CF scenarios where no nonlinear units are used is redundant and meaningless. The reason why this layer-aggregation mechanism can work in nonlinear GCN-based models is that the information redundancy and noise generated by the network layers may be weakened by the embedding concatenation of each layer. However, in our research we actually found that the nonlinear network structure and the layer-aggregation mechanism limited the model’s learning process for high-order connectivities in CF scenarios. This assumption is detailed in Section 3.3 and verified in Section 4.3.

1.4 Our proposal and contributions

In this work, we discuss the limitations of traditional GCN structures for capturing the high-order relations among different entity nodes in recommendation tasks, and propose a new GCN-based CF model, RGCF, where the embeddings of entities are reconstructed through a refined graph convolution structure and some strategies are intuitively used to reduce noise and redundancy existed in GCN-based methods. First, a linear weighted average operation is used to instead the complex and nonlinear network layers in the embedding function of the GCN-based methods. Then, we simply use the embeddings obtained at the last layer as the final representations to avoid the information overlap caused by the embedding concatenation of each convolution layer (Layer-Aggregation Mechanism). Lastly, we propose a fine-tuned symmetric normalization to integrate popularity information into the collaborative signal learning process. In addition, we further improve the model performance by changing the weights of the self-loop nodes in the aggregation process on the user-item graph. We have conducted extensive experiments on three public datasets, and the results show RGCF achieves significant improvement against other state-of-the-art baselines. To be more specific, our model improves over LR-GCCF w.r.t Recall@20 by 7.2%, 11.8%, and 22.5% on Gowalla, Yelp2018, and Amazon-Book respectively.

The main contributions of this work are as follows:

•
We first propose that traditional nonlinear GCN-based methods suffer from information redundancy and highlight its negative impact on the model ability to capture high-order connectivities.
•
We present a RGCF framework, which designs a refined graph convolution structure to reduce information redundancy and incorporate node popularity features into the learning process of collaborative signal.
•
We have conducted extensive experiments on three public million-size datasets, empirically demonstrating the state-of-the-art performance of our proposed RGCF.
•
To make it easier for subsequent researchers to reproduce our work, we have released our code at https://github.com/hfutmars/RGCF.

The rest of this paper is organized as follows. We first give a brief review of related work in Section 2. And then we elaborate the proposed RGCF and discuss the information redundancy in Section 3. In Section 4, we report the experimental results and analyze the effectiveness and rationality of our proposed RGCF. Finally, we conclude this paper in Section 5.
2. Related work

This section introduces factorization-based CF methods and GCN-based CF methods, which are most related to our work.

2.1 Factorization-based CF methods

The core idea of the factorization-based methods is to parameterize all users and items and use the product of the user matrix and the item matrix to reconstruct the interaction matrix. For example, Matrix Factorization (MF) [30] obtains vector representations of users and items by mapping their IDs. In order to improve the expressiveness of the user embeddings, SVD $++$ integrates the embeddings of historically interacted items into the user embeddings [15]. Meanwhile, many works believe that some auxiliary properties related to users and items, such as age, gender, occupation, price and multimedia features [7, 8], are relevant to user preferences, and integrate these properties into embeddings to improve model performance. Despite the effectiveness of the above methods, they ignore the importance of modeling high-order connectivities. Some works can capture such high-order connectivities. For example, HOSLIM [3] encodes high-order interactions into embeddings, but the time complexity is too high to handle the million-size dataset. DICF [29] and NCF [10] apply nonlinear neural networks as interaction function to capture high-order interactions. HOP-Rec [31] is a fusion algorithm of graph and matrix factorization method that uses the random-walk to find high-order neighboring nodes as a positive sample of the target node, achieving convincing results. However, HOP-Rec only uses high-order interactions to enrich the training data, and the embedding representations of users and items lack an explicit encoding of high-order connectivities.

2.2 GCN-based CF methods

The GCN-based methods [14, 27, 6] are capable of capturing the high-order interaction between graph nodes, which is integrated into the node representations. In recent years, many works have applied GCN techniques to the research field of recommender systems. GC-MC [22] uses GCN to construct an encoder to aggregate the information of first-order neighbors into the embedding representations of the target nodes. Compared with GC-MC, PinSage [32] extends the message aggregation function to the higher-order cases and achieves better model performance. The Section 4.4.1 in NGCF [26] has proven that high-order neighborhood information aggregation can improve the expressiveness of the embeddings. NGCF [26] is a new work that combines GCN and MF to integrate high-order connectivities into the users and items embedding representations and predict the preference score with the inner product of them. LR-GCCF [35] is an improved version of NGCF that designs linear graph convolution structure by removing nonlinear activation function.

Despite their effectiveness, we theoretically and empirically find that they suffer from some redundancy problems discussed in Section 3.5 and the capability of capturing high-order connectivities is suboptimal. We design a refined graph convolution structure to avoid these information redundancy problems and achieve the significant performance improvements shown in Table 2.

3. Methodology

In this section, we first brief the basic concept of GCN [14] and NGCF [26], and then present the details of our model structure, as illustrated in Fig. 2. Last, we have a discussion about the negative impact of information redundancy on GCN-based method.

3.1 Preliminary

Graph Convolution Networks. The core idea of GCN [14] is to capture graph structure information by transforming and aggregating the representations of neighboring nodes. To be specific, GCN includes multiple convolution layers, in which layer $l+1$ depends on the output of layer $l$ . In each layer, information about the target entity can be aggregated by its neighbor nodes. Thus, high-order embeddings can be effectively captured by stacking such multiple convolution layers. The convolution operation can be formulated as follows:

$\displaystyle\bm{E}^{(l+1)}=\sigma((\bm{D}+\bm{I})^{-0.5}(\bm{A}+\bm{I})(\bm{D% }+\bm{I})^{-0.5}\bm{E}^{(l)}\bm{W}^{(l)}),$ (1)

where $\bm{A}+\bm{I}$ is a $n\times n$ adjacency matrix, in which a self-loop is added, $n$ is the total number of nodes, $\bm{I}$ is the identity matrix, $\bm{D}+\bm{I}$ denotes the diagonal node degree matrix with elements $\bm{D}_{ii}=1$ , $\bm{E}^{(l+1)}$ and $\bm{E}^{(l)}$ are the $n\times k$ matrices which respectively denote embedding collection obtained at layer $l+1$ and $l$ for all nodes, $k$ is the embedding length, $W^{(l)}$ is the weight transformation matrix at layer $l$ , and $\sigma(\cdot)$ is a nonlinear activation function.

Neural Graph Collaborative Filtering. NGCF [26] is a new GCN-based CF method. Distinct from the standard GCN [14], NGCF integrates the element-wise product of the target nodes and its neighboring nodes into the embedding function, and concatenates the embeddings obtained at each layer as the final representations. The multi-layer-aggregation process for user $u$ can be formulated as follows:

$\displaystyle\bm{e_{u}}^{(l+1)}=\sigma\left(W_{1}^{(l)}\bm{e_{u}}^{(l)}+\sum_{% i\in N_{u}}\frac{1}{\sqrt{|N_{u}||N_{i}|}}(W_{1}^{(l)}\bm{e_{i}}^{(l)}+W_{2}^{% (l)}(\bm{e_{u}}^{(l)}\odot\bm{e_{i}}^{(l)}))\right),$ (2)

where $W_{1}^{(l)}$ and $W_{2}^{(l)}$ are the weight matrices at layer $l$ , $e_{u}^{(l)}$ and $e_{u}^{(l+1)}$ are the embeddings at layer $l$ and $l+1$ for current user $u$ , respectively. $e_{i}^{(l)}$ is the embedding for item $i$ at layer $l$ , $\frac{1}{\sqrt{|N_{u}||N_{i}|}}$ is the graph Laplacian norm [14] to normalize the embeddings aggregated from previous layer, where $N_{u}$ and $N_{i}$ respectively denote u’s and i’s neighborhood, $\sigma(\cdot)$ is the nonlinear activation function. The prediction function of NGCF is formulated as follows:

$\displaystyle\hat{\bm{r}}_{ui}={(\bm{e_{u}}^{1}||\bm{e_{u}}^{2}||\ldots\bm{e_{% u}}^{L})}^{T}\cdot(\bm{e_{i}}^{1}||\bm{e_{i}}^{2}||\ldots\bm{e_{i}}^{L}),$ (3)

where $\hat{\bm{r}}_{ui}$ is the predicted score between user $u$ and item $i$ , $||$ is a concatenation operation, and $\bm{e_{u}}^{L}$ and $\bm{e_{i}}^{L}$ denotes the embeddings of user $u$ and item $i$ obtain at layer $L$ respectively.

Linear Residual Graph Convolution Collaborative Filtering. To the best of our knowledge, LR-GCCF is the state-of-the-art GCN-based CF model, which can be seen as an improved version of NGCF by removing the nonlinear activation function and product terms from the graph convolution process of NGCF model. The embedding generation process for user $u$ can be formulated as follows:

$\displaystyle\bm{e_{u}}^{(l+1)}=W^{(l)}\bm{e_{u}}^{(l)}+\sum_{i\in N_{u}}\frac% {1}{\sqrt{|N_{u}||N_{i}|}}W^{(l)}\bm{e_{i}}^{(l)},$ (4)

where $W^{(l)}$ is the weight matrices at layer $l$ . Compared with the Eq. (2), Eq. (4) only remove the activation function $\sigma(\cdot)$ and the product term $\bm{e_{u}}^{(l)}\odot\bm{e_{i}}^{(l)}$ . The prediction function of LR-GCCF is exactly the same as that of NGCF, which can refer to Eq. (3).

Figure 2.

Illustration of our RGCF model which integrates high-order connectivities into the embeddings for user $u1$ and item $i1$ and outputs the matching score for that user-item pair, $\{i1,i2,i3,i4,i5\}$ is the connected item set for user $u1$ , and $\{u1,u2,u3,u4\}$ is the connected user set for item $i1$ , $A$ is the normalized adjacency matrix, $A_{\textit{uaib}}$ equals to $\frac{1}{{|N_{ua}|}^{0.5}{|N_{ib}|}^{p}}$ , $I$ is the identity matrix, and $D$ is the degree matrix.

3.2 Model

In this section, we present a detailed description of our RGCF model. As Fig. 2 shows, the embeddings of users and items are generated separately. User (item) embeddings are generated by propagating information from the first to the last layer. In each layer, the entity(user or item) is embedded by aggregating the information both from the neighbor nodes and the entity itself.

3.2.1 Embedding initialization

We first randomly initialize the embedding representations of all users and items. Considering that we are dealing with a collaborative filtering scenario where only ID information is available, that is, the input data is only the user IDs and the item IDs without any other additional information, thereby the initialization of the embedding can only be achieved by a simple ID mapping. The initialized user embeddings and item embeddings are as follows:

$\displaystyle\bm{E}^{(0)}_{u}=\{\bm{e}^{(0)}_{u_{1}},\bm{e}^{(0)}_{u_{2}},% \ldots,\bm{e}^{(0)}_{u_{m}}\},\bm{E}^{(0)}_{i}=\{\bm{e}^{(0)}_{i_{1}},\bm{e}^{% (0)}_{i_{2}},\ldots,\bm{e}^{(0)}_{i_{n}}\},$ (5)

where $E^{(0)}_{u}$ and $E^{(0)}_{i}$ are the initialized user and item embedding matrices, respectively, $m$ and $n$ are the total number of users and items, respectively, and $\bm{e}^{(0)}_{u_{1}}$ denotes the initialized vector representation of user $u_{1}$ .

3.2.2 Embedding generation

Inspired by the outstanding performance of GCN and SGCN in capturing high-order interactions, we design a refined graph convolution network structure to incorporate high-order neighborhood information into the embedding representations of users and items. The refined graph convolution can be seen as a linear aggregation process over the graph (the light blue part of the Fig. 2). We use the user embedding construction to detail the aggregation process, and the item embedding aggregation is similar.

The refined first-order GCN operation can be seen as a process of capturing messages from the node itself and from neighboring nodes, and incorporating this information into the node’s own embedding representation. We formulate these two messages as follows:

$\displaystyle{m_{u\leftarrow i}^{(0)}=\sum_{i\in N_{u}}\frac{1}{{|N_{u}|}^{0.5% }{|N_{i}|}^{0.5}}}\cdot\bm{e}_{i}^{(0)},$ (6) $\displaystyle{m_{u\leftarrow u}^{(0)}=\frac{1}{{|N_{u}|}^{0.5}{|N_{u}|}^{0.5}}% }\cdot\bm{e}_{u}^{(0)},$ (7)

where $m_{u\leftarrow i}^{(0)}$ is a message from neighboring nodes, which reflects the historical interactions of user $u$ , and $m_{u\leftarrow u}^{(0)}$ is the message from the node itself, which can be viewed as the intrinsic properties of the node itself. $\bm{e}_{i}^{(0)}$ and $\bm{e}_{u}^{(0)}$ are initialized vector representations for user $u$ and item $i$ , and $\frac{1}{{|N_{u}|}^{0.5}{|N_{i}|}^{0.5}}$ is graph Laplacian norm (or symmetric normalization) to normalize aggregated message, which is equivalent to $\frac{1}{\sqrt{|N_{u}||N_{i}|}}$ in Eq. (2), $|N_{u}|$ denotes the number of neighbor nodes of user $u$ , and $|N_{i}|$ denotes the number of neighbor nodes of item $i$ .

Node Popularity Fusion. In Section 1, we have shown that node popularity is an important and distinguishing feature, especially for users with less interaction data (or cold users), and incorporating node popularity features into the learning process of collaborative signals is a more effective way to model user preferences than traditional CF methods. In our proposed refined GCN structure, the symmetric normalization method is improved to mine the node popularity feature. Specifically, we set the index in symmetric normalization as a hyper-parameter greater than 0 and less than 0.5, which is equivalent to increasing the weights of neighboring nodes according to their popularity. As such, the aggregated message in Eqs (6) and (7) can be formulated as follows:

$\displaystyle{m_{u\leftarrow i}^{(0)}=\sum_{i\in N_{u}}\frac{1}{{|N_{u}|}^{0.5% }{|N_{i}|}^{p}}}\cdot\bm{e}_{i}^{(0)},$ (8) $\displaystyle{m_{u\leftarrow u}^{(0)}=\frac{1}{{|N_{u}|}^{0.5}{|N_{u}|}^{p}}}% \cdot\bm{e}_{u}^{(0)},$ (9)

where $p$ is a hyper-parameter to control the effect of the popularity of neighboring nodes, and $p\in\{f|0<f<0.5\}$ . Note that the optimal $p$ is 0.4 in our experiments.

Next, we aggregate these two message $m_{u\leftarrow i}^{(0)}$ and $m_{u\leftarrow u}^{(0)}$ to obtain a first-order embedding representation for user $u$ . In addition, we argue that the above two messages contribute differently to the final representation of the node $u$ , thus a hyper-parameter $\lambda$ is set to control the weight of the messages from $u$ itself. We formulate this process as follows:

$\displaystyle\bm{e}^{(1)}_{u}=m_{u\leftarrow i}^{(0)}+\lambda\cdot m_{u% \leftarrow u}^{(0)},$ (10)

where $\bm{e}^{(1)}_{u}$ is the embedding representation generated by the first-order refined graph convolution operation for user $u$ .

High-order Stacking. As with the traditional GCNs, we stack multiple first-order graph convolution operations in Eq. (10) to capture high-order interactions. After stacking $l$ such message aggregation operations, the resultant embedding representation of the user $u$ can preserve the neighbor information within $l$ hops. Such a high-order stacking process can be formulated as follows:

$\displaystyle\left\{\begin{array}[]{l}{m_{u\leftarrow i}^{(l-1)}=\sum_{i\in N_% {u}}\frac{1}{{|N_{u}|}^{0.5}{|N_{i}|}^{p}}}\cdot\bm{e}_{i}^{(l-1)},\\ \\ {m_{u\leftarrow u}^{(l-1)}=\frac{1}{{|N_{u}|}^{0.5}{|N_{u}|}^{p}}}\cdot\bm{e}_% {u}^{(l-1)},\\ \\ \bm{e}^{(l)}_{u}=m_{u\leftarrow i}^{(l-1)}+\lambda\cdot m_{u\leftarrow u}^{(l-% 1)},\\ \end{array}\right.$ (11)

where $\bm{e}^{(l)}_{u}$ is the embedding of user $u$ after stacking $l$ graph convolution operations, $m_{u\leftarrow i}^{(l-1)}$ and $m_{u\leftarrow u}^{(l-1)}$ denote the message of neighbor nodes and node $u$ itself at hop $l-1$ , which is built on the node embedding representations $\bm{e}_{i}^{(l-1)}$ and $\bm{e}_{u}^{(l-1)}$ of the previous hop. We can generate the representation for the item node $i$ in a similar way.

Algorithm 3.2.2 gives the specific iterative embedding generation process.

[t] : Embedding Generating Input: Initial embeddings $\bm{e_{u}^{(0)}}$ and $\bm{e_{i}^{(0)}}$ for user node $u$ and item node $i$ ; set of u’s neighborhood embeddings $\{\bm{e_{i}^{(0)}}\mid i\in\bm{N_{u}}\}$ ; set of i’s neighborhood embeddings $\{\bm{e_{u}^{(0)}}\mid u\in\bm{N_{i}}\}$ ; self-loop weight $\lambda$ ; and depth of message aggregation layers $L$ . output: The embedding representation $\bm{e_{u}^{(L)}}$ and $\bm{e_{i}^{(L)}}$ obtained at the convolution layer $L$ for node $u$ and $i$ .

[1] Let $l=1$ . $l\neq L+1$ Generate embedding for user $u$ . ${m_{u\leftarrow i}^{(l-1)}=\sum_{i\in N_{u}}\frac{1}{{|N_{u}|}^{0.5}{|N_{i}|}^% {p}}}\cdot e_{i}^{(l-1)}$ ${m_{u\leftarrow u}^{(l-1)}=\frac{1}{{|N_{u}|}^{0.5}{|N_{u}|}^{p}}}\cdot e_{u}^% {(l-1)}$ ${e_{u}^{(l)}=m_{u\leftarrow i}^{(l-1)}+\lambda\cdot m_{u\leftarrow u}^{(l-1)}}$

Generate embedding for item i. ${m_{i\leftarrow u}^{(l-1)}=\sum_{u\in N_{i}}\frac{1}{{|N_{i}|}^{0.5}{|N_{u}|}^% {p}}}\cdot e_{u}^{(l-1)}$ ${m_{i\leftarrow i}^{(l-1)}=\frac{1}{{|N_{i}|}^{0.5}{|N_{i}|}^{p}}}\cdot e_{i}^% {(l-1)}$ ${e_{i}^{(l)}=m_{i\leftarrow u}^{(l-1)}+\lambda\cdot m_{i\leftarrow i}^{(l-1)}}$ $l=l+1$

return $\bm{e_{u}^{(L)}}$ and $\bm{e_{i}^{(L)}}$

Matrix Implementation. In practice, we use sparse matrix multiplications to implement the above mentioned embedding function. The detailed operations can be formulated as follows:

$\displaystyle\bm{E}^{(l)}=(\bm{D}+\lambda\bm{I})^{-0.5}(\bm{A}+\lambda\bm{I})(% \bm{D}+\lambda\bm{I})^{-p}\bm{E}^{(l-1)},$ (12)

where $\bm{A}+\lambda\bm{I}$ is a $(m+n)\times(m+n)$ adjacency matrix in which a weighted self-loop is added, $m$ and $n$ are the number of users and items, and $\bm{I}$ is the identity matrix, $\lambda$ is a hyper-parameter to control the weight of self-loop, $\bm{D}+\lambda\bm{I}$ denotes the diagonal node degree matrix with elements $\bm{D}_{ii}=\lambda$ , $\bm{E}^{(l)}$ and $\bm{E}^{(l-1)}$ are the $(m+n)\times k$ matrix which denote the embedding collection for all users and items obtained at layer $l$ and $(l-1)$ , respectively, and $k$ is the embedding length.

3.3 Prediction

Distinct from concatenating the multiple representations obtained at each convolution layers in NGCF [26], we use the embedding obtained at the last layer as the final representations in the RGCF framework, which is the same as the standard GCN [14]. The key reason is that concatenating representations at different layers leads to the issue of information redundancy. To be specific, the embeddings obtained at layer $l$ actually contains most of the information comes from the previous layers since the aggregation operation of the previous layers in RGCF is a linear operation (the graph convolution operation in RGCF does not consider nonlinear network layers). Thereby, in RGCF, we get the final representations for user $u$ and item $i$ as follows:

$\displaystyle\bm{e_{u}}^{*}=\bm{e_{u}}^{(L)},\quad\bm{e_{i}}^{*}=\bm{e_{i}}^{(% L)},$ (13)

where $\bm{e_{u}}^{(L)}$ and $\bm{e_{i}}^{(L)}$ are the embeddings obtained at last layer $L$ for user $u$ and item $i$ respectively.

Inner product is applied to predict the matching score of a user-item pair $<u,i>$ . We formulate the prediction function as follows:

$\displaystyle\hat{\bm{r}}_{ui}={\bm{e_{u}}^{*}}^{T}\bm{e_{i}}^{*},$ (14)

where $\hat{r}_{ui}$ is a predicted preference score for $u$ towards the target item $i$ , and $\bm{e_{u}}^{*}$ and $\bm{e_{i}}^{*}$ are the final representations for user $u$ and item $i$ .

3.4 Training

Loss Function. We use Bayesian Personalized Ranking (BPR) loss [18] to optimize the parameters for our model. The basic assumption for BPR loss is that the observed interactions can reflect stronger preferences than unobserved ones, that is to say, the predicting score for an observed user-item pair should be higher than unobserved one. The loss function for our model is formulated as follows:

$\displaystyle\textit{loss}=\sum_{(u,i,j)\in O}-\ln\sigma(\hat{y}_{ui}-\hat{y}_% {uj})+\alpha{\mid\mid E\mid\mid}_{2}^{2},$ (15)

where $O=\{(u,i,j)\mid i\in N_{u},j\notin N_{u}\}$ is the training data, $N_{u}$ denotes the observed item set for user $u$ , $\sigma(\cdot)$ is the sigmoid function; we apply $L_{2}$ regularization on $E$ parameterized by $\alpha$ , $E$ is the final embeddings obtained at the last layer.

Optimizer. Mini-batch Adam optimizer [13] is applied to optimize our model and update the model parameters. Note that the parameters that need to be updated are the embeddings mapped from the IDs of all users and items, which is almost equivalent to that of MF [30].

Time Complexity Analysis. Embedding generation and model prediction are two main operations in our RGCF model. We use efficient sparse matrix multiplications to achieve the embedding generation, with complexity linear in the number of nonzero entities in the adjacency matrices (or the number of interactions), i.e., $O(L|R^{+}|)$ , where $L$ is the number of graph convolution layers, and $|R^{+}|$ denotes the number of nonzero entities in the adjacency matrices. Because only inner product operation is involved in the model prediction, the time complexity of this part is $O(|R^{+}|)d$ , where $d$ denotes the embedding size. Hence, the overall complexity for RGCF is $O(L|R^{+}|+|R^{+}|d)$ .

Additionally, we recorded the training and inference time of MF, NGCF, LRGCCF, and RGCF as follows: MF, NGCF, LRGCCF, and RGCF cost around 7 s, 40 s, 36 s, and 22 s per training epoch on the Gowalla dataset, respectively; and the time costs of MF, NGCF, LRGCCF, and RGCF are 45 s, 143 s, 137 s, and 130 s for model inference on the test data.

3.5 Discussion on information redundancy

Why are the network layers redundant? Distinct from traditional GCN-based methods, the nonlinear network layers are removed in our RGCF since they bring no benefit to model performance. Although the network layers can find hidden patterns from complex input embeddings that usually contain rich side information, the expressiveness of embeddings will be limited if the inputs do not have complex patterns. Meanwhile, the overfitting problem caused by too many parameters of the network layers cannot be completely eliminated even if dropout technology is applied.

Why is the layer-aggregation mechanism redundant? In the recommendation scenario, where the input is single ID data and therefore the nonlinear network layers cannot benefit the model performance, then under these conditions, we argue that the layer-aggregation mechanism is redundant. The main reason is that the embedding aggregation at each layer is a linear transformation, and the embeddings obtained at layer $l$ already contain the information inside the embeddings of its previous layers. As such, embeddings concatenation of each layer (or layer-aggregation mechanism) is equivalent to multifoldly consider the contribution of low-order interactions, where the contribution of important high-order interactions is relatively weakened consequently. This kind of analysis nicely supports our argument that the redundancies of traditional GCN-based recommendation methods lead to poor capture of high-order interactions. We use the following simplified formula that ignores the influence of graph Laplacian norm to justify this assumption.

$\displaystyle\bm{E}^{(2)}=(\bm{A}+\bm{I})(\bm{A}+\bm{I})\bm{E}^{(0)}=\bm{A}(% \bm{A}+\bm{I})\bm{E}^{(0)}+(\bm{A}+\bm{I})\bm{E}^{(0)}=\bm{A}(\bm{A}+\bm{I})% \bm{E}^{(0)}+\bm{E}^{(1)},$ (16)

where $\bm{E}^{(1)}$ and $\bm{E}^{(2)}$ denote the embedding matrices obtained at the first and second layers. We can see that $\bm{E}^{(2)}$ contains $\bm{E}^{(1)}$ . In this way, concatenating embedding of each layer is unnecessary when network layers are removed in RGCF. It is worth mentioning that the concatenation operation in traditional GCN [14, 26] can be effective. There are two reasons about it: (1) the defective embeddings in traditional GCN impaired by nonlinear network layers may be remedied by the concatenation operation to some extent; and (2) the embedding aggregation at each layer is not a linear transformation since the nonlinear activation function is considered in graph convolution process. We conduct some experimental comparison in Section 4.3 to verify this assumption.

Why is the product term redundant? In NGCF, the product term $\bm{e}_{u}\odot\bm{e}_{i}$ in Eq. (2) magnify the preference scores of the user-item pairs, which can increase the affinity of the interacted nodes and help speed up the model convergence. In fact, such product terms are also redundant, while the interaction function is an inner product. To be specific, the result of the inner product of $\bm{e}_{u}$ and $\bm{e}_{i}$ can reconstruct the information of the product term $\bm{e}_{u}\odot\bm{e}_{i}$ . We further verify this assumption in Section 4.3.

4. Experiments

In this section, we conduct experiments on three public datasets to evaluate the performance of our proposed model. We aim to answer the following research questions:

•
RQ1: How does our proposed RGCF perform compared to other state-of-the-art CF models?
•
RQ2: Whether each component in refined graph convolution structure are helpful for improving model performance?
•
RQ3: How do the key hyper-parameter settings affect the performance of our proposed RGCF?

4.1 Experimental settings

Dataset Description. We conducted experiments on three datasets: Gowalla [16], Yelp2018, and Amazon-book, which are the same datasets used in NGCF [26]. We showed the statistics of the three datasets in Table 1. To ensure the quality of the dataset, 10-core setting is applied to retain the users and items with at least ten interactions. For each dataset, we sampled 80% of the historical interactions for each user as the training set, and treat the remaining 20% as the test set, meanwhile, we resampled 10% of the historical interactions from the training data as the validation set to tune the hyper-parameters.

Table 1
Statistics of the datasets

Dataset	#User	#Item	#Interaction	#Density
Gowalla	29,858	40,981	1,027,370	0.00084
Yelp2018	31,668	38,048	1,561,406	0.00130
Amazon-Book	52,643	91,599	2,984,108	0.00062

Evaluation Metrics. For a fair comparison, we selected the evaluation protocols widely used in the most GCN-based CF baselines [26, 35]: recall $@ K$ and ndcg $@ K$ to evaluate model performance. Specifically, we computed the average recall $@20$ and ndcg $@20$ for each user in the test set. Note that for each user, we treated all items that the user has not interacted with as negative samples.

Baselines. We compared our proposed method with the following baselines:

•

MF [30]: This is a matrix factorization method with Bayesian Personalized Ranking (BPR) loss, which is widely used for recommendation baseline.

•

SVD $++$ [15]: This is a variant of MF, which uses the user’s historical interactions to model the user’s preferences. It can also be regarded as a one-layer linear GCN, and it only passes messages for user embeddings. For a fair comparison, we used Bayesian Personalized Ranking (BPR) loss to optimize SVD $++$ .

•

NeuMF [10]: This is a state-of-the-art neural collaborative filtering method which uses nonlinear neural networks as interaction function.

•

HOP-Rec [31]: This is a state-of-the-art graph-based method, which uses random walk to enrich the interaction data between users and their multi-hop connected items.

•

GC-MC [22]: This model adopts GCN technique which just contains only one layer of graph convolution operation to generate the users and items representations.

•

PinSage [32]: This is a GCN-based recommendation method, which uses GraphSAGE [6] to aggregate neighbor node information.

•

NGCF [26]: This is a GCN-based MF framework, which combines the embeddings obtained at different GCN layer as the final users and items representations (layer-aggregation mechanism).

•

LR-GCCF [35]: This model improves the structure of GCN by removing the nonlinear activation function in the graph convolution layer and uses the same layer-aggregation mechanism as NGCF to construct final embedding representation.

Our proposed model:

•

RGCF: This is our proposed RGCF model, which uses refined graph convolution to update node embeddings. Distinct from LR-GCCF and NGCF, RGCF incorporates node popularity features into embedding generation and directly uses the embedding of the last layer as the final representation.

•

RGCF $+$ sn: Based on our proposed RGCF, where the fine-tuned symmetric normalization is instead by the traditional symmetric normalization. This model variant is set to investigate the effectiveness of incorporating node popularity features into the learning process of CF signals.

Parameter Settings. To make a fair comparison, we set the embedding size as 64 for all models. We applied a grid search strategy to tune the following hyper-parameters: the learning rate is searched in $\{0.001,0.0005,0.0001,0.00005\}$ , the coefficient of $L_{2}$ normalization is searched in $\{1,0.1,\ldots,\linebreak 10^{-6},10^{-7}\}$ , and the weight of self-loop is searched in $\{0.0,0.3,0.5,0.7,1.0,1.2,1.5,1.7,2.0\}$ . Our experiment results show that the optimal learning rate is 0.001 and the optimal coefficient of $L_{2}$ normalization is $10^{-4}$ for Gowalla, $10^{-4}$ for Yelp2018, and $10^{-5}$ for Amazon-book, respectively. In addition, we reported the time cost on the Gowalla dataset for three hyper-parameter tuning (learning rate tuning costs 31.6 hours, coefficient of $L_{2}$ costs 63.3 hours, and self-loop weight cost 71.2 hours), where the training epoch is 1000, and the graphics card used is NVIDIA Titan X (pascal).

Table 2

Overall performance comparison w.r.t. recall@20 and ndcg@20 on Gowalla, Yelp2018, and Amazon-Book datasets

Method	Gowalla		Yelp2018		Amazon-Book
	Recall@20	NDCG@20	Recall@20	NDCG@20	Recall@20	NDCG@20
MF	0.1291 ( $-$ 29.6%)	0.1109 ( $-$ 28.4%)	0.0433 ( $-$ 36.8%)	0.0354 ( $-$ 37.2%)	0.0250 ( $-$ 48.3%)	0.0196 ( $-$ 47.9%)
SVD $++$	0.1439 ( $-$ 21.5%)	0.1267 ( $-$ 18.2%)	0.0500 ( $-$ 27.0%)	0.0412 ( $-$ 30.0%)	0.0332 ( $-$ 31.4%)	0.0251 ( $-$ 33.2%)
NeuMF	0.1326 ( $-$ 27.7%)	0.1212 ( $-$ 21.7%)	0.0451 ( $-$ 34.2%)	0.0363 ( $-$ 35.6%)	0.0258 ( $-$ 46.7%)	0.0200 ( $-$ 46.8%)
HOP-Rec	0.1399 ( $-$ 23.7%)	0.1214 ( $-$ 21.6%)	0.0517 ( $-$ 24.5%)	0.0428 ( $-$ 24.11%)	0.0309 ( $-$ 36.15%)	0.0232 ( $-$ 38.3%)
GC-MC	0.1395 ( $-$ 23.9%)	0.1204 ( $-$ 22.2%)	0.0462 ( $-$ 32.6%)	0.0379 ( $-$ 32.8%)	0.0288 ( $-$ 40.5%)	0.0224 ( $-$ 40.4%)
PinSage	0.1420 ( $-$ 22.5%)	0.1262 ( $-$ 18.5%)	0.0489 ( $-$ 28.6%)	0.0401 ( $-$ 28.9%)	0.0298 ( $-$ 38.4%)	0.0233 ( $-$ 38.0%)
NGCF	0.1547 ( $-$ 15.6%)	0.1327 ( $-$ 14.3%)	0.0579 ( $-$ 15.5%)	0.0477 ( $-$ 15.4%)	0.0337 ( $-$ 30.4%)	0.0261 ( $-$ 30.6%)
LR-GCCF	0.1701 ( $-$ 7.2%)	0.1452 ( $-$ 6.2%)	0.0604 ( $-$ 11.8%)	0.0498 ( $-$ 11.7%)	0.0375 ( $-$ 22.5%)	0.0296 ( $-$ 21.3%)
RGCF $+$ sn	0.1811 ( $-$ 1.2%)	0.1526 ( $-$ 1.4%)	0.0655 ( $-$ 4.3%)	0.0534 ( $-$ 5.3%)	0.0450 ( $-$ 7.0%)	0.0349 ( $-$ 7.2%)
RGCF	0.1833	0.1548	0.0685	0.0564	0.0484	0.0376

4.2 Performance comparison (RQ1)

We compared the performance of all methods in this section. Table 2 reports the performance of recall@20 and ndcg@20 for all compared methods. We have the following findings:

•
MF achieved poor performance on three datasets, indicating that simple inner product is insufficient to capture complex connectivities between users and items. NeuMF outperformed MF on all datasets, validating the effectiveness of applying neural networks to distill the nonlinear relations between users and items.
•
Compared to MF and NeuMF, the performance of GC-MC demonstrates that integrating the first-order connectivities into the embedding process is helpful for improving the expressiveness of the embeddings.
•
HOP-Rec generally outperformed GC-MC in all cases. The key reason is that HOP-Rec exploits high-order neighbors to enrich the training data while GC-MC considers the first-order neighbors only.
•
PinSage slightly outperformed GC-MC on all datasets, which illustrates the necessity of stacking multiple graph convolution layers to capture higher-order interactions.
•
SVD $++$ significantly outperformed GC-MC and PinSage, which also verifies that using a nonlinear network layer to process ID embeddings increases information redundancy and noise to the representations, thereby degrading model performance.
•
NGCF outperformed HOP-Rec on all datasets, which demonstrates that explicitly integrating high-order connectivities into the embedding process is more efficient than exploiting high-order interactions to enrich the training data. Meanwhile, NGCF performed slightly better than SVD $++$ . The main reason for this is that NGCF integrates high-order interactions into the embeddings of users and items, while SVD $++$ only integrates first-order interactions into user embeddings.
•
LR-GCCF achieved better performance than NGCF in all cases, which is owning to the fact that LR-GCCF removes the nonlinear activation function from the graph convolution process, such result demonstrates that the nonlinear component in GCN is redundant for the recommendation scenarios.
•
Our proposed RGCF achieved the best performance in all cases. Specifically, RGCF outperformed the state-of-the-art LR-GCCF w.r.t. Recall@20 by 7.2%, 11.8 %, and 22.5 % on Gowalla, Yelp2018, and Amazon-Book, respectively. To be more specific, RGCF can achieve better improvements on the Yelp2018 and Amazon-Book datasets compared to on the Gowalla dataset, which is correlated with the additional incorporation of node popularity features in RGCF.
•
RGCF outperformed RGCF-sn in all cases, which demonstrates that capturing node popularity features and integrating them into the CF signal can significantly improve model performance.

Table 3
Performance of RGCF w.r.t. Recall@20 and NDCG@20 on Gowalla, Yelp2018, and Amazon-book datasets under 5-fold cross validation

Method Gowalla Yelp2018 Amazon-Book

Recall@20 NDCG@20 Recall@20 NDCG@20 Recall@20 NDCG@20

Fold-1 0.1815 0.1523 0.0677 0.0558 0.0475 0.0359

Fold-2 0.1845 0.1553 0.0701 0.0575 0.0482 0.0367

Fold-3 0.1833 0.1539 0.0689 0.0553 0.0487 0.0371

Fold-4 0.1823 0.1527 0.0660 0.0551 0.0491 0.0382

Fold-5 0.1835 0.1549 0.0669 0.0561 0.0486 0.0376

Mean 0.1830 0.1538 0.0679 0.0560 0.0484 0.0371

Considering that the use of hold-out 20% to test model performance may be affected by anomalies inside the dataset, a more rigorous 5-fold cross validation was additionally designed for the RGCF to eliminate this possible bias and to verify the model effectiveness. Table 3 records the Recall@20 and NDCG@20 per fold for RGCF on the Gowalla, Yelp2018, and Amazon-book datasets. According to Table 3, we found that the performance of the RGCF model under the 5-fold cross validation is quite similar to that using the hold-out 20%, which validates the effectiveness of our RGCF.
4.3 Is refined graph convolution structure effective? (RQ2)

Method	Gowalla	Yelp2018	Amazon-Book
Fold-1	0.1815	0.1523	0.0677	0.0558	0.0475	0.0359
Fold-2	0.1845	0.1553	0.0701	0.0575	0.0482	0.0367
Fold-3	0.1833	0.1539	0.0689	0.0553	0.0487	0.0371
Fold-4	0.1823	0.1527	0.0660	0.0551	0.0491	0.0382
Fold-5	0.1835	0.1549	0.0669	0.0561	0.0486	0.0376
Mean	0.1830	0.1538	0.0679	0.0560	0.0484	0.0371

In this section, we first verified that the three components in GCN-based methods introduced in Section 3.5 are redundant. We then set the experimental comparison $w . r . t .$ different number of convolution layers to verify whether our proposed RGCF can enhance the ability of high-order connectivities capture. Finally, we investigated the performance of our proposed RGCF model under different levels of sparsity of the datasets.

Table 4
Performance of RGCF variants with different information redundancies

Method	Gowalla		Yelp2018		Amazon-Book
	Recall@20	NDCG@20	Recall@20	NDCG@20	Recall@20	NDCG@20
RGCF $+$ np	0.0584 ( $-$ 68.1%)	0.0505 ( $-$ 67.4%)	0.0216 ( $-$ 68.5%)	0.0193 ( $-$ 65.8%)	0.0164 ( $-$ 66.1%)	0.0125 ( $-$ 66.8%)
RGCF $+$ n	0.0608 ( $-$ 66.8%)	0.0525 ( $-$ 66.1%)	0.0238 ( $-$ 65.3%)	0.0206 ( $-$ 63.5%)	0.0166 ( $-$ 65.7%)	0.0128 ( $-$ 66.0%)
RGCF $+$ npc	0.1547 ( $-$ 15.6%)	0.1327 ( $-$ 14.3%)	0.0559 ( $-$ 18.4%)	0.0455 ( $-$ 19.3%)	0.0344 ( $-$ 28.9%)	0.0266 ( $-$ 29.3%)
RGCF $+$ nc	0.1616 ( $-$ 11.8%)	0.1387 ( $-$ 10.4%)	0.0562 ( $-$ 18.0%)	0.0473 ( $-$ 16.1%)	0.0359 ( $-$ 25.8%)	0.0276 ( $-$ 26.6%)
RGCF $+$ pc	0.1679 ( $-$ 8.4%)	0.1436 ( $-$ 7.2%)	0.0584 ( $-$ 14.7%)	0.0499 ( $-$ 11.5%)	0.0366 ( $-$ 24.4%)	0.0287 ( $-$ 23.7%)
RGCF $+$ p	0.1745 ( $-$ 4.8%)	0.1485 ( $-$ 4.1%)	0.0625 ( $-$ 8.8%)	0.0529 ( $-$ 6.2%)	0.0380 ( $-$ 21.5%)	0.0308 ( $-$ 18.1%)
RGCF $+$ c	0.1730 ( $-$ 5.6%)	0.1467 ( $-$ 5.2%)	0.0585 ( $-$ 14.6%)	0.0502 ( $-$ 11.0%)	0.0373 ( $-$ 22.9%)	0.0304 ( $-$ 19.1%)
RGCF	0.1833	0.1548	0.0685	0.0564	0.0484	0.0376

4.3.1 Impact of information redundancy

We have analyzed the redundancy issues of some state-of-the-art GCN models in Section 3.5, namely (1) nonlinear network layers redundancy, (2) embedding concatenation redundancy, and (3) element-wise product redundancy.

For the sake of presentation, we divided the experiment into two parts: Part A with nonlinear network layers, and Part B without nonlinear network layers.

For experiments using the network layers (Part A), we have the following derived model from RGCF:

•
RGCF $+$ n denotes the variant model of RGCF in which only the nonlinear network layers redundancy is reserved to process the embedding at each graph convolution layer.
•
RGCF $+$ np denotes the variant model of RGCF with the redundancies of the nonlinear network layers and product terms.
•
RGCF $+$ nc denotes the variant model of RGCF with the redundancies of the nonlinear network layers and the embedding concatenation.
•
RGCF $+$ npc denotes the variant model of RGCF which contains the above three redundancies, which is equivalent to NGCF if the node popularity feature is not considered.

For experiments that did not use the nonlinear network layers (Part B), we have the following derived model from RGCF similarly:

•
RGCF $+$ c denotes the variant model of RGCF in which only the embedding concatenation redundancy is reserved.
•
RGCF $+$ p denotes the variant model of RGCF, in which only the product terms redundancy is reserved.
•
RGCF $+$ pc denotes the variant model of RGCF with the redundancies of the embedding concatenation and product terms.
•
RGCF indicates that the three redundancies are all removed.

Table 4 reports the experimental results. We have the following findings:

(1) The nonlinear network layers are redundant.

•
The variants (RGCF, RGCF $+$ p, RGCF $+$ pc, RGCF $+$ c) in Part B that remove network layers significantly outperformed the variants (RGCF $+$ n, RGCF $+$ np, RGCF $+$ npc, RGCF $+$ nc) in Part A that reserve network layers, the most obvious result is the comparison of the model performance of the variants RGCF $+$ n and RGCF, that is, RGCF achieved an average 65.57% improvement compared to RGCF $+$ n, such significant improvements illustrate that the nonlinear network layers in graph convolution process are redundant for the recommendation scenarios with only ID inputs, and we can conclude that the nonlinear network component is the main driver of the information redundancy problem in the GCN-based CF approaches.

(2) The layer-aggregation mechanism is redundant, but it can alleviate the information redundancy caused by nonlinear network layers.

•
In Part B (without network layers), RGCF and RGCF $+$ p outperformed RGCF $+$ c and RGCF $+$ pc on the three datasets, respectively, and the concatenation operation (layer-aggregation mechanism) reduced the model performance by an average of 10.58%. Such results illustrate that the layer-aggregation mechanism in some GCN-based model [26, 35] is redundant, while nonlinear network layers are not considered into the graph convolution process.
•
In Part A (with network layers), RGCF $+$ np and RGCF $+$ n performed much worse than RGCF $+$ npc and RGCF, respectively, which is the opposite of the results in Part B. Such results illustrate that the layer-aggregation mechanism can alleviate the issue of information redundancy caused by nonlinear network layers.
•
RGCF of Part A outperformed RGCF $+$ nc of Part B in all cases, which indicates that although the layer-aggregation mechanism can alleviate the information redundancy problem caused by network layers to some extent, it does not eliminate the negative effects of information redundancy, so disregarding the network layers and the layer-aggregation mechanism is a better choice to ensure minimum information redundancy.

(3) The product terms are redundant.

•
The model variants of RGCF without the product terms achieved the better performance than the variant with the product terms. In all cases, RGCF outperformed RGCF $+$ p, RGCF $+$ c outperforms RGCF $+$ pc, RGCF $+$ nc outperformed RGCF $+$ npc, and RGCF $+$ n outperformed RGCF $+$ np, these results indicate that the product terms in NGCF are redundant and bring no benefit to model performance whether the network layers are considered.

4.3.2 Effect of the number of graph convolution layer

To illustrate the impact for RGCF $w . r . t .$ the number of graph convolution layers $L$ , we demonstrated the experimental result $w . r . t .$ Recall@20 and NDCG@20 on Gowalla, Yelp2018 and Amazon-book with different $L$ in Fig. 3. Jointly analyzing the Fig. 3, we have the following observations:

•
The performance of NGCF and RGCF $w . r . t .$ Recall@20 and NDCG@20 improved significantly with increasing layer depth in most cases. Such result demonstrates that high-order interaction is essential for modeling user preference. When the depth of layers increases to four, the model performance of both RGCF and NGCF in Yelp2018 was slightly decreased may be due to overfitting.
•
As the layer depth increased, the performance of NGCF improved slightly, while RGCF showed an impressive improvement in all cases. This result is owning to that RGCF model can benefit much more from the growth of the layer depth than NGCF, again verifying that the refined structure in RGCF is capable of capturing high-order connectivities in the user-item interaction graph.

Table 5
Performance of RGCF w.r.t. Recall@20 and NDCG@20 on Gowalla dataset with different cropping ratios

#Cropping ratio 20% 40% 60%

#Interaction 877862 714797 554255

Metrics Recall@20 NDCG@20 Recall@20 NDCG@20 Recall@20 NDCG@20

MF 0.1110 0.0854 0.10063 0.07691 0.08822 0.06409

NGCF 0.1316 0.1001 0.1192 0.0923 0.1058 0.0769

LR-GCCF 0.1328 0.1022 0.1201 0.0932 0.1060 0.0770

RGCF 0.1565 0.1202 0.1266 0.1038 0.1203 0.0907

Figure 3.
Performance of NGCF and RGCF with different number of convolution layers $L$ w.r.t. $recall@20$ on Gowalla, Yelp2018 and Amazon-Book datasets.

4.3.3 Test performance w.r.t. interaction sparsity level

#Cropping ratio	20%	40%	60%
#Interaction	877862	714797	554255
Metrics	Recall@20	NDCG@20	Recall@20	NDCG@20	Recall@20	NDCG@20
MF	0.1110	0.0854	0.10063	0.07691	0.08822	0.06409
NGCF	0.1316	0.1001	0.1192	0.0923	0.1058	0.0769
LR-GCCF	0.1328	0.1022	0.1201	0.0932	0.1060	0.0770
RGCF	0.1565	0.1202	0.1266	0.1038	0.1203	0.0907

An additional experiment was carried out to investigate the model performance of RGCF in the scenario of sparse user interactions. In Table 5, we cropped Gowalla dataset by removing 20%, 40%, and 60% of the user interaction records, on which MF, NGCF, LR-GCCF, and RGCF were tested. Detailed results are reported in Table 5, showing that RGCF achieved significant improvements over the other baselines on the three cropped sparse datasets, further validating that RGCF is capable of alleviating the data sparsity problem.

4.4 Study of hyper-parameters (RQ3)

Figure 4.

Performance of RGCF with different self-loop weights $\lambda$ w.r.t. $\textit{recall}@20$ and $\textit{ndcg}@20$ on Yelp2018, Gowalla, and Amazon-Book datasets.

Figure 5.

Performance of NGCF and RGCF with different $L_{2}$ regularization coefficient $\alpha$ w.r.t. $\textit{recall}@20$ and $\textit{ndcg}@20$ on Yelp2018, Gowalla, and Amazon-book datasets.

In this study, we investigated the effect of different self-loop weights $\lambda$ and $L_{2}$ regularization coefficient on the performance of our proposed model.

4.4.1 Effect of self-loop weight

To investigate how self-loop weight affects model performance. We searched the $\lambda$ in the range of $\{0.0,0.2,0.5,0.7,1.0,1.2,1.5,1.7,2.0\}$ . Figure 4 plots the effect of self-loop weight $\lambda$ w.r.t. $\textit{recall}@20$ and $\textit{recall}@20$ on the three datasets. Specifically, our RGCF achieved the best performance when $\lambda=1.2$ for Gowalla, $\lambda=0.5$ for Yelp2018, and $\lambda=0.0$ for Amazon-Book, respectively. Such experimental result shows that the importance of self-loop is different on different datasets. Therefore, finding an appropriate value of self-loop weight can be an effective strategy to further improve the recommendation task.

4.4.2 Effect of $L_{2}$ regularization coefficient

Figure 5 shows the test performance $w . r . t .$ recall@20 and NDCG@20 of RGCF with regard to different $L_{2}$ regularization coefficient settings on the three datasets. We tuned the $L_{2}$ regularization coefficient $\alpha$ in the range of $\{1e-1,1e-2,1e-3,1e-4,1e-5\}$ . From the experimental results, we found that RGCF achieved the best performance when $\alpha=0.0001$ for Gowalla, $\alpha=0.0001$ for yelp2018, and $\alpha=0.00001$ for Amazon-book respectively.

5. Conclusion

In this work, we highlighted the negative impact of information redundancy existed in traditional nonlinear GCN, and presented a novel GCN-based CF model, RGCF, which uses a refined graph convolution structure to eliminate this information redundancy. Additionally, RGCF integrates popularity features into the learning process of collaborative signal to achieve better recommendations for cold users (users with few interactions) and further alleviate the issue of data sparsity. Extensive experimental results demonstrate the state-of-the-art performance of our RGCF, and the further comparison experiments verify the effectiveness and rationality of our proposed RGCF.

In future work, we wish to further improve the performance of RGCF by using the attention mechanism [23, 2] to precisely assign the weight for neighboring nodes and control the granularity of node popularity feature for different recommendation scenarios. Meanwhile, we are interested in integrating causal inference [20] and knowledge graph [1, 25] into our RGCF to improve the interpretability of recommendations.

Footnotes

Acknowledgments

This work is supported by special support plan for innovation and entrepreneurship in Anhui Province.

References

Cao

Wang

and Chua

T.S.

, Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences, WWW, 2019, 151–161.

Chen

Zhang

Nie

Liu

and Chua

T.S.

, Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention, SIGIR, 2017, 335–344.

Christakopoulou

and Karypis

, HOSLIM: Higher-Order Sparse LInear Method for Top-N Recommender Systems, KDD, 2014, 38–49.

Covington

Adams

and Sargin

, Deep neural networks for youtube recommendations, RecSys, 2016, 191–198.

Fang

Grunberg

Lui

and Wang

, Development of a music recommendation system for motivating exercise, ICOT, 2017, 83–86.

Hamilton

W.L.

Ying

and Leskovec

, Inductive representation learning on large graphs, NeurIPS, 2017, 1025–1035.

and McAuley

, Ups and downs:modeling the visual evolution of fashion trends with one-class collaborative filtering, WWW, 2016, 507–517.

and McAuley

, Vbpr: Visual bayesian personalized ranking from implicit feedback, AAAI 30(1) (2016).

Gao

Kan

M.Y.

and Wang

, Birank: Towards ranking on bipartite graphs, TKDE 29(1) (2017), 57–71.

10.

Liao

Zhang

Nie

and Chua

T.S.

, Neural collaborative filtering, WWW, 2017, 173–182.

11.

Jiang

Niu

Guo

Mustafa

Lin

Chen

and Zhou

, Novel boosting frameworks to improve the performance of collaborative filtering, ACML, 2013, 87–99.

12.

Kabbur

Ning

and Karypis

, Fism: factored item similarity models for top-n recommender systems, KDD, 2013, 659–667.

13.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, ICLR, 2015.

14.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, ICLR, 2017.

15.

Koren

, Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD, 2008, 426–434.

16.

Liang

Charlin

McInerney

and Blei

D.M.

, Modeling user exposure in recommendation, WWW, 2016, 951–961.

17.

Ning

and Karypis

, Slim: Sparse linear methods for top-n recommender systems, ICDM, 2011, 497–506.

18.

Rendle

Freudenthaler

Gantner

and Schmidt-Thieme

, BPR: Bayesian personalized ranking from implicit feedback, UAI, 2009, 452–461.

19.

Sarwar

Karypis

Konstan

and Riedl

, Item-based collaborative filtering recommendation algorithms, WWW, 2001, 285–295.

20.

Stephen

and Flavian

, Causal embeddings for recommendation, RecSys, 2018, 104–112.

21.

and Khoshgoftaar

T.M.

, A survey of collaborative filtering techniques, Advances in Artificial Intelligence, 2009, 1–19.

22.

Berg

R.V.D.

Kipf

T.N.

and Welling

, Graph convolutional matrix completion, arXiv preprint arXiv:1706.02263, 2017.

23.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention is all you need, NeurIPS, 2017, 5998–6008.

24.

Battaglia

P.W.

Hamrick

J.B.

Bapst

et al., Relational inductive biases, deep learning, and graph networks, arXiv preprint arXiv:1806.01261, 2018.

25.

Wang

Cao

Liu

and Chua

T.S.

, KGAT: Knowledge graph attention network for recommendation, KDD, 2019, 950–958.

26.

Wang

Feng

and Chua

T.S.

, Neural graph collaborative filtering, SIGIR, 2019, 165–174.

27.

Souza

Zhang

Fifty

and Weinberger

, Simplifying graph convolutional networks, ICML, 2019, 6861–6871.

28.

Sun

Hong

Wang

and Wang

, A neural influence diffusion model for social recommendation, SIGIR, 2019, 235–244.

29.

Xue

Wang

Liu

and Hong

, Deep item-based collaborative filtering for top-n recommendation, TOIS, 2019, 33:1–33:25.

30.

Koren

Bell

and Volinsky

, Matrix factorization techniques for recommender systems, IEEE Computer 42(8) (2009), 30–37.

31.

Yang

J.H.

Chen

C.M.

Wang

C.J.

and Tsai

M.F.

, Hop-rec: high-order proximity for implicit recommendation, RecSys, 2018, 140–144.

32.

Ying

Chen

Eksombatchai

Hamilton

W.L.

and Leskovec

, Graph convolutional neural networks for web-scale recommender systems, KDD(Data Science track), 2018, 974–983.

33.

Tian

Sonobe

Kawarabayashi

K.I.

and Jegelka

, Representation Learning on Graphs with Jumping Knowledge Network, ICML, 2018, 5449–5458.

34.

Wang

and Gupta

, Zero-shot recognition via semantic embeddings andknowledge graphs, CVPR, 2018, 6857–6866.

35.

Chen

Hong

Zhang

and Wang

, Revisiting Graph based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach, AAAI, 2020, 27–34.

RGCF: Refined graph convolution collaborative filtering with concise and expressive embedding

Abstract

Keywords

1. Introduction

1.1 Why graph convolution networks?

1.2 Why refined graph convolution networks?

1.4 Our proposal and contributions

2.1 Factorization-based CF methods

2.2 GCN-based CF methods

3. Methodology

3.1 Preliminary

3.2.1 Embedding initialization

Table 1 Statistics of the datasets

Table 4 Performance of RGCF variants with different information redundancies

4.4 Study of hyper-parameters (RQ3)

4.4.2 Effect of L 2 regularization coefficient

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
Statistics of the datasets

Table 4
Performance of RGCF variants with different information redundancies

4.4.2 Effect of $L_{2}$ regularization coefficient