Abstract
Nowadays, the model compression method of knowledge distillation has drawn great attentions in Recommender systems (RS). The strategy of bidirectional distillation performs the bidirectional learning for both the teacher and the student models such that these two models can collaboratively improve with each other. However, this strategy cannot effectively exploit representation capabilities of each item and lack of the interpretability for the importance of items. Thus, how to develop an effective sampling scheme is still valuable for us to further study and explore. In this paper, we propose an improved rank discrepancy-aware item sampling strategy to enhance the performance of bidirectional distillation learning. Specifically, by employing the distillation loss, we train the teacher and student models to reflect the fact that a user has partiality for the unobserved items. Then, we propose the improved rank discrepancy-aware sampling strategy based on feedback learning mechanism to transfer just the useful information which can effectively enhance each other. The key part of the multiple distillation training aims to select valuable items which can be re-distilled in the network for training. The proposed technique can effectively solve the problem of high ambiguity in nature for recommender system. Experimental results on several real-world recommender system datasets well demonstrate that the improved bidirectional distillation strategy shows better performance.
Keywords
Introduction
At present, with the expanding scale of Recommendation System (RS), researchers usually adopt complicated models to understand the relationships between users and items [2, 12]. A recommender with large learning parameters can show superior performance. However, it may suffer from high computational and long inference costs. The drawbacks of previous models show more severe for web level applications with a good many of users and items for the reason that the number of parameters can increase sharply.
To deal with the problems, some works [2–4] have introduced Knowledge Distillation (KD) learning in the case of recommendation system. KD is combined with a mechanism that learns a compact student model under the guidance of a pretrained sophisticated teacher model. The related works can be divided into two types: unidirectional distillation (UD) and bidirectional distillation (BD) as listed in Fig. 1 (a) and (b), respectively. First, the UD in Fig. 1 (a) adopts a unidirectional information transferring. This learning strategy just transfers useful information from the teacher model to the student model based on the assumption that the teacher model is consistently better than that of the student [2, 3]. However, this learning strategy can not fully exploit the performance in the case that the student model shows better than the teacher model. Notably, in the application scenarios of recommendation system, Kweon et al [1] found that the knowledge of the student model can also help to enhance the performance of the teacher model, i.e., the teacher cannot be optimal on the testing set. The second strategy in Fig. 1 (b) is the bidirectional distillation, which can perform the collaborative learning for both the teacher and the student. Notably, the method of BD in [1] shows that some items ranked highly by the teacher can not exhibit strong ability to effectively enhance the other student model, vice versa. Considering that focusing on the high-ranked items of teacher and student will reduce the performance of knowledge distillation performance, the BD model aims at distilling useful information of items ranked relatively higher by the teacher, while along with lower rank determined by the student. However, it cannot fully exploit representation capabilities of the items and lack of the interpretability for the importance of items.

Two knowledge transferred strategies, (a) Unidirectional distillation and (b) Bidirectional Distillation.
Based on the above analysis, how to develop an effective sampling scheme is still an open problem. In this paper, we propose an improved rank discrepancy-aware item sampling strategy to enhance the performance of bidirectional distillation learning. First, by employing the distillation loss, we train the teacher and student models to reflect the fact that a user has partiality for the unobserved items. Then, we propose the improved rank discrepancy-aware sampling strategy based on feedback learning mechanism to transfer just the useful information which can effectively enhance both the teacher and student models. To be note, the improved sampling framework contains the shallow distillation stage and the deep interaction with multiple distillation training stage, which can distill only the informative knowledge and find potential knowledge to exactly predict the preference on the unobserved items for a target user in a bidirectional manner. Compared with the previous unidirectional distillation learning strategy which only considers the assumption that the teacher model is invariably better than that of the student, the proposed method further adopts bidirectional distillation analysis to exploit that the student is superior than the teacher for a large part of the testing set in the case of recommender system, enabling that the teacher and the student can collaboratively improve with each other during the training. Compared with the related bidirectional distillation learning strategy, our method proposes an improved rank discrepancy-aware item sampling strategy with multiple distillation training to enhance the performance of bidirectional distillation learning such that the representation capabilities of the items and the interpretability for the importance of items can be exploited. Thus, our method can effectively solve the problem of high ambiguity in nature for recommender system.
In brief, the main contributions in this paper are provided as follows: We develop a new sampling strategy to perform bidirectional distillation for top-K recommender system, named By integrating feedback learning mechanism in the rank discrepancy-aware sampling scheme, the multiple distillation training can definitely select and make full use of the valuable items such that only the important and useful information can be distilled to effectively enhance each other. Experimental results on several real-world recommender system datasets well demonstrate that the proposed method achieves superior performance compared with related knowledge distillation methods.
The rest of this paper is described as follows. The related works of knowledge distillation in RS are introduced in Section 2. Section 3 presents the motivation of our work, describes the optimization process of the proposed model. The effectiveness of the proposed method is experimentally evaluated in Section 4. Finally, the conclusion is provided in Section 5.
There are several strategies to reduce the model size and reasoning time of recommender system [5–8, 12]. These works aim to adopt discrete quantization for users and items to generate handy recommendations and can successfully reduce the model size. Some works [2–4] have used knowledge distillation in the field of RS. The strategy of
Furthermore, ensemble models have been jointly trained to obtain good results in the field of computer vision/image processing tasks and natural language processing (NPL) [9–11]. For instance, the proposed Deep Mutual Learning
In the task of NPL,
Our method
Problem statement
Considering that the set of users is P = {p1, p2, . . . , p
n
} and the set of items is Q = {q1, q2, . . . , q
m
}, then, we denote M as the user-item interaction matrix as follows:
In this section, based on the bidirectional distillation learning strategy, we propose a novel method named named

Block diagram of the proposed method. Based on the distillation loss of BD, in the first shallow distillation stage, we train the teacher and student models to reflect the fact that a user has partiality for the unobserved items. The recommendation lists
In this section, to obtain an effective student model with smaller size, we describe the joint learning loss inspired by the bidirectional distillation learning strategy [1], which can be listed as follows:
In this part, we describe an effective rank discrepancy-aware sampling strategy based on feedback learning mechanism to distill just the informative information such that the models can enhance each other in a bidirectional manner. Formally, some previous learning strategies [2, 3] simply select the high-ranked items in unidirectional distillation leading to undesired performance. Notably, considering the bidirectional distillation learning strategy, the performance gap between two models for teacher and student needs to be taken into account carefully to obtain the transferred knowledge, for the reason that not all knowledge are useful to enhance the performance. For the item i, the teacher model assigns the rank
Under the shallow distillation stage, the bidirectional learning strategy is considered to obtain the rank list for users, which can be described as follows.
We select the item which is ranked with a higher assessment value by the teacher model while with a lower assessment value ranked according to the student model by:
Based on the assumption that the student is superior than the teacher for a large part of the testing set, the student model can also help to improve the teacher model. Different from the distillation direction in teacher to student, we select the item which is ranked with a higher assessment value by the student model while with a lower assessment value ranked according to the teacher model by,
According to Equation (4) and Equation (5), in the training of shallow distillation learning, each recommender model gets informative knowledge by learning on the items ranked highly by the other model, but ranked lowly by itself. Thus, for each user p, we can obtain two modules to record the ranking of teacher and the ranking of student as follows: recommendation list
Considering that the shallow distillation learning does not always precisely reflect the fact that a user has partiality for the unobserved items due to the noise items. With multiple distillation training based on the above shallow distillation, we can obtain two matrices to record the ranking of teacher and the ranking of student: recommendation list matrix
Then, we need to check which items associated with the user p in
Experimental data
We use three real-world datasets to conduct experiments.
We compare our method with three widely used base architectures. To be specific, the competitive methods are introduced as follows:
Three state-of-art KD techniques are adopted to compared with our method:
Evaluation indicators and settings
In this section, we adopt two metrics to evaluate the ranking performance of these recommender systems. Specifically, Hit Ratio (H@K) tests whether the testing item is listed in the top-K list [18] and Normalized Discounted Cumulative Gain (N@K) [19] gives a higher assessment value to the items at higher rankings in the top-K list.
In this section, we introduce the experimental setup used throughout the paper. Specifically we use PyTorch [20] for the implementation. For each dataset, we use the parameters experimentally which resulted in the best performance for tasks. We train our models with the Adam optimizer [21] with L2 norm regularization and we select the learning rate from {0.00001, 0.0001, 0.001, 0.002} by following the work in [1]. The batch size is set as 128. For NeuMF, we adopt two-layer MLP to implement the network. For CDAE, we adopt two-layer MLP for the model of the encoder and decoder. The dropout ratio is set as 0.5. For the method of CD and RD, we refer to the experimental results followed by [1] with the best performance.
Result analysis and discussion
Tables 6 list the recommendation performance of CDAE and NeuMF methods on different real-world datasets, including, CiteULike, Foursquare-TYO and Foursquare-NYC, respectively. Notably, we denote "Teacher" and "Student" as the base architectures which are trained individually without the process of distillation learning, and we denote "BD-Teacher" and "BD-Student" are the teacher model and the student model which are both pre-trained jointly with bidirectional distillation. Notabley, the "iBD-Teacher" and "iBD-Student" are the teacher and student models trained with our improved bidirectional distillation strategy of iBD, respectively. Improv.BD-T is the performance improvement of the teacher under the iBD learning strategy and Improv.BD-S is the performance improvement of the student under the iBD learning strategy. From the results in Table 1, we can obtain that the teacher model can be effectively enhanced by the improved bidirectional distillation (iBD) learning strategy, by up to 1.26% under H@50 and 3.25% under H@100 on CiteULike dataset. Similarly, we can see that the student model can be effectively enhanced by the improved bidirectional distillation (iBD) learning strategy, by up to 4.23% under H@50 and 3.93% under H@100 on CiteULike dataset. Moreover, from the Tables 3, we can also find that the proposed method outperforms other methods. As shown in Tables 6, the student model of our solution shows 5.35%, 6.69% and 5.14% better than the other student model under H@50 on CiteULike, Foursquare-TYO and Foursquare-NYC datasets, respectively.
Performance comparison of base model CDAE on CiteULike dataset
Performance comparison of base model CDAE on CiteULike dataset
Performance comparison of base model CDAE on Foursquare-TYO dataset
Performance comparison of base model CDAE on Foursquare-NYC dataset
Performance comparison of base model NeuMF on CiteULike dataset
Performance comparison of base model NeuMF on Foursquare-TYO dataset
Performance comparison of base model NeuMF on Foursquare-NYC dataset
Based on the above experimental results in Tables 3, for both BD and iBD, the information of the student helps to enhance the model of teacher. The proposed iBD can effectively transfer the useful knowledge from the student compared with the model BD for the reason that these selected items labeled as noise items are redistilled in the network for training, which provides more information to interpret that the user actually dislikes the item or potentially likes the item.
Figure 3 shows the performance of BD and iBD with different 10 epochs of base model of CADE on CiteULike dataset. We can obtain that: 1) when the number of epoch is less than 300, these two methods achieve similar results, 2) when the number of epoch is large than 300, our proposed method obtains the best result for H@50 and N@50 on CiteULike, which can reflect that our learning strategy based on feedback learning mechanism can definitely select noise items such that only the informative knowledge can be distilled to fully enhance each other.

The performance of BD and iBD with different 10 epochs: (a) H@50 and (b) N@50.
We further examine the effect of the improved rank discrepancy-aware sampling scheme based on feedback learning mechanism on the performance of iBD to evaluate the superiority of the proposed sampling strategy. Figures 4(a) and 4(b) show the number of knowledge items in teacher model and student model to be distilled within iBD based on CDAE on Foursquare-NYC dataset. Figures 5(a) and 5(b) show the number of knowledge items in teacher model and student model to be distilled within iBD based on NeuMF on Foursquare-TYO dataset. Specifically, we show the proportional relationship between the number of items sampled by each user during the bidirectional distillation process and the number of items subsequently corrected. In Figs. 5(a) and 5(b), the blue height is the total number of items screened by each user before 10 rounds, and the red height is the total number of items subsequently corrected according to the feedback learning mechanism. As we can see, the noise item with multi-round learning contains information which can help the models to better solve the problem of high ambiguity in nature for recommender system.

Items analysis and selection strategy of improved rank discrepancy-aware sampling. (a) Teacher model within iBD based on CDAE on Foursquare-NYC dataset. (b) Student model within iBD based on CDAE on Foursquare-NYC dataset.

Items analysis and selection strategy of improved rank discrepancy-aware sampling. (a) Teacher model within iBD based on NeuMF on Foursquare-TYO dataset. (b) Student model within iBD based on NeuMF on Foursquare-TYO dataset.
Here, we give the hyperparameter analysis of γT→S and γS→T to evaluate the effectiveness of iBD. We show the results of NeuMF conducted on CiteULike dataset. For the proposed improved rank discrepant items selection based on bidirectional distillation, γT→S and γS→T are both chosen from the set of {0.1,0.5,0.9}. Specifically, Fig. 6 (a) and (b) show the performance of the teacher and the student with varying γT→S and γS→T, which balance the parts of the distillation losses. From Fig. 6 (a), when γT→S and γS→T are around 0.5, the student model can obtain the best performance under H@50 on CiteULike dataset. For the teacher model, we can also obtain that the trend of the experimental result is very similar with that of the student, but the corresponding values are more robust with the γT→S.

Hyperparameter analysis of the student and teacher models on CDAE. (a) γT→S for the student model under H@50 and (b) γS→T for the teacher model under H@50.
In this section, we validate the proposed iBD with two base models of CDAE and NeuMF on three datasets, including, CiteULike, Foursquare-TYO and Foursquare-NYC. The experiments were run on a workstation with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz and CUDA with NVIDIA GeForce RTX 2080 Ti GPU for acceleration. The results of running time of pre-train phrase, distillation phrase and predict phrase are shown in Table 7. From the results in Table 7, the running time of Pre-train and Distillation depends heavily on two factors: the datasets and the base model architectures. The model in larger dataset requires more running time. Notably, the predict phrase of the proposed method based on different base model architectures can show relatively high efficiency, with less than 5 seconds.
Time analysis ( s)
Time analysis ( s)
In this paper, we develop a new sampling strategy to perform bidirectional distillation for top-K recommender system, named improved rank discrepancy-aware sampling based on bidirectional distillation learning. Under our framework, the teacher model and the student model are collaboratively improved with each other in the training phrase. Based on feedback learning mechanism, both the teacher model and the student model can definitely select and make full use of the noise items such that only the important and useful information can be distilled to effectively enhance each other. Thus our method can effectively exploit the high ambiguity in nature for recommender system. At last, experimental results on several real-world recommender system datasets well demonstrate that the proposed iBD achieves superior performance compared with related knowledge distillation methods.
