Abstract
With the popularity of networks and the increasing number of online users, recommender systems have suffered from the privacy leakage of sensitive information. While people enjoy recommender services, their information is exposed to the networks. To protect the privacy of users when using the recommender services, we propose a multi-level combined privacy-preserving model that maintains high accuracy of recommendation with privacy protection and alleviates the data sparsity problem. Our scheme contains two steps of recommendation. First, a multi-level combined random perturbation (MCRP) model is proposed on the client side. Our model dynamically divides multiple disturbance levels and adds noise of different ranges to the rating matrix according to Gaussian and uniform mixed disturbances. Second, on the server side, we propose a pseudo rating prediction filling (PRPF) algorithm based on the matrix factorization model. Combining the PRPF algorithm with the MCRP method significantly improves the recommender accuracy and effectively increases privacy security. Sensitive analysis and comparison experiments show that the proposed privacy method has certain advantages in security and recommender accuracy by using three publicly available datasets.
Introduction
With the rapid development of the Internet, personalized recommendation has had far-reaching development in many fields such as e-commerce, smart travel, and social networks [1–3]. The many applications of the recommender system need a large amount of user personal data, and the data are rich in content and stored on the network. There is no doubt that user sensitive information is not lacking. Meanwhile, if we directly use the data without privacy protection, the many mature applications in the data mining area will inevitably lead to serious privacy breaches [4]. In 2010, the Netflix Prize competition was stopped because it threatened the security of the private information of users.
In the current study, the privacy-preserving recommender system has gained widespread attention. There are many ways to protect users’ history, such as anonymization [5], homomorphic encryption [6], secure multiparty computation [7], and random perturbation [8, 9]. Chang et al. [5] designed an anonymous algorithm to effectively achieve privacy goals using a third-party center to collect the information of users and perform anonymous operation. However, finding a completely trusted third party in reality is difficult. Therefore, privacy protection should occur before submitting the data to the server in our paper. Badsha et al. [6] proposed a privacy-preserving collaborative filtering method based on homomorphic encryption, introducing a semi-trusted third party to calculate and recommend. Erkin et al. [7] has designed a cryptographic protocol by using homomorphic encryption and secure multiparty computation techniques, trying to minimize the computation overhead and produce recommendations. Homomorphic encryption and secure multiparty computation could protect user privacy, but because the number of users is too large, the efficiency is not satisfactory. Polat et al. [8] proposed a collaborative recommendation using random perturbations to solve the efficiency and privacy-preserving problem. This method has the advantage of low computation overload and high efficiency. Nikolaos et al. [9] proposed a new multi-level perturbation method, generating noise based on multiple levels and different ranges of random values to achieve privacy protection. However, there are two main problems with this method. One problem is that random perturbation is only based on uniform distribution, which easily suffers from the privacy attack. The other problem is that multiple levels are simply randomly generated and impact the recommended accuracy.
At present, collaborative filtering (CF) [10] is very successful among many recommendation technologies, and the model-based CF algorithm has been widely used in various fields [11]. However, there are no effective ways for CF to resolve some problems, such as sparse data. When the number of users and the number of items increase a large amount, the rating matrix gets sparse. Consequently, using the sparse data to calculate the correlation Similarity or predicted rating yields poor results. In addition, this result means that sparse data affects the result of recommendations and that addressing the sparse problem helps improve the accurate predictions [12]. Moreover, implementing privacy-preserving technology to protect user data also greatly improves the quality of recommendations. Therefore, we could offset the impact on privacy protection by alleviating the problem of data sparsity while improving recommender accuracy.
To solve these problems, this paper proposes a multi-level combined privacy-preserving scheme based on random perturbation technology, which is applied to centralized scenarios. The main contributions of this study to relevant literature are summarized as follows. On the client side, we can design a multi-level combined random perturbation (MCRP) method that dynamically divides multiple perturbation levels and then randomly generates noise of different ranges at different levels to disturb user data. Experimental analyses show that the MCRP method has certain advantages in security and recommender accuracy. On the server side, we propose a PRPF algorithm that uses the latent factor model to predict the ratings of items that the user did not evaluate and then fill them into the rating matrix. The results show that the pseudo rating prediction filling (PRPF) is effective and improves the recommender accuracy.
We can validate the proposed privacy-preserving scheme through extensive experimental research on the three publicly available datasets, MovieLens, FilmTrust, and MovieTweetings, and the results show that this scheme improves recommender accuracy while protecting user privacy.
Related works
This section introduces matrix factorization and describes how randomized perturbation technique protects user privacy. It also presents the impact of data sparseness on recommendations.
Matrix factorization
The dimension of the rating matrix is often too large. To improve the efficiency in managing them, the matrix factorization methods [13] such as singular value decomposition (SVD) [14] and non-negative matrix factorization (NMF) [15] have been proposed. These methods focus on decomposing the high-dimensional sparse matrix into multiple low-dimensional matrices and letting the continued product of multiple low-dimensional matrices optimally approximate the original matrix. The latent factor model can employ singular value decomposition to map both users and items to a latent factor space and infer the preference of users and items using explicit or implicit feedback, such that user-item interactions are characterized as inner products in that space. As a result, the latent factor model can easily make further predictions by reconstructing the low-dimensional latent factor matrix [16]. A typical SVD model that associates each user u with a user latent factor vector p
u
∈ R
f
and each item with an item i latent factor vector q
i
∈ R
f
is proposed. Then, the prediction is made by calculating the inner product
Here, ∑
u
∥ p
u
∥ 2 + ∑
u
∥ q
i
∥ 2 + ∑j∈Nuyj2 is the regularizing term that effectively avoids overfitting by penalizing the magnitude of the parameters λ. The objective function can be minimized using the stochastic gradient descent (SGD), which is an iterative stochastic optimization algorithm. Once the latent factor vectors have been obtained, rating values will be quickly predicted by using the SGD algorithm according to Equation (2).
The basic idea of the randomized perturbation technique is to disguise private data by adding random noise that satisfies certain mathematical distributions [8]. If the number of users is significantly large, it is superior to cryptography methods, such as homomorphic encryption, in efficiency. However, the randomized perturbation technique also sacrifices a certain degree of recommended accuracy and privacy security. Moreover, as the added noise intensity is decreased, the recommended accuracy increases and the privacy security decrease. There are many methods for using random perturbation [18], such as additive perturbation and multiplicative perturbation. Additive perturbation is that the perturbed data Y is formed by adding the original data X to random noise R, as follows: Y = X + R. Random perturbations have less effect on data availability than other privacy protections because of these two approximation techniques.
Note that A and B are the original vectors, where A = (a1, a2,..., an) and B = (b1, b2,..., bn). Let A and B are disguised by R = (r1, r2,..., rn) and V = (v1, v2,..., vn), respectively, where R and V are uniformly distributed in the range [- α, α]. Let A′ and B′ be the disturbed rating vector, where A′ = A + R and B′ = B + R. Because of the independence of vectors, the scalar product of A and B can be estimated from A′ and B′, as follows [8].
Note that A is the original vectors, where A = (a1, a2,..., an). A is disguised by R = (r1, r2,..., rn), where R is uniformly distributed in the range [- α, α]. Let A′ be the disturbed rating vector, where A′ = A + R. The sum of the random values is close to zero, and the sum of the vectors is calculated, as follows [8].
Currently, there are many excellent CF recommended models, but they also have the problem of data sparseness [19]. Therefore, data sparseness has become a major obstacle that needs to be overcome. Resolving this problem could apply predictive filling to the ratings matrix. The data filling algorithm has many filling methods, and they are directly filled with the mean value of the user or item ratings, weighted average, mode, and the median. Although these methods have a small effect on alleviating the problem of data sparsity, they are single and the filled data lacks credibility. This affects the user-item similarity calculation or the model prediction result, resulting in the degradation of recommended quality. Therefore, we propose a new rating filling algorithm based on the latent factor model. We make full use of this advantage that the latent factor model has higher accuracy in prediction. We can predict the unrated items and fill in the training dataset, which can alleviate data sparsity and increase data security.
A multi-level combined privacy-preserving scheme
In this section, we will propose a multi-level combined privacy-preserving scheme based on a random perturbation technique, which is applied to centralized scenarios. On the client side, users’ ratings are first processed through the MCRP model and then submitted to the server. On the server side, we utilize the latent factor model to predict and fill the items’ ratings that users have not rated and return the recommended results to users. An overview of our model is shown in Fig. 1.

Centralized recommended scenarios.
In this section, we introduce our MCRP Model. In the recommender system, our MCRP Model can lead to the problem of user information leakage without privacy protection. Thus, we propose an MCRP model based on the random perturbation theory that was applied to the client side. A framework of MCRP model is shown in Fig. 2.

Framework of the MCRP model.
In the MCRP model, the process of dealing with ratings can be decomposed into the following several steps:
1) Calculate the maximum of the disturbed level. We can divide the disposal process of the perturbation into multiple levels. Consequently, we need to confirm the maximum disturbed level according to the range of user ratings. If the level is too large to make the disturbed ratings exceed the range, it will be meaningless. We find the maximum disturbed level (max_level) is closely related to the ratings, and max_level can be determined by Equation (5). This ensures that different datasets automatically adopt proper disturbed levels, making the MCRP model have wide universality.
2) Divide multiple levels. If we will directly perturb according to the max_level that was calculated, the recommended accuracy is affected. Meanwhile, if the attacker knows Equation (3), the ratings can be determined and are not secure. Consequently, we can propose a strategy for dynamic random partition. First, we set the parameter of the dynamic division ratio α, and the first-level perturbation is taken by random sampling to extract α percentage of the overall ratings. If the max_level is greater than two, we continue to randomly select α percentage of the remaining user ratings to use the second-level perturbation. As long as the current level is less than the max_level, we use recursion to continue selecting data points. If the current level is already the max_level, we use the max_level perturbation for all remaining user ratings. For example, Table 1 shows all the dynamic random partition levels.
Dynamic random partition
3) Mixed perturb all levels of ratings. In the MCRP model, we introduce the Gaussian distribution of noise and the uniform distribution of noise to randomly perturb together. Consequently, it is necessary to set the parameter of mixed perturbation ratio β, which is used to confirm the ratio of Gaussian to uniform noise at each level. The disturbed range is gradually expanded from the first to the highest level. If the current level may be the highest level, the N level, uniform noise is generated in the range [-N, N], and Gaussian noise is generated by a mean of zero, where variance is calculated through the items that the user rated. If not, we can easily define each level perturbation as follows. In the first level, the random noise is limited in the range [–1, 1], the uniform noise is generated in the range [–1, 1], and the Gaussian noise is generated by a mean of zero and variance of one. In the second level, the random noise is limited in the range [–2, 2], the uniform noise is generated in the range [–2, 2], and the Gaussian noise is generated by a mean of zero and variance two. Lastly, we must ensure the ratings validity by checking the disturbed ratings whether they are out of range. If a value drops below the minimum value, the perturbed rating takes the minimum value, and in the case that the value exceeds the maximum value, the perturbed rating takes the maximum value.
To alleviate the poor recommended quality because of data sparsity, we propose a pseudo rating prediction filling algorithm. We can plan to use the latent factor model to predict the unrated items. Initially, we will need to use the data to train the model because the model itself must be trained. We can randomly extract unrated items for prediction and then fill the predicted ratings into the ratings matrix. However, overfilling cannot help improve the recommender quality and affect the validity of user data. We randomly extract 20% of the real dataset as the test data in advance and then automatically extract the rest as the training set. The predicted ratings are only added to the training data to assist in training model parameters and to alleviate data sparseness. To acquire better experimental results, we set the parameter of fill rate ω, which indicates the percentage of pseudo rating in the real dataset. We can determine the optimal fill rate ω by performing many experiments.
Algorithm analysis
Algorithm 1 and Algorithm 2 run separately on the client side and server side. The run time of algorithm 1 is used to divide multiple levels and adds noise for all ratings. The algorithm need to divide the model into k levels in advance, randomly select the ratings by random sampling, and then generate random noise. The time complexities of algorithm 1 is O (k * n). The run time of algorithm 2 uses the latent factor model to predict the unrated items and its time complexity is O (m * n). Until s users are predicted, the time complexity of Algorithm 2 is O (s * mn).
Experiments
Datasets
In this section, we select three real datasets that are publicly available and widely used in evaluating recommenders. Moreover, Table 2 shows the basic statistics of these datasets.
Dataset statistics
Dataset statistics
MovieLens dataset is provided by the GroupLens research group (http://MovieLens.umn.edu). The dataset contains 100 thousand ratings, with values from 1 to 5.
FilmTrust dataset is collected from the FilmTrust website (http://firm-trust.com). The dataset contains 35,497 ratings, and the rating scale of this dataset is in the range from 0.5 to 4.
MovieTweetings dataset is collected from Twitter users (https://github.com/sidooms/MovieTweeting). The dataset contains 100000 ratings, and the rating scale of this dataset is in the range from 0 to 10.
In our experiments, the performance measures is the root mean square error (RMSE), which judges the quality of recommendations by calculating the deviation between the predicted and real ratings. As the RMSE value decreases, the accuracy of the recommended results increases and the RMSE is defined as follows.
The value difference (VD) is represented by the relative value difference in the Frobenius norm [20]. Thus, we select the VD to evaluate the effect of privacy protection on the datasets. The VD is defined as follows.
The F1 score effectively balance the correlation between the two different indicators. We use the F1 score to evaluate the equilibrium between recommender accuracy and privacy protection. When F1 is higher, the experimental scheme that we proposed is better, and the F1 score is defined as follows.
For resolving the problem of comparability between two different types of metrics, we use the range normalization to convert the evaluation metrics from the absolute value to the relative value. If the metric is an appropriateness metrics that the closer the indicator value is a certain value, as the evaluation result improves, we can use the Equation (9) to transform, and the formula is defined as follows:
In this section, we introduce the settings of the environment and parameters. The environment of the experimental algorithm is the Intel Core2 Quad CPU Q9500 @2.83 GHz, 4GB RAM, Windows 7 system. All algorithms are programmed in the C# language on the Visual Studio 2015. We use 80% of the data as the training dataset and the remaining 20% as the test dataset from the above dataset. No data is fixed for training or testing. In our work, the experimental results are obtained by each algorithm under the optimal parameters. Specifically, the Regularization Parameter λ is 0.1, the step size γ is 0.01, and the iterations are 50.
Experiment results
Sensitive analysis
In this section, we evaluate the privacy-preserving recommender system in terms of both recommender accuracy and privacy security. We apply the MCRP model based on the latent factor model and then use the RMSE and the VD to verify the impact of our model on these real datasets.
First, we analyze the effects of the dynamic division ratio α and the mixed perturbation ratio β on the RMSE and VD. In Fig. 3, the result shows that as α increased from 0 to 1, the proportion of the first-level perturbation increased in our model, indicating that the disturbed range is reduced and the recommender accuracy is improved. Meanwhile, the value of VD continued to decline, indicating that the privacy security gradually weakens. Figure 3 also shows that as β increased from 0 to 1, the proportion of random noise generated by Gaussian perturbations increased, indicating that the privacy security constantly improved. At the same time, the value of RMSE increased, and the accuracy declined. The results show that there is a close correlation between the accuracy and security. As the privacy security continued to increase, the recommender accuracy continued to decrease, and vice versa.

Results of different datasets in terms of RMSE and VD.
To solve the above problem, we use the F1 score to reconcile the RMSE and VD indicators. However, these two indicators are not in the same dimension. We can use Equation (9) to convert RMSE to the same unit as VD by range normalization. Then, we applied Equation (8) to calculate the F1 score. Figure 4 shows the final result between the RMSE and the VD from three real datasets. The figure shows that as α is 0.5 and β is 0.6 on the MovieLens and FilmTrust datasets, the F1 score can take the maximum value, which is the best compromise between the recommended accuracy and privacy security. As α is 0.4 and β is 0.7, the F1 score takes the maximum value on the MovieTweetings dataset. This result shows that the MCRP model can achieve better privacy protection and improve recommender quality at the same time. However, the F1 score shows two different results on these datasets. The two different results are mainly caused by the difference between the maximum ratings of the MovieLens and FilmTrust data sets where they are 5 points and 4.5 points, respectively. According to Equation (5) to calculate the max_level, only two disturbed levels can be taken at most. However, the maximum rating of the MovieTweetings dataset is 10 points, and three disturbed levels can be used. Therefore, the F1 score is different, and the optimal α and β value for the MCRP model is closely related to the dataset.

Results of different datasets on the value of F1.
On the client side, applying the MCRP model can effectively protect user data security, but it has a greater impact on recommender accuracy. For improving the recommender accuracy, we propose a PRPF algorithm on the server side. We compare the PRPF algorithm with a median filling algorithm, which was filled by the median average of the row and column data in the ratings matrix. The contrast experiment is based on the MCRP model and uses the optimal α and β value, which was already confirmed in Section 4.4.1. Figure 5 shows the comparison of the two algorithms under the RMSE and TIME metrics on three datasets. The abscissa axis is the filling rate ω, which represents the number of pseudo rating that adds ω multiple of the real data to the training set. The TIME metric is the application’s execution time in the computer.

Results of different datasets in terms of RMSE and TIME.
Figure 5(a) (b) and (c) show that the PRPF algorithm is significantly better than the median filling algorithm in recommender accuracy. Figure 5(d) (e) and (f) show that the PRPF is also superior to the median filling algorithm in the execution time. Table 2 shows that the sparsity rates of these datasets are 93.69%, 98.86%, and 99.94%, respectively. The table shows that as the dataset becomes sparser, the PRPF algorithm performs better. Meanwhile, as the filling rate ω increases, the running time increases and the efficiency decreases. By comparing the recommender accuracy and running time, the filling rate ω can be 1.1 on the MovieLens dataset, the filling rate ω can be 1.6 on the Film Trust dataset, and the filling rate ω is 3.5 on the MovieTweetings dataset. These values could explain the different filling rates by having the different datasets having their own data sparse rate. Thus, each dataset will have its own optimal fill rate.
To display the combined impact of the privacy-preserving model and the filling algorithm on the recommendation, we could combine the MCRP model with the PRPF algorithm and then compare them with the SVD++ algorithm, only the MCRP model, and only the MPPM model [9]. Figure 6 shows the comparison results about RMSE on three datasets. The MCRP model could use the optimal α and β values, as shown in Section 5.1, and the PRPF algorithm also can use the optimal ω values, as shown in Section 5.2.

Results of different datasets on the value of RMSE.
Figure 6 shows that the SVD++ algorithm is most effective under the recommender accuracy without privacy protection. However, with the privacy-preserving requirement, the MCRP model has better accuracy, and the value of RMSE is obviously lower than the MPPM model. This could be explained through the result that the process of optimizing the noise generated by multi-level perturbation. Instead of simply randomizing, our model is dynamically divided into multiple disturbed levels and then mixed to generate random noise of different ranges at different levels. The results then indicate the most reasonable experimental parameter through many experiments.
Figure 6 also shows that as the MCRP model combined with the PRPF algorithm, it performs better than MCRP under the recommender accuracy. After using the privacy-preserving model, the recommender quality is inevitably influenced. This result shows that the PRPF algorithm can effectively alleviate the sparse data and significantly improve recommender accuracy.
To solve the problem of privacy leakage in recommendation processes, we present an MCRP model applied on the client side that dynamically divides multiple disturbance levels and adds noise of different ranges at different levels according to the mixed disturbances. In addition, on the server side, we present the PRPF algorithm based on the matrix factorization, which alleviates the data sparse problem and improves the recommender accuracy. The experimental results show that the MCRP model combined with the PRPF algorithm in our paper has certain advantages in security and recommender accuracy.
Future works will focus on applying fuzzy tools to collaborative filtering recommender scenario. Fuzzy sets could model some uncertain and vague concepts which cannot be represented in a precise way, such as implicit feedback of social relationship, interest preference [21]. Using fuzzy tools to generate fuzzy profiles for analysing rating tendencies of users and apply fuzzy profiles to achieve natural noise management, detection and correction in the global or local level [22, 23]. We also could implement them to control and reduce added noise under privacy-preserving scenarios. In addition, we could consider a fuzzy approach to enhance the quality of the corrected users’ or items’ preferences, improve recommender systems performance in the model-based CF and use fuzzy tools to reduce the natural noise during matrix factorization. Therefore, specific researches will be needed to carry out related to the characteristic of model-based recommender scenarios, and we can apply fuzzy tools to solve practical problems.
Footnotes
Acknowledgments
This research was supported by the Natural Science Foundation of China (No. 61672039), Natural Science Foundation of Anhui Province (No. 1808085MF172) and Key Program in the Youth Elite Support Plan in Universities of Anhui Province (gxyqZD2019010).
