A matrix factorization recommendation model for tourism points of interest based on interest shift and differential privacy

Abstract

Adding noise to user history data helps to protect user privacy in recommendation systems but affects the recommendation performance. To solve this problem, a matrix factorization tourism point of interest recommendation model based on interest offset and differential privacy is proposed in this paper. The recommendation performance of the model is improved by analyzing user interest preferences. Specifically, user interest offsets are extracted from user tags and user ratings under time-series factors to calculate user interest drift. Then, similar neighbors are found to train user feature preferences which are integrated into the matrix model in the form of regular terms. Meanwhile, based on the differential privacy theory, a privacy neighbor selection algorithm combining the K-Medoides clustering algorithm and index mechanism is designed to effectively protect the identity of neighbors and prevent KNN attacks. Besides, the Laplace mechanism is used to implement differential privacy protection for the model’s gradient descent process. Finally, the feasibility of the proposed recommendation model is verified through experiments, and the experimental results indicate that this model has advantages in recommendation accuracy and privacy protection.

Keywords

Matrix factorization recommendation system differential privacy interest shift clustering

1 Introduction

With the rise of social networks and travel web-sites, tourists are often bored with a large number of information searches and product choices. A tourism recommendation system can solve the problem of information overload effectively [1]. The recommendation system learns the interests and preferences of users through their historical browsing records. Then, it recommends items or item sets of potential interest to users and provides personalized services. However, the training of learning algorithms requires a large amount of data.

To provide users with more accurate services and make the recommendation services accepted by more users, the recommendation system needs to collect more user data, such as user browsing information, purchase information, rating information, and user attribute tags. However, these data include personal privacy that users do not want to disclose. Suppose a tourist is now traveling in Xi’an. First, the tourist booked a hotel around the spot through a group-buying website, the website platform gets tourist’s location and then recommends a restaurant to the user according to his consumption level and food preference, if the user chooses Sichuan cuisine dishes, the merchant and the platform can infer where the tourist comes from based on his consumption. If the data is leaked to a third party, it is straightforward to cause the privacy of the user’s preferences to be leaked. The defect of recommendation system without privacy protection can be used by attackers to accurately obtain users’ real privacy information, which brings hidden danger to users’ reputation and even life and property safety. It is contrary to the goal of developing recommendation system to bring users a better experience and is not conducive to the development of recommendation system. Therefore, it is essential for personalized travel recommendation algorithm research that user data remains secure and private while users enjoy personalized travel recommendation content.

At present, privacy-preserving methods for travel recommendation systems are mainly classified into three types: generalization, data perturbation, and encryption. The literatures [2, 3] use the k-anonymity method to hide user information with k similar users by generalizing the user’s information. However, k-anonymity does not strictly define the background knowledge held by the attacker. Thus, there is the problem of low security when facing new attacks. The literatures [4, 5] proposed data scrambling methods to protect user data by interfering with their historical data to a certain extent before sending them to the recommendation server. Although the data scrambling method is simple, it also suffers from the problem of insufficient protection capability. The literatures [6, 7] use homomorphic encryption methods to calculate the similarity of nearest neighbors in the collaborative filtering recommendation process. However, the homomorphic encryption algorithm also suffers from high computational complexity and low recommendation efficiency when applied in large-scale datasets. The above privacy-preserving methods are all single for user preferences privacy. The recommendation methods that consider user preferences privacy protection have involved little research. Our previous work proposed a distributed user privacy-preserving recommendation framework [8], but the approach did not consider that users’ preferences are shifted over time.

Based on user interest offset and differential privacy, a matrix factorization tourism point-of-interest recommendation model is proposed in this paper to ensure recommendation performance without leaking user privacy data. The model extracts user interest points from user tags and user ratings under time-series factors to calculate user interest offset. Meanwhile, a private neighbor selection method combining the K-Medoides clustering algorithm and the exponential mechanism is designed, which implements exponential mechanism protection for neighbor identities. Besides, the Laplace mechanism is exploited to gradient the model. The descent process and random noise together increase the safety of the recommended model. The contributions of this paper are summarized as follows:

An Interest Drift Matrix Factorization (ITMF) recommendation model is proposed. The user interest points are extracted from user tags and user ratings under time-series factors to calculate user interest drift. Also, the neighbors with similar interests are found to train the user feature matrix, which is incorporated into the matrix factorization model in the form of regular terms.

Based on the differential privacy and interest shift, a matrix factorization recommendation model (Differential Privacy Interest Drift Matrix Factorization, DPITMF) is proposed. This model considers the identity of neighbors, and the recommendation process requires differential privacy protection. Meanwhile, a private neighbor selection method combining K-Medoides clustering algorithm and the exponential mechanism is designed to effectively resist KNN attacks, and the Laplace mechanism is exploited to protect the gradient descent process of the recommendation model.

The privacy security of the proposed model is proven through theory, and the efficiency of the recommendation system is verified through experiments.

The rest of this paper is organized as follows. In Section 2, we introduce existing related work on travel recommendation and privacy protection method. In Section 3, we introduce background on Matrix factorization recommendation model and Differential privacy. Recommendation system model construction and optimization is detailed in Section 4. The experimental setup and results are presented in Section 5. Finally, the conclusions are presented in Section 6.

2 Related work

2.1 Travel recommendation

As an effective tool to deal with the problem of information overload, the personalized recommendation system has received extensive research and attention Kofler et al. [9] collected travel photos shared on the Flikr platform and recommend a photo of tourist attractions related to the destination set by the user. Moreno et al. [10] designed a personalized travel destination recommendation system, which used collaborative filtering technology to recommend similar attractions to users. Loh et al. [11] searched the tourism ontology knowledge base for destinations or tourist attractions with high similarity to the user’s preferences and recommended them to users. Levi et al. [12] used content-based recommendation technology to recommend potential hotels of interest to users and user satisfaction is obtained through questionnaire surveys. Historical travel data is important to the recommendation model, but the user¡¯s interest points will change over time. A recommendation model designed with excessive reliance on historical travel data without considering the time factor cannot meet the needs of personalized recommendation.

2.2 Privacy protection method

In practical applications, service providers make high-quality recommendations by mining user information, but the lack of privacy protection methods will cause user privacy leakage [13, 14] in the recommendation system. Making accurate predictions while protecting user privacy and security is a research hotspot. At present, domestic and foreign researchers have proposed a variety of solutions to the privacy protection problem in the recommendation system. The common privacy protection technologies include anonymization technology, encryption technology, and data disturbance technology. The anonymization technology protects user privacy by collecting user data and performing anonymization operations. Improper data anonymization will result in poor data practicability. The encryption technology is an encryption method based on the cryptographic algorithm. The method has very high security but suffers from a high cost of communication and calculation overhead. The data perturbation technology generally obfuscates the real data by add-ing random noise to the original distribution of the data. Based on the principle of data disturbance, many methods have realized privacy protection, where the random disturbance method and the differential privacy method are the most widely known. The latter is adopted by this paper for privacy protection. The concept of differential privacy was first proposed by Dwork et al. [15 –17], and it is continuously improved. Under differential privacy protection, the attacker cannot infer real data even if he has the largest background knowledge. Differential privacy was first applied to recommendation systems by McSherry et al. [18]. The combination of the Laplace mechanism with collaborative filtering recommendation can effectively reduce the large recommendation performance loss caused by differential privacy. Zhu et al. [19] proposed to apply the differential privacy index mechanism to user-based collaborative filtering to strictly protect the identity of user neighbors. Berlioz et al. [20] proposed to apply differential privacy to the matrix factorization recommendation system to protect the entire process of matrix factorization. Xian et al. [21] proposed a collaborative filtering algorithm (DPSS++) based on differential privacy and SVD++. The three differential privacy mechanisms were designed in term of gradient perturbation, objective function perturbation, and output result perturbation. Based on this, the prediction accuracy is improved without leakage of user privacy. Yang et al. [22] proposed a framework that includes two perturbation methods to prevent the threat of inference attacks against users. All of the above privacy-preserving methods are only for user privacy. In contrast, the recommended methods for privacy preservation that take into account both user privacy and user preferences currently involve little research.

3 Preliminaries

3.1 Matrix factorization recommendation model

As for the collaborative filtering recommendation, the user-item data and the user’s item scoring data are stored in the user-item scoring matrix. It is assumed that m represents the number of users, and n represents the number of items; the user-item rating matrix R is a m * n matrix set that stores the ratings of all users for all items, and its mathematical expression is R = {r_ui|1 ≤ u ≤ m, 1 ≤ i ≤ n}. The user project score matrix is defined as follows: $R = [\begin{matrix} r_{11}, r_{12}, \dots, r_{1 n} \\ r_{21}, r_{22}, \dots, r_{2 n} \\ \dots \\ r_{m 1}, r_{m 2}, \dots, r_{mn} \end{matrix}]$

Because the dimension of the user-item rating matrix R is large and the matrix is extremely sparse, matrix factorization technology is introduced to improve the processing efficiency [23]. Specifically, matrix factorization uses items that have not been rated by users to make predictions or fill the original matrix to alleviate the impact of data sparseness on the recommendation accuracy [24]. As a result, matrix technology can be used in collaborative filtering recommendations, and matrix factorization algorithms have received extensive research and attention. Since matrix factorization reduces the matrix dimensionality and decomposes the original matrix into two submatrices, it is characterized by latent factors associated with users and items, which is also called the latent factor model (LFM). Subsequently, other improved recommendation models were proposed, such as Singular Value Decomposition (SVD) [25] and Nonnegative Matrix Factorization (NMF) [26]. These model-based collaborative filtering recommendation algorithms have been used in practical applications, and the recommendation accuracy is significantly better than that of the traditional user-item collaborative filtering. The matrix factorization process is shown in Fig. 1.

Fig. 1

Schematic diagram of matrix factorization.

In the matrix factorization model, the original matrix R can be approximate by the inner product of the user eigenvector p_u with a size of m * k and the item eigenvector q_i with a size of n * k obtained by decomposition. Thus, the predicted value of user u for item i can be directly expressed as R, where k represents the dimensionality of the two feature vectors. To enhance the learning efficiency of the feature vectors (p_u and q_i) and improve the accuracy of the final prediction, the minimum regularization objective function L is used to reduce the error. Meanwhile, a penalty term is added to the target letter L to avoid over-fitting during training. The definition of the objective function L is as follows: $\begin{matrix} L = \frac{1}{2} \sum_{u} {\sum_{i \in N_{u}} (r_{u i} - q_{i}^{T} p_{u})}^{2} + \\ \frac{λ}{2} (\sum_{u} {‖ p_{u} ‖}^{2} + \sum_{u} {‖ q_{i} ‖}^{2}) \end{matrix}$ (1) where λ is a penalty parameter to control the degree of regularization of the objective function. The larger the value, the higher the degree of regularization. The gradient descent method (e.g., stochastic gradient descent, SGD) is a local optimization algorithm that obtains the local minimum through continuous iterative training in the negative direction. Based on this, Koren et al. [27] proposed the classic SVD++ model. To offset the differences of different scoring systems and make predictions for a specific system, the global average μ, user bias b_u, and item bias b_i are also introduced. The prediction made by the SVD++ model is expressed as follows: $\begin{matrix} \hat{r_{u i}} = μ + b_{u} + b_{i} q_{i}^{T} (p_{u} + | N_{u} |^{- \frac{1}{2}} \sum_{j \in N u} y_{j}) \end{matrix}$ (2)

3.2 Differential privacy

3.2.1 The theoretical basis of differential privacy

Definition 1. ɛ-Differential privacy (ɛ-DP) [28]. Given a random algorithm A with a range of Range(A), for any two adjacent data sets D and D′ that differ by only one record, and the output result S ⊆ Range(A) satisfies the Equation (3), it is said that the algorithm A satisfies ɛ-differential privacy. $\begin{matrix} Pr [A (D) \in S] \leq exp (\in) \times Pr [A (D^{'}) \in S] \end{matrix}$ (3) where Pr [·] represents the probability that the random algorithm A may leak privacy, and ɛ represents the privacy protection parameter in differential privacy. The value of ɛ closer to 0 corresponds to a better privacy protection effect. Especially when the value of ɛ is 0, the highest privacy protection can be achieved in theoretical. In the query function, the output results of the algorithm are indistinguishable, and the probabilities are the same. However, excessive privacy protection will result in poor data applicability.

In the differential privacy theory, the sensitivity can measure the biggest change caused by the query function, and it can be exploited to control the amount of noise added. Too much noise affects data availability, while too little noise results in low privacy protection. Usually, global sensitivity and local sensitivity are defined.

Definition 2. Global sensitivity. Given a query function f, for any adjacent data sets D and D′, the global sensitivity of the query function can be defined as: ${GS}_{f} = max_{D, D^{'}} ∥ f (D) - f (D^{'}) ∥$ (4)

A small global sensitivity of the query function, e.g., the sensitivity of the counting function is 1, can effectively ensure data security. However, if the query function has a large global sensitivity, it is necessary to add enough noise to the output function to ensure privacy and security, which results in poor data availability. As a result, the concept of local sensitivity was proposed.

Definition 3. Local sensitivity. Given a query function f : D → R^d, d represents the query dimension of the function f, for any adjacent data sets D and D′, the local sensitivity of the query function f can be defined as: $\begin{matrix} {LS}_{f} = max_{D^{'}} ∥ f (D) - f (D^{'}) ∥ \end{matrix}$ (5)

Since the local sensitivity is determined by the query function f and the specific data in the data set, the local sensitivity is usually smaller than the global sensitivity. The equation describes the relationship between global and local sensitivities: ${GS}_{f} = max_{D} ({LS}_{f})$ (6)

3.2.2 Differential privacy protection mechanism

Definition 4. Laplace mechanism [29]. Given a data set D, for any query function f : D → R^d, its sensitivity is Δf. If the algorithm A_L(D) satisfies the Equation (7), it is said that the algorithm satisfies the Laplace mechanism. Lap(Δf/∈) represents the amount of random noise added. $A_{L} (D) = f (D) + Lap (\frac{Δ f}{\in})$ (7)

Definition 5. Exponential mechanism [29]. Given the input data set D and the function A_E(D, Q, ɛ), the utility function of the output entity object r is Q(D, r), and ΔQ represents the global sensitivity of the utility function Q(D, r). If the probability that the function Q(D, r) outputs the entity object r ∈ R is proportional to $exp (\frac{\in Q (D, r)}{2 Δ Q})$ , i.e., it satisfies the formula (8), it is said that A_E(D) satisfies the exponential mechanism. $Pr [A_{E} (D) = r] = \frac{exp (\frac{ɛ Q (D, r)}{2 Δ Q})}{\sum_{r^{'} \in R} exp (\frac{ɛ Q (D, r^{'})}{2 Δ Q})}$ (8)

3.2.3 The combined principle of differential privacy

The principle of differential privacy combination consists of sequence combination principle and parallel combination principle.

Property 1. Principle of sequence combination [29]. Given that the privacy budgets of n independent algorithms {A₁, A₂, A₃, ⋯ , A_n} are respectively {ɛ₁, ɛ₂, ɛ₃, ⋯ , ɛ_n}, for the same data set D, the combined algorithm {A₁(D) , A₂(D) , ⋯ , A_n(D)} satisfies $\sum_{i = 1}^{n} ɛ_{i}$ -differential privacy protection.

Property 2. Principle of parallel combination [23]. Given that the privacy budgets of n independent algorithms {A₁, A₂, A₃, ⋯ , A_n} are respectively {ɛ₁, ɛ₂, ɛ₃, ⋯ , ɛ_n}, on the n disjoint data sets {D₁, D₂, D₃, ⋯ , D_n}, the combined algorithm satisfies $max_{i} ɛ_{i}$ -differential privacy protection.

4 Recommendation system model construction and optimization

The user’s interest bias is calculated from the user’s label data and the score under the time factor. Based on this, the personalized recommendation of tourist attractions based on the bias of user interest is realized. Fig. 2 illustrates a recommendation framework for tourism points of interest based on interest offset and differential privacy.

Fig. 2

Tourist interest point recommendation framework based on interest shift and differential privacy. Firstly, the user’s original travel dataset is privacy preserved by adding noise, then the travel interest point recommendation model is constructed, and finally the recommendation results are fed back to the user.

4.1 User interest deviation

The user’s tag data serves as a mark for the user’s behavior and habits. The case that a user has many tags or he is tagged with the same tag many times reflects the actual interest of the current user. Meanwhile, when a recommendation cold start occurs, the recommendation result can be given from the user’s interest, thus alleviating the data-sparse problem to a certain extent. As for the prediction of the user’s choice, the user will be more likely to be affected by his interest preferences. Knowing the user’s interest in each tag in advance is helpful for the recommendation system to make predictions from the user’s interest point of view. Therefore, the interest value of user i for label j (Ist_ij) can be constructed as follows: ${Ist}_{ij} = \frac{\sum_{k \in C_{j}} {label}_{i, k}}{\sum_{l \in C} {label}_{i, l}}$ (9) where C represents the set of label, and the above equation describes the ratio of the number of user evaluations of items with tag k to the total number of user participated in the evaluation, label_i,k represents the number of evaluations of the item with the label k by the user i. It can be obtained from user data statistics and then exploited to calculate the similarity of interest values between users. Referring to the calculation principle of the Pearson correlation coefficient, the calculation method is as follows: $\begin{matrix} Lsim (u, v) = \frac{\sum_{j \in V_{u, v}} ({Ist}_{uj} - \bar{{Ist}_{u}}) ({Ist}_{vj} - \bar{{Ist}_{v}})}{\sqrt{\sum_{j \in V_{u, v}} {({Ist}_{uj} - \bar{{Ist}_{u}})}^{2} {({Ist}_{vj} - \bar{{Ist}_{v}})}^{2}}} \end{matrix}$ (10) where Ist_ij and Ist_vj respectively represent the interest value of the user u and the user v for the tag j; $\bar{{Ist}_{u}}$ and $\bar{{Ist}_{v}}$ respectively represent the average interest value of the user u and the user v. Since the user’s interest fluctuates with time, the scores that are closer in time are more valuable in the recommendation system. Therefore, the Ebbinghaus forgetting curve is exploited to calculate the time sequence factor, and the time factor is added to the calculation of user similarity to indicate the change of user ratings with time. The calculation of the timing factor is as follows: ${WT}_{ui} = {\begin{matrix} 1, if t_{\max} = t_{\min} \\ e^{\frac{t_{ui} - t_{\min}}{t_{\max} - t_{\min}} - 1}, t_{\max} \neq t_{\min} \end{matrix}$ (11) where WT_ui represents the time sequence factor of user u for item i; t_ui represents the scoring time of user u on item i; t_max and t_min represent the most recent scoring time and the earliest scoring time, respectively. According to the user ratings under the influence of timing factors, the Pearson correlation coefficient is exploited to calculate the similarity. The calculation equation is as follows: $Tsim (u, v) = \frac{\sum_{i \in V_{u, v}} ({WTr}_{ui} - \bar{r_{u}}) ({WTr}_{vi} - \bar{r_{v}})}{\sqrt{\sum_{i \in V_{u, v}} {({WTr}_{ui} - \bar{r_{u}})}^{2} {({WTr}_{vi} - \bar{r_{v}})}^{2}}}$ (12) where r_ui and r_vi respectively represent the ratings of the user u and user v on item i; $\bar{r_{u}}$ and $\bar{r_{v}}$ represent the average ratings of the user u and user v, WTr_ui = WT_ui · r_ui,WTr_vi = WT_vi · r_vi.

Finally, the score similarity under time drift Tsim(u, v) is combined with the similarity of user interest Lsim(u, v), and α is the balance factor. When there are few items rated by users, it helps alleviate the cold start problem by digging out the potential points of interest from the user tag information to assist in the prediction. The formula for calculating the deviation of user interest is as follows: $\begin{matrix} Hsim (u, v) & = α \times Tsim (u, v) + \\ (1 - α) \times Lsim (u, v) \end{matrix}$ (13)

4.2 Matrix factorization model based on interest migration (ITMF)

When a user makes a decision, the user is usually influenced by his friends around him. The friend who can influence user decisions is referred to as the user’s neighbor. The user’s neighbors usually have similar interests to the user’s interests. Therefore, the interest preference of the neighbors can be exploited by the matrix factorization model to train the potential feature matrix of the user and the item, thus improving the prediction accuracy. Specifically, k users with the highest interest offset are taken as the neighbors of the target user. They are integrated into the objective function in the form of user regular terms. In this case, the user’s feature matrix is always affected by the user’s neighbors, and the feature matrix is close to that of the neighbors based on the interest offset. Based on interest shift, the objective function of the matrix factorization model is defined as: $\begin{matrix} \begin{matrix} L_{\min} (R, P, Q) & = \frac{1}{2} \sum_{u} {\sum_{i} (R_{ui} - p_{u} q_{i}^{T})}^{2} + \\ \frac{λ}{2} ({∥ p_{u} ∥}_{F}^{2} + {∥ q_{i} ∥}_{F}^{2}) + \\ \frac{β}{2} (\sum_{u} \sum_{j \in N_{u}} Hsim (u, j) {∥ p_{u} - p_{j} ∥}_{F}^{2}) \end{matrix} \end{matrix}$ (14) where Hsim(u, j) represents the interest between users u and j. When the user’s interest offset is large, the feature vectors of the two users have a relatively small distance. N_u represents the set of neighbors of the target user u, and β represents the learning efficiency of the interest regular term.

The personalized matrix factorization model is solved following the stochastic gradient descent method. The partial derivative of the objective function is calculated. Then, the latent factor vector is updated iteratively until the objective function converges to find the final parameter value. The process of finding partial derivatives is as follows:

$\begin{matrix} \frac{\partial L}{\partial p_{u}} & = \sum_{u} I_{ui} (p_{u} q_{i}^{T} - r_{ui}) q_{i} + \\ λ p_{u} + β \sum_{j \in N_{u}} Hsim (u, j) (p_{u} - p_{j}) \end{matrix}$ (15) $\frac{\partial L}{\partial q_{i}} = \sum_{i} I_{ui} (p_{u} q_{i}^{T} - r_{ui}) p_{u} + λ q_{i}$ (16) The update of the feature matrix formula is as follows: $p_{u} = p_{u} - η * \frac{\partial L}{\partial p_{u}}$ (17) $q_{i} = q_{i} - η * \frac{\partial L}{\partial q_{i}}$ (18) where η represents learning efficiency and represents the step length of each iteration.

4.3 Matrix factorization model combining differential privacy and interest timing (DPITMF)

The differential privacy protection is divided into two parts to construct a centralized differential privacy model. One uses an exponential mechanism to protect the identity of user neighbors; The other uses the Laplace mechanism to add noise to the gradient descent process of the matrix factorization model. The privacy budget of the two parts will be allocated at a ratio of 1:1 so that the whole privacy protection scheme meets differential privacy. The design of the DPITMF privacy protection model is illustrated in Fig. 3.

Fig. 3

Tourist interest point recommendation framework based on interest shift and differential privacy.

4.3.1 Privacy neighbor selection

In the ITMF model, the interest preference of the user’s neighbor is similar to that of the user. If the attacker pretends to be the target user’s neighbor, he can use his false interest to infer the real interest of the user. To protect the security of the user’s neighbor identity, a private neighbor selection method based on the K-Medoide clustering algorithm is designed. It is necessary to ensure that the user has high-quality neighbors, and these neighbor identities should be strictly protected by differential privacy. The detailed process of selecting a private neighbor is shown in Fig. 4.

Fig. 4

The process of private neighbor selection. First, K-Medoide clustering is performed on the original tourism data to obtain the set of potential neighbors of the user, and the set of neighbors is randomly selected among the set of potential neighbors, and then the set of privacy neighbors is determined, which is the final set of neighbors.

Algorithm 1. K-Medoide clustering

Input: user project matrix R, number of clusters k

Output: user clustered collection V₁, V₂, …, V_k

1: Randomly select one user as the first initial cluster center C₁

2: for i ← 2 to k

3: for each u ∈ U

4: Calculate the distance S₁, S₂, …, S_i-1 from the cluster center C₁, C₂, …, C_i-1 to the user u

5: S_min ← min(S₁, S₂, …, S_i-1)

6: end for

7: Sampling by probability and use Roulette Selection to choose next center C_i

8:end for /* Complete the calculation of the initial cluster centers (C₁, C₂, …, C_k)*/

9: for each u ∈ U

10: Calculate the distance (S₁, S₂, …, S_k) from user u to k cluster centers

11: S_min ← min(S₁, S₂, …, S_i-1)

12: divide(u,S_min) // Divide user u into k clusters according to the shortest distance

13: end for;

14:update(C₁, C₂, …, C_k,C₁′, C₂′, …, C_k′) /*

Accumulate the distances from each user to other users in the same cluster, and use the user with the smallest sum of distances as the new cluster center */

15:if (compare(C₁, C₂, …, C_k,C₁′, C₂′, …, C_k′)) /* Compare whether the cluster centers have changed */

16: end loop

17:else:

18: repeat 9∼14;

19: return V₁, V₂, …, V_k;

Algorithm 2. Privacy neighbor selection algorithm

Input: target user u, number of clusters k, privacy budget ∈

Output: Potential neighbor set N

1: Execute Algorithm 1, find out the cluster of target user u according to the clustering result C_u

2: if length (C_u) > = 5N

3: find the 5N closest to the user u from the C_u as the potential neighbors

4: else

5: add all to potential neighbors from the C_u, the rest from the closest cluster

6: divide(5,pn₁, pn₂, . . . , pn₅) /*Randomly divide the set of potential neighbors into 5 small sets */

7: for i 2 to 5

8: M← enumerate(N/5,pn_i) /* Enumerate all the possibilities of size N/5 from pn_i and store them in M */

9: calculate Hsim(u, v) /* Calculate according to formula (13) */

10: for each N/5 to M /* Calculate according to formula (21)*/

11: calculate p(N, L)

12: end for

13: N← random_sample(N/5,p(N, L)) /*Random sampling generates the final neighbor set N*/

14: end for

15: return N

Step 1: K-Medoide clustering. The K-Medoide clustering algorithm is used for data preprocessing. The specific clustering steps are described in Algorithm 1, and the interest preference degree Hsim(u, v) proposed in Section 4.2 is used as the distance in the algorithm.

Step 2: Generation of a set of potential neighbors. In Step 1, the K-Medoide is combined the K-Means++ algorithm to cluster the user set. It is believed that 5 times the size of the neighbor set N is a reasonable size of the potential neighbor set. Therefore, if the number of users in the target cluster is greater than or equal to 5N, the closest 5N users are selected as potential neighbors; otherwise, all users in the target cluster are selected as the set of potential neighbors. The neighboring clusters are searched, and the set of potential neighbors with the difference is filled according to the distance.

Step 3: Random enumeration of neighbor selection. The neighbor enumeration mechanism is adopted to enumerate the probability set M of all neighbors. Then, the exponential mechanism is exploited to select the neighbor set N. To take into account both security and recommendation performance, the set of potential neighbors is randomly divided into five parts to enumerate all possible results of M in the set of size N.

Step 4: Private neighbor selection under the index mechanism. Since the enumeration is divided into five parts in step 3, the private neighbor selection is also performed five times under the exponential mechanism. Finally, the five subsets with the size of N/5 are merged to obtain a final neighbor set with the size of N. Meanwhile, the privacy budget is also divided into five parts, and the privacy budget selected each time is ∈/2/5. The utility function of the exponential mechanism is designed as follows: Assuming that the target user u is located in the cluster C_u, and the neighbor set of the target user u N ⊆ L, the utility function is defined as follows: $Q (C_{u}, u, N) = \sum_{v \in N} | Hsim (u, v) |$ (19)

According to the definition of the exponential mechanism, the probability of outputting object N as a neighbor should be proportional to $exp (\frac{\in q (C_{u}, u, N)}{2 Δ Q})$ , where ΔQ is the sensitivity of the utility function Q; C_u is a neighbor set with different user ratings. Considering the maximum change of the utility function, the calculation of ΔQ is as follows: $Δ Q = max_{N} ∥ q (C_{u}, u, N) - q ({C_{u}}^{'}, u, N) ∥$ (20)

As for the privacy neighbor selection of the exponential mechanism, the probability distribution of all cases in L is calculated through Equation (21). Following the probability distribution, a set of neighbors are randomly sampled as the neighbor set N. The complete description of this process is shown in Algorithm 2. $p (N, L) = \frac{exp (\frac{ɛ \sum_{v \in N} | Hsim (u, v) |}{2 Δ Q})}{\sum_{N \in L} exp (\frac{ɛ \sum_{v \in N} | Hsim (u, v) |}{2 Δ Q})}$ (21)

4.3.2 Gradient perturbation

Matrix factorization models usually use SGD for parameter learning and score prediction. However, the gradient descent process is not safe, because the attacker can infer the user feature matrix through the regression function. Therefore, the ITMF model proposed in this section will use the gradient perturbation method and add random noise based on the Laplace mechanism to the gradient descent to achieve differential privacy protection. Suppose that gradient descent requires k iterations, and k is a parameter preset by the algorithm. In this case, the privacy budget in each iteration is ɛ/2k. Since noise is added to each gradient descent iteration, a scoring error is set to limit the excessive influence of the noise. Meanwhile, the local sensitivity Δr is calculated through the difference between the maximum score and the minimum score.

4.3.3 Security analysis

In this section, it will be proved that the DPMFIT model proposed in this chapter satisfies the ɛ/2-differential privacy. The privacy budget in the differential privacy protection model is divided into two parts, i.e., the part using exponential mechanism and Laplace mechanism. It will be proved that each part meets the corresponding differential privacy protection requirement.

Theorem 1. Algorithm 2 satisfies the ɛ/2-differential privacy

Proof. Given two data sets D₁ and D₂, and D and D’ differ by at most one record. d₁ and d₂ respectively represent the set of potential neighbors obtained by Algorithm 1 and the optimized clustering results. The two potential neighbor sets are selected from the same user group, and it is guaranteed that there is only one inconsistent user score. Since the index mechanism selection is divided into five times, each application of the index mechanism consumes one-fifth of the whole privacy budget.

According to the differential privacy property (Property 1), if each neighbor selection satisfies the differential privacy, then the combined algorithm composed of private neighbor selection will still satisfy differential privacy. Therefore, according to the definition of exponential mechanism (Definition 5), the probability of arbitrarily outputting N in each private neighbor selection is as follows. In fact, N/5 is randomly output every time a private neighbor is selected. For a convenient expression, it is represented by N, and the privacy budget is the same, which is represented by ∈/2. $\begin{matrix} \frac{Pr (M_{PNS} (d_{1}) = N)}{Pr (M_{PNS} (d_{2}) = N)} = \frac{\frac{exp (\frac{ɛ Q (d_{1}, N)}{4 Δ Q})}{\sum_{N \in L} exp (\frac{\in q (d_{1}, N^{'})}{4 Δ Q})} \times Pr (d_{1}, N)}{\frac{exp (\frac{ɛ Q (d_{2}, N)}{4 Δ Q})}{\sum_{N \in L} exp (\frac{ɛ Q (d_{2}, N^{'})}{4 Δ Q})} \times Pr (d_{2}, N)} \\ = \frac{exp (\frac{ɛ Q (d_{1}, N)}{4 Δ Q})}{exp (\frac{ɛ Q (d_{2}, N)}{4 Δ Q}} \times \frac{\sum_{N \in L} exp (\frac{ɛ Q (d_{2}, N^{'})}{4 Δ Q})}{\sum_{N \in L} exp (\frac{ɛ Q (d_{1}, N^{'})}{4 Δ Q})} \\ \leq exp (\frac{ɛ}{4}) \times \frac{\sum_{N \in L} exp (\frac{ɛ}{4}) exp (\frac{ɛ Q (d_{1}, N^{'})}{4 Δ q})}{\sum_{N \in L} exp (\frac{ɛ Q (d_{1}, N^{'})}{4 Δ q})} \\ \leq exp (\frac{ɛ}{4}) \times exp (\frac{ɛ}{4}) \times \frac{\sum_{N \in L} exp (\frac{ɛ Q (d_{1}, N^{'})}{4 Δ Q})}{\sum_{N \in L} exp (\frac{ɛ Q (d_{1}, N^{'})}{4 Δ Q})} \\ = exp (\frac{ɛ}{2}) \end{matrix}$ (22) where Pr(d₁, N) and Pr(d₂, N) are the probability of randomly sampling the set N from d₁ and d₂. However, d₁ and d₂ have the same user and sample the neighbor set N independently. Therefore, the probability of the two is the same.

Theorem 2. Algorithm 3 satisfies ɛ/2 -differential privacy.

Proof. Given two adjacent rating matrices R and R′, the two matrices only differ in one user rating record. Adding noise to the matrix factorization process is described in the sixth row of Algorithm 3. The sensitivity change of the prediction error is: max ||L_min(R, p, q) ′ - L_min(R′, p, q) ′||₁ ≤ max ||(R_ui - R_ui′) ||₁ ≤Δr .

According to the differential privacy Laplace mechanism, this step always meets the differential privacy protection of ɛ/2k. Since the algorithm will converge after K iterations, it can be known from the sequence combination of differential privacy that the matrix factorization process meets the ɛ/2-differential privacy. Since Theorem 1 and Theorem 2 all satisfy ɛ/2-differential privacy, from the sequence combination of differential privacy, it can be known that the matrix factorization process satisfies ɛ-differential privacy.

Algorithm 3. gradient perturbation algorithm

Input: user project rating matrix R, number of iterations of SGD k, privacy budget ɛ

Output: latent feature matrix p_u and q_i

1: Randomly initialize user and item feature matrices p_u and q_i

2: for each iterations k

3: for each r_ui ∈ R

4: calculate L_min(R, p, q) /* Calculate the objective function according to formula (14)*/

5: Δr = r_max - r_min

6: L_min(R, p, q) ′ = L_min(R, p, q) + Lap(kΔr/2ɛ)

7: update(p_u) /* Update the user characteristic matrix according to formula (17) */

8: update(q_i) /* Update the project feature matrix according to formula (18)*/

9: end for

10: end for

11: return p_u and q_i

5 Experimental analysis

5.1 Experimental data set

The dataset used in this paper was captured from Google reviews which present comments on tourist attractions of 24 categories in Europe (Referred to as TravelRating). Google user ratings range from 1 to 5, and the average user rating for each category is calculated. This data set is widely used by recommendation system learning and research. The data set contains 5456 ratings, including 943 users’ ratings on 24 items. The score is an integer ranging from 1 to 5. The larger the value, the more the user likes it.

5.2 Experimental parameter settings

The experiment is conducted on a computer equipped with Intel Core2 Quad CPU (Q9500, 2.83 GHz) and 4 GB memory, and the computer runs Windows 7 operating system. All algorithms are programmed in C# language on the Visual Studio 2015 platform. The experiment uses a cross-validation method to divide the data set into 10 groups, and the ratio of the training set to the test set is 8:2. Each time a group of data is randomly selected, and no data is fixed for training or testing only. The experimental results in this paper are obtained under the optimal parameters of each algorithm. Specifically, the learning efficiency γ is set to 0.01; the regularization factor λ is set to 0.1; the number of iterations is set to 50. The experiment uses root mean square error (RMSE) to evaluate the recommendation accuracy and judges the quality of recommendation by calculating the deviation between the predicted score and the true score. The closer the value of RMSE is to 0, the higher the accuracy. The value of RMSE is calculated as follows: $RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(R - r)}^{2}}$ (23)

When the predicted score is completely consistent with the true score, the mean absolute error (MAE) is equal to 0, indicating a perfect model; the greater the error, the greater the value. The value of MAE is calculated as follows: $MAE = \frac{1}{n} \sum_{i = 1}^{n} | R - r |$ (24)

5.3 Experimental results and analysis

Since two recommendation models, ITMF and DPITMF, are proposed in this paper, the experiment investigates the recommended parameter settings in the ITMF model and the privacy parameter settings in the DPITMF model. Then, the algorithm model is compared and analyzed.

5.3.1 ITMF model performance analysis

Sensitivity analysis is performed on the three key parameters in the ITMF model in advance, including the balance factor between the user interest value similarity and the user similarity under the time factor, the learning factor of the interest regular term in the model, and the number of user neighbors k. Since the values of the three parameters have an impact on the model optimization, the controlled variable method is exploited to fix the other two parameters while the effect of changing one parameter on the model is investigated. The basic matrix factorization algorithm (Basic MF) and the classic SVD++ model that incorporates implicit user feedback [25] are taken for comparison.

(1) The impact of the balance factor α on the recommendation accuracy of the ITMF model To represent the weight of user similarity under the time factor, the similarity ratio α is changed from 0 to 1. The results obtained on the TravelRating data set are shown in Fig. 5. It can be seen that the ITMF model achieves a more significant improvement of recommendation performance than the classic algorithm. Meanwhile, the values of RMSE and MAE fluctuate with the increase of the balance factor α. When the value of α is 0.6, both the values of RMSE and MAE are the lowest, indicating the best recommendation accuracy. At this time, the user similarity under the time factor accounts for 0.6, and the user interest accounts for 0.4.

Fig. 5

The impact of the balance factor α on the performance of the ITMF model.

(2) The influence of the learning factor β of the interest regular term on the recommendation accuracy of the ITMF model

The impact of the learning factor β on the performance of the ITMF model is illustrated in Fig. 6. The experimental results show that when the learning factor β is 0.01, both the values of RMSE and MAE are the smallest, and the recommendation accuracy of the ITMF model is the best.

Fig. 6

The impact of learning factors β on the performance of ITMF models.

(3) The influence of the number of neighbors k of the target user on the recommendation accuracy of the ITMF model

It can be seen from Fig. 7 that learning the interest and preference of the neighbors of the target user contributes to a more accurate prediction of the target user. When the number of neighbors increases from 1 to 3, both the values of RMSE and MAE decrease significantly, indicating an obvious improvement effect. Besides, the values of RMSE and MAE fluctuate when the number of neighbors changes from 4 to 10. However, the overall changing trend is not downward, indicating that the increase of the number of neighbors gradually stabilizes the improvement of prediction accuracy. Therefore, the number of neighbors with the lowest RMSE in the range of 4 to 10 is taken as the best number of neighbors to facilitate subsequent experiments.

Fig. 7

The impact of the number of neighbors k on the performance of the ITMF model.

5.3.2 Privacy analysis of DPITMF model

The DPITMF model has the same user interest offset part as the ITMF model. So, the optimal parameters of the ITMF model described in Section 5.3.1 can be directly used for privacy analysis of the DPITMF model, and the impact of privacy budget on the recommendation results is investigated. The DPSS++ algorithm[16] is taken for comparison, which is a differential privacy protection algorithm based on SVD++ and gradient perturbation.

(1) The impact of the privacy budget ɛ on the privacy protection of the DPITMF model

Figure 8 shows the comparison results on the TravelRating data set. For a low privacy budget, the recommendation performance of the DPITMF model is worse than that of the classic DPSS++ algorithm. The DPITMF model does not spend the entire privacy budget ɛ on gradient perturbation: half of the budget for implicit neighbor identity protection, and the other half for gradient perturbation. The DPITMF model uses only half of the privacy budget ɛ throughout the experiment. The smaller the privacy budget, the more random noise; the stronger the privacy effect, the less data availability. However, with more random noise, the DPITMF model can have the same privacy budget ɛ as the DPSS++ algorithm but achieve a greater improvement in recommendation accuracy. The DPITMF model incorporates user interest offset information to help the model express user characteristics more accurately. Preferences make recommendations more accurate. It can be seen from Fig. 8 that when the value of the privacy budget ɛ reaches 6, the recommendation performance begins to stabilize. Although a larger privacy budget ɛ contributes to a better recommendation effect, the noise introduced will be smaller and privacy protection will be less sufficient. Therefore, considering comprehensive recommendation performance and privacy protection effect, the privacy budget ɛ will be set to 6 in subsequent experiments.

Fig. 8

Privacy budgets privacy protection impact on the DPITMF model.

5.3.3 Performance comparison of recommended models

To better demonstrate the recommendation performance of the recommendation model, the ITMF and DPITMF models are compared with the classic SVD++[25] and DPSS++[21] algorithms. The ITMF and DPITMF models use the optimal model parameters determined in sections 5.3.1 and 5.3.2, respectively. As shown in Fig. 9, the abscissa represents the dimension of the feature vector, and the impact of the dimension on the recommendation performance is observed by changing the dimension value.

Fig. 9

Recommendation performance comparison of recommendation models.

It can be seen from Fig. 9 that the ITMF proposed in this paper has obvious advantages in recommendation accuracy, and both the values of RMSE and MAE are significantly reduced, indicating that the fusion of user interest offset helps improve the prediction accuracy of the recommendation model. The DPITMF model also performs better than the DPSS++ algorithm in terms of the values of RMSE and MAE. Although the addition of privacy protection has a certain impact on the recommendation performance, the integration of user interest offset alleviates the loss of recommendation accuracy to a certain extent, making it possible to ensure user privacy. There is a reasonable balance between privacy protection and recommendation quality.

6 Conclusions

Aiming at the user privacy protection requirements of the recommendation systems and the problem that privacy protection technology affects the recommendation performance, a matrix factorization recommendation algorithm based on user interest offset and differential privacy is proposed in this paper. The user interest preferences are extracted from user tags and user ratings under time-series factors, and similar neighbors are exploited to train the user’s feature preferences which are then integrated into the matrix model in the form of regular items. Meanwhile, based on the differential privacy theory, a private neighbor selection method combining K-Medoides clustering algorithm and the exponential mechanism is designed to improve the accuracy of recommendation and satisfy the privacy protection requirements. Besides, the Laplace mechanism is used to protect the gradient descent process of the model and ensure the safety of the recommended model. Finally, the feasibility of the proposed privacy protection scheme is verified by experiments. The future work will further mine user data in social networks, such as the social relationship between users, the influence of authoritative users on ordinary users, etc., to improve the performance of the recommendation system.

Acknowledgments

This research was supported by the Natural Science Foundation of China (No. 61972439), Natural Science Foundation of Anhui Province (No. 1808085MF172) and Key Program in the Youth Elite Support Plan in Universities of Anhui Province (gxyqZD2019010).

References

Borras

, Moreno

and Valls

, Intelligent tourism recommender systems: A survey, Expert Systems with Applications 41(16) (2014), 7370–7389.

Ramakrishnan

, Keller

, Mirza

, Grama

and Karypis

, Privacy risks in recommender systems, IEEE Internet Computing 5(6) (2001), 54–63.

, Wong

R.K.

and Chi

C.-H.

, Efficient role mining for context-aware service recommendation using a highperformance cluster, IEEE Transactions on Services Computing 10(6) (2017), 914–926.

Bost

, Popa

R.A.

, Tu

, Goldwasser

Machine learning classification over encrypted data, in Network and Distributed System Security Symposium, (San Diego, USA: ISOC), (2015), pp. 4324–4325

Polatidis

, Georgiadis

C.K.

, Pimenidis

and Mouratidis

, Privacy-preserving collaborative recommendations based on random perturbations, Expert Systems with Applications 71 (2017), 18–25.

Liu

, Wang

, Li

, Liu

, Li

, Zhou

and Zhang

, A privacy-preserving framework for trust-oriented point-ofinterest recommendation, IEEE Access 6 (2018), 393–404.

Erkin

, Veugen

, Toft

and Lagendijk

R.L.

, Generating private recommendations efficiently using homomorphic encryption and data packing, IEEE Transactions on Information Forensics and Security 7(3) (2012), 1053–1066.

Zheng

, Luo

, Wang

, Sun

, Chen

, Hu

and Wang

, Research on location-based distributed differential privacy recommendation method, ACTA ELECTONICA SINICA 49(1) (2021), 99.

Kofler

, Caballero

, Menendez

, Occhialini

and Larson

, Near2me: An authentic and personalized social media-based recommender for travel destinations, in, Proceedings of the 3rd ACM SIGMM international workshop on Social media (2011), pp. 47–52.

10.

Antonio

, Aida

, David

, Lucas

M.J.

BorrÃă

Sigture-destination: Ontology-based personalized recommendation of tourism and leisure activitie, Engineering Applications of Artificial Intelligence 26(1) (2013), 633–651.

11.

Loh

, Lorenzi

and Lichtnow

, A tourism recommender system based on collaboration and text analysis, Information Technology and Tourism 6 (2003), 157–165.

12.

Levi

, Mokryn

, Diot

, Taft

Finding a needle in a haystack of reviews: Cold start context-based hotel recommender system, in Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, (New York, NY, USA), pp. 115–122, Association for Computing Machinery, 2012.

13.

Andersen

, Karlsen

Privacy Preserving Personalization in Complex Ecosystems, pp. 247–261 Berlin, Heidelberg: Springer Berlin Heidelberg, 2018.

14.

Calandrino

J.A.

, Kilzer

, Narayanan

, Felten

E.W.

, Shmatikov

you might also like: privacy risks of collaborative filtering, in 2011 IEEE Symposium on Security and Privacy, (2011), pp. 231–246.

15.

Dwork

, Kenthapadi

, McSherry

, Mironov

, Naor

Our data, ourselves: Privacy via distributed noise generation, in Proceedings of the 24th Annual International Conference on The Theory and Applications of Cryptographic Techniques, (Berlin, Heidelberg), pp. 486–503, Springer-Verlag, 2006.

16.

Dwork

Differential privacy: A survey of results, 702 in Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC’08, (Berlin, Heidelberg), pp. 1–19, Springer-Verlag, 2008.

17.

Dwork

, Calibrating noise to sensitivity in private data analysis, Lecture Notes in Computer Science 3876(8) (2012), 265–284.

18.

McSherry

, Mironov

Differentially private recommender systems: Building privacy into the netflix prize contenders, in Proceedings of the 15th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’09, (New York, NY, USA), pp. 627–636, Association for Computing Machinery, 2009.

19.

Zhu

, Li

, Ren

, Zhou

, Xiong

Differential privacy for neighborhood-based collaborative filtering, in Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, (New York, NY, USA), pp. 752–759, Association for Computing Machinery, 2013.

20.

Berlioz

, Friedman

, Kaafar

M.A.

, Boreli

, Berkovsky

Applying differential privacy to matrix factorization, in Proceedings of the 9th ACM Conference on Recommender Systems, RecSys’15, (New York, NY, USA), pp. 107âĂŞ114, Association for Computing Machinery, 2015.

21.

Zhengzheng

, Qiliang

, Xiaoyu

, Wei

and Jiyuan

, Collaborative filtering algorithm based on differential privacy and svd++, Control and Decision 34(01) (2019), 43–54.

22.

Yang

, Li

, Sun

, Zhang

A differential privacy framework for collaborative filtering, Mathematical Problems in Engineering, 2019.

23.

Polatidis

G.C.K.

, Nikolaos, A multi-level collaborative filtering method that improves recommendations, Expert Systems with Applications 48 (2016), 100–110.

24.

Koutrika

Modern recommender systems: From computing matrices to thinking with neurons, in Proceedings of the 2018 International Conference on Management of Data, SIGMOD’18, (New York, NY, USA), pp. 1651–1654, Association for Computing Machinery, 2018.

25.

Xun

, Jing

, Guangyan

and Yanchun

, Svd-based incremental approaches for recommender systems, Journal of Computer & System Sciences 81(4) (2015), 717–733.

26.

Xin

, Mengchu

, Yunni

and Qingsheng

, An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems, IEEE Transactions on Industrial Informatics 10(2) (2014), 1273–1284.

27.

Koren

Factorization meets the neighborhood: A multifaceted collaborative filtering model, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’08, (New York, NY, USA), pp. 426âĂŞ434 Association for Computing Machinery, 2008.

28.

Dwork

and Roth

, The algorithmic foundations of differential privacy, Foundations and Trends in Theoretical Computer Science 9(3-4) (2013), 211–407.

29.

, Lyu

, Su

and Yang

, Differential privacy: From theory to practice, Synthesis Lectures on Information Security, Privacy, & Trust 8(4) (2016), 1–138.