Abstract
Collaborative filtering is a widely used approach in recommendation systems which predict user preferences by learning from user-item ratings. To extract either user relationship or item dependencies, there exist several well known approaches; among them clustering is of great importance. Traditional clustering methods in collaborative filtering usually suffer from two fundamental problems: sparsity and scalability. Sparsity refers to a situation where most users rate only a small number of items, while scalability denotes a huge number of both users and items. Inspired by these problems, this paper presents a novel stepwise paradigm, SPCF, which in the first step clusters users and items separately using their latent similarity. Once the primary clusters of the first level are formed, the second level simultaneously clusters the user and item clusters by means of co-clustering. The advantages of SPCF are threefold; first, it is able to alleviate the well known sparsity problem which intrinsically exists in collaborative filtering; second, the proposed method offers an elegant solution to the scalability problem based on dimensionality reduction which in turn leads to better performance of the model; third, experimental results on two versions of a Movielens dataset for prediction have demonstrated that the proposed method can reveal major interests of users or items in promising manner.
Keywords
1. Introduction
With the rapidly growing amount of information available on the internet, users need tools to assist them in obtaining required information. In order to satisfy this need, recommender systems [1] have emerged, e.g. there are popular recommenders for movies, books, music, etc. Typically in a recommender system, we have a set of users and a set of items. Each user
Two common classes of algorithm used in recommender systems, are content-based and collaborative filtering (CF). Content-based algorithms suggest their recommendations based on the content of the items (e.g. the text of a news article), while CF algorithms make recommendations based on information recorded by the system about the preferences of all users.
Usually, CF is based on the explicit ratings that users suggest for items. So, input data can be represented in terms of a rating matrix, which has one row for each user and one column for each item. Matrix elements indicate the ratings given by the users to the various items. Each element in a rating matrix is usually a natural number taken from a fixed range (e.g. 1–5). Here, if there are

The left square and right circular vertices denote two parts of bipartite graph.
Grouping users into clusters based on items they have rated already is a popular technique in CF. This method is based on the fact that if customer
Generally, most collaborative filtering methods suffer from either sparsity or scalability problems. While sparsity refers to a situation where most users rate only a small number of items, scalability denotes the existence of a huge number of users and items in the system.
In the proposed algorithm (SPCF), we combine both single-side and co-clustering algorithms to increase accuracy and alleviate sparsity and scalability problems, as well as decreasing the computational complexity of the co-clustering algorithm. To realize our objective, we use singular value decomposition (SVD) [2] to capture latent relationships between users and items that allows us to compute the predicted liking of a certain item by a user. Also SVD produces a low-dimensional representation of the original user–item space and then we compute neighbourhood in the reduced space. By utilizing SVD in the proposed method, we form both user clusters and item clusters separately. Subsequently, the co-clustering algorithm is performed on the user clusters and item clusters bipartite graph to merge similar clusters together. So, SPCF transforms the original data to a new low dimension space, one with less sparsity, while exploring the similarity between users and items. As experimental results on both Movielens-100k and Movielens-1M datasets show, this approach outperforms some well-known CF approaches according to the mean absolute error (MAE) evaluation measure. The overall architecture of the proposed approach is shown in Figure 2.

The overall architecture of the approach. The lines with arrows represent the workflow as well as the flow of data. (a) The m×n bipartite graph (m users and n items) as input, (b) single-side clustering, top rectangular and bottom represent user clusters and item clusters respectively, where k corresponds to the number of user clusters and l denotes the number of item clusters, (c) co-clustering, where T is the number of co-clusters, (d) prediction scheme, (e) estimated value for test pair as output.
The remainder of this paper proceeds as follows. In the next section, we summarize related works in brief, while in Section 3 our approach is described with respect to it. We validate the proposed approach experimentally in Section 4. The final section provides some concluding remarks and directions for future research.
2. Related work
Automated recommender systems can be broadly classified into two groups – (i) content-based systems [1, 3] that predict preferences based on the content of the items and the explicit interests of the users, and (ii) CF systems [4, 5] that make predictions based solely on the known ratings of similar users. The relative performance of these techniques depends on the sparsity of the known ratings. However, recent advances in internet and database technology have made it feasible to collect a large number of user–item ratings with little or no effort on the part of users, thus making CF a superior and more practical approach [4].
CF approaches are often classified as a memory- or model-based approach. Fundamentally, the memory-based techniques use similarity measures such as Pearson correlation [6] and cosine similarity [7] to determine a neighbourhood of like-minded users for each user. Then, these techniques predict the user’s rating for an item as a weighted average of ratings of the neighbours. Examples of memory-based CF include user-based methods [1, 8, 6] and item-based methods [9, 4]. The advantage of the memory-based methods over model-based alternatives is the smaller number of parameters to be tuned.
On the other hand, in model-based approaches, training examples are used to generate a model which is able to predict the ratings for items that a test user has not rated before. Examples include matrix factorization approaches such as SVD [2] and NNMF (Non Negative Matrix Factorization)-based [10, 11] filtering techniques that predict unknown ratings based on a low rank approximation of the original ratings matrix. The missing values in the original matrix are filled using average values of the rows or columns. Unlike correlation-based methods, the matrix factorization techniques create latent features of users and items and hence, handle sparsity problems in a better fashion. However, utilizing the data matrix factorization dramatically improves the scalability which is a crucial problem in recommender system framework.
In addition, there are also some works on CF approaches including Bayesian networks [12, 13], clustering [14], horting [15] and co-clustering [16, 17]. Bayesian networks create a model based on a training set with a decision tree at each node and edges representing user information. The model can be built off-line over a matter of hours or days. The resulting model is very small, very fast, and essentially as accurate as the nearest neighbour methods.
Clustering techniques try to identify groups of users with similar preferences. Once the clusters are created, a particular item’s rating could be predicted for a particular user by averaging opinions of the other users in that cluster. These techniques usually produce less personal recommendations than other methods, and in some cases, perform less accurately than nearest neighbour algorithms [1]. However, once the clustering process is complete, performance can be very good, since the size of the group that must be analysed is much smaller. Anyway, it is clear that using clustering as the pre-processing step is a worthwhile trade-off between accuracy and throughput.
Horting is a graph-based technique in which nodes are users, and edges between them represent the degree of similarity between two end-point users where predictions are produced by walking the graph to nearby nodes and combining their corresponding user’s opinions. It differs from the nearest neighbour algorithm as the graph may be walked through other users who have not rated the required item. Horting is reported to achieve better results than the nearest neighbour algorithm [15].
Co-clustering, or bi-clustering is the problem of simultaneously clustering rows and columns of a data matrix which arises in diverse data mining applications, such as the simultaneous clustering of genes and experimental conditions in bioinformatics [17, 18], documents and words in text mining [19, 20], users and movies in recommender systems [21], etc. In [20], Dhillon proposes an algorithm based on spectral graph partitioning. This algorithm, is a co-clustering algorithm and does not use or define any notion of similarity between individual documents. It models a document collection as a weighted bipartite graph with each edge running from a document vertex to a word vertex. The edge weight is the corresponding entry in matrix
It is worth mentioning that those methods especially designed to alleviate the sparsity problem usually hybridize a CF approach and a content-based recommender [23–25]. However, the proposed approach differs from others owing to the fact that it remains a pure CF approach.
Huang et al. [23] construct a bipartite graph while an edge between a user and an item is added when the user likes the item. An item is then ranked for a user based on the number of paths between that item and the user. Although the system gives good results, its application scope is limited because it can only handle binary data. A user either likes or dislikes an item and edges have no weight. Papagelis [26] proposed another graph-based algorithm in which the graph contains only users. The system uses only binary data: edges are weighted based on the number of co-ratings between two users. In [24], Nanopoulos proposed a method which exploits the transitive correlations between items, i.e. some items might be heavily correlated and thus a given rating for one item might tell us something about the rating for other items as well. The algorithm recommends the top-k (transitively) correlated items. This can be seen as a hybrid approach to the sparsity problem. Huang et al. [27] presented a graph-based recommender system that only recommends a number of items and no predictions are made for the other items. This system uses a two-layer graph, containing user–user, item–item and user–item edges. This graph is then searched for recommendable items.
Note that none of these graph-based methods can be directly compared to our method either because they predict a ranking instead of an absolute value or because they can only handle binary data.
3. Stepwise Partitioning for Collaborative Filtering (SPCF)
A considerable number of CF algorithms are based on either user or item clustering. The reason why clustering has gained increasing attention within recommender system communities lies in the well-known problem of sparsity. To deal with sparsity in the raw data, it is advantageous to transform users and items into clusters as a pre-processing process.
The majority of CF algorithms find neighbours of users or items (usually known as correlation-based CF), as well as clustering them, using similarity measures such as Pearson correlation [29]. There have been great achievements in CF approaches using these measures; however, there are still some issues to be further explored.
The first issue that should be considered is sparsity. To calculate correlation between two users/items, these measures rely solely on exact matches of them, and the other users/items do not affect the correlation value. Here, we illustrate this problem with a simple example. Suppose that three customers:

Correlation-based CF problem has been shown in this example. The values shown next to edges correspond to the correlation value between two endpoint customers.
As a result, because of the large number of missing values in recommender systems, these approaches are not reliable enough.
Second, scalability is one of the other challenges that must be tackled by correlation-based CF algorithms. The computational complexity of these algorithms grows with both the number of users and the number of items. With a large number of users and items, a typical web-based recommender system, running existing algorithms will suffer from serious scalability problems.
Based on these understandings, this paper studies clustering by utilizing the nature of SVD which acquires latent features of users and items when it factors the original matrix. Since SVD uses all existing ratings in order to form features space, it does not suffer from the first mentioned problem. Hence, the SVD-based CF algorithm is the best choice for large databases, thus overcoming the second problem.
SPCF is different from existing approaches in the sense that, first, it utilizes SVD to produce latent features, rather than predicting missing values as other CF algorithms do. By generating these latent features, it is able to overcome the sparsity problem. Second, the scalability problem is handled once the clustering phase is incorporated into our SPCF algorithm. Moreover, using a scalable approach (SVD) to undertake this phase, compared to correlation-based approaches, strengthens the efficiency of proposed approach. Third, by transforming the original
3.1. Single-side clustering as the first step
Sarwar et al. proposed SVD based CF as a well-known matrix factorization technique in a recommender system [2]. To deal with missing values, there exist two common approaches. The first approach makes use of row averages to fill in the missing values, whereas, the second approach utilizes the column information. In this paper, we take into account both rows and columns. So, we obtain a filled matrix,
The reduced dimensional representation of the original space is less sparse than its high dimensional one and led us to form the neighbourhood in that space. So, we started with the original user-item matrix
Accordingly, for each user or item,

The overall architecture of user clustering and item clustering process. The squares correspond to matrices, whose dimensions have been denoted below them.
3.2. Bipartite graph spectral co-clustering as the second step
The bipartite graph
where
In this paper, we introduce the adjacency matrix
In SPCF, co-clustering algorithm is performed on
There are some well known co-clustering algorithms, for instance, information theoretic Co-clustering [19], Bayesian co-clustering [28] and scalable co-clustering [21]. Here, we adopt a bipartite graph spectral co-clustering algorithm [20] to our approach which poses the co-clustering problem in terms of finding minimum cut vertex partitions between left side and right side of the bipartite graph. Given a partitioning of the vertex set V into two subsets V1 and V2, the cut between them is defined by
Additionally, the definition of cut is extended to T vertex subsets,
Also, we need an objective function that, in addition to small cut values, captures the need for more ‘balanced’ clusters. To do so, for a subset of vertices as a partition, define its weight to be
where
We consider subsets
Given two different partitioning with the same cut value, the above objective function value is smaller for the more balanced partitioning. Thus minimizing

Here, the optimized cut to partitioning this graph is drawn using dashed line. Assuming the top formed cluster as
Finding a globally optimal solution to such a graph partitioning problem is NP complete; however, in [20] it was shown that the second left and right singular vectors of a suitably normalized adjacency matrix give an optimal solution to the real relaxation of this discrete optimization problem. Based on this observation, we present a spectral algorithm that simultaneously partitions the bipartite graph, and demonstrate that this algorithm can reach a good global solution in practice.
The pseudo code of the proposed algorithm is shown in Algorithm 2. Here, two diagonal matrix
In the prediction phase, it is essential that user and item biases are not misleading for the estimation of matrix
where
As an illustration, consider the scalable co-clustering algorithm described in [21]. This algorithm partitions users based on items and vice versa simultaneously, and performs this operation iteratively until convergence to the stable clusters. Applying this algorithm to the training set shown in Figure 6(a) concludes in two user clusters and two item clusters as illustrated in Figure 6(b).

Formed clusters by scalable co-clustering algorithm. (a) Training set matrix, (b) training set bipartite graph, where formed clusters are represented as coloured partitions, (c) calculation of rating value for user u4 on item i3 using Equation (9).
Once the clustering process is done, using Equation (9), the test rating value from user
It can be shown that approximation matrix  is the least squares solution that preserves user, item and co-cluster averages, or in other words, it is the best additive approximation to
By taking into account the concept of Equation (9), in this paper we introduce a prediction scheme that includes two parts: first, we calculate the estimation of user
Where
It is clear that, Equation (11) is similar to Equation (9), while a prior user–item bipartite graph has been reduced to a user cluster–item cluster bipartite graph.
Then we incorporate user bias and item bias and predict
It should be noted that Equation (11) is a general formula that estimates the rating value of a user cluster on an item cluster. Then, using Equation (12), we use user
To better understand the performance of SPCF, in Figure 7, we have shown the overall process of this approach on the previous mentioned training set. The rating value from user u4 on item i3, using above equations, has been predicted to be 4.37. It means that the proposed approach estimates the real value more precisely than scalable co-clustering approach.

An example of applying SPCF algorithm on a sample dataset A. The matrix M corresponds to the extracted bipartite graph after single-side clustering. It is calculated using Equation (3) and its elements denote the bipartite graph edge weights. The rating value of user u4 on item i3, is estimated using our proposed prediction scheme.
4. Experimental results
This section provides empirical evidence to demonstrate whether the proposed algorithm improves the prediction accuracy or not. Accordingly, we evaluate the performance of SPCF against a number of well-known CF algorithms, as well as its ability to alleviate sparsity and scalability problems.
4.1. Chosen datasets
We use two versions of Movielens dataset to evaluate the performance of our proposed approach compared to a number of well-known algorithms. In the literature, the first version is known as Movielens-100k and the second one as Movielens-1M. Movielens-1M was gathered using www.movielens.org website April 2000–February 2003, as opposed to September 1997–April 1998 for the MovieLens-100k. The details of these datasets can be found in Table 1.
Characteristics of data sets used in this study for experimental evaluation.
To evaluate our approach versus sparsity, we divide the rating records into training set and test set according to different ratios. We call this the training ratio and denote it by
It should be noted that, since some items in Movielens-1M are not rated at all, we remove these items, so our dataset is reduced to include 6040 users and 3706 items.
In order to get a better picture of the rating information, we illustrate the sparsity pattern of the training data in Figure 8. All dots represent known rating values and white areas depict unknown rating information. The rating information in Figure 8 might seem slightly denser than it is in reality. This is due to the low resolution and the slightly enlarged point size for the known ratings plotted. However, Figure 8(a) reveals that, for Movielens-100k, most users only rated a small portion of the items available, and most items received modest user feedback. On the other hand, Figure 8(b) indicates that existing ratings in Movielens-1M are distributed more uniform than in Movielens-100k. This means that approximately all users have rated identical number of items and the items have received as equal ratings as each other.

Sparsity patterns of (a) Movielens-100k, (b) Movielens-1M. In both plots, the x-axis indexes the items while the y-axis corresponds to users.
4.2. Performance measure
Statistical accuracy metrics evaluate the accuracy of a predictor by comparing the predicted rating results with user provided raw values in the test dataset. MAE is a commonly used method while, the lower the value of MAE, the more accurately the recommendation algorithm predicts user ratings. Formally, MAE is calculated using Equation (13).
where,
4.3. Observation analysis
In this section, we provide empirical evidence showing that our stepwise approach can provide high quality predictions on unknown ratings as compared to traditional CF techniques. We also study the benefits of alleviating sparsity as well as scalability problems of proposed method.
4.3.1. Quality evaluation
To evaluate the performance, both datasets are split into five sets of approximately the same size, called folds. Then, we evaluate each approach five times, each time using one fold for testing and all other folds for training. It means we choose x = 0.8. Utilizing both Movielens-100k and Movielens-1M, the MAEs of a number of well known CF approaches and our proposed method are calculated. Therefore, we compare the obtained results to evaluate our approach. Figure 9 shows the yielding MAEs, as part (a) states the results on Movielens-100k and part (b) is upon Movielens-1M. Because of this, SPCF outperforms five well-known CF approaches using MAE metric. All algorithms presented in this paper are implemented using MATLAB.

MAE using various CF algorithms for
We try to compare our approach with both memory-based and model-based CF algorithms. As mentioned before, SVD and NNMF are model-based approaches while correlation-based CF is amemory-based approach. According to definition of the model-based CF described previously, we categorize the co-clustering algorithm as model-based since it generates a number of clusters from training data, whereas prediction is done according to these clusters as a model. Also, SPCF is categorized as model-based algorithms, since both single-side clustering and co-clustering steps are model-based.
The parameters of the above algorithms were chosen experimentally. In the experiments, the number of user clusters and item clusters in scalable co-clustering and the rank parameter of the SVD and NNMF algorithms were chosen to be identical as described in next sub-section. However, the number of neighbours in the correlation-based CF algorithm was set to a fixed number. Also, the number of clusters in spectral co-clustering is explained in next subsection. Since co-clustering and the NNMF algorithms only provide locally optimal results, we performed multiple runs of these algorithms and picked the best solution.
4.3.2. Scalability discussion
Most existing CF algorithms suffer low scalability issue as their calculation complexity increased quickly both in time and space when number of users or items increases. Ideally, partitioning will improve the quality of the CF predictions and increase the scalability of these systems. However, traditional clustering methods in CF usually cannot handle the large number of users and items. As mentioned before, in order to find neighbours of users or items as well as clustering them, majority of CF algorithms utilized similarity measures such as Pearson correlation. These algorithms are usually known as correlation-based CF, for which scalability is a disadvantage. This statement implies the computation that grows with both the number of users and the number of item, and in the literature, these approaches are known as non-scalable ones. On the other hand, SVD is famous for its ability to deal with high-dimensional datasets. As a result, the single-side clustering step in SPCF, using SVD, can overcome the scalability problem.
Again, once the number of users and items increase, the spectral co-clustering algorithm, as presented in Algorithm 2, cannot work efficiently. This is because of the complexity of computations increases with high-dimensional sample matrices. Therefore, clustering step can reduce the
4.3.3. Sensitivity to sparsity
4.3.3.1. Experiments with Movielens-100k
We varied the training ratio,
To focus the paper, we compared our scheme with representative algorithms from model-based and memory-based approaches including SVD, spectral co-clustering and correlation-based CF.
As mentioned before, in SVD, the r×r matrix S is reduced to have only t largest diagonal values to obtain a matrix St, t < r. In [2], using Movielens-100k, the entire process was repeated for
In the proposed approach, we compared MAEs for different number of clusters as well as co-clusters, and report the best ones. Based on obtained results, for
Additionally, spectral co-clustering results are produced by
As you see in Figure 10(a), for

Proposed approach evaluation vs. three other algorithms using MAE metric for different fraction of training set. (a) Movielens-100k, (b) Movielens-1M.
Finally, as we expect, our proposed approach outperforms co-clustering for all values of
Additionally, for
4.3.3.2. Experiments with Movielens-1M
In Figure 10(b), like Figure 10(a), we compare our approach with SVD, spectral co-clustering and correlation-based CF. However, here we apply Movielens-1M to obtain the results. As before, we ran the experiment 10 times each time choosing different training ratios and take the average to generate the plots.
In order to find the optimum value of
Towards the proposed approach, lower MAEs are acquired using
The results in Movielens-1M are somehow different with Movielens-100k. Here, for
Additionally, here the results on SVD, Spectral Co-clustering and our approach have approximately an identical difference in all ranges of
5. Conclusion and future work
Sparsity as well as scalability are two fundamental challenges in the CF framework. Sparsity refers to a situation where informative data are lacking or are insufficient. On the other hand, scalability denotes the expensive computations required by CF once the number of users and items tend to increase. In this work, we devise a two-step clustering model, SPCF, to overcome these problems. Moreover, in order to predict a user rating to a specified item, we introduce a prediction scheme for acquired model. It is important to know that the use of clustering in collaborative filtering intuitively alleviates the sparsity problem. This is because a user or item without adequate information is attached to a group of similar objects and can share the information of whole group members. Most CF methods cluster users or items using similarity measures like Pearson correlation; because of sparsity, these techniques cannot find similar objects efficiently. This is because, to calculate correlation between two users/items, these measures rely upon exact matches of them, and the other users/items do not affect the correlation value. Additionally, clustering is a useful method to increase the scalability when we are encountered with a large dataset, like E-commerce sites which sometimes offer millions of products for sale. However, the traditional clustering approaches are not usable when the numbers of users and items grow.
To overcome these problems, in this work we utilize the nature of SVD, which acquires latent features of users and items when it factors the original matrix. By applying the SVD algorithm at the first step, user and item feature matrices are made, then by performing k-means algorithm on them, user as well as item clusters are created. Co-clustering is performed at second step on user clusters and item clusters to aggregate similar ones together. Using single-side clustering prior to co-clustering and introducing an efficient prediction scheme, improve our method compared to a number of well known methods which is shown in the experimental results section. It is important to know that this is the first attempt to incorporate spectral co-clustering algorithm in a recommender system, and here we show that it can be used in this field to put related users and items together efficiently. This research opens up new research directions. The first direction is its capability to alleviate the notion of the sparsity problem that intrinsically exists in CF. Second, the SPCF offers an elegant solution for scalability problem based on dimensionality reduction that in turn leads to better performance of the model. Third, experimental results on two versions of Movielens dataset have demonstrated that the proposed method can reveal major interest of users or items in promising manner.
For future work, we plan to propose a dynamic collaborative filtering approach that can support the entry of new users, items and ratings using incremental version of proposed method. Also, we try to adapt our approach to a situation when timestamps exist in datasets.
