Abstract
This article presents a scalable and optimized recommender system for e-commerce web sites to maintain a better customer relationship management and survive among its competitors. The proposed system analyses the clickstream data obtained from an ecommerce site and predicts the preference level of the customer for the products clicked but not purchased using efficient classifiers such as decision trees, artificial neural networks and extended trees. Collaborative filtering technique is used to recommend products in which similarity measures are used along with efficient rough set leader clustering algorithm which helps in making accurate and fast recommendations. To determine the effectiveness of the proposed approach, an experimental evaluation has been done which clearly depicts the better performance of the system as compared to conventional approaches.
Keywords
Introduction
Recommender systems (RS) [6] are interactive knowledge based systems that finds out preferences of users on a commercial website on the basis of information acquired explicitly (ratings and queries answered) or implicitly [3] (browsing behaviour, likes, previous purchases made, applications downloaded, books read and websites visited). For a commercial website, maintaining a good relationship with a customer is an essential part of business strategy to set the company apart from other competitors. RS are utilized in a variety of areas such as movies, news, books, websites, research articles and products to maintain customer relationship management (CRM) strategy.
Conventional RS deals with binary purchase data where only two outputs are possible: whether the product is purchased (1) or not (0). However this does not depicts the preference of a user for a product not purchased as many other reasons affects the purchase such as budget, low on time etc. Thus the study done in this article is based on implicit details of user. Click stream path [2, 10] of a user is traced and strong attributes are identified that justify the preference of a user for a particular product. Experiments carried out clearly shows that results are better than binary purchase data. The system provides a framework to extract implicit details and finds correlation between like-minded users. A cluster of similar users is formed and preference of the product not viewed by target user is predicted numerically.
The proposed work performs the following task: Initially the raw data is processed to extract the strong attributes. These attributes have impact on the purchase of the products such as reading time spent on the product, number of times product is visited, whether product is previously purchased or not, time, product webpage is visited directly or user came through browsing the website etc. After the attributes are identified products are categorized according to their preferences. Products placed in cart have low preference than products that are purchased. Also products that are viewed but not placed in cart have low preferences than former one. Similarly products that are not even seen have the least priority for the user.
To illustrate the effectiveness and performance of RS, Experiments are carried out on click stream data [10] of a commercial website of Europe. F1 value is used as a metric that clearly shows proposed RS is giving better results than that of conventional ones.
The rest of the article is divided as follows. Section 2 presents the related work. In Section 3 details of the proposed RS is elaborated and experiments are carried out. Section 4 gives the experimental results followed by conclusion and future work in Section 5.
Related work
Conventional RS work well where customers show their preferences for specific products in an explicit way. These user preferences are represented in utility matrix that consists of n users and m items. This matrix is generally sparse as many products are not rated by users. Each cell corresponds to the preference of user i for product j. Blank cell represents that products are not rated by the user. The task of RS is to assign some priority to these unrated products on the basis of similarity among users and products. The data collected above is in explicit form as it is totally dependent on user feedback. Although it may not be always accurate as it is also affected by user behaviour and contextual situations. In e-commerce websites, users surfing behaviour is traced to get the insight of users interests. This weblog data is pre-processed to find users interest for various products on the basis of purchases made, time spent on each webpage, number of visits and so on. The most common information filtering approaches used in RS can be categorized as collaborative filtering (CF) [1, 12], content based filtering (CBF) [6, 17] and hybrid approaches [11, 13].
In CF approach, similar users are identified by the ratings given by them for a product using measures such as Pearson correlation (PC) [14], constrained Pearson correlation (CPC) [17] and cosine vector similarity [8]. After that top-n items are recommended that are preferred by like-minded users. Major steps followed in CF approach are: A user profile is constructed from the feedback/rating information given by him/her on each product. Like-minded users are clustered together that have same interest as that of target user. This similarity can be calculated using the measures such as PC, CPC, cosine vector or distance-based similarity. Ratings given by these clustered users for an item are then used to predict the preference of target user. These ratings can be predicted using measures such as average rating or weighted sum. The ratings can also be predicted using probabilistic model or machine learning model that is built by large set of rating sample.
Although the approach is very commonly used, it suffers from few limitations such as New user problem [7, 18]: The system faces difficulty in recommending items to users who have never rated any items in past. New item problem [9, 18]: Similarly items which are new in market are recommended less as they are never rated before. Sparsity problem [6]: If items are not sufficiently rated by users then utility matrix becomes sparse which leads to poor recommendations.
CBF approach suggests items based on high similarity between them and recommend highly rated items [4]. A similarity between user and item profile is calculated and top-n items with high similarity score are recommended i.e. they suggest items based on high similarity between items to recommend and items already purchased. Major steps followed in CBF approach are as follows. An item profile is constructed based on item attributes preferred by users. A content-based user profile is built from set of features of items that each user purchased. Similarity between user and item profiles is calculated using similarity measures. Top-n items with high similarity scores are recommended.
CBF also suffers from certain limitations such as Insufficient feature problem [17]: Item profiles are incomplete as it is not easy to obtain sufficient number of features. Over – Specialization Problem [6]: Recommended items are similar to the items purchased in past. Unusual user problem [6, 16]: Users whose purchasing behaviour is not uniform get poor recommendations.
Hybrid filtering approach [5, 15] combines the features of CF and CBF to remove limitations of said approaches and gives more accurate recommendations [18].
Proposed approach
Dataset
The data set selected is of a big retailer in Europe that sells books, toys, clothes etc. online and contains one month detail of users who visited the site and products they viewed. Over 10 million users visited the website and viewed 62546 products. The data is pre-processed and we find preferences of those users who have clicked more than 70 products. Thus a total of 57540 users are selected to find preferences for 4300 products.
Phases of proposed algorithm
Phase I: Implicit details of users are extracted from the click stream path followed by them and strong attributes are identified. Phase II: Classification is performed to find preference of the products that are neither purchased nor placed in cart. Phase III: Clustering is performed to group users that have similar taste in product brand due to which finding similar neighbours of target user for the commonly clicked product becomes less time consuming. Phase IV: CF is performed to predict the preference of a product that is not viewed by the target user.
In first phase, click stream path of user is processed to extract the attributes that justify customer interest for a particular product. Attributes such as time spent on reading product details, whether product webpage is directly searched for or it is traced during browsing, number of times that product is visited, whether that particular page is bookmarked or printed, whether the product is placed in basket or it is purchased or not are identified. Pre-Processing is performed and attributes identified are justified. Table 1 shows the probability of purchase of a product after its cart placement as 85.5 (=562/657). The value clearly justifies that cart placed product is having higher probability of purchase than those products that are not placed in cart.
Purchase probability after cart placement
Purchase probability after cart placement
Table 2 shows the relationship among the products that are predetermined to be purchased or user get to know about the product while browsing the website. The probability of products directly visited (0.23 = 443/1890) has higher probability of purchase than those products which are browsed (0.04 = 119/2410) by chance.
Purchase probability after searching
Table 3 presents the no. of visits made to the products. If the product is highly visited then it has higher chance of purchase (0.38 > 0.33 or 0.08).
Purchase probability justifying #visits
If time spent on a particular product webpage is more as compared to others then it has higher probability of purchase. Table 4 presents the desired result which is justified using t-test as reading time is captured in seconds which is considered as continuous attribute. To verify the hypothesis that higher the time taken to read a product, the more probability of purchase is justified as difference between mean of purchase and not purchase is significant at 5%. Thus a longer reading time indicate higher chances of purchase made.
Reading time: Result of t– Test (Significance Level = 0.05)
Finally the strong attributes recognized from our click stream data set are shown in Table 5.
Strong attributes along with their descriptions
In second phase, Decision Tree (DT), Artificial Neural Networks (ANN) and Extra-Tree (ET) classifiers are applied to find the preferences of product that are viewed at least once but not placed in the cart. It is obvious that these products have less probability of purchase than products that are purchased or that are placed in the cart. Table 6 shows the comparison among three classifiers. The classifiers were trained in 4 folds (A1, A2, A3, A4) and accuracy of each fold is calculated. Based on average accuracy, the choice of classifier is made. Clearly results show that DT is the best classifier for the processed dataset.
Comparison of classifiers
In third phase, utility matrix is updated as preference of product that is neither purchased nor placed in cart is identified. Sparseness of utility matrix is now filled with predicted preferences. The task now is to predict the preferences of the products that are unseen by the user. For this purpose like-minded users are considered for prediction. Since the dataset is vast, a need arise to first cluster similar users based on an attribute that is having higher interest (product brand) as compared to others and then neighbourhood formation is done within the cluster. Clustering is performed using Rough Set Leader Clustering. Rough Set Leader clustering finds out all the leaders in the dataset in a single iteration. Initially a random user is selected as a leader. Then each user’s similarity is checked with the leaders, if the similarity is greater than a threshold, the user is added to that corresponding leader’s cluster, otherwise it will be a new leader. This is repeated until all users are covered. For better clustering, rough set properties are used in collaboration with Leader clustering algorithm. According to which the following 3 rough set properties must be satisfied while clustering the data: A user can not be a part of more than one upper bound. Lower bound is contained in the Upper bound. A user can not belong to only one Upper bound.
Upper bound of a set X is the region which holds the elements that may belong to X. Lower bound of X contains the elements that certainly belong to X.
To satisfy the above properties, two thresholds are defined: General Threshold (GT): Controls the assignment of a customer to a cluster. Rough Threshold (RT): Controls the overlapping between clusters.
The thresholds must be carefully determined for the dataset being used to control similarity of users in each cluster and overlapping between the clusters.
In last phase, we need to identify the preferences of the products that are not even viewed once. CF is involved in the formation of nearest neighbours that helps to find like-minded users for target user. The measures used in predicting neighbours are Pearson Correlation (PC) as in Equation (1), Constrained Pearson Correlation (CPC) as in Equation (2), and Cosine Vector (CV) as in Equation (3). Pearson Correlation (PC): PCa,b find the similarity among user a and user b. Prefa,j and Prefb,j are preferences of user a and user b for clicked product j respectively. Also
Constrained Pearson Correlation (CPC):
CPCa,b is same as PCa,b except that it is based on the value m that is the mid of preferences. Since we are considering the normalized preferences in our experiment, the value of m is taken as 0.5. Cosine Vector (CV):
Cosine Vector is computed by measuring the angle between two vectors. Here the two vectors are preferences of user a and user b. Proximity value is calculated by considering their preferences for commonly clicked product j. Experiments justifies that CPC gives better correlation among the rest.
After determining the similarity among users, the next task is to proximate the preferences of the products that are unseen by the target user and either clicked, placed in cart or purchased by the like-minded users.
A neighbourhood formation value k is varied to be 5, 7 and 10 in the experiments. NBa,j denotes the neighbourhood of user a whose preferences for product j are calculated in above phases. To calculate the preference of user a for product j, weighted average for that product is calculated. When PC is used as proximity measure, user a preference for product j is calculated by Equation (4).
Similarly, when CPC is used as proximity measure then weighted average of preferences is calculated by Equation (5).
Utility matrix is completed by computing the preferences in all the above phases and top-N products are recommended to the users of the e-commerce website. The number of recommended products is varied from 5 to 20 in increments of 5 to measure its effect on accuracy of recommendation.
Evaluation measures such as ‘recall’ and ‘precision’ are used to test the accuracy of RS. These methods are commonly used in the field of information retrieval which are described in Equations (6 and 7).
Where
When the # of hidden products increases, recall also increases but precision decreases as the two measures are inversely related. Therefore a combined measure F1 is used whose higher value indicates a better accuracy of the designed RS. The harmonic mean F1 of recall and precision is given by Equation (8).
The algorithm for the proposed technique is defined in Fig. 1.

Calculating preferences using proposed method.
F1 values: Conventional VS proposed approach
The experiments are carried out in Python in which sklearn.ensemble library is used. The dataset is pre-processed in MATLAB R2013a and strong attributes are identified as in Table 5. Data is randomly divided into two sets. Training data consists of 70% and rest of the data is treated as test set. To classify the products that are viewed by the user but neither purchased nor cart placed, three classifiers are used. As shown in Fig. 2(a), DT classifier outperforms the rest of the two.

Comparison of precision, recall and F1 values for different similarity measures.
Figure 2(b-d) shows that among three similarity measures, CPC outperforms rest of the two. Thus to find correlation among like-minded users CPC is used to form neighbourhood of the target user. Table 7 clearly depicts that CPC outperforms PC and CV proximity measures in almost all the cases. Also proposed approach gains higher F1 values by clustering the users having similar interest in product brands than those in the conventional approach. Also grouping like-minded users and finding preferences for the product saves enormous amount of time as the data set is very vast.
The approach used in this article utilizes the clickstream data and predicts the preferences for the products clicked by the customer using machine learning classifiers. For the products not clicked, CF is used to predict their preferences. For the same, proximity measures such as PC, CPC and CV are used to find similarity between users. Rough set leader clustering is used to cluster similar customers having interest in a specific brand. Finally weighted average of the customers in a cluster is used to calculate the preference of the products not clicked by the target user. Using the preference matrix, top-N preference products are recommended to the user. For the analysis and evaluation of the recommender system, F1 values are analysed. From the analysis, we observed that the classifiers and clustering algorithm used outperform the ones used in the conventional approach.
Further data used is very vast and recommendations are made only for users who have clicked more than 70 products. Preferences for users who have spent very less time on e-commerce site and clicked very few products need to be calculated more accurately and proximity measures defined in the experiment gives low similarity among peer users.
