Abstract
As the size of websites continues to grow, current research focuses on the development of intelligent websites which facilitate the browsing by providing a navigation aid to the website users. Web page recommendation systems provide suggestions to the website users about the webpages that may be of concern to them by evaluating the collective navigation behavior of previous website users. The main motive of this study was to explore the utilization of partially ordered sequential rules (POSR) in making future predictions for website users. Sequential rules provide the association between the events that occur in a particular sequence. In this paper, two sequential rule mining algorithms, namely TRuleGrowth and CMRules have been separately used to generate sequential rules. Then the sequential rules were used to make predictions about the future interests of the users regarding webpages. The experimental results on a real life dataset have revealed that the rules generated by TRuleGrowth algorithm were able to make predictions with higher accuracy than those generated by CMRules algorithm.
Introduction
With the massive invasion of data into the World Wide Web, the discovery of the interesting patterns from the data, and further evaluating the patterns is a perplexing and fascinating job to perform [9]. Web page recommendation systems are usually implemented on web servers and make use of data obtained as a collection of web browsing sequences. The patterns extracted from the web servers can be used to predict and understand the behavior of the navigation of website’s visitors [20]. A web navigation pattern can be defined as a pathway through more than one page of the website. This pathway can be extracted from the access logs of the web server [8]. Also, a session on a website is defined as a sequence of webpages accessed by a specific user in a limited time span [9].
The procedure of automatic discovery and pattern analysis from log files is defined as Web usage mining or Web log mining [9]. Web usage mining techniques such as mining of sequential patterns [3], association rule mining [2, 17], and clustering of web pages [5, 6] discover diverse kind of user access patterns from log files of the websites, which can be retrieved and used to provide the recommendations to the users. In the retrieval process, the association rule mining [16] plays a significant method to find the strongly related elements. It captures associations among the webpages that co-occur frequently in the web navigation database [17].
Fournier-Viger et al. [14] stressed on the point that sequential rules [1] could provide a better alternative to sequential pattern mining in addressing the problem of predictions [14]. A partially ordered sequential rule is a type of sequential rule which simultaneously exists in different sequences where elements in the antecedent and consequent of the rule are not in the same order. The partially ordered sequential rules consider the confidence of sequential patterns. There are many areas where sequential rule mining has been in practice such as doing an analysis of the stock market, climate observations and e-learning [1, 19]. The previous researchers observed that the prediction accuracy of POSR can be higher than regular sequential rules [14]. In this study, recommendation of the webpages that are having a high probability to be visited next by the user have been done by using the partially ordered sequential rules. The rest of this paper is organized as follows. Section 2 presents the related work. Section 3 presents the formal definitions of some essential terms and also describes the significant features of CMRules [14] and TRuleGrowth [11, 12] algorithms. Section 4 describes the proposed recommendation model. Section 5 presents the results. Finally, Section 6 presents conclusions and future work.
Related work
Kazienko [15] has done his work on traditional association rules. The core part of his work is to determine indirect relations prevailing between pages that infrequently occur together. The authors presented a different category of pages called transitive pages. The authors also divided indirect association rules into two categories: partial indirect association rules and complete indirect association rules.
Kim [18] developed an algorithm for mining of web data named as streaming association rule model. The model generates association rules from navigation paths in which weights are assigned to web pages according to the navigation order in user sessions [4] which tends to discover remarkable relations among the pages of websites. The authors first divide the problem into smaller parts which helps to decrease the time of scanning of the database. This directly enhances the result by increasing the prediction accuracy of the system as compared to traditional systems.
Peng [7] has presented a method in which association rules can be proficiently extracted from the log files of the websites. For processing the web log records, the author proposed the use of FP-growth algorithm which obtains patterns which are frequently accessed by users. The author uses a unique methodology of combining the browsing patterns of users and layout of the website and then finding the association rules.
Mobasher et al. [2] presents a methodology which mines association rules from clickstream data extracted from log files of websites. For saving frequent patterns, the authors present a data structure which is particularly appropriate for web recommendation systems.
Weiyang et al. [17] presented a novel collaborative recommendation technique which is designed to mine association rules. There is no need to mention the minimum support value in advance as stated in other association rule mining algorithms. As a substitute of minimum support, a specified range is mentioned for the given number of rules, and for each user the algorithm amends the minimum value of support so that rules which falls in that specified range can be extracted.
Recommendation based on partially ordered sequential rules
Preliminaries of sequential rule mining
Sequential Database. It can be defined as a large collection of logical order sequences and elements occurring in these sequences. A sequence is a well-organized list of elements and for each sequence, a unique sequence identification number is allocated which represents it uniquely. Web logs contain a sequential database consisting of navigation sequences. Sequential Rules. Sequential rules can be described in the form of A ⇒ B, where A = {x, y … z} and B = {p, q … r}, are two collections of events. For finding rules, two measures of association rule mining: support and confidence are used. The value of support measure is stated as a percentage of all the sequences that comprise the rule and value of confidence measure is stated as the probability that the pattern of the sequence B will show up after A. The rule A ⇒ B mentioned above signifies that if all events in head A occur in a particular order, then the events of body B is about to occur with a particular value of confidence measure in some specific order. The formulas of support and confidence of a rule A ⇒ B are defined as follows:
The rule A ⇒ B is said to be of size x*y if |A| = xand |B| = y. The term sids (A ⇒ B) represents the set of sequence numbers of the those navigation sequences where the rule occurs [12]. For example, sids (A ⇒ B) = {2, 3} depicts that rule (A ⇒ B) is present in sequence 2 and sequence 3. The term |S| represents the total number of sequences in the database. Partially Ordered Sequential Rule. A partially-ordered sequential rule can be denoted as A ⇒ B, where A = {x, y … z} and B = {p, q … r}, are two collections of events, such that A∩ B = ∅, A≠ ∅ and B≠ ∅. The interpretation of a sequential rule A ⇒ B is that if events in A occur, the events in B will also occur a while later in the same sequence. The order of occurrence of events of A does not matter and the same is true for events of B. It is to be noted that in a sequential rule the events of B may occur before events of A, but in partially ordered sequential rules, the events of B can occur only after events of A.
For instance, the sequence {a, b, c, f, g, e} may contain the sequential rule {a, b, c} ⇒ {e, f, g}, but it cannot contain the sequential rule {a, b, f} ⇒ {c}, because the element c did not occur afterwards f.
CMRules [14] is an algorithm which helps to find the partially ordered sequential rules. Its methodology states that from association rules one can extract the sequential rules. CMRules is considered as an extension of association rule mining algorithm. To extract sequential rules from database, the algorithm implements a two-step procedure: In the first step, sequential database is considered as a simple transactional database. The algorithm uses an association rule mining method such as Apriori technique [16] to extract all association rules instead of sequential rules. In the second step, the original database is examined to eradicate the rules that do not satisfy the condition of minimum support and minimum confidence. The resulting output is set of all sequential rules.
When the user wants to extract the association rules and sequential rules both at the same time, CMRules algorithm is preferred. Significant features of CMRules are given as follows: It can be applied on large sized databases. It works very well at low support thresholds. It can reuse the capabilities of existing association rule mining algorithms.
TRuleGrowth algorithm
TRuleGrowth algorithm [12] has been developed to discover partially ordered sequential rules. It is the extension of RuleGrowth algorithm [13]. The TRuleGrowth algorithm follows the concept of the sliding window, where a window consists of a group of successive events in a navigation sequence. A sliding window is a frame which travels from starting element of a navigation sequence to its last element. For instance, if a sequence is of size 13 and a window is of size 3 time unit, then the sliding-window can move in 15 different positions. The methodology of TRuleGrowth finds the partially ordered sequential rules which are falling within the boundary of the sliding window, that is, within the maximum number of sequential elements.
In RuleGrowth algorithm [12], first those rules are acquired which are having size 1×1 which means both head and body of the association rules consist of a single element and then iteratively expands them by checking the sequences that contain the rule of size 1×1, so that single items can be found that can enlarge first rule’s left and right sides. By following this strategy, it is ensured that only those rules will be considered in the database that are valid rules according to the algorithm. TRuleGrowth follows the same strategy along with the sliding window constraint. The steps of the TRuleGrowth are described as follows: The first step of TRuleGrowth algorithm is to examine the sequence database one time to find the sequence set (sids) for each item. The items which does not satisfy the minimum support constraint are discarded. After filtering the elements in the second step, the method considers each combination of elements one at a time of size and rules of size 1×1 are generated. For each rule, the calculation of support value is performed. If the calculated support comes out to be greater than the minimum support, then the procedures for expanding antecedent and consequent of rules are executed iteratively so that the respective parts of rule can be enlarged. In the last step, the value of confidence measure is calculated for each rule generated in the previous step. The rules with confidence value greater than the minimum confidence threshold are given out as valid partially ordered sequential rules.
It is to be noted that at every step the sliding window constraint should be satisfied. It means that all the elements in the rule should fall within the window. The above mentioned steps are repeated to find the partially ordered sequential rules. The significant features of TRuleGrowth are given as follows: It generates a much smaller number of rules. It is much faster than the other methods. It uses rule expansion to create the sequential rules.
Table 1 depicts the input parameters and output generated by CMRules and TRuleGrowth algorithms.
Proposed recommendation model
Figure 1 shows the proposed recommendation model. User sessions sequence database will be divided into two parts: training dataset and testing dataset. Partially ordered sequential rules will be extracted from training data using sequential rule mining algorithm such as CMRules and TRuleGrowth. Testing data would be used to evaluate the predictions of webpages which will be made using the recommendation model.
In the presented recommendation model, recommendations are done using partially ordered sequential rules extracted in sequential rule mining phase. As described above, the partial ordered sequential rule X ⇒ Y is an association among more than one unordered itemsets, where X is denoted as head of the rule and Y is denoted as the body of the rule. For recommending webpages to the users, first of all, a threshold value and a prediction value will be set. The threshold value defines the number of webpages already browsed by the active user. The threshold value represents the length of subsequence of testing data sequence which will be matched with the head of the rule. Every sequence present in the testing data will be matched with head of each rule. For example, if the threshold value is set at 3, then first three pages of a testing sequence will be used to make recommendations.
While iterating through all the partially ordered sequential rules, if a match occurs in any of the rules, then the elements in the body of that rule will be the set of predicted webpages. The prediction value plays a major role. It will define the total number of predictions which have to be made from the rule body. For example, a prediction value of 3 means that the system is making recommendations for the next three pages that may be traversed by the user.
Pseudo code
In this study, the proposed algorithm which has been used to make recommendations from partially ordered sequential rules is given below.
Experimental evaluation
Dataset used
The dataset which has been used to conduct the experiments is the MSNBC [10] dataset. It has been extracted from the web logs of msnbc.com and msn.com. The sequences of database consist of click-stream data. In the dataset, there are 31,790 numbers of sequences and distinctive item count in the dataset is 17. In one sequence the average number of itemsets is 13.33. For training and testing data, the whole data set was divided into a ratio of 70:30. First seventy percent sequences of data set were used as training data and remaining thirty percent were used as testing data.
Evaluation metric
In order to evaluate the performance of the proposed recommendation system, a standard measure named accuracy (or precision) was used. Accuracy is a recommendation measure which finds the percentage of accurate recommendations generated by the system. After considering each page in the respective testing user session, the recommendation system produces a recommendations set R(p). To calculate the value of the accuracy measure, the set R(p) is compared with the set T(p) which contains the webpages actually accessed by the user The formula for accuracy is given in Equation (3) below.
Table 2 depicts the characteristics of the system configuration used in the experiments.
Evaluation procedure
In the proposed recommendation model, the algorithms, TRuleGrowth and CMRules which extract partially ordered sequential rules from a sequence database have been used. The set of partially ordered sequential rules is fed to the recommendation module which picks each sequence from the testing dataset one by one and produce recommendations for it. The recommendation set generated for each sequence is denoted as T(p). For each testing sequence, the accuracy value is calculated and the final accuracy value is the average calculated over accuracy values of all the sequences. To evaluate the performance of TRuleGrowth and CMRules algorithms, the accuracy values generated by these algorithms at different input are compared with each other.
Choice of parameter values
To determine the optimal values of each input parameter, many experiments were performed with different values of the parameters and the combination of values which gave sufficient number of rules were used finally. Table 3 shows the parameter values used in the experiments.
Results and discussions
Figures 2 to 4 show the accuracy in percentage for varying number of recommended pages (prediction value) when CMRules and TRuleGrowth algorithms were applied on MSNBC dataset. It was assumed that at least two webpages accesses by the user are needed to make predictions. Also, the results in Figs. 2–4 were generated with threshold values 2, 3 and 4, respectively.
The significant observations from the results are given as follows: It can be observed that in case of both the algorithms, as the number of recommended pages increases, the accuracy also increases. It makes it clear that more prior information of webpages accessed by the users will give more accuracy. The accuracy achieved by TRuleGrowth is always higher than that achieved by CMRules. The use of sliding window constraint by TRuleGrowth might be the reason for it. The TRuleGrowth generates much stronger and significant rules that CMRules method. Due to the use of sliding window constraint, the number of rules generated by TRuleGrowth is much smaller than the number of rules generated by CMRules. The reason behind the lesser accuracy of CMRules may be the generation of a large number of incompetent rules.
Conclusions and future works
Some earlier studies have concentrated on the development of web page recommendation systems based on concepts such as clustering, association rules, and frequent pattern mining, but none of the studies have used sequential association rules for this purpose. Navigation sequences are more closely related to sequential rules as these rules are interested in making an association between events that occur in a particular order. In this study, two sequential rule mining algorithms, TRuleGrowth and CMRules, have been used to make webpage predictions and a comparison between these two methods have shown that the accuracy values of TRuleGrowth method is more than CMRules based method. In future, other sequential rule mining algorithms should be used in making webpage predictions. Future work can also focus on making the comparison between the sequential rule mining techniques and other techniques available for web page recommendation.
