Fuzzy c-Least Medians clustering for discovery of web access patterns from web user sessions data

Abstract

Mining web usage data of e-business organizations is essential to provide knowledge about clients’ web utilization patterns, which can help these businesses in landing at vital business choices. Because of non-deterministic web access behavior of web clients, web user session data is usually noisy and imperfect. Such imperfection has a negative impact on pattern discovery process. One of the real issues associated with the prevalently used Fuzzy c-Means (FCM) and Fuzzy c-Medoids (FCMdd) methods is that they are not robust against the noise, because a single outlier object could lead to a very different clustering result. In this research we propose a robust Fuzzy c-Least Medians (FCLMdn) clustering framework to deal with the user session data contaminated with noise and outlier user session objects, with the objective of improving the quality of the extracted patterns. To deal with the high dimensionality of user session data which may contain noise and outliers, a fuzzy set theoretic approach for assigning fuzzy weights to user sessions and associated URLs has been proposed. Our results clearly indicate that quality of user session clusters formed using FCLMdn algorithm is much better than those using FCM and FCMdd algorithms in terms of various cluster validity indices.

Keywords

Fuzzy clustering Fuzzy c-Least Medians clustering web usage mining

1. Introduction

The “Web User Session Clustering” is articulated as the automatic discovery and analysis of usage clusters from web user session data as a result of user interactions with web resources on one or more web sites [1]. The discovered patterns are usually represented as collections of web resources that are frequently accessed by groups of users having common interests [2].

There are several challenges associated with user session clustering from unlabeled, semi-structured and noise contaminated web user session data [3]. Due to the human interactions and non-deterministic browsing patterns of various web users, user session data may be incomplete or may involve noise and outliers [4]. Such data which suffers from ambiguity and vagueness do not have the crisp boundaries and often have overlapping clusters [5]. Fuzzy clustering is more suitable techniques than the traditional hard or crisp clustering techniques for the discovery of web usage models from the web usage data [6].

One of the most well know fuzzy clustering technique is Fuzzy c-Means (FCM) clustering [7] was first proposed by Dunn and later modified by many authors [8, 9]. FCM algorithm uses an objective function that is minimized while trying to partition the data objects. The algorithm computes the cluster centers and assigns a membership value to each user session corresponding to every cluster within a range of 0 to 1. Another category of fuzzy clustering algorithm know as Fuzzy c-Medods algorithm [10] extends Hard c-medoids algorithm by incorporating fuzzy set concept to produces fuzzy clusters [11].

One of the major problems associated with FCM and FCMed based user session clustering algorithms, which try to minimize sum of the squared errors type objective function is that, they are not robust against the noise contaminated user session data. A single outlier or noise user session object could lead to a very different clustering result [12]. In this research work we propose a Fuzzy c-Least Medians (FCLMdn) based user session clustering algorithms to deal with the session data contaminated with noise and outlier user session objects.Unlike FCM and FCMed algorithms which minimize sum of the squared errors, FCLMdn algorithm tries to minimize the median of the squared error i.e. median of the euclidean distances of the $m$ user sessions from their respective cluster medoids.

Figure 1 depicts an overview of the FCLMdn based framework for the discovery of web user session clusters. Various steps involved in the proposed framework include i) Fuzzy weight assignment to user sessions, ii) discovery of user session clusters using the proposed FCLMdn clustering technique and iii) evaluation of the discover clusters using various fuzzy validity indexes.

Figure 1.

Fuzzy c-Least Medians based discovery of user session clusters.

Organisation of the rest of the paper is as follows: In Section 2 we provide a brief overview of the related work in the field of web usage mining. Section 3 provides a brief discussion of the methodology to assign fuzzy weights to web user sessions. In Section 4 mathematical formulation and algorithmic details of the FCM, FCMed and the proposed FCLMdn based user session clustering is presented. Section 5 discusses various validity indexes utilized for the quantitative measurement of the quality of the discovered fuzzy clusters. Section 6 deals with the experimental results and the related discussions. Finally, conclusions are presented in Section 7.

2. Related work

Clustering techniques have been applied to the user session data in order to discover user session clusters representing similar URL access patterns. “Clustering aims at dividing the data set into clusters where inter-cluster similarities are minimized while the intra cluster similarities are maximized”. Detailed description about various clustering methodologies are provided in [13]. The most widely used clustering methods are partitioning based, which partition the given data objects into $c$ clusters. Numerous clustering algorithms have been used for mining web logs in order to group the user sessions based on how similar their URL access patterns are [14]. In [15, 16] authors have provided a crisp clustering based framework for web session data clustering including a comparative study of various crisp clustering techniques. web user session clustering has been applied to a wide range of applications. Those which are most widely reported in literature are described below:

•
Web personalization: The goal of personalization is to provide users with what they need without them asking for it explicitly [17]. Web personalization implies the delivery of dynamic contents, such as text, advertisement or product recommendations etc., as per the needs or interest of users [18]. Making dynamic recommendations to a web user based on the user’s profile is very useful for many e-commerce sites [19, 20].
•
web site design: In order to design a complex web site, the designer must anticipate the users’ needs and structure the site accordingly. Also users’ needs may change over time, and their usage patterns may violate the designer’s initial expectations [21]. Understanding user needs requires understanding how users view the data available and how they actually use the site [22, 23].
•
system improvements: Web usage mining provides the key to understand the web traffic behavior which can be utilized to decide strategies for web caching, load balancing and data distribution [24, 25].
•
business intelligence: Due to intense competition, the business community has realized the necessity of intelligent marketing strategies and relationship management [26]. Web usage mining attempts to discover useful knowledge from the data obtained from the interactions of the users with the Web [27] etc.

Due to the human interactions and non-deterministic browsing patterns of various Web users, user session data may be incomplete or may involve noise and suffers from ambiguity and vagueness [5]. Fuzzy clustering is more suitable techniques than the traditional hard or crisp clustering techniques for the discovery of web usage models from the web usage data [28]. Fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete or noisy data, mixed media information and human interaction [29]. Fuzzy set based clustering of user session data partitions a set of user sessions into clusters, where each user session may belong to several clusters with different degrees of membership. This enables the clusters to grow into their natural shapes [30]. A membership value of zero indicates that the user session is not a member of that cluster. A non-zero membership value shows the degree to which the user session represents a cluster [31].

For the discovery of user sessions clusters the most popularly used Fuzzy clustering methods include i) Fuzzy c-Means and Fuzzy c-Medoids Clustering algorithms. Fuzzy c-means algorithm [32] performs the fuzzy clustering in such a way that a given user session may belong to several clusters with the degree of belongingness specified by membership grades between 0 and 1. It uses an objective function that is minimized while partitioning the user sessions. The objective function of Fuzzy c-Means represents the weighted sum of distances between the user sessions and their cluster centers. Fuzzy c-Medoids (FCMdd) algorithm [11] provide the concept of medoid for each cluster which is the most centralized user session object representing that cluster [33]. The objective function of FCMdd method is similar to that of FCM with the only difference that it utilizes cluster medoids instead of cluster means.

FCM has been used for the discovery of usage patterns from the web log data [34, 35]. Some of the variants of of fuzzy c-Means that have been applied in the domain of web usage clustering include [36, 37, 38, 39, 40, 41]. FCMdd method have been used for user session clustering in [42]. The major problem associated with these algorithms is that, they are not robust against the noise contaminated user session data. A single outlier or noise user session object could lead to a very different clustering result [43]. In order to provide the robustness against against noise, a Fuzzy c-Least Medians (FCLMdn) based user session clustering algorithms has been proposed that deals with the user session data contaminated with noise and outlier objects. Fuzzy c-Least Median of squares algorithm tries to minimize the median of the squared error i.e. median of the euclidean dissimilarities of user sessions from their cluster centers.
3. Fuzzy weight assignment to web user sessions

Web user session data is extracted by preprocessing and transforming the the unlabeled, semi-structured textual raw web log data into a set of numeric user session vectors [44]. Preproessing utilizes a variety of algorithms and heuristic techniques to perform various preprocessing tasks such as data cleaning, user identification and session identification etc. Data cleaning eliminates the irrelevant, extraneous URL references and those generated due to the spider navigations [45]. User identification is performed to identify the users corresponding to the web log requests [46]. User session identification segments the rquests corresponding to individual users into sessions, each representing a single visit to the site [47]. In [48, 49, 50], authors have provided the details of various techniques utilized for preprocessing the web log data.

Let $\mathscr{U}\leftarrow\{u_{1},u_{2},\ldots,u_{n}\}$ be the set of all the URLs appearing in the preprocessed log, where $n$ represents the number of URLs. Let $\mathscr{S}=\{s_{i}|i=1\ldots m\}$ be a set of $n$ -dimensional user session vectors where $m$ is the number of user sessions. Each user session $s_{i}$ takes the form:

$s_{i}\leftarrow\{s_{i}^{1},s_{i}^{2},\ldots,s_{i}^{n}\}\leftarrow\{\mathscr{A}% _{i}^{u_{1}},\mathscr{A}_{i}^{u_{2}},\ldots,\mathscr{A}_{i}^{u_{n}}\}$

where each $s_{i}^{k}\leftarrow\mathscr{A}_{i}^{u_{k}}$ represents the no. of times $u_{k}$ accessed during the session $s_{i}$ ; $\mathscr{S}$ may be represented by the following matrix form

$\mathscr{S}[m\times n]=\begin{pmatrix}s_{1}^{1}&s_{1}^{2}&\ldots&s_{1}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ s_{m}^{1}&s_{m}^{2}&\ldots&s_{m}^{n}\end{pmatrix}$ (1)

The number of URL items appearing in the access logs could number in the thousands. Filtering the logs by removing references to low session support URLs (i.e. that are not supported by a specified number of user sessions) can provide an effective dimensionality reduction method while improving clustering results. For this purpose fuzzy weights are assigned to all the URLs using membership function based on their user session support count. All URLs with session support count lower than the lower threshold $\alpha_{l}$ are assigned the weight 0. On the other hand all the URLs with session support count higher than the upper threshold $\alpha_{h}$ are assigned the weight 1. The remaining URLs with session support count between $\alpha_{l}$ and $\alpha_{h}$ are assigned the weight between 0 and 1 using a fuzzy membership function.We utilize linear fuzzy membership fuction (LFMF) and standard S shaped fuzzy membership function (SFMF) to assign the weights to various URL items. Let $\mathscr{X}_{u_{k}}$ be the session support count of the URL $u_{k}$ as given below:

$\mathscr{X}_{u_{k}}\leftarrow|\{s_{i}|\mathscr{A}_{i}^{u_{k}}>0\}|$ (2)

The fuzzy weight $\mathscr{W}_{u_{k}}$ assigned to the URL $u_{k}$ using LFMF and SFMF is given in Eq. (3) and (4) respectively.

$\mathscr{W}_{u_{k}}\leftarrow\begin{cases}0,&\mathrm{if}\>\mathscr{X}_{u_{k}}% \leqslant\alpha_{l}\\ 1,&\mathrm{if}\>\mathscr{X}_{u_{k}}\geqslant\alpha_{h}\\ (\mathscr{X}_{u_{k}}-\alpha_{l})/(\alpha_{h}-\alpha_{l}),&\mathrm{if}\>\alpha_% {l}<\mathscr{X}_{u_{k}}<\alpha_{h}\end{cases}$ (3)

$\mathscr{W}_{u_{k}}\leftarrow\begin{cases}0,&\mathrm{if}\>\mathscr{X}_{u_{k}}% \leqslant\alpha_{l}\\ 2{\left((\mathscr{X}_{u_{k}}-\alpha_{l})\div(\alpha_{h}-\alpha_{l})\right)}^{2% },&\mathrm{if}\>\alpha_{l}\leqslant\mathscr{X}_{u_{k}}\leqslant(\alpha_{l}+% \alpha_{h})\div 2\\ 1-2{\left((\mathscr{X}_{u_{k}}-\alpha_{h})\div(\alpha_{h}-\alpha_{l})\right)}^% {2},&\mathrm{if}\>(\alpha_{l}+\alpha_{h})\div 2\leqslant\mathscr{X}_{u_{k}}% \leqslant\alpha_{h}\\ 1,&\mathrm{if}\>\mathscr{X}_{u_{k}}\geqslant\alpha_{h}\\ \end{cases}$ (4)

Since very small sized sessions may represent the noise present in the data, weights are assigned to all the sessions using a fuzzy membership function based on the number of URLs accessed by those sessions. Let $\mathscr{Y}_{s_{i}}$ be the number of URLs accessed in session $s_{i}$ .

$\mathscr{Y}_{s_{i}}\leftarrow|\{u_{k}|\mathscr{A}_{i}^{u_{k}}>0\}|$ (5)

Let $\beta_{l}$ and $\beta_{h}$ represent the lower and higher threshold on the URL count respectively. The fuzzy weight $\mathscr{W}_{s_{i}}$ assigned to the session $s_{k}$ using LFMF and SFMF is given in Eqs (6) and (7) respectively.

$\mathscr{W}_{s_{i}}\leftarrow\begin{cases}0,&\mathrm{if}\>\mathscr{Y}_{s_{i}}% \leqslant\beta_{l}\\ 1,&\mathrm{if}\>\mathscr{Y}_{s_{i}}\geqslant\beta_{h}\\ (\mathscr{Y}_{s_{i}}-\beta_{l})/(\beta_{h}-\beta_{l}),&\mathrm{if}\>\beta_{l}<% \mathscr{Y}_{s_{i}}<\beta_{h}\end{cases}$ (6)

$\mathscr{W}_{s_{i}}\leftarrow\begin{cases}0,&\mathrm{if}\>\mathscr{Y}_{s_{i}}% \leqslant\beta_{l}\\ 2{\left((\mathscr{Y}_{s_{i}}-\beta_{l})\div(\beta_{h}-\beta_{l})\right)}^{2},&% \mathrm{if}\>\beta_{l}\leqslant\mathscr{Y}_{s_{i}}\leqslant(\beta_{l}+\beta_{h% })\div 2\\ 1-2{\left((\mathscr{Y}_{s_{i}}-\beta_{h})\div(\beta_{h}-\beta_{l})\right)}^{2}% ,&\mathrm{if}\>(\beta_{l}+\beta_{h})\div 2\leqslant\mathscr{Y}_{s_{i}}% \leqslant\beta_{h}\\ 1,&\mathrm{if}\>\mathscr{Y}_{s_{i}}\geqslant\beta_{h}\\ \end{cases}$ (7)

Table 1 provides brief descriptions of various mathematical symbols and terms used in this article.

Table 1

Description of the symbols used in this study

Notation	Description
$\mathscr{S}$	Set of user sessions extracted from the web logs
$m$	Number of user sessions
$s_{i}$	$i$ -th user session
$c$	Number of clusters
$\mathscr{V}$	Set of cluster centers
$g_{j}$	$j$ -th cluster
$v_{j}$	$j$ -th cluster center
$\mathscr{U}$	Set of URLs in the preprocessed web logs
$n$	Number of URLs
$u_{k}$	$k$ -th URL
$s_{i}^{k}$	$k$ -th URL feature value of $i$ -th user session
$v_{j}^{k}$	$k$ -th URL feature value of $j$ -th cluster center
$\mathscr{W}_{s_{i}}$	Fuzzy weight of user session $s_{i}$
$\mathscr{W}_{u_{k}}$	Fuzzy weight of $k$ -th URL
$\alpha_{l}$	Lower threshold on the session support count of a URL
$\alpha_{h}$	Upper threshold on the session support count of a URL
$\beta_{l}$	Lower threshold on the no. of URLs accessed in a session
$\beta_{h}$	Upper threshold on the no. of URLs accessed in a session
$\mathscr{X}_{u_{k}}$	Session support count of the URL $u_{k}$
$\mathscr{Y}_{s_{i}}$	Number of URLs accessed in session $s_{i}$
$\mathscr{M}$	$m\times c$ fuzzy partition matrix
$\mu_{ij}$	Membership grade of user session $s_{i}$ in cluster $g_{j}$
$q$	Fuzziness index
$\mathscr{E}^{2}(s_{i},v_{j})$	Euclidean distance between session $s_{i}$ and center $v_{j}$
$\mathscr{E}(s_{i},v_{j})$	Manhattan distance between session $s_{i}$ and center $v_{j}$
$\mathscr{J}_{\textit{FCM}}$	Objective function of FCM
$\mathscr{J}_{\textit{FCMed}}$	Objective function of FCMdd
$\mathscr{J}_{\textit{FCLMdn}}$	Objective function of FCLMdn
$\mathscr{D}_{i}$	Dissimilarity of $s_{i}$ with respect to all the cluster centers
$\mathscr{D}_{i:m}$	$i$ -th term after arranging all dissimilarity terms in ascending order
$s_{i:m}$	User session corresponding to the dissimilarity terms $\mathscr{D}_{i:m}$
$\mu_{ij:m}$	Membership value of user session $s_{i:m}$ in cluster $g_{j}$
$t$	Iteration number
$\eta$	Maximum no. of iterations
$\epsilon$	Error threshold
XB	Xie Beni cluster validity index
$\delta_{\min}^{2}$	Minimum Euclidean distance between any two cluster centers
FS	Fukuyama Sugeno cluster validity index
SC	Zahid separation compaction cluster validity index

4. Fuzzy clustering based discovery of user session clusters

4.1 Fuzzy user session clustering data strctures and distance computations

Let $\mathscr{V}=\{v_{j}|j=1\ldots c\}$ represent a set of $n$ -dimensional vectors representing the cluster center corresponding to each of the $c$ clusters in $\mathscr{G}=\{g_{j}|j=1\ldots c\}$ . $\mathscr{V}$ may be represented by the following matrix form

$\mathscr{V}[c\times n]=\begin{pmatrix}v_{1}^{1}&v_{1}^{2}&\ldots&v_{1}^{n}\\ \vdots&\vdots&\ddots&\vdots\\ v_{c}^{1}&v_{c}^{2}&\ldots&v_{c}^{n}\end{pmatrix}$ (8)

where $v_{k}^{j}$ represents $k$ -th URL feature value of $j$ -th cluster center. Let $\mu_{ij}$ represent the grade of membership of user session $s_{i}$ in cluster $g_{j}$ where

$\mu_{ij}\in[0,1],\forall i=1\ldots m\;∼{}\mathrm{and}∼{}\;\forall j=1\ldots c$ (9)

The $m\times c$ matrix $\mathscr{M}=[\mu_{ij}]$ is a fuzzy $c$ -partition matrix, which describes the degree of membership of user sessions to various clusters satisfying the following conditions

$\begin{split}\displaystyle\sum_{j=1}^{c}\mu_{ij}=1,∼{}\forall i=1\ldots m\\ \displaystyle 0<\sum_{i=1}^{m}\mu_{ij}<m,∼{}\forall j=1\ldots c\end{split}$ (10)

The partition matrix $\mathscr{M}$ takes the following form

$\mathscr{M}[m\times c]=\begin{pmatrix}\mu_{11}&\mu_{12}&\ldots&\mu_{1c}\\ \vdots&\vdots&\ddots&\vdots\\ \mu_{m1}&\mu_{m2}&\ldots&\mu_{mc}\end{pmatrix}$ (11)

Let $\mathscr{E}^{2}(s_{i},v_{j})$ is the euclidean distance between the user session $s_{i}$ and cluster center $v_{j}$ . The distance $\mathscr{E}^{2}(s_{i},v_{j})$ is computed using:

$\mathscr{E}^{2}(s_{i},v_{j})\leftarrow\begin{cases}\sum_{k=1}^{n}|s_{i}^{k}-v_% {j}^{k}|^{2},&\text{if no weights assigned to user sessions and URLs}\\ \mathscr{W}_{s_{i}}\sum_{k=1}^{n}\mathscr{W}_{u_{k}}|s_{k}^{i}-v_{j}^{k}|^{2},% &\text{if fuzzy weights assigned to user sessions and URLs}\\ \end{cases}$ (12)

where $w_{s_{i}}$ is the weight assigned to the user session $s_{i}$ and $w_{u_{k}}$ is the weight assigned to the $k_{th}$ URL of $s_{i}$ . Let $\mathscr{E}(s_{i},v_{j})$ be the Manhattan distance between the user session $s_{i}$ and cluster center $v_{j}$ is given by:

$\mathscr{E}(s_{i},v_{j})\leftarrow\begin{cases}\sum_{k=1}^{n}|s_{i}^{k}-v_{j}^% {k}|,&\text{if no weights assigned to user sessions and URLs}\\ \mathscr{W}_{s_{i}}\sum_{k=1}^{n}\mathscr{W}_{u_{k}}|s_{k}^{i}-v_{j}^{k}|,&% \text{if fuzzy weights assigned to user sessions and URLs}\\ \end{cases}$ (13)

Following subsections describe the algorithms for the discovery of user session clusters using Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Medians Clustering techniques.

4.2 Fuzzy c-Means user session clustering

The objective function $\mathscr{J}_{\textit{FCM}}$ of user session clustering using Fuzzy $c$ -Means, is the weighted sum of distances between the user sessions and their cluster centers is described in Eq. (14)

$\mathscr{J}_{\textit{FCM}}\leftarrow\sum_{j=1}^{c}\sum_{i=1}^{m}\mu_{ij}^{q}% \mathscr{E}^{2}(s_{i},v_{j})$ (14)

where $q\in[1,\infty]$ is the fuzziness index. Minimization of the objective function $\mathscr{J}_{\textit{FCM}}$ is achieved by updating the grade of memberships of user sessions to various clusters and recalculating the cluster centers in an alternating fashion until convergence occurs. During each iteration, the cluster centers are updated using Eq. (15)

$v_{j}\leftarrow\frac{\overset{m}{\underset{i=1}{\sum}}\mathscr{W}_{s_{i}}\mu_{% ij}^{q}s_{i}}{\overset{m}{\underset{i=1}{\sum}}\mu_{ij}^{q}}$ (15)

Membership matrix $\mathscr{M}$ is updated using Eq. (16)

$\mu_{ij}\leftarrow{\displaystyle\frac{\left({\displaystyle\frac{1}{\mathscr{E}% ^{2}(s_{i},v_{j})}}\right)^{1/(q-1)}}{\overset{c}{\underset{j=1}{\sum}}\left({% \displaystyle\frac{1}{\mathscr{E}^{2}(s_{i},v_{j})}}\right)^{1/(q-1)}}}$ (16)

Algorithm 1 describes the steps involved in fuzzy c-Means clustering to discover the user session clusters.

Algorithm 1

Fuzzy $c$ -Means User Session Clustering

Input: $c$ , error threshold $\epsilon$ , maximum no. of iterations $\eta$ and user sessions matrix $\mathscr{S}$

Output: Cluster center matrix $\mathscr{V}$ and membership matrix $\mathscr{M}$

Initialize cluster center matrix $\mathscr{V}$ , by randomly selecting $c$ user sessions from $\mathscr{S}$ .

$t\leftarrow 1$

repeat

Compute the membership matrix $\mathscr{M}$ :

for $i\leftarrow 1,m$ do

for $j\leftarrow 1,c$ do

Compute $\mu_{ij}$ using Eq. (16)

end for

10:

Compute the new cluster centers in $\mathscr{V}$ :

11:

for $j\leftarrow 1,c$ do

12:

Compute $v_{j}$ using Eq. (15)

13:

end for

14:

Compute the objective function $\mathscr{J}_{\textit{FCM}}(t)$ using Eq. (14)

15:

$t\leftarrow t+1$

16:

until $|\mathscr{J}_{\textit{FCM}}(t)-\mathscr{J}_{\textit{FCM}}(t-1)|<\epsilon% \parallel t=\eta$

4.3 Fuzzy c-Medoids user session clustering

Fuzzy c-Medods algorithm extends Hard c-medoid algorithm by incorporating fuzzy set concept to produces fuzzy clusters. Each cluster is represented by a representative user session object as the medoid of that cluster. The fuzzy c-medoids algorithm tries to minimizes the objective function:

$\mathscr{J}_{\textit{FCMdd}}\leftarrow\sum_{i=1}^{m}\sum_{j=1}^{c}\mu_{ij}^{q}% \mathscr{E}(s_{i},v_{j})$ (17)

The membership martix $\mathscr{M}$ can be populated using:

$\mu_{ij}\leftarrow{\displaystyle\frac{\left({\displaystyle\frac{1}{\mathscr{E}% (s_{i},v_{j})}}\right)^{1/(q-1)}}{\overset{c}{\underset{j=1}{\sum}}\left({% \displaystyle\frac{1}{\mathscr{E}(s_{i},v_{j})}}\right)^{1/(q-1)}}}$ (18)

Once the membership matrix $\mathscr{M}$ is populated the new medoids can be computed using

$p\leftarrow\underset{1\leqslant k\leqslant m}{\text{argmin}}\overset{m}{% \underset{i=1}{\sum}}\mu_{ij}^{q}\mathscr{E}(s_{i},s_{k});\quad v_{j}% \leftarrow s_{p}$ (19)

Algorithm 2 describes the steps involved in fuzzy c-Medoids clustering to discover the user session clusters.

Algorithm 2

Fuzzy $c$ -Medoids user session clustering

Input: $c$ , maximum no. of iterations $\eta$ and user sessions matrix $\mathscr{S}$

Output: $c$ cluster medoids in matrix $\mathscr{V}$ and membership matrix $\mathscr{M}$ .

Initialize the medoids matrix $\mathscr{V}(0)$ , by randomly selecting $c$ user sessions from $\mathscr{S}$ .

$t\leftarrow 1$

repeat

Compute the membership matrix $\mathscr{M}$ entries:

for $i\leftarrow 1,m$ do

for $j\leftarrow 1,c$ do

Compute $\mu_{ij}$ using Eq. 16

end for

10:

Compute the new medoids in $\mathscr{V}(t)$ :

11:

for $j\leftarrow 1,c$ do

12:

Compute $v_{j}$ using Eq. 18

13:

end for

14:

$t\leftarrow t+1$

15:

until $\mathscr{V}(t)=\mathscr{V}(t-1)\parallel t=\eta$

4.4 Proposed Fuzzy c-Least Medians user session clustering

Unlike Fuzzy c-Medoids algorithm that minimize sum of the absolute errors, Fuzzy c-Least Median of squares algorithm tries to minimize the median of the absolute errors i.e. median of the Manhattan distances of the $m$ user sessions from their respective cluster medoids. Let $\mathscr{D}_{i}$ be the dissimilarity term representing the dissimilarity of user session $s_{i}$ with respect to all the cluster centers as given below

$\mathscr{D}_{i}\leftarrow\sum_{j=1}^{c}\left(\mu_{ij}^{q}\mathscr{E}(s_{i},v_{% j})\right)$ (20)

The membership $\mu_{ij}$ is calculated using Eq. 16. If the dissimilarity terms $\mathscr{D}_{i}$ are arranged in ascending order for $i=1\ldots m$ , the $k$ -th item is denoted as $\mathscr{D}_{k:m}$ . The objective function $\mathscr{J}_{\textit{FCLMdn}}$ for the Fuzzy c-Least Median algorithm is as given below:

$\mathscr{J}_{\textit{FCLMdn}}\leftarrow\underset{1\leqslant k\leqslant m}{% \text{median}}\left(\mathscr{D}_{k:m}\right)$ (21)

The new cluster centers that try to minimize the objective function $\mathscr{J}_{\textit{FCLMdn}}$ are computed using

$p\leftarrow\underset{1\leqslant k\leqslant m}{\text{argmin}}\;\underset{1% \leqslant i\leqslant m}{\text{median}}\;\mu_{ij:m}^{q}\mathscr{E}(s_{i:m},s_{k% :m});\quad v_{j}\leftarrow s_{p}$ (22)

where $s_{i:m}$ and $s_{k:m}$ represent the user sessions corresponding to the $i$ -th and $k$ -th dissimilarity terms after sorting namely $\mathscr{D}_{i:m}$ and $\mathscr{D}_{k:m}$ respectively. $\mu_{ij:m}$ represents the membership value of user session corresponding to the $i$ -th dissimilarity term $\mathscr{D}_{i}$ in cluster $g_{j}$ . Algorithm 3 describes the steps involved in Fuzzy c-Least Medians clustering to discover the user session clusters.

Algorithm 3

Proposed Fuzzy $c$ -Least Medians user session clustering

Input: $c$ , maximum no. of iterations $\eta$ and user sessions matrix $\mathscr{S}$

Output: $c$ cluster centers in matrix $\mathscr{V}$ and membership matrix $\mathscr{M}$ .

Initialize the cluster centers matrix $\mathscr{V}(0)$ , by randomly selecting $c$ user sessions from $\mathscr{S}$

$t\leftarrow 1$

repeat

for $i\leftarrow 1,m$ do

for $j\leftarrow 1,c$ do

Compute membership value $\mu_{ij}$ using Eq. 16

end for

for $i\leftarrow 1,m$ do

10:

Compute dissimilarity term $\mathscr{D}_{i}$ using Eq. (20)

11:

end for

12:

Sort the dissimilarity terms $\mathscr{D}_{i}$ , in ascending order for $i=1\ldots m$

13:

for $j\leftarrow 1,c$ do

14:

Compute new cluster center $v_{j}$ using Eq. (22)

15:

end for

16:

$t\leftarrow t+1$

17:

until $\mathscr{V}_{t}=\mathscr{V}(t-1)\parallel t=\eta$

5. Assessment of fuzzy cluster validity

The term cluster validity refers to the process of evaluating the results of a clustering method. The purpose of a fuzzy user sessions clustering methods is to identify the significant overlapping partitions present in the user session data set. Various quality measures to evaluate the quality of the discovered fuzzy clusters are given in articles [51, 52, 53]. In order to measure the quality of the fuzzy clusters, we used the Xie-Beni, Fukuyama Sugeno, Zahid separation compaction (SC), and error fuzzy cluster validity indices due to the following reasons.

•
The Xie-Beni Index has been used extensively for evaluating the quality of FCM cluster partitions in a wide range applications because it can validate fuzzy partitions by considering the geometrical features of the discovered clusters [54, 55, 9]. According to the Xie-Beni index, the cluster compactness and separateness are measured by intra-cluster deviations and inter-cluster distance, respectively [56, 57]. In the domain of web usage mining, the Xie-Beni index has been used to evaluate the quality of fuzzy web usage clusters [58, 1].
•
The Fukuyama Sugeno index captures the difference in intra-cluster compaction and inter-cluster separation, and it has been used widely for validating the quality of fuzzy clusters [59, 60, 61]. In the field of web data mining, it has been applied successfully to the validation of fuzzy clusters [62, 63].
•
The Zahid SC index measures the ratio of fuzzy separation relative to fuzzy compactness, where it computes the separation-compaction ratio based on the geometrical properties of the clustering data structure. It also computes the separation-compaction ratio using the fuzzy union and fuzzy intersection concepts of membership values [64]. The Zahid SC index has been used in various domains for validating the cluster quality [65, 66].
•
The error index measures the sum of the squared error, i.e., the objective function of FCM-based clustering [67]. We use the error index to measure the clustering error.

The following subsections provide further details of these validity measures. A clustering is considered as optimal if it provides good results for all the below mentioned validity measures:
5.1 Xie-Beni index (XB)

The Xie-Beni validity index XB is the ratio of the intra-cluster cluster compactness relative to the inter-cluster separation [56]. This index is based on the objective function $\mathscr{J}$ and the square of the minimum distances of the cluster centers. The XB validity index is defined as follows:

$\textit{XB}=\frac{\overset{c}{\underset{j=1}{\sum}}\overset{m}{\underset{i=1}{% \sum}}\mu_{ij}^{q}\mathscr{E}^{2}(s_{i},v_{j})}{m\cdot\delta_{\min}^{2}},$ (23)

where $\delta_{\min}^{2}$ is the square of the minimum Euclidean distance between the cluster centers, which is given by:

$\delta_{\min}^{2}=\underset{l,k=1\ldots c\wedge l\neq k}{\min}\;\mathscr{E}^{2% }(v_{l},v_{k}).$ (24)

If the clusters are highly separated, the value of $\delta_{\min}$ is large, which yields an XB index with a smaller value. Thus, a small XB index indicates a compact and well-separated cluster.

5.2 Fukuyama sugeno index (FS)

The Fukuyama Sugeno Index FS is defined as:

$\textit{FS}=\overset{m}{\underset{i=1}{\sum}}\overset{c}{\underset{j=1}{\sum}}% \mu_{ij}^{q}\mathscr{E}^{2}(s_{i},v_{j})-\overset{m}{\underset{i=1}{\sum}}% \overset{c}{\underset{j=1}{\sum}}\mu_{ij}^{q}\mathscr{E}^{2}(v_{j},v),$ (25)

where $v$ is the mean of all the cluster center vectors:

$v=\overset{c}{\underset{j=1}{\sum}}{\displaystyle\frac{v_{j}}{c}}.$ (26)

The first term of Eq. (25) measures the cluster compactness and the second term measures the separation. Smaller values of FS are expected for compact and well-separated clusters.

5.3 Zahid SC validity index

The Zahid SC index is defined as:

$\textit{SC}=\mathscr{T}_{1}-\mathscr{T}_{2},$ (27)

where

$\mathscr{T}_{1}={\displaystyle\frac{\overset{c}{\underset{j=1}{\sum}}\mathscr{% E}^{2}(v_{j},v)/c}{\overset{c}{\underset{j=1}{\sum}}\left(\overset{m}{% \underset{i=1}{\sum}}\mu_{ij}^{q}\mathscr{E}^{2}(s_{i},v_{j})\bigg{/}\overset{% m}{\underset{i=1}{\sum}}\mu_{ij}\right)}}$ (28)

and

$\mathscr{T}_{2}={\displaystyle\frac{\overset{c-1}{\underset{j=1}{\sum}}% \overset{c}{\underset{k=j+1}{\sum}}\left(\overset{m}{\underset{i=1}{\sum}}(% \min(\mu_{ij},\mu_{ik}))^{2}\right)\bigg{/}\overset{m}{\underset{i=1}{\sum}}% \min(\mu_{ij},\mu_{ik})}{\overset{m}{\underset{i=1}{\sum}}\left({\underset{1% \leqslant j\leqslant c}{\max}}\mu_{ij}\right)^{2}\bigg{/}\overset{m}{\underset% {i=1}{\sum}}{\underset{1\leqslant j\leqslant c}{\max}}\mu_{ij}}}.$ (29)

The term $\mathscr{T}_{1}$ measures the ratio of fuzzy separation relative to fuzzy compactness by considering the geometrical properties of the data structure and the membership functions. A large value of $\mathscr{T}_{1}$ indicates well-separated and compact clusters. The term $\mathscr{T}_{2}$ also measures the ratio of fuzzy separation relative to fuzzy compactness but it only considers the fuzzy membership values. The term $\mathscr{T}_{2}$ utilizes a fuzzy union and a fuzzy intersection to obtain the fuzzy compactness and fuzzy separation, respectively. A small value in the numerator of $\mathscr{T}_{2}$ is desirable for well-separated fuzzy c-partitions and a large value in the denominator of $\mathscr{T}_{2}$ indicates a compact fuzzy c-partition. Therefore, a low value of $\mathscr{T}_{2}$ denotes compact and well-separated fuzzy cluster. Finally, by considering Eq. (27), a large SC index indicates good intra-cluster cohesion and a small inter-cluster overlap.

5.4 Error index

The error index represents the objective function of the fuzzy user session clustering, which is defined as follows:

$\mathscr{J}\leftarrow\sum_{j=1}^{c}\sum_{i=1}^{m}\mu_{ij}^{q}\mathscr{E}^{2}(s% _{i},v_{j})$ (30)

where $\mathscr{E}^{2}(s_{i},v_{j})$ is the Euclidean distance between the user session vector $s_{i}$ and the cluster center vector $v_{j}$ , and the membership parameter $\mu_{ij}$ is given by Eq. (16). $\mathscr{E}^{2}(s_{i},v_{j})$ represents the Euclidean distance between the user session $s_{i}$ and the cluster center $v_{j}$ . A lower error index denotes better clustering results.

6. Experimental results and discussion

Input user session data is extracted from the web access logs taken from the Proxy Servers of a university campus. Total number of input user sessions are 319. Number of URLs of the resources accessed from these user sessions are 116. Perl scripts are used to assign the fuzzy weights to all the user sessions and URLs.

6.1 Results of assigning weights to user sessions

Reduction in the row dimensionality of the user session matrix $\mathscr{S}$ is achieved by evaluating the user sessions and eliminating the most insignificant ones. User sessions are evaluated based on the number of URL items accessed in that session. User session weights are calculated using linear fuzzy membership function as given in Eq. (6) and standard $S$ fuzzy membership function as given in Eq. (7). For Google user sessions lower bound $\beta_{l}$ of the session URL count is set to 1 and upper bound $\beta_{h}$ of the session URL count is set to 4. Table 3 shows the user session weights assigned to Google using the linear fuzzy membership function as given in Eq. (6) and standard $S$ fuzzy membership function as given in Eq. (7).

Table 2
Results of user session weight assignment

Log	Session	No of	URL weight
data	support	URLs	Linear	S
			function	function
Google	1	46	0	0
	2	20	0.2	0.08
	3	16	0.4	0.32
	4	4	0.6	0.68
	5	4	0.8	0.92
	5 $+$	72	1	1

Table 3

Results of URL weight assignment

URLs are evaluated based on their session support count and their weights are calculated using linear fuzzy membership function as given in Eq. (3) and standard $S$ fuzzy membership function as given in Eq. (4). For Google user sessions lower bound $\alpha_{l}$ is set to 1 and upper bound $\alpha_{h}$ is set to 4. Table 3 shows the user session weights assigned to Google URLs using the linear fuzzy membership function as given in Eq. (3) and standard $S$ fuzzy membership function as given in Eq. (4).

6.2 Experimental process for discovery of user session clusters

The experimental code is implemented using Java language in Eclipse Integrated Development Environment and run in Windows 7 operating system environment on machine configuration AMD C-60 processor, 1.33 GHz, 4 GB RAM. Rest of this this section is organized as follows. Details of the experimental process is given below:

•
Multiple runs of Fuzzy c-Means, Fuzzy c-Medoids and the proposed Fuzzy c-Least Medians algorithms are conducted for the input user sessions i) without any weight assignment ii) with linear fuzzy weights and iii) with standard S fuzzy weights. The number of clusters parameter $c$ is varied from 2 to 50. Parameter fuzziness index $q$ , error threshold $\epsilon$ and maximum iterations $\eta$ are set to 2, 0.01 and 100 respectively.
•
For all of the above cases following cluster validity measures are computed i) Xie Beni Index ii) Fukuyama Sugeno Index iii) Error Index and iv) Zahid SC Index. The details of these validity measures are described in Section 6.
•
For all of the above Fuzzy clustering techniques, results for different types of input user session weights are compared using the specified validity measures.
•
Performance of Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Medians algorithms are compared using the specified cluster validity measures.

6.3 Results of fuzzy c-Means, c-Medoids, c-Least Medians clustering with and without fuzzy weights

Table 4 reports the values of Xie Beni Index, Fukuyama Sugeno (FS) Index, Zahid SC Index and Error Index for Fuzzy c-Means, c-Medoids and c-Least Medians clustering of Google user sessions with different types of weights assigned to them.

Table 4
Fuzzy clustering of Google user sessions

Index	$c$	Fuzzy c-Means						Fuzzy c-Medoids						Fuzzy c-Medians
		None		LFMF		SFMF		None		LFMF		SFMF		None		LFMF		SFMF
XB	10	1.	416	8.	21e4	9.	56e4	6.	96e3	2.	37e3	2.	41e3	4.	24e3	1.	13e3	2.	04e3
	20	4.	10e6	1.	70e6	2.	66e6	1.	06e4	7.	01e3	8.	01e3	5.	49e3	4.	21e3	4.	58e3
	30	8.	02e7	1.	11e6	6.	78e6	1.	67e4	1.	10e4	1.	20e4	8.	66e3	6.	08e3	7.	04e3
	40	9.	03e7	5.	13e6	9.	44e6	2.	03e4	1.	28e4	1.	38e4	9.	24e3	6.	75e3	7.	78e3
	50	5.	09e7	9.	27e6	2.	57e7	1.	91e4	4.	61e3	5.	62e3	6.	99e3	4.	00e3	4.	93e3
FS	10	24.	35	17.	49	18.	84	22.	47	14.	90	17.	70	20.	17	14.	57	16.	59
	20	17.	45	8.	17	9.	15	10.	38	8.	38	9.	36	6.	34	5.	34	6.	02
	30	20.	35	5.	41	6.	76	13.	82	4.	78	7.	76	8.	17	4.	78	5.	92
	40	7.	20	2.	01	2.	68	4.	58	1.	57	2.	93	2.	08	1.	57	1.	96
	50	7.	63	2.	35	3.	48	3.	03	1.	74	2.	04	1.	90	1.	73	1.	79
SC	10	55.	47	58.	01	56.	08	67.	73	81.	41	75.	60	81.	98	97.	68	85.	36
	20	37.	46	59.	81	51.	18	43.	36	65.	40	54.	09	45.	76	85.	96	58.	13
	30	439.	73	520.	88	497.	21	478.	05	645.	79	561.	57	581.	30	795.	57	709.	37
	40	434.	66	515.	16	489.	38	564.	55	632.	33	608.	55	632.	12	662.	66	650.	25
	50	294.	26	383.	77	337.	88	310.	14	384.	88	332.	98	387.	37	398.	55	390.	40
Error	10	25.	05	21.	80	22.	83	24.	19	15.	68	17.	76	23.	38	15.	32	17.	48
	20	12.	50	8.	50	10.	61	12.	26	7.	68	9.	73	12.	07	7.	70	9.	05
	30	7.	78	5.	83	6.	26	7.	54	5.	18	6.	27	7.	46	5.	11	6.	29
	40	5.	95	3.	45	4.	57	5.	52	3.	83	4.	89	5.	05	3.	83	4.	16
	50	4.	82	2.	72	3.	76	4.	10	3.	10	3.	78	4.	08	3.	07	3.	51

Figure 2.

Fuzzy c-Means clustering of user sessions with and without fuzzy weights.

Figure 2 provides various validity index values for Google user session clusters discovered by applying Fuzzy c-Means algorithm as a function of number of clusters. Figure 2(a) shows the Xie Beni Index scores computed using Eq. (23) as a function of number of clusters. A smaller value of this index represents better quality of the clusters. From the figure this is very clear that the quality of the fuzzy clusters formed by using Fuzzy Weighted user sessions is much better than that of those with no user session weights. Also users sessions with linear fuzzy weights give better result than user sessions with standard fuzzy weights. Figure 2(b) describes the Fukuyama Sugeno (FS) index scores computed using Eq. (25) as a function of number of clusters. Since FS index represents the difference of the within cluster compaction and inter cluster separation, a lower value of this index represents better cluster quality. From the figure it is obvious that Fuzzy Weighted user sessions results in high quality clusters in terms of FS index. Sessions with linear fuzzy weight give slightly better performance than sessions with standard S weight. Figure 2(c) provides values of Zahid Separation Compaction (SC) Index computed using Eq. (27) as a function of number of clusters. A higher value this index represents better fuzzy cluster quality. Figure clearly indicates that Fuzzy Weighted user sessions results in high quality clusters in terms of SC index and sessions with linear fuzzy weight give better quality of clusters as compared with sessions with standard S weight. Figure 2(d) shows the Fuzzy Error Index scores computed using Eq. (30) as a function of number of clusters. A smaller value of this index represents less fuzzy clustering error and better quality of the clusters. Figure clearly indicates that the fuzzy clustering error by using Fuzzy Weighted user sessions is much less than that with non-weighted user session. Also the quality of the clusters using user sessions with linear fuzzy weights and standard fuzzy weights are comparable to each other.

Figure 3 provides various validity index values for Google user session clusters discovered by applying Fuzzy c-Medoids algorithm as a function of number of clusters. Figures 3(a)–(d) show clearly (for the same reasons discussed for Fig. 2) that the quality of the fuzzy clusters discovered using the LFMF- and SFMF-based fuzzy weighted user sessions was better than that of the clusters formed by user sessions without any weights in terms of the XB, FS, SC, and Error validity indices. This indicates that fuzzy weight assignment reduced the adverse effects of insignificant user sessions and URLs.

Figure 3.

Fuzzy c-Medoids clustering of user sessions with and without fuzzy weights.

Figure 4 provides various validity index values for Google user session clusters discovered by applying Fuzzy c-Least Medians algorithm as a function of number of clusters. Figures 4(a)–(d) show clearly (for the same reasons discussed for Fig. 2) that the quality of the fuzzy clusters discovered using the LFMF- and SFMF-based fuzzy weighted user sessions was better than that of the clusters formed by user sessions without any weights in terms of the XB, FS, SC, and Error validity indices. This indicates that fuzzy weight assignment reduced the adverse effects of insignificant user sessions and URLs.

Figure 4.

Fuzzy c-Least Medians clustering of user sessions with and without fuzzy weights.

6.4 Comparison of Fuzzy c-Means, c-Medoids and c-Least Medians clustering results

In this subsections experimental results of Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Median of Squares clustering of Google user sessions are compared. Comparison of these algorithms is performed by utilizing non-weighted as well as weighted user sessions having linear fuzzy weights. Comparisons are made using various fuzzy cluster validity measures provided in Section 6.

Figure 5.

Fuzzy clustering results with no session weights.

Figure 6.

Fuzzy clustering results with linear fuzzy weights.

Figure 5 provides various validity index values for Google user session clusters discovered by applying Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Median of Squares with no weights assigned to the user sessions. Figure 5(a) shows the Xie Beni Index scores computed using Eq. (23) as a function of number of clusters. A smaller value of this index represents better quality of the clusters. From the figure this is clear that the quality of the fuzzy clusters discovered using Fuzzy c-Least Median of Squares algorithm is superior than those using Fuzzy c-Means and Fuzzy c-Medoids algorithms. Fuzzy c-Means algorithm produces clusters of worst quality as compared with the other two methods in terms of Xie Beni Index. Results of Fuzzy c-Medoids algorithms are better than Fuzzy c-Means but slightly inferior to that of Fuzzy c-Least Median of Squares methods. Figure 5(b) describes the Fukuyama Sugeno (FS) index scores computed using Eq. (25) as a function of number of clusters. Since FS index represents the difference of the within cluster compaction and inter cluster separation, a lower value of this index represents better cluster quality. From the figure this is clear that the quality of the fuzzy clusters discovered using Fuzzy c-Least Median of Squares algorithm is superior than those using Fuzzy c-Means and Fuzzy c-Medoids algorithms. Fuzzy c-Means algorithm produces clusters of worst quality as compared with the other two methods in terms of FS Index. Results of Fuzzy c-Medoids algorithms are better than Fuzzy c-Means but inferior to that of Fuzzy c-Least Median of Squares methods. Figure 5(c) provides values of Zahid Separation Compaction (SC) Index computed using Eq. (27) as a function of number of clusters. A higher value this index represents better fuzzy cluster quality. Figure clearly indicates that quality of the fuzzy clusters formed using Fuzzy c-Least Median of Squares algorithm is better than those using Fuzzy c-Means and Fuzzy c-Medoids algorithms. Fuzzy c-Means algorithm produces clusters of worst quality as compared with the other two methods in terms of SC Index. Results of Fuzzy c-Medoids algorithms are better than Fuzzy c-Means but inferior to that of Fuzzy c-Least Median of Squares methods. Figure 5(d) shows the Fuzzy Error Index scores computed using Eq. (30) as a function of number of clusters. A smaller value of this index represents less fuzzy clustering error and better quality of the clusters. Figure clearly indicates that quality of the fuzzy clusters formed using Fuzzy c-Least Median of Squares algorithm is better than those using Fuzzy c-Means and Fuzzy c-Medoids algorithms. Fuzzy c-Means algorithm produces clusters of worst quality as compared with the other two methods in terms of Error Index. Results of Fuzzy c-Medoids algorithms are better than Fuzzy c-Means but inferior to that of Fuzzy c-Least Median of Squares methods.

Figure 6 provides various validity index values for Google user session clusters discovered by applying Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Median of Squares with linear fuzzy weights assigned to the user sessions. Figures 6(a)–(d) show clearly (for the same reasons given for Fig. 5) that the quality of the fuzzy clusters formed using Fuzzy c-Least Median of Squares algorithm is better than those using Fuzzy c-Means and Fuzzy c-Medoids algorithms. Fuzzy c-Means algorithm produces clusters of worst quality as compared with the other two methods in terms of Error Index. Results of Fuzzy c-Medoids algorithms are better than Fuzzy c-Means but inferior to that of Fuzzy c-Least Median of Squares methods.

Table 5

Execution time for fuzzy clustering of google user sessions

Session weight	Clusters	Fuzzy c-Means	Fuzzy c-Medoids	Fuzzy c-Medians
No weight	10	1821	2234	30875
	20	8320	17860	126563
	30	10225	29812	264407
	40	15763	43125	396016
	50	19987	58234	521906
Fuzzy linear weight	10	1696	2028	27172
	20	7469	15925	120047
	30	9913	27828	253000
	40	14534	42938	378688
	50	17230	55406	509296

Figure 7.

Execution time of fuzzy clustering the google user sessions with no weights.

Figure 8.

Execution time of fuzzy clustering the Google user sessions with fuzzy linear weights.

Figure 9.

Execution time of fuzzy clustering the Google user sessions with and without fuzzy weights.

Table 5 provides execution timings in milliseconds for Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Median of Squares clustering of Google user sessions without and with linear fuzzy weights.

Figures 7 and 8 depict the graphs of execution time versus the number of clusters for Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Median of Squares clustering of Google user sessions without any weights and with fuzzy weights respectively. From these figures it is very clear that execution time required for Fuzzy c-Least Medians of squares clustering is much higher than the other two algorithms. Fuzzy c-Means algorithm show least execution time and Fuzzy c-Medoids algorithm requires higher than Fuzzy c-Means.

Figure 9 depicts the graph of Execution Time versus the number of clusters for Fuzzy c-Means, Fuzzy c-Medoids and Fuzzy c-Least Median of Squares clustering of Google user sessions with and without linear Fuzzy weights. From the figure it is very clear that execution time required for Fuzzy clustering of Google user sessions with linear Fuzzy weights is less than that of non-weighted user sessions for all the three fuzzy algorithms is much higher than the other two algorithms. Fuzzy c-Means algorithm show least execution time and Fuzzy c-Medoids clustering algorithms.

7. Conclusion

In this study to discover overlapping user session clusters from the noise contaminated user session data, Fuzzy set based clustering algorithms including Fuzzy c-Means, Fuzzy c-Medoids adapted and Fuzzy c-Least Median of Squares algorithm is proposed, implemented and tested. All these algorithms make use of fuzzy membership function to efficiently handle the overlapping clusters. The contribution of this work in the discovery of web usage clusters is multifold, namely:

A fuzzy set theoretic approach for assigning fuzzy weights to the user sessions and associated URLs is proposed to deal with the high dimensional user session data mixed with noise and outliers in the form of insignificant sessions and URLs. Fuzzy weight assignment is performed using linear fuzzy and standard S membership functions. This approach is tested extensively using various Crisp, Fuzzy, Neural, Rough and Genetic clustering techniques.

For web usage pattern discovery, FCM and FCMdd based clustering techniques are explored.

In order to rectify the problems due to noise sensitivity of FCM and FCMdd methods, we propose a robust Fuzzy c-Least Medians (FCLMdn) clustering framework to deal with the user session data contaminated with noise and outlier user session objects, with the objective of improving the quality of the extracted patterns.

Following conclusions can be drawn from the experimental results presented above.

•

The fuzzy weight assignments to user sessions and URL items have been found to improve the fuzzy clustering performance and quality in terms of Xie Beni Index, Fukuyama Sugeno Index, Zahid SC Index and Error Index. It is also observed that linear fuzzy weight assignment results in better fuzzy cluster formation as compared with standard S weight assignment. Weight assignment also resulted in dimensionality reduction of the user session data and trimmed down the execution time requirements of various Fuzzy Clustering algorithms.

•

From the experimental results it is observed that user session clustering using Fuzzy c-Least Median of Squares algorithms provide better fuzzy clustering performance as compared with Fuzzy c-Means and Fuzzy c-Medoids algorithms in terms of Xie Beni Index, Fukuyama Sugeno Index, Zahid SC Index and Error Index, for user sessions with or without weight assignments. Fuzzy cluster quality of Fuzzy c-Medoids algorithm is better than that of Fuzzy c-Means algorithm.

•

However execution timing requirement of Fuzzy c-Least Median of Squares algorithms is much higher than the other two algorithms. Fuzzy c-Medoids algorithm requires longer execution time as compared with Fuzzy c-Means algorithm and lesser than Fuzzy c-Least Medians algorithm. Thus Fuzzy c-Least Medians algorithm provides high quality fuzzy clusters at the cost of execution time.

References

Ansari

Sattar

S.A.

Babu

A.V.

and Azeem

M.F.

, Mountain density-based fuzzy approach for discovering web usage clusters from web log data, Fuzzy Sets and Systems, Elsevier Publication 279 (November 2015), 40–63.

Ansari

Azeem

M.F.

Babu

A.V.

and Ahmed

, A fuzzy clustering based approach for mining usage profiles from web log data, International Journal of Computer Science and Information Security 9(6) (2011), 70–79.

Neelima

and Rodda

, An overview on web usage mining, in: Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI, Springer 2 (2015), 647–655.

Ansari

Z.A.

Sattar

S.A.

and Babu

A.V.

, A fuzzy neural network based framework to discover user access patterns from web log data, Advances in Data Analysis and Classification, Springer Berlin Heidelberg (2015), 1–28.

Peters

Crespo

Lingras

and Weber

, Soft clustering-fuzzy and rough approaches and their extensions and derivatives, International Journal of Approximate Reasoning 54(2) (2013), 307–322.

Vellingiri

Kaliraj

Satheeshkumar

and Parthiban

, A novel appproach for user navigation pattern discovery and analysis for web usage mining, Journal of Computer Science 11(2) (2015), 372.

Nayak

Naik

and Behera

, Fuzzy c-means (fcm) clustering algorithm: A decade review from 2000 to 2014, in: Computational Intelligence in Data Mining, Springer 2 (2015), 133–149.

Zaixin

Lizhi

and Guangquan

, Neighbourhood weighted fuzzy c-means clustering algorithm for image segmentation, Image Processing, IET 8(3) (2014), 150–161.

Sing

J.K.

Adhikari

S.K.

and Basu

D.K.

, A modified fuzzy c-means algorithm using scale control spatial information for mri image segmentation in the presence of noise, Journal of Chemometrics 29(9) (2015), 492–505.

10.

de AT de Carvalho

de Melo

F.M.

and Lechevallier

, A fuzzy c-medoids clustering algorithm based on multiple dissimilarity matrices, in: Intelligent Systems (BRACIS), 2013 Brazilian Conference on, IEEE (2013), 107–112.

11.

Labroche

, Online fuzzy medoid based clustering algorithms, Neurocomputing 126 (2014), 141–150.

12.

Wang

and Zhang

, A brief survey on fuzzy cognitive maps research, in: Advanced Intelligent Computing Theories and Applications, Springer (2015), 159–166.

13.

Arora

and Chana

, A survey of clustering techniques for big data analysis, in: Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference, IEEE (2014), 59–65.

14.

Duan

Dumitru

Cleland-Huang

and Mobasher

, User-constrained clustering in online requirements forums, in: Requirements Engineering: Foundation for Software Quality, Springer (2015), 284–299.

15.

Ansari

Ahmed

Azeem

and Babu

, Discovery of web usage profiles using various clustering techniques, International Journal of Computer Information Systems 1(3) (July 2011), 18–27.

16.

Ansari

Z.A.

Babu

A.V.

Ahmed

and Azeem

M.F.

, A comparative study of mining web usage patterns using variants of k-means clustering algorithm, International Journal of Computer Science and Information Technologies (IJCSIT) 2(4) (July 2011), 1407–1413.

17.

Nguyen

T.T.S.

H.Y.

and Lu

, Web-page recommendation based on web usage and domain knowledge, Knowledge and Data Engineering, IEEE Transactions on 26(10) (2014), 2574–2587.

18.

Jagan

and Rajagopalan

, A survey on web personalization of web usage mining, International Research Journal of Engineering and Technology 2(1) (2015), 6–12.

19.

Ansari

Babu

Azeem

and Ahmed

, Quantitative evaluation of performance and validity indices for clustering the web navigational sessions, World of Computer Science and Information Technology Journal 1(5) (June 2011), 217–226.

20.

Khribi

M.K.

Jemni

and Nasraoui

, Recommendation systems for personalized technology-enhanced learning, in: Ubiquitous Learning Environments and Technologies, Springer (2015), 159–180.

21.

Carmona

RamArez-Gallego

Torres

Bernal

del Jesus

and GarcAa

, Web usage mining to improve the design of an e-commerce website, Elsevier Expert Systems with Applications 39(12) (2012), 11243–11249.

22.

Ansari

and Khan

, Fast global k-means method to discover user session clusters from web log data, International Journal of Computer Engineering and Applications (IJCEA) 8(3) (December 2014), 26–35.

23.

Mahajan

Sodhi

and Mahajan

, Usage patterns discovery from a web log in an indian e-learning site: A case study, Education and Information Technologies 21(1) (2016), 123–148.

24.

Kumar

and Norris

J.B.

, A new approach for a proxy level web caching mechanism, Decision Support Systems 46(1) (2008), 52–60.

25.

Chaudhari

S.S.

and Gupta

, Proxy-side web prefetching scheme for efficient bandwidth usage: A probabilistic method, International Journal of Engineering 3(6) (2014).

26.

Ansari

, Discovery of web user session clusters using dbscan and leader clustering techniques, International Journal of Research in Applied Science & Engineering Technology (iJRASET) 2(12) (December 2014), 209–207.

27.

Ganeshmoorthy

and Kumar

M.B.

, An improved intellectual analysis precedence and storage for business intelligence from web uses access data, in: Computational Advancement in Communication Circuits and Systems, Springer (2015), 251–259.

28.

Zimmermann

, Fuzzy set theory, Wiley Interdisciplinary Reviews: Computational Statistics 2(3) (2010), 317–332.

29.

Bustince

Barrenechea

Pagola

Fernandez

Bedregal

Montero

Hagras

Herrera

and Baets

B.D.

, A historical account of types of fuzzy sets and their relationships, IEEE Transactions on Fuzzy Systems 24(1) (Feb 2016), 179–194.

30.

Bora

D.J.

Gupta

and Kumar

, A comparative study between fuzzy clustering algorithm and hard clustering algorithm, International Journal of Computer Trends and Technology (IJCTT) 10(2) (April 2014), 108–113.

31.

Dubois

and Prade

, The legacy of 50 years of fuzzy sets: A discussion, Fuzzy Sets and Systems 281 (2015), 21–31.

32.

Zhou

and Yang

, Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation, Science China Information Sciences 57(11) (2014), 1–8.

33.

de Carvalho

F.D.A.

de Melo

F.M.

and Lechevallier

, A multi-view relational fuzzy c-medoid vectors clustering algorithm, Neurocomputing 163 (2015), 115–123.

34.

Martin-Bautista

M.J.

and Vila

M.A.

, Obtaining user profiles via web usage mining, in: IADIS European Conference Data Mining 2008 1 (2008), 73–76.

35.

Suresh

Mohana

and Reddy

A.R.M.

, Improved fcm algorithm for clustering on web usage mining, IJCSI International Journal of Computer Science Issues 8(1) (January 2011), 42–46.

36.

and Wang

X.W.

, Web usage mining based on fuzzy clustering, in: International Forum on Information Technology and Applications, 2009 IFITA 09 2 (2009), 268–271.

37.

Aghabozorgi

and Wah

, Using incremental fuzzy clustering to web usage mining, in: Soft Computing and Pattern Recognition, 2009 SOCPAR 09 International Conference of, (Dec 2009), 653–658.

38.

Zhang

Zhao

Shang

and Wang

, Web usage mining based on fuzzy clustering in identifying target group, in: Computing, Communication, Control, and Management ISECS International Colloquium on 4 (Aug 2009), 209–212.

39.

Shivaprasad

Reddy

N.S.

Acharya

U.D.

and Aithal

P.K.

, Neuro-fuzzy based hybrid model for web usage mining, Procedia Computer Science 54 (2015), 327–334.

40.

Niware

D.K.

and Chaturvedi

S.K.

, Web usage mining through efficient genetic fuzzy c-means, International Journal of Computer Science and Network Security (IJCSNS) 14(6) (2014), 113.

41.

and Liu

, Effective personalized web access patterns mining method based on fuzzy clustering, Journal of Convergence Information Technology 10(4) (2015), 45.

42.

Ansari

, Web user session cluster discovery based on k-means and k-medoids techniques, International Journal of Computer Science & Engineering Technology (IJCSET) 5(12) (December 2014), 1105–1113.

43.

Ismail

M.M.B.

and Frigui

, Unsupervised clustering and feature weighting based on generalized dirichlet mixture modeling, Information Sciences 274 (2014), 35–54.

44.

Tahira

and Ansari

, Advanced data preprocessing and soft computing based web usage pattern discovery, International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE) 5(3) (March 2015), 785–795.

45.

Ishrath Rayeesa

K.T.

and Ansari

, Preprocessing methodologies for the discovery of web access patterns from the raw web log data, International Journal of Emerging Technologies and Applications in Engineering, Technology and Sciences (IJ-ETA-ETS) 7(1) (Jan–June 2014), 269–276.

46.

Ansari

Azeem

Babu

A.V.

and Ahmed

, A fuzzy approach for feature evaluation and dimensionality reduction to improve the quality of web usage mining results, International Journal on Advanced Science Engineering and Information Technology 2(6) (2012), 67–73.

47.

Sardar

and Ansari

, A methodology for detecting web robot requests from voluminous web log file, in: Proceedings of the International Conference on Emerging Trends in Engineering (ICETE-2014), Mangalore, India, (15–17 May 2014), 108–113.

48.

Ansari

Babu

A.V.

Ahmed

and Azeem

M.F.

, A fuzzy set theoretic approach to discover user sessions from web navigational data, in: IEEE Recent Advances in Intelligent Computational Systems (RAICS) 2011, (September 2011), 879–884.

49.

Ansari

Azeem

M.F.

Babu

A.V.

and Ahmed

, Preprocessing users web page navigational data to discover usage patterns, in: The Seventh International Conference on Computing and Information Technology, Bangkok, Thailand, (May 2011).

50.

Sardar

and Ansari

, Detection and confirmation of web robot requests for cleaning the voluminous web log data, in: Proceedings of the IEEE International Conference on Impact of E-Technology on US (IC-IMPETUS), Bangalore, India, IEEE (2014), 13–19.

51.

Le Capitaine

and Frelicot

, A cluster validity index combining an overlap measure and a separation measure based on fuzzy aggregation operators, IEEE Transactions On Fuzzy Systems 19(3) (2011), 580–588.

52.

Zhang

Yang

Zhang

and Xie

, A novel cluster validity index for fuzzy clustering based on bipartite modularity, Fuzzy Sets and Systems 253 (2014), 122–137.

53.

Jafar

O.M.

and Sivakumar

, A comparative study of hard and fuzzy data clustering algorithms with cluster validity indices, in: Proceedings of International Conference on, Emerging Research in Computing, Information, Communication and Applications (ERCICA 2013), Elsevier Publications, (2013), 775–782.

54.

Alp Erilli

Yolcu

Eugriouglu

Hakan Aladaug

and Oner

, Determining the most proper number of cluster in fuzzy clustering by using artificial neural networks, Expert Systems with Applications 38(3) (2011), 2248–2252.

55.

Rezaee

, A cluster validity index for fuzzy clustering, Fuzzy Sets and Systems 161(23) (2010), 3014–3025.

56.

Muranishi

Honda

and Notsu

, Application of xie-beni-type validity index to fuzzy co-clustering models based on cluster aggregation and pseudo-cluster-center estimation, in: Intelligent Systems Design and Applications (ISDA), 2014 14th International Conference on, IEEE (2014), 34–38.

57.

Xie

Luktarhan

and Zhao

, A classification of cluster validity indexes based on membership degree and applications, in: Web Information Systems and Mining, Springer (2011), 43–50.

58.

Castellano

Fanelli

Mencar

and Torsello

, Similarity-based fuzzy clustering for user profiling, in: Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology-Workshops, IEEE Computer Society Washington, DC, USA (2007), 75–78.

59.

Sengupta

Konar

and Janarthanan

, An improved fuzzy clustering method using modified fukuyama-sugeno cluster validity index, in: Recent Trends in Information Systems (ReTIS), 2011 International Conference on, IEEE (2011), 269–274.

60.

Said

A.B.

Foufou

and Abidi

, A fcm and surf based algorithm for segmentation of multispectral face images, in: Signal-Image Technology & Internet-Based Systems (SITIS), 2013 International Conference on, IEEE (2013), 65–70.

61.

Nazari

Shanbehzadeh

and Sarrafzadeh

, Fuzzy c-means based on automated variable feature weighting, in: Proceedings of the International MultiConference of Engineers and Computer Scientists 1 (2013).

62.

Corsini

and Marcelloni

, A fuzzy system for profiling web portal users from web access log, Journal of Intelligent and Fuzzy Systems 17(5) (2006), 503–516.

63.

Zhuang

Jiang

and Xiong

, An intelligent anti-phishing strategy model for phishing website detection, in: Distributed Computing Systems Workshops (ICDCSW), 2012 32nd International Conference on, IEEE (2012), 51–56.

64.

K.-L.

Yang

M.-S.

and Hsieh

J.-N.

, Robust cluster validity indexes, Pattern Recognition 42(11) (2009), 2541–2550.

65.

Di Martino

Loia

and Sessa

, A segmentation method for images compressed by fuzzy transforms, Fuzzy Sets and Systems 161(1) (2010), 56–74.

66.

Srinivasan

and Shobha

, Segmentation techniques for target recognition, International Journal Computer Communications 1(3) (2007).

67.

Balafar

, Fuzzy c-mean based brain mri segmentation algorithms, Artificial Intelligence Review 41(3) (2014), 441–449.

Fuzzy c-Least Medians clustering for discovery of web access patterns from web user sessions data

Abstract

Keywords

1. Introduction

4.1 Fuzzy user session clustering data strctures and distance computations

6.1 Results of assigning weights to user sessions

Table 2 Results of user session weight assignment

Table 4 Fuzzy clustering of Google user sessions

References

Table 2
Results of user session weight assignment

Table 4
Fuzzy clustering of Google user sessions