Abstract
Data clustering is the generic process of splitting a set of datums into a number of homogenous sets. Nevertheless, although a clustering process inputs datums as a set of separate mathematical objects, these entities are in fact correlated within a spatial context specific to the problem class in hand. For example, when the data acquisition process yields a 2D matrix of regularly sampled measurements, as it is the case with image sensors which utilize different modalities, adjacent datums are highly correlated. Hence, the clustering process must take into consideration the spatial context of the datums. A review of the literature, however, reveals that a significant majority of the well-established clustering techniques in the literature ignore spatial context. Other approaches, which do consider spatial context, however, either utilize pre- or post-processing operations or engineer into the cost function one or more regularization terms which reward spatial contiguity. We argue that employing cost functions and constraints based on heuristics and intuition is a hazardous approach from an epistemological perspective. This is in addition to the other shortcomings of those approaches. Instead, in this paper, we apply Bayesian inference on the clustering problem and construct a mathematical model for data clustering which is aware of the spatial context of the datums. This model utilizes a robust loss function and is independent of the notion of homogeneity relevant to any particular problem class. We then provide a solution strategy and assess experimental results generated by the proposed method in comparison with the literature and from the perspective of computational complexity and spatial contiguity.
Keywords
Introduction
Unsupervised grouping of datums into homogenous clusters is an important prerequisite in a vast number of signal and image processing applications. In essence, these applications require a stable process which inputs a set of datums, of a particular mathematical model, and divides them into an appropriate number of clusters, where the clusters comply with a particular notion of homogeneity. Additionally, this process is required to satisfy conditions such as being robust, allowing for datums with different priorities, and detecting outliers. Moreover, it is an important advantage for such a process to be independent of any particular datum and cluster model. When this condition is satisfied, reuse of the clustering module becomes possible and time and effort can be saved. Also, from the point of view of practical usability, a clustering process which depends on a few parameters with perceptual definitions is highly preferred to an alternative which requires the deliberate adjustment of a large set of incomprehensibleconfiguration variables.
Clustering algorithms are historically defined as operators which function on sets. It is important, however, to emphasize that the input set of datums to any clustering algorithm is inherently ordered. In other words, datums generally correspond to a physical phenomenon which is spatially correlated. This is important because a significant majority of the algorithms in the literature are indifferent to the order of the datums. Nevertheless, the spatial context of the datums is important because we generally expect spatial contiguity in the classification results. Additionally, the spatial context of a datum is an important component in clustering it and provides helpful clues when noise and cluster combination produce ambiguity for the clustering process.
In practice, the dominant approach for delivering a novel clustering algorithm is to translate a verbal description of the purpose of clustering into an objective function, one or more constraints, and a potential set of pruning procedures. In this framework, it is essentially the intuition of the researchers which drives the development of new clustering algorithms. As such, the inclusion of a new regularization term or a novel weight component in the objective function is commonly justified through verbal descriptions which are based on heuristics and metaphors. In other words, the development of fuzzy clustering algorithms is often output-based, where experimental results provide the “proof” that the heuristics of the researchers have been correct. Comparison of different fuzzy clustering algorithms, too, is often carried out based on the outputs of the respective formulations.
In this paper, we utilize Bayesian inference and directly derive the mathematical model for loss in a fuzzy-possibilistic clustering problem when spatial context is taken into consideration. This loss model is independent of any particular datum or cluster models and utilizes a generic notion of homogeneity. This model also utilizes a robust loss function and employs the concept of the relevance of each datum to the set. Moreover, the loss model developed in this paper is independent of any parameter which might have to be trained or tuned empirically or through repetition of any process. Hence, it is important to emphasize that the process utilized in this work develops the objective function organically and is, therefore, characteristically different from previous works in the field which are based on heuristics.
The rest of this paper is organized as follows. First, in Section 2, we review the related literature and then, in Section 3, we provide the developed method. Subsequently, in Section 4, we present experimental results produced by the developed method on three different problem classes and, in Section 5, we provide the concluding remarks.
Literature review
Notion of membership
The notion of membership is a key point of distinction between different clustering frameworks. Essentially, membership may be Hard or Fuzzy. Within the context of hard membership, each datum belongs to one cluster and is different from all other clusters. The fuzzy membership regime, however, maintains that each datum in fact belongs to all clusters, with the stipulation that the degree of membership to different clusters is different. K-means [104] and Hard C-means (HCM) [55] clustering algorithms, for example, utilize hard membership values. The reader is referred to [74] and the references therein for a history of K-means clustering and other methods closely related to it. Iterative Self-Organizing Data Clustering (ISODATA) [7] is a hard clustering algorithm as well.
With the introduction of Fuzzy Theory [161], many researchers incorporated this more natural notion into clustering algorithms [8, 155]. The premise for employing a fuzzy clustering algorithm is that fuzzy membership is more applicable in practical settings, where, generally, no distinct line of separation is present between the clusters. Additionally, from a practical perspective, it is observed that hard clustering techniques are extremely more prone to falling into local minima [40]. The reader is referred to [16, 18] for the wide array of fuzzy clustering methods developed in the past few decades.
Initial work on fuzzy clustering was done by Ruspini [136] and Dunn [41] and it was then generalized by Bezdek [16] into Fuzzy C-means (FCM). In FCM, datums, which are denoted as x1, ⋯ , x N , belong to and clusters, which are identified as ψ1, ⋯ , ψ C , are represented as points in . FCM makes the assumption that the number of clusters, C, is known through a separate process or expert opinion and minimizes the following objective function,
In (1), m > 1 is the fuzzifier (also called weighing exponent and fuzziness). The optimal choice for the value of the fuzzifier is a debated matter [94] and is suggested to be “an open question” [160]. Bezdek [145] suggests that 1 < m < 5 is a proper range and utilizes m = 2. The use of m = 2 is suggested by Dunn [41] in his early work on the topic as well and also by Frigui et al. [47], among others [79]. Bezdek [15] provided physical evidence for the choice of m = 2 and Pal et al. [116] suggested that the best choice for m is probably in the interval [1.5, 2.5]. Yu et al. [160] argue that the choices for the value of m are mainly empirical and lack a theoretical basis. They worked on providing such a basis and suggested that “a proper m depends on the data set itself” [160].
Recently, Zhou et al. [165] proposed a method for determining the optimal value of m in the context of FCM. They employed four Cluster Validity Index (CVI) models and utilized repeated clustering for m ∈ [1.1, 5] on four synthetic data sets as well as four real data sets adopted from the UCI Machine Learning Repository [99] (refer to [17] for a review of CVIs and [138] for coverage in the context of relational clustering). The range for m in that work is based on previous research [115] which provided lower and upper bounds on m. The investigation carried in [165] yields that m = 2.5 and m = 3 are optimal in many cases and that m = 2 may in fact not be appropriate for an arbitrary set of datums. This result is in line with other works which demonstrate that larger values of m provide more robustness against noise and the outliers. Nevertheless, significantly large values of m are known to push the convergence towards the sample mean, in the context of Euclidean clustering [160]. Wu [150] analyzes FCM and some of its variants in the context of robustness and recommends m = 4.
Rousseeuw et al. [135] suggested to replace with , for a known 0 < α < 1. Klawonn et al. [77, 78] suggested to generalize this effort and to replace with an increasing and differentiable function g (f nc ).
Pedrycz [122–124] suggested to modify (2) in favor of customized constraints for different values of n. That technique allows for the inclusion of a priori information into the clustering framework and is addressed as Conditional Fuzzy C-means (CFCM). The same modification is carried out in Credibilistic Fuzzy C-Means (CFCM) [27, 28], in which the “credibility” of datums is defined based on the distance between datums and clusters. Therefore, in that approach, (2) is modified in order to deflate the membership of outliers to the set of clusters (also see [156]). Customization of (2) is also carried out in Cluster Size Insensitive FCM (csiFCM) [112] in order to moderate the impact of datums in larger clusters on an smaller adjacent cluster. Leski [94] provides a generalized version of this approach in which is constrained.
It is a common assumption that the notion of homogeneity depends on datum-to-datum distances. This assumption is made implicitly when clusters are modeled as prototypical datums, also called clustroids or cluster centroids, as in FCM, for example. A prominent choice in these works is the use of the Euclidean distance function [89]. For example, the potential function approach considers datums as energy sources scattered in a multi-dimensional space and seeks peak values in the field [154] (also see [12, 133]). We argue, however, that the distance between the datums may not be either defined or meaningful and that what the clustering algorithm is to accomplish is the minimization of datum-to-cluster distances. For example, when datums are to be clustered into certain lower-dimensional subspaces, as is the case with Fuzzy C-Varieties (FCV) [95], the Euclidean distance between the datums is irrelevant.
Note that, prototype-based clustering does not necessarily require explicitly present prototypes. For example, in kernel-based clustering, it is assumed that a non-Euclidean distance can be defined between any two datums. The clustering algorithm then functions based on an FCM-style objective function and produces clustroids which are defined in the same feature space as the datums [146]. These cluster prototypes may not be explicitly represented in the datum space, but, nevertheless, they share the same mathematical model as the datums [151] (the reader is referred to a review of Kernel FCM (KFCM) and Multiple-Kernel FCM (MKFCM) in [24] and several variants of KFCM in [25]). Another example for an intrinsically prototype-based clustering approach in which the prototypes are not explicitly“visible” is the Fuzzy PCA-guided Robust k-means (FPR k-means) clustering algorithm [66] in which a centroid-less formulation [162] is adopted which, nevertheless, defines homogeneity as datum-to-datum proximity.
Relational clustering approaches constitute another class of algorithms which are intrinsically based on datum-to-datum distances (for example refer toRelational FCM (RFCM) [63] and its non-Euclidean extension Nerf C-means [60]). The goal of this class of algorithms is to group datums into self-similar bunches. Another algorithm in which the presence of prototypes may be less evident is Multiple Prototype Fuzzy Clustering Model (FCMP) [107], in which datums are described as a linear combination of a set of prototypes, which are, nevertheless, members of the same as the datums are. Additionally, some researchers utilize L r -norms, for r ≠ 2 [19, 72], or other datum-to-datum distance functions [75].
We argue that a successful departure from the assumption of prototypical clustering is achieved when clusters and datums have different mathematical models. For example, the Gustafson-Kessel algorithm [57] models a cluster as a pair of a point and a covariance matrix and utilizes the Mahalanobis distance between datums and clusters (also see the Gath-Geva algorithm [50]). Fuzzy shell clustering algorithms [79] utilize more generic geometrical structures. For example, the FCV [95] algorithm can detect lines, planes, and other hyper-planar forms, the Fuzzy C Ellipsoidal Shells (FCES) [46] algorithm searches for ellipses, ellipsoids, and hyperellipsoids, and the Fuzzy C Quadric Shells (FCQS) [79] and its variants seek quadric and hyperquadric clusters (also see Fuzzy C Plano-Quadric Shells (FCPQS) [88]).
Robustification
Dave et al. [36] argue that the function of membership values in FCM and the concept of weight functions in robust statistics are related. Based on this perspective, it is argued that the classical FCM in fact provides an indirect means for attempting robustness. Nevertheless, it is known that FCM and other least square methods are highly sensitive to noise [28]. Hence, there has been ongoing research on the possible modifications of FCM in order to provide a (more) robust clustering algorithm [43,93, 43,93]. Dave et al. [36] provide an extensive list of relevant works and outline the intrinsic similarities within a unified view (also see [33, 56]).
The first attempt to robustifying FCM, based on one account [36], is the Ohashi Algorithm [56, 114]. That work adds a noise class to FCM and writes the robustified objective function as,
The transformation from (1) to (3) was suggested independently by Dave [33, 34] when he developed the Noise Clustering (NC) algorithm as well. The core idea in NC is that there exists one additional imaginary prototype which is at a fixed distance from all of the datums and represents noise. That approach is similar to modeling approaches which perform consecutive identification and deletion of one cluster at a time [73, 166]. Those methods, however, are expensive to carry out and require reliable cluster validity measures.
Krishnapuram et al. [85] extended the idea behind NC and developed the Possibilistic C-means (PCM) algorithm by rewriting the objective function as,
We argue that the interlocking mechanism present in FCM, i.e. (2), is valuable in that, not only clusters seek homogenous sets, but that they are also forced into more optimal “positions” through forces applied by competing clusters. In other words, borrowing the language used in [89], in FCM clusters “seize” datums and it is disadvantageous for multiple clusters to claim high membership to the same datum. There is no phenomenon, however, in NC and PCM which corresponds to this internal factor. Additionally, it is likely that PCM clusters coincide and/or leave out portions of the data unclustered [10]. In fact, it is argued that the fact that at least some of the clusters generated through PCM are non-coincidental is because PCM gets trapped into local minimum [142]. PCM is also known to be more sensitive to initialization than other algorithms in its class [89].
It has been argued that both concepts of possibilistic degrees and membership values have positive contributions to the purpose of clustering [32]. Hence, Pal et al. [117] combined FCM and PCM and rewrote the optimization function of Fuzzy Possiblistic C-Means (FPCM) as minimizing,
Weight modeling is an alternative robustification technique and is exemplified in the algorithm developed by Keller [76], in which the objective function is rewritten as,
Frigui et al. [47] included a robust loss function in the objective function of FCM and developed Robust C-Prototypes (RCP),
The classical FCM and PCM, and many of their variants, are based on the assumption that the number of clusters is known (the reader is referred to [16, Chapter 4] and [71, Chapter 4] for reviews of this topic). While PCM-style formulations may appear to relax this requirement, the corresponding modification is carried out at the cost of yielding an ill-posed optimization problem [89]. Hence, provisions have been added to existing clustering algorithms in order to address this challenge. Repeating the clustering procedure for different numbers of clusters [50, 87] and progressive clustering are two of the approaches devised in the literature.
Among the many variants of progressive clustering are methods which start with a significantly large number of clusters and freeze “good” clusters [83, 88], approaches which combine compatible clusters [35, 88], and the technique of searching for one “good” cluster at a time until no more is found [73, 139]. Use of regularization terms in order to push the clustering results towards the “appropriate” number of clusters is another approach taken in the literature [48]. These regularization terms, however, generally involve additional parameters which are to be set carefully, and potentially per problem instance (for example see the mixed C-means clustering model proposed in [117]).
Dave et al. [36] conclude in their 1997 paper that the solution to the general problem of robust clustering, when the number of clusters is unknown, is “elusive” and that the techniques available in the literature each have their limitations. In this paper, we acknowledge that the problem of determining the appropriate number of clusters is hard to solve and even hard to formalize. Additionally, we argue that this challenge is equally applicable to many clustering problems independent of the particular clustering model utilized in the algorithms. Therefore, we designate this challenge as being outside the scope of this contribution and assume that either the appropriate number of clusters is known or that an exogenous means of cluster pruning is available which can be utilized within the context of the algorithm developed in this paper.
Weighted clustering
Many fuzzy and possibilistic clustering algorithms make the assumption that the datums are equally important. Weighted fuzzy clustering, however, works on inputs datums which have an associated positive weight [73]. This notion can be considered as a marginal case of clustering fuzzy data [42]. Other examples for this setting include clustering of a weighted set, clustering of sampled data, clustering in the presence of multiple classes of datums with different priorities [3], and a measure used in order to speed up the execution through data reduction [22, 157]. Nock et al. [109] formalize the case in which weights are manipulated in order to move the clustering results towards datums which are harder to include regularly. Chen et al. [23] utilize density motivated weights in order to reduce the impact of outliers (refer to [58] for different variants of this framework). Semi Supervised FCM (ssFCM) [13] uses weight factors based on Euclidean norm in order to balance the sizes of different hyper-spherical shaped clusters based on user intervention. Note that the extension of FCM on weighted sets has been developed under different names, including Density-Weighted FCM (WFCM) [62], Fuzzy Weighted C-means (FWCM) [96], and New Weighted FCM (NW-FCM) [70].
Spatial context
FCM effectively ignores spatial context [137]. In other words, in FCM, PCM, and many other variants of fuzzy clustering algorithms, datums are treated as separate realizations, where there is no a priori relationship between x n and xn+1. Nevertheless, in the physical world, datums are always correlated. Hence, the notion of spatial context suggests that a datum is defined in its context and must be classified while the context is taken into consideration. In words of the authors of [137], “usually, one pixel is too small to represent part of an image”.
It is important to emphasize that the notion of spatial context and the premise of NC are somewhat contradictory. While NC attempts to classify noisy datums into a separate cluster and to subsequently discard them, the thinking behind spatial context is that a noisy datum may be classified based on its context [137]. This is an important requirement in approaches such as image segmentation, in which a pixel identified as noise is in effect a discontinuity in the output.
While it is theoretically possible to include datum coordinates, if applicable, as additional features and then to carry out clustering [71, 84], that approach is not theoretically justifiable, because coordinate information and datum features are often inherently different and are defined in different scales. From a practical perspective, too, defining a notion of homogeneity which encompasses both datum homogeneity as well as spatial contiguity for datums is not a trivial task.
Another primitive approach to including the spatial context into the clustering process is to perform data pre-processing [5, 152]. Potential loss of details, however, is among the caveats of this technique. A marginally more appropriate approach is to perform post-processing on the membership maps [20, 152] or to execute clustering at different scales and to fuse the results afterwards [4, 144]. The reader is referred to [45, 68] for other variations of these approaches. It has been argued [101] that the incorporation of spatial context as a pre- or post-processing stage is easy to implement but lacks proper theoretical justification.
Spatial context can also be incorporated into the optimization process as additional information. In fact, Gibbs random fields have been used in order to model spatial context within the framework of K-means clustering [14, 92]. Also see [119] for an iterative process which utilizes sliding windows which shrink over time. Another, more recent, example for this approach is Geometrically Guided FCM (GG-FCM) [110, 111] in which the semi-supervised framework developed in [125] is modified in order to use neighborhood information as training for image pixels (also see Spatial FCM (sFCM) [29] and Bilateral FCM (bFCM) [103]).
GG-FCM and Geometrically Guided Conditional FCM (GGC-FCM) utilize a reject class and eliminate datums which are found to be spurious [110]. That approach is not desirable in applications which require every datum to be classified into a cluster. That deficiency, however, is addressed in Spatially Guided FCM (SG-FCM) [113], in which a geometrical shape descriptor is incorporated into the objective function. Nevertheless, in those works, spatial context is either utilized as a static input [125] or as information which is dynamically recalculated through the process [113]. The latter case, however, commonly depends on engineered measures of compliance to spatial contiguity and often additionally requires the proper setting of one or more regularization coefficients. For example, the approaches outlined in [113, 125] depend on the proper setting of the value of the parameter α.
Another example for the utilization of spatial context as a priori information is the Improved FCM (IFCM) [157], in which a histogram-based FCM deployment produces cluster prototypes and membership values and then the resulting crisp membership information is utilized in order to produce the p nc quantities for each datum and each cluster. Here, p nc denotes the ratio of the neighbors of x n which belong to ψ c . The second stage of that algorithm then finds f nc and ψ c which minimize the following modified objective function,
The aforementioned works belong to the general category of approaches which utilize engineered regularization terms. Many of the works in that category propose a superficially constructed term which penalizes excessive spatial variation. Additionally, as stated before, the regularization terms are generally multiplied by a constant, the proper setting of the value of which is an important prerequisite for the appropriateness of the outcomes of the algorithm (as an example containing both issues refer to [38]). An early examples of that approach is the Contig-k-means [131, 141] algorithm which updates the crisp k-means clustering objective function in order to incorporate spatial contiguity into it. That approach, however, requires the proper adjustment of the value of the parameter λ. In another example, in [126], a term is devised which penalizes correlation between membership values of adjacent datums to different clusters. Another example is presented in [163], in which the authors inject a regularization term into the objective function of a Kernel FCM (KFCM) algorithm [52] and generate the Spatially constrained Kernel FCM (SKFCM) algorithm. That approach of modeling the spatial context has precedence in the literature [22, 159] and is inspired by Neighborhood EM (NEM) [6].
In NEM, the proper value for the parameter α is to be set by the user and is suggested to be dependent on the signal to noise ratio (SNR) of the input image. Dependence on additional parameters is a concern in other works as well. For example, in [137] the authors develop the Improved FCM (IFCM) algorithm wherein “neighborhood attraction” is modeled as a combination of “feature attraction” and “distance attraction”. That model is a reminiscent of notions from bilateral filtering literature [120]. Nevertheless, the term which utilizes spatial context in that work depends on two parameters, the values of which are to be set by on an Artificial Neural Network (ANN) (also see [29]). Additionally, the attraction models utilized in [137] depend on datum-to-datum distances, which, as previously noted, are in fact irrelevant to a generic homogeneity model.
A more generalized approach to spatial context is utilized in Bias-Corrected FCM (BCFCM) [5], in which the following regularization term isemployed,
In [128] the authors develop the Adaptive FCM (AFCM) algorithm through incorporating a multiplier field into the objective function as follows (also see [100, 149]),
Fuzzy Local Information C-Means (FLICM) [82] is among the state-of-the-art in the field of image segmentation. The formulation of FLICM does not depend on any particular parameters and it uses fuzzy local similarity measures which incorporate both gray level as well as spatial closeness. We argue, however, that the notion of datum-to-datum comparison which is used in FLICM, is only applicable to a certain category of problem classes which includes gray scale image segmentation but excludes color image segmentation. Moreover, the primary concern with FLICM is that it engineers a new concept, i.e. the Fuzzy Factor G. In fact, that work is a prime example of the introduction of a new concept based on intuition, as it is outlined in the list of “characteristics” given in [82, Section III.A]. That concept is then heuristically composed as a mathematical formula [82, (17)]. That paper then follows with verbal justification of the appropriateness of the engineered factor [82, Section III.B]. While there are important epistemological questions regarding the construction of FLICM, that framework is further extended in subsequent frameworks such as RFLICM [53] and KWFLICM [54]. An extension of FLICM is given in FCM with Edge and Local Information (FELICM) [97], in which results at image edges are improved through separate treatment of boundary pixels.
In this section, we utilize Bayesian inference in order to develop the loss model for a fuzzy-possibilistic clustering algorithm which utilizes a generic datum model and a generic notion of cluster homogeneity and also employs a robust loss function. The main contribution of this section is the incorporation of spatial context into the loss model.
Model preliminaries
We assume that a problem class is given, within the context of which a datum model is known and denote a datum as x. We also assume that a particular cluster model is provided, which complies with the notion of homogeneity relevant to the problem class at hand, and denote a cluster as ψ.
In this work, we utilize a weighted set of datums, defined as,
We assume that the real-valued non-negative distance function φ (x, ψ) is defined on the datum x and the cluster representation ψ. We emphasize that this generalized assumption is a definite departure from prototype-based approaches. Those approaches assume that x and ψ have identical mathematical models, generally as members of , and adopt the Euclidean distance function. We also assume that the distance function is unbounded, i.e. for any cluster representation ψ and any positive value L, there exist infinite number of datums x for which φ (x, ψ) > L. As special cases, when the datum belongs to , the Euclidean Distance, any L r norm, and the Mahalanobis Distance are special cases of the notion of datum-to-cluster distance defined here. The corresponding cluster models in these cases would be , , and ψ identifying a pair of a member of and a k × k covariance matrix, respectively.
We assume that φ (x, ψ) is differentiable in terms of ψ and that for any non-empty weighted set
We assume that a function Ψ∘paran·, which may depend on
We assume that a robust loss function, uparan · : [0, ∞] → [0, U], is given. Here, U > 0 is the value which u (τ) converges to for τ→ ∞. We assume that uparan· is an increasing differentiable function which satisfies u (0) =0 and u (λ) = ontop12 for a known value of λ > 0, which we will address as the scale parameter (note the similarity with the cluster-specific weights in PCM [85]). In fact, λ has a similar role to that of scale in robust statistics [69] (also called the resolution parameter [12]) and the idea of distance to noise prototype in the Noise Clustering (NC) algorithm [33, 34]. Scale can also be considered as the controller of the boundary between inliers and outliers [36]. From a geometrical perspective, λ controls the radius of spherical clusters and the thickness of planar and shell clusters [89]. One may investigate the possibility of generalizing the unique scale factor λ into cluster-specific scale factors λ c in line with the η c parameters inPCM [85].
In this work, we utilize the rational robust loss function given below,
We acknowledge that one may consider the possibility of utilizing Tukey’s biweight [11], Hampel [59], and Andrews loss functions in this framework. Huber and Cauchy loss functions are not bounded and therefore are not applicable to this work. Refer to [36, Table I] for mathematical formulations.
Now, we provide a mathematical model for spatial context. We note that the notion of spatial context implies that datums which are spatially adjacent must interfere in the classification of each other. Here, we model the influence of x n 1 on x n 2 as ω n 1 n 2 ≥ 0. We utilize this model through a Bayesian inference framework, as follows,
In (17), we differentiate between E{Loss} and expectloss. The former is the expected loss associated to a particular setting while the spatial context is taken into consideration. In effect, the purpose of the clustering algorithm, derived here, is the minimization of expectLoss for the entire set of datums. In contrast, expectloss is the expected loss when the spatial context is ignored, i.e. the loss of the individual datum. As such, we write,
Here, we employ a Bayesian inference framework and model the aggregate loss which corresponds to an arbitrary solution to the clustering problem. This process results in an objective function which is organically derived from the utilized loss model.
We write,
In this model,c is the subset of which contains datums in ψ c . Therefore,
We continue the derivation of (21) by using (17) for and write,
Close assessment of (26) shows that this cost function contains an HCM-style hard formulation. It is known, however, that the use of the fuzzifier has important benefits, as outlined in Section 2:membership. Hence, we rewrite (26) as,
It is important to emphasize that (27) converges to the objective function for the classical FCM when the additional features built into it are eliminated. In order to observe this relationship, we first need to assume that ωnn′ is the identity operator, i.e. that ωnn′ is one iff n = n′. Additionally, we need to set pn ≡ 1, ∀ n, i.e. assume that all datums are inliers, and also set u (x) ≡ x, i.e. ignore the robust loss function. Then, (27) will convert into the conventional FCM formulation. In other words, the process developed in this paper can be considered as theoretical justification for the FCM model when the presence of the outliers and the significance of spatial context are ignored and robustification is not required.
Note the presence of the product of and uncn′c in (27). In other words, the distance between xn′ and ψ c is weighted by the membership of x n to this cluster. This notion of spatial correspondence is at the heart of the spatial context model which is utilized in the present work. A distantly similar approach is taken in [101], in which the Euclidean distance between the datums is coupled with their corresponding distances to the clusters. That framework is inherently based on the assumption that the Euclidean distance between the datums is relevant to the homogeneity of sets of datums (also see [91]).
A Note on the concept of membership in the present framework is necessary. In this work, f nc denotes the membership of x n to ψ c conditional to x n being an inlier. However, for convenience, we may simply refer to f nc as “membership of x n to ψ c ”. It is important to realize that this reference contains an implicit “subject to x n being an inlier” excluded from the sentence in order to facilitate the flow of the conversation and the derivations. In other words, the membership of x n to the ψ c , as it is commonly referenced in the data clustering literature, is in fact equal to pnfnc, as derived below,
We emphasize that the second term in (27) in fact acts as a regularization component and that it bears resemblance to the regularization term in PCM (see (3)). It is important to emphasize, however, that this term is derived organically through the mathematical modeling of the loss value. In comparison, a significant majority of the well-known methods in the literature utilize regularization components which are heuristically and based on the intuition of the researchers assumed to yield desired effects (for example see [49]). Those works then utilize experimental evidence in order to exhibit that the desired effects are in fact present and valid. In contrast, in this work, we start from a Bayesian model for the loss function and derive an objective function which yields a term that may be understood as a regularization term. We find this distinction important from an epistemological perspective.
In this section, we provide a solution strategy for the cost function given in (27). This cost function is to be minimized subject to (2).
First, we calculate ontop∂Δ∂p n and equate it to zero and derive,
An alternative method is to utilize the technique developed by Weiszfeld [148] and Miehle [105, 134] (similar techniques are cited under different names as well [30, 90]). The Weiszfeld technique utilizes the fixed point method in order to solve (14):prime when φpcdotp is not the Euclidean distance function (refer to [90] for details and to [39] for acceleration options). A weighted version of the Levenberg-Marquardt algorithm [106] may also be applicable for certain distance functions and loss functions.
For certain classes of problems, it is not a requirement that every datum must be assigned to a cluster. In other words, the problem class under consideration may prefer a solution in which clusters are not unnecessarily “bloated” in order to include outliers. We satisfy the requirements of such problem classes through returning the C disjoint sets c, c = 1, ⋯ , C, as the inliers, as well as
We suggest to utilize a Maximum Likelihood inference framework and to compare the probability values corresponding to x
n
∈ c, for c = 1, ⋯ , C as well as the probability that x
n
∈
This section, presents an outline of the algorithm developed in this paper, as follows, Generate the input set of datums . Call Ψ∘paran· to produce ψ
c
, for all c. Calculate uncnc, for all n and c. Calculate f
nc
, for all n and c, using (33). Calculate p
n
, for all n, using (32). Calculate Δ using (27). If change in Δ is negligible, exit the loop. Update ψ
c
, for all c, using (14):star. Call external pruning process, if applicable. Go to Line 3.
Note that, in Line 9, there is a place-holder for an external pruning process which may decide to remove clusters that it finds redundant or scarcely populated. In essence, in this context, any of the techniques listed in Section 2:validity can be utilized as external pruning procedures.
Implementation notes
The developed algorithm is implemented as a class named Calista in MATLAB Version 8.1 (R2013a). It takes use of the Image Processing Toolbox for minor image-related primitives. The major operations in this class are implemented as C/MEX dll’s. The results carried in this paper are collected on Windows 7, 64 bit, on an Intel Core i5-2400 CPU, 3.10 GHz, with 8.00 GB of RAM.
In this work, each problem class is implemented as a child class for Calista. The child classes implement a constructor which creates the weighted set X based on the input image, data file, etc. The child classes also implement the three functions φparan·, Ψ∘paran·, and Ψparan· and set the value of λ independent of
Complexity analysis
Analysis shows that the computational cost of the classical FCM can be estimated as,
Similar derivations indicate that the computational complexity of the developed algorithm can be estimated as,
Substituting typical values in (38) and (39), we derive that the developed algorithm is expected to require about 10 times more computation than FCM when a 3 × 3 kernel is employed. With a 5 × 5 kernel this ratio drives up to about 24.
In this section, we define three problem classes and then follow with the procedure required in order to deploy the developed method in the context of these problem classes. Here, we use the inputs sets of datums shown in Fig. 1.
Spatial context model
The problem classes utilized in this section all address datums in regularly sampled 2-D grids. Therefore, here, we define ω n 1 n 2 as a function of the Euclidean distance between the two points which correspond to x n 1 and x n 2 . We denote this distance as d n 1 n 2 and measure it in pixels. We then model the spatial context as a finite 2D kernel of size 2r + 1 ×2r + 1 and define ω n 1 n 2 = 0 outside of this neighborhood of x n 1 . Inside this square, we utilize,
In this section, we review the utilization of the developed method in the context of three problems classes which address 1D, 3D, and 3D datums, as described below.
Grayscale image multi-level thresholding
The problem of grayscale image multi-level thresholding defines datums as grayscale values and models a cluster as an interval on the grayscale axis centered at the scalar ψ c . The reader is referred to [21] for a review of the approaches to image segmentation with emphasis on Magnetic Resonance Images.
In order to produce the datums, we downsample the input image at the scale of 4 and produce the sample average and the sample standard deviation for each block. We address these entities as η n and σ n , respectively. Then, we utilize x n = η n and calculate ω n as,
In this problem class, distance between a datum and a cluster is defined as the square difference and the initial clusters are defined as uniformly-distributed points in the working range. The cluster fitting function in this problem class calculates the weighted sum of the input datums. This problem utilizes the scale of 25 gray levels and is defined as inclusive. The input images used in the experiments carried in this section are at the resolution of 512 × 512 pixels.
The input data in this problem class contains 3D points captured by a Kinect 2 sensor. The depth-maps used in these experiments are captured at the resolution of 424 × 512 pixels. Here, intrinsic parameters of the camera are acquired through the Kinect SDK and each datum in this problem class has the weight of one.
Clusters in this problem class are defined as planes which have a thickness, i.e. is a vector in . Here, we model a plane using the following mathematical representation,
The initial set of clusters in this problem class are very rough estimates of the three ideal planes which are expected to exist in the scene. Here, each plane passes through the corresponding point at the tenth percentile of the marginal distribution of the points on the corresponding axis and is parallel to the two other axes. The cluster fitting function, Ψpcdotp, for this problem class fits a plane to a number of 3-D weighted points using a weighted variant of Singular Value Decomposition (SVD). This problem class utilizes scale of 200 millimeters and is defined as non-inclusive.
The appropriateness of an Euclidean model for color homogeneity has been disputed in the literature. Klinker et al. [80] presented a new approach to measuring highlights from an arbitrary point of a dielectric object in 1988. Then, in 1990, they applied their theory to color image understanding [81]. About a decade later, Cheng et al. [26] used Principal Component Analysis (PCA) for color image processing. Then, in 2004,Nikolaev et al. [108] provided theoretical evidence for the appropriateness of linear models of color homogeneity in natural images. Later in 2008, Abadpour et al. [2] utilized a resulting cylindrical cluster model for color image segmentation for the purposes of compression and watermarking. Here, we utilize this model using the clustering algorithm developed in this paper in order to perform color image segmentation.
In order to produce the datums, we downsample the input image at the scale of 4 and utilize the same framework as (41) in order to produce the weights. Note that in this problem class we utilize 3σ instead of σ.
A cluster in this problem class is defined as the mean vector and the principal vector , i.e. . In this problem class, distance between and ψ c is defined as [1],
The clusters in this problem class are initialized to the PCA representations of a maximum of 6 homogenous 256 × 256 colored patches. These patches are extracted form images found on the web through searching for terms such as Red, Green, and Blue. This problem class utilizes a scale of 25 gray levels and it is solved as inclusive.
Grayscale image multi-level thresholding
Figure 2 compares the outputs of classical FCM and the developed algorithm on the input image shown in Fig. 1(a). Here, C equals 2. In this figure, the bottom row graphs present the input data as well as the output clusters and also membership of datums to the clusters in the histogram domain. In fact, in these visualizations, the horizontal axis identifies the datums and the gray curve denotes the weights, i.e. the histogram of the downscaled version of the input image. Here, each set of colored points identifies membership to one cluster, i.e. pnfnc, and black points denote values of p n .
Note that, as expected from the model, p n is always one in FCM. Moreover, the x n –p n relationship in the developed algorithm is not one-to-one. In other words, in the developed algorithm, the same grayscale value may be considered an inlier with different probabilities based on its context. Hence, the black set of points in Figure 2(b-bottom) do not constitute a single-valued curve. Similarly, the x n –f nc relationship is not one-to-one for the developed algorithm either. Thus, while in FCM the membership representation for each cluster is a single-valued curve, it is in fact a cloud of points for the developed algorithm. This observation also exhibits that in the developed algorithm membership to clusters depends on the context of the datum.
We note that in the outputs of FCM, shown in Figure 2(a-bottom), the membership curve for the rightmost cluster is at over 0.5 at the end of the range. This is essentially because the distances between these datums and both clusters are significantly large. We compare this unwanted situation with the geometry of the output of the developed method, in which, as observed in Figure 2(b-bottom), at a distance from any cluster it is not significantly important how far the datum is from that particular cluster. In other words, as seen in the left side of Figure 2(b-bottom) for x n ≤ 40, membership to both clusters drops without requiring a competing cluster. In FCM, however, a cluster centered between 60 and 80 assigns high membership values to x n valuesbelow 10.
Figure 2 also compares the output of FCM with that of the developed algorithm in the image domain. Here, in the top row, in order to visualize the classification results, x n is replaced with ψ c n . We observe that the developed algorithm produces stronger spatial contiguity. This is essentially the main goal of the approaches which perform post- or pre-processing operations in order to increase correlation of the clustering information for adjacent datums. The developed algorithm achieves this goal using the datum-to-datum relationship model described in Section 3:preliminaries. In fact, we observe that compared to Figure 2(a-top), there are less isolated pixels in Figure 2(b-top) and that the boundaries of the clusters in Figure 2(b-top) are smoother than those in Figure 2(a-top).
It took FCM and the proposed method 30 milliseconds and 250 milliseconds to produce the results shown in Figure 2(a) and (b), respectively.
Figure 3 shows another pair of results for grayscale image multi-level thresholding produced by FCM and the proposed method on the image shown in Figure 1(b) for C = 5 clusters. We observe in Figure 3 that, similar to the observations made in Figure 2, FCM clusters are wider while the proposed method produces more specific clusters. This point can be observed, for example, at the two ends of the spectrum in Figure 3(a-bottom), where the FCM clusters assign membership values over half to x n = 0 and x n = 255. Whereas, membership of these datums to the clusters in the output of the developed method falls down to close to zero, as seen in Figure 3(b-bottom). We also emphasize the higher spatial contiguity of the result shown in Figure 3(b-top) compared to the one shown in Figure 3(a-top). In this experiment, FCM converged after 91 milliseconds while the proposed method required 811 milliseconds to converge.
Plane finding in range data
Figure 4 shows the output of FCM and the developed algorithm for the input shown in Figure 1(c). This scene contains a room in which the floor and two walls are visible. Therefore, the purpose of this experiment is to examine whether the two algorithms can successfully recognize these three planes.
The top images in Figure 4 denote the classification of the datums to the detected clusters. Here, datums are painted according to the cluster they are assigned to. Outlier datums, i.e. datums which are found to not belong to any of the clusters, are painted in black. As expected form the models, in the FCM results every datum is assigned to a clusters. The proposed method, however, labels some of the datums as outliers. These datums are the ones which belong to other objects in the scene as well as to areas in which the inherent geometrical distortions caused by the sensor push datums unacceptably far from the planes.
The bottom images in Figure 4 show the relative positions of the converged clusters. Here, we observe that both algorithms have successfully detected the floor and the two walls. Nevertheless, examining the top images in Figure 4, we observe higher spatial contiguity and smoother edges for the clusters generated by the proposed method compared to the ones produced by FCM.
In this experiment, FCM converged after 2,194 milliseconds, whereas the proposed method required 19,526 milliseconds to converge. We observe that while the proposed method is computationally more expensive, as predicted in Section 3.7, the ratio between the computational costs of the two algorithms is in fact smaller than the estimate produced in Section 3.7.
The results shown in Figure 4 indicate a situation in which the proposed method and FCM perform similarly, with the added bonus that the proposed method performs outlier detection while the latter does not. The comparison shown in Figure 5, however, exhibits another case in which the clustering results generated by the two algorithms are in fact significantly different. This data belongs to the same room utilized in Figure 4 with the difference that the camera is moved to another corner of the room and three human bodies are added to the scene. In other words, the major difference between the data utilized in Figure 4 and the one used in Figure 5 is a visible increase in the portion of the data which does not belong to any of the clusters.
Under these conditions, as seen in the bottom row of Figure 5(a), FCM fails to produce meaningful clusters. In fact, we observe that the outliers in the scene pull the clusters into incorrect positions. This is also evident in the top image in Figure 5(a), i.e. the output of FCM. The proposed method, however, as seen in Figure 5(b), produces spatially contiguous clusters which also correctly correspond to the actual planes present in the scene.
Color image segmentation
Finally, we present two pairs of results generated by FCM and the proposed algorithm for color image segmentation. These results correspond to the two input images shown in Figure 1 and (f). As seen in Figure 6 and Figure 7, however, it is harder to discuss appropriateness of clustering results for color data, compared to what was carried out on the results corresponding to the two other problem classes discussed in this section. Nevertheless, We observe spatial contiguity in the outputs of the proposed method while the FCM results, as expected from the mathematical model, lack spatial context and contiguity.
Complexity analysis
Table 1 carries the elapsed times corresponding to the experimental results exhibited in this section. We observe that the computational cost of the proposed method is always less than 10 times that of FCM. This empirically calculated value must be compared with the estimation given in Section 3.7 which predicts that the proposed method is 24 times more expensive that FCM.
In fact, we repeatedly executed the clustering procedures, both FCM and the proposed method, for the purpose of grayscale image multi-level thresholding on 16 samples from the USC-SIPI database for 9 different values of C, between 2 and 10, and recorded the results. We observed that FCM in average required 557.17 milliseconds to converge while the proposed method converged in average after 2,854.32 milliseconds. Thus, the computational cost of the proposed method is in average 9.38 times that of FCM for this problem class
We also utilized 7 samples and examined the plane finding in range data problem class. For this problem class FCM in average converged after 4,430.14 milliseconds while the proposed method in average required 18,490.29 milliseconds to converge. Hence, the proposed method is in average 4.69 times more computationally expensive for this problem class.
Finally, we carried out a similar investigation for the color image segmentation problem class. In this investigation, we utilized 8 samples from the USC-SIPI database and the same range of C investigated for the grayscale image multi-level thresholding problem class. We observed that FCM and the proposed method in average converged after 490.67 and 1,925.56 milliseconds, respectively. Hence, for this problem class, the computational cost of the proposed method is in average about 4.80 times that of FCM.
These investigations verify the finding, based on the data presented in Table 1, that the developed method is less than 10 times more computationally expensive than FCM.
Spatial contiguity
Figure 8 shows magnified sections of the outputs of FCM and the proposed method for some of the experiments discussed in this section. The three pairs visualized in this figure each present classification results corresponding to one of the problem classes utilized in this section. Here, for each problem class, FCM and the proposed method are applied on the same set of input datums and the same sections of the output are magnified.
We observe that, independent of the specifics of the problem class, the proposed method produces spatially contiguous classification results. In the output of FCM, however, there exist datums which are assigned to clusters, to which none of their neighbors are assigned. In other words, as expected from the mathematical models, FCM disregards the expected correlation in the classification results and assigns datums to clusters with no regard for the spatial context in which the datums exist. The proposed method, however, produces patches of datums which are assigned to the same cluster and, therefore, generates smoother edges between theclusters.
Conclusions
In this work, we investigated a generic fuzzy-possibilistic clustering problem in which datums of an arbitrary mathematical model are classified into a number of clusters of an arbitrary model. This framework addresses the general case in which, due to the physical properties of the datums and the clusters, a particular notion of homogeneity is applicable to the problem class in hand. We showed that datum and cluster models and the notion of homogeneity can be abstracted out of the loss model, which we derived using Bayesian inference. As a result, we developed a loss model which employs a robust loss function and utilizes both concepts of fuzzy/probabilistic membership as well as the possibilistic estimate that a datum is in fact an inlier.
A key contribution of this paper is the incorporation of spatial context into the clustering process. We argued that while a significant majority of the works in the literature are indifferent to the order of the input datums, the input to any clustering problem is inherently correlated in the spatial domain. When algorithms ignore this relationship between the datums, however, they fail to produce spatially contiguous classification results. Additionally, spatial context is an important property of a datum in the presence of noise and when clusters meet. Thus, we modeled the loss for a datum as a function of the loss for the datums in its proximity. Then, we utilized a generic concept of datum-to-datum correlation and derived the loss function for the clustering problem. Subsequently, we developed a solution strategy for the developed clustering model.
We emphasize that the process developed in this paper avoids heuristically engineered terms which may be intuitively believed to lead to particular types of output, as it is generally practiced in the literature. In fact, in this paper, we based the entire model on direct derivation of loss using a construction process and avoided any parameters which may have to be set by the user or through other processes.
We used problem instances in three different problem classes and described the process of adopting the developed algorithm in each one. We exhibited experimental results and compared the outputs of the developed algorithm with those of FCM. We showed that the developed method is successful in performing outlier detection as well as finding the clusters present in the data. We also demonstrated the spatial contiguity of the classification results generated by the developed algorithm. While in theory the developed algorithm is estimated to be 24 times more computationally expensive than FCM, we observe that in practice the increase in computational complexity is less than 10 times that of FCM.
Acknowledgments
The author wishes to thank the management of Epson Edge for their help and support during the course of this research. We thank Masoud Mazloom and Bahar T. for their help in locating critical pieces of literature needed for this work. We wish to thank Mahsa Pezeshki for proofreading this manuscript.
