Visual feature fusion and its application to support unsupervised clustering tasks

Abstract

The concept of involving users in the loop of analytic workflows refers to the ability to replace heuristics with user input in machine learning and data mining tasks. For supervised tasks, user engagement generally occurs via the manipulation of training data. But for unsupervised tasks, user involvement is limited to changes in the algorithm parametrization or the input data representation, also known as features. Typically, different types of features can be extracted from raw data, and the careful selection of the extraction strategy allows users to have more control over unsupervised tasks. Nevertheless, since there is no perfect feature extractor, the combination of multiple sets of features has been explored through a process called feature fusion. Feature fusion can be readily performed when the machine learning or data mining algorithms have a cost function, such as accuracy for classification tasks. However, when such a function does not exist, user support needs to be provided, otherwise the process is impractical. In this article, we present a novel feature fusion approach that employs data samples and visualization to allow users to not only effortlessly control the combination of different feature sets but also understand the attained results. The effectiveness of our approach is confirmed by a comprehensive set of qualitative and quantitative experiments, opening up different possibilities for user-guided analytical scenarios. The ability of our approach to provide real-time feedback for feature fusion is exploited in the context of unsupervised clustering techniques, where users can perform an exploratory process to discover the best combination of features that reflects their individual perceptions about similarity.

Keywords

Feature fusion dimensionality reduction visual analytics user interaction

Introduction

Machine learning and data mining techniques are, in general, split into supervised and unsupervised approaches or a combination of both. In supervised approaches, user knowledge can be input into the analytical process through sets of processed data instances. In unsupervised, knowledge can be added by changing algorithm parameters or the data representation, also known as features. For unsupervised methods, therefore, the challenge is not only to define the most appropriate set of parameters but also to find the data representation that best expresses user knowledge or expectation.

Depending on the application domain (e.g. text or image), there exist several approaches to constructing features, each providing complementary information about the original or raw data. Since there is no perfect feature, the idea of joining different representations is straightforward. This process is called data or feature fusion,¹ and it can occur through combining the features’ vector representation or merging distances calculated from them. When the machine learning or data mining task involves a cost function, for instance, classification accuracy, it can be used to guide the combination. However, for tasks, like clustering^2,3 or multidimensional projection,^4,5 where such a function does not exist, support needs to be provided to allow users to build proper combinations. Otherwise, in practice, data fusion is impossible or useless given the abundance of possible combinations.

In this article, we present a novel feature fusion approach that allows users to control and understand the fusion of different feature sets. Starting with a small sample, users employ a simple widget to define the weights for the combination and observe in real-time the outcome through a scatterplot-based visualization. Once users find the weighted combination that best matches their point of view of similarity, the same weights can be used to combine the complete data set. In this way, we not only allow users to test different combinations easily but also enable the interpretation of the attained results.

In summary, the main contributions of this article are the following:

A novel feature fusion technique that allows users to explore and understand different combinations of features in real-time;

An approach to input user knowledge via controlling similarity relationships in unsupervised tasks with much more flexibility than parameter tweaking;

An interactive visualization-assisted tool for the clustering of image collections which allows for real-time tuning of the similarity among images to match user expectations.

Related work

The process of integrating information from multiple sources to produce a unified enhanced data model is called data fusion.¹ The goal is to combine different data representations into a single model aimed at incorporating properties from various sources. Data fusion can occur in different ways, including combining features, that is, the vectorial data representation, or merging distances calculated from the various sources.

The concept of merging features is called feature fusion. Feature fusion aims to generate a unified vectorial data representation based on different sets of features (vectorial representations).^6,7 The most straightforward approach is feature concatenation.^8,9 In concatenation, given the sets of features $F_{1}, F_{2}, \dots, F_{p}$ , the unified representation is given by $[F_{1}, F_{2}, \dots, F_{p}]$ . Despite its simplicity, the literature reports several examples. In Wang et al.,¹⁰ Local Binary Pattern (LBP)¹¹ and Histograms of Oriented Gradients (HOGs)¹² features are concatenated to improve performance in pedestrian detection. In Manshor et al.,¹³ Scale Invariant Transform Features (SIFT)¹⁴ and boundary-based shape¹⁵ features are concatenated to improve object recognition. In Chu et al.,¹⁶ high, low, and medium layer features of a deep neural network are united to support object detection, and in Chun et al.,¹⁷ color and texture features are progressively concatenated to reduce model complexity in a content-based retrieval framework. Feature concatenation has also been used in the text domain. In Loni et al.,¹⁸ the authors extract seven types of lexical, syntactical, and semantic features and combine subsets of them to improve text classification.

Weights can be used in the concatenation process to control the influence of the different features. In this process, the unified representation is given by $[α_{1} F_{1}, α_{2} F_{2}, \dots, α_{p} F_{p}]$ ,⁶ where $α_{1}, α_{2}, \dots, α_{p}$ are the weights. In Loni et al.,¹⁹ a weighted concatenation was used to improve text classification by combining lexical, syntactic, and semantic features. In Ma et al.,²⁰ a neural network was used to learn the weights of a concatenation, combining different image features, such as color, shape, and texture, to improve classification accuracy. In You and Tang,²¹ the authors use a saliency detection model to fuse color and texture features through a weighting strategy. First color and texture features are transformed into saliency features, which are then combined linearly. Different from the previous weighted techniques, features are combined instead of concatenated, that is, the unified representation is given by $[α_{1} F_{1} + α_{2} F_{2} + \dots + α_{p} F_{p}]$ . Such combination is possible since the saliency representations have the same dimensionality.

In practice, the feature concatenation is not recommended since it may result in high-dimensional feature vectors leading to the curse of dimensionality problem.⁶ One solution is to apply a dimensionality reduction after the concatenation,²² or to perform a distance fusion. In the distance fusion, instead of combining the vectorial representations, the distances calculated from the representations are combined. If $Δ (F_{i})$ represents the distance matrix calculated from $F_{i}$ , the resulting distance matrix is given by $α_{1} Δ (F_{1}) + α_{2} Δ (F_{2}) + \dots + α_{p} Δ (F_{p})$ . In Degani et al.,²³ a simple normalized combination of distances computed from different types of features is used to cover song identification. The distance fusion can also be performed using weights. In Huang et al.,²⁴ weights are used to combine distances calculated from color and texture features to improve the results of a content-based image retrieval system. In Vadivel et al.,²⁵ distances calculated from color and texture features are also combined to support content-based image retrieval applications. Finally, in Liu et al.²⁶ and in Chu et al.,¹⁶ distances calculated from features extracted from different layers of a deep neural network are combined to improve retrieval tasks.

Different from data fusion, model fusion combines computational models instead of data. Such combination can be performed in two different ways: by combining different models (parametrizations) processing a single feature set (data set), or by combining different models processing different feature sets.²⁷ The former is called ensemble learning and has been extensively used in classification tasks. The idea is to combine the predictions from different models using a voting strategy to improve model diversity and classification accuracy.^28,29 Ensembles of classifiers typically outperform single classifiers³⁰ and have been used in different domains, including remote sensing, computer security, financial risk assessment, fraud detection, recommender systems, medical computer-aided diagnosis, and others.^27,31 Similarly, the later also employs a (weighted) voting strategy to combine different models, but in this case, the models use different sets of features as input. Examples of applications include fruit classification⁹ and sentiment analysis.³²

Common to all these data and model fusion approaches is that combinations can only be performed when a loss function is available to guide the process, like in classification. If such a function does not exist, or there is a degree of subjectivity in the process, the combination without proper user support hampers its applicability in practice or real scenarios, and none of the mentioned approaches offer such support. In this article, we devise an approach to aid the process of feature combination to allow users to control the process to match individual expectations, enabling applications where user judgment is crucial.

Proposed methodology

Our feature fusion approach employs a two-phase strategy to support users to define combinations that reflect a particular point-of-view regarding similarity. Considering an original or raw data set $D = {d_{1}, d_{2}, \dots, d_{n}}$ with $n$ elements (e.g. an image collection), and $F_{1} = {f_{1}^{1}, \dots, f_{n}^{1}}, \dots, F_{p} = {f_{1}^{p}, \dots, f_{n}^{p}}$ the different sets of feature vectors extracted from $D$ . In the first phase of our approach, initial sample sets $E'_{1}, E'_{2}, \dots, E'_{p}$ are selected from each different set of feature vectors, and the indexes of these vectors are merged to compose a single list of indexes containing all elements captured by the different samplings. After that, new sample sets $E_{1}, E_{2}, \dots, E_{p}$ are built containing the different types of feature vectors. They represent the same elements given by this common list of indexes so that all sample sets have the same number of elements $q$ , that is, $q = | E_{1} | = | E_{2} | = \dots = | E_{p} |$ . Each sample $E_{i}$ is then mapped to a vectorial representation $U_{i} = {u_{1}^{i}, u_{2}^{i}, \dots, u_{q}^{i}} \in R^{m}$ preserving as much as possible the distance relationships, and these representations are combined to generate a single representation $\bar{U} = α_{1} U_{1} + α_{2} U 2 + \dots + α_{p} U_{p}$ , which is then visualized.

The user can then change the feature weights $α_{1}, α_{2}, \dots, α_{p}$ and observe the resulting combination in real-time. Once the sample visualization reflects the user expectations, that is, once proper feature weights are found, the second phase takes place, and these weights are used to combine the complete set of features $F_{1}, F_{2}, \dots, F_{p}$ . In this process, the mapped sample representations $U_{1}, U_{2}, \dots, U_{p}$ and the samples $E_{1}, E_{2}, \dots, E_{p}$ are used to construct models to transform each set of feature $F_{i}$ to a vectorial representation $V_{i} \in R^{m}$ . Since these vectorial representations are embedded in the same space, they can be combined using the weights $α_{1}, α_{2}, \dots, α_{p}$ , obtaining the final vectorial representation $\bar{V} = α_{1} V_{1} + α_{2} V 2 + \dots + α_{p} V_{p}$ that seeks to match the user expectations defined by the sample visualization. Figure 1 outlines our approach showing the involved steps. Next, we detail these steps, starting with the sampling and the distance preservation mapping.

Figure 1.

Overview of our feature fusion process. Initially, a sample is extracted, combined, and visualized. Based on this, the user can test different weights to fuse the features and observe the outcome. Once a sample combination that reflects the user expectation is found, the same weights are used to combine the complete sets of features that can then be used on subsequent data mining tasks, such as clustering.

Sampling and mapping

The first step of our process is sampling. Since users employ the sample visualization to guide the feature fusion process, it is essential to have all possible data patterns from the different features represented. Therefore, we recover samples from each different set of features so as to represent the distribution of each set faithfully.

In this process, we extract samples from each set $F_{1}, F_{2}, \dots, F_{p}$ independently using a cluster-based strategy. We employ the k-means algorithm to create $\sqrt{n}$ clusters, getting the medoid of each cluster as a sample, where $n$ is the number of instances in the raw data set $D$ . We set the number of clusters to $\sqrt{n}$ since this is considered a useful heuristic for the upper-bound number of clusters in a data set.³³ As explained, after extracting the sample sets $E'_{1}, E'_{2}, \dots, E'_{p}$ , we merge their indexes to define a unified set of indexes, and create the sets $E_{1}, E_{2}, \dots, E_{p}$ having feature vectors with the indexes contained in the unified set of indexes. This is an essential and mandatory step since the sample visualization is constructed based on the combination of all features, and this combination is only possible to compute if the samples contain vectors representing the same data elements. Also, this increases the chance of representing the different patterns contained in the different sets of features. Notice that the combined sample features $\bar{U}$ will have at most $\sqrt{n} \times p$ instances, enhancing the probability of having samples that represent the distribution and patterns of each set of features while not hampering the computational complexity of the overall process since $p << \sqrt{n}$ .

After recovering the samples, we map them to a common $m$ -dimensional space, obtaining their vectorial representation $U_{1}, U_{2}, \dots, U_{p} \in R^{m}$ so that we can combine them to obtain $\bar{U} \in R^{m}$ (for the sample visualization). In this process, each set of samples $E_{i}$ is mapped to $R^{m}$ preserving as much as possible the distance relationships. We do this by minimizing

E_{st} (E_{i}) = \frac{1}{q^{2}} \sum_{r}^{q} \sum_{s}^{q} {(δ (e_{r}^{i}, e_{s}^{i}) - ‖ u_{r}^{i} - u_{s}^{i} ‖)}^{2}

(1)

where $e_{r}^{i}$ and $e_{s}^{i}$ are feature vectors in $E_{i}$ , $δ (e_{r}^{i}, e_{s}^{i})$ is the distance between them, and $u_{r}^{i}$ and $u_{s}^{i}$ are the vectorial representations in the $m$ -dimensional space of $e_{r}^{i}$ and $e_{s}^{i}$ , respectively.

Besides preserving distance relationships, our mapping process aims to align the vectorial representations so that $u_{r}^{i}$ is placed as close as possible to $u_{r}^{j} \forall j \in [1, p], r \in [1, q]$ without affecting the distance preservation of the individual mappings. This is necessary since the unified sample representation is calculated as a convex combination of these representations, that is, $\bar{U} = α_{1} U_{1}, α_{2} U_{2}, \dots, α_{p} U_{p}$ , with $\sum α_{i} = 1$ , and misalignments could result in meaningless unified representations. First, we calculate the average distance matrix $\bar{Δ} = (1 / p) \sum_{i}^{p} Δ (E_{i})$ by combining the distance matrices of all set of samples, where $Δ (E_{i}) = (δ_{rs} = δ (e_{r}^{i}, e_{s}^{i}))$ is the distance matrix calculated from $E_{i}$ . Then we map $\bar{Δ}$ to the $m$ -dimensional space using equation (1) but replacing $δ (e_{r}^{i}, e_{s}^{i})$ by $δ_{rs}$ obtaining $\bar{H} = {{\bar{h}}_{1}, {\bar{h}}_{2}, \dots, {\bar{h}}_{q}} \in R^{m}$ . The idea is to use this average vectorial representation as a guide to align the vectorial representations $U_{1}, U_{2}, \dots, U_{p}$ minimizing

E_{al} (E_{i}) = \frac{1}{q^{2}} \sum_{r}^{q} \sum_{s}^{q} {(d ({\bar{h}}_{r}, {\bar{h}}_{s}) - ‖ {\bar{h}}_{r} - u_{s}^{i} ‖)}^{2}

(2)

where $d ({\bar{h}}_{r}, {\bar{h}}_{s})$ is the distance between two instances of the average vectorial representation.

Joining equations (1) and (2) renders the function we optimize in our mapping process seeking to preserve, as much as possible, the distance relationships of the sample set of features $E_{1}, E_{2}, \dots, E_{p}$ in the vectorial representations $U_{1}, U_{2}, \dots, U_{p} \in R^{m}$ while aligning them. This function is given by

E_{mapping} (E_{i}) = λ \cdot E_{st} (E_{i}) + (1 - λ) \cdot E_{al} (E_{i})

(3)

where $λ$ is used to control the importance of the distance preservation and the alignment to the produced vectorial representations. $λ$ is a hyperparameter and can be changed to define a good trade-off between distance preservation and alignment.

To minimize equation (3), we use a stochastic gradient descent approach with a polynomial decay learning rate. We set the initial learning rate to $γ_{0} = 0.1$ , the decay power to $κ = 0.95$ , and the number of iterations $Ω = 100$ , following common choices found in the literature.³⁴ Algorithm 1 outlines our mapping process. Function $RAND (t, k)$ randomly draws $k$ discrete values in the range $[1, 2, \dots, t]$ , and function $INIT ()$ initializes the mapping coordinates, also randomly. We tested a deterministic initialization using Fastmap,³⁵ but the gain in quality does not justify the computational overhead. Also, the first time the function $MAP (\dots)$ is called, $\bar{H}$ is not provided. In this case, $\nabla E_{al}$ is not computed and only $\nabla E_{st}$ is considered to calculate the mapping coordinates.

Notice that we normalize all features $f_{i} \in F_{j}, \forall i, j$ before this process, so that the Euclidean norm $| | f_{i} | | = 1$ . Given the triangular inequality property $(| | f_{i} - f_{j} | | \leq | | f_{i} | | + | | f_{j} | |)$ , this guarantees a upper limit for the maximum pairwise distance between features. Therefore, the distances are in the same range despite the type of feature or its dimensionality, avoiding biasing the process toward the type of feature with the largest maximum distance. In addition, we define the dimensionality $m$ of the resulting mappings $U_{1}, U_{2}, \dots, U_{p} \in R^{m}$ as the largest intrinsic dimensionality of $F_{1}, F_{2}, \dots, F_{p}$ , calculated using the maximum likelihood estimation.³⁶ Such dimensionality can also be defined by the user if the target dimensionality is known, such as $m = {1, 2, 3}$ for visualization purposes.

Algorithm 1 Algorithm for mapping different feature sets to a common vectorial space.
$\bar{Δ} \leftarrow \frac{1}{p} \sum_{i}^{p} Δ (E_{i})$ ▹ compute the average distance matrix $\bar{H} = {{\bar{h}}_{1}, \dots, {\bar{h}}_{q}} \leftarrow$ map $(\bar{Δ}, -, 1.0)$ ▹ $\bar{Δ}$ into $R^{m}$ for $E_{i} \in E_{1}, E_{2}, \dots, E_{p}$ do $U_{i} \leftarrow$ MAP $(Δ (E_{i}), \bar{H}, λ)$ end for functionMAP $(Δ = (δ_{rs}), \bar{H}, λ)$ $U = {u_{1}, \dots, u_{q}} \leftarrow$ INIT() ▹ initialize the mapping for $it = 0 to Ω$ do $γ \leftarrow γ_{0} \times {(1 - \frac{it}{Ω})}^{κ}$ ▹ learning rate $R \leftarrow$ RAND $(q, \sqrt{q})$ ▹ draw $\sqrt{q}$ random values for $r \in R$ do for $s \in [1, 2, \dots, q]$ do $\nabla E_{st} \leftarrow (δ_{rs} - \| \| u_{r} - u_{s} \| \|) \frac{(u_{r} - u_{s})}{\| \| u_{r} - u_{s} \| \|}$ $\nabla E_{al} \leftarrow (d ({\bar{h}}_{r}, {\bar{h}}_{s}) - \| \| {\bar{h}}_{r} - u_{s} \| \|) \frac{({\bar{h}}_{r} - u_{s})}{\| \| {\bar{h}}_{r} - u_{s} \| \|}$ $u_{s} \leftarrow u_{s} - γ (λ \cdot \nabla E_{st} + (1 - λ) \cdot \nabla E_{al})$ end for end for end for return $U$ end function

Algorithm 1 Algorithm for mapping different feature sets to a common vectorial space.

\bar{Δ} \leftarrow \frac{1}{p} \sum_{i}^{p} Δ (E_{i})

▹ compute the average distance matrix

\bar{H} = {{\bar{h}}_{1}, \dots, {\bar{h}}_{q}} \leftarrow

map

(\bar{Δ}, -, 1.0)

▹

\bar{Δ}

into

R^{m}

for

E_{i} \in E_{1}, E_{2}, \dots, E_{p}

U_{i} \leftarrow

MAP

(Δ (E_{i}), \bar{H}, λ)

end for
functionMAP

(Δ = (δ_{rs}), \bar{H}, λ)

U = {u_{1}, \dots, u_{q}} \leftarrow

INIT() ▹ initialize the mapping
for

it = 0 to Ω

γ \leftarrow γ_{0} \times {(1 - \frac{it}{Ω})}^{κ}

▹ learning rate

R \leftarrow

RAND

(q, \sqrt{q})

▹ draw

\sqrt{q}

random values
for

r \in R

do
for

s \in [1, 2, \dots, q]

\nabla E_{st} \leftarrow (δ_{rs} - | | u_{r} - u_{s} | |) \frac{(u_{r} - u_{s})}{| | u_{r} - u_{s} | |}

\nabla E_{al} \leftarrow (d ({\bar{h}}_{r}, {\bar{h}}_{s}) - | | {\bar{h}}_{r} - u_{s} | |) \frac{({\bar{h}}_{r} - u_{s})}{| | {\bar{h}}_{r} - u_{s} | |}

u_{s} \leftarrow u_{s} - γ (λ \cdot \nabla E_{st} + (1 - λ) \cdot \nabla E_{al})

end for
end for
end for
return

U

end function

Weighted feature combination

Given the sample vectorial representations $U_{1}, U_{2}, \dots, U_{p}$ , we build a set of functions using the process defined in Joia et al.³⁷ to map each feature set $F_{i}$ into its vectorial representation $V_{i} \in R^{m}$ preserving as much as possible the distance relationships while obeying the geometry defined in $U_{i}$ . In this process, each instance $f_{j}^{i} \in F_{i}$ is mapped to the $m$ -dimensional space through an orthogonal local affine transformation $T_{j}^{i} : R^{m^{i}} \to R^{m}$ , where $m^{i}$ is the dimensionality of $F_{i}$ .

The affine transformation $T_{j}^{i} (f) = fM + t$ associated with $f_{j}^{i}$ is defined so as to minimize

\sum_{k} β_{k} {‖ T_{j}^{i} (e_{k}^{i}) - u_{k}^{i} ‖}^{2}

(4)

where $β_{k} = ∥ e_{k}^{i} - f_{k}^{i} ∥^{- 2}$ , with $e_{k}^{i}$ the original feature representation of the kth sample in $E_{i}$ .

Equation (4) can be re-written in the matrix form $∥ C (AM - B) ∥_{F}$ , where $∥ \cdot ∥_{F}$ denotes the Frobenius norm, $C$ is a diagonal matrix with entries $C_{ii} = \sqrt{β_{i}}$ , and $A$ and $B$ are matrices with the jth row given by the vectors

e_{j}^{i} - \frac{\sum_{k} β_{k} e_{k}^{i}}{\sum_{k} β_{k}} and u_{j}^{i} - \frac{\sum_{k} \leq β_{k} u_{k}^{i}}{\sum_{k} β_{k}}, respectively

Based on that, $M$ is computed as $M = SR$ , where $S$ and $R$ are obtained from the singular value decomposition of $A^{T} CCB = ST R^{T}$ . Then, the vectorial representation $v_{j}^{i}$ of $f_{j}^{i}$ is given by

v_{j}^{i} = (f_{j}^{i} - \frac{\sum_{k} β_{k} e_{k}^{i}}{\sum_{k} β_{k}}) M + \frac{\sum_{k} β_{k} u_{k}^{i}}{\sum_{k} β_{k}}

(5)

Equation (4) is subject to $M M^{⊤} = I$ , which avoids scale and shearing effects, therefore, preserving the distance relationships of the input features. Also, notice that the sample vectorial representations $U_{1}, U_{2}, \dots, U_{p}$ dictate the geometry of the embeddings $V_{1}, V_{2}, \dots, V_{p}$ . Since they are aligned by the mapping process defined in the previous section, the linear combination $\bar{V} = α_{1} V 1, α_{2} V_{2}, \dots, α_{p} V_{p}$ can be performed to obtain the final embedding $\bar{V}$ . That incorporates the patterns defined by each set of features, weighted according to the user’s point-of-view. For more information about this affine transformation and how the sample vectorial representation controls the final results, please refer to Joia et al.³⁷

Feature combination widget

To support the feature sample combination, we create a widget inspired by the strategy presented in Pagliosa et al.³⁸ The idea is to position anchors (circles) representing each different set of features on a circumference, computing the weights $α_{1}, α_{2}, \dots, α_{p}$ according to their distances to a “dial,” which can be freely manipulated by the user, contained in the circumference. If ${\tilde{f}}_{i}$ are the coordinates of the anchor representing the feature $F_{i}$ and $\tilde{d}$ the coordinates of the “dial,” the weight $α_{i}$ related to $F_{i}$ is calculated as

α_{i} = \frac{1}{(\sum_{j}^{p} \frac{{(1 + ‖ {\tilde{f}}_{i} - \tilde{d} ‖)}^{2}}{{(1 + ‖ {\tilde{f}}_{j} - \tilde{d} ‖)}^{2}})}

(6)

Initially, the anchors are equally spaced on the circumference following a random order. However, users can freely move them to produce the desired combinations. Also, to help the perception of the weights, we set the transparency level of the anchors and fonts according to $α_{1}, α_{2}, \dots, α_{p}$ . Figure 2 shows the combination widget. In this example, the “dial” in orange is closer to the anchor representing the feature $F_{1}$ , so the corresponding anchor is more opaque than the others.

Figure 2.

Feature combination widget. Using the orange “dial,” users can control the contributions of the different types of features for the final feature combination.

In addition to this design, we have explored another option using multiple sliders, one per feature. Although sliders are commonly used in applications that involve setting multiple parameters, it proved to not be the best choice in our combination scenario. Given that changes in one slider affect the others (it is a convex combination), every user interaction requires adjusting several sliders at once. In our design, users need to manipulate only one dial (and optionally the anchors), providing a much faster exploratory process.

Results and evaluation

In this section, we evaluate our mapping and feature combination processes using different data sets in order to show that the sample manipulation effectively controls the complete feature fusion. Next, we describe the employed data sets, detail how we extract features, and present our quantitative and qualitative evaluation.

Data sets

We use five data sets in our tests, named STL-10,³⁹ Animals,⁴⁰ Zappos,⁴¹ CIFAR-10,⁴² and Photographers.⁴³ These data sets come from a variety of different domains. The STL-10 consists of $13, 000$ images split into $10$ classes of different objects. Similarly, CIFAR-10 contains $60, 000$ images from $10$ commonly seen object categories (e.g. animals, vehicles, and so on) in lower resolution. The Animals data set is more specific and it is composed of $30, 475$ images of animals in $50$ categories. Zappos is a data set for shoes with $50, 025$ images from Zappos.com split into $4$ shoe categories. Finally, the Photographers consists of $181, 948$ photos taken by $41$ well-known photographers. Table 1 summarizes the data sets, showing the number of instances and classes.

Table 1.

Data sets employed in the evaluations: size and number of classes.

Name	Size	Classes
STL-10	13,000	10
Animals	30,475	50
Zappos	50,025	4
CIFAR-10	60,000	10
Photographer	181,948	41

Features

We use four distinct methods to extract features, representing low-level and high-level image components. Low-level means that the dimensions of the feature vector have no inherent meaning, but represent a basic understanding of the image such as edges or color. High-level features have semantic meaning. For example, they denote the presence of an object or not in the image.

For the low-level features, we represent (1) color using LAB color histogram, (2) texture using Gabor filters⁴⁴ with $8$ orientations and $4$ scales, and (3) shape using HoG technique¹² with a window size of $8$ . For the high-level, we extract deep features from the pool5 layer using a pre-trained CNN CaffeNet.⁴⁵ This network was trained on approximately $1.3 M$ images to classify images into 1000 object categories.

We believe that these features are discriminative for our data sets. For example, we can differentiate a leopard from a panda using a texture extractor. Texture can identify spots on a leopard, as well as differentiate them from other animals. Similarly, color features can be helpful to recognize pandas, where the more common colors are black and white. Also, HOG is helpful to differentiate the type of animals by their shape, for example, quadrupeds from birds. Finally, object recognition can complement the HOG descriptor. These examples can be generalized to other data sets as well.

Quantitative evaluation

To confirm the quality of our approach, we quantitatively evaluate our mapping and feature combination processes. For the mapping process evaluation, the five data sets of Table 1 are randomly sampled $10$ times, reducing them to $5 %$ of their original sizes. We sample the data since we cannot execute the mapping process on large data sets given its memory footprint of $O (n^{2})$ . Due to the random initialization (see Algorithm 1), we repeat the mapping process test $15$ times. Moreover, to ensure a common dimensional space, we calculate the intrinsic dimensionality of each set of features and choose the smallest value. This value is used to do the mapping. The minimum values of intrinsic dimensionality are $57$ , $71$ , $91$ , $41$ , and $83$ for STL-10, Animals, Zappos, CIFAR-10, and Photographer data sets, respectively.

We use stress and alignment error to evaluate the mapping process (see equations (1) and (2), respectively). We summarize our results in Figure 3 varying the value of $λ$ in the range $[0, 1]$ . The stress boxplots (in orange) decrease as $λ$ increases whereas the alignment boxplots (in blue) have the opposite behavior. This is the expected outcome since larger values of $λ$ preserve the distance relationships, and smaller values align the data.

Figure 3.

Comparing distance preservation versus alignment error with varying $λ$ . The best trade-off is achieved in the range $[0.45, 0.65]$ . The blue and orange lines connect the average values of the boxplots.

Setting $λ = 1$ preserves as much as possible the original distance relationships. This is reflected by the average stress $\bar{E_{st}} = 0.0009$ , but it does not ensure any alignment $(\bar{E_{al}} = 2.0343)$ . On the contrary, $λ = 0$ delivered almost a perfect alignment $(\bar{E_{al}} = 0.0001)$ , but it does not enforce the distance preservation $(\bar{E_{st}} = 0.0345)$ . In this article, we are interested in the best trade-off between distance preservation and alignment so that the alignment is obtained without penalizing the overall distance preservation. According to our experiments, we achieved this in the range $λ = [0.45, 0.65]$ , where both stress and alignment errors are nearly 0 for our experiments (see Figure 3).

For visual inspection, we map the samples setting the target dimensionality to two. We show the results for the STL-10, Zappos, and CIFAR-10 data sets in Figures 4 –6, respectively. In these figures, the points are colored according to image class. The stress and alignment error values are shown at the bottom-left corner of each projection. To show the influence of $λ$ values in the mapping process, we vary $λ$ in the range $[1.0, 0.2]$ . Notice that in the first column of all figures, the visual representations of each feature (color, texture, border, and object) are misaligned among themselves. That is, points belonging to the same class are placed in different regions on the different projections. For instance, in the first column of Figure 4, the points colored in green are positioned at the bottom in the projection of the color features, on the right in the projection of texture features, and on the left in the projections of the border and object features. The second column depicts results with $λ = 0.8$ . The projections start to align, and points belonging to the same class are placed in similar regions across the different projections. We observe a small increase in the stress error, but the alignment error decreases considerably compared with the first column (see the $E_{al}$ measure at the bottom-left corner). The same behavior is observed in the remaining columns. As expected, as lambda decreases, the alignment improves (alignment error decreases), and the distance preservation decreases (stress error increases). However, the stress changes are minimal, showing that our approach is capable of aligning different feature spaces while preserving the distance relationships in them.

Figure 4.

Resulting 2D mapping process for the STL-10 data set. As $λ$ decreases, the features get more aligned (see last column). Bottom-left numbers correspond to stress and alignment error.

Figure 5.

Resulting 2D mapping process for the Zappos data set. As $λ$ decreases, the features get more aligned (see last column). Bottom-left numbers correspond to stress and alignment error.

Figure 6.

Resulting 2D mapping process for the CIFAR data set. As $λ$ decreases, the features get more aligned (see last column). Bottom-left numbers correspond to stress and alignment error.

For the feature combination, we assess the degree that the distance relationships of the sample are preserved in the feature fusion of the whole data set, intending to demonstrate the effectiveness of the user sample manipulation on the produced data set. In this evaluation, we first randomly generate $30$ different weight combinations summing up to $1$ and apply them to the sample data. Then, we reuse these weights for the whole data fusion and measure if the distance relationships induced by the weights on the sample are preserved in the whole data set. We use the Nearest Neighbor Measure (NNM)⁴⁶ to evaluate the degree of preservation.

The NNM quantifies the distance preservation using the similarity of each instance in the whole data with its nearest neighbor in the sampled data. The NNM is given by equation (7), where $D_{i}$ is the smallest distance among the $i th$ instance in the complete data set and the instances in the sample, and $N$ denotes the number of instances. In the original article, the authors normalized each dimension of the data to the range $[0, 1]$ . However, this results in the loss of the magnitude of the dimensions, hampering our feature weighting process. Therefore, we change the normalization per dimension by a unit vector normalization per instance to avoid such an effect. The output of NNM is within the interval $[0, 1]$ with larger values indicating better results

NNM = 1.0 - \frac{\sum_{i}^{N} D_{i}}{N}

(7)

We compare the NNM values of our feature fusion with two baselines: feature concatenation and distance fusion (see section “Related Work”). Boxplots in Figure 7 show that our approach outperforms the other two baselines by at least $5 %$ . The mean value of our method is $0.9365$ , and the baselines achieved are $0.8877$ and $0.8958$ , respectively. Hence, our method preserves more accurately the data patterns presented in the sample and its distribution in the whole data set fusion.

Figure 7.

The NNM evaluation. We compare our approach of user-guided feature fusion (light green box), with two baselines: feature concatenation, and feature distance combination. Our feature fusion strategy surpasses current state-of-the-art strategies, indicating that the similarity patterns observed in the sample data combination are preserved in the complete data set fusion.

Qualitative evaluation

Besides the quantitative evaluation, we also present an example based on projections for qualitative evaluation. The reasoning is to project the complete combined data set $(\bar{V})$ , showing that the patterns observed in the sample projection $(\bar{U})$ are preserved on the complete projection. In this example, we use our approach to explore large photo collections considering different user perspectives about similarity among images. We use the photographers data set. In addition to the features described in subsection “Features”, we create a new set of features to describe each photographer. We use Wikipedia articles about each photographer and construct a bag-of-words vector to represent them. Photos of the same photographer share the same feature vector, and the similarity among photos is defined as the similarity between texts describing the photographers.

As explained before, based on a sample and using our approach, users can combine different features by employing the combination widget (see Figure 2) until the sample visualization reflects a particular understanding regarding the similarity among photos. Figure 8 shows three different combinations. The first (Figure 8(a)) provides more importance to color and the objects contained in the photos, with little importance given to information about photographers. The second (Figure 8(b)) is defined using the idea of photographic style from⁴³ fusing objects and Wikipedia features. Finally, the third (Figure 8(c)) shows the result of combining texture, borders and a small amount of color.

Figure 8.

User-defined similarity configurations. Based on a small sample, users can interactively combine different features seeking for the combination that best approaches a particular point of view. This combination is then propagated to the entire data set for a complete projection. The widget at the bottom-right helps to control such combination and indicates the importance of each feature. (a) The combination provides more importance to color and the objects contained in the photos. (b) It gives more importance to the objects and the information about the photographers and (c) It provides more importance to texture, border, and color features.

Once a feature combination has been defined that reflects the users’ point of view, a projection representing the complete photo collection is constructed. Figure 9 shows the produced layout using the weights established in Figure 8(a). In this figure, since the color is an important feature, we observe a clear separation between black-and-white and colorful images. Also, given the weight assigned to the feature representing objects, it is possible to notice a separation among photos of people, landscapes, and houses in certain regions of the figure. We zoom in on two small portions of the projection (at the top and at the right side) to show this effect. On the colored images (right), we observe images with sky and forest. On the gray images (top), we observe houses, sky, and forest.

Figure 9.

Photographers data set projection using the weight combination of Figure 8 (a). Since a larger weight is assigned to the color feature, a clear global separation between black-and-white and colorful photos can be observed. This configuration also considers the presence of objects and photographer information.

Figure 10 depicts the final projection using the weights defined in Figure 8(b). In this figure, we zoom in on a region at the bottom-left, and here, we mainly find portrait images. Remember that in this weight combination, our goal was to represent the photographic style. The selected photos are from two well-known photographers, Van Vechten and Curtis, who mostly work with portraits, presenting similar styles.⁴³ These examples qualitatively attest that the similarity patterns observed on the sample projection are preserved in the complete projection, corroborating the quantitative results measured using the NNM index.

Figure 10.

Photographers data set projection using the weight combination of Figure 8 (b). A larger weight is assigned to the object and photographers features. Photos with similar visual features are grouped. The zoom-in region (bottom-left) shows photos of well-known photographers that share similar styles (portrait photos).

Application: user-guided clustering

One of the most appealing application scenarios for our approach is to assist non-supervised data mining strategies, such as clustering techniques. Clustering techniques seek to split sets of data instances into groups so that instances belonging to the same group are more similar to each other than to those in other groups. Typically, clustering is a subjective task that depends on the way the similarity is computed, which could vary from user to user, and the ability to explicitly control and understand similarity is the benefit our approach offers.

Following, we present an example of using our approach to control clustering results of a sample from the photographers data set containing 7800 instances. In this example, we define different weights for features and observe how this influences the composed groups. In Figure 11, we analyze the transition between color and Wikipedia features. Color starts with weight $1$ and decreases to weight $0$ as Wikipedia weight increases from $0$ to $1$ . We generate new fused features in each intermediate state. In each combination state, we compute clusters using the mean shift algorithm.⁴⁷ We opt to use this algorithm because we do not need to provide the number of clusters as input, so the produced results directly reflect the provided similarity (or combination of features).

Figure 11.

Using parallel sets to visualize cluster formation. The parallel sets visualization shows nine clusterings results computed using the means shift algorithm. Axes $C_{0}$ and $C_{8}$ represent the clustering results for color and Wikipedia features, respectively. Intermediate clusterings denote combinations of these features.

We display the different clustering configurations (for each combination) using the parallel sets.⁴⁸ In the parallel sets, the vertical axes represent different clusterings $C_{k}$ , where $k$ indicates a different weight combination of features. All axes contain a set of groups where different colors represent different groups. Curves between axis $k$ and $k + 1$ are colored using the colors of the groups in $C_{k}$ . This coloring scheme improves the perception of membership changes between different clusterings results. To reduce cluttering, we implement a simple filtering strategy to remove non-relevant curves. For each group in $C_{k}$ , we evaluate how many instances from this group are redistributed in the groups in $C_{k + 1}$ . If the quantity is less than a percentage threshold, the curves representing these instances are removed. This threshold is a user parameter and can be adjusted accordingly.

Figure 11 shows parallel sets with $9$ axes representing clusterings results $(C_{0}, C_{1}, . . ., C_{8})$ with a filtering threshold equal to 0.1. $C_{0}$ axis represents the results for the color feature only (no Wikipedia feature is considered). It has two groups, one presenting colorful photos and the other black-and-white photos. $C_{1}$ shows fused features with $0.79$ weight to color and $0.21$ weight to Wikipedia. Most of the two groups presented in $C_{0}$ remain in $C_{1}$ , but some instances change their membership.

From $C_{4}$ , more groups are composed, and the colorful and black-and-white photo division is lost. Finally, $C_{8}$ represents the clustering for the Wikipedia feature only (no color feature). Note that from clusterings $C_{6}$ to $C_{8}$ , the groups are more stable, that is, most of the items in a particular group tend to be assigned to the same group as $k$ increases. In order to analyze the semantic meaning of the groups, we select the purple group $(g_{7})$ from $C_{8}$ , and we check its correspondent instances backward. Photos of that group were taken by Brumfield, Gottscho, and Horydczak, which are three iconic American photographers. We map the data from $C_{8} (g_{7})$ to the visual space using the force-scheme technique.⁴⁹ Figure 12(d) shows the result where each photo border is colored with its group color. As can be observed, photos are similar in content and appearance, as the work of their photographers is focused on architectural photography. We also observe that there is a mixture of colorful and black-and-white photos in this group. However, clustering $C_{0}$ shows a clear separation between these two types of photos (Figure 12(a)). Looking at the sequence of curves from clustering $C_{8}$ to $C_{0}$ , it is possible to analyze the $g_{7}$ group and see when photos of this group are merged backward. We highlighted the path in the parallel set with darker colors for easy navigation. From $C_{8}$ to $C_{4}$ , the groups are stable. Instances of these groups are also projected and depicted in Figure 12(d) and (c). In $C_{4}$ , $g_{1}$ is formed with instances from $C_{3} (g_{1})$ and $C_{3} (g_{3})$ groups. Corresponding instances from $C_{3}$ are mapped in Figure 12(b). Note that in $C_{3}$ , colorful and black-and-white photos are mixed.

Figure 12.

Projections produced by the force-scheme technique for the purple group $(g_{7})$ of instances in $C_{8}$ . The color of the border indicates the group the photo belongs to. The purple group instances are selected from $C_{0}$ , $C_{3}$ , and $C_{4}$ and mapped in (a), (b), and (c), respectively. The projection of the purple group in $C_{8}$ is shown in (d). In (a), two groups are visible. In (b), these groups are less separate. Both in (c) and (d), there is only one group according to the clustering technique. However, in (d) there is a small group inside this group that distinguishes three photographers with similar styles.

Parallel sets are useful tools to show the difference between clustering results. However, they do not show the similarity relationships between instances. In order to explore clusters and the relationships between instances, we also visualize the pairwise dissimilarity matrix produced from a given feature combination as a heatmap. In our representation, similar items are rendered in brown colors, whereas dissimilar ones are rendered in pale orange colors. The order (rows and columns) of our representation is obtained using the position of the leaves in a dendrogram generated by average linkage hierarchical clustering.^50–52

Figure 13 shows dissimilarity matrices using the same weight combinations that generate $C_{0}$ , $C_{3}$ , $C_{4}$ , and $C_{8}$ on parallel sets. In Figure 13(a), we can spot two groups (two dominant brown areas on the main diagonal). The colored margins indicate the groups of the instances given by the clustering algorithm. In Figure 13(b), the two major groups remain, but sub-groups can be noticed inside the larger ones. Figure 13(c) also shows two significant brown areas on the diagonal. However, these groups have the same size. In the previous matrices, one group is bigger than the other because the color feature has more weight and the data set has more black-and-white than colorful photos. In Figure 13(c), the Wikipedia feature begins to have more contribution in the combination process forming clusters that group photos according to style and color. Finally, in Figure 13(d), there are several groups on the main diagonal and two major groups that intersect. A possible explanation is that some photographers tend to shoot similar object categories, but they are from different schools of thought.⁴³

Figure 13.

Similarity matrix for four different weight combinations. (a) represents color feature only, showing two groups (brown areas on the main diagonal). (b) and (c) represent different weights for color, and Wikipedia features, both displaying two major brown areas but with different sizes. (d) represents the Wikipedia feature only, and it has several small groups on the diagonal and two major intersected groups. These visualizations show how the different weight combinations influence the similarity calculation between instances, matching with the group formation presented by the parallel sets.

Discussion and limitations

The quantitative and qualitative results presented in section “Results and Evaluation” show that the patterns observed on the sample feature fusion $(\bar{U})$ are accurately represented in the final feature fusion $(\bar{V})$ , suggesting that similarity relationships can be successfully controlled through the manipulation of small portions of large data sets. However, as in any sampling process, we are susceptive to misrepresentations of patterns that could reduce the quality of the final result. We tried to address this using a clustering technique to sample the data from different features and merging these samples. Nevertheless, it is not possible to guarantee that misrepresentation will not occur. This is the price of supporting a real-time approach, and real-time is a critical element in an exploratory process that is based on user judgments to define the “correct” answer—the right combination of features that reflect a personal point of view about similarity. In this scenario, the ability to allow users to perform and check different feature combinations instantaneously is of vital importance. Moreover, allowing users to interact with small samples reduces the cognitive overload imposed on them, especially when handling large data sets.

A critical aspect of our approach is that the resulting feature fusion is an $m$ -dimensional data set, so not intended for visualization purposes (although it can be visualized as any multidimensional data set). Thereby, the user involvement in our process is through the feature combination widget and the analysis of the weight combination via the sample visualization (first phase of our approach). Although we have presented visual representations for the clustering results, they are meant only to show how effectively the feature combination can be used to control the clustering results. Our intention was not to provide a new visualization for clustering methods. Moreover, the interaction is not with the clustering parameters but through the manipulation of the input space, so our approach could be employed with other subjective unsupervised techniques that rely on user expectations.

Finally, our approach provides support to unsupervised tasks where the result is subject to individual perception, like on the definition of similarity among images, instead of proposing a mechanism to improve a quality measure, like accuracy for classification. For enhancing quality measures, especially for supervised classification, user interaction usually is not necessary, and deep learning approaches tend to produce unbeatable results, particularly when processing image collections. However, for subjective unsupervised methods, where user expectations and interpretability are essential components to control the final results and to understand them (based on the feature meaning), is of utmost importance, and this is the core contribution of our approach.

Conclusion

In this article, we proposed a novel approach for feature fusion that successfully allows users to control the fusion process. It is a two-step strategy where, starting from a small sample of the input data, users can quickly test different feature combinations and check in real-time the resulting similarity relationships. Once a combination that matches the user expectation is defined, it is propagated to the whole data set through an affine transformation. Our experiments show that the complete data set combination preserves the similarities from the sample configuration, making our approach a very flexible mechanism to assist the feature fusion process.

We have applied the proposed feature fusion approach to allow users to control and understand the results of clustering techniques. Clustering is one of the most attractive application scenarios for our approach given the subjectiveness involved in unsupervised tasks. Currently, visualization assisted clustering techniques only allow users to control the results by changing technique parameters.^53–56 Enabling users to guide the input feature configuration renders a much more flexible control, since users can explicitly steer the semantics of the input data and the similarity relationships (e.g. images are similar due to the color vs images are similar due to the presence of objects). Therefore, the proposed methodology controls the cluster formation while still allowing for a natural interpretation of the composed groups.

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research has been funded by CAPES-Brazil and the Emerging Leaders in the Americas Program (ELAP) with the support of the Government of Canada.

ORCID iDs

Gladys M Hilasaca

Fernando V Paulovich

References

Bostrom

Andler

Brohede

, et al. On the definition of information fusion as a field of research. Technical report, University of Skövde, Skövde, 2007.

Wunsch

II . Survey of clustering algorithms. Trans Neur Netw 2005; 16(3): 645–678.

Tan

Steinbach

Kumar

. Introduction to data mining. 1st ed. Boston, MA: Addison-Wesley Longman, 2005.

Nonato

Aupetit

. Multidimensional projection for visual analytics: linking techniques with distortions, tasks, and layout enrichment. IEEE Trans Vis Comput Graph 2018; 25: 2650–2673.

Sacha

Zhang

Sedlmair

, et al. Visual interaction with dimensionality reduction: a structured literature analysis. IEEE Trans Vis Comput Graph 2017; 23(1): 241–250.

Mangai

Samanta

Das

, et al. A survey of decision fusion and feature fusion strategies for pattern classification. IETE Tech Rev 2010; 27(4): 293–307.

Sudha

Ramakrishna

. Comparative study of features fusion techniques. In: Proceedings of the 2017 international conference on recent advances in electronics and communication technology (ICRAECT), Bangalore, India, 16–17 March 2017, pp. 235–239. New York: IEEE.

Anne

Kuchibhotla

Vankayalapati

. Acoustic modeling for emotion recognition. Berlin: Springer, 2015.

Kuang

Chan

Liu

, et al. Fruit classification based on weighted score-level feature fusion. J Electronic Imaging 2016; 25(1): 013009.

10.

Wang

Han

Yan

An HOG-LBP human detector with partial occlusion handling. In: Proceedings of the 2009 IEEE 12th international conference on computer vision, Kyoto, Japan, 29 September–2 October 2009, pp. 32–39. New York: IEEE.

11.

Ahonen

Hadid

Pietikinen

Face recognition with local binary patterns. In: Proceedings of the 9th European conference on computer vision (Euro’15), Prague, 11–14 May 2004, pp. 469–481. Berlin: Springer.

12.

Dalal

Triggs

. Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, 20–25 June 2005, pp. 886–893. New York: IEEE.

13.

Manshor

Rahiman

Mandava

, et al. Feature fusion in improving object class recognition. J Comput Sci 2012; 8: 1321–1328.

14.

Lowe

. Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE international conference on computer vision, Kerkyra, Greece, 20–27 September 1999, pp. 1150–1157. New York: IEEE.

15.

Gonzalez

Woods

Eddins

. Digital image processing using MATLAB. Upper Saddle River, NJ: Prentice Hall, 2003.

16.

Chu

Guo

Leng

. Object detection based on multi-layer convolution feature fusion and online hard example mining. IEEE Access 2018; 6: 19959–19967.

17.

Chun

Kim

Jang

. Content-based image retrieval using multiresolution color and texture features. IEEE T Multimedia 2008; 10(6): 1073–1084.

18.

Loni

Khoshnevis

Wiggers

. Latent semantic analysis for question classification with neural networks. In: Proceedings of the 2011 IEEE workshop on automatic speech recognition & understanding, Waikoloa, HI, 11–15 December 2011, pp. 437–442. New York: IEEE.

19.

Loni

Van Tulder

Wiggers

, et al. Question classification by weighted combination of lexical, syntactic and semantic features. In: Proceedings of the 14th international conference on text, speech and dialogue (TSD’ 11), Pilsen, 1–5 September 2011, pp. 243–250. Berlin; Heidelberg: Springer.

20.

Yang

Zhang

, et al. Multi-feature fusion deep networks. Neurocomput 2016; 218(C): 164–171.

21.

You

Tang

. Visual saliency detection based on adaptive fusion of color and texture features. In: Proceedings of the 2017 3rd IEEE international conference on computer and communications (ICCC), Chengdu, China, 13–16 December 2017, pp. 2034–2039. New York: IEEE.

22.

Zhu

. Quick retrieval method of massive face images based on global feature and local feature fusion. In: Proceedings of the 2017 10th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI), Shanghai, China, 14–16 October 2017, pp. 1–6. New York: IEEE.

23.

Degani

Dalai

Leonardi

, et al. A heuristic for distance fusion in cover song identification. In: Proceedings of the 2013 14th international workshop on image analysis for multimedia interactive services (WIAMIS), Paris, 3–5 July 2013, pp. 1–4. New York: IEEE.

24.

Huang

Chan

PPK

WWY

, et al. Content-based image retrieval using color moment and Gabor texture feature. In: Proceedings of the 2010 international conference on machine learning and cybernetics, vol. 2, Qingdao, China, 11–14 July 2010, pp. 719–724. New York: IEEE.

25.

Vadivel

Majumdar

Sural

. Characteristics of weighted feature vector in content-based image retrieval applications. In: Proceedings of the international conference on intelligent sensing and information processing, Chennai, India, 4–7 January 2004, pp. 127–132. New York: IEEE.

26.

Liu

Guo

, et al. Fusion of deep learning and compressed domain features for content-based image retrieval. IEEE T Image Process 2017; 26(12): 5706–5717.

27.

Kim

Lin

Choi

, et al. A design framework for hierarchical ensemble of multiple feature extractors and multiple classifiers. Pattern Recogn 2016; 52(C): 1–16.

28.

Mendes-Moreira

Soares

Jorge

, et al. Ensemble approaches for regression: a survey. ACM Comput Surv 2012; 45(1): 10:1–10:40.

29.

Dietterich

TG.

Ensemble methods in machine learning. In: Proceedings of the first international workshop on multiple classifier systems (MCS ’00), Cagliari, 21–23 June 2000, pp. 1–15. London: Springer.

30.

Schneider

Jackle

Stoffel

, et al. Visual integration of data and model space in ensemble learnings. In: Proceedings of the IEEE visualization in data science (VDS), Phoenix, AZ, 1 October 2017, pp. 75–87. New York: IEEE.

31.

Woniak

Graña

Corchado

. A survey of multiple classifier systems as hybrid systems. Inf Fusion 2014; 16: 3–17.

32.

Xia

Zong

. Ensemble of feature sets and classification algorithms for sentiment classification. Inf Sci 2011; 181: 1138–1152.

33.

Pal

Bezdek

. On cluster validity for the fuzzy c-means model. IEEE T Fuzzy Syst 1995; 3(3): 370–379.

34.

Wilson

Martinez

TR.

The need for small learning rates on large problems. In: Proceedings of the international joint conference on neural networks (IJCNN’01) (Cat. No.01CH37222), vol. 1, Washington, DC, 15–19 July 2001, pp. 115–119. New York: IEEE.

35.

Faloutsos

Lin

. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. SIGMOD Rec 1995; 24(2): 163–174.

36.

Levina

Bickel

PJ.

Maximum likelihood estimation of intrinsic dimension. In: Proceedings of the 17th international conference on neural information processing systems (NIPS’04), Vancouver, BC, Canada, 1 December 2004, pp. 777–784. Cambridge, MA: MIT Press.

37.

Joia

Coimbra

Cuminato

, et al. Local affine multidimensional projection. IEEE Trans Vis Comput Graph 2011; 17(12): 2563–2571.

38.

Pagliosa

Paulovich

Minghim

, et al. Projection inspector: assessment and synthesis of multidimensional projections. Neurocomputing 2015; 150(Part B): 599–610.

39.

Coates

Lee

. An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS 2011), Fort Lauderdale, FL, 11–13 April 2011, pp. 215–223, http://proceedings.mlr.press/v15/coates11a/coates11a.pdf

40.

Lampert

Nickisch

Harmeling

Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of the 2009 IEEE conference on computer vision and pattern recognition, Miami, FL, 28 November 2012, pp. 951–958. Tübingen: Max Planck Institute for Biological Cybernetics.

41.

Grauman

Fine-grained visual comparisons with local learning. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition (CVPR ’14), Washington, DC, 23–28 June 2014, pp. 192–199. Washington, DC: IEEE Computer Society.

42.

Krizhevsky

. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, ON, Canada, 2009.

43.

Thomas

Kovashka

. Seeing behind the camera: identifying the authorship of a photograph. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 3494–3502, http://openaccess.thecvf.com/content_cvpr_2016/papers/Thomas_Seeing_Behind_the_CVPR_2016_paper.pdf

44.

Chen

Zhang

. Effects of different Gabor filters parameters on image retrieval by texture. In: Proceedings of the 10th international multimedia modelling conference (MMM ’04), Brisbane, QLD, Australia, 5–7 January 2004, p. 273, Washington, DC: IEEE Computer Society.

45.

Jia

Shelhamer

Donahue

, et al. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia (MM ’14), Orlando, FL, 3–7 November 2014, pp. 675–678. New York: ACM.

46.

Cui

Ward

Rundensteiner

, et al. Measuring data abstraction quality in multiresolution visualizations. IEEE Trans Vis Comput Graph 2006; 12(5): 709–716.

47.

Comaniciu

Meer

. Mean shift: a robust approach toward feature space analysis. IEEE T Pattern Anal 2002; 24(5): 603–619.

48.

Kosara

Bendix

Hauser

. Parallel sets: interactive exploration and visual analysis of categorical data. IEEE Trans Vis Comput Graph 2006; 12: 558–568.

49.

Tejada

Minghim

Nonato

. On improved projection techniques to support visual exploration of multidimensional data sets. Inform Visual 2003; 2(4): 218–231.

50.

Sokal

Michener

. A statistical method for evaluating systematic relationships. U Kans Sci Bull 1958; 28: 1409–1438.

51.

Day

Edelsbrunner

. Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1984; 1(1): 7–24.

52.

Sander

Qin

, et al. Automatic extraction of clusters from hierarchical clustering representations. In: Proceedings of the 7th Pacific-Asia conference on knowledge discovery and data mining (PAKDD ’03), Seoul, South Korea, 30 April–2 May 2003, pp. 75–87. Berlin; Heidelberg: Springer.

53.

Kwon

Eysenbach

Verma

, et al. Clustervision: visual supervision of unsupervised clustering. IEEE Trans Vis Comput Graph 2018; 24(1): 142–151.

54.

Kern

Lex

Gehlenborg

, et al. Interactive visual exploration and refinement of cluster assignments. BMC Bioinform 2017; 18(1): 406.

55.

Bruneau

Pinheiro

Broeksema

, et al. Cluster sculptor, an interactive visual clustering system. Neurocomputing 2015; 150: 627–644.

56.

Jentner

Sacha

Stoffel

, et al. Making machine intelligence less scary for criminal analysts: reflections on designing a visual comparative case analysis tool. Vis Comput J 2018; 34: 1225–1241.