Abstract
Both visual analytics and interactive machine learning try to leverage the complementary strengths of humans and machines to solve complex data exploitation tasks. These fields overlap most significantly when training is involved: the visualization or machine learning tool improves over time by exploiting observations of the human–computer interaction. This article focuses on one aspect of the human–computer interaction that we call user-driven sampling strategies. Unlike relevance feedback and active learning sampling strategies, where the computer selects which data to label at each iteration, we investigate situations where the user selects which data are to be labeled at each iteration. User-driven sampling strategies can emerge in many visual analytics applications, but they have not been fully developed in machine learning. User-driven sampling strategies suggest new theoretical and practical research questions for both visualization science and machine learning. In this article, we identify and quantify the potential benefits of these strategies in a practical image analysis application. We find user-driven sampling strategies can sometimes provide significant performance gains by steering tools toward local minima that have lower error than tools trained with all of the data. In preliminary experiments, we find these performance gains are particularly pronounced when the user is experienced with the tool and application domain.
Introduction
Interactive machine learning is an emerging field of research that has similar aims to visual analytics: to leverage the complementary strengths of humans and machines to produce better solutions to data exploitation tasks. Perhaps the only real difference between interactive machine learning and visual analytics is historical: visual analytics has emerged from the visualization science community, 1 and interactive machine learning has emerged from the machine learning community. 2 Visualization science has traditionally focused on the user and has developed a number of tools and techniques that tailor user interfaces to the data exploitation problem, with the objective of maximizing user productivity. Machine learning has traditionally focused on the machine and has developed a number of tools and techniques that tailor the data processing tools to the problem at hand, with the objective of maximizing prediction accuracy.
One of the main areas where interactive machine learning and visual analytics overlap is training: examples of tool inputs and outputs are used to tailor the tool to the application. In traditional machine learning, training examples are obtained in any number of ways, but in interactive machine learning, training examples are obtained from end-users in the deployed environment as they interact with their data. This opens the door to a number of research questions for visualization science (e.g. how best to elicit training examples from end-users?) and for machine learning (e.g. how to best characterize user interactions in terms of training data?).
In section “Human–computer interaction in training,” we describe recent machine learning advances that enable new forms of user interaction to be captured and incorporated into training processes. There are two main research thrusts to these training advances: (1) advances in the training vocabulary enable users to provide more information than standard labels and (2) advances in the training dialog enable users to interact in a more iterative and intuitive way.
In section “User-driven sampling strategies,” we describe a new component of the training dialog which we call user-driven sampling strategies. These strategies emerge naturally in many interactive visual analytics applications, but they are yet to be formally developed in machine learning. In section “Evaluating and comparing sampling strategies,” we describe practical experiments that we use to quantify the potential benefits of user-driven sampling strategies. In section “Experimental results,” we present our experimental results and discussion before concluding in section “Summary.”
Human–computer interaction in training
There are two main technical components that determine how human–computer interaction is translated into training data to build better machine learning tools. These are illustrated as axes of an interactive machine learning design space in Figure 1.

The design space for interactive machine learning in terms of training interactions.
The training vocabulary
We call the horizontal axis the Training Vocabulary. In traditional machine learning, the vocabulary is based on simple labels. But over the last 10 years, learning by example has advanced rapidly to include a much richer class of data-structures that can support a much richer set of user interactions.
A common application that exploits these more complex interactions is clustering. Typically, the interaction is formalized as equivalence constraints: pairs (or sets) of data that belong to the same cluster and/or pairs (or sets) that belong to different clusters. 3 These constraints can be obtained from the user through labeling interfaces or through drag-and-drop type interfaces where user’s visualizing clusters are able to drag subsets of data closer to other subsets. 4 For example, in Bayesian Visual Analytics (BaVA), 5 the interaction is based on two- or three-dimensional point clouds, and users visually pick points and drag them closer to other points; this information is then used to refine the prior.
In the most general case, training examples are generated by translating user interactions into general structures or graphs. 6 These structures encode labels associated with subsets of data, but also relational (or semantic) relationships among different subsets of data. In general, structures are complex, and collecting examples from users in an interactive setting is non-trivial. In addition, structures are typically fixed in advance and only approximate reality, which means generating training examples is often not intuitive. However, recent work in interactive machine learning has started to adapt these methods to interactive settings,7–9 and we suggest further work in this area can help exploit the spatial reasoning and semantic interactions that are inherent in many visual analytics systems. 10
The training dialog
The vertical axis is the Training Dialog, and it is the main focus of this article. In traditional machine learning, training examples are collected upfront and provided to the training algorithm all at once (Batch learning). However, this is often not how users generate training data. Online learning methods relax the requirement that training examples be provided at the same time, but typically, online learning makes the same statistical assumptions as Batch learning: it assumes training samples are Independent and Identically Distributed (IID). This means (in principle) that the training data generated by users at time t + 1 should not be biased by the training samples provided previously or the output from the machine learning system at time t. Relaxing this requirement has motivated a number of iterative learning techniques which are often a better match for how users want to interact with data. A very common dialog in iterative settings is relevance feedback 11 which is summarized in the following pseudo code:
Start with a small number of examples and train a model/tool.
Apply model/tool to unlabeled data and predict most relevant subsets.
User provides labels for predicted subsets indicating it is relevant (or not).
Update model/tool based on the new labels.
Goto 2
Active learning is very similar to relevance feedback, but it uses a different strategy for selecting examples (step 2). Active learning focuses on minimizing the number of labels required to obtain a given level of performance (the sample complexity). Note that with respect to the end-users’ application, active learning strategies may well select the most uninteresting samples in the dataset. In some interactive applications, this may not be a good match; however, in other applications, it can lead to better tools with less work (labeling) for the user in the long term.
A long-standing challenge for relevance feedback and active learning has been sampling bias. The samples that the computer selects in step 2 are not selected randomly, but the methods used in step 4 often assume they are. This means there are no guarantees that performance will get better as more labels are obtained, and in fact, it may get worse. Mitigating sampling bias has been a key topic of research, and a number of methods have been developed that provide safety guarantees and batch learning performance in the worst case. 12
User-driven sampling strategies
So far, we have discussed sampling strategies for step 2 where the computer determines which samples to label next based on the previous result. An alternative is to let the user choose which samples to label next. This approach is particularly relevant to visual analytics systems, since, in order for the user to choose which examples to label, they must be able to visualize (or browse) a larger subset of data. Empowering users to visualize and select the samples could have several potential advantages:
Users often know the most important aspects of the data and can choose examples appropriately.
The design criteria for the machine learning tools may be different to the application level objectives, and the user may be able to direct, or steer, the tools with sample selection.
By enabling users to interact with the sample selection and with predictions of the tool, users can learn the strengths and weaknesses of the tool and then choose examples that can guide the tool to better solutions.
For a concrete example of how user-driven sampling strategies emerge in interactive applications, we will focus on its application to image exploitation and, specifically, the task of labeling pixels. This application is the basis of the Crayons interactive machine learning system 13 as well as our own Genie image exploitation system. 14 This is illustrated in Figure 2. These tools obtain training examples from users through paintbrush-like tools. Second from the left in Figure 2 shows a typical “mark-up” where the user has selected examples of the feature of interest (vegetation) in green and examples of the background in red. These labeled pixels are fed into a supervised learning method that produces a pixel-level classifier that can be applied to the entire image, as well as additional images. An example prediction from the pixel classifier is shown as a green/red overlay second from the right in Figure 2.

User-driven sampling strategies arise naturally in image exploitation applications. Users are able to visualize a large quantity of data (step 1) when deciding which data to annotate (step 2). They are also able to visualize a large quantity of predicted results (step 3) when deciding which additional data to annotate and correct (step 4).
This basic tool has a wide range of applications, and our tool has been used in remote sensing, biomedical image analysis, as well as material science. In many of these applications, we have observed that users prefer to use the tool incrementally. Instead of providing a complete mark-up upfront, users provide a small amount of training data, train the classifier, then provide additional training data (typically where the classifier made a mistake), and so on. Informally, we have also observed that this iterative process often leads to better tools than those developed with batch training. In this article, we try to quantify and understand this potential performance improvement.
Evaluating and comparing sampling strategies
In this article, we perform a number of experiments to better understand how user-driven sampling strategies affect classifier performance in the image exploitation application illustrated in Figure 2. In all experiments, the objective is to predict each pixel (

(Left) RGB of a 10-channel multi-spectral image, (second from left) ground-truth overlay showing desired features in green and background in red, and (right) example of how pixel predictions, in combination with morphological reconstruction with the ground-truth, lead to a more object-centric measure of performance.
In addition to the image data associated with each experiment, we also have a fully annotated labeling of the image that is used as ground-truth. This ground-truth image was annotated by a human, but for all intents and purposes, the ground-truth generation can be considered independent of the experiments performed in this article. Note, however, that we do show the ground-truth image to the user at the start of the experiments to help the user understand the task objective. Once the user starts the experiment, the ground-truth image is not available. An example of a ground-truth labeling overlaid on the original image is shown second from the left in Figure 3.
Classifier design
In all experiments, we will use a linear classifier to predict the labeling for each pixel
where
Fisher’s Linear Discriminant
The first learning algorithm is Fisher’s Linear Discriminant, 15 and it finds the weight vector that maximizes Fisher’s criteria for separation
where the numerator is the variance between pixels in the training set with different labels, and the denominator is the variance between pixels in the training set with the same label.
Linear Support Vector Machine
The second learning algorithm is a linear Support Vector Machine (SVM) (we use LIBSVM 16 ), and it finds the weight vector that minimizes a regularized loss function related to the misclassification rate on the training set
where
Classifier evaluation
The two learning methods described in the previous section optimize well-established design criteria for classification. However, these design criteria are typically different to the criteria used to evaluate the classifiers. Also, in this article, we are particularly interested in investigating whether user-driven sampling strategies are enabling users to optimize different, perhaps more application level, criteria. Therefore, we investigate two different criteria for evaluation.
Pixel error
Perhaps the most common criterion for evaluating classifiers is the error rate or the number of classification errors calculated on a hold-out set
where
Focus of attention
In the experiments, we design pixel-level classifiers as defined in equation (1). However, the features of interest in the experiments are much larger than a pixel and can be more accurately described as objects. For example, in Figure 3, the features correspond to airplanes. Humans are often very object-centric in their interpretation of image data, and it is therefore possible that user-driven sampling strategies may be optimizing an object-centric design criteria. In previous work, we suggested a criterion for pixel-level classifiers that may be more appropriate for object detection problems.17,18 We called it the Focus of Attention (FOA), and it is based on the observation that a human user may be more interested in a classifier that correctly predicts one pixel on every object in the scene than in a classifier that correctly predicts all pixels on a subset of the objects in the scene. To quantify this idea, we implement a reconstruction post-processing step on pixel predictions illustrated on the right in Figure 3. If the pixel classifier predicts at least one pixel correctly on the airplane, then the complete airplane is reconstructed, and the classifier has zero pixel error over that object. However, false alarms (associated with background pixels) are not affected by the procedure and therefore contribute to the pixel error as before.
IID sampling
The theory and algorithms for optimizing the two different design criteria represented by equations (2) and (3) assume IID training examples. This is the baseline against which we compare the user-driven sampling strategy. To generate IID training sets, we randomly select
Note that the

Performance of classifiers designed with IID sampling: (left) Fisher Linear Discriminant evaluated by pixel error, (middle) SVM evaluated by pixel error, and (right) SVM evaluated by FOA error. The black dashed line corresponds to the performance when the entire image was used as training data,
The black dashed lines in Figure 4 correspond to the performance when the entire image is used as the training set. In this limit, we are using the same examples to design and evaluate the classifiers, which, for the purposes of comparison, provide a lower bound for the IID methods.
Comparing the left and middle plots in Figure 4, we observe that the Fisher classifier has higher error for this problem than the SVM, which we attribute to the fact that the SVM design criteria are closer to the pixel error evaluation criteria. Also, the Fisher classifier generally has lower variance in error with different training sets, particularly as the number of training examples increases. Contributing to this difference in performance is the fact that the SVM has an additional parameter,
User-driven sampling
In this sub-section, we describe how we evaluate the user-driven sampling strategy. In the first iteration, the user inspects the raw image and selects a number of examples of feature pixels and background pixels. A typical selection for the aircraft problem is shown on the left in Figure 5. These examples are used to train a classifier which produces the prediction second from the left in Figure 5. We calculate the error of the prediction compared to the ground-truth. The second (and subsequent) iterations proceed much like the first, except now the user has the previous prediction to help them choose examples. We overlay the prediction with the image and the training data mark-up with different colors so that the user can simultaneously see the current training data and the current prediction. This typically biases the user’s selection of samples toward misclassified pixels. This process continues until the user decides to stop (when they judge the prediction is no longer improving). In Figure 5, the user stopped after nine iterations. The final training data used to build the classifier in iteration 9 are shown in black and white in Figure 5. The final prediction with this training data is shown on the right in Figure 5. In the results presented in the next section, our user repeats this experiment several times (typically five trials) for each dataset.

(Far left) The user selects training samples in iteration 1, (second from left) the resulting classifier, (second from right) the labeled pixels after nine iterations, and (far right) the final classifier produced with these pixels.
Experimental results
Classifying aircraft
Figure 6 summarizes the experimental results for the aircraft problem described in Figure 3. Perhaps most interesting are the results on the left in Figure 6 which correspond to the Fisher learning method. In the first iterations, the user’s selection of examples appears to produce higher error than IID sampling. This implies the user’s selection is biased compared to the final problem, as we might expect. However, in subsequent iterations, the user’s sample selection leads to consistently better performance than IID sampling. This error is significantly lower than the variability in performance observed from IID sampling. The lowest number of mistakes observed in IID sampling was approximately

Performance as a function of training samples (iterations) for five different trials of user-driven sampling (colored) in comparison with the lower bound classifiers (black dashed lines): (left) the Fisher classifier evaluated by pixel error, (middle) the SVM evaluated by pixel error, and (right) the SVM evaluated by FOA.
In the middle plot in Figure 6, we observe that the performance of user-driven sampling converges to an error which is much closer to the lower bound for the SVM. In some cases, the user obtained lower error, but this depended much more critically on when the user decided to stop iterating with the system. This general pattern appears amplified in the FOA results shown on the right in Figure 6. In four of the five trials, the user-driven sampling outperformed the lower bound, but the variance between iterations was much larger.
The results in Figure 6 are somewhat different to typical relevance feedback and active learning results. Active learning experiments typically quantify how many samples are required for algorithms to converge to the IID lower bound. The user-driven sampling results in Figure 6 tell a different story: user-driven sample selection can drive classifiers to minimize error criteria that are different to the criteria used to design the classifier. This is particularly pronounced for the Fisher design method. Note, in all cases, if the user continues to iterate they will eventually label the entire image, and the error curves in Figure 6 would return to the dashed line.
Classifying vehicles
To investigate this phenomenon further, we performed a second set of experiments where the task was to identify vehicles in a three-color image, as illustrated in Figure 7. The results from these experiments are shown in Figure 8. We observe a similar pattern of performance to the aircraft problem:
User-driven sampling significantly outperforms the Fisher lower bound, as measured by pixel error.
User-driven sampling converges to the SVM lower bound, as measured by pixel error.
User-driven sampling appears to outperform the SVM when measured with FOA, although the variance between iterations is high.

In this problem, the image has three channels (left), and the task is to delineate vehicles in an urban environment (right).

Performance as a function of training samples (iterations) for the vehicle detection problem: (left) the Fisher classifier evaluated by pixel error, (middle) the SVM evaluated by pixel error, and (right) the SVM evaluated by FOA. Note that data points are at discrete intervals as in Figure 6, but not shown in this case to avoid clutter.
To investigate the impact of contamination, and the accuracy of our lower bound in experiments, we evaluated the lower bound classifiers (black dashed lines), and the classifiers generated at each iteration of user-driven sampling, using two different images. The results are summarized in Figure 9, and we see that in all cases, the user-driven sampling appears to do better in test in comparison with the lower bound classifiers.

Out-of-sample performance evaluated using two different images (top row and bottom row) not used during the experiment. (Left) The image used to evaluate performance, (middle) SVM performance as measured by pixel error, and (right) SVM performance as measured by FOA.
Broad area features
The two experiments we have described so far are both examples of object-centric detection problems, in that the features of interest cover relatively small, compact regions within the image. In Figure 10, we show a different class of problem where the feature of interest (an urban land cover type) covers much larger, contiguous regions within the image. On the right in Figure 10, we show the performance of user-driven sampling for the Fisher classifier, as evaluated by pixel error. This configuration is where we saw the biggest difference between the sampling strategies in the previous experiments. In this experiment we also see a consistent improvement over the lower bound, but it is significantly less than in the previous experiments.

(Left) The image has 16 channels and the task (middle) is to delineate land use corresponding to high-density residential neighborhoods. (Right) The performance of the Fisher classifier as evaluated by pixel error.
Inexperienced user
In the experiments described so far, the user involved was very familiar with the tool and the image analysis applications. In fact, through many years of hands-on experience, this user had already reached the conclusion that incremental mark-up could obtain better results than batch mark-up for these applications. To better understand the role of expertise in this situation, we solicited a second user, with no prior experience with the tool or image analysis. For this experiment, we used the aircraft problem described in section “Classifying aircraft,” and results are summarized in Figure 11.

Performance over four trials on a subset of the aircraft delineation task: (left) results from the experienced user and (right) an inexperienced user on the same problem. Both plots show the results using the Fisher classifier evaluated with pixel error.
We observed that the inexperienced user was not able to do as well as the experienced user, but they were able to consistently outperform the lower bound. Note that we only show three trials for the inexperienced user. The first trial was used as a training run by the inexperienced user and was not included. Generally, we have found that inexperienced users quickly become familiar with the tool and the interaction required. Quantifying user performance as a function of experience would be an interesting direction for future work.
Summary
In this article, we identified a user-driven sampling strategy that emerges naturally in some visual analytics applications and leads to different machine learning tools than those produced by traditional IID sampling. In our experiments, we found that user-driven sampling leads to biased training sets, but showed that these biases can lead to improved performance. The key observation is that the classifier design criteria are often different to the criteria that the user may ultimately be interested in and the criteria we use to evaluate the classifier.
The improvement in performance was most significant when designing classifiers according to Fisher’s criteria, in which case the user-driven sampling was found to consistently lead to better performance as evaluated by pixel classification error. This effect appeared to be robust to the type of image application and the expertise of the user. However, the effect was most pronounced in classification problems involving object type features and for experienced users.
We also saw a similar effect for classifiers designed with the SVM criteria, although the magnitude of improvement was much less. We attribute this to the fact that the SVM design criteria are much closer to pixel classification error. Consistent with this hypothesis is the fact that the performance improvement was most pronounced when comparing to out-of-sample classification error rates instead of the lower bound estimated from the training set. We also observed that, in general, the user-driven sampling led to improved performance with respect to the FOA evaluation criteria, but the performance between iterations had much higher variance. We attribute this to the sensitivity of the SVM classifier to the class proportions in the training set, and our user did report that the classifier results, and thereby the next iteration of training data, would often alternate between high false alarm rates and high missed detection rates.
The class representation in the training set is a key lever with which a user can steer classifier performance with user-driven sampling. We found that the number of samples selected by the user for each class in training was often very different to the class probabilities in the final prediction. Specifically, our object detection applications typically had a large background class (background), but in samples selected by the user, we observed class probabilities that were much more equal. This suggests user-driven sample selection may be related to other methods in interactive machine learning, such as ManiMatrix, which enables users to interact with classifier design by adjusting weights on the classifier error matrix. 19 In the ManiMatrix system, users interact directly with classifier design parameters as they iterate with classifier predictions. In our approach, the interaction is indirect, but in some applications this could be more intuitive.
Finally, we note that the image classification problem is particularly well suited to user-driven sampling strategies because users can easily view a large amount of unlabeled data, together with predicted results, to choose the next iteration of examples. However, systems that support other applications and data types have also been developed, 20 and we hope the initial results we have presented in this article will motivate support of user-driven sampling strategies in other visual analytics domains. This article has introduced user-driven sampling and presented initial results, but in doing so, it has opened the door to many open questions for machine learning and visualization science. The fact that the performance benefit lies in a local minima and the user must know when to stop suggests methods from active learning that help control worst case behavior and provide guarantees on performance will be required. The fact that our experienced user significantly outperformed our inexperienced user suggests a large number of visualization, training, and human factors could be involved in producing good user-driven sampling strategies.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
