Abstract
Humans interact more and more with systems containing AI components. In this work, we focus on hand gestures such as handwriting and sketches serving as inputs to such systems. They are represented as a trajectory, i.e. sequence of points, that is altered to improve interaction with an AI model while keeping the model fixed. Optimized inputs are accompanied by instructions on how to create them. We aim to cut on effort for humans and recognition errors while limiting changes to original inputs. We derive multiple objectives and measures and propose continuous and discrete optimization methods embracing the AI model to improve samples in an iterative fashion by removing, shifting and reordering points of the gesture trajectory. Our quantitative and qualitative evaluation shows that mimicking generated proposals that differ only modestly from the original ones leads to lower error rates and requires less effort. Furthermore, our work can be easily adjusted for sketch abstraction improving on prior work.
Introduction
“Smart systems” containing machine learning components have already deeply penetrated our daily life and humans interact with multiple such systems daily: recommendation systems (on web pages), voice assistants, gesture recognition systems, and safety-critical systems such as driver-assisted cars, to name a few. However, interaction between humans and AI consumes a lot of resources. Furthermore, the time and economic loss due to errors in mis-communication are considerable. Even simple gestures for screen unlocking of smartphones are not detected with 100% accuracy in laboratory settings [51]. Given that billions of unlock attempts fail almost daily, improving accuracy by just 0.1% can lead to yearly global time savings in the order of millions of hours for such a simple gesture alone.1
Smartphone users worldwide 3.8 Billion (
By improving interaction with AI systems, humans might benefit from lower time to create inputs and fewer misunderstandings in interaction. While the idea of co-adaption is known in human-computer interaction [16,20], in the machine learning community, data originating from people is typically fixed, and the goal is to find the best possible model. This work provides first steps to helping humans to do so, focusing on hand-generated inputs. Humans are offered feedback allowing them to alter their inputs to AI systems. See Fig. 1 for a process overview. The human provides a sketch in the form a sequence of strokes, where each stroke is a sequence of points. The human receives the optimized sketch being a visualization of the optimized sequence with instructions in the form of stroke order and direction indicated by red arrows. As we show, if a human draws sketches to be more similar to the proposal taking into account stroke order, the sketches can be more reliably be detected by the AI and the human takes less time to create them (given some practice).

Left panel: humans change their inputs to AI based on feedback leading to time savings and better recognizability; right panel: password pattern input: improving recognition failures of password patterns alone by as little as 0.1% can save millions of wasted hours yearly.
Thus, there are good reasons why humans might want to improve on their (hand-generated) inputs. Humans, “copying” or adjusting to optimized human inputs, should benefit from lower errors in interaction and save time. At the same time, those changes should be easy and still preserve the personal style of the human to maintain human diversity. Unfortunately, fulfilling all of these objectives is difficult for multiple reasons. First, there is a trade-off between recognition accuracy and effort to create. Simplifying inputs too much will increase error rates. Making inputs more complex, e.g., adding redundant features to inputs, might reduce errors but increase the amount of effort to create them. Second, streamlining human-to-AI interaction is possibly more intricate than human-to-human interaction since AI systems process information differently and are sensitive to input changes that humans hardly notice. Using almost invisible, adversarial perturbations, classifiers might be “fooled” to misjudge samples that can be categorized without any problems by humans (e.g., [42]). From our perspective, this is both good and bad news. While it suggests that minor changes exist that might increase confidence in the correct class (rather than in an incorrect class as used in an adversarial setting), it also highlights that humans might not even be able to distinguish an improved from a non-improved sample if optimization is not done with care. Moreover, movements of (human) limbs and the vocal tract are subject to stochastic variation, making it impossible for humans to alter inputs in very subtle ways as done in adversarial settings like [42].
Designing adequate optimization objectives is non-trivial. Even if the proposed samples optimize a specific mathematical objective, it is unclear whether these samples are of practical value. That is, humans might deem the proposed inputs unnatural, and they might struggle to reproduce suggested changes that deviate from deeply rooted habits, beliefs and behaviors.
This work discusses improving human-to-AI interaction through optimization of hand-generated inputs by proposing to adjust the order of movements, removing input parts and altering trajectories of movements. The optimization aims at all of the aforementioned goals: reducing time to create inputs, improving recognition accuracy of inputs by the AI model, maintaining recognizability by humans as well as the diversity of samples. The problem of sketch abstraction [38,60], in particular as treated in prior work, can be seen as a special case of our work neglecting multiple of our objectives such as time to create inputs, diversity of samples, and providing instructions to create them. Our approach to change samples also differs compared to these works. We alter each sample through an iterative optimization procedure rather than by training a model that generates an improved version of an input without any optimization. That is, we gradually change an input in a trial-and-error manner, whereas prior work learns a model that directly transforms the input using patterns learnt from the training data of the model. Due to the huge diversity of samples, the training data is likely rather sparse and, therefore, a trained model cannot adjust to the unique characteristics of a training sample even though they might be relevant for the optimization.
Our work contributes as follows:
Defining high-level objectives and mathematically concise measures both for optimization and for evaluation. Our evaluation accounts for discrepancies due to innate variation in human movement between suggested inputs and those created by humans.
Proposing algorithms to optimize single hand-generated samples of an individual in an iterative, exploratory manner rather than using a separately trained model. While the latter might be computationally faster, it typically performs some sort of generalization, i.e., abstraction, leading to more uniform, less diverse samples. Our approach leads to highly personalized samples. These samples are faster to generate and classified with higher accuracy while preserving diversity among humans. Among our optimization operations, only removal of strokes has been studied extensively in prior work in the context of sketch abstraction. Our work outperforms the state-of-the-art with respect to recognizability of abstracted sketches by an AI.
Showing inputs in combination with how to create them. We utilize (hand) movement data, highlighting the order and direction of movements rather than exposing a human only to the suggested optimized inputs without instructions on how to create them efficiently.
Altering human inputs: Schneider introduced a human-to-AI coach based on an auto-encoder that given a picture of a digit outputs a digit that has lower classifier loss and potentially consists of fewer pixels [45]. In contrast, we optimize samples individually in an iterative manner. Additionally, we provide instructions on how to create samples and we evaluate on actual users. Our work also uses more complex datasets – see survey on AI and sketches [56]. Abstracting sketches using removal of stroke segments and entire strokes, while preserving semantics was studied in [38,43]. In these works an agent learns to select strokes relevant for a classifier to maintain the correct class using reinforcement learning (RL). The implementation using RL differs from our approach improving individual samples directly. The optimization procedure in [38,43] neglects all constraints and focuses on maintaining accuracy employing just one option for abstracting: Removal. In this paper, we also consider a gradual movement of points and the creation process, i.e., altering the order of strokes, while maintaining multiple constraints. Liu et al. used GANs to complete sketches corrupted through occlusion [33]. They achieved high-quality results comparable to methods such as image inpainting. Many works deal with creating human inputs, e.g., synthesizing hand writing [23].The focus of these works is typically on creativity [6], i.e. generating novel, realistically looking artifacts rather than altering given inputs.
Explainability: This paper has strong ties to explanations [24,37] and explanations in the field of human-AI interaction [27]. Counterfactual explanations seek to identify a modification of the input to obtain another class [15,22]. Dhurandhar et al. identify minimal changes to digits on a pixel level using perturbations [15]. Thus, in contrast to our work, they focus on misclassified samples only. Moreover, the suggested changes commonly involve adding or removing multiple pixels distributed across the digit. This is infeasible for humans, since they cannot reproduce such changes. A human might improve its interaction using a better understanding of AI, e.g., with the help of visualizations highlighting what aspects of an input are critical for classification [41,49]. Generally, explainability [37] aims at making AI models human understandable. In our setup, we do not explicitly aim at understandability but a human might understand AI decision-making by generalizing from multiple optimized inputs, i.e., a person might learn how samples can be altered without distorting their recognizability.
Human-AI interaction: Rzepka and Berger summarized the effects of user and AI system characteristics in general [44], while Martins et al. focused on digital AI assistants [36]. Interaction between AI and users was studied in various contexts including social robots [36,40]. The primary focus has been on desirable AI behavior, e.g., empathy, or strategies how AI can adapt to user behavior [12,21] with few exceptions. Bansal et al. explicitly investigated how users can alter the behavior [2], i.e., override decisions of the AI, by understanding the error boundary of a classifier, while Shneiderman provides general guidelines on human-centered AI [52]. In follow-up work, Bansal et al. [3] also investigated updates for AI systems interacting with humans.
Recommendations and personalization: While we also make recommendations to a user, there are only weak ties to recommender systems. Even for interpretable recommendation systems [19] users primarily seek to understand decisions but do not aim to alter their behavior to obtain better recommendations. Our work optimizes inputs of individuals through iterative processing, while other works learn a transformational model to change inputs in a more straightforward manner. Learning personalized models [47] might be a middle ground where a model is learnt (or adjusted) for each individual. Personalization of models and our optimizations might be combined.
Problem
This work focuses on object classification using data from human (physical) activity, i.e., hand movements as done for sketching, writing and gestures as sketched by billions of people daily on their smartphones and other devices. Each input X created by a human should be labeled by a classifier as a specific class Y. An input X is a sequence of points
We aim to provide some guidance for a user showing how a proposal can be created. That is, we show all operations on how to draw the optimized output
Objectives and measures
We consider three main objectives:
The length-wise difference loss
The loss is:
This objective is adequate if parts of the input are removed. In this case, the original and the modified sample are identical except that one is missing some parts.
The point-wise difference loss
It might be possible to improve on all three objectives simultaneously for some samples. But they commonly require trade-offs: Keeping changes minimal is at odds with the other objectives, which encourage changes to the input. Minimizing time to create samples is at odds with recognizability. Little time to create implies little information is contained in the outputs, which makes discriminating inputs harder. Thus, what is most preferred – time or recognizability or minimal changes – is a decision that is subjective and left to the end-user. She must state her preferences. We consider two mechanisms a user can state her inclinations: (i) Weighing objectives and (ii) providing constraints. Constraints are ensured not to be violated, while objectives are optimized though no guarantee is given to what extent an objective is fulfilled. To keep matters simple, we shall consider two (primary) objectives, i.e. either minimize “time” or maximize “accuracy” while constraining the maximal distortion of the original input. Additionally, changes are constrained so that the classifier still recognizes optimized inputs correctly or they become recognizable due to the optimization process.
Methodology
To solve the problem outlined in Section 3 we use a form of trial-and-error to identify which alterations of input samples are beneficial due to the lack of labeled data. We optimize each input in an iterative fashion. In each iteration, we investigate one option to change the current input. The strategy to choose the next option to “try” depends on the number of possible options. We employ three strategies: gradient descent, a greedy approach, and a brute-force approach. Gradient descent is employed for small movements of points. The greedy and brute-force approaches are used for discrete decisions, e.g., changing the order of strokes and removal of strokes. The brute-force approach simply tries all possible solutions. Therefore, it is guaranteed to find the best solution in contrast to a greedy approach performing short-sighted decisions by picking the change that improves the current input the most. Greedy solutions can be local optima only. Gradient descent can be said to be a form of greedy optimization.
Alternative to our approach, one might train a machine learning model to optimize subsequently provided inputs in one forward pass (e.g. [43,45]). A model is likely advantageous if labeled data is available or generalization across samples is helpful or computational load is a concern. But since detailed characteristics of the input should be preserved as much as possible and the diversity of inputs is large, generalization possibilities seem limited. Our approach of applying optimization techniques directly on individual samples seems preferable, as also confirmed by our experimental evaluation.
Optimizing individual samples: To change an input by a human to an optimized input fulfilling all desirdata, we solve a constraint optimization problem encompassing discrete operations, e.g., removal of stroke segments, and continuous operations, e.g., changing coordinates of points. We employ a general framework (Algorithm 1) that performs a fixed number of iterations

Discrete optimization
Generating solution candidates: Solution candidates are generated for each operation as follows: For removal of visible parts, we consider all stroke segments
For changing the direction of strokes and changing the order of strokes, we choose a solution candidate randomly. That is, for changing the direction of a stroke, we flip the direction of a randomly chosen stroke of
For continuous optimization, we use gradient descent with gradients obtained from the classifier. We maintain a solution candidate Z that is initialized with the original input X. It is updated in each iteration using a gradient descent step treating the classifier weights as fixed and the input X as variable. Otherwise, the procedure is identical to Algorithm 1. The loss function
We conducted our evaluation on multiple datasets and models employing qualitative and quantitative methods and a user study.
Qualitative evaluation
The qualitative evaluation aims to provide an understanding of how samples are altered and to what extent our formalization of objectives indeed leads to samples that are (i) still human recognizable, (ii) diverse, and (iii) fast to draw. To this end, we depict for each shown class multiple samples. Showing multiple samples per class allows assessing diversity, i.e., if samples are still diverse or if they are optimized towards a single sample that is “best recognizable and fastest to create”. In particular, it is interesting to understand the behavior of each of our proposed methods. Thus, we discuss optimized samples for each alteration operation separately, starting with the removal of segments. As seen in Fig. 2 removal of segments in the same order as creation (SO) or reverse order (RO) tends to remove entire strokes. This can lead to unnatural sketches, e.g., angels without heads. Random removal (RA) and classifier loss-based ordering (CL) increase the number of strokes, which might be undesirable for reproducing the optimized sample. Removing only endpoints based on classifier loss (CE) tends to produce well-recognizable samples with fewer artifacts and without an increase in strokes.

Original and generated samples for removal, minimizing creation effort.

Original and generated samples for continuous optimization for various
Figure 3 shows outcomes for continuous optimization. Changes appear more subtle than for removal, particularly when optimizing for accuracy (second row). Continuous optimization tends to shorten strokes, straighten them (best seen for wings of the first angel) and it might also rotate them – as done for clock hands. Changes seem to be the least noticeable when only optimizing for effort (last row), which should lead to the largest distortion. That is, the original and the optimized input appear most similar. Our quantitative analysis shows that they differ strongly, e.g., if the visible length of strokes is compared. Samples get scaled entirely in a more uniform manner, making changes harder to spot (best seen for the third image from the left in the last row).

Original and generated samples for stroke direction and order, optimizing for time.
Figure 4 shows outcomes for altering order and direction of strokes. For most samples, order and direction can be changed to improve both (required) effort and accuracy but differences are difficult to notice. For some samples it is more apparent that optimized samples due to changing order and direction of strokes lead to less overall hand movements. Mostly differences tend to be difficult to determine hinting that humans are already fairly good at efficiently drawing sketches. Still, our quantitative evaluation shows that significant reduction in movements is possible leading also to an improvement in classifier accuracy.
We use priorly described metrics related to (i) the classifier’s capability to recognize samples (
This happens if the x and y coordinate of a point is shifted by -10 once and another time by +10, yielding a total distance of
Results varying loss term weights
Results for removal
Using the classifier loss (CL or CE) for ordering removals yields best results in terms of accuracy. When being allowed to split strokes (CL) accuracy is larger than for removing parts at the end of strokes (CE), but these gains come at the expense of having more strokes. Furthermore, effort loss and similarity to original samples (
When optimizing for effort, only CL and CE achieve much better accuracy for noisy samples
For effort, removing entire strokes from the beginning (or end) yields benefits, since in the optimization process, we do not account for moving to the first point or from the last point to some starting point, and there is often a significant distance between the endpoint of one stroke and startpoint of the next stroke. This distance is also gained when removing entire strokes. In contrast, a transition between strokes remains when removing strokes (segments) in the middle, a transition between strokes remains.
In summary, CL is best for accuracy, CE is still good for accuracy with better effort loss, and RO leads to best effort loss but only slight gains in accuracy compared to the original samples.
Accuracy reduction for Sketch-a-Net architecture and ours when reducing visible elements; bold shows best
Results for permuting strokes (P), reversing direction (R) and doing both (B)
Results for applying multiple methods sequentially; (C)ontinuous point movement, (B) Reverse and permute, (D)eletion/Removal
Thus, qualitative behavior was consistent across datasets, networks and operations. Gains could vary per dataset, network and operation considered. For example, the highest accuracy gains relative to the original were achieved for Conv1D on Quickdraw being 13.4% compared to 8.3% for LSTM on QuickDraw.
We conducted an experiment to assess if humans can reproduce optimized samples and if these reproductions indeed yield gains according to the specified objective. While the prior numerical investigation is highly suggestive, optimized samples might be unnatural for humans. Thus, reproductions of those samples might actually be worse, i.e., they might take longer to create and deviate more strongly from the proposal than non-optimized samples, making a user study necessary. Thus, the study is a first step in assessing whether optimized samples can be reproduced “better” than non-optimized ones. The next step would be to demonstrate long-term learning from optimized samples by changing user behaviors over a prolonged period of time through possibly repeated training. We believe that this is a study in its own right.
We used generated samples for method “D-B” (Table 5) optimized towards accuracy for the QuickDraw dataset. The overall pool of sketches consisted of 10 original samples per class, where each sample consisted of up to 7 strokes to ensure good readability of instructions, i.e., numbering and arrows. Since we are particularly interested in the capability, whether errors in interaction can be mitigated, we chose 5 (of the 10) original samples per class that were misclassified. Each participant had to copy an optimized version of a human input and the original version for 5 randomly selected sketches, yielding 10 sketches per user. Thus, on average 2.5 of the presented, original versions are classified correctly. The optimized and original versions were shown in random order. Users were advised to draw strokes in the order and direction as indicated by the numbering and arrows in the shown sample (Fig. 5).

During the user study participants are shown a sketch with numbered strokes and stroke start indicated (left panel). They should reproduce it (right).
We recruited 200 English-speaking participants on Amazon Mechanical Turk. We removed reproduced sketches, that did not match the instructed number of strokes or took more than 60 s to create or had only the original or the corresponding optimized sample was drawn adequately, i.e., within 60 s and with the correct number of strokes. The (LSTM) classifier had an accuracy of 54% on sketches resembling the original and 68% on sketches based on the optimized sample. The differences are statistically significant using a t-test, yielding
Our experiment primarily confirmed that inputs to an AI following instructions leads to better outcomes. Ultimately, our goal is to show that people perform better without instructions, i.e., after having learnt from instructions to create optimized versions of their own inputs. Showing this might require a longer training phase and might also be accompanied by different measures than mimicking generated samples [18]. For example, the handwriting style of adolescents has been shown to improve to some extent without direct teaching of handwriting using consultative clinical reasoning [18]. Furthermore, the user-machine interaction process during the learning phase of a human also provides many opportunities and raises a sequence of questions, e.g., is actual reproducing of inputs the best mode for learning? Is it better to provide feedback be in real-time while humans are creating inputs?
Moreover, generating a “sequence of strokes” can be based on fairly different cognitive processes with different goals. For instance, the act of handwriting or unlocking a mobile phone is fairly automatic, i.e., a person does not need to think about how to write a specific letter (or word) or about the movements of hands to unlock the screen lock. She might even be able to do it blindly, whereas generating a sketch of an object for a person not very experienced in sketching is harder and it might follow a less automatic and task-specific reasoning process, e.g., from coarse to fine details [17]. Our optimization process might break the creation process from coarse to fine to encourage drawing with less movement, i.e., effort. When optimizing for effort it can be beneficial to draw all nearby “strokes” independent of their abstraction level to minimize hand movements. Such a change in the drawing process raises questions with respect to the quality of the sketches beyond our considered metrics, e.g., are sketches obtained using a process focusing on efficiency less diverse or creative? Sketching is often a process that is part of idea and concept formation where efficiency (and recognizability by an AI) might not be of utmost importance. We believe that our work is most beneficial for tasks that are based on frequent interaction between humans and AI using symbols or patterns produced fairly automatically by humans. In such situations, efficiency and recognition of inputs by the AI are of high importance.
This study contributes by (i) setting forth objectives and measures to optimize and evaluate hand-generated human inputs, (ii) providing algorithmic foundations, and (iii) an extensive evaluation including a user study showing gains in terms of recognition accuracy for optimized samples generated by humans based on automatically generated instructions.
We believe that our objectives are of interest to any interaction of humans and AI, i.e., inputs should be fast and effortless to create, recognizable by the AI model and humans, and diverse. A limitation of our work is that the set of proposed objectives is not complete. Objectives could be considered such as configuring suggestions to maximize preferences of a human and robustness of the suggested proposals to forgery and adversarial attacks. Our work is most beneficial for scenarios where interaction between a human and AI is frequent and relying on well-established patterns or symbols. In such cases, efficiency and recognizability of samples by the AI are of high importance. Humans and AI might also interact to jointly produce a “creative” solution, e.g., to jointly create a new logo for a sports team. In this case, it is unclear whether restricting a person’s creative thought process to improve the understanding of the AI is beneficial overall.
Our algorithms improve on multiple prior works focusing on sketch abstraction [38,60]. The improvements based on quantitative comparison can be attributed to our optimization approach. We chose to optimize samples individually, which allows to adjust the optimizations to individual samples to a very high degree, but it is more computationally intensive than training a model (as was done in prior work on sketch abstraction). For inputs resulting from complex and long interactions, the options to change inputs increase and, therefore, computational demands limit our methods’ applicability. Thus, to keep the optimization process short, our methods might have to be further tailored to reduce running time or parameters in our algorithm such as the number of incremental changes must be altered. We focused on a hand-generated inputs as occurring on touch-screen by billions of users every day. While a number of publications have focused exclusively on such types of problems, it is still of interest to assess to what extent our methods can be generalized. The framework outlined in Algorithm 1 can be applied to any classifier and other problems changing human inputs in the form of movements. In our problem statement, we used an indicator
Our user study shows that samples generated based on instructions from optimized samples lead to higher recognition accuracy than the original samples. Prior work on sketch abstraction [38,60] was commonly lacking user studies. A limitation of our user study is that it is only a first step towards showing our ultimate goal that humans can interact more efficiently and with higher accuracy after having learnt instructions to create optimized samples. We believe that further algorithmic improvements are possible and should be attempted before long-term user studies are undertaken investigating both the effort it takes humans to change as well as to train (first time) users to properly interact with an AI to minimize misunderstandings. Our instructions are easy to understand, but could be improved, e.g., by showing strokes in a sequential manner rather than all at once. Instructions are also strongly sample driven. Instead, one might also instruct humans by providing general strategies for improvements rather than providing detailed instructions on a per-sample basis.
Beyond user studies, more exploration of the field of human and AI interaction and co-adaption is needed to improve interaction on a semantic level as needed for interaction with chatbots beyond making chatbots more human [13,14], to consider other recognition problems such as speech recognition [58], to perform a joint optimization of human inputs and AI models, e.g., interactive modeling [55], to derive optimization algorithms that use inputs of a human to provide general rules as feedback, to assess additional concerns such as acceptance of technology by humans [54].
Conclusions
Humans interact more and more with AI. This leads to forms of co-adaption and questions like “How can an AI adapt to improve interaction? And how can humans do so?”. This paper relates to the second question for hand gestures. It provides first steps towards improving interaction by showing how human inputs to an AI can be optimized. Our approach, optimizing samples individually seems beneficial. For stroke removals, we improve prior work considerably. To the best of our knowledge, we are the first to investigate other operations for altering inputs such as stroke order and direction. Our evaluation indicates that optimized samples lead to less misclassifications, while still bearing similarity to the original input and not requiring more time to create.
