Optimizing human hand gestures for AI-systems

Abstract

Humans interact more and more with systems containing AI components. In this work, we focus on hand gestures such as handwriting and sketches serving as inputs to such systems. They are represented as a trajectory, i.e. sequence of points, that is altered to improve interaction with an AI model while keeping the model fixed. Optimized inputs are accompanied by instructions on how to create them. We aim to cut on effort for humans and recognition errors while limiting changes to original inputs. We derive multiple objectives and measures and propose continuous and discrete optimization methods embracing the AI model to improve samples in an iterative fashion by removing, shifting and reordering points of the gesture trajectory. Our quantitative and qualitative evaluation shows that mimicking generated proposals that differ only modestly from the original ones leads to lower error rates and requires less effort. Furthermore, our work can be easily adjusted for sketch abstraction improving on prior work.

Keywords

Optimization deep learning algorithms human-AI interaction computer vision

1. Introduction

“Smart systems” containing machine learning components have already deeply penetrated our daily life and humans interact with multiple such systems daily: recommendation systems (on web pages), voice assistants, gesture recognition systems, and safety-critical systems such as driver-assisted cars, to name a few. However, interaction between humans and AI consumes a lot of resources. Furthermore, the time and economic loss due to errors in mis-communication are considerable. Even simple gestures for screen unlocking of smartphones are not detected with 100% accuracy in laboratory settings [51]. Given that billions of unlock attempts fail almost daily, improving accuracy by just 0.1% can lead to yearly global time savings in the order of millions of hours for such a simple gesture alone.1

¹
Smartphone users worldwide 3.8 Billion (https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/); 30% use pattern passwords (e.g., gestures) [35], i.e. 1 billion = 1e9; Assume 20 unlock attempts per phone and day (https://www.statista.com/statistics/1050339/average-unlocks-per-day-us-smartphone-users/), i.e. 2e10 attempts/day in total; Assume the time lost if unlock fails is 3 s, then the total time gain per day in seconds if improve by 0.1% is $2 e 10 * 3 * 0.001$ and per year in hours it is $2 e 10 * 3 * 0.001 * 365 / 3600 = 6.08$ million hours.

By improving interaction with AI systems, humans might benefit from lower time to create inputs and fewer misunderstandings in interaction. While the idea of co-adaption is known in human-computer interaction [16,20], in the machine learning community, data originating from people is typically fixed, and the goal is to find the best possible model. This work provides first steps to helping humans to do so, focusing on hand-generated inputs. Humans are offered feedback allowing them to alter their inputs to AI systems. See Fig. 1 for a process overview. The human provides a sketch in the form a sequence of strokes, where each stroke is a sequence of points. The human receives the optimized sketch being a visualization of the optimized sequence with instructions in the form of stroke order and direction indicated by red arrows. As we show, if a human draws sketches to be more similar to the proposal taking into account stroke order, the sketches can be more reliably be detected by the AI and the human takes less time to create them (given some practice).

Fig. 1.

Left panel: humans change their inputs to AI based on feedback leading to time savings and better recognizability; right panel: password pattern input: improving recognition failures of password patterns alone by as little as 0.1% can save millions of wasted hours yearly.

Thus, there are good reasons why humans might want to improve on their (hand-generated) inputs. Humans, “copying” or adjusting to optimized human inputs, should benefit from lower errors in interaction and save time. At the same time, those changes should be easy and still preserve the personal style of the human to maintain human diversity. Unfortunately, fulfilling all of these objectives is difficult for multiple reasons. First, there is a trade-off between recognition accuracy and effort to create. Simplifying inputs too much will increase error rates. Making inputs more complex, e.g., adding redundant features to inputs, might reduce errors but increase the amount of effort to create them. Second, streamlining human-to-AI interaction is possibly more intricate than human-to-human interaction since AI systems process information differently and are sensitive to input changes that humans hardly notice. Using almost invisible, adversarial perturbations, classifiers might be “fooled” to misjudge samples that can be categorized without any problems by humans (e.g., [42]). From our perspective, this is both good and bad news. While it suggests that minor changes exist that might increase confidence in the correct class (rather than in an incorrect class as used in an adversarial setting), it also highlights that humans might not even be able to distinguish an improved from a non-improved sample if optimization is not done with care. Moreover, movements of (human) limbs and the vocal tract are subject to stochastic variation, making it impossible for humans to alter inputs in very subtle ways as done in adversarial settings like [42].

Designing adequate optimization objectives is non-trivial. Even if the proposed samples optimize a specific mathematical objective, it is unclear whether these samples are of practical value. That is, humans might deem the proposed inputs unnatural, and they might struggle to reproduce suggested changes that deviate from deeply rooted habits, beliefs and behaviors.

This work discusses improving human-to-AI interaction through optimization of hand-generated inputs by proposing to adjust the order of movements, removing input parts and altering trajectories of movements. The optimization aims at all of the aforementioned goals: reducing time to create inputs, improving recognition accuracy of inputs by the AI model, maintaining recognizability by humans as well as the diversity of samples. The problem of sketch abstraction [38,60], in particular as treated in prior work, can be seen as a special case of our work neglecting multiple of our objectives such as time to create inputs, diversity of samples, and providing instructions to create them. Our approach to change samples also differs compared to these works. We alter each sample through an iterative optimization procedure rather than by training a model that generates an improved version of an input without any optimization. That is, we gradually change an input in a trial-and-error manner, whereas prior work learns a model that directly transforms the input using patterns learnt from the training data of the model. Due to the huge diversity of samples, the training data is likely rather sparse and, therefore, a trained model cannot adjust to the unique characteristics of a training sample even though they might be relevant for the optimization.

Our work contributes as follows:

Defining high-level objectives and mathematically concise measures both for optimization and for evaluation. Our evaluation accounts for discrepancies due to innate variation in human movement between suggested inputs and those created by humans.

Proposing algorithms to optimize single hand-generated samples of an individual in an iterative, exploratory manner rather than using a separately trained model. While the latter might be computationally faster, it typically performs some sort of generalization, i.e., abstraction, leading to more uniform, less diverse samples. Our approach leads to highly personalized samples. These samples are faster to generate and classified with higher accuracy while preserving diversity among humans. Among our optimization operations, only removal of strokes has been studied extensively in prior work in the context of sketch abstraction. Our work outperforms the state-of-the-art with respect to recognizability of abstracted sketches by an AI.

Showing inputs in combination with how to create them. We utilize (hand) movement data, highlighting the order and direction of movements rather than exposing a human only to the suggested optimized inputs without instructions on how to create them efficiently.

2. Related work

Altering human inputs: Schneider introduced a human-to-AI coach based on an auto-encoder that given a picture of a digit outputs a digit that has lower classifier loss and potentially consists of fewer pixels [45]. In contrast, we optimize samples individually in an iterative manner. Additionally, we provide instructions on how to create samples and we evaluate on actual users. Our work also uses more complex datasets – see survey on AI and sketches [56]. Abstracting sketches using removal of stroke segments and entire strokes, while preserving semantics was studied in [38,43]. In these works an agent learns to select strokes relevant for a classifier to maintain the correct class using reinforcement learning (RL). The implementation using RL differs from our approach improving individual samples directly. The optimization procedure in [38,43] neglects all constraints and focuses on maintaining accuracy employing just one option for abstracting: Removal. In this paper, we also consider a gradual movement of points and the creation process, i.e., altering the order of strokes, while maintaining multiple constraints. Liu et al. used GANs to complete sketches corrupted through occlusion [33]. They achieved high-quality results comparable to methods such as image inpainting. Many works deal with creating human inputs, e.g., synthesizing hand writing [23].The focus of these works is typically on creativity [6], i.e. generating novel, realistically looking artifacts rather than altering given inputs.

Explainability: This paper has strong ties to explanations [24,37] and explanations in the field of human-AI interaction [27]. Counterfactual explanations seek to identify a modification of the input to obtain another class [15,22]. Dhurandhar et al. identify minimal changes to digits on a pixel level using perturbations [15]. Thus, in contrast to our work, they focus on misclassified samples only. Moreover, the suggested changes commonly involve adding or removing multiple pixels distributed across the digit. This is infeasible for humans, since they cannot reproduce such changes. A human might improve its interaction using a better understanding of AI, e.g., with the help of visualizations highlighting what aspects of an input are critical for classification [41,49]. Generally, explainability [37] aims at making AI models human understandable. In our setup, we do not explicitly aim at understandability but a human might understand AI decision-making by generalizing from multiple optimized inputs, i.e., a person might learn how samples can be altered without distorting their recognizability.

Human-AI interaction: Rzepka and Berger summarized the effects of user and AI system characteristics in general [44], while Martins et al. focused on digital AI assistants [36]. Interaction between AI and users was studied in various contexts including social robots [36,40]. The primary focus has been on desirable AI behavior, e.g., empathy, or strategies how AI can adapt to user behavior [12,21] with few exceptions. Bansal et al. explicitly investigated how users can alter the behavior [2], i.e., override decisions of the AI, by understanding the error boundary of a classifier, while Shneiderman provides general guidelines on human-centered AI [52]. In follow-up work, Bansal et al. [3] also investigated updates for AI systems interacting with humans.

Recommendations and personalization: While we also make recommendations to a user, there are only weak ties to recommender systems. Even for interpretable recommendation systems [19] users primarily seek to understand decisions but do not aim to alter their behavior to obtain better recommendations. Our work optimizes inputs of individuals through iterative processing, while other works learn a transformational model to change inputs in a more straightforward manner. Learning personalized models [47] might be a middle ground where a model is learnt (or adjusted) for each individual. Personalization of models and our optimizations might be combined.

3. Problem

This work focuses on object classification using data from human (physical) activity, i.e., hand movements as done for sketching, writing and gestures as sketched by billions of people daily on their smartphones and other devices. Each input X created by a human should be labeled by a classifier as a specific class Y. An input X is a sequence of points $X = ((x_{i}, y_{i}, I_{i}))$ ordered in the way the input was created, where $(x_{i}, y_{i})$ are the coordinates of the i-th point and $I_{i}$ is an indicator having one of three values ${- 1, 0, 1}$ . The value ‘1’ indicates a line was drawn from the prior point $i - 1$ to point i. ‘0’ means no line was drawn when moving to this point. The value ‘ $- 1$ ’ only applies for (array) data structures of fixed length if there are fewer points than the array length. It indicates that the (array) position is empty. The sequence of points can be split into contiguous line segments, i.e. strokes. A stroke denotes a sequence of connected points, so that no points before and after the sequence (if they exist) are connected to any point of the sequence. We denote as a stroke segment, two connected points of a stroke, i.e., $((x_{i}, y_{i}, I_{i}), (x_{i + 1}, y_{i + 1}, I_{i + 1}))$ . For an input (sample) X with label Y, the classifier C should return class Y, i.e. ideally $C (X) = Y$ . The classifier C processing human inputs is optimized using a known loss $L_{C}$ such as the cross-entropy loss. The model C is treated as unchangeable, but it can be used in the optimization process. Our human interaction optimization algorithm O is provided a sample X with its label Y. It computes an optimized input (or proposal) $\hat{X} : = O (X, Y)$ that should improve on X according to one or several objectives. Thus, there is no labeled data available, i.e., there are no optimal proposals, which could be used to train a model in a straightforward supervised manner.

We aim to provide some guidance for a user showing how a proposal can be created. That is, we show all operations on how to draw the optimized output $\hat{X}$ based on changing her input X. We consider the following alterations of the creation process of the original input: (i) changing the (drawing) direction of a stroke, (ii) changing the position of one or more strokes, (iii) moving a point, and (iv) deleting a stroke segment. If a stroke segment is deleted that is not consisting of the last two or first two points of a stroke, then a stroke is split into two shorter strokes.

3.1. Objectives and measures

We consider three main objectives:

Minimizing time to create inputs: The time to create an optimized or non-optimized sample is easily measured in practice, i.e., by simply observing how long a user takes to complete a sample. However, it is difficult to incorporate in a model as an optimization objective since the exact mechanics of human motion are hard to model. Therefore, the impact of any alteration of a human input on creation time is hard to estimate exactly. Our optimization algorithm used a more tangible proxy loss metric to estimate the time to create an optimized sample, i.e., the total distance the hand has to move to create the sketch. We do not include the movement to the start point, but we account for movements for the hand between an endpoint of a stroke and the starting point of the next stroke. We denote this as effort loss: $\begin{matrix} L_{E} (X) : = \sum_{i = 0}^{| X | - 2} \sqrt{{(x_{i} - x_{i + 1})}^{2} + {(y_{i} - y_{i + 1})}^{2}} \end{matrix}$

Minimizing misunderstanding while accounting for human variation: The amount of wrongly extracted or interpreted information by the AI should be kept as little as possible. Optimized inputs should lead to better task performance, i.e., higher recognition accuracy, when processed by the AI. Thus, a generated sample $\hat{X}$ minimizes the classifier loss $L_{C} (\hat{X})$ among other losses. Proposed samples should allow for variance in human behavior. Human behavior is characterized by unintentional variation. For example, a human is not able to reproduce even a single of her strokes exactly. Since classifiers are knowingly sensitive to small changes in the input as witnessed by adversarial examples, the robustness of optimized samples should be evaluated, e.g., as done for adversarial samples using linear programs [7]. We model human variation by creating noisy samples of a proposed input $\hat{X}$ . We measure accuracy ${Acc}_{Noise}$ on these noisy samples. There are multiple approaches to create noisy samples, e.g., using local and global deformations [57]. We employ a well-established, easy to comprehend approach: A noisy sample ${\hat{X}}^{'}$ is created by adding uniform noise to each coordinate. For a sequence $\hat{X}$ , the noisy sequence ${\hat{X}}^{'}$ is $\begin{matrix} {\hat{X}}^{'} : = (({\hat{x}}_{i} + ϵ_{i, 0}, {\hat{y}}_{i} + ϵ_{i, 1}, {\hat{I}}_{i}) ∣ ({\hat{x}}_{i}, {\hat{y}}_{i}, {\hat{I}}_{i}) \in \hat{X}) \end{matrix}$ Each $ϵ_{i, j} \in [- r, r]$ is chosen uniformly and independently at random within a fixed range. The range is upper and lower bounded by a constant r that depends on the dataset, e.g., it might be 20 pixels or 1 cm.

Minimize modifications of original samples: The proposed samples should bear large similarity to the original inputs. Preserving characteristics of the inputs as much as possible is aligned with the idea that inputs remain comprehensible for humans, diversity is maintained among humans, and changes are easy to comprehend and execute for humans. We use two loss objectives depending on what type of alteration is applied to an input.

The length-wise difference loss $L_{D}$ captures how much the original and suggested sample differ in visible parts. Length corresponds to the sum of lengths of all visible strokes. We do not account for ordering and direction of strokes, which impacts distances to be moved but not the length of drawn strokes. The length of visible parts, i.e., strokes, of an input X is given by $\begin{matrix} V (X) : = \sum_{i = 0, I_{i + 1} \neq 0}^{| X | - 2} \sqrt{{(x_{i} - x_{i + 1})}^{2} + {(y_{i} - y_{i + 1})}^{2}} \end{matrix}$

The loss is: $L_{D} (X, \hat{X}) : = | V (X) - V (\hat{X}) |$ .

This objective is adequate if parts of the input are removed. In this case, the original and the modified sample are identical except that one is missing some parts.

The point-wise difference loss $L_{P}$ captures the displacement of individual points between the original and suggested sample. It is suitable when the positions of points are altered, i.e., points are moved but their order is kept: $\begin{matrix} L_{P} (X, \hat{X}) : = \sum_{i = 0}^{| X | - 1} \sqrt{{(x_{i} - {\hat{x}}_{i})}^{2} + {(y_{i} - {\hat{y}}_{i})}^{2}} \end{matrix}$

It might be possible to improve on all three objectives simultaneously for some samples. But they commonly require trade-offs: Keeping changes minimal is at odds with the other objectives, which encourage changes to the input. Minimizing time to create samples is at odds with recognizability. Little time to create implies little information is contained in the outputs, which makes discriminating inputs harder. Thus, what is most preferred – time or recognizability or minimal changes – is a decision that is subjective and left to the end-user. She must state her preferences. We consider two mechanisms a user can state her inclinations: (i) Weighing objectives and (ii) providing constraints. Constraints are ensured not to be violated, while objectives are optimized though no guarantee is given to what extent an objective is fulfilled. To keep matters simple, we shall consider two (primary) objectives, i.e. either minimize “time” or maximize “accuracy” while constraining the maximal distortion of the original input. Additionally, changes are constrained so that the classifier still recognizes optimized inputs correctly or they become recognizable due to the optimization process.

4. Methodology

To solve the problem outlined in Section 3 we use a form of trial-and-error to identify which alterations of input samples are beneficial due to the lack of labeled data. We optimize each input in an iterative fashion. In each iteration, we investigate one option to change the current input. The strategy to choose the next option to “try” depends on the number of possible options. We employ three strategies: gradient descent, a greedy approach, and a brute-force approach. Gradient descent is employed for small movements of points. The greedy and brute-force approaches are used for discrete decisions, e.g., changing the order of strokes and removal of strokes. The brute-force approach simply tries all possible solutions. Therefore, it is guaranteed to find the best solution in contrast to a greedy approach performing short-sighted decisions by picking the change that improves the current input the most. Greedy solutions can be local optima only. Gradient descent can be said to be a form of greedy optimization.

Alternative to our approach, one might train a machine learning model to optimize subsequently provided inputs in one forward pass (e.g. [43,45]). A model is likely advantageous if labeled data is available or generalization across samples is helpful or computational load is a concern. But since detailed characteristics of the input should be preserved as much as possible and the diversity of inputs is large, generalization possibilities seem limited. Our approach of applying optimization techniques directly on individual samples seems preferable, as also confirmed by our experimental evaluation.

Optimizing individual samples: To change an input by a human to an optimized input fulfilling all desirdata, we solve a constraint optimization problem encompassing discrete operations, e.g., removal of stroke segments, and continuous operations, e.g., changing coordinates of points. We employ a general framework (Algorithm 1) that performs a fixed number of iterations $n_{iter}$ for our discrete optimization. The framework is adjusted to each alteration operation, i.e., removing strokes, reversing stroke direction or changing stroke order. Starting from the initial human input X, an operation is applied to the current best solution $\hat{X}$ to create a new solution candidate Z in each iteration. The new solution candidate Z is then evaluated. To become the new best solution $\hat{X}$ , the candidate Z must improve on the objective $obj$ , which is either time or accuracy, and it must fulfill the following two constraints: (i) It does not deviate too much from the original input. To this end, the user specifies the maximal allowed deviation d of the candidate Z to the original input X. The deviation is measured in terms of visible length, i.e., the constraint is fulfilled as long as the visible length of strokes is similar; (ii) the current sample is classified correctly or no best solution $\hat{X}$ has been classified correctly so far. To summarize, if a solution candidate Z originating from an alteration of the current best solution $\hat{X}$ fulfills both constraints and has lower loss value according to the user-determined objective $obj$ , the best solution $\hat{X}$ is set to the solution candidate Z.

Algorithm 1

Discrete optimization

Generating solution candidates: Solution candidates are generated for each operation as follows: For removal of visible parts, we consider all stroke segments $s_{i} : = ((x_{i}, y_{i}, I_{i}), (x_{i + 1}, y_{i + 1}, I_{i + 1}))$ . A solution candidate Z based on $\hat{X}$ is obtained by removing point i from $\hat{X}$ , i.e., $Z : = X ∖ s_{i}$ . In case point $i + 1$ is the last point of a stroke, it is also removed. Otherwise, we set $I_{i + 1} = 1$ to indicate the start of a new stroke. The order in which the solution candidates are created is important. That is, it matters which stroke segments are removed first. A natural choice (denoted as CL) is to select stroke segments with the smallest classifier loss $L_{C}$ as being removed first. The idea being that segments being only weakly indicative of the given class Y can likely be removed without causing a misclassification. Increasing the number of strokes due to removal of stroke segments is not desirable. Therefore, we consider a variant (CE), where only stroke segments at the beginning or end of a stroke are removed. For comparison, we also consider removal in reverse order (RO) in which points are created and in the same order (SO). The motivation being that this procedure constantly removes one stroke after the other without splitting strokes. Furthermore, humans might first draw the most important high-level outline that might help in distinguishing objects and add more details over time.

For changing the direction of strokes and changing the order of strokes, we choose a solution candidate randomly. That is, for changing the direction of a stroke, we flip the direction of a randomly chosen stroke of $\hat{X}$ by reversing the sequence of points of the stroke. To change the order of strokes, we perform a cut-and-paste operation. That is, we choose a random sequence consisting of subsequent strokes, remove it and insert it either after a random stroke or at the very beginning of the sequence. We also consider doing both, i.e., for a cut-and-paste operation we flip the direction randomly with 50% probability. This randomized approach is essentially equal to trying all options given the number of strokes is small.

For continuous optimization, we use gradient descent with gradients obtained from the classifier. We maintain a solution candidate Z that is initialized with the original input X. It is updated in each iteration using a gradient descent step treating the classifier weights as fixed and the input X as variable. Otherwise, the procedure is identical to Algorithm 1. The loss function $L_{Tot}$ is a weighted combination of the classifier loss $L_{C}$ , the effort loss $L_{E}$ and the point-wise difference loss $L_{P}$ : $\begin{matrix} L_{Tot} (Z) : = β_{C} L_{C} (Z) + β_{P} L_{P} (Z) + β_{E} L_{E} (Z) \end{matrix}$

5. Evaluation

We conducted our evaluation on multiple datasets and models employing qualitative and quantitative methods and a user study.

Datasets: We used 1 million samples of the QuickDraw [25] dataset distributed equally among the first 30 classes. It consists of human sketches of an object given its name. The data was created primarily using mouse and touch devices. The second dataset called Homus consists of musical notes drawn using a pen [11]. It consists of 32 classes but only of 15200 samples. We padded or stripped sequences to be of fixed length, i.e., 104 points, which reduced less than 3% of the Quickdraw data and about 10% of the musical notes data. For evaluation, we optimized 256 samples for each class of the QuickDraw dataset that were not used for training, yielding 7680 test samples. For the Homus dataset we used 20% of the data corresponding to 3040 samples. Note that our optimization algorithms do not require any training data. They optimize test samples directly.

Models: For classifier C, we trained three instances of two identical classifier architectures for each dataset yielding 12 classifiers. For each instance, we ran our optimization algorithms. We used PyTorch 1.6.0 and trained on a 2080 TI Nvidia GPU. Our “LSTM” architecture is made of 3 Conv1D, 2 LSTM and 2 dense layers with dropout. The “Conv1D” network consists of 3 Conv1D layers with stride 2 and a dense layer. For comparing to prior work, we also implemented an architecture based on [57]. If not stated differently we used $d = 20 %$ , i.e., at most 20% can be deleted, and movements are limited to 20%.

Procedure: We conduct: i) a qualitative evaluation, illustrating created samples, ii) a quantitative evaluation, discussing the impact of parameters and comparing various approaches, and iii) a user study, where humans have to redraw original and optimized samples. Results for all combinations of models and datasets were very similar in nature. Therefore, we focus on just one scenario, i.e., the LSTM-network on the Quickdraw dataset. Results for other setups are summarized in the end.

5.1. Qualitative evaluation

The qualitative evaluation aims to provide an understanding of how samples are altered and to what extent our formalization of objectives indeed leads to samples that are (i) still human recognizable, (ii) diverse, and (iii) fast to draw. To this end, we depict for each shown class multiple samples. Showing multiple samples per class allows assessing diversity, i.e., if samples are still diverse or if they are optimized towards a single sample that is “best recognizable and fastest to create”. In particular, it is interesting to understand the behavior of each of our proposed methods. Thus, we discuss optimized samples for each alteration operation separately, starting with the removal of segments. As seen in Fig. 2 removal of segments in the same order as creation (SO) or reverse order (RO) tends to remove entire strokes. This can lead to unnatural sketches, e.g., angels without heads. Random removal (RA) and classifier loss-based ordering (CL) increase the number of strokes, which might be undesirable for reproducing the optimized sample. Removing only endpoints based on classifier loss (CE) tends to produce well-recognizable samples with fewer artifacts and without an increase in strokes.

Fig. 2.

Original and generated samples for removal, minimizing creation effort.

Fig. 3.

Original and generated samples for continuous optimization for various $β_{C}$ , $β_{P}$ and $β_{E}$ .

Figure 3 shows outcomes for continuous optimization. Changes appear more subtle than for removal, particularly when optimizing for accuracy (second row). Continuous optimization tends to shorten strokes, straighten them (best seen for wings of the first angel) and it might also rotate them – as done for clock hands. Changes seem to be the least noticeable when only optimizing for effort (last row), which should lead to the largest distortion. That is, the original and the optimized input appear most similar. Our quantitative analysis shows that they differ strongly, e.g., if the visible length of strokes is compared. Samples get scaled entirely in a more uniform manner, making changes harder to spot (best seen for the third image from the left in the last row).

Fig. 4.

Original and generated samples for stroke direction and order, optimizing for time.

Figure 4 shows outcomes for altering order and direction of strokes. For most samples, order and direction can be changed to improve both (required) effort and accuracy but differences are difficult to notice. For some samples it is more apparent that optimized samples due to changing order and direction of strokes lead to less overall hand movements. Mostly differences tend to be difficult to determine hinting that humans are already fairly good at efficiently drawing sketches. Still, our quantitative evaluation shows that significant reduction in movements is possible leading also to an improvement in classifier accuracy.

5.2. Quantitative evaluation

We use priorly described metrics related to (i) the classifier’s capability to recognize samples ( $Acc (uracy)$ , ${Acc}_{Noise}$ ), (ii) the effort of creating a sample $L_{E}$ and (iii) the distortion of the original sample $L_{D}$ and $L_{P}$ . To compute ${Acc}_{Noise}$ , we create for each input 10 noisy samples, where each $ϵ_{i, j} \in [- 10, 10]$ . For the Quickdraw dataset, where points are within a range of $[0, 255]$ this means that the maximal distance due to the addition of $ϵ_{i, j}$ to each coordinate between two points is about 28 pixels or about 10% of the canvas used for sketching.2

²
This happens if the x and y coordinate of a point is shifted by -10 once and another time by +10, yielding a total distance of $\sqrt{{(- 10 - 10)}^{2} + {(- 10 - 10)}^{2}} = 28.3$

We report the accuracy on all created samples, i.e. for the Quickdraw dataset with 7680 test samples, we report the accuracy computed using 76800 samples.

Continuous optimization: Table 1 shows all metrics for continuous optimization varying loss weights β. All settings of β improve upon the original in terms of accuracy. This is expected given that no modification allows changing a correctly classified sample into an incorrectly classified one. When optimizing for effort, accuracy gains of optimized samples vary. For noisy samples, accuracy can even be lower than for the original. Reduced samples contain less information for classification than the original ones, making them somewhat more sensitive to noise. If optimizing effort loss ( $β_{E} = 1$ ) only, effort loss is not lowest among all options. Having a classifier loss ( $β_{C} > 0$ ) strongly improves accuracy and, interestingly, it also leads to lowest effort loss. Without a classifier loss, all sketch parts are altered irrespective of whether they are relevant for classification. Thus, a significant increase in loss is occurred due to movements of highly relevant points for classification. This is avoided using a classifier loss. In summary, for all investigated options optimizing for accuracy using all loss terms, i.e., $β_{C} = 0.9998$ , $β_{P} = 0.0001$ , $β_{E} = 0.0001$ , leads to a good compromise in terms of improved accuracy w/o noise and maintaining the original sample, while still reducing effort.

Table 1

Results varying loss term weights $β_{C}$ , $β_{R}$ , $β_{E}$

Obj	$β_{C}$ , $β_{P}$ , $β_{E}$	$Acc$	${Acc}_{Noise}$	$L_{E}$	$L_{P}$
Original		0.897 ± 0.02	0.894 ± 0.02	1527.4 ± 161.4	0.0 ± 0.0
Acc	1.0, 0, 0	0.929 ± 0.04	0.923 ± 0.03	1539.6 ± 158.8	10.8 ± 11.5
	0, 0, 1.0	0.904 ± 0.02	0.9 ± 0.02	1485.8 ± 162.8	39.3 ± 9.1
	0.9999, 0, 0.0001	0.963 ± 0.01	0.954 ± 0.01	1461.9 ± 160.1	63.0 ± 10.2
	0.9998, 0.0001, 0.0001	0.966 ± 0.01	0.958 ± 0.01	1446.9 ± 159.8	78.2 ± 16.9
Eff	1.0, 0 ,0	0.962 ± 0.01	0.927 ± 0.01	1525.3 ± 162.3	1.94 ± 1.1
	0, 0, 1.0	0.903 ± 0.02	0.89 ± 0.02	1407.1 ± 152.3	113.2 ± 13.4
	0.9999, 0, 0.0001	0.963 ± 0.01	0.934 ± 0.01	1422.0 ± 157.9	100.1 ± 9.1
	0.9998, 0.0001, 0.0001	0.966 ± 0.01	0.94 ± 0.01	1417.1 ± 160.4	107.6 ± 24.4

Table 2

Results for removal

Obj	Meth.	$Acc$	${Acc}_{Noise}$	$L_{E}$	$L_{D}$
Original		0.897 ± 0.02	0.895 ± 0.02	1527.4 ± 161.4	0.0 ± 0.0
Acc	CL	0.971 ± 0.0	0.967 ± 0.01	1444.2 ± 157.1	180.7 ± 15.7
	CE	0.956 ± 0.01	0.953 ± 0.01	1395.4 ± 151.7	123.7 ± 11.3
	RA	0.919 ± 0.01	0.916 ± 0.01	1503.2 ± 159.0	60.9 ± 8.7
	RO	0.908 ± 0.01	0.906 ± 0.01	1497.7 ± 160.7	24.6 ± 4.5
	SO	0.911 ± 0.01	0.909 ± 0.02	1499.0 ± 160.0	29.3 ± 3.8
Eff	CL	0.961 ± 0.01	0.955 ± 0.01	1346.2 ± 141.0	220.5 ± 20.0
	CE	0.947 ± 0.01	0.941 ± 0.01	1257.3 ± 131.0	219.3 ± 19.8
	RA	0.914 ± 0.01	0.881 ± 0.02	1381.8 ± 143.4	220.7 ± 19.8
	RO	0.912 ± 0.01	0.872 ± 0.02	1191.1 ± 122.9	219.1 ± 19.7
	SO	0.912 ± 0.01	0.868 ± 0.02	1288.6 ± 140.4	220.2 ± 20.0

Removal: The results for all removal strategies (Table 2) indicate that abandoning irrelevant stroke parts yields improvements for both effort and accuracy at the same time. Both optimization objectives also behave as expected: Using accuracy as objective leads to better accuracy but worse effort loss than using effort as objective.

Using the classifier loss (CL or CE) for ordering removals yields best results in terms of accuracy. When being allowed to split strokes (CL) accuracy is larger than for removing parts at the end of strokes (CE), but these gains come at the expense of having more strokes. Furthermore, effort loss and similarity to original samples ( $L_{D}$ ) is better for CE.

When optimizing for effort, only CL and CE achieve much better accuracy for noisy samples ${Acc}_{Noise}$ than the original samples. Removing stroke segments in sorted order (SO), i.e., as they were drawn, or in reverse order (RO) gives the lowest effort loss overall. Removing them in a random order (RA) is slightly better with respect to accuracy, since the optimization method has more options to select from, i.e., it also attempts to remove segments in the middle of a stroke leading to more strokes. But overall RA is clearly inferior to CL and CE.

For effort, removing entire strokes from the beginning (or end) yields benefits, since in the optimization process, we do not account for moving to the first point or from the last point to some starting point, and there is often a significant distance between the endpoint of one stroke and startpoint of the next stroke. This distance is also gained when removing entire strokes. In contrast, a transition between strokes remains when removing strokes (segments) in the middle, a transition between strokes remains.

In summary, CL is best for accuracy, CE is still good for accuracy with better effort loss, and RO leads to best effort loss but only slight gains in accuracy compared to the original samples.

Table 3

Accuracy reduction for Sketch-a-Net architecture and ours when reducing visible elements; bold shows best

Method	Original Acc.	Δ if keep 50%	Δ if keep 25%
DQSN	0.92	−0.12	$- 0.27$
GDSA	0.92	−0.06	$- 0.20$
CE (This paper)	0.95	−0.02	$- 0.16$
CL (This paper)	0.95	0.01	$- 0.14$

Comparison to removal methods for sketch abstraction: Table 3 shows that the proposed optimization procedure if only a fixed percentage of average stroke segments of a category is kept as proposed and described for GDSA in [38] for a smaller subset of QuickDraw. In this setup, removal takes place even if it leads to erroneous classification. Our classifier-guided methods CE and CL achieve significantly higher accuracy than prior work (DQSN [60] and GDSA) based on training a model using reinforcement learning. We attribute this to the fact that we optimize samples individually in an iterative manner. Note that we only use stroke segment removal here. Even larger differences are expected, if we add other optimization operations such as changing the direction of strokes.

Stroke Order and Direction: As shown in Table 4 accuracy gains are largest when focusing on accuracy as objective but in this case effort loss might be slightly larger than for the original samples (by less than 5%). In contrast, effort gains when using effort as objective are beyond 25%, indicating that humans tend not to draw in an optimal order from an efficiency perspective. Both permuting strokes (P) and reversing directions (R) yield significant gains in terms of accuracy and also effort. With respect to accuracy both behave similarly, whereas reversing direction is preferable from an effort perspective. Doing both (B) yields best outcomes.

Table 4

Results for permuting strokes (P), reversing direction (R) and doing both (B)

Obj	Meth.	$Acc$	${Acc}_{Noise}$	$L_{E}$	$L_{D}$
Original		0.897 ± 0.02	0.895 ± 0.02	1527.4 ± 161.4	0.0 ± 0.0
Acc	P	0.952 ± 0.01	0.948 ± 0.01	1595.6 ± 162.1	0.096 ± 0.07
	R	0.951 ± 0.01	0.948 ± 0.01	1551.8 ± 159.5	0.06 ± 0.07
	B	0.961 ± 0.01	0.959 ± 0.01	1580.1 ± 159.6	0.094 ± 0.04
Eff	P	0.911 ± 0.02	0.892 ± 0.02	1393.9 ± 135.7	0.179 ± 0.09
	R	0.915 ± 0.01	0.888 ± 0.01	1306.6 ± 120.5	0.193 ± 0.1
	B	0.913 ± 0.01	0.88 ± 0.02	1295.2 ± 117.9	0.096 ± 0.1

Table 5

Results for applying multiple methods sequentially; (C)ontinuous point movement, (B) Reverse and permute, (D)eletion/Removal

Obj	Meth.	$Acc$	${Acc}_{Noise}$	$L_{E}$	$L_{D}$	$L_{P}$
Original		0.897 ± 0.02	0.895 ± 0.02	1527.4 ± 161.4	0.0 ± 0.0	0.0 ± 0.0
Acc	B-C	0.976 ± 0.0	0.972 ± 0.0	1518.4 ± 155.5	60.1 ± 13.7	69.1 ± 12.0
	C-D	0.975 ± 0.0	0.972 ± 0.01	1388.4 ± 150.0	167.2 ± 25.6	107.2 ± 18.5
	D-B	0.98 ± 0.01	0.979 ± 0.01	1486.4 ± 161.8	152.2 ± 32.2	0.0 ± 0.0
Eff	B-C	0.961 ± 0.01	0.925 ± 0.01	1203.3 ± 120.0	101.4 ± 22.2	131.1 ± 14.1
	C-D	0.973 ± 0.0	0.964 ± 0.01	1281.7 ± 132.8	228.7 ± 21.5	121.3 ± 18.5
	D-B	0.958 ± 0.01	0.911 ± 0.02	1086.1 ± 94.2	219.9 ± 19.2	0.0 ± 0.0

Combining methods: We also applied two or more methods sequentially (Table 5). For continuous optimization of points C we used $(β_{C}, β_{P}, β_{E}) = (0.9998, 0.0001, 0.0001)$ . Applying multiple methods gives some more improvement. That is, both the maximum accuracy and minimum effort loss are lower, when applying multiple methods.

Other networks and datasets: We found that qualitatively results were identical, meaning that if there was a clear improvement for optimized samples for one dataset and one network type this also held for others. More precisely, for removal, CL gave best overall accuracy for both nets (Conv1d and LSTM) and both datasets, while RO yielded lowest efficiency loss. For continuous optimization using non-zero loss weights ( $(β_{C}, β_{P}, β_{E}) = (0.9998, 0.0001, 0.0001)$ ) gave best outcomes with respect to accuracy and could also improve on effort. Reversing and permuting strokes was the best strategy across networks and datasets.

Thus, qualitative behavior was consistent across datasets, networks and operations. Gains could vary per dataset, network and operation considered. For example, the highest accuracy gains relative to the original were achieved for Conv1D on Quickdraw being 13.4% compared to 8.3% for LSTM on QuickDraw.

Overall summary: Continuous optimization, reversing and permuting, and stroke removal can significantly improve accuracy and effort. Combining them yields even further gains. Each strategy allows for multiple algorithms, e.g., multiple papers focus on stroke removal only. No algorithm performs best with respect to all considered metrics. However, some variants are clearly inferior to other from the perspective of Pareto optimality, i.e., they do not achieve a trade-off that could not be improved upon.

6. User study

We conducted an experiment to assess if humans can reproduce optimized samples and if these reproductions indeed yield gains according to the specified objective. While the prior numerical investigation is highly suggestive, optimized samples might be unnatural for humans. Thus, reproductions of those samples might actually be worse, i.e., they might take longer to create and deviate more strongly from the proposal than non-optimized samples, making a user study necessary. Thus, the study is a first step in assessing whether optimized samples can be reproduced “better” than non-optimized ones. The next step would be to demonstrate long-term learning from optimized samples by changing user behaviors over a prolonged period of time through possibly repeated training. We believe that this is a study in its own right.

We used generated samples for method “D-B” (Table 5) optimized towards accuracy for the QuickDraw dataset. The overall pool of sketches consisted of 10 original samples per class, where each sample consisted of up to 7 strokes to ensure good readability of instructions, i.e., numbering and arrows. Since we are particularly interested in the capability, whether errors in interaction can be mitigated, we chose 5 (of the 10) original samples per class that were misclassified. Each participant had to copy an optimized version of a human input and the original version for 5 randomly selected sketches, yielding 10 sketches per user. Thus, on average 2.5 of the presented, original versions are classified correctly. The optimized and original versions were shown in random order. Users were advised to draw strokes in the order and direction as indicated by the numbering and arrows in the shown sample (Fig. 5).

Fig. 5.

During the user study participants are shown a sketch with numbered strokes and stroke start indicated (left panel). They should reproduce it (right).

We recruited 200 English-speaking participants on Amazon Mechanical Turk. We removed reproduced sketches, that did not match the instructed number of strokes or took more than 60 s to create or had only the original or the corresponding optimized sample was drawn adequately, i.e., within 60 s and with the correct number of strokes. The (LSTM) classifier had an accuracy of 54% on sketches resembling the original and 68% on sketches based on the optimized sample. The differences are statistically significant using a t-test, yielding $p < 0.02$ . Participants took on average 23.2 s to (re)sketch an original sample. They were 1.7 s faster for optimized samples (though only with $p = 0.21$ ). Note that we used samples optimized towards accuracy not effort. Still, even those samples have (mostly) less visible strokes ( $L_{D}$ ), while overall hand movements are typically similar to original samples ( $L_{E}$ ) – see Table 5. Furthermore, errors in recognition can also lead to rework, e.g., if the unlocking pattern of a smartphone is not recognized correctly, a human has to redraw it. Thus, accuracy (and not creation time of a sample) can be the key determinant for overall efficiency.

Our experiment primarily confirmed that inputs to an AI following instructions leads to better outcomes. Ultimately, our goal is to show that people perform better without instructions, i.e., after having learnt from instructions to create optimized versions of their own inputs. Showing this might require a longer training phase and might also be accompanied by different measures than mimicking generated samples [18]. For example, the handwriting style of adolescents has been shown to improve to some extent without direct teaching of handwriting using consultative clinical reasoning [18]. Furthermore, the user-machine interaction process during the learning phase of a human also provides many opportunities and raises a sequence of questions, e.g., is actual reproducing of inputs the best mode for learning? Is it better to provide feedback be in real-time while humans are creating inputs?

Moreover, generating a “sequence of strokes” can be based on fairly different cognitive processes with different goals. For instance, the act of handwriting or unlocking a mobile phone is fairly automatic, i.e., a person does not need to think about how to write a specific letter (or word) or about the movements of hands to unlock the screen lock. She might even be able to do it blindly, whereas generating a sketch of an object for a person not very experienced in sketching is harder and it might follow a less automatic and task-specific reasoning process, e.g., from coarse to fine details [17]. Our optimization process might break the creation process from coarse to fine to encourage drawing with less movement, i.e., effort. When optimizing for effort it can be beneficial to draw all nearby “strokes” independent of their abstraction level to minimize hand movements. Such a change in the drawing process raises questions with respect to the quality of the sketches beyond our considered metrics, e.g., are sketches obtained using a process focusing on efficiency less diverse or creative? Sketching is often a process that is part of idea and concept formation where efficiency (and recognizability by an AI) might not be of utmost importance. We believe that our work is most beneficial for tasks that are based on frequent interaction between humans and AI using symbols or patterns produced fairly automatically by humans. In such situations, efficiency and recognition of inputs by the AI are of high importance.

7. Discussion and future work

This study contributes by (i) setting forth objectives and measures to optimize and evaluate hand-generated human inputs, (ii) providing algorithmic foundations, and (iii) an extensive evaluation including a user study showing gains in terms of recognition accuracy for optimized samples generated by humans based on automatically generated instructions.

We believe that our objectives are of interest to any interaction of humans and AI, i.e., inputs should be fast and effortless to create, recognizable by the AI model and humans, and diverse. A limitation of our work is that the set of proposed objectives is not complete. Objectives could be considered such as configuring suggestions to maximize preferences of a human and robustness of the suggested proposals to forgery and adversarial attacks. Our work is most beneficial for scenarios where interaction between a human and AI is frequent and relying on well-established patterns or symbols. In such cases, efficiency and recognizability of samples by the AI are of high importance. Humans and AI might also interact to jointly produce a “creative” solution, e.g., to jointly create a new logo for a sports team. In this case, it is unclear whether restricting a person’s creative thought process to improve the understanding of the AI is beneficial overall.

Our algorithms improve on multiple prior works focusing on sketch abstraction [38,60]. The improvements based on quantitative comparison can be attributed to our optimization approach. We chose to optimize samples individually, which allows to adjust the optimizations to individual samples to a very high degree, but it is more computationally intensive than training a model (as was done in prior work on sketch abstraction). For inputs resulting from complex and long interactions, the options to change inputs increase and, therefore, computational demands limit our methods’ applicability. Thus, to keep the optimization process short, our methods might have to be further tailored to reduce running time or parameters in our algorithm such as the number of incremental changes must be altered. We focused on a hand-generated inputs as occurring on touch-screen by billions of users every day. While a number of publications have focused exclusively on such types of problems, it is still of interest to assess to what extent our methods can be generalized. The framework outlined in Algorithm 1 can be applied to any classifier and other problems changing human inputs in the form of movements. In our problem statement, we used an indicator $I_{i}$ to express whether the hand movement should generate a stroke. For gestures, where the hand moves in 3d space without the concept of a stroke (or the hand touching a sensing device), we can consider an entire gesture, i.e., sequence of points as one stroke and apply our techniques. A human might also move multiple limbs simultaneously, e.g., both hands. Under the assumption that the limbs move independently, we can also apply our optimization techniques independently for each movement of a limb, i.e., the sequence of points for each limb. However, if there are physical constraints, e.g., both feet cannot be in the air simultaneously for a prolonged time, further adjustments to the mathematical formulation of the loss function, i.e., our measures, are needed. One limitation of our measures is that they are only proxies to our higher level objectives. For instance, to measure time to create an input, we used the hand’s total distance to move. This is imprecise since a long straight line might be drawn faster than a zig-zag line of the same total length. While our evaluation has shown that the resulting imprecision are not severe in our application, this is not necessarily true for other applications. Thus, the required adjustment of our measures are problem-dependent. They could be very simple: For 3D hand gestures simply adding a third coordinate z analogous to the existing x and y could suffice, i.e., the term ${(x_{i} - x_{i + 1})}^{2} + {(y_{i} - y_{i + 1})}^{2}$ in the effort loss $L_{E}$ is enhanced by adding ${(z_{i} - z_{i + 1})}^{2}$ . For full-body movements, the formalization of objectives might have to be altered more extensively, and possibly accompanied by other constraints. Using naive objectives, the system might suggest moving limbs in ways that are physically impossible or very challenging ways. However, whether this holds or not, is difficult to assess without actual experimentation.

Our user study shows that samples generated based on instructions from optimized samples lead to higher recognition accuracy than the original samples. Prior work on sketch abstraction [38,60] was commonly lacking user studies. A limitation of our user study is that it is only a first step towards showing our ultimate goal that humans can interact more efficiently and with higher accuracy after having learnt instructions to create optimized samples. We believe that further algorithmic improvements are possible and should be attempted before long-term user studies are undertaken investigating both the effort it takes humans to change as well as to train (first time) users to properly interact with an AI to minimize misunderstandings. Our instructions are easy to understand, but could be improved, e.g., by showing strokes in a sequential manner rather than all at once. Instructions are also strongly sample driven. Instead, one might also instruct humans by providing general strategies for improvements rather than providing detailed instructions on a per-sample basis.

Beyond user studies, more exploration of the field of human and AI interaction and co-adaption is needed to improve interaction on a semantic level as needed for interaction with chatbots beyond making chatbots more human [13,14], to consider other recognition problems such as speech recognition [58], to perform a joint optimization of human inputs and AI models, e.g., interactive modeling [55], to derive optimization algorithms that use inputs of a human to provide general rules as feedback, to assess additional concerns such as acceptance of technology by humans [54].

8. Conclusions

Humans interact more and more with AI. This leads to forms of co-adaption and questions like “How can an AI adapt to improve interaction? And how can humans do so?”. This paper relates to the second question for hand gestures. It provides first steps towards improving interaction by showing how human inputs to an AI can be optimized. Our approach, optimizing samples individually seems beneficial. For stroke removals, we improve prior work considerably. To the best of our knowledge, we are the first to investigate other operations for altering inputs such as stroke order and direction. Our evaluation indicates that optimized samples lead to less misclassifications, while still bearing similarity to the original input and not requiring more time to create.

References

Amershi,

Weld,

Vorvoreanu,

Fourney,

Nushi,

Collisson,

Suh,

Iqbal,

P.N.

Bennett,

Inkpen et al., Guidelines for human-AI interaction, in: Proc. of the CHI Conference on Human Factors in Computing Systems, 2019.

Bansal,

Nushi,

Kamar,

W.S.

Lasecki,

D.S.

Weld and

Horvitz, Beyond accuracy: The role of mental models in human-AI team performance, in: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 2019.

Bansal,

Nushi,

Kamar,

D.S.

Weld,

W.S.

Lasecki and

Horvitz, Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff, in: Proc. of the AAAI Conference on Artificial Intelligence, 2019.

Bao,

Chen,

Wen,

Li and

Hua, CVAE-GAN: Fine-grained image generation through asymmetric training, in: Proc. of the Int. Conf. on Computer Vision, 2017.

Bartneck and

Forlizzi, A design-centred framework for social human-robot interaction, in: Workshop on Robot and Human Interactive Communication, 2004.

Basalla,

Schneider and

vom Brocke, Creativity of deep learning: Conceptualization and assessment, in: Proceedings of the 14th International Conference on Agents and Artificial Intelligence, 2022.

Bastani,

Ioannou,

Lampropoulos,

Vytiniotis,

Nori and

Criminisi, Measuring neural net robustness with constraints, in: Advances in Neural Information Processing Systems, 2016.

Billard and

Dautenhahn, Grounding communication in situated, social robots, in: Proceedings Towards Intelligent Mobile Robots Conference, Report No. UMCS-97-9-1, Department of Computer Science, Manchester University, 1997.

Bisk,

Yuret and

Marcu, Natural language communication with robots, in: Proc. of Conf. of the North American Chapter of the Ass. for Computational Linguistics: Human Language Technologies, 2016.

10.

Breazeal,

C.D.

Kidd,

A.L.

Thomaz,

Hoffman and

Berlin, Effects of nonverbal communication on efficiency and robustness in human-robot teamwork, in: Int. Conf. on Intelligent Robots and Systems, 2005.

11.

Calvo-Zaragoza and

Oncina, Recognition of pen-based music notation: The HOMUS dataset, in: Int. Conf. on Pattern Recognition, 2014.

12.

Carroll,

Shah,

M.K.

Ho,

Griffiths,

Seshia,

Abbeel and

Dragan, On the utility of learning about humans for human-AI coordination, in: Adv. in Neural Information Processing Systems, 2019.

13.

A.P.

Chaves and

M.A.

Gerosa, How should my chatbot interact? A survey on human-chatbot interaction design, arXiv preprint, 2019. arXiv:1904.02743.

14.

Ciechanowski,

Przegalinska,

Magnuski and

Gloor, In the shades of the uncanny valley: An experimental study of human-chatbot interaction, Future Generation Computer Systems 92 (2019), 539–548. doi:10.1016/j.future.2018.01.055.

15.

Dhurandhar,

P.-Y.

Chen,

Luss,

C.-C.

Tu,

Ting,

Shanmugam and

Das, Explanations based on the missing: Towards contrastive explanations with pertinent negatives, in: Advances in Neural Information Processing Systems, 2018.

16.

S.K.

Ehrlich and

Cheng, Human-agent co-adaptation using error-related potentials, Journal of neural engineering 15(6) (2018), 066014. doi:10.1088/1741-2552/aae069.

17.

Eitz,

Hays and

Alexa, How do humans sketch objects?, ACM Transactions on graphics (TOG) 31(4) (2012), 1–10.

18.

R.P.

Erhardt and

Meade, Improving handwriting without teaching handwriting: The consultative clinical reasoning process, Australian Occupational Therapy Journal 52(3) (2005), 199–210. doi:10.1111/j.1440-1630.2005.00505.x.

19.

Fusco,

Vlachos,

Vasileiadis,

Wardatzky and

Schneider, Reconet: An interpretable neural architecture for recommender systems, in: Proc of Int. Joint Conf. on Artificial Intelligence (IJCAI), 2019.

20.

Gallina,

Bellotto and

Di Luca, Progressive co-adaptation in human-machine interaction, in: 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO), Vol. 2, 2015, pp. 362–368.

21.

Ghosh,

Tschiatschek,

Mahdavi and

Singla, Towards deployment of robust AI agents for human-machine partnerships, 2019.

22.

Goyal,

Wu,

Ernst,

Batra,

Parikh and

Lee, Counterfactual visual explanations, arXiv preprint, 2019. arXiv:1904.07451.

23.

Graves, Generating sequences with recurrent neural networks, arXiv preprint, 2013. arXiv:1308.0850.

24.

Guidotti,

Monreale,

Turini,

Pedreschi and

Giannotti, A survey of methods for explaining black box models, 2018. http://arxiv.org/abs/1802.01933.

25.

Ha and

Eck, A neural representation of sketch drawings, arXiv preprint, 2017. arXiv:1704.03477.

26.

Hadash,

Kermany,

Carmeli,

Lavi,

Kour and

Jacovi, Estimate and replace: A novel approach to integrating deep neural networks with existing applications, arXiv preprint, 2018. arXiv:1804.09028.

27.

Hois,

Theofanou-Fuelbier and

A.J.

Junk, How to achieve explainability and transparency in human AI interaction, in: Int. Conference on Human-Computer Interaction, 2019.

28.

M.M.

Hoy,

M.Y.

Egan and

K.P.

Feder, A systematic review of interventions to improve handwriting, Canadian Journal of Occupational Therapy 78(1) (2011), 13–25. doi:10.2182/cjot.2011.78.1.3.

29.

C.P.

Janssen,

S.F.

Donker,

D.P.

Brumby and

A.L.

Kun, History and future of human-automation interaction, International journal of human-computer studies 131 (2019), 99–107. doi:10.1016/j.ijhcs.2019.05.006.

30.

Kour and

Saabne, Real-time segmentation of on-line handwritten Arabic script, in: Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, IEEE, 2014, pp. 417–422. doi:10.1109/ICFHR.2014.76.

31.

Kour and

Saabne, Fast classification of handwritten on-line Arabic characters, in: Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of, IEEE, 2014, pp. 312–318. doi:10.1109/SOCPAR.2014.7008025.

32.

J.S.

Lansing, Complex adaptive systems, Annual review of anthropology 32(1) (2003), 183–204. doi:10.1146/annurev.anthro.32.061002.093440.

33.

Liu,

Deng,

Y.-K.

Lai,

Y.-J.

Liu,

Ma and

Wang, Sketchgan: Joint sketch completion and recognition with generative adversarial network, in: Proc. of the Conference on Computer Vision and Pattern Recognition, 2019.

34.

Maedche,

Legner,

Benlian,

Berger,

Gimpel,

Hess,

Hinz,

Morana and

Söllner, AI-based digital assistants, Business & Information Systems Engineering 61(4) (2019), 535–544. doi:10.1007/s12599-019-00600-8.

35.

Malkin,

Harbach,

De Luca and

Egelman, The anatomy of smartphone unlocking: Why and how Android users around the world lock their phones, in: GetMobile: Mobile Computing and Communications, 2017.

36.

G.S.

Martins,

Santos and

Dias, User-adaptive interaction in social robots: A survey focusing on non-physical interaction, International Journal of Social Robotics 11(1) (2019), 185–205. doi:10.1007/s12369-018-0485-4.

37.

Meske,

Bunde,

Schneider and

Gersch, Explainable artificial intelligence: Objectives, stakeholders, and future research opportunities, Information Systems Management 39 (2022), 53–63.

38.

U.R.

Muhammad,

Yang,

T.M.

Hospedales,

Xiang and

Y.-Z.

Song, Goal-driven sequential data abstraction, in: Proc. of the International Conference on Computer Vision, 2019.

39.

A.I.

Niculescu and

R.E.

Banchs, Strategies to cope with errors in human-machine spoken interactions: Using chatbots as back-off mechanism for task-oriented dialogues, in: Proc. Errors by Humans and Machines in Multimedia, Multimodal and Multilingual Data Processing (ERRARE), 2015.

40.

Nocentini,

Fiorini,

Acerbi,

Sorrentino,

Mancioppi and

Cavallo, A survey of behavioral models for social robots, Robotics 8(3) (2019), 54. doi:10.3390/robotics8030054.

41.

Olah,

Mordvintsev and

Schubert, Feature visualization, Distill (2017). https://distill.pub/2017/feature-visualization. doi:10.23915/distill.00007.

42.

Poursaeed,

Katsman,

Gao and

Belongie, Generative adversarial perturbations, in: Pro. of Conference on Computer Vision and Pattern Recognition, 2018.

43.

Riaz Muhammad,

Yang,

Y.-Z.

Song,

Xiang and

T.M.

Hospedales, Learning deep sketch abstraction, in: Proc. of the Conference on Computer Vision and Pattern Recognition, 2018.

44.

Rzepka and

Berger, User interaction with AI-enabled systems: A systematic review of IS research, in: Int. Conf. on Information Systems (ICIS), 2018.

45.

Schneider, Human-to-AI coach: Improving human inputs to AI systems, in: International Symposium on Intelligent Data Analysis, 2020.

46.

Schneider and

Handali, Personalized explanation in machine learning, in: European Conference on Information Systems (ECIS), 2019.

47.

Schneider and

Vlachos, Personalization of deep learning, in: Data Science – Analytics and Applications, 2020.

48.

Schneider and

Vlachos, Explaining neural networks by decoding layer activations, in: International Symposium on Intelligent Data Analysis, 2021, pp. 63–75.

49.

Schneider and

Vlachos, Explaining classifiers by constructing familiar concepts, Machine Learning (2022), 1–34.

50.

Schuetz and

Venkatesh, Research perspectives: The rise of human machines: How cognitive computing systems challenge assumptions of user-system interaction, Journal of the Association for Information Systems 21(2) (2020), 2.

51.

Shahzad,

A.X.

Liu and

Samuel, Secure unlocking of mobile touch screen devices by simple gestures: You can see it but you can not do it, in: Proceedings of the 19th Annual International Conference on Mobile Computing & Networking, 2013, pp. 39–50. doi:10.1145/2500423.2500434.

52.

Shneiderman, Human-centered artificial intelligence: Reliable, safe & trustworthy, International Journal of Human-Computer Interaction 36(6) (2020), 495–504. doi:10.1080/10447318.2020.1741118.

53.

Su,

D.V.

Vargas and

Sakurai, One pixel attack for fooling deep neural networks, IEEE Transactions on Evolutionary Computation (2019).

54.

Venkatesh,

M.G.

Morris,

G.B.

Davis and

F.D.

Davis, User acceptance of information technology: Toward a unified view, in: MIS Quarterly, 2003.

55.

Ware,

Frank,

Holmes,

Hall and

I.H.

Witten, Interactive machine learning: Letting users build classifiers, International Journal of Human-Computer Studies 55(3) (2001), 281–292. doi:10.1006/ijhc.2001.0499.

56.

Xu, Deep learning for free-hand sketch: A survey, arXiv preprint, 2020. arXiv:2001.02600.

57.

Yu,

Yang,

Liu,

Y.-Z.

Song,

Xiang and

T.M.

Hospedales, Sketch-a-Net: A deep neural network that beats humans, International journal of computer vision 122(3) (2017), 411–425. doi:10.1007/s11263-016-0932-3.

58.

Zhang,

Geiger,

Pohjalainen,

A.E.-D.

Mousa,

Jin and

Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments, ACM Transactions on Intelligent Systems and Technology (TIST) 9(5) (2018), 1–28. doi:10.1145/3178115.

59.

Zhao, Humanoid social robots as a medium of communication, New Media & Society (2006).

60.

Zhou,

Xiang and

Cavallaro, Video summarisation by classification with deep reinforcement learning, arXiv preprint, 2018. arXiv:1807.03089.

61.

J.-Y.

Zhu,

Krähenbühl,

Shechtman and

Efros, Generative visual manipulation on the natural image manifold, in: European Conf. on Computer Vision, 2016.

Optimizing human hand gestures for AI-systems

Abstract

Keywords

1. Introduction

3. Problem

3.1. Objectives and measures

4. Methodology

5.1. Qualitative evaluation

2 This happens if the x and y coordinate of a point is shifted by -10 once and another time by +10, yielding a total distance of ( − 10 − 10 ) 2 + ( − 10 − 10 ) 2 = 28.3

8. Conclusions

References

²
This happens if the x and y coordinate of a point is shifted by -10 once and another time by +10, yielding a total distance of $\sqrt{{(- 10 - 10)}^{2} + {(- 10 - 10)}^{2}} = 28.3$