Abstract
In this paper, we show that keystroke latencies used in continuous user authentication systems disclose application context, i.e., in which application user is entering text. Using keystroke data collected from 62 subjects, we show that an adversary can infer application context from keystroke latencies with 95.15% accuracy.
To prevent leakage from keystroke latencies, and prevent exposure of application context, we develop privacy-preserving authentication protocols in the outsourced authentication model. Our protocols implement two popular matching algorithms designed for keystroke authentication, called Absolute (“A”) and Relative (“R”). With our protocols, the client reveals no information to the server during authentication, besides the authentication result. Our experiments show that these protocols are fast in practice: with 100 keystroke features, authentication was completed in about one second with the “A” protocol, and in 595 ms with the “R” protocol. Further, because the asymptotic cost of our protocols is linear, they can scale to a large number of features. On the other hand, by leveraging application context we were able to reduce HTER from 14.7% with application-agnostic templates, to as low as 5.8% with application-specific templates.
Introduction
Commercial providers of keystroke-based behavioral authentication services, e.g., Behaviosec [11], often operate under the outsourced authentication model. In this model, the user’s device collects aggregated keystroke data (e.g., average inter-keystroke latencies computed from hundreds of keystrokes typed within an authentication window), and sends it to a remote server for authentication. It is therefore natural to ask what the authentication server can learn from such aggregated latencies.
Multiple studies have shown that aggregated keystroke latencies can be used to infer private information such as demographic and affective information [5,10,16]. One aspect that has received less attention is the leakage of application context through aggregated keystroke latencies. In this paper, we aim to address this gap. We demonstrate that access to aggregated latencies enables the authentication server to infer which application is being used by the user. We address this newly identified privacy leakage by designing two novel cryptographic protocols that allow the implementation of secure outsourced authentication without disclosing any information to the authentication server, other than the result of the authentication process.
While the primary contribution of our paper is to address the leakage of information from aggregated keystroke latencies, we also show that application context can be leveraged to improve authentication performance in scenarios where authentication is done locally on the device and does not involve sending latencies to third parties.
The key contributions of this paper can be summarized as follows.
We demonstrate that aggregated keystroke data, such as average key press latencies, can be used to predict application context with high (95.15%) accuracy. This result was achieved without using individual keystrokes typed by the user, and is quite worrisome because an authentication server with access to aggregated latencies can exploit them to learn about user’s routine, activities, and application usage. For instance, an adversarial server could determine, for each day of the week, how much time the user is spending on word processing, browsing, or working on spreadsheets. This represents an unnecessary violation of the user’s privacy, particularly because application usage information should not have been exposed to a third party if the goal was to authenticate users using keystrokes.
To address privacy concerns arising from the exposure of keystroke data, we develop novel cryptographic privacy-preserving protocols that prevent information leakage under the outsourced authentication model [1]. Specifically, we introduce two protocols which calculate Absolute (“A”) and Relative (“R”) measures [20] (defined in Section 5.2), and compare them with protocols that compute scaled Euclidean and scaled Manhattan distance from Govindarajan et al. [19]. Performance evaluation on a standard desktop computer shows that our protocols are practical, because their overhead is between 35 ms and 4.5 s.
We report the improvement in authentication performance that can be achieved using application-specific template matching. In our experiments, application-agnostic matching led to an half total error rate (HTER) of 14.7%, while we were able to achieve significantly better HTER-s with application-specific matching (5.8% for Microsoft Word, 5.9% with Outlook, and 13.8% with Internet Explorer). Further, authentication with application-specific templates significantly outperformed authentication with mismatched templates (i.e., when probe keystrokes from application X were matched against a template from a different application Y).
Organization. The rest of this paper is organized as follows. In Section 2 we review the related work. The dataset, matching algorithms, and fusion techniques used in our experiments are summarized in Section 3. Section 4 reports our results on leakage of application context from keystroke data. We introduce new privacy-preserving protocols for “A” and “R” metrics in Section 5. Authentication accuracy using application-specific templates is reported in Section 6. We conclude in Section 7.
Related work
Next, we briefly review the state of the art on information leakage from keystroke timing information, on privacy-preserving protocols for continuous authentication, and on the impact of context on behavioral authentication. For a general survey on keystroke-based authentication, we refer the reader to Banerjee et al. [2].
Information leakage from keystroke timing information
Song et al. [44] demonstrated a side-channel attack in which latencies between consecutive key press events were used to reconstruct what the user was typing. The authors used a Hidden Markov Model (HMM) to predict character sequences from keystroke latencies collected during SSH sessions, and validated their technique by successfully predicting 8-character passwords for 4 users.
Zhang and Wang [48] demonstrated that inter-key times for all local users can be obtained from the
There is growing evidence that keystroke latencies can be used to predict demographic information (e.g., gender, native vs. non-native English speakers), and affective information (e.g., mood, stress level, engagement, and cognitive load). For instance, Fairhurst et al. [16] achieved less than 5% error rates in predicting gender from press-press, release-release, press-release, and release-press times. In [5], Bixler demonstrated that keystroke timing features, combined with task appraisal (e.g., interest in a topic before performing a writing task on it) and information on the user’s stable traits (e.g., scholastic aptitude), successfully predicted affective states such as boredom and engagement during writing tasks.
In [10], Brizan et al. observed that keystroke timing information can be used to predict gender and other demographic information. Their results show that the transition from a punctuation symbol to the spacebar was faster for male subjects than for female subjects, while the times before and after common digraphs, such as ‘
The research presented in our paper extends the state of the art on keystroke authentication by showing that application context can be reliably inferred from keystroke timing information.
Privacy-preserving protocols for continuous authentication
There is extensive literature on the use of privacy-preserving protocols in the context of biometric authentication and identification. Bringer et al. [9] were the first to introduce a general security model for biometric user authentication. Their model assumes low trust between the involved parties, and formalizes privacy for biometric authentication. Furthermore, [9] introduces a privacy-preserving protocol that computes the Hamming distance of two bit strings representing a biometric sample and a template respectively. Barbosa et al. [3] extended the framework of Bringer et al. [9] with a classifier to improve authentication accuracy and propose an instantiation based on Support Vector Machine (SVM) using homomorphic encryption.
Schoenmakers and Tuyls [38] introduced a protocol for secure privacy-preserving iris matching. Their protocol is implemented using threshold ElGamal, and computes (encrypted) Hamming distance between two bit strings representing a template and a user sample, encoded using IrisCode. The result of the Hamming distance is then compared, in the encrypted domain, with a threshold.
Erkin et al. [15] introduced the first privacy-preserving face recognition protocol. Sadeghi et al. [36] subsequently improved the performance of the protocol of Erkin et al. [15]. More recently, Osadchy et al. [33] designed a new face recognition algorithm together with its privacy-preserving realization called SCiFI. SCiFI simultaneously improves robustness and efficiency of [15,36].
Blanton and Gasti [6] focused on privacy-preserving iris and fingerprint matching. The authors rely on a hybrid approach based on garbled circuits and homomorphic encryption for optimal performance.
Barni et al. [4] presented a privacy-preserving protocol for fingerprint identification using FingerCodes [23]. Their technique is not as discriminative as techniques based on location of minutiae points, but is particularly well suited for efficient privacy-preserving implementations.
Govindarajan et al. [19] introduced two protocols for continuous smartphone user authentication based on touchscreen features. Their protocols compute Scaled Manhattan and Scaled Euclidean verifiers, and are secure in the semi-honest model. Subsequently, Šeděnka et al. [40] improved on the work of Govindarajan et al. [19] by introducing new scaled Manhattan and scaled Euclidean protocols secure in the malicious model.
Safa et al. [37] presented a protocol for outsourcing continuous authentication with a Scaled Manhattan verifier on smartphones, and considered security against malicious clients. However, the security argument presented in their protocol does not take into account information disclosed by order-preserving encryption. (See, e.g., the analysis of Boldyreva et al. [7].) As such, there is a substantial amount of information leaked by the client during the protocol, even in the ciphertext-only scenario. In contrast, our schemes provably leak no information.
Techniques based on fuzzy commitments [25,26,39,46] are commonly used to provide template protection and to implement access control on encrypted documents. However, such techniques require biometric comparisons to be performed in a feature space different from that of the original biometrics, possibly increasing equal error rate (EER) [29]. In contrast, our protocols do not affect the EER of the underlying biometric modality, since the comparison between the user sample and the template is functionally the same as the comparison in the unencrypted domain.
While there is prior work on privacy-preserving biometric authentication, this paper is the first to introduce secure protocols for “A” and “R” metrics. These metrics have been shown to perform well in multiple keystroke authentication studies (see, for example, [21,35,47]).
Use of context on behavioral authentication
Several papers have leveraged different types of context to enhance the performance of keystroke authentication systems on desktops. Most popular among these contexts are (1) application context; (2) linguistic context (e.g., keystroke features collected from specific words); (3) cognitive context (e.g., keystroke features collected from typing bursts); and (4) motion context (i.e., keystroke features collected while walking or while sitting).
Application context. Dowland et al. [14] conjectured that the use of application-specific templates might impact authentication performance. By analyzing typing data from one user, the authors suggested that Messenger and Microsoft Word benefit from application-specific templates more than Internet Explorer and Microsoft PowerPoint.
In contrast to [14], our work uses data from 62 users, and therefore takes into account both false accept and false reject errors. Further, our work establishes that the use of application context improves authentication accuracy for most applications by comparing the use of application-specific templates with application-agnostic and mismatched templates.
Linguistic context. Sim and Janakiraman [42] demonstrated that the discriminability of digraph latencies increased when digraphs were associated with the word they came from. Their experiments, performed on a dataset collected from 22 users, showed that the inter-user Manahalobis distance of word-specific digraphs was substantially lower than that of digraphs without word context.
Goodkind et al. [18] showed that using word context decreased authentication errors. Specifically, when adding word context to digraphs, EER was reduced by about 1%.
Cognitive context. Locklear et al. [32] performed a study on typing data collected from 486 users. In their experiments, keystroke features outperformed cognitive features for continuous authentication. However fusion of the two classes of features led to the lowest authentication EER.
Body motion context. Sitova et al. [43] showed that body motion affects the performance of keystroke-based authentication on smartphones. Using data from 99 subjects, the authors demonstrated that a combination of signals from keystroke timings, gyroscope, and accelerometer sensors leads to higher authentication accuracy when users were walking, compared to sitting.
Dataset and authentication algorithms
Dataset. The dataset used in this work was collected from 62 subjects, who at the time of data collection were cadets of the United States Military Academy at West Point, New York.1
An IRB approval was obtained prior to performing data collection.
Feature categories. We extracted six different categories of features from our dataset. Each category is a collection of individual keystroke features. The categories are: (1) KHL: contains key hold latency (down-up time) of each key as features; (2) IK: contains key interval latency (up-down time) of each pair of consecutive keys; (3) KRL: contains key release latency (up-up time) of each pair of consecutive keys; (4) KeyHoldWithNextVKCode: contains key hold with next key context preserved (e.g., key hold latency when typing letter ‘
Prediction and matching algorithms. For predicting application context from aggregated keystroke information, we used Random Forest [8], Naïve Bayes [31], and Tree-augmented Naïve Bayes [49] classifiers.
For biometric verification, we used five different matching algorithms that have been used previously in keystroke authentication studies. They are: (1) Scaled Manhattan [28], (2) Scaled Euclidean [28], (3) “A” [20], (4) Similarity [24]; and (5) “R” [20].
Fusion. Decisions from all verifier-feature pairs were combined, or fused together, to make an overall authentication decision. We used a weighted decision fusion method to fuse the individual authentication results [22,30]. We implemented weighing of classifier decisions using the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm [45].
While it is straightforward to infer application context from individual keystrokes, it might appear that the same information cannot be inferred from aggregated keystroke timings (e.g., average inter-key latencies computed from hundreds of keystrokes typed within an authentication window). In this section, we show that aggregated keystroke latencies can be used to predict application context with high accuracy. We formulate this problem as a classification problem, where the underlying applications form the class labels, and aggregated keystroke data collected from each application was used to train the classification model. The goal of this task was to correctly identify the application, given a vector of keystroke statistics.
For training, we used all keystroke data from 31 users, while for testing we used all keystrokes from a separate group of 31 users who were not included in training. For each user, we divided the keystrokes from each application into two sets. Each set was used to construct one vector, as long as the set contained at least 8,800 keystrokes. This allowed us to build 206 training vectors from the first group of 31 users, and 206 testing vectors from the remaining users.
We computed the following three statistics from the individual keystroke features: standard deviation, mean, and normalized count (the number of occurrences of a feature, divided by the number of occurrences of all features). Using the following example, we illustrate how these statistics were computed. Let ‘
We used four classes: Microsoft Word, Microsoft Internet Explorer, Microsoft Outlook, and Wolfram Mathematica. Our results are presented in Tables 1 and 2.
Context identification accuracies, listed by feature categories. Darker cells corresponds to higher accuracy. We use RF to indicate Random Forests; NB for Naïve Bayesian; and TAN to indicate Tree-Augmented Naïve Bayesian
Context identification accuracies, listed by feature categories. Darker cells corresponds to higher accuracy. We use RF to indicate Random Forests; NB for Naïve Bayesian; and TAN to indicate Tree-Augmented Naïve Bayesian
Context identification performance, by feature statistics. Our results show that NORMALIZEDCOUNT features are the most discriminative for context detection. This feature alone leads to over 90% accuracy with 50–100 features, and over 80% with 10 features. When NORMALIZEDCOUNT features are removed, the context identification accuracies drop to less than 63%
In Table 1 we report our results for the 100, 50, and 10 most available features for each feature category. “All” means that the classifier was trained using all statistics computed on all feature categories. “All but IK” means the features in the IK category were excluded from the experiments. Classification accuracy was calculated as number of correctly classified sessions divided by the total number of sessions. Our results with Random Forest (the best performer among the tested classifiers) show that excluding a single feature category does not significantly affect performance (see Table 1). In fact, by removing each individual feature category, the drop in accuracy is always below 3% with the 100 and 50 most available features, and below 5% with the 10 most available features.
In Table 2, we report our results for the 100, 50, and 10 most available features for each statistic. For example, the normalized count row in Table 2 shows the application prediction accuracy achieved using the normalized counts of the 100, 50, and 10 most available features in each feature category.
We were able to achieve over 95% classification accuracy with the 100 most available features, using standard deviation and normalized count statistics with Random Forests. Normalized count is the statistic that contributes the most to the classification accuracy. As shown in the last row of Table 2 (MEAN, STDDEV), when this statistic was excluded, classification accuracy dropped to 62.1% or less.
Sub-contexts within an application. The results presented in this section were achieved without isolating overlapping sub-contexts within each application. For instance, it is possible that users composed emails and performed word processing tasks using Internet Explorer. The former task would overlap with Microsoft Outlook, while the second with Microsoft Word. Despite such possibility of overlap, we were able to reliably distinguish between application context.
To address application context leakage under the outsourced keystroke authentication model, we introduce new privacy-preserving protocols for “A” and “R” matching algorithms. Our protocols allow a client to authenticate its user by exchanging encrypted information with a server. The server stores an encrypted copy of the user’s template, and learns the outcome of the authentication process. Our protocols provably reveal no information about the keystroke data to the server, which is therefore unable to infer application context or any other ancillary information. Moreover, our protocols reveal no information on the template to the client and the server. Before we introduce our protocols, we briefly review cryptographic preliminaries.
Cryptographic preliminaries
Security model. We use the term adversary to refer to protocol participants (i.e., the client and the authentication server). Outside adversaries can be mitigated via standard network security techniques (e.g., TLS), and are therefore not considered in our analysis.
Our protocols are secure against semi-honest (or honest-but-curious) adversaries. Semi-honest participants are assumed to follow prescribed protocol behavior. However, they might try to learn additional information beyond the protocol output by analyzing messages exchanged during protocol execution. Formally [17]:
Let
Homomorphic encryption and comparison. In an additively homomorphic encryption scheme,
Our constructions can be instantiated using any semantically secure additively homomorphic encryption scheme (e.g., [12,13,34]). We instantiate our protocols using the homomorphic construction of Damgård et al. [12,13] (henceforth, DGK), because it is faster [6] (and produces smaller ciphertexts) than the well known Paillier cryptosystem [34]. In the rest of the paper, we use
Our privacy-preserving “A” and “R” protocols rely on the comparison protocol of Erkin et al. [15]. This interactive protocol allows the client to compare two values encrypted using a homomorphic scheme, without having access to the decryption key, and without learning anything besides the output of the protocol.
Symmetric encryption. Our protocols use a semantically secure symmetric encryption scheme for protecting the template. In our implementation, we use AES in counter (CTR) mode. We use
Our protocols are divided in two phases: enrollment, and verification. Enrollment is executed once per user (or when the user wants to updates her biometric profile), while verification is performed periodically (e.g., every 60–120 seconds).
In the enrollment phase, the client constructs the user’s template by extracting features from keystroke data. The template is encrypted with homomorphic encryption under the server’s public key, and then with symmetric encryption using a key known only to the client (e.g., a user-provided password). The resulting ciphertext is stored on the server, which cannot reconstruct the unencrypted template because it has no access to the symmetric encryption key.
The client constructs a user template
Next, we present the verification phase for each of our privacy-preserving protocols. Our protocols for computing “A” and “R” measures are illustrated in Algorithms 1 and 2, respectively.

Our privacy-preserving “A” measure protocol
Privacy-preserving “A” measure. Gunetti et al. [20] define the similarity between two n-graphs occurring in two typing samples as
Privacy-preserving “R” measure. In order to implement “R” distance, we reduce it to Manhattan distance as follows. The user’s template is composed of a set of digraphs, together with their rank (stored in encrypted form). During authentication, the user’s sample x determines the ranking of features for the session. Client ranks the elements from the encrypted template according to x. If x and the encrypted template are ranked identically, then Client obtains the encrypted list

Our privacy-preserving “R” measure protocol. Without loss of generality, the protocol is shown with the use of digraphs. It is easy to see how it can be easily extended to use arbitrary n-graphs
The computational complexity of “R” protocols is
We performed experiments on a Linux system with two Intel Xeon E5420 CPUs running at 2.5 GHz. Our prototype implementation is written in C, and relies on the GNU GMP library. Table 3, summarizes the performance results of our protocol implementation with 25–400 features. These results are obtained representing each feature using 10 bits. The cost of the “R” protocol is dominated by the computation of the privacy-preserving scaled Manhattan distance. Therefore, we report the performance of the two protocols combined.
Performance of our prototype implementation. Because the performance difference between scaled Manhattan and “R” is negligible, we report the two protocols together
Performance of our prototype implementation. Because the performance difference between scaled Manhattan and “R” is negligible, we report the two protocols together
Our client and server implementations are single core. This allows straightforward performance comparison with the related work. However, all protocols presented in this paper can greatly benefit from multiple cores, because most operations can be performed in parallel on different features.
The security of our protocols relies on the security of the underlying building blocks. In our analysis, we assume that
To show that our protocols are secure, next we describe how to simulate the view of each party using its inputs and outputs alone. Because these simulations are indistinguishable from the real execution of the protocol, for semi-honest parties this implies that the protocols do not reveal any information besides the protocol output to the client and server.
Privacy-preserving “R” verifier. The security of the privacy-preserving protocol implementing the “R” verifier reduces to the security of the privacy-preserving Manhattan distance protocol used in Step 3 of Algorithm 2. In fact, because of the semantic security of
Privacy-preserving “A” verifier. Since
The client’s view of the protocol consists in the server’s public key, the symmetric key for
Authentication accuracy using application-specific templates
When authentication is performed locally on the user’s device, disclosure of application context to the authentication system does not represent a privacy violation. Assuming this use case, we show that it is possible to use application-specific templates to improve authentication accuracy. Our results provide empirical evidence that leveraging application context improves authentication accuracies. We constructed application-specific templates by using all keystroke data from one application at a time. For instance, we constructed a Microsoft Word-specific template by computing the average of each feature value collected while the users were typing text in Word.
Table 4 shows the impact of application-specific matching in keystroke-based verification. We report accuracy results, expressed in terms of HTER, for templates constructed using 1,100 keystrokes. HTER is a standard accuracy measure defined as (FAR+FRR)/2, where FAR represents the percentage of vectors from an impostor incorrectly classified as vectors from the legitimate user (false accept rate), and FRR the percentage of vectors from the legitimate user incorrectly classified as impostor vectors (false reject rate) [41]. The diagonal (highlighted in bold) indicates the HTER-s obtained for application-specific matching (i.e., test vectors collected within an application are matched with the templates from the same application). The non-diagonal cells represent HTER-s obtained when the test vectors from an application are matched against a template from a different application (cross-template matching). Our results show that the HTER-s obtained for application-specific matching are considerably lower than HTER-s obtained with cross-template matching.
Performance of keystroke verification using fusion of A, R, SE, and SM verifiers on key hold, key press, digraph, key hold with next, key hold with previous. Templates are constructed using 1,100 keystrokes
Performance of keystroke verification using fusion of A, R, SE, and SM verifiers on key hold, key press, digraph, key hold with next, key hold with previous. Templates are constructed using 1,100 keystrokes
Number of users with at least 1,100 keystrokes for each combination of training and testing
On the other hand, the HTER of application-agnostic matching, i.e., the template and the test vectors were constructed using keystrokes from all applications, was 14.7%. This is therefore less accurate than application-specific matching – with the exception of Mathematica. However, further experiments show that the HTER increase associated with Mathematica can be attributed to the lower number of keystroke available in this application: as we increase the number of keystrokes used in the construction of application-specific user templates from 1,100 to 1,400, we observed a reduction in HTER to 10.6%, although with data from 26 users instead of 34. (Availability information for 1,100 keystrokes is presented in Table 5.)
In this paper, we showed that application context impacts both privacy and performance of keystroke authentication. We were able to measure application-specific variations from aggregated keystroke data, and successfully used them to identify application context with over 95% accuracy. Because the identification of application context raises significant privacy concerns when authentication is outsourced to a third-party server, we designed two new privacy-preserving protocols for continuous user authentication with keystroke data. Our protocols incur relatively small overhead (595 ms for “R” computation of distance, and 1.1 s for “A” distance, using 100 keystroke features), and disclose no information besides the outcome of the authentication process to the server.
We then determined whether application context can be used to improve authentication accuracy. Using application-specific templates we successfully reduced HTER from 14.7% (with application-agnostic matching) down to 5.8% for Microsoft Word, 5.9% for Microsoft Outlook, and 13.8% for Microsoft Internet Explorer. However, because of the low availability of keystroke samples, the HTER of Mathematica was higher than the baseline at 17.7%.
Footnotes
Acknowledgments
This work was supported in part by DARPA Active Authentication grant FA 8750-13-2-0274. The views, findings, recommendations, and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the sponsoring agencies or the U.S. Government.
The authors thank Colonel Ron Dodge (United States Military Academy, West Point, NY) and his team for leading the data collection effort.
