A side-channel attack on smartphones: Deciphering key taps using built-in microphones

Abstract

In recent years, concerns are increasingly being expressed about the threats of side-channel attacks that exploit acoustic emanations from electronic as well as mechanical devices of daily use. With the increased level of sophistication in both hardware and applications that run on mobile phones, the number of possible ways in which their vulnerabilities can be exploited is also on the rise. In this article, we demonstrate a novel attack which uses the sound emanating from a tap made on the touchscreen of a smartphone to decipher the text being typed. The audio signal captured by the pair of microphones typically embedded in a smartphone is first processed to determine a candidate set of keys. Filters are employed to make this step robust against ambient noise. Natural language processing techniques are then used to estimate the most probable words and sentences that can be constructed from a sequence of taps. It is shown that using even off-the-shelf tools, the typed text including passwords can be guessed with reasonably high accuracy. Besides raising awareness about this potential side-channel attack, we identify the causes that allow it to succeed and suggest countermeasures.

Keywords

Side-channel attack smartphone key tap noise filtering natural language processing

1. Introduction

Over the last few years, mobile phones have become ubiquitous in their presence across all countries in the world. It is expected that mobile phone penetration will reach close to $70 %$ of the global population by the year 2020 [26]. Along with the growth in the number of devices, several new and innovative applications are being made available on such phones, especially the smartphones. Usefulness of these applications is making people over-dependent on their smartphones, which are increasingly being used for personal and confidential information exchange. Tasks range from sending and receiving e-mails, editing and sharing documents, and accessing on-line banking sites, to surfing the net, using various applications (apps) like cab, flight and hotel booking, chatting and participating in online social networking. As a result, the level of security ensured by these devices has become a topic of serious concern.

It may be noted that, all the tasks mentioned above, involve text input using a keypad, which is typically touchscreen in nature. In this paper, mobile phones including smartphones, tablets and other similar devices, are hereinafter referred to as smartphones (or simply as phones) and they represent gadgets that have a touchscreen user interface and have features like Internet access, support for recording and storing media files, etc.

The smartphones that are popularly available today are equipped with at least two microphones. Being attached to the phone body, these microphones pick up even the slightest sound produced on the surface of the phone including the touchscreen. As a user taps on the touchscreen for entering information, a sound is produced which is often not quite audible to the human ear due to the presence of ambient noise and is ignored. These tap sounds can, however, be easily captured using the microphones of the smartphone itself. An unsuspecting user hardly ever denies microphone use or Internet access permission to an app while installing the same. An app is also not suspected to be malicious unless it asks for some unusual combination of permissions. Microphone and Internet access permissions are so common that they do not normally raise any suspicion. However, we show that even these two seemingly innocuous permissions can be judiciously exploited to decipher the text being typed on the smartphone, thus causing a security breach.

In the literature, there is some limited reference to work that recover text using acoustic emanations from mechanical keyboards [1,10,14,30]. External microphones are typically employed to capture the sound, which is then used to train a classifier for differentiating between sounds produced by different keys. It has also been reported that smartphones equipped with various forms of sensors like accelerometer and gyroscope can be used for text extraction [17,20,21]. However, the possibility of exploiting the audio signal captured by the built-in microphones to decipher the typed text has never been studied, to the best of our knowledge.

This paper presents a novel methodology for side-channel attack using acoustic emanations from the touchscreen keypad of a smartphone and that too using readily available tools. The approach makes use of phone geometry and key pad layout, while applying signal processing and natural language processing (NLP) techniques for word and message deciphering. We have significantly expanded our work recently presented at a conference [11] by introducing methods to suppress ambient noise and using the proposed methodology for password guessing, while carrying out more comprehensive experimental studies and strengthening of approaches for countering the attack.

The rest of the paper is organized as follows: Section 2 describes our methodology and the associated algorithms in detail. Specifically, we discuss noise reduction and identification of potential character sets being keyed in, word and sentence inferencing from the character sequences and usefulness in password guessing. Section 3 provides the complete workflow of an implemented system including possible alternative implementation strategies and their relative advantages. Section 4 presents the results of an extensive experimental study. Performance of the proposed methodology is reported in terms of accuracy, robustness against noise as well as execution time. Related work is reviewed in Section 5. Several preventive measures to counter the attack along with their feasibility are discussed in Section 6. Finally, Section 7 concludes the paper and identifies scope for future research.

2. Attack methodology

In this section, we describe the complete attack methodology, which can be broadly divided into two phases. The first phase involves processing of audio signals captured in the two microphones of a smartphone to determine the probability of entering a character in each tap. To achieve this, the nature of the two signals received at the microphones as well as the phone geometry and touchscreen keypad layout are exploited. This phase can be further divided into two stages, namely, noise filtering and tap key identification. In the second phase, NLP techniques are used to extract the most probable sequence of words corresponding to a tap sequence by processing the output of the first phase.

Of the two phases mentioned above, the activities of the first are partially carried out in the target phone itself. The rest of the processing is carried out on a remote server. A trojan app installed in the phone does the task of recording of the tap sound and sending it to the server. Once the data is available at the server end, relatively more computation intensive steps can be carried out at the convenience of the attacker. A detailed description of the complete implementation is provided later in Section 3.

2.1. Threat model

Any study on attacks and their prevention methods is based on an underlying threat model. Such a model is typically developed keeping in mind the operational conditions, user awareness, computational feasibility and attack goals. Our threat model includes the following components. First of all, as mentioned before, we consider that the target smartphone has stereo recording capability along with Internet connectivity. It is also assumed that the trojan app to be used for recording of the key taps can be installed in the phone without raising user’s suspicion. The goal of the attack is to read message texts and also passwords being typed in by the user. Even though the attack methodology is developed without making any assumption on the orientation of the phone, we consider for the sake of brevity that the target phone is held in landscape mode. Such assumptions and dependencies are discussed in greater detail in later sections along with supporting justifications.

2.2. Character sequence extraction using microphone data

Modern day smartphones almost invariably come with more than one, typically two, microphones providing stereo recording capability. Indeed, the iPhone 5 onwards, three microphones have been present in the iPhone. However, since the vast majority of smartphones have two microphones, in the rest of this work, we simply work with two, though the same methodology can be extended to three as well. When a user taps on the touchscreen to enter any text, a sound is produced for each tap. This tap sound is captured by the microphones, which are located near the touchscreen surface. As a result, and due to the fact that the microphones are solidly attached to the phone body, there is hardly any damping. However, there is a possibility of picking up ambient noise besides the tap sounds. In this subsection, we first explain the details of how the delay in arrival of the tap sound at the two microphones can be used to estimate the location of the tap and hence the tapped key and then show how the ambient noise can be eliminated to make the approach more robust.

2.2.1. Identification of key tap position

When a touchscreen key is tapped, there is a finite difference in the distances traveled by sound to reach the two microphones. As a result, from the captured key tap sound, the time difference of arrival (TDoA) [12] of audio peaks in these microphones can be computed. For each TDoA, the locus of a point of tap having constant difference in its distance from the two microphones (also approximated as points due to their small size with respect to the phone dimensions) forms a hyperbola, with the two microphones at its two foci. The sign of TDoA determines which half of the hyperbola (i.e., the semi-hyperbola) needs to be considered for a given key tap. The cross correlation based method proposed in [4] is used for computing the TDoA between the two audio signals recorded at the two microphones as explained below.

Fig. 1.

(a) Key pad layout superimposed on the chosen co-ordinate system (b) Hyperbola coverage on the left side of key pad with $k = 3$ (c) Hyperbola coverage on the right side of key pad with $k = 3$ (d) Hyperbola coverage on the left side of key pad with $k = 6$ (e) Hyperbola coverage on key pad in portrait layout with $k = 6$ .

We first define the co-ordinate system as follows. Holding it upright facing front, the origin coincides with the top left corner of the phone and the x-axis is parallel to the left edge. In such a co-ordinate system, one of the microphones (denoted as $m_{1}$ ) is located on the y-axis itself and the other (denoted as $m_{2}$ ) is positioned in the first quadrant. The coordinates of the microphones are denoted as $(0, p)$ and $(l, q)$ , respectively. The phone position can thus be modeled as shown in Fig. 1(a). The vertex of the hyperbola is also located in the first quadrant. We, therefore, use the general equation of a hyperbola with vertex at $(x_{0}, y_{0})$ . The hyperbola from the obtained TDoA can now be represented as follows: $\begin{matrix} (1) & \frac{{(x - x_{0})}^{2}}{a^{2}} - \frac{{(y - y_{0})}^{2}}{b^{2}} = 1 \end{matrix}$

In Equation (1), $(x, y)$ denotes a tap point, while a and b represent the parameters of the desired semi-hyperbola, which can be computed using the following equations in terms of the other known and measurable quantities. $\begin{matrix} (2) & a = \frac{Δ d}{2}, where Δ d = | d_{1} - d_{2} | \end{matrix}$ Here, $d_{i}$ is the distance of the ith microphone from the touchscreen point that was tapped, with $i \in {1, 2}$ . From Fig. 1, it is easy to derive that $\begin{matrix} (3) & c = \frac{\sqrt{{(l - 0)}^{2} + {(q - p)}^{2}}}{2} \end{matrix}$ l, p, q are phone parameters out of which l denotes the length (dimension of the longer edge). p and q are the distances of $m_{1}$ and $m_{2}$ , respectively, from the base of the phone. Thus, $\begin{array}{l} (4) & b = \sqrt{c^{2} - a^{2}} \\ (5) & x_{0} = c \\ (6) & y_{0} = min (p, q) \end{array}$

By substituting and solving for x in terms of y from Equation (1), we can write: $\begin{matrix} (7) & x = x_{0} \pm (\frac{a}{b} \sqrt{b^{2} + {(y - y_{0})}^{2}}) \end{matrix}$ This equation can be used to obtain x in terms of y, where $y \in [0, w]$ , w being the width of the phone. $Δ d$ is computed from the TDoA and the recorded audio signal parameters, namely, sampling frequency $f_{s}$ and sample offset $Δ s$ . Sample offset is the value by which one of the audio signals needs to be shifted so that the correlation between the two signals is maximum. The remaining parameters are computed using Equations (2) to (7). $Δ d$ is determined from $Δ s$ and $f_{s}$ using Equations (8) and (9) given below. $\begin{array}{l} (8) & TDoA = \frac{Δ s}{f_{s}} \\ (9) & Δ d = v_{sound} \cdot TDoA \end{array}$

Based on the above discussions, it is observed that by computing the value of TDoA, we can characterize the desired hyperbola. Thus, the problem of determining hyperbola parameters reduces to that of estimating the TDoA (or sample offset $Δ s$ ) between the audio signals received at the two microphones. The terms $TDoA$ and $Δ s$ are, therefore, used interchangeably since as Equation (8) shows, one can be computed from the other with $f_{s}$ remaining constant.

From the captured audio signals at the microphones, the first task is to determine the tap instants (detect peaks in the corresponding signal). Hence, we propose a scheme for automatic peak detection, the details of which are presented as the $Detect_Peak_Intervals$ algorithm (Algorithm 1). It takes two audio signals as input and returns a set of time intervals, each corresponding to a single tap. The algorithm is explained below in details.

Algorithm 1

$Detect_Peak_Intervals$

In Algorithm 1, the for loop of Lines 1–3 computes the energy levels of the signal by accumulating 10 sample points together. Next, in the for loop of Lines 4–10, for each sample point, the beginning and end points of the interval window are set in the variables $s_{k}$ and $e_{k}$ for the kth sample point. Lines 6–9 compute the maxima of the signal in this interval. If this value is greater than the threshold computed using θ, the corresponding interval is added to the set $tmpIntr$ . Finally, the computed time intervals are readjusted so that the peaks are positioned at the center of each interval.

It can easily be seen that Algorithm 1 runs in $O (n)$ time since there are two loops in the algorithm of complexity $O (n)$ and $O (k)$ , respectively. Here, n is the number of sample points of the audio signal, k is the number of taps in the recording and $n ≫ k$ .

As mentioned above, the output of Algorithm 1 is a set of time intervals, each containing a tap. These intervals are next processed to get the TDoA corresponding to each tap. The algorithm presented next (Algorithm 2– $(Compute_TDoA)$ ) is used to determine the sample offset $Δ s$ of the two signals in a given interval. It takes two audio signals ( $x_{1}$ and $x_{2}$ ) from the two microphones and for each interval returned by Algorithm 1, finds their cross-correlation. Then, it computes the median of the top p peak values.

Algorithm 2

$Compute_TDoA$

In this algorithm, the for loop of Lines 3–13 selects each interval from the set of intervals one at a time. Next, the for loop of Lines 7–11 selects the top p peaks (typically set to a value of 3) from the cross correlation values. Line 12 adds the computed median to the set $lagMedian$ . The final $lagMedian$ set is returned by the algorithm. The set of time lags thus obtained is used to determine the hyperbola corresponding to each tap as explained above. The locus of the valid semi-hyperbola computed using Equation (7), when overlayed on the keyboard layout, returns the shortest distance of each key from this semi-hyperbola.

The probability that a given key is tapped, i.e., an individual character entered by the user, is considered to be inversely proportional to the shortest distance of the key from the semi-hyperbola. Thus, we can obtain the top k most probable characters for each TDoA and hence, corresponding to each key tap. For example, Fig. 1(b) shows this situation when the left semi-hyperbola is selected with three letters in the landscape mode of the phone falling on the locus. Figure 1(c) depicts the same situation with the right semi-hyperbola selected. If we expand the locus to include six letters instead of three, and the left semi-hyperbola is selected as shown in Fig. 1(d), the keys q, w, e, a, s, z form the candidate set of keys that might have been tapped. Instead of holding the phone in landscape orientation, which usually makes text entry more convenient, if it is held in portrait mode, more number of keys get covered for the same width of the semi-hyperbola as shown in Fig. 1(e). However, this does not have any impact on the basic methodology for obtaining the candidate set. Further, whether the user uses one or both the thumbs to tap in the text does not affect the efficacy of the proposed approach except for the fact that the sounds corresponding to consecutive taps would arrive at the microphones faster.

The orientation in which a smartphone is held, i.e., portrait or landscape can be sensed and accordingly the candidate set is adaptively decided. It may be noted that, the trojan application to be used to carry out the attack, can send the smartphone information like its model number, based on which the geometric parameters can be determined either using a lookup table or through Internet search. Further, several of the operating system parameters, including currently active window information, are made available through appropriate Application Programming Interfaces (APIs) for user-developed apps to operate efficiently. From this information, the trojan can determine the type of app like messaging, gaming, browser, etc., that is currently being used. Accordingly, it decides whether to process the tap sounds or not. For example, if the currently active app is a video game, it would not be processing the audio signals corresponding to the taps any further. On the other hand, message entry and filling up of information in form fields through a browser are applications for which the audio files would be transmitted. It can also be tuned to process data only for one particular type of application.

2.2.2. Noise reduction

So far, we have considered an ideal setup for carrying out the attack with clean audio signals reaching the two microphones. However, in real-life situations, the recorded audio signals invariably contain noise. To take this factor into account, we first discuss how the performance of the methodology proposed in the last subsection is affected by the presence of noise. We then present a filtering technique that is used to improve its accuracy even when there is ambient noise.

To systematically study the impact of noise, white Gaussian noise of varying degree is added to the recorded audio signals. This results in a variation in the signal to noise ration (SNR). Higher the SNR, lower is the noise and clearer is the tap sound. Figures 2(a)–(c) show the effect of adding noise to a clean audio signal.

Fig. 2.

Time domain representaion of audio signal corresponding to text “dplhffr” (a) without adding noise and (b) after adding noise with $SNR = 35 dB$ (c) after adding noise with $SNR = 10 dB$ .

It is seen from the figure that increased addition of noise to a clean audio signal distorts it gradually. The original SNR for the example audio signal is 35 dB. At this level of noise, the methodology presented in the last subsection works well. However, as further noise is added and the SNR falls below 20 dB, its performance degrades. It is observed from the figures that when sufficient noise is added to bring down the SNR to 10 dB, the noise amplitude becomes almost equal to that of the actual tap sound and hence it is hard to locate the taps from the audio signal. In fact, with this level of noise, the peak-detection algorithm detects numerous peaks erroneously throughout the signal. As a result, the attack fails when the background noise SNR $⩽ 15 dB$ . In order to make the attack robust against noise, we propose a pre-processing technique for removal of noise and show how the performance improves even for a low value of SNR.

The first step towards noise reduction is to analyze the frequency spectrum of the clean audio with respect to the noisy signals. The frequency spectra are presented in Fig. 3.

Fig. 3.

Frequency domain representation of audio signal corresponding to the text “dplhffr” (a) without adding noise (b) after adding noise with $SNR = 35 dB$ and (c) after adding noise with $SNR = 10 dB$ .

It is seen from Fig. 3(a) that the tap sound is characterized by the presence of frequencies around 5 kHz and 8.3 kHz. Figure 3(b) is the frequency spectrum of the sound presented in Fig. 2(b), where it is observed that even though most of the characteristics of the spectrum are affected by noise, still the overall trend is maintained and hence, noise removal is actually not necessary. However, for a further degraded signal as presented in Fig. 3(c), noise removal becomes necessary for the attack to succeed.

It may be noted from the spectra that there is a particular frequency range for the key tap sounds and hence use of a frequency domain band-pass filter [22] is considered to be appropriate to remove the frequencies that are spurious and mostly contain noise. We have, therefore, designed a band-pass filter with the low and high cutoff frequencies set at 700 Hz and 12 KHz, respectively. When such a filter is used to process the noisy audio signal shown in Fig. 3(c), the resulting signal as shown in Fig. 4 is obtained.

Fig. 4.

(a) Time domain and (b) Frequency domain representation of audio signal corresponding to the text “dplhffr” after removing noise of $SNR = 10 dB$ .

It is seen from the time and frequency domain representation inf Fig. 4(a) and (b) that, on using the band-pass filter, the recovered audio signal has characteristics quite close to that of the original signal. Although not evident quite from the time-domain representation, the difference of this cleaned signal from the original signal can be clearly seen in the frequency spectrum. The cleaned audio retains its similarity with the original signal near frequencies 5 kHz and 8.3 kHz. When these band-pass filtered audio signals are subjected to the attack, the accuracy improves significantly. The detailed results are presented in Section 4.

2.3. Word and sentence inferencing from probable character sequences

In this subsection, we describe our proposed methodology for constructing words and sentences from the sequence of characters obtained using Algorithm 2 of Section 2.2.1.

In any given language, a valid sequence of n characters is called an n-gram. Each language including English has a well-studied set of n-grams and their probability of occurrence in written text. Once the k most probable characters are obtained using the algorithms presented in Section 2.2.1, we make use of the uni-gram, bi-gram and tri-gram probabilities of the language. We initially assume that the message that a user is tapping in, is written in English. It forms the basis for learning the language model that helps to reduce the search space and determine the most probable sentence. Other language models can also be similarly used without changing the basic steps. The character-level n-grams as well as word-level n-gram counts are learned from the widely used e-mail corpus1

¹
Enron Mail corpus: https://www.cs.cmu.edu/~./enron/.

as an offline process.

The probability of a particular n-gram is the count of that n-gram divided by the total number of n-grams in the language. Since the number of n-grams is usually a very large number, the probability value comes out to be too small. To compute the probability of occurrence of a string of characters, we need to multiply all the n-gram probabilities. If we use the actual probability values, the product might cause an underflow. On the other hand, taking the product of the count of n-grams results in an overflow. Hence, to avoid both underflow and overflow, instead of using the probabilities directly, we introduce the notion of $pseudo - probability$ , which is the log of the count of n-grams [15].The steps for computing pseudo-probability are shown in Algorithm 3 (Compute_pseudo-probability_of_a_string). In this algorithm, the outer $for loop$ of Lines 2–12 runs the inner loop considering i-grams in its ith iteration. The inner $for loop$ of Lines 7–10 picks up an i-gram from the string, computes its pseudo-probability, and adds it to the temporary variable ${prob}_{t}$ . Finally, Line 11 adds the temporary probability ( ${prob}_{t}$ ) to the total probability variable $prob$ of the string after multiplying it with the appropriate weight factor.

Algorithm 3

$Compute_pseudo - probability_of_a_string$

There are two loops in Algorithm 3. The outer loop runs n times if n-grams are used for the computation, while the inner loop runs in the order of the length of the string, denoted here as l. Hence, the overall time complexity is $O (n \cdot l)$

From the computed score of strings, we build an inference tree to determine the most probable text that was typed. The language model is based on n-gram and inference based tree [7,29]. The character-level n-grams are used to construct a word-level inference tree. At each level, we include the next set of probable characters and recompute the probabilities of the strings thus obtained. Only the top k nodes are retained and the rest are pruned. This prevents the tree from expanding exponentially. In the last level of the tree, we are left with k most probable words that can be constructed using the character sequences obtained from the algorithms presented in Section 2.2.1.

The node used to create the tree is a three-tuple represented as $node = ⟨ string, prob, children ⟩$ , where $string$ is the partial word that is stored in the node. $prob$ stores the pseudo-probability of this partial word computed using Algorithm 3. The children for this node are stored in the list $children$ . For any child node c, if p is its parent node, then $p . string$ is a prefix of $c . string$ (Property 1). Also, the tree is constructed in such a way that at any level l of the inference tree, the maximum number of nodes does not exceed n, where n is a user-defined input parameter (Property 2).

We propose an algorithm $(Generate_Word - level_inference_tree)$ that builds the word-level inference tree whose properties and node structure are defined in the previous paragraph. It is shown in Algorithm 4.

Algorithm 4

$Generate_Word - level_inference_tree$

Algorithm 5

$Truncate_leaves$

This algorithm builds the inference tree one level at a time. Each level of the tree adds to the tree, the set of probable characters corresponding to the next tap. The outermost $for loop$ of Lines 2–13 selects the set of characters corresponding to the ith tap in the ith iteration. Next, the leaf nodes in the tree are pruned so that the tree retains Property 2. The $for loop$ of Lines 5–12 selects each of the remaining leaves one at a time. Pruning of leaves is done using the $Truncate_leaves$ procedure shown in Algorithm 5. The $for loop$ of Lines 6–11 adds new child nodes to the current node and sets the $string$ so that they satisfy Property 1. The $prob$ value in new nodes (denoted as t) is computed using Algorithm 3. Algorithm 4 terminates when all the probable characters corresponding to all the key taps have been added to the tree.

There are three nested $for loops$ in Algorithm 4, with time complexity of $O (n)$ , $O (c)$ and $O (k)$ , respectively. Here, n is the total number of taps, k is the number of probable characters returned per tap and c is the number of allowed child nodes per level. Invocation of $Truncate_leaves ()$ contributes to $O (l)$ time, where l is the number of nodes at a level prior to pruning. So, the time complexity of Algorithm 4 is $O (n \cdot (O (l) + (c \cdot k)))$ . c and k being fixed parameters and independent of the input size n, the time complexity becomes $O (n \cdot l)$ .

A similar approach is used to build a sentence-level inference tree based on the most probable word sequence. As in the word-level inference tree, at each level, we prune all the nodes except the top k (based on the pseudo-probability value). In the last level, we are left with the top k most probable sentences that can be constructed from the set of probable words and hence from the set of probable characters.

Both the word-level as well as sentence-level inference trees use leaf pruning to prevent exponential expansion of the tree. The pruning algorithm $(Truncate_leaves)$ first takes the list of all children at the current level as shown in Algorithm 5. It then deletes all nodes except the top c nodes and returns the modified list. The time complexity of this algorithm is $O (n \cdot c)$ , where n is the number of leaves prior to pruning and c is the desired number of leaves after pruning. Since c is a constant, the overall time complexity is $O (n)$ .

2.4. Password guessing

The attack methodology as described above can not only be used to infer dictionary words but also non-English strings, such as passwords. The contemporary password standard is Payment Card Industry Data Security Standards ( $PCI DSS$ ). Some of the key points revealed by a recently conducted survey involving $\sim 5000$ PCI DSS passwords are given below.2

²
http://resources.infosecinstitute.com/beyond-password-length-complexity/#gref

Length of $61 %$ of the passwords is exactly equal to the minimum length limit with the average length being $= 9.6$ .

Approximately $86 %$ of the passwords have only one uppercase letter, which is the first letter of the string.

Lowercase letters dominate the majority of characters with about $59 %$ of passwords containing 5–7 lowercase letters.

Numbers, if present, only appear in one or two digits with only single digit occurrence in over $40 %$ of the passwords.

Significant similarity between default password and username.

These findings point to the fact that the passwords that are normally set, are not too long, not overly complex and have similarity with other information that can be obtained easily (e.g., default password, username, etc.). Thus, with the help of audio signals in reducing the password search space, dictionary attacks can be carried out very efficiently. This subsection presents ways in which passwords can be inferred using the TDoA computed from the recorded audio signal and phone geometry.

A brute force dictionary attack would iterate through all possible combinations of the allowed characters to find a password. Although it is very inefficient and takes a lot of resources, in the absence of any other information regarding the password, this is the only way to guess the password. The number of guesses to be tried is given by $S = n^{1} + n^{2} + n^{3} + \dots + n^{l}$ , where l is the upper length restriction for the password and n is the alphabet size. This is a geometric series with first term $= n$ and common ratio $= n$ . Thus, the sum $S = (n (n^{l} - 1)) / (n - 1)$ , which is $O (n^{l + 1})$ .

A basic improvement that can be made using the proposed attack methodology is to use the audio signal to detect only the number of taps a password entry needs. Thus, if the password length is determined to be q from the number of key taps detected, the attack will take $O (n^{q})$ time to guess the password, which is usually much smaller than the brute force approach.

The attack can be made significantly more efficient if the TDoA extracted from the audio signal to generate a candidate set for each tap is made use of. As discussed in the previous subsections, the candidate set contains the most probable characters corresponding to each tap, in decreasing order of probability. The collection of candidate sets corresponding to all the taps is then used as the search space for the passwords. Assuming that the length of the candidate set for each tap is k, the time complexity of search becomes $O (k^{q})$ , where q is the number of taps detected. This is a significant improvement over the previous method since the candidate set to search for the password is only a small fraction of the complete set of characters allowed in the passwords.

In this section, we have explained the two phases of the proposed attack methodology and the algorithms used to carry out the attack. The next section presents the details of implementation of a complete system for carrying out such an attack.

3. Implementation details

In this section, we describe a complete implementation of the attack flow as proposed in the last section, from recording of sound to extraction of estimated typed text. A trojan app installed in the potential victim’s smartphone generates audio files from the tap sounds received at the two microphones and sends them to the attacker’s server as shown in Fig. 5. The only privileges required are microphone access permission during installation and Internet access during execution for sending the audio files to the attacker’s server for further processing.

Fig. 5.

Complete block diagram of the attack flow.

The first module at the server end processing in the overall block diagram of Fig. 5 takes the raw audio file as input and runs the noise filtering step to remove ambient noise. The next step generates the list of time intervals that contain peaks (taps on the touchscreen). The list of peaks serves as input to the next module, which computes the cross correlation between the sound waves from the two microphones as explained in Algorithm 2. Figure 6(a) shows the energy spectra of the audio signals received at the two microphones for each tap. As seen in the figure, the taps produce sharp peaks, preceded and followed by low signal amplitude. This property of the microphone captured signal is used to extract peaks from the audio files. A noticeable shift exists between the two audio signals received at the two microphones as observed from their time domain representation in Fig. 6(b). An expanded view of the same is shown in Fig. 7(a). Cross correlation is next computed for these two signals, whose peak is the estimated value of their TDoA. The TDoA value thus computed serves as a parameter for defining the hyperbola that corresponds to the candidate region on the touchscreen where the tapping was done. Another important observation from Fig. 6(a)–(b) is that, one of the signals has higher amplitude than the other. This fact is used to select the valid semi-hyperbola, i.e., left or right as depicted in Fig. 1(b) and (c). Instead of amplitude difference, the sign of TDoA can also be used to arrive at the same decision. For instance, if microphone $m_{1}$ peak occurs before microphone $m_{2}$ peak, the tap must have been made in that half of the keypad which is closer to $m_{1}$ .

Fig. 6.

(a) Energy spectrum of the input signal corresponding to a single tap. (b) Input signals in time domain corresponding to a tap.

Fig. 7.

(a) Shift between two signals captured by the two microphones in time domain. (b) Overlay of hyperbola region on phone keyboard.

The TDoA magnitude is next used to determine the probable touchscreen region where the tap occurred, as shown in Fig. 7(b). In this example, the key a was tapped on the touchscreen and the computed TDoA is used to plot the correct semi-hyperbola while the other half is discarded. It is observed that the most probable region passes over the letter a. In a similar manner, for every tap, the most probable region on the touchscreen can be determined. To obtain the k most probable characters, the width of the band can be set in such a way that a total of k number of keys fits inside the semi-hyperbola band. The $Probable Character Extraction$ module of Fig. 5 returns the set of k characters that have the highest area of intersection with the plotted region.

The final step of text inferencing from the candidate set is done by building an inference tree and appropriately pruning leaves at each level as explained in Algorithms 4 and 5. The same approach is used for both word and sentence level inferencing of Fig. 5.

In Table 1, we show a sample input-output combination. The user had entered the text “hello world” by tapping the appropriate keys on the touchscreen. The output, which represents the possible characters corresponding to each key tap, keeping the thirteen top possibilities is shown next to the input. It is observed that all the correct letters are present in the output character set. The word and sentence inferencing steps easily identify the correct words as “hello” and “world” from the candidate characters. It is also observed that the characters returned for each tap are located in the same vicinity of a standard QWERTY key pad layout. While we show the result here for a situation where thirteen characters are retained for each key tap, for all except one of the letters (letter r), the first six options contain the correct letter. Thus, working with six characters would have sufficed with the error correction done using a spell checker.

Table 1

Sample input and corresponding output for the complete system

Input	Output candidate set	Input	Output candidate set
h	uhbnjyigvmkto	w	waszqedxrfctg
e	waszqedxrfctg	o	polikmjunhbyg
l	polikmjunhbyg	r	waszqedxrfctg
l	polikmjunhbyg	l	polikmjunhbyg
o	polikmjunhbyg	d	rfcxdtegvzsyw

It may be noted that, in our implementation, we have considered all the steps to be carried out in the server end (refer to Fig. 5). This enables the attacker to deploy all the computing power available at his disposal to decipher the text. However, it also implies that the sound files have to be transmitted to the attacker’s server using the Internet connectivity of the smartphone. An alternative to this approach is to carry out a part or all of the steps depicted in Fig. 5 in the smartphone itself and then send back the deciphered text to the attacker’s server. We next explore the implications of the two approaches.

Carrying out all the processing steps at the server end is apparently beneficial to the attacker. This is not only due to the possibility of deploying more computing power, but also the amount of time the attacker can spend in the process and the potential for deploying newer strategies as and when developed. However, the issue of transferring large audio files needs to be looked into. Table 2 provides the size of audio files in two formats, namely, .wav and .mp3 for several words (typically passwords) of varying length. The corresponding estimated time for transmitting the .mp3 files in 3G and 4G networks are also shown. It is observed that, if .mp3 file is used, the actual transmission time is negligible and is not likely to raise suspicion of the user. However, if the message size is large, the overall transmission time would be higher. For such cases, it would be more prudent to execute the code corresponding to Algorithms 1 and 2 in the trojan app embedded in the smartphone itself, each of which does not take much time to execute as explained later in Section 4. Our code for these two algorithms is now developed using Matlab libraries. However, they may be ported in Java and tested on Android platform for exploring the viability of this alternative approach.

Table 2

Variation of different audio file sizes with text length

Typed	.wav file size	.mp3 file size	.mp3 trans. time (3G)	.mp3 trans. time (4G)
Text	(in MB)	(in MB)	(in second)	(in second)
ayhy	1.396	0.176	1.29	0.77
byipbf	1.786	0.225	1.65	0.99
bfwjsx	1.947	0.245	1.80	1.08
cdtapq	1.951	0.246	1.81	1.09
bjhetdqmprzv	2.401	0.302	2.21	1.33
baxmcqptcaci	2.719	0.342	2.51	1.51

4. Experimental results

The system implementation described in Section 3 was used to carry out a set of extensive experiments. In this section, we highlight the important observations made and present quantitative results on accuracy, execution time, potential for password guessing and impact of noise.

4.1. Audio processing results

In Fig. 8, we plot various TDoAs measured in terms of number of sample points and the corresponding character sets. In one of the phones used in our experiments (layout shown in Fig. 1(b)), $TDoA \in [- 8, 10]$ . Each TDoA is plotted over the keypad layout keeping $k = 3$ as described in Section 3 and the character set corresponding to that TDoA is returned. A linear trend in the TDoA values is observed and the character sets also move from the left end of the keypad to the right. For instance, for $TDoA = 10$ , the character set is ${q, w, a}$ while $TDoA = - 8$ corresponds to the character set ${p, l, o}$ .

Fig. 8.

Character sets for different values of TDoA.

Fig. 9.

(a) Computed values of TDoA obtained from cross-correlation for 100 taps of a. (b) Distribution of the computed TDoA.

We next plot the distribution of the computed value of TDoA for a given letter (a in this case) in Fig. 9(a). The set of top k most probable characters contain the typed character with very high accuracy as seen in Fig. 9(b). In this example, using $k = 3$ , the tapped character is found to be present in the set of probable characters $90 %$ of the time. Even for the rest $10 %$ , the value of TDoA differs only by a value of 1. Thus, it can be concluded that TDoA is a robust feature for the proposed attack methodology.

4.2. Text extraction results

In Fig. 10(a), we show how word occurrence frequency and word level accuracy vary with change in word length. A test set of 30 messages of varying sizes was selected and the deciphered text was compared with the actual typed text. The twin vertical bars represent their occurrence frequency and correct recovery frequency. On the other hand, the piecewise continuous line represents accuracy. It is observed that accuracy becomes close to 1.0 for word length $= 1$ or 2. These are primarily stop words like $a, a n, b e, a t$ , etc., which occur quite frequently and hence, can be inferred fairly accurately. The next level of peaks are found at word lengths of $3, 4$ and 5. This is due to the fact that the average word length is 5.1 letters in the English language [6]. Since words with greater lengths occur only rarely, the number of such samples to train the language model is less. Hence, the accuracy decreases with increasing word length beyond 5. The average word level accuracy is $81 %$ .

Fig. 10.

(a) Accuracy for different word lengths (b) Variation in accuracy with message size.

Next, Fig. 10(b) shows how accuracy of deciphering a complete text varies with message size. The same data set considered in Fig. 10 is also used in this study. From the figure, it is seen that accuracy lies in the range of $57 %$ – $95 %$ , with an average of $88 %$ . It is also observed that accuracy improves with increase in the length of text. The NLP algorithm used in our work considers that the text is primarily composed of English words due to which such a trend is observed. This is because, with increase in message length, the ratio of valid English words to the total number of words increases. There are also a few local variations in the figure, a point that was further investigated in depth. It was found that contrary to the assumption made, the message often contains certain words like URLs, dates, proper nouns, etc., that are not present in the English dictionary, thereby causing some degradation in performance.

4.3. Execution time

While we have analyzed the time complexities of the algorithms in Section 2, we also report the actual execution time in this subsection. The algorithms were implemented and tested on an Intel i5 3.2 GHz CPU with 8 GB of RAM. Matlab libraries were used to implement the basic operations. For this configuration, the noise filtering step took 7 milliseconds while Algorithms 1 and 2, i.e., peak detection and TDoA computation, took 0.2 second and 8 milliseconds, respectively. It may be noted that these steps are carried out for each tap sound and hence, the results are independent of the actual length of the message being typed.

Table 3 presents the time taken by the text inferencing algorithm to get the most probable text that can be built using the candidate set provided by the signal processing units. In the table, the message length in number of words is shown along with execution time in seconds. It can be seen from the values of running time that it closely follows an $O (n log (n))$ trend validating the theoretical analysis of the algorithm done in Section 2.3. It may be noted that, this execution time for the text inferencing algorithm includes both the time to generate the inference tree from the candidate character sets as well as the time for searching through the tree to determine the most probable text. A small positive deviation from the strict $n log (n)$ complexity is observed, which can be attributed to implementation constructs, e.g., the time needed for memory allocation and object creation for each node of the tree.

Table 3
Execution time of the text inferencing algorithm for different message lengths

Message length (in words) Text inferencing time (sec) Message length (in words) Text inferencing time (sec) Message length (in words) Text inferencing time (sec)

5 1.34 22 19.28 34 40.02

7 1.99 26 22.65 34 43.20

12 5.63 26 22.42 36 41.06

14 8.33 27 25.31 39 54.07

16 9.23 31 41.37 41 62.00

20 14.82 32 37.84 41 59.11

22 18.73 33 39.10 53 106.23

Message length (in words)	Text inferencing time (sec)	Message length (in words)	Text inferencing time (sec)	Message length (in words)	Text inferencing time (sec)
5	1.34	22	19.28	34	40.02
7	1.99	26	22.65	34	43.20
12	5.63	26	22.42	36	41.06
14	8.33	27	25.31	39	54.07
16	9.23	31	41.37	41	62.00
20	14.82	32	37.84	41	59.11
22	18.73	33	39.10	53	106.23

4.4. Password guessing result

As discussed in Section 2.4, other than deciphering message text, the proposed attack methodology can also be used to guess passwords. We first present the details of the data set used to carry out the password guessing experiments and then report results on its accuracy. To test the password guessing performance, 374 passwords were generated that are random strings consisting of lowercase English letters. We consider the lowercase letters since as discussed in Section 2.4, most of the passwords contain only lowercase letters and those that have other characters also contain the lowercase characters in majority. The letter distribution of the password data set is depicted in Fig. 11. In the data set, there are passwords of different lengths ranging from 4 to 14 with equal number of passwords of each length.

Fig. 11.

Distribution of letters in the password data set.

The TDoA values obtained from the audio signals corresponding to each tap were used to guess individual letters of the password. To study the performance, different candidate set sizes were considered. The accuracy thus obtained is presented in Table 4. Each value in the table represents the accuracy with which letters were detected for passwords of a certain length (represented by the row). The columns represent the candidate set sizes which are considered for analysis. For instance, the accuracy with which letters can be detected for a password of length 9 taking the candidate size to be 10 can be found following the row corresponding to password length 9 and the column corresponding to the candidate set size 10. From the table, it is observed that using a candidate set size of 11 gives the most promising result. This is because the average password length is usually around 9 to 10 and the accuracy figures for that password length with a candidate set size of 11 are $\sim 87 %$ and $90 %$ , respectively.

Table 4

Variation of password guessing accuracy with respect to different password lengths and candidate set lengths

Password length	Candidate set size

	1	2	3	4	5	6	7	8	9	10	11	12	13	14
4	20.83	33.33	41.67	41.67	41.67	41.67	50.00	58.33	58.33	70.83	87.50	91.67	91.67	91.67
5	6.00	22.00	24.00	26.00	30.00	36.00	44.00	60.00	66.00	80.00	88.00	92.00	92.00	92.00
6	9.52	16.67	21.43	28.57	40.48	47.62	52.38	59.52	59.52	73.81	78.57	83.33	83.33	83.33
7	3.57	21.43	28.57	42.86	53.57	57.14	57.14	60.71	64.29	78.57	89.29	89.29	92.86	92.86
8	8.33	12.50	20.83	37.50	50.00	54.17	62.50	62.50	62.50	75.00	87.50	95.83	95.83	95.83
9	13.33	24.44	31.11	48.89	51.11	53.33	57.78	66.67	75.56	82.22	86.67	86.67	86.67	86.67
10	16.67	26.67	30.00	30.00	36.67	43.33	46.67	50.00	70.00	73.33	90.00	90.00	90.00	90.00
11	13.64	27.27	31.82	54.55	59.09	63.64	63.64	63.64	72.73	86.36	90.91	100.00	100.00	100.00
12	25.00	36.11	41.67	44.44	47.22	52.78	58.33	63.89	75.00	77.78	83.33	86.11	86.11	86.11
13	6.15	23.08	26.15	29.23	36.92	41.54	41.54	43.08	49.23	60.00	72.31	76.92	80.00	80.00

4.5. Results on noise filtering

Fig. 12.

Performance of the proposed approach for varying SNR and candidate size (CSS) (a) without noise filtering (b) with noise filtering.

As discussed previously, the attack methodology proposed in Section 2.2.1 is robust against ambient noise upto a certain extent. Figure 12(a) shows the variation in accuracy as the SNR is lowered from 100 dB to 0 dB through addition of white Gaussian noise. The results have been shown for varying lengths of candidate set size (CSS). It is observed that the accuracy of the basic methodology does not degrade till SNR falls to 20 dB, beyond which due to a high level of noise, its performance begins to suffer. Also, as expected, accuracy is higher for longer candidate set size as compared to smaller ones. Although a fall in SNR to 20 dB implies a substantial degree of ambient noise, we used the noise filtering step proposed in Section 2.2.2. The impact of this pre-processing step on accuracy is depicted in Fig. 12(b). It is seen that upto 5 dB of SNR, there is no degradation in performance. Thus, the methodology can work even in a very noisy environment. It also shows that if a user taps in gently making the tap sound less loud as compared to others who tend to use hard taps, the proposed methodology would work with high accuracy. This is due to the fact that, for the same level of ambient noise, although a light tap lowers the SNR, the proposed methodology would perform equally well.

5. Related work

While our recent work [11] is the first ever approach that exploits the built-in microphones of a smartphone to decipher text from touchscreen key taps, there is some related work in the literature that address similar problems. There are several possible ways of launching side-channel attacks using the various forms of emanations from electrical and mechanical devices, thereby revealing information associated with the source. The attack path heavily depends on the type of emanation and the ease of capturing the same [5,13,19,24]. In particular, threats that exploit acoustic emanations constitute an important field of research. This is due to the fact that almost all devices, including mechanical, electrical, electronic or computing devices, generate some form of sound either as they operate or when a user interacts with them. Some attempts have been made in recent years where methods were designed to recover information from acoustic emanations of devices.

Asonov et al. [1] show that the keyboard strokes can be detected from the sound generated by mechanical keyboards. It is argued that, although the variation in keystroke sound for different keys is imperceptible to human ear, proper signal processing techniques followed by a well-trained classifier can distinguish between them. They use Fast Fourier Transform (FFT) to extract a feature set from the received sound signal and multi-layer perceptron as a classifier. Their experiments show that on an average there are $2.5 %$ wrong detections when distinguishing between two keys. While experimenting with more keys, $12 %$ of the time the actual key was not one among those represented by the top three candidate keys.

Zhuang et al. [30] took a relook at the above issue and presented a novel technique for recovering even random keystrokes. Unlike the approach followed in [1], which needs labeled data for supervised learning, this methodology makes use of the statistical constraints of the English language for unsupervised learning. Instead of FFT, cepstrum features are used as they show higher discriminating power. Clustering is done by keeping the number of clusters slightly greater than the actual number of keys to be detected. This is done to account for the imprecision of the clustering algorithm. Next, Hidden Markov Model (HMM) is used to get the acoustic class that a stroke belongs to. To extract text, further spelling and grammar checking is carried out. A feedback based training methodology is followed using the corrected output from previous rounds of detection. It is shown that Gaussian mixtures and Linear classification perform better than neural networks. A maximum character level accuracy of $92 %$ was achieved.

With the growing popularity of mobile devices and the number and variety of sensors they come equipped with, the problem of exploiting emanations needs to be addressed in a more general way. Several attack techniques that tend to succeed with a moderately high accuracy have been proposed. Meng et al. [18] demonstrate the use of mobile phones to enhance the Kinect [27] experience. They propose that mobile sensors like gyroscope, accelerometer and microphone can be used to broaden its view as well as to improve its accuracy and hearing capability. Kinect typically suffers from the problem of occlusion, while the sensors of a phone can send data associated with the player even when he is out of the viewing range. It is shown that the quality of Kinect sensor output tends to improve when multiple sensor data is fused. A related work by Dagon et al. [8] presents an overview, including a nomenclature, of the various attacks that can be carried out using mobile phones. Preventive measures are also suggested for minimizing the risk from the identified threats.

Since the work reported in this paper aims at deciphering text being typed in smartphone touchscreen, we discuss some related work in this specific field. Miluzzo et al. [20] identify the possibility of using accelerometer and gyroscope sensors and present ways to recover the tap position along with the typed text. They argue that different values of linear and angular acceleration are produced from different points on the screen as they are tapped, which can be captured using the phone sensors. The extracted feature set comprises both time and frequency domain features. A labeled data is used to train a number of classifiers, namely, k-nearest neighbor, multinomial logistic regression, support vector machines, random forests and bagged decision tree. For inferencing, an ensemble technique is used, which finds consensus among the results returned by the individual classifiers. Experiments were carried out to infer the clicked icon as well as the key clicked while typing. In the icon tapping experiment, the average accuracy is reported to be $78.7 %$ and $67.1 %$ on iPhone and Nexus S, respectively. For the letter tapping experiment, an average accuracy of $65.11 %$ is reported. There are also ways to extract the typed keys using just the vibrations [9].

On a different note, Zhu et al. [28] use mobile phones as devices for recording the keystroke sounds generated by a mechanical keyboard. The mobile phones are placed in the vicinity of the keyboard being typed on. With a sufficiently large number of such external microphones, the keystroke region can be identified with good accuracy. For further generalizing the scope of the attack, the orientation of the keyboard is made unknown. Reconstruction of the keyboard orientation is framed as an optimization problem and the proposed algorithm is expected to return the most likely keyboard layout. The keystrokes are finally recovered from the sound captured by the mobile phones. It is reported that this methodology can recover more than $72.2 %$ of the keystrokes correctly.

There is, however, no known work beyond the initial attempt made by us [11], that can decipher the key taps from touchscreen mobile phones. We feel that this attack methodology works in a more realistic setting unlike the contrived situations considered in some of the existing approaches. For example, placing several mobile phones in the vicinity of a keyboard is likely to evoke suspicion. Similarly, use of special applications in mobile phones for launching an attack is often not feasible since the users take special care while using them. External microphones for recording tap sounds on smartphone touchscreen also cannot be meaningfully used due to two reasons. First of all, their position relative to the sound source can vary with time and secondly, these will be susceptible to ambient noise.

Since the proposed methodology uses the microphones already attached to the smartphone, they can silently record and transmit the audio signals to the attacker’s server for further processing. A limited amount of pre-processing like compression can also be carried out in the phone itself before transmission. It also does not require on-line connectivity when the data is being captured. The recorded sound files can be transmitted at a later time when the user connects to the Internet. Unlike some of the other methods for acoustic emanation based attack, which require supervised learning, the proposed methodology does not use any learning stage. The dependency on the phone layout in determining the candidate set of characters for each tap can be resolved by merely sending the phone model through the trojan app to the server, which can be easily obtained. This is made available to all apps as they optimize their performance by setting several of their parameters like aspect ratio, viewing format, etc., using this information.

6. Preventive measures

From the results reported in Section 4, it can be seen that the attack methodology proposed in this paper works with fairly high accuracy both for deciphering message text as well as passwords. The goal of this paper so far has been to raise awareness that such an attack can indeed be carried out. In this section, we present some preventive measures by which this attack can be thwarted or at least its impact can be minimized.

The first step would be to block evaluation of TDoA and a plausible approach is to deactivate one of the microphones. This would imply that no candidate set can be generated for each key tap. However, this also means that either the stereo recording capability remains unused or the user has to selectively turn the two on when such a requirement arises.

The second approach is broadly based on fudging the tapping sound. One possibility is for the user to keep the key-press tone ON. For each tap, an additional sound will be generated that can partially mask the actual sound created by the tap. The sound produced in this manner is generated by the speaker of the phone and its frequency is also constant for all the keys. Such a sound, even if recorded, will return a TDoA that points to the location of the speaker of the phone. Table 5 shows the candidate sets that are generated when the background key tap sound is ON corresponding to the same input string “hello world” used in Section 4. Rows in the table represent each input letter corresponding to the word “hello world”. The columns represent volume of the background system-generated sound for each key tap. The effect was studied for different volume settings for this background sound. It is observed that, when the volume is in the $25 %$ – $75 %$ range, most of the typed letters are present in the candidate sets. In the table, each correct occurrence of a letter is highlighted. For instance, corresponding to the letter e of the word $hello$ , when the background sound level is upto $75 %$ , the letter e is present in the candidate set generated from the tap sound. But as the background sound level is increased to $100 %$ , the letter e is no longer present in the generated candidate set. As the volume is increased further, the number of wrongly detected letters keeps increasing. When the volume is set at $100 %$ , the TDoA detection algorithm starts to detect extra peaks and hence it becomes almost impossible to infer any meaningful information out of it. This results in wrongly detecting higher number of candidate character sets than the number of taps actually made as seen in the last column of Table 5.

Table 5
Candidate sets for the input string “hello world” for different volume levels of background key tap sound

Typed letter (input) Volume of key tap sound

25 50 75 100

h p o l i k m j u n h b y g p o l i k m j u n h b y g p l o m k i n j u b h y v y h b v g t u j n c f r i

e s a q d w f b e n v g m c y u t i r o e p w q h j g s a q d w f b e n v g m c y h b v g t u j n c f r i

l p o l i k m u j n y h b t p o l i k m u j n y h b t p l o m k i n j u b h y v p o l i k m u j n y h b g

l p o l i k m u j n y h b t p o l i k m u j n y h b g y u t i r o e p w q h j g y u t i r o e p w q h j g

o o l k i m p j n u h b y g p o l i k m u j n h y b g q a w z s e x d r c f t v s a q d w f b e n v g m c

q a w s e z d r x f t c g

q a w s e z d r x f t c g

q a w s e z d r x f t c g

w q a w z s e x d r c f t v q a z w s x e d c r f v t r f c x d t e g v z s y w q w a e s r z d t x f c y

o o l k i m p j n u h b y g p o l i k m u j n y h b g p l o m k i n j b u h v y q w a e s r z d t x f c y

r q a w s e z d r x f t c g q a w s e z d r x f t c g s a q d w f b e n v g m c y u t i r o e p w q h j g

l p o l i k m u j n y h b t p o l i k m u j n y h b t i m k j n o u l h b p y g q a w s e z d r x f t c g

d r f c x d t e g v z s y w x r d f e c z t s g w v a s a q d w f b e n v g m c y h b v g t u j n c f r i

y u t i r o e p w q h j g

Typed letter (input)	Volume of key tap sound
h	p o l i k m j u n h b y g	p o l i k m j u n h b y g	p l o m k i n j u b h y v	y h b v g t u j n c f r i
e	s a q d w f b e n v g m c	y u t i r o e p w q h j g	s a q d w f b e n v g m c	y h b v g t u j n c f r i
l	p o l i k m u j n y h b t	p o l i k m u j n y h b t	p l o m k i n j u b h y v	p o l i k m u j n y h b g
l	p o l i k m u j n y h b t	p o l i k m u j n y h b g	y u t i r o e p w q h j g	y u t i r o e p w q h j g
o	o l k i m p j n u h b y g	p o l i k m u j n h y b g	q a w z s e x d r c f t v	s a q d w f b e n v g m c
				q a w s e z d r x f t c g
				q a w s e z d r x f t c g
				q a w s e z d r x f t c g
w	q a w z s e x d r c f t v	q a z w s x e d c r f v t	r f c x d t e g v z s y w	q w a e s r z d t x f c y
o	o l k i m p j n u h b y g	p o l i k m u j n y h b g	p l o m k i n j b u h v y	q w a e s r z d t x f c y
r	q a w s e z d r x f t c g	q a w s e z d r x f t c g	s a q d w f b e n v g m c	y u t i r o e p w q h j g
l	p o l i k m u j n y h b t	p o l i k m u j n y h b t	i m k j n o u l h b p y g	q a w s e z d r x f t c g
d	r f c x d t e g v z s y w	x r d f e c z t s g w v a	s a q d w f b e n v g m c	y h b v g t u j n c f r i
				y u t i r o e p w q h j g

Fig. 13.

Time domain signal of taps with background key tap tone volume set to (a) $25 %$ and (b) $100 %$ .

The presence of this background sound can be verified from the time domain audio signal shown in Fig. 13(a) and (b). It is observed in the figures that, near each peak that represents a tap, there is also another peak corresponding to the background key tap sound. When the volume of this sound is set to $25 %$ , the actual key tap can be easily distinguished from the background sound, but as the volume is increased to $100 %$ , it becomes difficult to distinguish between the actual tap sound and the background sound, causing the attack to perform poorly. A similar effect may be achieved if the vibration on each key-press is kept ON. However, both of these counter-measures are inconvenient for the user and they drain the battery as well. Due to these reasons, users normally prefer to keep them turned OFF.

A third direction to minimize the success of the attack is to ensure that the OS displays notifications (using icons) to alert the user about which of the devices are being used in the background. As a result, the user will be notified if the two microphones or any other available sensor is currently running. The OS can also be made to display the apps that are currently using the sensors. While this approach cannot be made to prevent the attack, it can act as a deterrent since the user becomes aware of background apps that are trying to eavesdrop and possibly terminate the same to prevent the attack from succeeding. Such measures, however, demand additional OS features as well as a certain level of maturity and knowledge on part of the user.

Another potential countermeasure could be to prompt about device usage or to deactivate the microphones. The user may also be notified when an attack as suggested in this paper becomes viable. This may be achieved by detecting when the user starts typing in while the microphones are still active. Since these two activities (stereo recording and typing) do not always seem to occur simultaneously for regular users, this can be a promising prevention measure under certain situations. However, it need not always be effective as the microphone is almost always ON while typing, trying to pick up keywords that activate voice commands, e.g., “Okay Google”. It is also possible that the user has installed certain custom keyboard apps having voice activated data entry features and most of the voice controlled functions require and use both microphone and Internet simultaneously. Thus, it would be difficult to judge the legitimacy of an app simply by looking at the devices that it is trying to use. What can be done instead is to develop a small footprint monitor that would run in the background to study the behavior of each app and assign a $trust_value$ score to them. This score can be later used in conjunction with the current behavior to alert the user of any potential malicious activity. Even proactive measures can be taken like temporarily switching to mono recording whenever it detects that the user is typing a password and another app with low $trust_value$ score is using both the microphone and the Internet.

7. Conclusions and future directions

The primary goal of this paper is to raise awareness about the vulnerabilities of touchscreen phones, which can be exploited by off-the-shelf tools and techniques to devise a side-channel attack methodology that succeeds with high probability. The unique strength of this attack is that, no form of encryption can help to prevent it since the tap sounds are recorded at the time of keying in itself. Extensive experiments show the efficacy of the approach in deciphering text of different lengths as well as in guessing passwords. It works even in the presence of a significant amount of ambient noise since frequency domain filtering can retain only the frequencies corresponding to the tapping sounds and remove the rest of the spectrum.

There are several possible directions for improving the proposed attack methodology. First of all, the signal processing unit can be enhanced. As discussed in Section 3, the signal processing unit consists of three units, the $Noise_Filtering$ module, $Peak_Detection$ module and the $TDoA_Computation$ module. The $Peak_Detection$ module currently converts the time domain signal into its energy spectrum and detects peaks in the accumulated energy. Alternative methods for peak detection can be suitably employed for identifying the tap intervals more accurately. We may even use better TDoA computation techniques [2,16].

The second area of improvement is in word and sentence inferencing. Other NLP techniques [3,23,25] may be explored for enhancing the performance of this step. This includes a better language model and use of a larger corpus for determining accurate n-gram frequencies. In Section 2.3, we have used the unigram, bigram and trigram $pseudo_probability$ as the language model. Some error correction steps may be included at the word level (e.g., by using edit distance) and even at the sentence level to improve the accuracy of the recovered text.

It would be an interesting exercise to try to implement the attack on the mobile device itself instead of sending the sound files over the Internet to another server for further processing. Given the rich processing capabilities of modern smartphone ARM processors and various tool APIs as well as open source DSP libraries available for development, it might be feasible to implement the processing of audio file to extract the TDoA in the smartphone. If such an implementation succeeds, it would make the attack stealthier. This will additionally evade the detection of attack by analyzing the network pattern in correlation with key strokes.

Finally, we plan to experiment further by including data from more number of sensors (both microphones as well as others) and see if fusion of information helps in improving the performance. For instance, there are phones equipped with more than two microphones and data captured by an increased number of microphones is expected to improve the accuracy of tap localization. One possible way to achieve this would be to generate candidate sets for each pair of microphones and take their set-intersection/union as the final candidate set. Moreover, sensors such as gyroscope and accelerometer are present in almost all popular smartphones. For each key tap, the accelerometer and gyroscope values would vary. Although this data is a bit more difficult to make use of as compared to the microphone data, still it is a possibility that is worth exploring to see how the attack performs.

Footnotes

Acknowledgments

Research reported in this publication was supported by the National Science Foundation under awards CNS-1564034 and CNS-1624503. The content is solely the responsibility of the authors and does not necessarily represent the official views of the agencies funding the research.

References

Asonov and

Agrawal, Keyboard acoustic emanations, in: Proceedings of IEEE Symposium on Security and Privacy, 2004, pp. 3–11, ISSN 1081-6011. doi:10.1109/SECPRI.2004.1301311.

Azaria and

Hertz, Time delay estimation by generalized cross correlation methods, IEEE Transactions on Acoustics Speech and Signal Processing 32(2) (1984), 280–285, ISSN 0096-3518. doi:10.1109/TASSP.1984.1164314.

L.R.

Bahl,

Brown,

P.V.

de Souza and

Mercer, A tree-based statistical language model for natural language speech recognition, IEEE Transactions on Acoustics Speech and Signal Processing 37(7) (1989), 1001–1008, ISSN 0096-3518. doi:10.1109/29.32278.

Benesty,

Chen and

Huang, Time-delay estimation via linear interpolation and cross correlation, IEEE Transactions on Speech and Audio Processing 12(5) (2004), 509–519. doi:10.1109/TSA.2004.833008.

Berger,

Wool and

Yeredor, Dictionary attacks using keyboard acoustic emanations, in: 13th ACM Conference on Computer and Communications Security, 2006, pp. 245–254.

V.V.

Bochkarev,

A.V.

Shevlyakova and

V.D.

Solovyev, Average word length dynamics as indicator of cultural changes in society, arXiv:1208.6109, 2012.

P.F.

Brown,

P.V.

Desouza,

R.L.

Mercer,

V.J.D.

Pietra and

J.C.

Lai, Class-based n-gram models of natural language, Computational linguistics 18(4) (1992), 467–479.

Dagon,

Martin and

Starner, Mobile phones as computing devices: The viruses are coming!, Pervasive Computing 3(4) (2004), 11–15, ISSN 1536-1268. doi:10.1109/MPRV.2004.21.

de Souza Faria and

H.Y.

Kim, Identification of pressed keys from mechanical vibrations, IEEE Transactions on Information Forensics and Security 8(7) (2013), 1221–1229, ISSN 1556-6013. doi:10.1109/TIFS.2013.2266775.

10.

de Souza Faria and

H.Y.

Kim, Identification of pressed keys by acoustic transfer function, in: Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on, IEEE, 2015, pp. 240–245.

11.

Gupta,

Sural,

Atluri and

Vaidya, Deciphering text from touchscreen key taps, in: Proceedings, Data and Applications Security and Privacy XXX – 30th Annual IFIP WG 11.3 Conference, DBSec 2016, Trento, Italy, July 18–20, 2016, 2016, pp. 3–18.

12.

Gustafsson and

Gunnarsson, Positioning using time-difference of arrival measurements, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 6, 2003, p. 553.

13.

Halevi and

Saxena, Acoustic eavesdropping attacks on constrained wireless device pairing, IEEE Transactions on Information Forensics and Security 8(3) (2013), 563–577. doi:10.1109/TIFS.2013.2247758.

14.

Hussain,

Al-Haiqi,

Zaidan,

M.M.

Kiah,

N.B.

Anuar and

Abdulnabi, The rise of keyloggers on smartphones: A survey and insight into motion-based tap inference attacks, Pervasive and Mobile Computing 25 (2015), 1–25.

15.

Jurafsky and

J.H.

Martin, Speech and Language Processing – An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2008.

16.

C.H.

Knapp and

G.C.

Carter, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech and Signal Processing 24(4) (1976), 320–327. doi:10.1109/TASSP.1976.1162830.

17.

Liu,

Zhou,

Diao,

Li and

Zhang, When good becomes evil: Keystroke inference with smartwatch, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ACM, 2015, pp. 1273–1285.

18.

Meng,

Isenhower,

Qin and

Nelakuditi, Can smartphone sensors enhance kinect experience?, in: ACM International Symposium on Mobile Ad Hoc Networking and Computing, 2012, pp. 265–266. ISBN 978-1-4503-1281-3. doi:10.1145/2248371.2248417.

19.

Meng,

W.H.

Lee,

S.R.

Murali and

S.P.T.

Krishnan, Charging me and I know your secrets!: Towards juice filming attacks on smartphones, in: Proceedings of the 1st ACM Workshop on Cyber-Physical System Security, 2015, pp. 89–98. ISBN 978-1-4503-3448-8. doi:10.1145/2732198.2732205.

20.

Miluzzo,

Varshavsky,

Balakrishnan and

R.R.

Choudhury, Tapprints: Your finger taps have fingerprints, in: Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services, 2012, pp. 323–336. ISBN 978-1-4503-1301-8. doi:10.1145/2307636.2307666.

21.

Ping,

Sun and

Mao, TextLogger: Inferring longer inputs on touch screen using motion sensors, in: Proceedings of the 8th ACM Conference on Security & Privacy in Wireless and Mobile Networks, ACM, 2015, pp. 1–12.

22.

Rabiner and

R.W.

Schafer, Digital Processing of Speech Signals.

23.

M.D.

Riley, Some applications of tree-based modelling to speech and language, in: Workshop on Speech and Natural Language, 1989, pp. 339–352.

24.

Saxena,

J.-E.

Ekberg,

Kostiainen and

Asokan, Secure device pairing based on a visual channel: Design and usability study, IEEE Transactions on Information Forensics and Security 6(1) (2011), 28–38. doi:10.1109/TIFS.2010.2096217.

25.

Song and

W.B.

Croft, A general language model for information retrieval, in: Proceedings of the Eighth International Conference on Information and Knowledge Management, 1999, pp. 316–321.

26.

Williams, 7 in 10 of world’s population using smartphones by 2020, The Telegraph, UK (2015). http://www.telegraph.co.uk/technology/mobile-phones/11646593/7-in-10-of-worlds-population-using-smartphones-by-2020.html.

27.

Zhang, Microsoft kinect sensor and its effect, IEEE MultiMedia 19(2) (2012), 4–10, ISSN 1070-986X. doi:10.1109/MMUL.2012.24.

28.

Zhu,

Ma,

Zhang and

Liu, Context-free attacks using keyboard acoustic emanations, in: Proceedings of the ACM Conference on Computer and Communications Security, 2014, pp. 453–464.

29.

Zhu,

Bernhard and

Gurevych, A monolingual tree-based translation model for sentence simplification, in: Proceedings of the 23rd International Conference on Computational Linguistics, 2010, pp. 1353–1361.

30.

Zhuang,

Zhou and

J.D.

Tygar, Keyboard acoustic emanations revisited, ACM Transactions on Information and System Security 13(1) (2009), 1–26. doi:10.1145/1609956.1609959.