Abstract
Establishing secure voice, video and text over Internet (VoIP) communications is a crucial task necessary to prevent eavesdropping and man-in-the-middle attacks. The traditional means of secure session establishment (e.g., those relying upon PKI or KDC) require a dedicated infrastructure and may impose unwanted trust onto third-parties. “Crypto Phones” (popular instances such as PGPfone and Zfone), in contrast, provide a purely peer-to-peer user-centric secure mechanism claiming to completely address the problem of wiretapping. The secure association mechanism in Crypto Phones is based on cryptographic protocols employing Short Authenticated Strings (SAS) validated over the voice medium.
The security of Crypto Phones crucially relies on the assumption that the voice channel, over which SAS is validated, provides the properties of integrity and source authentication. In this paper, we challenge this assumption, and report on automated SAS voice imitation man-in-the-middle attacks that can compromise the security of Crypto Phones in both two-party and multi-party settings, even if users pay due diligence and even if an automated software (voice biometrics systems) is used to detect voice manipulation.
The first attack, called the short voice reordering attack, builds arbitrary SAS strings in a victim’s voice by reordering previously eavesdropped SAS strings spoken by the victim. The second attack, called the short voice morphing attack, builds arbitrary SAS strings in a victim’s voice from a few previously eavesdropped sentences (less than 3 minutes) spoken by the victim. We design and implement our attacks using off-the-shelf speech recognition/synthesis tools, and comprehensively evaluate them with respect to both manual detection (based on a user study with 30 participants) and automated detection via a speaker verification tool. The results demonstrate the effectiveness of our attacks against three prominent forms of SAS encodings: numbers, PGP word lists and Madlib sentences. These attacks can be used by a wiretapper to compromise the confidentiality and privacy of Crypto Phones voice, video and text communications (plus authenticity in case of text conversations).
Keywords
Introduction
Voice, video and text over IP (VoIP) systems are booming and becoming one of the most popular means of communication over the Internet. Today, VoIP is a prominent communication medium used on a variety of devices including traditional computers, mobile devices and residential phones, enabled by applications and services such as Skype, Hangout, and Vonage, to name a few.
Given the open nature of the Internet architecture, unlike the traditional PSTN (public-switched telephone network), a natural concern with respect to VoIP is the security of underlying communications. This is a serious concern not only in the personal space but also in the industrial space, where a company’s confidential and sensitive information might be at stake. Attackers sniffing VoIP conversations for fun and profit (e.g., to learn credit card numbers, and passwords) as well as wiretapping and surveillance of communications by the government agencies [14,22] are well-recognized threats. Prior research also shows the feasibility of launching VoIP man-in-the-middle (MITM) attacks [61], which can allow for VoIP traffic sniffing, hijacking or tampering.
In light of these threats, establishing secure – authenticated and confidential – VoIP communications becomes a fundamental task necessary to prevent eavesdropping and MITM attacks. To bootstrap end to end secure communication sessions, the end parties need to agree upon shared authenticated cryptographic (session) keys. This key agreement process should itself be secure against an MITM attacker. However, the traditional means of establishing shared keys, such as those relying upon a Public Key Infrastructure (PKI) or Key Distribution Center (KDC), require a dedicated infrastructure and may impose unwanted trust onto third-parties. Such centralized infrastructure and third-party services might be difficult to manage and use in practice, and may themselves get compromised or be under the coercion of law-enforcement, thereby undermining end to end security guarantees.
In this paper, our central focus is on “Crypto Phones” (Cfones), a decentralized approach to secure VoIP communications. Cfones promise to offer a purely peer-to-peer user-centric mechanism for establishing secure VoIP connections. A prominent real-world instance of a Cfone is Zfone [38,60], invented by Phil Zimmermann, now being offered as a commercial product by Silent Circle [42]. A Cfone involves executing a SAS (Short Authenticated Strings) key exchange protocol, such as [52,62], between the end parties. The SAS protocol outputs a short (e.g., 20-bit) string per party – if the MITM adversary attempted to attack the protocol (e.g., inserted its own public key or random nonces), the two strings will not match. These strings are then output, e.g., encoded into numbers or words [60], to users’ devices who then verbally exchange and compare each other’s SAS values, and accordingly accept, or reject the secure association attempt (i.e., detect the presence of MITM attack). Figure 1 depicts a traditional MITM attack scenario against Cfone.

A traditional MITM attack scenario for Cfone – attack is detected since SAS values do not match.
The security of Cfones crucially relies on the assumption that the human voice channel, over which SAS is communicated and validated (Alice and Bob), provides the properties of integrity and source authentication. In other words, it is assumed that the attacker (Mallory) is not able to insert a new desired SAS value in Alice’s and/or Bob’s voice.
In this paper, we systematically investigate the validity of this assumption. Our hypothesis is that, although impersonating someone’s voice in face-to-face arbitrarily long conversations can be significantly challenging, impersonating short voices (saying short SAS) in a remote VoIP setting may not be. Note that SAS needs to be a short string for the human to be able to copy, read and/or compare it. Indeed, we undermine Cfones’ security assumption underlying SAS validation, and report on SAS voice imitation MITM attacks that can compromise the security of Cfones, even if users were asked to pay due diligence (as in the current implementation of Cfones), and even if automated software is incorporated to verify the speaker (a potential replacement for human-based speaker verification). Figure 2 depicts an example scenario for our short voice imitation MITM attacks against Cfone.
Our contributions are four-fold:
Our work shows that voice-based MITM attacks can be highly successful against Cfones whether human or machine speaker verification is used. These attacks may be deployed by a wiretapper to compromise the confidentiality and privacy of Cfones (plus authenticity in case of Cfones text conversations).
This paper is an extension of our ACM CCS 2014 paper [41], in which we investigated the security of Crypto Phones against automated voice imitation attacks with respect to human verifiers (i.e., human-based speaker verification). In this submission, we comprehensively extend this work to analyze the accuracy of machine-based speaker verification (voice biometrics) systems against this class of attacks. Using automated speaker verification systems in the context of Crypto Phones can be a natural near-future deployment scenario. Our work shows that even the state-of-the-art machine-based speaker verification systems can fail badly at detecting voice imitation attacks, thereby compromising the security and privacy of Crypto Phones sessions communications.

Our short voice imitation MITM attack scenario for Cfone – attack succeeds because of voice impersonation.
Communication and threat model
A Cfone SAS protocol between Alice and Bob is based upon the following communication and adversarial model, adopted from [52]. The devices being associated are connected via a remote, point-to-point high-bandwidth bidirectional VoIP channel. An MITM adversary Mallory attacking the Cfone SAS protocol is assumed to have full control over this channel, namely, Mallory can eavesdrop and tamper with messages transmitted. However, an additional assumption is that Mallory can not insert voice messages on this channel that mimic Alice’s or Bob’s voice. In other words, the voice channel (over which the SAS values are validated) is assumed to provide integrity and source authentication. The latter assumption is what we are analyzing and challenging in this paper by performing both manual and automated speaker verification of synthesized/fake voice.
SAS protocols
A number of SAS protocols exist [11,33,52] in the literature that a Cfone implementation may adopt. SAS protocol is an authenticated key exchange protocol which allows Alice and Bob to agree upon a shared authenticated session key based on SAS validation over an auxiliary channel (such as voice channel). The protocol results in a short (e.g., 20-bit) string per party – matching strings imply successful secure association, whereas non-matching strings imply a MITM attack. These protocols limit the attack probability to
SAS validation mechanisms
Compare-Confirm and Copy-Confirm are the two popular SAS checksum comparison methods to securely associate two devices A and B, which encode the SAS data into decimal digits [51], PGP words [60] or Madlib phrases (grammatically correct Madlib sentences) [20]. In Compare-Confirm, the SAS checksum is displayed on each party’s screen, they verbally exchange their respective checksums, and both accept or reject the connection by comparing the displayed and spoken checksum. In Copy-Confirm, one party reads the encoded checksum to the other party, who types it onto his/her device, and get notified whether the checksum is correct or not. In this work, we study unidirectional Compare-Confirm checksum comparisons, given this is the most commonly deployed approach on Cfones.
Our attacks & background
We first provide an overview of our Cfone voice imitation attacks. We then discuss why recognition of the identity of a speaker (especially from short speech) can be a complex task for human users. While human limitation in speaker verification may suggest its replacement with machine-based speaker verification, we further analyze the validity of such an automated speaker verification capability, which is introduced in this section.

High-level diagram of the attack.
Our short voice imitation attacks involve the following components (our higher-level attack is depicted in Fig. 3).
Attacking human-based speaker verification
In an MITM attack against the SAS protocol of a Cfone, Mallory can insert herself into a session and gain full access to the data being transferred between the Alice and Bob. To do so, Mallory needs to hijack the session and impersonate each party. As discussed in Section 2.1, Cfone’s security assumption is that although Mallory has full control over the communication channel, it cannot insert voice messages that mimic Alice/Bob. Should this hypothesis be valid, the SAS value which is verbally exchanged on this channel can always authenticate Alice and Bob, foiling the MITM attack. A Cfone MITM attack seems relatively straight-forward against data communication (i.e., non verbal communication messages of the SAS protocol) [61], however, it is assumed that voice is unique to each individual, and therefore it is impossible to impersonate it. This assumption relies on special characteristic of speech which appears to make it difficult to impersonate.
Speech construction is a complex area. In simple terms, speech consists of words, each of which is a combination of speech sound units (phones). However, in reality, human voice is not as simple as this definition. Voice signal created at the vocal folds travels and gets filtered through vocal tract to produce vowels and consonants. Human body structure, vocal folds, articulators and human physiology and the style of speech provide each individual a potentially distinguished voice characteristic. Pitch, timbre and tone of speech are some of the features that may make a voice unique (for further information, we refer the reader to [4]). Therefore, the assumption that voice is unique, just like fingerprint or iris, does have some validity (although how much is a question explored in this paper).
Speech perception and recognition, the tasks that Cfone users have to perform while validating the SAS values, are even more complex than speech construction. There exists considerable literature on how speech is recognized [12,13,40]. Linguistics researchers have conducted various experiments and analyzed the capabilities of human speech recognition over different parameters, such as length of the samples, number of different samples, samples from familiar vs. famous people, and combinations thereof [40]. In an experiment, conducted in [31], the participants were asked to identify a voice when the sample string presented to them was “hello”, which resulted in a correct recognition rate of only 31%. However, when a full sentence was presented to the participants, the recognition rate increased to 66%. In the study of [21], a 2.5 minute long passage was presented as a sample to the participants, resulting in the average recognition accuracy of 98%. Many other experiments have been performed over the years evaluating human users’ performance in voice recognition [30]. They show that the shorter the sentence, the more difficult it is to identify the source. Spoofing attack against human-user have been studied in past [55] and showed to be detected with about 28% accuracy for GMM-based voice conversion using Festvox. However, these researches focus on longer audio samples whereas our application focus on short samples (e.g., two words).
Based on this literature survey, it appears that the task of establishing the identity of a speaker may be challenging for human users, especially in the context of short SAS, and serves as a weak-link in the security of the Cfone SAS communication.
Attacking machine-based speaker verification
Machine-based speaker recognition is a biometric technique of identifying a person by his speech. A speaker recognition task can be categorized into speaker verification and speaker identification. Speaker verification is the biometric task of authenticating a claimed identity by means of analyzing a spoken sample of the claimant’s voice. In this paper, we perform automatic speaker verification in a crypto phone application. To recognize a known target speaker, a speaker verification system goes through a prior speaker enrollment phase. In the speaker enrollment phase, the system creates a target model of a speaker from his/her speech samples so that they can be verified during the test phase in future.
A speaker verification system extracts certain spectral or prosodic features from a speech signal to enroll the model of the target speaker. The most common speaker modeling techniques include Template Matching, Nearest Neighbor, Neural Networks, Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs).To improve and optimize speaker model training, session variables and joint factor analysis based score computations have been included in the Gaussian mixture based speaker modeling framework [25,26].
With the emergence of advanced speech synthesis and voice conversion techniques, the automatic speaker verification systems may be at risk like never before. Other than classical spoofing attacks like impersonation [8,32] or spoofing tricks using voice conversion, synthetic speech, artificial signals can be used to attack an automatic speaker verification system [56,57]. De Leon et al. have studied the vulnerabilities of advanced speaker verification systems to synthetic speech [15,16], and proposed possible defenses for such attacks. In [1], the authors have demonstrated the vulnerabilities of speaker verification systems against artificial signals. The authors of [28,58] have studied the vulnerabilities of text-independent speaker verification systems against voice conversion based on telephonic speech.
Noticeably, a key difference between our work and previous studies lies in the number/length and type of samples required to build a good voice conversion model. We use very less amount of training samples (e.g., 50–100 sentences of length 5 seconds each) for voice conversion collected using unprofessional voice recording devices such as laptops and smartphones. Such short-size audio samples giving rise to a victim’s voice sets a fundamental premise of how easily a person’s voice can be attacked or misused. While the prior papers do not seem to clearly specify the sizes of their voice conversion training data sets, they employ spectral conversion approaches which typically require a large amount of high-quality training data [24,48]. Another prominent difference from prior work [35] is that we are studying the potential for the automated speaker verification systems in detecting short speech as part of the SAS verbal exchanges. Being short detecting the speaker may be more challenging compared to long arbitrary speech considered in prior research [35].
While voice biometrics is a rich area, achieving robust detection rates (i.e., low false negatives and low false positives) is still a challenging problem. To prevent voice spoofing attacks, a number of countermeasures have been proposed [18,56]. However, existing voice biometrics system may not work well to thwart active voice impersonation and synthesis attacks [56]. Furthermore, given that the SAS challenge being authenticated in Cfones application is short (only few seconds worth of audio sample), it may not provide sufficient “knowledge” to the biometric system to extract features from the voice using which detection can take place.
We pursue a detailed analysis of the vulnerabilities of a speaker verification system against voice conversion. In this setting, the attack components are similar to human-based SAS verification (Section 3.1). The only difference is that we replace the human’s speaker verification role with a machine-based speaker verification system.
Attack design, implementation and performance evaluation
Attack design
An alternative is voice converters that convert a source voice to a target voice by mapping features between the two voices. The voice conversion framework we used in our morphing attack is CMU’s Festvox [7] voice transformation. Festvox gets trained by only a few sentences (less than three minutes) spoken in both the source (attacker or default Text-to-Speech, TTS voices) and the target (victim). Therefore it requires much less effort than other synthesizers. Once trained, a synthesized voice is built based on the target system. It can either act as a TTS tool in the victim’s voice or can be used to convert any utterance spoken by Mallory to the same utterance in the victim’s voice. Rather than converting properties of the voice, Festvox predicts the position of articulators from the speech signal and maps between the speakers and create voices in the target voice.
We trained the system with less than 50 sentences from the victim and an attacker voice, and converted all the possible SAS atomic units from the attacker voice to the victim voice. Using this system, we built an offline dictionary of all possible SAS combinations even though the SAS atomic units have not been spoken by the victim before. The offline dictionary can be queried at the MITM attack time to insert new forged SAS.
Attack implementation – Putting the pieces together
To evaluate the feasibility of our attack, we setup an RTP communication channel between Alice and Bob with Mallory acting as the router in the middle. We used the JMF framework to send audio captured from the built-in microphone of Alice’s computer to Mallory. Mallory receives the RTP stream and stores it in WAV format audio file. Duration of each audio file is set to be 3 seconds.
In the attack, Mallory’s initial goal is to search for the presence of a SAS in the regular conversation between Alice and Bob. To this end, after receiving the first audio file, our application on Mallory’s node calls CMU Sphinx keyword spotter to look up possible SAS in the captured audio files. We evaluated the performance of the attacking application with two different grammars. The first grammar looks up all possible SAS combinations. For example, a two word PGP word list SAS could be “dashboard liberty” which is included in our keyword spotter grammar. The second grammar just looks up some possible phrases that Alice and Bob might say just before confirming the SAS such as “SAS shown on my side is …”, or “My SAS is …”. The second grammar makes the keyword spotting faster but it is not completely predictable, as there are many ways that users can confirm their SASs.
We assume that SAS atomic units (e.g., digits or words) have already been forged and are stored on Mallory’s system for further use following the morphing or reordering attack discussed earlier in this chapter. The keyword spotter works in parallel with RTP receiver, audio files are stored and processed in a First-In-First-Out (FIFO) order. Those files containing SAS are replaced with same size (bit-wise) recording matching the MITM desired SAS, and files not containing the SAS are simply relayed to Bob.
Delay of the attack
The voice MITM attack naturally introduces a delay associated with the MITM attack on the non-voice, non-SAS channel communication, and with the voice impersonation on the SAS channel communication. Prior work [61] shows that MITM attack on non-voice channel can be efficiently performed and therefore we focus on the delay related to the SAS voice impersonation. The dominating delay in voice impersonation could be because of the keyword spotting procedures. Using simpler grammars (i.e. the SAS confirmation phrases) can improve the keyword spotting method. In offline keyword spotting (such as the one that we used), duration of each stored audio file can affect the performance, since we are running the RTP receiver and the keyword spotter in parallel. So if the duration of the stored audio file is less than the execution of keyword spotting method, no delay would be introduced by the keyword spotter. Moreover, real-time keyword spotters such as [19] might further improve the performance of the attack in such cases. Based on our preliminary analysis the delay introduced by the attack would be negligible.
Evaluating the attack against humans
In order to measure the effectiveness of our attacks against human-based speaker verification, we conducted a user study. In this section, we present the evaluation’s setup and its results.
Setup
We conducted a survey, approved by our University’s IRB, and requested 30 participants to answer several multiple choice and open-ended questions about the quality and (speaker) identity of certain recordings. We believe this number of participants is reasonable given that our hypothesis is a negative one (users cannot not perfectly verify the benign and attacked voice samples). This is in contrast to a positive hypothesis (to indicate the security and usability of the system), in which larger number of participants may be needed. However, larger number of participants in our study could have achieved a more concrete statistical analysis. There were two categories of questions: one related to the quality of the forged SAS (9 questions) and another related to speaker identification (9 questions).
We collected different types of SAS recordings, including four 16-bit numerical SAS, eight PGP word lists and four Madlibs. We also presented four longer SASs including 32-bit PGP word list and 32-bit Madlibs. Generally, 32-bit numeric SAS is not secure against reordering attack (since in only one transmission of SAS, all 10 distinct digits might appear). Therefore 32-bit numeric SAS was not questioned.
Results
Demographic Info: User Study
Demographic Info: User Study
Mean (Std. Dev) ratings for original and attacked SAS
We can see that the difference between the ratings for morphed SAS and original SAS is generally more than the difference between the ratings for reordered SAS and original SAS (only exception is 32-bit Madlibs). This suggests that reordering attack might generally be harder to detect for the participants than morphing attack. Note that for PGP words, participants rated the reordered SAS higher than the original recording. This implies that if the attacker collects enough data to perform the reordering attack on PGP words, the quality of the forged SAS may even be perceived better than the original one. However, the same was not true for Madlibs. Madlibs have a correct grammatical structure and therefore people usually read them following a sentence flow, which may make it difficult for the attacker to split and remix.
Results of the human-based evaluation for different attacks and SAS types. Row 1, column 1 shows FNR related to benign samples (instances that are not successful in detecting a familiar voice). Row 1, column 2–4 show FPR indicating effectiveness of each attack (naive different voice attack; reordering and morphing attacks). Higher FPR shows more powerful attack
We would like to determine how well the users perform at the tasks of speaker verification when listening to benign and attacked audio samples. We are interested in finding out how often users cannot correctly verify the voice of an original speaker reciting a short SAS. We use False Negative Rate (FNR) metric that represents the probability of rejecting such benign instances (not recognizing a legitimate familiar voice). The lower the FNR, the better the accuracy of the users in recognizing an original voice. Higher values show Cfone system does not work well under benign non-attack setting. Also, we are interested in determining how often users accept a different speaker’s voice, a morphed voice, or a reordered voice. We use False Positive Rate (FPR) metrics that denotes the probability of accepting such attacked instances (considering an attacked voice as a familiar voice). False acceptance implies the success of the attacker and a compromise of the security of Cfone session communications. Higher values represent that the attack is working and participants are not able to detect it.
Table 3 depicts our evaluation metrics corresponding to a SAS spoken by a “different” person (second column, representing the naive attack), a SAS generated by converting attacker voice to victim voice (third column), and a SAS spoken by the same person but reordered (fourth column). The results are shown for different type of SAS. Also shown is the overall aggregated result among all three types of SASs (the last matrix).
First of all, the table illustrates a relatively low FPR for SASs played in a “different voice” (row 2, column 2), which means when a different voice is presented, people successfully detect the difference with a high chance (about 80%). This demonstrates that if the (naive) attacker just inserts a different voice, it would be detected by the users with a high probability. This provides an important quantitative benchmark to compare the performance of attacks with.
The effectiveness of our voice imitation (morphing and reordering) attacks is represented by FPR (first row results for column 3 and 4). Although FPR is not very high (somewhere around 50–60%), it is important to look at the corresponding FNR of the Cfone system which shows that people are not accurate in recognizing the familiar voice saying SAS even in benign scenarios, and almost 50% of the times participants detect original voice in a different noise profile as fake voice. That is, even in non-attack scenario, participants make almost random guesses to decide whether the voice is real or fake. Thus, we can conclude that, under our attacks, users do not perform any better than a random guess in recognizing a forged SAS, and in fact the result is very similar to recognizing original voice in a different noise profile. In short, people are as successful as recognizing a forged SAS as they are successful in recognizing an original SAS in a different noise profile.
Similar to the quality test, the speaker identification test shows that reordering attack generally works better (e.g., has higher FPR) than the morphing attack. The performance metrics, however, do not indicate any significant differences in the way users may detect the attacks against different SAS types (numeric, PGP or Madlibs). They all seem almost equally prone to our attacks. In Section 3, we referred to the linguistic studies that demonstrate people are more successful in recognizing familiar voices when they are presented with long sentences rather than short sentences. Our experiment for short SAS confirms this insight.
In this section we investigate as to what extent a converted or morphed SAS (short authenticated string) spoken by a MITM attacker can be detected by an automatic speaker verification system.
Setup
The input to this system, a set of clips spoken by a number of speakers, is split into 3 sets namely: training set, development set (Dev set) and evaluation set (Eval set). The training set is used for background modeling. The development and evaluation sets are further divided into two subsets, namely, Enroll set (Dev.Enroll, Eval.Enroll) and Test set (Dev.Test, Eval.Test). Speaker modeling can be done using numerous modeling techniques, including, Universal Background Modeling in Gaussian Mixture Model (UBM-GMM) [39] and Inter-Session Variability (ISV) [53].
UBM-GMM is a modeling technique that uses the spectral features and then computes a log-likelihood of the Gaussian Mixture Models for background modeling and speaker verification [10,46]. ISV is an improvement to UBM-GMM, where a speaker’s variability due to age, surroundings, emotional state, etc., are compensated for, and it gives better performance for the same user in different scenarios [43,53].
After the modeling phase, the system is then tuned and tested respectively using the Dev.Test and Eval.Test sets from Development and Evaluation sets. All the audio files in the Dev.Test and Eval.Test sets are compared with each of the speaker models for development and evaluation sets, respectively, and each file is given a similarity score with respect to each speaker in the corresponding set. The scores of the Dev.Test files are used to set a threshold value. The scores of the Eval.Test set are then normalized and compared with this threshold, depending on which each file is assigned to a speaker model. If the audio file actually belong to the speaker to whom it got assigned, then the verification is successful otherwise the verification is not successful.
Results of the SAS-based voice conversion attack against automated speaker verification system. Original Speaker column shows the rejection rate (in percentage) in the benign setting and the attack columns show the acceptance rate (in percentage) in the attack setting
Results of the SAS-based voice conversion attack against automated speaker verification system. Original Speaker column shows the rejection rate (in percentage) in the benign setting and the attack columns show the acceptance rate (in percentage) in the attack setting
Table 4 summarizes the results under the benign (original speaker) scenario and the conversion attack. We elaborate on and interpret the results more below.
Automated quality test design and results
In speech and speaker recognition systems, it is common to extract a multi-dimensional vector of components of the underlying audio to identify the linguistic features of the signal. We used Mel-Cepstral Distortion (MCD) to measure the similarity between a forged (converted) SAS and an original SAS by calculating the Euclidean distance between feature vector of the forged SAS and that of the original SAS. A similar strategy has been used in several speech conversion and synthesis systems [17,29] to measure the distance between a synthesized and an original version of the same utterance. If the difference between feature vector of the original SAS and the forged SAS is minimized, the forged voice is close to the original voice and detecting the attack would be inherently difficult. Lower MCD shows better conversion (a forged SAS is so similar to the original one that it is not easy to distinguish the two).
To compute MCD, we extract features of the forged SAS (fSAS) built from attack engine (morpher) and features of the original SAS (oSAS) spoken by the victim, and calculate the difference between the two. MCD computation is defined in Equation (1), where
Weighted average of the magnitudes of cepstral peaks.
In our automated quality test, we trained the voice conversion engine to convert between pairs of 8 different male and female speakers from our voice dataset (Section 4) representing victims and attackers. A combination of 20 different conversions was built. To first test the effect of the training set size on the conversion performance, we trained the system with 50, 100, and 200 sentences. We noticed an average MCD improvement of only 1.6% by increasing the size of the training set from 50 to 100 and an average MCD improvement of only 1.1% by increasing the size of the training set from 100 to 200. This means that increasing the training set size beyond 50 sentences does not significantly improve the performance of conversion. As a result, in the rest of the experiments, we use 50 first sentences of Arctic dataset in the training phase. The average duration of each utterance of the training set is 5 seconds, and the average duration of the whole set of 50 sentences is 2 minutes and 30 seconds. This means that in order to train a system to speak in the victim’s voice, we are required to collect less than 3 minutes of her voice. This is quite short and therefore training does not seem to impose a challenge for the attacker in the conversion process.
Table 5 presents the results of our objective evaluation. We present the results of only 4 conversions between same genders and different genders. The other 16 conversions yielded similar results and are not reported here due to space constraints.
Quality test evaluation results for the morphing attack
To obtain a measure of how good the conversion process is, we first performed a conversion between utterances of a single speaker (the victim) spoken and recorded in two different noise profiles. We call this the “Single Speaker” conversion. Rationally, such conversion would gain the optimum conversion result. Row 1 of Table 5 shows MCD between the two set of 50 recordings (in different noise profiles) of the victim before the conversion, averaged across all recordings, which can be used as a reference of what MCD values are acceptable. Row 2 of the table shows the result of conversion. The Single Speaker conversion gives us a baseline MCD to measure the quality of other conversions from attacker voice to the same speaker as the victim.
Row 3 depicts the MCD between an utterance in the training dataset spoken by the source and the same utterance spoken by the target before the conversion, averaged across all utterances. This parameter characterizes the actual similarity/dissimilarity between the attacker and victim voice before conversion. Recall that we used 50 utterances spoken by the source and the target to train the system to convert from attacker to the victim. We refer to this conversion as the “Source to Target” conversion. Row 4 shows result of this conversion. By comparing row 3 and row 4, it can be seen that after conversion, the distance between source and converted voice becomes less than the initial distance. This demonstrates the effectiveness of the converter.
Comparing the result of Single Speaker and Source to Target conversions (row 2 and 4), we can observe that the MCD between converted voice and the original voice is higher in the Source to Target conversion (which is the real attack scenario) than the Single Speaker conversion (the optimum conversion result). This is intuitive. However, by comparing row 1 and 4, it is interesting to note that the MCD values after conversion are slightly less than MCD values of single speaker before conversion, which suggests that Source to Target conversion produces a voice that is comparable to the voice of the victim in a different noise profile.
Subsequently, we tested the performance of the converter for the purpose of our attack (i.e. SAS morphing). Here, we used the trained system (described in the above two paragraphs) to convert 60 utterances from our potential attackers to victims representing 20 short, medium and long SAS with size of 20 bits, 80 bits and 128 bits respectively. Average duration of saying short, medium and long SAS is approximately 1.2, 2.1 and 4.4 seconds respectively. Row six of Table 5 shows the distance between the converted SAS (resulted from Single Speaker converter) and the original SAS (spoken by the victim). Last rows show the average distance between the forged SAS (resulted from Source to Target converter) and the original SAS (spoken by the victim).
For all pairs of speakers, we see a clear pattern of increase in MCD with increase in the SAS size. This suggests that the quality of SAS conversion degrades as the SAS size increases, which may make longer SASs more detectable than shorter ones. Comparing rows six and seven, we see that the quality of SAS conversion degrades only slightly when source and target are different speakers (similar to the case of non-SAS samples as discussed above).
Unlike the morphing attack that maps between features of the attacker and the victim, in reordering, filtering characteristics of vocal cords of the speaker are not changed. As the name suggests, reordering simply remixes the ordering of words or digits. Therefore, unlike morphing attack, no new voice is generated in reordering and in fact the attacked voice has the same features as that of the victim’s voice [9,50]. Hence, we did not conduct automated quality test on the reordered SASs.
The attack against machine-based speaker verification shows that although the machine is successful in detecting a different voice with at accuracy of around 99%, it cannot distinguish between a morphed voice and an original speaker’s voice (the average rate of accepting a morphed voice is around 99%). Our automated quality test shows that the distortion between the original SAS and morphed SAS increases with the size of the SAS. This supports our hypothesis that short voice impersonation is easier for the attacker (harder for the users to detect). We also observed that if attacker voice is similar to the victim voice, the result of conversion would be better.
As far as building training sets for reordering attack is concerned, the difficulty depends on the underlying SAS encoding type. While eavesdropping all (10) digits for numeric SAS is relatively easy (e.g., waiting for the victim to speak phone numbers, zip codes, and other numeric utterances), learning all PGP words or Madlib words might be challenging given these words may not be commonly spoken in day to day conversations. However, it is possible for the attacker to use social engineering techniques to address this challenge. Number of possibilities exists to this end. For example, the attacker can create crowd sourcing tasks on online websites (e.g., freelancer or Amazon Mechanical Turk) which asks the users to auditize proses which contain all PGP words or Madlib phrases. Similarly, the attacker can create audio CAPTCHAs, and use them on its own websites or other compromised sites, that challenge the users to auditize words from books (similar in spirit to the idea of reCAPTCHAs).
Moving forward, we believe that our work also raises a more broader and general threat of “voice privacy.” The malicious actors may use various approaches to record someone’s voice samples and use these samples to compromise the security and privacy in another application (such as Cfones or voice recognition systems). While people seem quite concerned about their “visual privacy” in today’s digital world (e.g., someone taking their picture), they may not consider their voice to be so sensitive (e.g., people often talk out loud in a restaurant and even talk to strangers). Given that audio sensors are very common and do not require explicit efforts from an attacker to record audio (unlike camera, for example), we believe that voice privacy can have several implications that may need careful attention.
Yet another potential solution to thwart the voice impersonation attacks against Cfones is to perform the SAS validation over an auxiliary channel that can be more resistant to voice and packet manipulation. PSTN communication is believed to offer such properties, and, when available, may be used to secure VoIP communication. For example, if the communicating devices support both VoIP capability (Internet connection) and PSTN connectivity (e.g., cellular connection), the non-SAS communication can take place over the former and SAS validation can take place over the latter. This mechanism is suitable for mobile phones – the Cfone app switches to a PSTN call when SAS comparison is performed by the user (Android, e.g., allows making VoIP and cellular calls simultaneously). A limitation of this defense mechanism is that it is only applicable to devices which have PSTN capability (such as cell phones).
An independent defense could be increasing the dictionary size to make reordering difficult, and to reduce the efficiency of automatic keyword spotting. Moreover, if the dictionary is not fixed, reordering will be impossible. An idea suggested in [2] is to choose words from a large dynamic space (e.g., front pages of today’s newspapers). The dictionary can be chosen by users, or programmatically during key exchange. However, the security and user experience of this approach needs further investigation.
Another possibility is to employ the approach proposed in [3], which identifies and characterizes the network route traversed by the voice signal and creates a detailed fingerprints for the call source. For VoIP connection, this method is based on network characteristics, and therefore may only be effective if the attacker and the victim reside in different networks.
Conclusions
Crypto Phones aim to solve an important problem of establishing end-to-end secure communications on the Internet via a purely peer-to-peer mechanism. However, their security relies on the assumption that the voice channel, over which short checksums are validated by the users, provides integrity/authenticity. We challenged this assumption, and developed two forms of short voice impersonation attacks, reordering and morphing, that can compromise the security of Crypto Phones. Our evaluation demonstrate the effectiveness of these attacks under both human-based speaker verification (current deployment scenario) and machine-based speaker verification (future deployment scenario), when contrasted with a trivial attack where the attacker impersonates with a totally different speaker’s voice. We suggested potential ways and associated challenges to improve the security of Crypto Phones against the voice MITM attacks. A comprehensive future investigation is needed to develop a viable mechanism to thwart such attacks. Currently, it seems that humans perform better than machines in detecting the attacks presented. It is possible that machine capability to detect the attacks may improve with the advancement in speech processing, but it is unlikely that human capability may improve.
