Statistical properties for sequence produced by pseudorandom number generator used in well-known stream cipher

Abstract

The current article statistically analyzes several PRNG for well-known high-speed stream ciphers. The study focuses on frequency detection, uniformity distribution alongside with randomization in the generated sequence. The purpose of this work is to show if there is a signature left by these PRNG in theirs produced sequence. In addition, the work compares these PRNG to indicate which is the safest against statistical cryptanalysis.

Keywords

Stream cipher statistical cryptanalysis pseudorandom number generator ciphertext signature guessing attack

1. Introduction

The common symmetric ciphers used today are in fact block ciphers which uses iterations of a deterministic algorithm that operates on plaintext bits of fixed length – known as block – at a time. Each iteration is called round and uses a different subkey created from the primary key of encryption. There have been numerous operating modes developed for block cipher in order to allow authenticity and confidentiality while some modes also provides the padding for the plaintext block. Padding the plaintext block is simply adding bits to the plaintext block in cases where plaintext bits are shorter than the block size. It should be noted that block ciphers have also been used in Pseudo-Random Number Generators “PRNG” and universal hash functions. Today’s most Famous Block Ciphers can be found gathered in [1].

On the other hand, a stream cipher, takes plaintext characters (1 bit or byte) at a time and XOR them with the pseudo-random bits to get the output. The infinite pseudo-random bits actually refer to the key that is known as the keystream. The keystream is normally created using the initial encryption key – called the seed – by the PRNGs. To remain secure, PRNGs should be unpredictable in stream cipher. In addition, stream ciphers should not use the same keystream twice, otherwise the cipher may be broken. The aim of designing stream cipher was to approach the idealistic cipher, known as the One-Time Pad.

The One-Time Pad, which is supposed to use a purely random key that is longer than plaintext, can potentially reach “the perfect secret”, that is, the total safe against brute force attacks. Nevertheless, such a cipher would be too impractical to use, because if someone likes to encrypt and send a one-minute full HD video file to another, he would need a key of at least 144 megabytes in size (this size is calculated), that is to say, a key with 1,125 Gigabits of length.

Figure 1.

General model of a synchronous stream cipher.

As appears the impractical use of a key that is longer than the plaintext, stream ciphers are far from the perfect secret. However, the key change per use makes them difficult to break. In fact, if the keystream sequence is at least as long as the plaintext being encrypted, i.e. the keystream is used once when encrypting the plaintext, before the seed is changed, then the block cipher in counter mode cannot be any more secure than the stream cipher. The vulnerabilities of the two is then the same, according to Albert Manfredi-Principal Engineer at Boeing Defense Systems.

Statistical cryptanalysis is an important tool in a cryptanalysis study. Even if a stream or bloc cipher is protected against today’s cryptanalysis attacks [2], the enormous amount of encrypted data can be seriously dangerous for any cipher if statistical bias has occurred.

In today’s communication, stream ciphers are widely used. Each human language contains a signature. The signature is actually the occurrence number of letter and the occurrence number of word used in the language. This signature can be easily computed due to quantization [3], i.e. the binary vector representing the analog signal for human voice is not infinite. Hence, the huge importance of a statistical analysis for stream ciphers.

Consequently, this paper investigates the behavior of several stream ciphers on statistical cryptanalysis attacks such as frequency analysis, the goodness-of-fit, the serial test, the runs test, the poker test and the autocorrelation test. Each stream cipher is directly related to the behavior of their pseudorandom number generator. The latter can easily be studied by encrypting a constant plaintext. Henceforth, the stream cipher studies are equivalent to pseudorandom generator studies. Moreover, this work do contain statistically comparison between stream ciphers based on some experiment and result discussions, so as to deduct the presence or the absence of weakness and safety level obtainable thru each cipher.

2. An Overview of symmetric ciphers

Stream ciphers create successive characters of key- streams based on their internal state. They are two types of stream ciphers.

– Synchronous stream cipher [4]: The status is updated regardless of plaintext and ciphertext. The keystream bits are subsequently combined with plaintext for encryption or with ciphertext for decryption, which implies that the encryption and decryption machines respectively, Bob and Alice must use the same information, hence synchronization.

If synchronization is lost during the process, some approaches are applied to resynchronize the two machines. Among these approaches, we have: the systematic use of various offsets until achieving the synchronization, or tagging the ciphertext with markers at set points.

In this schema, if synchronization is lost, i.e. there is a corrupted bit in the data stream transmitted between Bob and Alice, then a single bit will be corrupted in the recovered plaintext and the error does not affect the rest of the data stream. Consequently, this mode is very useful in case of high error rate in a communication. However, this scheme can be very susceptible to active attacks if a malicious attacker has access to the stream.

Figure 2.

General model of a self-synchronizing stream cipher.

The encryption process of a synchronous stream cipher can be described by the Eq. (1):

$\displaystyle\begin{cases}x_{i+1}=f(x_{i},k)\\ y_{i}=g(x_{i},k)\\ c_{i}=h(y_{i},p_{i})\end{cases}$ (1)

where $x_{0}$ is the initial state and may be determined from the key $k$ , $f$ is the next-state function, $g$ is the function which produces the keystream $y_{i}$ , and $h$ is the output function that combines the keystream and plaintext $p_{i}$ to produce ciphertext $c_{i}$ . The encryption and decryption processes are depicted in Fig. 1.

– Self-synchronizing stream ciphers [5]: The status is updated based on the previous ciphertext, i.e. the previous X (X is a number) ciphtertext bits help to generate the keystream. This allows Bob and Alice to retrieve data more easily if there are bits added, deleted, or altered in the stream data. In this scheme, the error of one bit will be limited in the overall effect.

Figure 3.

The CTR mode of operation.

As example of self-synchronizing stream cipher, we have RC4 [6] or a block cipher that operates in CFB mode [7] (cipher feedback). Other ciphers that use this technique are A5/1 [8], A5/2 [9], Helix [10], ISAAC [11], MUGI [12], Phelix [13] …

The encryption function of a self-synchronizing stream cipher can be described by the Eq. (2):

$\displaystyle\begin{cases}x_{i}=(x_{i-t},x_{i-t+1}\ldots,x_{i-1})\\ y_{i}=g(x_{i},k)\\ c_{i}=h(y_{i},p_{i})\end{cases}$ (2)

where $x_{0}=(x_{-t},x_{-t+1},\ldots,x_{-1})$ is the non-secret initial state, $k$ is the key, $g$ is the function which produces the keystream $y_{i}$ , and $h$ is the output function that combines the keystream and plaintext $p_{i}$ to produce ciphertext $c_{i}$ . The encryption and decryption processes are depicted in Fig. 2.

Stream cipher differs from the block cipher design. This causes a difference in use. For instance, Block ciphers require more memory to save the master key, subkeys, plaintext block and often more data from previous blocks depending on the encryption mode [7], which can also associate confidentiality to the key integrity check. Whereas stream ciphers only work on a few bits at a time, they have relatively low memory requirements, i.e. more suitable to embedded devices, firmware …

Block ciphers are more susceptible to transmission noise, that is, if a bit is corrupted in the ciphertext, the rest of the block is probably unrecoverable. While stream ciphers encrypt bytes independently without connection to each other.

Stream ciphers are usually faster than block ciphers, but they are often less secure and subject to weaknesses based on usage, because of the very strict requirements for the keystream.

Stream ciphers do not provide integrity nor authentication, whereas some block ciphers can provide integrity in addition to confidentiality (depending on encryption mode). Because of all the above, stream ciphers are typically best for cases where the amount of data is either unknown, or continuous, such as network streams, radio mobile communication …While Block ciphers are more suitable when the amount of data is known or high secret, such as a file sharing, top-secret communication …

In this work, several high-speed stream ciphers are examined to observe the presence or absence of potential vulnerabilities to statistical analysis. The studied stream ciphers are ChaCha8/12/20 [14], HC128/ 256 [15, 16], Panama [17], Rabbit [18], RC4, Sosemanuk [19], Salsa20/XSalsa20 [20], SEAL [21] and WAKE [22]. Despite that RC4, SEAL and WAKE have been broken and are no longer secure, they are still used (e.g. RC4 is widely used in web encryption).

Figure 4.

The OFB mode of operation.

3. Pseudorandom number generator

It is still possible to create a PRNG using block cipher in a particular mode such as the output counter (CTR) and the FeedBack (OFB) [23].

In CTR mode, the output of the ith block is parameterized by the secret key K added (by XOR) to a counter ci (which usually takes the value i) (see Fig. 3).

In OFB mode, the block parameterized by the secret key K is assigned to an initial value (IV). The output corresponding to the block is added to the plaintext for encryption. Also, this output is used to provide the initial value (IV) of the next block (see Fig. 4).

These modes have been standardized for many years and make it possible to generate a keystream from a secret key and an initial value. The OFB mode was initially defined in FIPS 81 [24]; CTR mode has been added to the usual procedures in NIST’s special publication, NIST SP 800-38A [25]. Their specifications are included in the recent ISO/IEC10116 [26].

Normally, a PRNG constructs by a block cipher that resists to cryptanalysis attacks is often considered reasonably secure. However, theoretically, these block ciphers are not very strong because they are vulnerable to Distinguished Attack with a chosen initial value [23]. In CTR mode, the ith block of the sequence generated from the initial value IV is always equal to the ith block of the sequence generated from the initial value $IV\oplus c_{i}\oplus c_{j}$ . Similarly, in OFB mode, the sequence generated for the ith block from the initial value IV and a key K is always equal to the sequence generated for the $(i-1)$ th block from a previous initial value and the same key K.

Obviously, these properties facilitate the distinction of the keystream produced by each of these modes. This weakness has recently led to modifying each of these modes to provide a keystream that cannot be distinguished from a random sequence if the underlying block cipher has a similar property. The classification of the different types of PRNG is a delicate task. However, we can distinguish three major families [27] according to the type of function used by the PRNG:

•
Linear transition function: The use of a linear transition function is indeed a preferred choice because of the simplicity of its implementation. Among the linear transition functions, we have those that have been implemented using the linear feedback shift registers (LFSR) [28]. These are favored for the low cost of their hardware implementation in addition to the statistical properties alongside with the large period of their sequences produced. Among the stream ciphers using PRNGs based on LFSR, we have E0 deployed in the Bluetooth standard [29], A5/1 used to encrypt mobile communications in the GSM or SNOW 2.0 standard [30], which targets software applications.
•
Nonlinear transition function: To avoid weaknesses that may result from the linearity of the transition function, some designs favor a nonlinear function. However, the transition function must ensure that the internal states of the PRNG do not process a sequence of short periods, regardless of the value of their initial state. Unlike linear functions, it is relatively difficult to obtain such theoretical results for nonlinear functions. This difficulty can be circumvented if the size of the internal state of the generator is not limited by implementation constraints (unfortunately this is not the case), because it is very unlikely that an initial state generates a small period sequence if the internal state is large enough. The typical example of this is RC4. However, in hardware applications, size constraints dictate that the internal state of the generator is not too large, that is, the size should not exceed twice the length of the key. At present, we can mention some nonlinear LFSRs with holdbacks [31, 32] and LFSR with T-functions [33, 34].
•
Hybrid designs: In some PRNGs, the internal state is divided into two parts, one being updated by a linear function and the other by a nonlinear function. When the linearly advancing part is much larger, the PRNG is often classified as a linear transition generator with memory. This is the case of SNOW 2.0 and E0 for instance. However, there is an internal state for PRNG with a similar size of its linear and nonlinear parts, this category includes the MUGI generator designed by Hitachi as well as SNOW 2.0.

4. Statistical properties of strong PRNG

To resist a distinguish attack [35] the PRNG used by stream cipher must have good statistical properties. A classic criterion of security is that the output of the PRNG cannot be distinguished from a random sequence of cipher with cost less than 2 ${}^{n}$ , where n corresponds to the bits number of the secret key.

In strong PRNG, consecutive bits shall not easily help to predict the value of the next bit with a probability substantially distinct from 1/2 [36]. Note that a bias in the keystream sequence causes typically an information leaking on the plaintext. For instance, let denote $x_{i}$ as the plaintext bit number $i$ , $y_{i}$ as the ciphertext bit number $i$ and $k_{i}$ as the keystream bit number $i$ .

We know that $Pr(x_{i}=0)=\frac{1}{2}+\varepsilon$ and $y_{i}=k_{i}\oplus x_{i}$ hence $Pr(y_{i}=k_{i})=\frac{1}{2}+\varepsilon$ . For PRNG used in stream cipher, there is no theoretical proof of the absence of a polynomial complexity distinction. However, there are a number of statistical tests that any PRNG must be able to succeed to be reasonably secure even if it is not a sufficient security condition. Among these probabilistic tests, we can distinguish the so-called tests of normality [37], which determine the probability that a sequence of n bits associates a certain quantity of which we can determine the probability distribution for a random sequence. Besides the tests of normality, the so-called compression tests determine whether the sequence to be tested can be compressed without information lost, which would distinguish it from a random sequence [38]. It should be noted that the normality test and the compression test have been improved since then by many researchers in order to obtain more efficient results and more dangerous attacks on stream ciphers.

Knuth [39] and Golomb [40] have described the first statistical properties required for a pseudo-random sequence in the general context of a periodical series $k_{i}$ of period T. These statistical properties defined the so-called Golomb’s postulates, which are:

•
In each period T, the numbers of 0 and 1 differ by no more than 1 (see Eq. (3)):

$\displaystyle\left|\sum_{i=0}^{T-1}(-1)^{k_{i}}\right|\leqslant 1$ (3)
•
If we define a “run” by a group of identical values in the sequence (e.g. 111, 0000), a “block” by group of ones (e.g. 111, 11111) and a “Gap” by group of zeroes (e.g. 00, 0000). Then, half the runs have length 1, one-fourth have length 2, one-eight has length 3, etc., as long as the number of runs indicated exceeds 1. Moreover, for each of these lengths, there are equally many gaps and blocks.
•
The out-of-phase autocorrelation $C(\tau)$ of the sequence has always the same value, i.e. the in-phase autocorrelation has value equal to the length of the period (see Eq. (4)):

$\displaystyle C(\tau)=\sum_{i=0}^{T-1}(-1)^{k_{i}+k_{i+\tau}}$ (4)

Many statistical tests has been established since Knuth and Golomb works. Today, we usually use the so-called statistical-test libraries. The most used libraries is NIST test-suites 800-22 [41], DIEHARD [42] and Crypt-XS [43].
5. Statistical cryptanalysis

Studying cryptanalysis cannot go past over statistical analysis [44, 45], because even if there is no connection between plaintext and ciphertext, statistical analysis and more specific frequency analysis can show to attacker important information. In this section, we examine our tested stream ciphers to observe the presence or the absence of a signature into theirs ciphertext that can lead to some useful data in plaintext. In this test, we analyze the frequency of character (letter) in English as language. Then, we evaluate each stream cipher PRNG from a statistical point of view and finally, we study the distribution of the encrypted characters based on the Chi-square statistical test, in order to have a global idea about the statistical resistance of each stream cipher.

5.1 Frequency analysis

With a long enough plaintext, each character occurs with a characteristic frequency. The most frequently used character in English is the letter E with a frequency of 12.7% followed by the letter T with a frequency of 9.1% [46] (see Table 1).

Table 1
The frequency of letters in English (L denote the letter and F denote the frequency in%)

L	F	L	F	L	F
A	8.2	J	0.2	S	6.3
B	1.5	K	0.8	T	9.1
C	2.8	L	4	U	2.8
D	4.3	M	2.4	V	1
E	12.7	N	6.7	W	2.4
F	2.2	O	7.5	X	0.2
G	2	P	1.9	Y	2
H	6.1	Q	0.1	Z	0.1
I	7	R	6

Figure 5.

The percentage of each character in the plaintext.

Figure 6.

The percentage of each character in the ciphertext.

Figure 6.

continued.

The frequency study will lead to apply a guess attack because it is normal to suggest that the character with a higher frequency in the ciphertext has more probabilities of being the character with a higher frequency in the plaintext. As a result, we define the probability of success of guessing attacks for each stream cipher as the ratio equal to the number of good guessed character divided by the total number of characters. This ratio will help to compare the resistance of stream ciphers to these types of attacks. It should also be mentioned that the probabilities of the guessing attack is also related to the number of obtained ciphertext. The higher the number of obtained ciphertext the higher the guessing attack success probability.

Typically, stream ciphers are mono-alphabetic ciphers, but the continuous changes of keystream allows them to act as polyalphabetic [47]. This complicate the frequency cryptanalysis study because the search of the possible mono-decrypted character become a search of the possible poly-decrypted character.

Furthermore, the probability of success of the guessing attack is also related to the probability of searching for the character in the ciphertext from the plaintext. We call this, the mission candidate, i.e. we define the group of poly-decrypted character as the possible candidate for each plaintext character. The candidate assignment is based on a binomial distribution, in this way, every candidate of the encrypted character has a probability of being the corresponding character [48] (see Eq. (5)):

$\textstyle\hskip-18.494291ptP(\textit{character}\!=\!x)\!=\!\binom{n}{x}(P_{% \textit{character}})^{x}(1-{\textit{character}})^{n-x}$ (5)

where $P_{\textit{character}}$ is the probability of the searching character from ciphertext ( $P_{\textit{character}}$ is equal to the character frequency into plaintext), $n$ is the total number of all ciphertext character and $x$ is the occurrence of the encrypted character that we calculate its probability.

Figure 7.

The percentage of each character in the ciphertext.

Figure 7.

continued.

The first observation taken from Fig. 6 is that some ciphertext has a near pseudo-uniform distribution and the character frequency seem bounded between zero and 0.47%. This pseudo-uniform distribution of information into ciphertext attests the huge difficulty of extracting information from ciphertext to determine the plaintext even with clear character frequency (see Fig. 5). Moreover, Fig. 6i and j showed that one character appeared 16% in the ciphertext while the rest of characters has a frequency seem bounded between zero and 0.746%. For HC128 and HC256, our guessing attacks has showed 0.037934% and 0.048203% as probability of success for guessing one character. However, it is still difficult to apply frequency analysis further because the rest of encrypted data with HC128/256 has a near pseudo-uniform distribution. As for RC4, Rabbit and WAKE in Fig. 6k–m respectively, the frequency analysis showed a big bad character distribution into ciphertext, which gives attractive information to break the cipher. In this work, our attempt in guessing attacks showed 0.081989%, 0.10249% and 0.10044% as probability of success for guessing one character for Rabbit, RC4 and WAKE respectively. Of note, our work here is not trying to break those ciphers, but to show that with only few attempt, the guessing attacks succeed in guessing one character with a probability of success near to 0.1%. Therefore, we deduce that RC4, Rabbit and Wake are potentially vulnerable to frequency analysis.

In general, the strength of a stream cipher is based on the unpredictability of its PRNG used. Statistically, a good PRNG is linked to the uniform distribution of the character from the set domain to the codomain. For instance, if Bob sends a message to Alice and this message contains only a duplicate of a character, then it will be bad if the encrypted message contains a bias. It is not necessary that all the characters encrypted have the same frequency of appearance, what is bad for a cipher is to find an encrypted character with a frequency of appearance higher than the others.

For that reason, we test the PRNGs of our studied stream ciphers by applying a frequency analysis for a duplication of a random character in the plaintext. The result is illustrated in the following Fig. 7.

Figure 7 shows the probability of occurrence of each ciphertext character for one plaintext character. We notice that Chacha8/12/20, Panama, Rabbit, Wake and XSalsa showed a good distribution of encrypted characters, which leads to good diffusion and confusion by their PRNGs. On the other hand, HC128/256, RC4, Salsa, SEAL, and Sosemanuk had a bad statistical distribution (e.g., an encrypted character with Salsa has a probability of occurrence equal to 2.4%, which is approximately seven times bigger than it should be). Therefore, we deduce that their PRNGs are not statistically strong, hence their produced keystream has a good probability of bias appearance.

5.2 Chi-square goodness-of-fit test

The Chi-squared statistic [49] is a measure of similar degree for two categorical probability distributions. If the two distributions are matching, the Chi-squared statistic is zero, if the distributions are very different, some higher numbers will result. The Chi-square test is a general case of the statistical test of normality. The formula for the Chi-squared statistic is presented in Eq. (6):

$\displaystyle\chi^{2}(P,C)=\sum_{i}\frac{(P_{i}-C_{i})^{2}}{C_{i}}$ (6)

where $P_{i}$ is the frequency of a character in the source file, and $C_{i}$ is the frequency in the corresponding encrypted file. The $\chi^{2}$ test can be used by cybercriminal to guess the key used into encryption, even if he must try in the worst case all possible key which is similar to brute force attack, he can in the best case catch the key in less operation by comparing $\chi^{2}$ for every key and pick-up the minimum.

To resist the test $\chi^{2}$ , the cipher must distribute the encrypted characters uniformly. Uniformity caused by a number may be quantitatively justified by the Pearson chi-square test [50]. The $\chi^{2}$ distribution is used to compare the goodness-of-fit of the observed frequencies of a sample measurement with the corresponding expected frequencies of the hypothesized distribution.

Table 2

Pearson’s chi-square test

Stream cipher	Expected $\chi^{2}$	Tested $\chi^{2}$
Chacha8	717607	703094
Chacha12	717607	720545
Chacha20	717607	682797
HC128	886823	1273949
HC256	892714	1466110
Panama	717607	679200
Rabbit	288098	2743095
RC4	717607	3443095
Salsa	717607	673304
SEAL	717607	812491
Sosemanuk	717607	728598
WAKE	717607	1246379
XSalsa	717607	749167

The chi-square test value for the same ciphertext used in the frequency analysis is listed in Table 2. It is found that for Chacha8/20, Panama and Salsa the real $\chi^{2}$ is smaller than the estimated $\chi^{2}$ implying that the null hypothesis is not rejected and the distribution of the ciphertext is uniform. Contrary to Chacha12, HC128/256, Rabbit, RC4, SEAL, Sosemanuk, WAKE and Xsalsa, the founded $\chi^{2}$ is bigger than the estimated $\chi^{2}$ implying that the null hypothesis is rejected and the distribution of the ciphertext is not uniform.

6. Keystream randomization examination

Statistical test for a keystream sequence is used to test a null hypothesis (H0). In this paper, the null hypothesis H0 indicates that the sequence tested is random and the alternative hypothesis H1 indicates that the sequence tested is not random [51].

The significance level $\alpha$ of the test of a statistical hypothesis H0 is the probability of rejecting H0 when it is true. If the significance level $\alpha$ of a test of H0 is too high, then the test may reject sequences that were, in fact, a random sequence. Typically, such an error is called a Type I error. On the other hand, if the significance level of a test of H0 is too low, then there is the danger that the test may accept sequences that were, in fact, not a random sequence. Usually, such an error is called a Type II error. Therefore the significance level $\alpha$ between 0.001 and 0.05 is widely employed in practice.

Actually, the probability of a Type II error may be completely independent of $\alpha$ . If the sequence is not random, the probability depends on the nature of the defects of the PRNG, and is usually difficult to determine in practice. For this reason, assuming that the probability of a Type II error is proportional to $\alpha$ is a useful intuitive guide when selecting an appropriate significance level for a test [36].

The NIST test mentioned in previous section can only applied for a sequence of $10^{6}$ bits in length. For a small sequence, the NIST test shows misleading results. For this reason, we use one of the most important randomness test for small sequence called Beker-Piper [52] statistical tests suite. This test is applied to provide a quantitative measure of randomness.

This Beker and Piper tests suite consists of five tests [36, 38]: the frequency test or the mono-bit test, the serial test or the two-bits test, the poker test, the runs test, and the autocorrelation test. Note that if a sequence fails the mono-bit test, it is not necessary to apply the remaining four tests.

6.1 Mono-bit test

This test checked whether the number of 0 or 1 is equal or not as would be expected for a random sequence. If we denote A, B and C as the number of 0, 1 and bits in a sequence, respectively, the frequency test is computed as Eq. (7):

$\displaystyle\frac{(A-B)^{2}}{C}$ (7)

which approximately follows a $\chi^{2}$ distribution with 1 degree of freedom if $C\geqslant 10$ [53]. For a given significance level $\alpha$ , if the frequency test is less or equal to $\chi^{2}_{\alpha}(1)$ then the sequence pass the mono-bit test, otherwise the sequence do not pass the mono-bit test. In the case of significance level of $\alpha=0.05$ , the threshold value for this test is 3.8415.

6.2 Two-bits test

The purpose of this test is to determine whether the number of occurrences of 00, 01, 10, and 11 as subsequences of keystream are approximately the same, as would be expected for a random sequence. Let A, B denote the number of 0 and 1 in the keystream, respectively, and let C, D, E, F denote the number of occurrences of 00, 01, 10, 11 in the keystream, respectively. For a keystream length equal to n, we have $C+D+E+F=n-1$ , since the subsequences are allowed to overlap. The two-bits test is computed as Eq. (8):

$\displaystyle\frac{4(C^{2}+D^{2}+E^{2}+F^{2})}{n-1}-\frac{2(A^{2}+B^{2})}{n}+1$ (8)

which approximately follows a $\chi^{2}$ distribution with 2 degrees of freedom if $n\geqslant 21$ [53]. For a given significance level $\alpha$ , if the serial test is less or equal to $\chi^{2}_{\alpha}(2)$ then the sequence pass the two-bits test, otherwise the sequence do not pass the two-bits test. The $\chi^{2}_{\alpha}(2)$ presents the inverse of the $\chi^{2}$ cumulative distribution function with free degree 2. In the case of significance level of $\alpha=0.05$ , the threshold value for this test is 5.9915.

6.3 Runs test

This test checked whether the behavior of changes in a sequence meet the criterion of random sequence. The expected number of gaps (or blocks) of length m in a random sequence of length n is $L_{m}=(n-m+3)/2^{m+2}$ . Let K be equal to the largest integer m for which $L_{m}\geqslant 5$ and let $A_{m}$ , $B_{m}$ be the number of blocks and gaps, respectively, of length $m$ in a sequence. For each $m$ such that $1\leqslant m\leqslant K$ , the runs test is computed as Eq. (9):

$\displaystyle\sum_{m=1}^{K}\frac{(A_{m}+L_{m})^{2}}{L_{m}}+\sum_{m=1}^{K}\frac% {(B_{m}+L_{m})^{2}}{L_{m}}$ (9)

which approximately follows a $\chi^{2}$ distribution with $2K-2$ degrees of freedom [53]. For a given significance level $\alpha$ , if the run test is less or equal to $\chi^{2}_{\alpha}(2K-2)$ then the sequence pass the run test, otherwise the sequence do not pass the run test. In the case of significance level of $\alpha=0.05$ and $k=9$ , the threshold value for this test is 26.2962.

6.4 Poker test

This test checked whether the number of times the p-bits block appears in the entire sequence is the same. Let’s denote by n the length of a sequence and m be a positive integer such that $\lfloor n/m\rfloor\geqslant 5\times 2^{m}$ and let $K=\lfloor n/m\rfloor$ divides the sequence into K non-overlapping parts each of length m. If we denote $n_{i}$ to be the number of each m-bit with decimal value $i-1$ . The poker test is computed as Eq. (10):

$\displaystyle\frac{2^{m}}{K}\left(\sum_{i=1}^{2^{m}}n_{i}^{2}\right)-K$ (10)

which approximately follows a $\chi^{2}$ distribution with $2^{m}-1$ degrees of freedom [53]. For a given significance level $\alpha$ , if the poker test is less or equal to $\chi^{2}_{\alpha}(2^{m}-1)$ then the sequence pass the poker test, otherwise the sequence do not pass the poker test. In the case of significance level of $\alpha=0.05$ and $m=4$ , the threshold value for this test is 24.9958. It must be noted that the poker test is a generalization of the frequency (if $m=1$ ).

6.5 Autocorrelation test

This test checked the degree of dependence between a sequence and its shifted sequence. Let’s denote by n the length of a sequence, bi the bits number i and m be a positive integer such that $1\leqslant m\leqslant\lfloor n/2\rfloor$ . Therefore, the number of bits in sequence that are different from their m-shifts is $A(m)=\sum_{i=1}^{n-m-1}b_{i}\oplus b_{i+m}$ . The autocorrelation test is computed as Eq. (11):

$\displaystyle 2\left(A(m)-\frac{n-m}{2}\right)/\sqrt{n-m}$ (11)

which approximately follows an $N(0,1)$ distribution if $n-m\geqslant 10$ . Since small values of $A(m)$ are as unexpected as large values of $A(m)$ , a two-sided test should be used [53]. For a given significance level $\alpha$ , if absolute value of the autocorrelation test is less or equal to $Z_{\alpha/2}$ then the sequence pass the autocorrelation test, otherwise the sequence do not pass the autocorrelation test. The $Z_{\alpha/2}$ presents the inverse of the normal cumulative distribution function. In the case of significance level of $\alpha=0.05$ , the threshold value for this test is 1.96.

Table 3

The required interval for passing the Runs test according to length of blocks and gaps

Length of run	Required interval	Length of run	Required interval
1	2343–2657	4	251–373
2	1135–1365	5	111–201
3	542–708	6	111–201

Table 4

The required interval for passing the Runs test according to length of blocks and gaps

Stream cipher	Mono-bit test	Two-bit test	Runs test	Poker test	Autocorrelation test	Total passing ratio
Chacha8	95.6%	100%	94%	95%	95.25%	95.97%
Chacha12	96.3%	100%	94.4%	94.7%	94.5%	95.98%
Chacha20	94.3%	100%	93.25%	96%	94.2%	95.55%
HC128	98.9474%	100%	94.7368%	68.4211%	91.5789%	90.73684%
HC256	94.7368%	100%	92.6316%	47.3684%	87.3684%	84.42104%
Panama	94.9%	100%	93.4%	95.7%	90.75%	94.95%
Rabbit	95.3%	100%	96%	96.4%	94.8%	96.5%
RC4	95.7%	100%	92%	92.5%	93.4%	94.72%
Salsa	95.2%	100%	93.25%	95%	94%	95.49%
SEAL	94.1%	100%	94%	94.9%	94.5%	95.5%
Sosemanuk	95%	100%	95%	95.6%	93%	95.72%
WAKE	96.6%	100%	92%	94.8%	95%	95.68%
XSalsa	95.1%	100%	93.75%	95.5%	93%	95.47%

6.6 Results and discussion

Instead of making the user select appropriate significance levels for the Beker-Piper test, explicit bounds are provided by FIPS 140-2 [54] for several computed value that must be respected in order to succeed four tests. These bounds presents the FIPS passing criteria condition. A single keystream of length 20000 bits is subjected to each of the following tests. If any of the tests fail, then the stream cipher fails the test. The FIPS 140-2 randomization tests are:

•
Mono-bit test: The number of 1 in the keystream sequence should belong to the interval $[9725,$ $10275]$ .
•
Poker test: The Eq. (10) is computed for $m=4$ . The poker test is passed if the result belong to the interval $[2.16,46.17]$ .
•
Runs test: The number blocks and gaps of length $i$ in keystream sequence are counted for each $i$ , $1\leqslant i\leqslant 6$ . (For the purpose of this test, runs of length greater than 6 are considered to be of length 6.) The runs test is passed if the 12 counts of blocks and gaps for each $i$ , $1\leqslant i\leqslant 6$ , are within the corresponding interval specified by the following Table 3.
•
Long runs test: The test is passed if there are no long runs. A long run is defined to be a run of length 26 or more (of either zeros or ones).

In this work, we use the FIPS 140-2 bounds as a required condition to pass the following three tests: the mono-bit test, the poker test and the runs test. For the remaining two tests: the two-bits test and the autocorrelation test, we denote our significance level $\alpha$ as 0.05. Moreover, we define the passing criteria to each test by 95%. However, the results near to 95% shall also be considered acceptable. If any keystream produced shows result less than 94%, it will be concluded that the test has failed. However, if the Beker-Piper total passing ratio shows result less than 95%, it will be decided that the keystream is not randomly good, i.e. H0 is false and H1 is true. The result for this experiment is tabulated in Table 4.

The Mono-bit test respectively the Two-bits test in Table 4 for all the studied streams cipher are above 94%. This confirms the good balance of 0 and 1 respectively of 00, 01, 10 and 11 in the generated sequence. As for the Poker test, the results satisfy the defined passing criteria to all the studied streams cipher except HC128/256 and RC4.

The Autocorrelation test of HC128/256, Panama, RC4, Sosemanuk and XSalsa proves the correlation between the sequences and their shifted versions. Also the failed result for Chacha20, HC256, Panama, RC4, Salsa, WAKE and XSalsa in Runs tests indicates the dependency and correlation among the generated keystream of HC256, Panama, RC4 and XSalsa.

However, since the total passing ratio failed for HC128/256, Panama and RC4, it is then concluded that the H0 is rejected and H1 is true. Therefore, a statistical analysis for keystream generated by HC128/256, Panama or RC4 is applicable with high probability.

As conclusion, Table 4 shows that Rabbit has the best keystream randomization among all the studied streams cipher followed by Chacha12/Chacha8, Sosemanuk then WAKE, Chacha20, SEAL, Salsa/XSalsa, Panama then RC4. While it seems that HC128/256 has the worst keystream randomization.
7. Conclusions

This article contains a statistical cryptanalysis study for several high-speed stream ciphers that are Chacha8/ 12/20, HC128/256, Panama, Rabbit, RC4, Salsa, SEAL, Sosemanuk, WAKE and XSalsa.

The aim of this work is to show either the presence or the absence of any statistical signature in the ciphertext by the ciphers studied. It has been found that RC4, Rabbit and WAKE have a lot of attractive information in the ciphertext, which provides a frequency statistic that can be used in order to reduce the brute-force attack against them. Our guessing attack applied in this paper managed to link a character of plaintext to some characters to the ciphertext with a probability of success equal to 0.1%. This number means that, starting from n characters in the ciphertext, the attack succeeds in cracking an encrypted character $n\times 0.001\times P_{\textit{character}}$ times. In addition, the study of PRNGs showed that HC128/256, RC4, Salsa, SEAL and Sosemanuk had a poor statistical distribution for keystream generation. As example, the constant input encryption for Salsa has shown that an encrypted character has a probability of occurrence seven times greater than it should be in uniform distribution. In addition, the chi-square test showed that Chacha12, HC128/256, Rabbit, RC4, SEAL, Sosemanuk, WAKE and Xsalsa have a non-uniform distribution in ciphertext. In terms of randomization, HC128/256, Panama and RC4 failed the test, while the others showed a good degree of randomization, which increases the probabilities of complexity to statistical analysis.

According to these tests, we deduce from our statistical analysis that Chacha8/12/20 are the best stream ciphers among those studied in this paper in terms of hidden statistical information in the ciphertext, followed by Rabbit, Salsa and SEAL while it seems that RC4 is the worst of them.

References

Harmouch

and El Kouch

, A fair comparison between several ciphers in characteristics, safety and speed test, Europe and MENA Cooperation Advances in Information and Communication Technologies, Springer, (2017), 535–547.

Banegas

, Attacks in stream ciphers: A survey, IACR Cryptology ePrint Archive (2014), 677.

Gersho

and Gray

R.M.

, Vector quantization and signal compression (Vol. 159), Springer Science & Business Media.

, Cryptanalysis and design of stream ciphers, (2008).

Daneshgar

and Mohebbipoor

, A secure self-synchro- nized stream cipher, arXiv preprint arXiv:170908613. (2017).

Eason

et al., The RC4 encryption algorithm, RSA Data Security (1992).

NIST Computer Security Division’s Security Technology Group, Block cipher modes, Cryptographic Toolkit, NIST, Retrieved April 12, 2013.

Maximov

et al., An improved correlation attack on A5/1, International Workshop on Selected Areas in Cryptography, Springer, (2004), 1–18.

Goldberg

et al., The real-time cryptanalysis of A5/2, Rump session of Crypto (1999), 239–255.

10.

Ferguson

et al., Helix: Fast encryption and authentication in a single cryptographic primitive, International Workshop on Fast Software Encryption, Springer, (2003), 330–346.

11.

Jenkins

R.J.

, Isaac, International Workshop on Fast Software Encryption, Springer, (1996), 41–49.

12.

Watanabe

et al., A new keystream generator MUGI, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 87(1) (2004), 37–45.

13.

Whiting

et al., Fast encryption and authentication in a single cryptographic primitive, ECRYPT Stream Cipher Project Report 27(200) (2005), 5.

14.

Bernstein

D.J.

, ChaCha, a variant of Salsa20, Workshop Record of SASC 8 (2008), 3–5.

15.

, The stream cipher HC-128, New Stream Cipher Designs, Lecture Notes in Computer Science, Springer, 4986 (2008).

16.

, A new stream cipher HC-256, International Workshop on Fast Software Encryption, Springer, (2004), 226–244.

17.

Daemen

and Clapp

, Fast hashing and stream encryption with PANAMA, International Workshop on Fast Software Encryption, Springer, (1998), 60–74.

18.

Boesgaard

et al., Rabbit: A new high-performance stream cipher, International Workshop on Fast Software Encryption, Springer, (2003), 307–329.

19.

Berbain

et al., Sosemanuk a fast software-oriented stream cipher, New Stream Cipher Designs, Springer, (2008), 98–118.

20.

Bernstein

D.J.

, The salsa20 family of stream ciphers, New Stream Cipher Designs, Springer, (2008), 84-97.

21.

Rogaway

and Coppersmith

, A software-optimized encryption algorithm, Journal of Cryptology 11(4) (1998), 273–287.

22.

Wheeler

D.J.

, A bulk data encryption algorithm, International Workshop on Fast Software Encryption, Springer, (1993), 127–134.

23.

Gilbert

, The security of one-block-to-many modes of operation, International Workshop on Fast Software Encryption, Springer, (2003), 376–395.

24.

FIPS 81, DES modes of operation, U.S. Federal Information Processing Standards Publication, Department of Commerce/National Bureau of Standards, (1980).

25.

NIST SP 800-38A, Recommendation for block cipher modes of operation, NIST Special Publication 800-38A, (2001).

26.

ISO/IEC 10116, Information technology-security techniques-modes of operation for an n-bit block cipher, International Organization for Standardization, (1997).

27.

Koblitz

, A course in number theory and cryptography, Springer Science & Business Media 114 (2012).

28.

Balph

and Semiconductor

, LFSR counters implement binary polynomial generators, EDN 43(11) (1998), 155–160.

29.

Bluetooth

S.I.G.

, Specification of the bluetooth system, Version 1.1, (2001).

30.

Pekdahl

and Johansson

, A new version of the stream cipher SNOW, International Workshop on Selected Areas in Cryptography, Springer, (2002), 47–61.

31.

Arnault

and Berger

T.P.

, F-FCSR: Design of a new class of stream ciphers, International Workshop on Fast Software Encryption, Springer, (2005), 83–97.

32.

Klapper

and Goresky

, Feedback shift registers 2-adic span and combiners with memory, Journal of Cryptology 10(2) (1997), 111–147.

33.

Klimov

and Shamir

, A new class of invertible mappings, International Workshop on Cryptographic Hardware and Embedded Systems, Springer, (2002), 470–483.

34.

Klimov

and Shamir

, Cryptographic applications of T-functions, International Workshop on Selected Areas in Cryptography, Springer, (2003), 248–261.

35.

Kang

et al., Distinguishing attack on SDDO-based block cipher BMD-128, Ubiquitous Information Technologies and Applications, Springer, (2014), 595–602.

36.

Menezes

A.J.

et al., The Handbook of Applied Cryptography, Fifth Printing, CRC Press, 2001.

37.

D’Agostino

R.B.

, Tests for the normal distribution Goodness-of-fit techniques, (1986), 367–419.

38.

Maurer

U.M.

, A universal statistical test for random bit generators, Journal of Cryptology 5(2) (1992), 89–105.

39.

Knuth

D.E.

, The art of computer programming, Semi Numerical Algorithms, Addison Wesley, 2 (1969).

40.

Golomb

S.W.

, Shift register sequences, Aegean Park Press, 1982.

41.

NIST SP 800-22, A Statistical test suite for the Validation of random number generators and pseudo random number generators for cryptographic applications, (2000).

42.

Marsaglia

, The marsaglia random number CDROM including the diehard battery of tests of randomness, Florida State University, 1995.

43.

Caelli

et al., CRYPT-X stastical package manual-measuring the strength of stream and block ciphers, Queensland Univeristy of Technology, 1992.

44.

Gérard

, Cryptanalyses statistiques des algorithmes de chiffrement à clef secrète, Ph.D. Dissertation, Université Pierre et Marie Curie-Paris VI, 2010.

45.

Junod

, Statistical cryptanalysis of block ciphers, (2005).

46.

Wiegold

, Cipher systems: The protection of communications, (1983).

47.

Klima

et al., Applications of abstract algebra with Maple and MATLAB, CRC Press, (2006).

48.

Cochran

, For Whose Eyes Only? Cryptanalysis and Frequency Analysis, Department of Mathematics, US Military Academy.

49.

Ganesan

and Sherman

A.T.

, Statistical techniques for language recognition: An introduction and guide for cryptanalysts, Cryptologia 17(4) (1993), 321–366.

50.

L’Ecuyer

and Simard

, TestU01: AC library for empirical testing of random number generators, ACM Transactions on Mathematical Software 33(4) (2007), 22.

51.

Harmouch

and El Kouch

, A statistical analysis for high-speed stream ciphers, International Conference on Innovations in Bio-Inspired Computing and Applications, Springer, (2017), 339–349.

52.

Beker

and Piper

, Cipher systems: The protection of communications, Northwood Books, 1982.

53.

Hao

and Min

, Statistical tests and chaotic synchronization based pseudorandom number generator for string bit sequences with application to image encryption, The European Physical Journal Special Topics 223(8) (2014), 1679–1697.

54.

FIPS PUB 140-2, Security requirements for cryptographic modules, NIST, (2007).

Statistical properties for sequence produced by pseudorandom number generator used in well-known stream cipher

Abstract

Keywords

1. Introduction

5.1 Frequency analysis

Table 1 The frequency of letters in English (L denote the letter and F denote the frequency in%)

6.1 Mono-bit test

References

Table 1
The frequency of letters in English (L denote the letter and F denote the frequency in%)