Synthesizing DNA molecules with identity-based digital signatures to prevent malicious tampering and enabling source attribution

Abstract

DNA molecules are increasingly being synthesized in the laboratory. A major concern in this domain is that a malicious actor can potentially tweak a benevolent synthesized DNA molecule and create a DNA molecule with harmful properties (Biodefense in the Age of Synthetic Biology (2018) The National Academies Press). To detect if a synthesized DNA molecule has been modified from the original version created in the laboratory, the authors in (In Proceedings of the New Security Paradigms Workshop (2018) ACM) proposed a digital signature protocol for creating a signed DNA molecule. However, several challenges arise in more complex molecules because of various forms of DNA mutations as well as size restrictions of the molecule that impact its properties. The current work extends (In Proceedings of the New Security Paradigms Workshop (2018) ACM) in several directions to address these problems. A second concern with synthesized DNA is that it is an intellectual property. In order to allow its use by third parties, an annotated document of the molecule needs to be distributed. However, since the molecule and document are two different entities, one being a physical product and the other being a digital one, ensuring that both are distributed correctly together without tampering is challenging. This work also addresess this problem by transforming the document into a DNA molecule and embedding it within the original molecule together with the signature.

Keywords

Cyber-bio security identity-based signatures Reed–Solomon codes pairing-based cryptography synthetic DNA

1. Introduction

Synthesizing DNA molecules in the laboratory is quite common these days. Such a synthetic DNA molecule is often a licensed intellectual property. DNA samples are shared between academic laboratories, ordered from DNA synthesis companies and manipulated for a variety of purposes, for example, to create new biochemicals, reduce the burden of diseases, improve agricultural yields or simply to study the DNA’s properties and improve upon them. There have also been instances of new biological organisms that do not exist in the natural world being created using synthesized DNA [17]. While the vast majority of such activities are pursued for beneficial purposes, there are concerns that malicious users can use the technology malevolently, for example, to make harmful biochemicals, or render existing bacteria more dangerous [17]. Recently, a DNA-based security exploit was demonstrated as a proof-of-concept, where a synthesized DNA was used to attack a DNA sequencer that has been deliberately modified with a vulnerability [18]. Preventing such malicious use of synthesized DNA is beyond the scope of this current work. However, attribution of a physical DNA sample and establishing proof of origin can contribute significantly to deter such malicious activities.

Following the anthrax attack of 2001, there is an increased urgency to employ microbial forensic techniques to trace and track agent inventories. For instance, it has been proposed that unique watermarks be inserted in the genome of infectious agents to increase their traceability [12]. The synthetic genomics community has demonstrated the feasibility of this approach by inserting short watermarks into DNA without introducing significant perturbation to genome function [6,8,15,24]. The use of watermarks has also been proposed in order to identify genetically modified organisms (GMOs) or proprietary strains. Heider et al. [7], for example describe DNA-based watermarks using DNA-Crypt algorithm. This technique is applicable to provide proof of origin to a DNA molecule. However, there is a major shortcoming with all watermark based approaches. The watermark in all these works is generated from an arbitrary binary data and added to the original sequence, and so is independent of the original sequence and provides no integrity of the actual DNA sequence.

To enable effective trace back and eliminate the limitation of watermark-based approaches, Kar et al. [13] had proposed a scheme to create digital signatures of DNA molecules in living cells. The main idea is as follows: Take a DNA molecule and sequence it. The result is a string over the alphabet A, C, G, and T, representing the four nucleotide building blocks of DNA. The output of the sequencer is stored in what is called a FASTA file. For interpretability reasons, the FASTA file is annotated by the researcher to create another file called the GenBank file. The authors then use Shamir’s identity-based signature scheme [27], Reed–Solomon error-correction codes [22,23] and the 16 digits Open Researcher and Contributor ID (ORCID – https://orcid.org) of the researcher to create a digital signature of the string in the FASTA file. The resulting signature is in the form of a DNA sequence which is now synthesized as a physical molecule. Finally, the signature molecule is inserted into the original DNA molecule using DNA editing tools to obtain a signed DNA molecule. When this signed molecule is shared, a receiver can sequence the signed molecule to verify that it was shared by an authentic sender and that the sequence of the original molecule has not been altered or tampered with.

However, there are significant challenges related to the placement of the signature within the molecule and various types of mutations in more complex molecules that Kar et al. do not address (discussed in more details in Section 2). The current work improves the previous scheme to address these problems (Sections 4 and 5). Moreover, we would like to shorten the size of the signature sequence as much as possible without impacting security. While biologists believe that the size of the DNA has a correlation with its properties within certain bounds, they still do not know by how much a DNA molecule can be expanded without changing the properties of interest. The current work explores other cryptographic algorithms towards this end as well as signature compression schemes (Section 6).

In order to verify a signature, the verifier needs to first identify which part of a signed message is the signature. There are three choices – before the actual message, after it or within it. If the signature is placed within the message, identifying the signature becomes problematic and needs to be addressed during the signature placement time. (The other two cases when the signature is placed before or after the message, are easily addressed by using a counting-length approach.) Unfortunately, when working with DNA molecules, it may not always be possible to place the signature at the end or the beginning of the message. This is because there can be a feature present at these locations. The possible places to place the signature are most likely to be somewhere within the original sequence. For this reason the file that documents the DNA molecule needs to be also accessible to the receiver. Now, these molecule documentation files are created using gene editors that maintain databases of DNA molecule properties. It is conceivable that the databases are updated with location of signature in the molecule. However, these databases may not be consistent across different editors in the sense that receiver’s gene editor may not have all the information about the same set of molecules that the sender’s gene editor has. In addition, there is no universal file format used by gene editors, unlike the FASTA file format.

The document is needed not only to identify and retrieve the signature but also to address the problem of sequence alignment. Thus, we recommend sending the document created by the creator of molecule be also sent to the receiver. This is described in [13]. The updated algorithm described in Section 4 takes into account for any circular or double helix permutations, and the verifiers sequence need not be in perfect alignment with the senders sequence. The purpose of sharing the digital file is not just limited to aligning sequences at the receivers end. The digital file contains descriptions about the physical DNA. There are gene editing software such as Snapgene which can be used to generate documentations or annotations about a sequence. A user can sequence a DNA, provide the FASTA file containing just the raw sequences to the software. The software searches its database for descriptions matching any subsequence within the FASTA file and generates a genbank file which contains the descriptions (from the database) and the original sequence. Refer to Fig. 6 and Fig. 5, the genbank file is generated from the FASTA file, the keyword “ORIGIN” separates the actual sequence and the descriptions. Although the software like Snapgene can generate documentation for some sequences, it cannot describe what is not present in its database. For this, the software have the flexibility of user addition, deletion and update. The user can add more descriptions, can delete wrong descriptions populated by Snapgene, and update a description to include or exclude information.

When a DNA sample is shared, the receiver can only generate the descriptions that the software can generate using its database. The extra descriptions written by the sender cannot be obtained by the receiver unless the digital file is shared. For this reason the digital file also needs to be shared with the receiver along with the physical sample. However, this file, being a separate entity from the DNA molecule, is not strongly tied to the molecule. Sending it separately from the molecule can result in the receiver using a different document file from the one that creator generated. Thus, we propose embedding the document in the DNA molecule itself and sending the whole signed and documented molecule to the receiver. Only this way would the receiver be obtain all the descriptions which the sender intended to share. However, note that placing the document within the molecule brings up the same issue of locating it as the signature. In this work, we develop the necessary techniques to generate, embed, retrieve and validate signatures as well as document in/from the molecule. The details of these procedures are explained in Sections 4, 5 and 8.

We would like to note here that there can sometimes be reasons why the originator may not want to share the entire document file. While the creator may be willing to divulge the properties of the synthesized DNA as a whole, s/he may not be willing to divulge properties of some sub-sequences because of reasons related to protecting intellectual properties or preventing mal-actors from exploiting some of the properties for nefarious purposes. Sending the entire document file may jeopardize these confidentiality needs. However, we do not address these concerns in the current work but is left as a future topic. We assume that if and when the creator is willing to share the document (may be by encrypting parts that cannot be revealed to the receiver) it is embedded in the molecule.

2. Limitations of earlier work and current contributions

2.1. Cyclic shifts and reverse complement

In [13], the signer is required to send the GenBank file along with the physical DNA sample to the receiver. This is because the GenBank file is needed to align the FASTA file (which is the output of a DNA sequencer) in the same order as during the signature generation. Plasmid DNA is cyclic and double-stranded. Following DNA sequencing, any cyclic permutation of the DNA structure is possible. A sequence represented in a FASTA file, and consequently the GenBank file, is thus one of several possible linear representations of a circular structure. For example, in a FASTA file if the sequence was “ACGGTAA”, and the same sample is sequenced again, the FASTA file might read as “TAAACGG”.

Moreover, since DNA is composed of two complementary, anti-parallel strands, a DNA sequencer can read a sample in both the “sense” or “antisense” direction. The sequence may be represented in a FASTA file in either direction. When the sample is sequenced again, the output might be in the other direction, or what is known as the reverse complement. The reverse complement of “A” is “T” and vice-versa, and the reverse complement of “C” is “G” and vice-versa. The DNA molecule has a polarity with one end represented as 5’ and the other represented as 3’. One strand adheres to its reverse complement in an anti-parallel fashion. So if the sequence is – “5’-ACGGTAA-3’”, the reverse complement is “3’-TGCCATT-5’”. The FASTA file will represent one strand of the DNA sequence in the 5’ to 3’ direction; so the FASTA file could read as “ACGGTAA” or “TTACCGT”. Thus, by combining these two properties, for a DNA that contains N number of bases, the possible number of correct representations of the same sample is $2 N$ : N cyclic permutations plus each reverse complement.

Let us now consider the implications of this characteristic of DNA on the signature generation and verification. The sender has a sequence say “ACCGTT”. The sender synthesizes the sequence and sends it to the receiver. The receiver after sequencing with an automated DNA sequencer may not have exactly “ACCGTT”. It can be “TTACCG” which is a cyclic permutations. The receiver can also get something like “AACGGT” which is the reverse complement of “ACCGTT”. Owing to such domain challenges, the signature verification procedure is not as simple as in digital messages.

Let us assume the signature sequence is “TTAA”. (The actual signature length is 512 base pairs). In [13], the authors had defined a start and an end tag which served as delimiters for the signature. Let “ACGC” and “GTAT” be the start and end tags. For this discussion, we will use the term message to denote some linear representation of the sequence generated by a DNA sequencer. There can be three cases for including the signature sequence in the DNA sequence:

Append the signature after the message: In this case, the sender’s message with the signature embedded looks like – “ACCGTT ACGC TTAA GTAT”. The receiver, after sequencing the signed DNA sample may get something like – “GTT ACGCTTAA GTAT ACC” or something else depending on which base position the sequencer considers as the beginning of the sequence. In the permutation, the DNA sequencer assumed the 4th base from the left as the start of the sequence. The message is split but the delimiters and signature are intact. The simplest way to extract the message and signature is to append the extracted sequence to itself. With the permutation, this becomes “GTT ACGC TTAA GTAT ACC || GTT ACGC TTAA GTAT ACC”. Now we can extract the message which will be contained between two “ACGC TTAA GTAT” when the string is wrapped around. The receiver reconstructs the message which is “ACCGTT”. The receiver can then invoke the verification. Note that this scheme works no matter which position the sequence considers as the start of the sequence.

Append the signature before the message: In this case, the sender’s message with signature looks like – “ACGC TTAA GTAT ACCGTT”. The receiver after sequencing the DNA might get something like – “AA GTAT ACCGTT ACGC TT”. We observe that this is the same as the previous case. We can append the extracted sequence to itself – “AA GTAT ACCGTT ACGC TT || AA GTAT ACCGTT ACGC TT. Thus we can extract the message using the same procedure as above and then invoke the verification.

Append the signature between the message: In this case, the sender’s message with signature might look like – “ACC ACGC TTAA GTAT GTT”. The receiver after sequencing the DNA might get something like “ACGC TTAA GTAT GTT ACC”. The problem occurs in this scenario. Even if we append the extracted sequence, we will not be able to recover the message. After appending the sequence we get “ACGC TTAA GTAT GTT ACC || ACGC TTAA GTAT GTT ACC”. We can observe that the sequence contained by the two “ACGC TTAA GTAT” is “GTTACC”. This is not the message the sender signed. The sender signed the message on “ACCGTT”. But the receiver has no way of knowing this and hence the verification will fail since the message is not the same even though there is no modification to either the message or the signature.

The problem of recovering the message only occurs when the signature is placed within the message. The other two cases when the signature is placed before or after the message works perfectly fine. However, when working with DNA molecules, it may not always be possible to place the signature at the end or the beginning of the message. This is because there can be a feature present at that location. The possible places to place the signature are most likely to be within the original sequence. For this reason the GenBank file needed to be shared. Only this way would the receiver be able to align the sequence in the same order that the sender had when he signed.

There are several reasons why we may not want to share the GenBank file. The GenBank file is created by the originator of the DNA molecule using a gene editor. Its only purpose is to annotate the DNA sequence. If the DNA is an intellectual property, then the creator of the DNA will be annotating the DNA’s GenBank file with different features of different subsequences of the DNA. While the creator may be willing to divulge the property of the synthesized DNA as a whole, s/he may not be willing to divulge properties of various subsequences. Sending the GenBank file jeopardizes the latter. Moreover, gene editors maintain databases of DNA molecule properties. However, these databases may not be consistent across different editors in the sense that receivers gene editor may not have all the information about the same set of molecules that the sender’s gene editor has. Finally, the GenBank file format is not the only format used by gene editors, unlike the FASTA file format. In order to not share the GenBank file with the receiver, we have changed the signature generation procedure in this work, such that the verification is not dependent on where the signer placed the signature. The details of the new signature generation procedure are explained in Section 4.

2.2. Mutations in identifying tags

In our previous work, we defined two identifying tags to demarcate the signature. The start tag was chosen as “ACGCTTCGCA” and the end tag as “GTATCCTATG”. These two delimiters were chosen not just randomly but for very specific reasons. First biologists typically have some idea about what DNA sequence will not occur in their specific project. Thus they can choose delimiters from these non-occurring sequence. Second, from these possible delimiters, they will choose the ones that are simple to synthesize and assemble since DNA synthesis is expensive. Finally, they will choose a sequence that are easy to identify visually, are unlikely to develop secondary structures and have a balanced number of “A, C, G and T”s. Our domain experts selected these delimiters for this project. We also used error correction code to tolerate mutations within the DNA. However, we assumed that the start and end tag do not mutate. If they do, our previous work will fail to locate the signature and consequently, it will not be possible to verify the signature.

To overcome this limitation, in this work we propose using partial matching techniques such that the start and end tag can be located approximately. This is used in conjunction with error correction codes. Note that since the start and end tags are fixed, we know what we are searching for in the DNA molecule. For example, we may want to look for strings similar to “ACGCTTCGCA” such as “GCGCTTCGCG”. The different techniques we use for achieving this are discussed in Section 5. It has to be noted that the error-correction code that is used can only tolerate substitution errors, whereas the partial matching technique can work for any type of errors.

2.3. Signature length

The length of the signature plays a very important role in this biology domain. Shorter signatures imply less cost of synthesizing the signature into a physical DNA molecule. Shorter signatures will also be less likely to impact the existing functionality and stability of the plasmid during signature embedding. Previously, we used 1024 bit keys and that resulted in 512 base-pair signature. However, 1024 bit keys are no longer considered very strong and not recommended in practice for digital signatures. Generally, 2048 bit keys are used. In our domain, this would result in a 1024 base pair signatures. This length has a higher probability of affecting the characteristics and stability of the plasmid. Furthermore, when synthesizing the signature, presently with a 512 base pair signature the cost is $46.08 – 512 base pairs at $0.09 per base pair. With a 1024 base pair signature, even if the plasmid remains stable and functional, the cost of synthesizing the signature would be $92.16. The new signature scheme with a shorter signature is described in Section 6.

3. Overview of the DNA signature workflow

We begin by providing a threat model for this work followed by an overview of the DNA signature workflow. Even though the approach is not limited to plasmid DNA, we use “plasmid” to refer to physical DNA molecules that are being created, used or modified in the laboratory. We use the term “sequence” to mean the digital representation of the DNA molecule as a series of nucleotides represented by the letters A, C, G, and T. Note that biologists use the term “sequence” to refer to the DNA molecule or to the data representing the DNA molecule. In the context of this work, we limit the term “sequence” to mean the digital representation to avoid any ambiguity. The documentation, digital signing and verification processes in this work are all conducted on the sequence. The document as well as signature are digital sequences that are converted back to physical molecules and embedded in the DNA molecule.

3.1. Threat model

Biological threats: The security of the biotechnology research enterprise depends on trustworthy supply chain transactions [21]. For example, a receiving party may be interested in getting a plasmid or other genetic material from a vendor, a client, or a collaborator. This physical entity will come as few micrograms of white powder or as a few microliters of an aqueous solution. It will be packaged in a small tube with a small label that will provide some information about the tube content possibly in the form of a barcode and sample name. This data points to another source of information such as a product datasheet or a publication for instance. This information is necessary to be able to use the genetic material.

The link between the physical sample and the data describing this sample is very tenuous. Spontaneous mutations can lead to random modifications of the genetic material not reflected on the label or data sheet. Labelling errors can lead to a gross misrepresentation of the tube content. There is no assurance that the information printed on a label will point to a detailed description of the genetic material or that the description of the genetic material will be accurate.

There are a number of common scenarios in which the receiving party may work with genetic material based on misleading information available on the tube label. In a research context, the most likely outcomes of this situation are wastes of time and laboratory processes resulting from experiments that don’t work. In a clinical environment, labelling errors can harm patients and labelling errors are one of the major causes of drug recalls https://www.fda.gov/drugs/drug-safety-and-availability/medication-errors-related-cder-regulated-drug-products. Beyond the accidental labelling error, it is also possible to imagine scenarios in which a nefarious actor takes advantage of the supply chain weaknesses to intentionally misrepresent the nature of a physical sample with the goal of inflicting a loss to the receiving party.

By encoding identifiers of the sender and the genetic material, this technology makes it possible to verify the authenticity and integrity of the genetic material without relying on the information that may be available on the label.

Cyber threats: The main goal of this work is to protect against the aforementioned biological threats. We do this by using the capabilities of a digital signature and error correction codes. For the cyber part of this work, we assume a polynomial-time adversary, Mallory, who is trying to forge the signature of a reputed synthesized DNA molecule creator, Alice. Alice is trying to protect her IP rights/reputation as she distributes DNA molecules synthesized by her to researcher Ellen. If the attacker, Mallory, is able to forge the signature of Alice then: (a) Mallory can replace the actual DNA created by Alice with her own but keep the signature intact. (b) Mallory can create her own DNA molecule and masquerade as Alice to sign it. (c) Mallory can modify parts of the signed DNA molecule created by Alice. Therefore, for our proposed signature scheme to be secure, we need to show that no polynomial-time adversary can forge a genuine signature without knowing the secret used to sign.

The workflow presented below is enabled by an identity-based signature (IBS) key authority, a trusted entity that creates and distributes keys needed for an identity-based signature scheme. We assume an out-of-band secure channel for such key distribution. We assume that the central authority is secure and trusted by all participants in the system. The description of this central authority is beyond the scope of this article.

Legal threats: There are also a number of legal threats that this approach may potentially be able to protect against. It may limit product liability. A vendor could use this technology to limit representations of functionality to products for which the signature can be validated as opposed to the looser criteria of similarity that are often used in biotechnology. This technology could also be used to help comply with material transfer agreements. Finally, this technology could be used to ensure deniability or facilitate attribution. For example, a research institution such as a government lab may decide to enforce a policy to systematically sign their genetic material. Compliance with this policy could help detect rogue agents who would try to avoid compliance with signature policies. It would also limit the risks of exposure to damaging investigations by law enforcement.

Limits of the technology: The technology does not prevent rogue agents from developing unsigned DNA sequences that may have damaging effects. The situation is similar to digital signature of software. There is no way to force every software developer to sign their codes but digital signature technologies of applications make it possible to develop a trustworthy ecosystem like the Apple App store. Widespread adoption of digital signature technologies limits the risks associated with anonymous applications without completely eliminating it. Similarly, the widespread adoption of digital signatures for DNA sequences would raise the security standards but cannot completely eliminate the inherent risks associated with synthetic DNA molecules.

Recently authors have proposed to use a machine learning approach for attributions of DNA sequences [19]. Digital signatures complement inferential methods by allowing to associate sequences with specific individuals instead of a research group. They also establish the relation between a sequence and an individual with certainty instead of simply providing a likelihood of association. Finally, digital signatures do not require the availability of large datasets. Two parties can securely exchange genetic material without relying a large dataset such as the Addgene repository. However, digital signatures require a minor modification of the genetic material and access to a central authentication authority.

3.2. Signature generation and validation workflow

The DNA signature and validation workflow is presented in Fig. 1. There are four major steps in the workflow involving Alice, the Originator, Ellen, the Researcher/User, a Doc creator service, a Signer service, a Doc reader service, and a Verifier service. Note that these services are applications that can either reside on the machines used by Alice and Ellen or can be on remote service providers.

Alice, the Originator, uses a computer software to produce a bioinformatics files that include both the plasmid sequence and its digital documentation. The digital documentation is converted into a DNA sequence by the Doc creator service. Then, the Signer service creates an identity-based digital signature of the sequence that combines the original plasmid sequence and the DNA-version of the documentation.

Ellen, the researcher/user, obtains a plasmid for her use. The plasmid is received via a out-of-band physical communication channel. Ellen uses a DNA sequencer instrument and bioinformatics tools to produce the sequence of the received plasmid.

Ellen imports this sequence into the Verifier service that validates the signature and returns Pass/Fail message. If Fail, the Verifier service also provides explanation of failure.

If the sequence has been validated by the Verifier service, it is passed to the Doc reader service that decrypts the documentation from a sequence and converts it to a standard bioinformatics format.

Fig. 1.

DNA sign–share–validate workflow.

Fig. 2.

View of sequenced pUC19 plasmid in SnapGene editor before and after signing.

The steps of the DNA sign–share–verify workflow is shown in Fig. 1. Alice creates a plasmid in her laboratory, submits her ORCID and the plasmid ID to the IBS Authority and gets a corresponding IBS token that is provided to the Signer service. She also creates a sequence for this plasmid (FASTA file) . Next she uses the Doc creator service to create a documentation of the sequence (a GENBANK file). She uses the Signer service, to create a DNA signature sequence that she will add to her design. This sequence is the digital signature and the documentation about the plasmid. It is generated using the signature algorithm described in Section 4 and the documentation algorithm described in Section 8. The digital signature and documentation is then inserted in the sequence between two demarcating sequences used to identify the signature from the rest of the original plasmid’s sequence. We rely on the biological properties of the DNA molecule to determine what is the best position within the molecule where the signature-documentation sequence can be inserted.

Biologists have observed that there are segments within a DNA molecule that do not contribute to the properties of the molecule that are of interest – the so called “junk DNA”. (In fact, one of the roles that the DNA documentation plays is to identify what are the segments that contribute to the biological properties.) Note that this is not to say that the “junk DNA” does not contribute any properties. It is just that those properties are not of interest to the biologist. The signature-documentation sequence is inserted around such segments. We would like to emphasize two points here: (i) No segment is removed from the original molecule. This means that when the signature-documentation sequence is inserted there is an expansion in the size of the DNA molecule (see Fig. 2). Biologists have experimentally observed that DNA molecules can tolerate some expansion in size without affecting the properties of interest. The tolerable fraction of increase is a function of the size of the original molecule. However, since we do not know the exact fraction of expansion that can be tolerated, a significant focus of this work is in reducing the size of the signature-docmentation sequence. Most of the effort has been spent on experimentally determining what works. We urge the reader to keep this context in mind while reading this paper. (ii) It is possible that some DNA molecule does not allow any expansion in the size. In such cases, no signature-documentation sequence can be inserted and this work is not relevant for such molecules. For the rest of the paper, we assume that we are guaranteed to find a spot within the molecule where the signature-documentation sequence can be inserted. If not, the process exits.

Once the signature-documentation sequence is inserted, the Signer service returns another sequence in the form of a GENBANK file which contains the original plasmid sequence combined with the signature and documentation sequence. Alice then creates a new plasmid (probably by getting the service from a gene synthesis company) corresponding to this signature and documentation sequence. Alice communicates about the original plasmid by using the plasmid ID used to identify the plasmid in the signature and claims ownership of it using her ORCID. When requested she sends the signed plasmid to the User through some physical means (in the biology domain, plasmids are shared. for example, as bio-solutions in test tubes).

Ellen gets the plasmid from some source (may or may not be directly from Alice). Ellen has limited confidence in the plasmid because it came in a hand-labeled tube. So, she decides to get it sequenced completely before doing anything with it. She uploads this sequence of the signed plasmid to the Verifier service to verify the plasmid signature. This uploaded file is a FASTA file. The Verifier service identifies the signature inserted between the two signature tags. The Verifier service will proceed with the signature validation as discussed in Section 4. If the signature is correct, indicated by a PASS response from the Verifier, Ellen will know that the plasmid was signed by Alice and that the physical sequence of the received plasmid corresponds exactly to Alice’s design. Also after successful validation, the Verifier service sends the GENBANK file containing the descriptions and signature that Alice embedded within the plasmid to the Doc reader service, which reveals the original sequence and the corresponding sequence (as allowed by Alice).

Alternatively, the validation service might have determined that the signature was invalid. Several hypotheses could lead to this situation. It is possible that Alice was sloppy and did not manage to assemble the plasmid corresponding to the sequence she had designed. It may have maliciously changed. One could also not rule out the possibility of spontaneous mutations or a labeling error. In this situation, Ellen may decide to proceed with the plasmid based on the similarity of the plasmid sequence and the information available as discussed in Section 5.

For the curious reader we show in Fig. 2 the two sequenced versions of the pUC19 plasmid that we used in our work, both before the plasmid being signed (Fig. 2(a)) and after (Fig. 2(b)). The figures were created via the SnapGene editor that used as input the sequenced versions of both the un-signed and signed molecules. Note in Fig. 2(b), the increased size of the DNA molecule because of the embedded signature.

4. DNA signature generation and verification procedure

To present the rest of the work incrementally, we start with the discussion of the core DNA sign–share–validate process. We consider the three major players: (i) The Signer will create the DNA signature and sign a DNA sequence. (ii) The Verifier will use the signature to verify whether the received DNA sequence was sent by the appropriate sender and was unchanged after signing. (iii) A trusted central authority, the IBC key authority, provides the signer with an encrypted token that is associated with the DNA originator’s identity. The token contains the originator’s private key.

Use of error correction in DNA signature: In the digital domain, the digital signature on a message can be used to detect integrity violations. If a violation is detected, the sender can always re-transmit the signed message without incurring much extra cost. However, in the DNA world, we are primarily shipping physical DNA samples. This implies that if a DNA signature identifies that there is an error in the signature validation, then the sample needs to be physically transported and/or synthesized again. This incurs significant cost. DNA mutation is a very natural and common phenomenon. Thus, there is a good likelihood that signature validation will fail. Moreover, associated with the problem of mutation lies the problem of sequencing. When the DNA is processed by an automated DNA sequencer, the output is not always one hundred percent correct. It is dependent on the depth of sequencing, and increased sequencing depth means higher costs. Sequencing a small plasmid to sufficient depth is relatively inexpensive, but for larger sequences, sequencing errors can be an issue. In order to overcome these limitations, we use block-based error correction codes, such as a Reed–Solomon code [23], together with signatures. The presence of error correction codes helps the receiver to locate a limited number of errors (as set by the signer) in the sequenced DNA as well as correct them. The position of the errors and the corrected values are conveyed to the verifier. The verifier can then decide if the errors are in any valuable feature of the DNA or not. If a valuable feature has been corrupted, the verifier can ask for a new shipment, else if the error was in a non-valuable area in the DNA, the verifier can disregard the error and continue to work with it. Generally, there are three types of mutations that occur within a DNA – substitution, insertion and deletion. The error-correction code that is used here only deals with substitution errors.

We now describe our new DNA signature scheme. The steps are shown in Algorithm 1 (for signing) and Algorithm 2 (for verification). To avoid confusion we use the following conventions. The term sample is used to indicate the physical DNA molecule. The term sequence is used to signify the digital counterpart of a DNA molecule. This is generated by sequencing a sample in a DNA sequencer. The raw sequence (output of sequencing) is stored in a FASTA file. The annotated sequence is stored in a GenBank file. The signer creates a physical DNA sample from the signed sequence and sends the sample (only) to the verifier. The verifier sequences this sample to get another sequence that is then verified.

For ease of understanding, we denote the sequence to be signed by the string SEQUENCE, the signature by SIN, the begin and end tags as BESN and EDSN and the error correction code as ECC. Each of these strings is really a sequence of bases that can be synthesized into a physical DNA molecule and embedded in the sample. Any location reference in SEQUENCE for subsequence discussion is specific to the location within the sequence. For instance, location 3 in the string contains character Q. However, in the real sequence, the subsequence denoted by Q may occur in position 350 (for example) depending on how many bases constitute S and E.

Algorithm 1:

DNA signature algorithm accommodating cyclic shifts, reverse complement and mutating tags

Algorithm 2:

New signature verification procedure

Signature generation: The signature generation procedure begins by scanning the GenBank file for the keyword ORIGIN and locating the actual DNA sequence. Let there exists a feature from location 1 to 3 in the sequence, which corresponds to SEQ. Next, the location of the signature placement specified by the signer is checked. If the location collides with a feature, the user is alerted to change the location. In our example, if the user had provided 2, the algorithm will alert the user that there is already a feature SEQ there and ask for a new location. If the user chooses 4 which is after Q, it will be allowed. Next, the ORCID and Plasmid ID (which are integers) are converted to the corresponding A C G T sequence by the following conversion method – [0 – AC, 1 – AG, 2 – AT, 3 – CA, 4 – CG, 5 – CT, 6 – GA, 7 – GC, 8 – GT, 9 – TA]. The reason for choosing this conversion type is that if any ORCID or Plasmid ID has repetitions e.g. if ORCID is 0000-0001-4578-9987, the converted sequence will not have a long run of a single base. Long runs of a single nucleotide can result in errors during sequencing. Let the converted ORCID and Plasmid ID sequences be ORCID and PID respectively.

To account for the problem of placing the signature within the sequence mentioned earlier in Section 2, the signature is generated on the hash of a tweaked version of the sequence. We left rotate a copy of the sequence by $n - 1$ where n is the location within the sequence where the signature needs to be placed. For this example, the sender wants to place the signature after Q. The sequence will be shifted as – UENCESEQ. The signature is generated on the hash of the left rotated sequence UENCESEQ. The signature bits are then converted to A C G T sequence. Let this signature sequence be SIN. Let the start tag be BESN and end tag be EDSN. The signature sequence is concatenated with ORCID and PID and then placed between the start and end tags as BESN ORCID PID SIN EDSN. This entire string is then placed at the position specified by the user. We chose 4 in our example. Hence, the signed sequence looks like – SEQ BESN ORCID PID SIN EDSN UENCE.

Next, this sequence is passed into the error correction encoder. According to the number of tolerable errors specified by the user, the error correcting parity bits are generated. These parity bits are then converted to some A C G T sequence. Let this sequence be ECC. When the encoder output is generated, the sequence would look like – SEQ BESN ORCID PID SIN EDSN UENCE ECC. Next, the ECC is separated and is placed before the signature and end tag. So the final output sequence is – SEQ BESN ORCID PID SIN ECC EDSN UENCE. Note that the error correction code is generated after generating the signature sequence and combining with original sequence. Hence any error in that string can be corrected provided it is within the tolerable limit. For instance, if we put 2 as our error tolerance limit, then any 2 errors within the string SEQ BESN ORCID PID SIN ECC EDSN UENCE can be tolerated. If there is 1 error in SEQ and 1 error in SIN, or 2 errors in SIN, or 1 error in SIN and 1 error in ECC, these can be corrected. But if there are more than two errors it cannot be corrected. The final output sequence – SEQ BESN ORCID PID SIN ECC EDSN UENCE is written into another GenBank file. The descriptions are updated i.e. the locations of the signature, start, end, ecc are added and if there were features after location 4 in the original DNA, the locations of these features are also updated. This GenBank file is for reference of the sender. It is not required for signature verification and there is no need to share it with the receiver unless there are other reasons. The output sequence is now synthesized into the signed DNA sample.

Signature verification: The verification of the signature embedded in a DNA molecule starts with sequencing the DNA molecule using a DNA sequencer. DNA sequencers produce short reads of 100 to 200 bases of random locations in the DNA molecule. Sequencing reads include errors in which 1 base over 100 to 1000 bases is misread. Since sequencers produce a lot of overlapping sequencing reads, each region of the DNA molecule is read multiple times. Large sets of overlapping reads are processed by specialized software, sequencing assemblers, to produce an assembled sequences corresponding to the consensus sequence. We have provided a more detailed description of the analysis of sequencing data in another paper written for a biology audience [5]. We were pleasantly surprised by the quality of the assembled sequences we were able to produce. The error rate of the assembled sequence is good enough to meet the stringent requirements of digital signature validation. The signature verification procedure applies to the FASTA file produced by the sequencing assembler. It proceeds as described in Algorithm 2.

The receiver sequences the shared DNA using an automated DNA sequencer. The sequence in the FASTA file might not be the in the same order when the sender signed it. That is, after sequencing the shared DNA, the FASTA file may look like – ORCID PID SIN ECC EDSN UENCE SEQ BESN which is a cyclic permutation of the sender’s sequence.

The first step in the verification procedure is to extract the BESN and EDSN tags. If they are not mutated they are retrieved directly. If the tags cannot be located directly, we use Algorithm 3 to retrieve their closest matches and use them as BESN and EDSN tags. We defer the discussion on Algorithm 3 to Section 2.2. The verification step now will concatenate the FASTA sequence – ORCID PID SIN ECC EDSN UENCE SEQ BESN + ORCID PID SIN ECC EDSN UENCE SEQ BESN.

Now, it looks for 2 BESN tags and extracts the content between them. After obtaining the start tag, 32 bases are counted, this is the ORCID sequence, next 12 bases are counted, this is the plasmid ID sequence, then 512 bases are counted, this is the signature sequence. Next the substring after this signature sequence to the EDSN tag is retrieved, this is the error correction sequence. Finally, the substring between EDSN and BESN is the message for signature verification.

Until this point, we have retrieved UENCESEQ, ORCID, PID, SIN, and ECC. The UENCESEQ, ORCID and SIN is used for signature verification. With our previous signature generation method, since the message signed by the sender was SEQUENCE and the message retrieved by the verifier is ENCESEQ the hashes will be different and the validation would fail. With the new procedure, we can see that the although the sender’s file contained the sequence SEQUENCE, the signature was actually generated on the shifted UENCESEQ. Due to this shift, the retrieved sequence and the sender’s sequence will always be the same under any rotations. We have shifted the message of the sender to make the signature placement at the start of the message. We call this new generation scheme as force shift 0.

If the FASTA file contains the reverse complement of the sender’s DNA sequence, the entire FASTA file is reverse complemented and then we look for the BESN and EDSN tags. If there is a match, we arrive at the conclusion that the FASTA file contains the reverse complement. Then we start the same verification steps on the reverse complemented FASTA sequence.

5. Allowing mutations in start and end tags

The approximate matching technique, shown in Algorithm 3, breaks the entire string in which we are looking for the result into substrings of the length of the input string. Each of the broken substring in the larger string is assigned a score based on how similar it is to the input string. A match is inferred using the highest score. Now in the real DNA, we are looking for sequences of A, C, G, and T. So there might be a case that there are multiple close matches which means that there are multiple starts (or end) tags. In those cases, we use the end tags (or start tags respectively) to narrow our results. The following steps describe how the approximate matching technique works. There can be a total of four scenarios:

Case 1: No mutation in either start or end tags. – In this case, we can find the exact locations of the tags and hence approximate matching techniques are not needed. There can be mutations in any other place which will be handled by the error correction code.

Case 2: Mutation in BESN tag only. – In this case, the EDSN tag is found directly. The algorithm looks for the closest match to BESN. If there is a single match with the highest score, then we can be quite certain that the BESN tag has been located correctly. However, there can be multiple matches with close scores, i.e., there is no single stand out high score. In that case, we use the EDSN tag for further elimination of choices. We already know that the content within the start tag and the end tag is more than 556 base pairs. Hence we choose only those potential BESN tags which are at distance of 556 base pairs/characters or more away from the EDSN tag. The logic is set to 556 or more because the length of the error correction can be 0 if the user chooses no error correction.

Case 3: Mutation in EDSN tag only. – In this case, the BESN tag is found directly. The tool looks for the closest match to EDSN. As in case 2, if there is a single match with the highest score then we can be quite certain that the EDSN tag has been located correctly. For multiple matches with close scores, we use the same logic as described in case 2 above, using the distance between the BESN and EDSN tags to be more than or equal to 556 base pairs.

Case 4: Mutation in both BESN and EDSN tags. – In this case, we try to locate the closest matches for both tags. If there is a single match with the highest score for both of them then we can be pretty certain that we have located them both correctly. Also, we invoke the criteria of length more than or equal to 556 between them for more certainty. In case of multiple potential BESN and EDSN tags, we employ the length counting criteria for each BESN and EDSN tag pair possible from the obtained results and narrow down the results.

Algorithm 3:

Approximate matching of tags

5.1. Experimentally determining most suitable distance measures for string matching

Various techniques exist to handle matching of similar strings. These methods measure the distance between strings using a distance equation. One of the most important works in this field is the Levenshtein distance [14]. Other notable algorithms are Damerau–Levenshtein [3,14,30], Optimal String Alignment variant of Damerau–Levenshtein (sometimes called the restricted edit distance) [30], Jaro–Winkler edit distance [11], and Jaccard index [9,10].

We used all these five algorithms for the approximate start and end tag matching. One of the reasons for using all of the above was that we wanted to find out which would be most suited to the DNA domain. For testing, the FASTA file is taken as input and the start and end tag within the FASTA file are manually changed. Next, we search for the location of the defined start and end tags within the mutated FASTA file. The results for each algorithm are summarized on a case by case basis in Fig. 3. As can be seen from the figure, the Jaro algorithm was fairly inaccurate with an average accuracy of only 35.12 %. The Jaccard algorithm fared much better but was still imperfect with an average accuracy of only 95.18 %. All of the three Levenshtein variants were perfectly accurate in their assessment. These results indicate that if accuracy was the chief concern, either of the three Levenshtein variants would be ideal choices.

Fig. 3.

Accuracy of algorithms per case as a percentage.

Another important consideration in algorithm selection was speed. While an algorithm may be perfectly accurate in its selection of the closest match to a string it would not help much in practice if the algorithm has an unacceptably long run time. Towards this end, the execution time of the algorithms were also compared. To accomplish this each method was used to compare a series of one million random strings of a set length. A graph of the time in milliseconds (ms) for each algorithm is given in Fig. 4.

Fig. 4.

Runtime analysis of various algorithms in milliseconds.

As can be seen from Fig. 4, the Jaro–Winkler and Optimal String Alignment algorithms were the quickest, each growing at very slow rates with Jaro–Winkler being slightly faster overall. Taking both of these factors into consideration, we chose the Optimal String Alignment variant of the Damerau–Levenshtein algorithm [30] as our preferred method for string matching.

6. New identity-based signature scheme with shorter signature size

There are several identity-based digital signature schemes using pairings. Some of the notable schemes are: Sakai–Kasahara [25], Sakai–Ohgishi–Kasahara [26], Paterson [20], Cha–Cheon [2], and Xun Yi [29]. The Sakai–Kasahara scheme described two types of identity-based signatures. One is El-Gamal type and the other is Schnorr type. To identify the most appropriate scheme we first implemented all the above schemes using the Java Pairing Based Cryptography library (jPBC) [4]. We then investigated the signature lengths based on different types of curves that can be used. The time to generate and validate a signature depends on the type of the curve used. We evaluated both aspects: time to sign and verify, and the size of the signature using this algorithm for all the different types of curves present in the jPBC library.

Based on the signature size and the computation cost of signature generation and verification, we identified the best scheme to be the Sakai–Kasahara Schnorr type. We now describe the Sakai–Kasahara Schnorr type identity-based signature scheme. It has four steps: setup, extract, sign and verify.

Setup: The setup generates the curve parameters. The different curves provided in the jPBC library can be used to load the parameters. Let $g_{1}$ be the generator of $G_{1}$ , $g_{2}$ be the generator of $G_{2}$ . A random $x \in Z_{n}^{*}$ is chosen to be the master secret. Two public keys $P_{1}$ and $P_{2}$ are calculated as – $P_{1} = x \cdot g_{1}$ and $P_{2} = x \cdot g_{2}$ . An embedding function H is chosen such that $H {(0, 1)}^{*} \to G_{1}$ .

Extract: Takes as input the curve parameters, the master secret key x, and a user’s identity and returns the users identity-based secret key. This step is performed by the central authority for each user A with identity ${ID}_{A}$ .

For an identity ${ID}_{A}$ , calculate $C_{A} = H ({ID}_{A})$ . That is map the identity string to an element of $G_{1}$ .

Calculate $V_{A} = x \cdot C_{A}$ .

User A’s secret key is

(C_{A}, V_{A})

and is sent to the user via a secure channel.

Sign: To sign a message m, a user A with the curve parameters and the secret key $(C_{A}, V_{A})$ does the following:

Choose a random $r \in Z_{n}^{*}$ . Compute $Z_{A} = r \cdot g_{2}$ .

Compute $e = e_{n} (C_{A}, Z_{A})$ , where $e_{n}$ is the pairing operation.

Compute $h = H_{1} (m ∥ e)$ , where $H_{1}$ is a secure cryptographic hash function such as SHA-256 and ∥ is the concatenation operation.

Compute $S = h V_{A} + r C_{A}$ .

A’s signature for the message m is – (

h, S

Verify: The verification procedure is as follows:

Compute $w = e_{n} (S, g_{2}) * e_{n} (C_{A}, - h P_{2})$ .

Check $H_{1} (m ∥ w) \overset{?}{=} h$ .

The above equation works because: $\begin{array}{l} e = e_{n} (C_{A}, Z_{A}) = e_{n} (C_{A}, r \cdot g_{2}) = e_{n} {(C_{A}, g_{2})}^{r}, \\ \begin{matrix} w & = e_{n} (S, g_{2}) * e_{n} (C_{A}, - h P_{2}) \\ = e_{n} (h V_{A} + r C_{A}, g_{2}) * e_{n} (C_{A}, - h x \cdot g_{2}) \\ = e_{n} (h x \cdot C_{A} + r C_{A}, g_{2}) * e_{n} {(C_{A}, g_{2})}^{- h x} \\ = e_{n} ((h x + r) \cdot C_{A}, g_{2}) * e_{n} {(C_{A}, g_{2})}^{- h x} \\ = e_{n} {(C_{A}, g_{2})}^{h x + r} * e_{n} {(C_{A}, g_{2})}^{- h x} \\ = e_{n} {(C_{A}, g_{2})}^{r} . \end{matrix} \end{array}$

Hence, $h = H_{1} (m ∥ e) = H_{1} (m ∥ w)$ .

The signature is a tuple $(h, S)$ where h is the result of a hash function and is dependent on the choice of the hash function. If h is SHA-1, then length is 20 bytes, if h is SHA-256, the length is 32 bytes. The value S is an element of the group $G_{1}$ . Hence its length will be dependent on the curve type and the length of the prime. There are six types of curves in the jPBC library namely – a, a1, d, e, f, and g. The different types of curves and their parameters are provided in the library as “properties” files. Table 1 summarizes the comparison of the signature length using the different curves.

Table 1
Signature size using different curves for the Sakai–Kasahara scheme

Curve name Signature size using SHA-1 (bytes) Signature size using SHA-256 (bytes)

a.properties (20, 128) = 148 (32, 128) = 160

a1.properties (20, 260) = 280 (32, 260) = 292

d159.properties (20, 40) = 60 (32, 40) = 72

d201.properties (20, 52) = 72 (32, 52) = 84

d224.properties (20, 56) = 76 (32, 56) = 88

e.properties (20, 256) = 276 (32, 256) = 288

f.properties (20, 40) = 60 (32, 40) = 72

g149.properties (20,38) = 58 (32, 38) = 70

Curve name	Signature size using SHA-1 (bytes)	Signature size using SHA-256 (bytes)
a.properties	(20, 128) = 148	(32, 128) = 160
a1.properties	(20, 260) = 280	(32, 260) = 292
d159.properties	(20, 40) = 60	(32, 40) = 72
d201.properties	(20, 52) = 72	(32, 52) = 84
d224.properties	(20, 56) = 76	(32, 56) = 88
e.properties	(20, 256) = 276	(32, 256) = 288
f.properties	(20, 40) = 60	(32, 40) = 72
g149.properties	(20,38) = 58	(32, 38) = 70

Based on the signature size, the best performance is provided by the d159, f, and g149 curves. However, the length of the primes are a bit different and also the embedding degree is different. In the d159 curve, the prime is 159 bits and the embedding degree is 6. In the f curve, the prime is 158 bits and the embedding degree is 12. In the g149 curve, the prime is 149 bits and the embedding degree is 10. Keeping in view the small difference in signature sizes and the security related to each type, the better choice is the f curve.

The time to generate the signature and verify also depends on the type of the curve because of their properties. Table 2 summarizes the time to sign and verify using the different types of curves.

Table 2

Average time taken to sign and verify for different types of curves for the Sakai–Kasahara scheme

Curve name	Average time to sign (ms)	Average time to verify (ms)
a.properties	56	60
a1.properties	594	448
d159.properties	102	98
d201.properties	121	138
d224.properties	129	131
e.properties	262	214
f.properties	133	251
g149.properties	170	219

From the speed perspective, the a type curve is the fastest for generating and verifying the signature. But the size of the signature is way larger. The short signature size generating curves i.e. d159, f and g149 take a bit more time. It is, therefore, a matter of priority – signature size over speed. If we need to sign and verify a lot of messages and not care about the signature size then type A curve is a good choice. However, if the size of the signature is more important than speed like in our application, the f type curve is a better option. Also, the f type curve offers the best security among the three as its embedding degree is higher. Using this Sakai–Kasahara scheme we have reduced the signature size from 512 base pairs to 288 base pairs. The only thing it affects in our earlier algorithms is determination of BESN and EDSN in Section 5 when these tags mutate and we need to rely on counting base pairs to locate those tags.

Security of scheme: Since we use well-known signature schemes that assume that no polynomial-time adversary can forge a genuine signature without knowing the secret used to sign, it trivially follows that our scheme is also secure.

7. Compressing the signature to reduce molecule size

The Sakai–Kasahara scheme mentioned above generates the shortest signature size among the known identity-based signature schemes. The signature is a tuple ( $h, S$ ), where h is a hash function and S is an element in the group $G_{1}$ of the curve.

Now, the elements in the group $G_{1}$ are points on the elliptic curve. They have two subcomponents the X-coordinate and the Y-coordinate which are packed together. In the JPBC library, the elements are assigned and computed using the class “Element”. The fields are generated from the pairing curves i.e. the properties files.

Now every elliptic curve and consequently every pairing curve is defined by an equation of the form $y^{2} = x^{3} + a x + b$ . Hence, the Y coordinate can be calculated from the X coordinate. Using this point compression technique, it can be observed that a signature which contains an element of $G_{1}$ and has a size of 2n bytes (assuming each X and Y are n bytes), can be compressed to a size of $(n + 1)$ bytes.

This technique can only be applied to signature schemes where the signature contains an element in $G_{1}$ . We are not sure if the elements of $G_{2}$ and $G_{T}$ can be compressed. So for all of the schemes that are implemented already and the schemes that we described above can utilize this compression. Also, the Sakai–Kasahara Schnorr analog scheme contains a hash value in one of the tuple. Recall that the Sakai–Kasahara signature was $(h, S)$ , where h is a hash function like SHA-256. Along with the point compression, the hash value can also be shortened using the techniques that are used to generate Ethereum [28] or Bitcoin [16] addresses. In Ethereum, the public key is hashed using keccak 256 hash algorithm. Then instead of taking the entire 32 bytes (64 bytes in hex representation), the Ethereum blockchain takes only the last 20 bytes (40 bytes in hex) and generates the wallet address. Each wallet address is 40 hex bytes but the hash generates 64 hex bytes. There is also another way of doing the same hash compression. When computing the verification step – $e_{n} (C_{A}, - h P_{2})$ . Here $- h$ is the negative hash value integer and we perform a scalar multiplication with the point $P_{2}$ . There is only one integer group involved and that is $Z_{r}$ where r is the order of the curve. So when performing that scalar multiplication $- h \cdot P_{2}$ , JPBC internally converts the hash value to an element of $Z_{r}$ by modulo r. Hence, instead of writing the signature as $(h, S)$ we can rewrite that as $(R, S)$ where $R = h mod r$ . This implies our signature is now of the form ( $R, S$ ), where R is an element of the group $Z_{r}$ . But this method will not be efficient for curves which have large orders like the elements in the type a1 curve are 128 bytes. Hence all curve types except a and $a 1$ can benefit from compression.

In order to recover the Y coordinate during verification we need to do point decompression i.e. retrieve Y using X and the extra byte. The X value is plugged in the equation and the square root modulo prime is calculated. The two values of Y are obtained. Then using the extra byte we know if we should keep the odd Y or the even Y. Note that every curve has a different equation so the decompression method needs to consider the curve and the coefficients a and b. Also, there exists a very easy way to compute the square root modulo prime when the prime modulo 4 equals 3. If prime ≡ 3 mod 4 – $\begin{matrix} Y coordinate = {(Y^{2})}^{\frac{prime + 1}{4}} mod prime . \end{matrix}$

However if prime ≡ 1 mod 4, then there is no one line way to calculate the square root. The f type curve will be faster in this approach as the prime in that curve parameters is 3 mod 4. The code for point decompression is provided in the Appendix 3. This point compression method is not present in the JPBC library by default.

Using these compression methods the signature size which was (32, 40) bytes totalling 72 bytes can be reduced to (20, 21) totalling 41 bytes. Hence the size can be reduced from 288 base pairs ( $72 * 4$ ) to 164 base pairs ( $41 * 4$ ).

8. Self documenting plasmids

In the context of DNA sharing, the receiver gets a physical DNA molecule, which he/she can pass through an automated DNA sequencer and obtain the sequences present within the molecule. The sequence is in the form of a FASTA file which can be used to verify the signature within the molecule. The FASTA file contains just the raw sequences refer to Fig. 5.

Fig. 5.

Sample fasta (.fasta) file.

However, the receiver has no description about the molecule. Sequence manipulation software such as SnapGene can be used to convert a GENBANK file to a FASTA file and vice versa. When a FASTA file is converted to a GENBANK file, the software searches its database for common annotations. The generated annotations may not be complete or correct every time. Hence, the user has the flexibility to manually add additional annotations that may be required to describe the sample sequence. These manually added annotations are only available to the creator. When the same sample is sent to others, they will sequence it and obtain the FASTA file but the GENBANK file will contain only those annotations that can be automatically generated. In order for the receiver to extract all the feature information for a given sample, the creator would need to share the GENBANK file containing the manually added annotations.

Hence in order to establish a strong tie between the shared GENBANK file and the shared physical sample, the use of dual signature was proposed in [13]. Using a dual signature, it can be verified that the shared GENBANK file containing the descriptions is indeed meant to describe the shared sample and not for any other sample. But having a large set of GENBANK files and a large set of physical samples, it will be difficult to match a sample with its related GENBANK description, i.e. all the samples needs to be tested for the dual signature match.

A sample GENBANK file is shown below in two parts Fig. 6 and Fig. 7. This file corresponds to the FASTA file shown above. The GENBANK file is displayed in two parts due to its length.

Fig. 6.

Sample GENBANK (.gb) file – part 1.

Fig. 7.

Sample GENBANK (.gb) file – part 2.

The keyword “ORIGIN” demarcates the annotations/descriptions about the molecule and the actual sequence. The keyword “FEATURES” describes the location of the different features that this sample contains and their location within the sequence. Also it denotes the total number of base pairs in the sample under “source”.

As we are already embedding a digital signature within a molecule, it is also possible to embed the descriptions within it as well. If the descriptions about the molecule can be embedded within itself then there is no need to share the GENBANK file and the physical sample is self documented. The descriptions are text and it can be easily converted to bytes and from bytes to ACGT. But there is a limit to how much information can be put inside a molecule such that it is stable and retains its original functionality. For this criteria, we explored lossless text compression techniques and encode the compressed bytes as ACGT instead of the original text bytes. While unpacking the original text can be recovered and the exact same GENBANK file can be reproduced. Some of the compression algorithms we explored were zip, bzip, lzma, lz4, snappy, and deflate. Our main focus was compression size rather than time. We observed that the best algorithm in terms of compression size was deflate with the parameter BEST_ COMPRESSION. (Deflater compressedtext = new Deflater(BEST_ COMPRESSION);)

For the sample GENBANK file, the length of the descriptions (i.e. all text before the keyword “ORIGIN” in Figs 6 and 7) is 3675 bytes. After compression it shrinks to 1493 bytes. Consequently it is a better choice to encode the compressed annotations instead of the actual text. At the time of inserting the annotations, the same challenges we faced for signature insertion and verification because of the circular and double stranded nature of DNA, also appears.

To explain the process, we consider a small example. Let the sample have the following sequence – AAA CCC GGG TTT. Also, let us consider that this sample has 2 features – AAA is feature X and GGG is feature Y. The GENBANK file for this sample will look somewhat the following:

LOCUS... .... FEATURES source 1 .. 12 featureX 1 .. 3 featureY 7 .. 9 ORIGIN 1 aaacccgggt tt

Let us assume that when the annotations are compressed and encoded into sequence, it is ACGCTC. As discussed before we need two identifiers. Let GCG and ATA be the identifiers for this example. This will be inserted into the original content. The location is chosen by the user and it cannot collide with an existing feature. Let us assume that the chosen location is 4 i.e. after AAA. The sequence now becomes AAA GCG ACGCTC ATA CCC GGG TTT.

When this is shared with the receiver and sequenced it might come out as ATA CCC GGG TTT AAA GCG ACGCTC. The identifiers will aid in locating the annotations. We can identify ACGCTC convert to bytes and decompress to get the actual text. Recall the annotations had featureX 1..3 and featureY 7..9. But as per this sequence AAA is not at 1..3 as also GGG is not at 7..9. The source can always be overwritten with the new count. But the problem is with locating the features and generating the GENBANK file with the actual location of the features. We can permute the sequence such that it always starts with the identifier after locating i.e. ATA CCC GGG TTT AAA GCG ACGCTC can be rewritten as GCG ACGCTC ATA CCC GGG TTT AAA. In this way, the original content which was shared can be located. But still there is a mismatch with the feature location. In the original AAA was in position 1..3, here it is CCC. As we mentioned, before this would not happen if the sender selects the first position or the last position of the sequence as the annotation insertion point. But there is no guarantee that there will be no feature at the start or end. Hence in order to generalize this, we changed the procedure to annotation placement similar to the signature placement strategy discussed in Section 4. To explain it briefly, before inserting the signature at the chosen location, the original sequence is shifted such that it becomes the dummy start. In the example, AAA CCC GGG TTT was the original sequence and 4 was the chosen location. The original sequence is shifted to CCC GGG TTT AAA and the new feature location are calculated based on this i.e the features are now updated as featureX 10..12 and featureY 4..6. Then this annotation is compressed and encoded to ACGT and then put into the location. Let this annotation sequence be CAGATA. Note that this is not the same as the annotations in the original file. Since we shifted the sequence internally, the compression sequence will not be the same as the previous example. The final sequence is therefore AAA GCG CAGATA ATA CCC GGG TTT, At the receiver’s end, the result might come up as GGG TTT AAA GCG CAGATA ATA CCC. Like before we can look for the delimiters and then permute the sequence as GCG CAGATA ATA CCC GGG TTT AAA. But the decompressed text now will be expanded as featureX 10..12 and featureY 4..6. This matches with the annotations precisely if we consider the start of the sequence as after the end delimiter ATA. The final generated GENBANK will contain an offset of the length of start delimiter compressed annotation length and the end delimiter length. So for this example, it will be updated as featureX – $10 + 12$ .. $12 + 12$ and featureY – $4 + 12$ .. $6 + 12$ . The length of offset – $len$ (GCGCAGATAATA) = 12. It can be verified that GCG CAGATA ATA CCC GGG TTT AAA contains the featureX which was AAA at position 22…24 and featureY which was GGG at position 16…18. The algorithms for generating a self documented plasmid and reading the plasmid to get back the annotations are described in Algorithm 4 and 5.

Algorithm 4:

A self documenting plasmid generation algorithm

Algorithm 5:

A self documented plasmid reading algorithm

8.1. Encrypting parts of the annotations for controlled dissemination

As mentioned earlier, the creator of the DNA molecule may not always be willing to share all information in the DNA document. One simple approach to do this is to encrypt some features of the annotations such that the receiver can only read parts of the annotations. Only the annotations that needs to be read by everyone will be kept in plaintext and encoded as mentioned above. For this approach, there can be several different ways each with its own advantages and disadvantages. Since we are already using identity-based signature, the obvious technique is to use its counterpart identity-based encryption. There are already existing schemes such as Boneh–Franklin IBE scheme [1], Sakai–Kasahara IBE scheme [25] etc., that can be used to accomplish this. However, note that all of the IBE schemes generate the encrypted message as a tuple. The size of the message is doubled. Unfortunately, in this domain we need to keep the message or annotations as short as possible. To what extent we can allow the expansion of a molecule without affecting properties is largely unknown. However, biologists believe that it depends on the size of the original DNA molecule where we would include the annotations. If that sample is large then it is very likely that IBE schemes can be used. One issue with IBE schemes is that it is meant to be decrypted by exactly one particular user. The user’s ORCID which is used to encrypt the annotation will only be able to read the annotations. In cases where we would like to have a selected group of users read an annotation IBE schemes cannot be used.

Another possible method is by using symmetric key encryption like AES. One advantage is that the message size will not expand. But the sender and receiver need to agree on a shared key. This key can be sent to other selected users who will be allowed to read the annotations. But this also implies that a different key is needed for each DNA. Attribute-based encryption techniques is worth exploring in this regard. As of now we have not yet implemented these schemes and is left as a future work.

9. Conclusion and future work

In this work, we improve the previous DNA signing scheme [13] in several directions. First, we remove the need to share the genbank file by eliminating the requirement of alignment at the sample receiver’s end. The new signature generation procedure is independent of where the signer wants to place the signature. Notwithstanding any cyclic shifts or reverse complements that the receiver may get during sequencing, the signature can still be verified. To account for DNA mutations, we use error correction codes in the signature protocol to correct errors within pre-specified tolerable limits. Our second improvement is a way to locate mutated tags using approximate string matching techniques. This allows us to overcome mutation in the identifying tags and hence we can correctly recover the error correction code. This was a major problem in previous scheme.

Our third improvement is the reduction of signature size. We used pairing based cryptography to improve the previous signature scheme which generated 512 base pair signature to the Sakai–Kasahara scheme which generates 288 base pair signature. Then we further compressed that to 164 base pairs. It will be worth exploring the lattice-based schemes to check if those can be used to generate a shorter signature size than 164 base pairs.

We also explored the possibility of encoding the documentation or annotation about the sample within the sample itself. For just the signature scheme we had removed the need to share any digital file along with the physical sample. But still the receiver after validating the correctness of the sample would have limited knowledge about the description of the sample and the features it has. Previously, we had to share the digital file but included a dual signature such that the receiver can strictly tie one digital file with exactly one physical sample. If we can encode the documentation about the sample also within it, there will be no need to share any extra information and the physical sample will the self validating and self documenting. As of now, the signature and the self documenting parts are implemented as two different modules. In future they will be combined to generate signatures and annotations together.

One of the future directions in this work would involve signing and verifying the same DNA molecule multiple times by different users. Alice signs and sends a DNA sample to Bob and Bob validate Alice’s DNA. Then Bob continues to modify it, then signs it and sends it to Eve. Can Eve only verify Bob’s signature, or is there a way for Eve to track the entire pathway starting from Alice? It would be interesting to see if the concept of aggregate signatures can be applied in these scenarios.

Another direction to be explored is signing a part of the DNA. In this work, we have only applied signatures on plasmids and the entire plasmid sequence. Plasmids are relatively smaller in size than genomes ranging from 2.5 – 25 kilobases, and hence sequencing a plasmid without any errors is feasible. But as we move from plasmids to genomes, the sheer size of the DNA makes it almost impossible to produce and error free sequencing. The error correction code can be used to tolerate some errors but it might not be the optimal solution. It can be a better solution to sign a part of the DNA rather than the entire DNA sequence for example a particular protein sequence. In that way, we can ensure that the protein sequence is unchanged although there might be some errors in rest of the DNA sequence.

Also, it would be interesting to see if we put a signature on top of an existing signature whether the characteristic of the DNA molecule changes or not. If it does not, how many signatures can be inserted before the characteristics of the original DNA molecule begin to change? If we cannot put multiple signatures within the same DNA molecule, how do we remove the signature that was present before signing it again? Does removing the signature also alter the property of the DNA molecule? These are some future directions that we plan to explore further.

We envision an extension of this technology to genomic sequences. Even simple genomes are orders of magnitude larger than the plasmid sequences considered in this article. In addition, their sequence include long repeats and other complex features that are difficult to sequence accurately. As a result, it would not be practical to apply this technique to whole genomes because of the cost of sequencing whole genomes with the high level of accuracy necessary to validate signatures. Instead, we are considering signing genome segments delimited by tags. We could use the tag to amplify these fragments by Polymerase Chain Reaction (PCR – a molecular biology method to rapidly make millions of copies of a DNA molecule) and validate the PCR amplification products. Strain engineers for instance could use this approach to sign the genes they modify in an existing genome.

Footnotes

Acknowledgments

This work was partly supported by the U.S. National Science Foundation’s award #1934573 “EAGER: Development of a tool-chain to write and read self-documenting plasmids”, and award #1832320 “EAGER: Modeling DNA Manufacturing Processes Using Extensible Attribute Grammars”, and by the Colorado State University’s Office of the Vice President for Research Catalyst for Innovative Partnerships Program. The work of Indrajit Ray was performed while serving as Program Director at the U.S. National Science Foundation (NSF) and supported by the foundation’s Independent Research and Development program for staff. Research findings presented here and opinions expressed are solely that of the authors, and in no way reflect the opinion of the NSF, other federal agencies or the Office of the Vice President for Research of Colorado State University.

References

Boneh and

Franklin, Identity-based encryption from the Weil pairing, in: Advances in Cryptology – CRYPTO 2001,

Kilian, ed., Lecture Notes in Computer Science, Vol. 2139, 2001.

J.C.

Choon and

Hee Cheon, An identity-based signature from gap Diffie–Hellman groups, in: Public Key Crytpography – PKC 2003,

Y.G.

Desmedt, ed., Lecture Notes in Computer Science, Vol. 2567, Springer, Berlin, 2003.

F.J.

Damerau, A technique for computer detection and correction of spelling errors, Communications of ACM 7(3) (1964), 171–176. doi:10.1145/363958.363994.

De Caro and

Iovino, jPBC: Java pairing based cryptography, in: Proceedings of the 16th IEEE Symposium on Computers and Communications, ISCC 2011, Kerkyra, Corfu, Greece, June 28–July 1, IEEE, 2011, pp. 850–855.

J.E.

Gallegos,

D.M.

Kar,

Ray,

Ray and

Peccoud, Securing the exchange of synthetic genetic constructs using digital signatures, bioRxiv, 2019. https://www.biorxiv.org/content/early/2019/09/26/750927. doi:10.1101/750927.

D.G.

Gibson,

J.I.

Glass,

Lartigue,

V.N.

Noskov,

R.-Y.

Chuang,

M.A.

Algire,

G.A.

Benders,

M.G.

Montague,

Ma,

M.M.

Moodie,

Merryman,

Vashee,

Krishnakumar,

Assad-Garcia,

Andrews-Pfannkoch,

E.A.

Denisova,

Young,

Z.-Q.

Qi,

T.H.

Segall-Shapiro,

C.H.

Calvey,

P.P.

Parmar,

C.A.

Hutchison,

H.O.

Smith and

J.C.

Venter, Creation of a bacterial cell controlled by a chemically synthesized genome, Science 329(5987) (2010), 52–56. doi:10.1126/science.1190719.

Heider and

Barnekow, DNA-based watermarks using the DNA-crypt algorithm, BMC Bioinformatics 8 (2007), 176. doi:10.1186/1471-2105-8-176.

C.A.

Hutchison,

R.-Y.

Chuang,

V.N.

Noskov,

Assad-Garcia,

T.J.

Deerinck,

M.H.

Ellisman,

Gill,

Kannan,

B.J.

Karas,

Ma,

J.F.

Pelletier,

Z.-Q.

Qi,

R.A.

Richter,

E.A.

Strychalski,

Sun,

Suzuki,

Tsvetanova,

K.S.

Wise,

H.O.

Smith,

J.I.

Glass,

Merryman,

D.G.

Gibson and

J.C.

Venter, Design and synthesis of a minimal bacterial genome, Science 351(6280) (2016), aad6253. doi:10.1126/science.aad6253.

Jaccard, Etude de la distribution florale dans une portion des alpes et du jura, Bulletin de la Societe Vaudoise des Sciences Naturelles 37(142) (1901), 547–579.

10.

Jaccard, Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines., Bulletin de la Societe Vaudoise des Sciences Naturelles 37(140) (1901), 241–272.

11.

M.A.

Jaro, Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association 84(406) (1989), 414–420. doi:10.1080/01621459.1989.10478785.

12.

D.C.

Jupiter, DNA watermarking of infectious agents: Progress and prospects, PLOS Pathogens 6(6) (2010), e1000950. doi:10.1371/journal.ppat.1000950.

13.

D.M.

Kar,

Ray,

Gallegos and

Peccoud, Digital signatures to ensure the authenticity and integrity of synthetic DNA molecules, in: Proceedings of the New Security Paradigms Workshop, Windsor, UK, ACM, 2018.

14.

V.I.

Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady 10(8) (1966), 707–710.

15.

Liss, Embedding permanent watermarks in synthetic genes, PLOS ONE 7(8) (2012), e42465. doi:10.1371/journal.pone.0042465.

16.

Nakamoto, Bitcoin: A peer-to-peer electronic cash system, Bitcoin.org Working Paper, 2008.

17.

National Academies of Sciences, Engineering, and Medicine, Biodefense in the Age of Synthetic Biology, The National Academies Press, Washington, DC, 2018. doi:10.17226/24890.

18.

Ney,

Koscher,

Organick,

Ceze and

Kohno, Computer security, privacy, and DNA sequencing: Compromising computers with synthesized DNA, privacy leaks, and more, in: Proceedings of the 26th USENIX Security Symposium, Vancouver, Canada, 2017.

19.

A.A.

Nielsen and

C.A.

Voigt, Deep learning to predict the lab-of-origin of engineered DNA, Nature Communications 9(1) (2018), 3135. doi:10.1038/s41467-018-05378-z.

20.

K.G.

Paterson, ID-based signatures from pairings on elliptic curves, Electronics Letters 38(18) (2002), 1025–1026. doi:10.1049/el:20020682.

21.

Peccoud,

J.E.

Gallegos,

Murch,

W.G.

Buchholz and

Raman, Cyberbiosecurity: From naive trust to risk awareness, Trends in Biotechnology 36(1) (2018), 4–7. doi:10.1016/j.tibtech.2017.10.012.

22.

J.S.

Plank, A tutorial on Reed–Solomon coding for fault-tolerance in RAID-like systems, Software Practice and Experience 27(9) (1997), 995–1012. doi:10.1002/(SICI)1097-024X(199709)27:9<995::AID-SPE111>3.0.CO;2-6.

23.

I.S.

Reed and

Solomon, Polynomial codes over certain finite fields, Journal of the Society for Industrial and Applied Mathematics 8(2) (1960), 300–304. doi:10.1137/0108018.

24.

S.M.

Richardson,

L.A.

Mitchell,

Stracquadanio,

Yang,

J.S.

Dymond,

J.E.

DiCarlo,

Lee,

C.L.V.

Huang,

Chandrasegaran,

Cai,

J.D.

Boeke and

J.S.

Bader, Design of a synthetic yeast genome, Science 355(6329) (2017), 1040–1044. doi:10.1126/science.aaf4557.

25.

Sakai and

Kasahara, ID based cryptosystems with pairing on elliptic curve, IACR Cryptology ePrint Archive, 2003.

26.

Sakai,

Ohgishi and

Kasahara, Cryptosystems based on pairing, in: Proceedings of the 2000 Symposium on Cryptography and Information Security, Okinawa, Japan, 2000.

27.

Shamir, Identity-based cryptosystems and signature schemes, in: Advances in Cryptology – CRYPTO 1984,

G.R.

Blakley and

Chaum, eds, Lecture Notes in Computer Science, Vol. 196, Springer, Berlin, 1984, pp. 47–53.

28.

Wood

et al., Ethereum: A secure decentralised generalised transaction ledger, Ethereum Project Yellow Paper 151 (2014), 1–32.

29.

Yi, An identity-based signature scheme from the Weil pairing, IEEE Communications Letters 7(2) (2003), 76–78. doi:10.1109/LCOMM.2002.808397.

30.

Zhao and

Sahni, String correction using the Damerau–Levenshtein distance, BMC Bioinformatics 20 (2019), 277. doi:10.1186/s12859-019-2819-0.