Abstract
Abstract
The DNA sequencing data analysis pipelines require significant computational resources. In that sense, cloud computing infrastructures appear as a natural choice for this processing. However, the first practical difficulty in reaching the cloud computing services is the transmission of the massive DNA sequencing data from where they are produced to where they will be processed. The daily practice here begins with compressing the data in FASTQ file format, and then sending these data via fast data transmission protocols. In this study, we address the weaknesses in that daily practice and present a new system architecture that incorporates the computational resources available on the client side while dynamically adapting itself to the available bandwidth. Our proposal considers the real-life scenarios, where the bandwidth of the connection between the parties may fluctuate, and also the computing power on the client side may be of any size ranging from moderate personal computers to powerful workstations. The proposed architecture aims at utilizing both the communication bandwidth and the computing resources for satisfying the ultimate goal of reaching the results as early as possible. We present a prototype implementation of the proposed architecture, and analyze several real-life cases, which provide useful insights for the sequencing centers, especially on deciding when to use a cloud service and in what conditions.
1. Introduction
D
This revolution demands for high computing resources to turn vast amounts of sequencing data into valuable knowledge (Auwera et al., 2013). Raw sequencing data are produced by sequencing centers that are located in life science research centers or clinics in general. Despite the deep experience that those centers have on the biological process of sequencing, the expertise required on the computational side is either missing or very hard to manage there. Thus, the choice would be to outsource the computational tasks to a dedicated cloud computing service (Stein, 2010; Marx, 2013; O Driscoll et al., 2013).
However, the bottleneck in this scenario is the transmission of data from where they are generated to where they will be processed. That is because the size of the DNA sequencing data produced by the high-throughput machinery is huge and it becomes time consuming to transmit over the internet.
The initial attempt to tackle with this challenge is surely intended to compress the genomic data to reduce their size, and then to send them in compressed form via a fast data transmission protocol (e.g., Aspera) to the cloud, where they will be decompressed and analyzed. This approach introduces an overhead during the compression and decompression processes since the standard compression techniques are not easy to be integrated due to the large size and structure of the genomic data (Numanagić et al., 2016). Another point is that the compression of the DNA sequencing data here does not help any further processing other than reducing the size of the payload in the transmission. In that sense, it might be viewed as a classical conventional approach rather than as compressive genomics (Loh et al., 2012), which aims at operating on the compressed data themselves without a need for decompression for better handling of the ever-increasing data size in genomics (Berger et al., 2016).
It is true that introducing a compression step before the analysis pipeline decreases the size of the data, which is valuable for the efficient transmission. On the other hand, the time and resources allocated for the compression and also the decompression of the data on the receiver side introduce an overhead. This overhead within the whole pipeline should be well managed for the efficiency of the overall process. Assuming the cloud has a theoretically infinite capacity, the three important parameters during this management are the computing resources available on the sender side, the data production rate of the sequencing facility, and the bandwidth between the parties. Table 1 summarizes these parameters according to some sample scenarios.
In any case, the sender should decide whether to send the raw data without any compression, or first deploy a compression and then send the data, or do whatever analysis is required with the available computing resources in-house without incorporating a cloud service. Raw data transmission will require high bandwidth. The compression will introduce overhead time proportional to the computing power. Analyzing the data without cloud computing will require both high computing power and expertise in computational genomics. For the large facilities with well-equipped computing resources and expertise, it might seem feasible to achieve any task internally. However, this optimization is especially hard to resolve, mostly for the middle-sized facilities.
In this study, we consider this problem and propose a new system architecture that will automatically manage the efficient transmission for the DNA sequencing center by combining the compression and the alignment, which might be assumed the first step of the genomic analysis pipeline (Auwera et al., 2013). The motivation here is not to minimize the data size or maximize the speed of transmission. Surely, any progress in the compression step or in the transmission lines/protocols will improve the system. However, the main purpose of the study is to provide an adaptive system architecture that will optimize the time of the total procedure, including both the transmission and the alignment of the DNA sequencing data according to the available resources, mainly the computation power ready at the sender side and the bandwidth between the parties.
The outline of the article is as follows. Section 2 introduces the proposed system architecture. Section 3 describes a prototype implementation and presents experimental results for selected bandwidths. The discussions particularly addressing the cloud service engagement for several use cases are given in Section 4, which is followed by the final conclusions.
2. The System Architecture
The general concept of the proposed system architecture is depicted in Figure 1. The sender would like to transmit a FASTQ (Cock et al., 2010) file containing a large number of reads to the cloud service for DNA sequence analysis that typically starts with the alignment of the source data. Each read within the proposed system is transmitted as either a compressed raw read in FASTQ format or a compressed aligned read in SAM (Li et al., 2009) format. The compression step is achieved by a standard entropy coding algorithm such as Lempel-Ziv (Ziv and Lempel, 1978), arithmetic (Witten et al., 1987), or Huffman coding (Huffman, 1954). Similarly, the alignment block can include any aligner (Li and Homer, 2010), such as the BWA (Li and Durbin, 2010) or Bowtie (Langmead et al., 2009) that maps an input read in FASTQ format into an aligned read in SAM format.

The proposed system architecture combining the alignment and compression steps for efficient transmission of the DNA sequencing data.
The key point in this architecture is the select operation that decides whether to fetch from the raw read or aligned read lists. The select operation first fetches from the aligned reads list, and when there exist no more available aligned reads there, it begins fetching from the raw reads list. The mechanism that dynamically tunes this selection is based on analyzing the throughput of each block in the chain.
Each operational block (aligner, select, compressor, and transmitter) in Figure 1 accepts input from and writes its output to a first-in-first-out (FIFO) list. These FIFO lists are shown in dashed rectangles, where the raw reads list is initially filled by reading from the input FASTQ file. The rule of thumb throughout the proposed system is that when there is no space to write the output of an operation on its respective output block, this operation is suspended until there appears vacancy on its output list.
For instance, the compressor can write to its output, which is the compressed raw and aligned reads FIFO list in the figure, only after the transmitter block had fetched and, thus, cleared the previous data in this buffer. It is important that the transmitter's performance in clearing this list depends on the bandwidth available. Thus, when the capacity of the bandwidth is poor in accordance to the throughput of the compressor, the compressor will suspend, that will decrease its fetching ratio from its input list, that is, the compressed raw and aligned reads FIFO list. When the transmitter stops fetching items, the compressor will suspend its operation since its output buffer will become full. As a consequence, that will affect the select operation such that it will suspend fetching from its inputs since it will not be able to write to its output list due to the stuck on the compressor. This longer waiting period will give the aligner the chance to fill its output FIFO buffer more since the alignment process is expected to be the slowest in this chain. As soon as there appears a vacancy in the raw and aligned reads list, which is the output FIFO of the select operation, the select block will fill this gap first with the items from the aligned reads lists, and after that if there still exists more space, it will fetch directly from the raw reads list.
Therefore, when the bandwidth is poor, the system tends to perform more alignment, and when the bandwidth is quite large, the raw reads are directly compressed (assuming a compressor is available with a throughput that is higher than the transmission capacity) and sent to the cloud. In the extreme case of a totally blocked transmission line, all reads are aligned in the sender side. On the other extreme that the bandwidth is infinite, no alignment is to be done on the sender side, and compressed raw data are sent directly to the cloud assuming that the throughput of the compressor catches the transmission speed. Thus, when the bandwidth is large, the bottleneck becomes the compressor, where sacrifice from the compression efficiency can be traded against the speed.
It is noteworthy that when a raw read is aligned, it is compressed better by representing that read by only storing its differences from the range on the reference genome that it is aligned to (Fritz et al., 2011). For example, if the read is hundred bases long, and aligned to a region on the reference, where it differs only by three positions, then it will be enough to represent this read by storing these three differences and corresponding bases along with the starting position of the alignment. Actually, the CIGAR string (with a slight modification 1 ) provides enough information on that. This implies that when the bandwidth is limited and the system due to that congestion tends to automatically achieve more alignment on the sender side, the compression ratio automatically improves.
Besides the bandwidth, the computing power of the sending party also plays an important role in this scenario. As much as that computing power lets in terms of available memory and processors, there can be more than one aligner or compressor running in parallel at a time. This will improve their speed in filling their respective output FIFO lists, and in coping with the higher bandwidths that might occur. For instance, when the bandwidth is superior at a certain time, the performance of the system depends on the compression and the alignment speeds. Thus, when the computing power allows higher compression and alignment speeds, this will enhance the overall process. Yet another opportunity is in algorithmic scale that replacing the compression and alignment blocks with better alternatives automatically improves the total performance, which can be defined as the total time it takes for the customer to reach the aligned data in SAM format. The reads that are aligned on the sender side do not require any further processing in the alignment sense on the cloud. Therefore, an increased number of aligned reads on the sender side helps in the performance.
The cloud service as the receiver side exactly follows the steps in Figure 1 in reverse order such that it begins by decompressing the received data, and then feeds these data to the (de)selection block, where the aligned and unaligned reads in the received payload are identified. The unaligned reads are sent to the aligner on the cloud side. The already aligned reads and the output of the aligner are directed into the output SAM file.
3. Prototype Implementation and Experimental Results
We have implemented the proposed scheme by using the BWA (Li and Durbin, 2010) (in single-threaded mode) as the alignment tool and an arithmetic coding implementation (Said, 2004a, 2004b) 2 for the compression step.
While compressing a raw read, we basically consider its base sequence, quality scores (in a lossless fashion), and the label string. Separate arithmetic coders (of order two) for each of these types of data are initiated, since the context and alphabet issues per data type are distinct.
While compressing the aligned reads, we considered the attributes that will be enough to reproduce the line in SAM format on the receiver side. With this aim, the starting position of the read on the reference genome, the CIGAR string, and the information regarding the bases of the read are taken into account. Only the bases that differ from the region of the reference that the read is aligned to are included. This reduces the amount of bases to be transmitted significantly, and it, thus, enhances the overall performance. Separate entropy coders are initiated as in the raw read compression to have better squeezing of each distinct data type.
The machine that we use on the sender side has an Intel i7 2.6 GHz CPU with 16GB memory running Windows 10. The publicly available fastq files SRR622461_1.fastq and SRR622461_2.fastq that include a total of ∼90 million reads produced by Illumina sequencing machines are used in the experiments.
We assume that the cloud reserves 64 cores per customer, and thus, the reported cloud processing times are computed accordingly. Surely, larger allocation of resources will improve the elapsed time. However, we would like to simulate a real-life situation, where we believe that an allocation of 64 cores per customer is realistic. The cores used in the experiments have the same CPU of the local machine.
We focus on different scenarios to benchmark our proposed solution. In absence of the proposed system, we consider four basic scenarios that the user might follow.
• SCENARIO S1: Most of the high-throughput sequencing machines have the ability to generate the FASTQ file already in the compressed format, for example, the Illumina sequencers output in gzip. Thus, the user may choose to send this compressed version directly to the cloud without spending any resource for the compression. When data are received on the cloud, the pipeline will start with unzipping the data, and then, the alignment process will be initiated.
• SCENARIO S2: For the plain FASTQ files, the user can apply a state-of-the-art compressor that is specifically tuned for this job. In such a case, the time required to compress the source file will introduce a delay, but on the other hand, the compression ratio is expected to be much better than the gzip, which will reduce the data size to be transmitted.
• SCENARIO S3: The user can align the input FASTQ file against the reference genome, for example, by using the BWA aligner that is being widely used. Assuming the aligner has the capability to produce the output in BAM (Li et al., 2009) format, which is the compressed version of plain SAM (Li et al., 2009) format, this BAM file can be sent to the cloud. The cost of this choice is the time spent for the alignment process. However, since the transmitted reads will be aligned, the processing pipeline on the cloud side will skip the alignment step.
• SCENARIO S4: Yet another recent option with the aligners is producing the output file in CRAM (Fritz et al., 2011) format, which outperforms the BAM compression. The user again performs the alignment with the choice to receive the output in CRAM format. In this case, the size of the data is expected to be smaller when compared with the BAM file in scenario S2. Similar to S2, there will be no need to run alignment on the cloud side after receiving the file.
The scenario for the proposed solution can be stated yet as a fifth case.
• SCENARIO S5: The user applies the system architecture proposed in this study. Time elapsed on the sender side includes compressing both the raw and aligned reads, and then transmission of them in a streaming fashion. On the cloud side, as soon as the packets arrive, the decompression takes place, and the raw reads in the received packet are fed into the aligner to accomplish the task. The whole process is achieved in a streaming fashion that the alignment of the raw reads in the received packet occurs immediately assuming that the cloud computing power is enough to finish the alignment of the raw reads in the first packet until the arrival of the second packet.
We have tested these four scenarios on different bandwidths of 0.1, 1, 10, 100, and 1000 Mbits/s. 3 We used the DSRC2 (Roguski and Deorowicz, 2014) fastq compressor during the tests for scenario S2. The results are shown in Table 2.
For the proposed system, it is not possible to separately report the times for the processing and transmission since all occur jointly in a streaming fashion. On the cloud side, when a packet including both the aligned and unaligned reads is received, it is decompressed and the unaligned ones are fed into the aligner on the cloud side. Due to this streaming pipeline, there is no additional elapsed excess time on the cloud side as long as the cloud reserves enough number of cores to welcome the transmission speed. In our experiments, 64 cores was enough for a maximum of 10 Mbits bandwidth, and thus, the cloud processing time is shown as 0 for those cases on the table.
For the 100 and 1000 Mbits cases, the bottleneck is the number of cores on the cloud side, and we reported regarding the results on the total time column along with possible timings in case of 160 and 1600 cores reserved, respectively, for the 100 and 1000 Mbits. Please see the discussions section later for further explanations to this.
In the 100 Kbit bandwidth, due to the poor connection, in any case it is better for the user to perform the pipeline in-house. Notice that the proposed scheme tuned itself to perform nearly all alignment on the sender side, since the transmission line is always busy.
When the bandwidth reaches 1 Mbit, the system tuned itself to perform 55% of the alignment on the sender side, and to transmit the rest as raw reads. This depicts a usable case where the introduced hybrid architecture makes sense.
In case the bandwidth becomes 10 Mbit, the system prefers to send 94% of the data as raw reads. This is mainly effected with the speed of the prototype implementation that in the time that a single read is aligned and compressed, the large bandwidth lets many raw reads to flow from the large connection. Improving the alignment/compression techniques in the proposed system might help to enhance the results here.
As soon as the bandwidth exceeds 100 Mbit, it is observed that the main limitation becomes the number of dedicated cores on the cloud side. Since the connection is of a high quality, the system prefers to send the compressed raw reads to the cloud, where the time depends on the available resources there. Although we assume a 64-core cloud resource, we observed that 160 cores will catch the speed of transmission in a 100 Mbit connection, and similarly 1600 cores will accommodate 1 Gbit.
4. Discussions
In case of a poor connection less than 1 Mbit, there appears no way of benefiting from a cloud service, and thus, the only option seems to run the whole processing in-house on the sender side. For those consumers, where there is no possibility of obtaining a better bandwidth, for example, the institutions that are in a region with a very poor internet infrastructure, the best option would be to empower their computing infrastructure.
The experimental results unleashed an interesting situation for those centers that have the opportunity of a good bandwidth, which is, say larger than 100 Mbits. In such a case, the bottleneck becomes the available resources on the cloud side since the transmission is instant, but the process waits on the cloud side. For instance, in our 100 Mbits and 1 Gbit bandwidth experiments, the transmission of data takes little time as opposed to the 1875 seconds cloud processing time with 64 dedicated cores. The total time can be reduced to 748 and 74 seconds when there are 160 or 1600 dedicated cores, respectively. Therefore, those consumers who have an excellent internet connection should make a special agreement with the cloud to have more dedicated resources.
This brings another dimension to think about. First of all, from a telecommunication perspective, to guarantee a dedicated bandwidth of X, the price to be paid to the service provider nearly becomes about 10X shared bandwidth, for example, 100 Mbits of dedicated connection costs ∼1 gigabit shared line. This is because the internet lines are shared and it seems not possible to achieve the highest rate all the time, but practically only one tenth of it due to line-sharing issues. Surely, such a large bandwidth will cost more, and in addition, the cost will increase even larger due to the necessity of higher number of cores required to be allocated on the cloud side. In terms of economical feasibility, such large data generators may then consider to build an in-house computing solution instead of cloud service purchase. These customers may also think of purchasing service not for data processing, but for management of their own computing facilities, for example, making a deal with the cloud not for dedicating some number of cores, but to manage the self-owned in-house computing infrastructure.
In middle bandwidth connections that are larger than 1 and smaller than 100 Mbits, all the parameters, including the available in-house computing power as well as the daily data generation throughput, become important. In this range, cloud service engagement with the proposed system architecture seems helpful and economically advantageous, as a reasonable number of dedicated cores will cost not so much. In addition, moderate improvements of the in-house computing power or better engineering of the proposed system architecture by enhancing the data compression ratios have the opportunity to significantly improve the total time.
5. Conclusions
We have presented a system architecture to efficiently handle the transmission and alignment, the first step of the downstream genomic data analysis, of DNA sequencing data in a streaming fashion. The system tunes itself automatically according to the available computing resources available on the client and cloud sides as well as the available bandwidth, where the dynamically changing bandwidths are accommodated instantly.
The experimental results showed that such an approach is meaningful, particularly for the middle-sized customers who have a reasonable but not so high bandwidth and moderate computing facilities. We observed that for large sequencing data producers, it seems better to achieve the analysis in-house instead of using a cloud service due to the fact that having larger bandwidth necessitates a large number of dedicated cores on the cloud side. Similarly, for the small-scale sequencing data providers, the choice appears to perform the job without the cloud service. It should be noticed that the analysis of these cases are in their most general settings, and the real decisions in granting a cloud service for any case surely requires to investigate the parameters of that case in detail.
Particularly for the large bandwidths, enhanced alignment and compression techniques have the opportunity to improve the introduced system architecture. However, the improvements, especially in the data compression case, should consider the speed as a more important parameter than the compression ratio. Yet another dimension that will make sense here is to focus on compression systems, which not only reduce the data size but also produce helpful information that might help in further steps of the genomic analysis pipeline addressing the compressive genomics approach, for example, as proposed in Adaş et al. (2015).
Footnotes
Acknowledgments
This work was partially supported by The Scientific and Technological Research Counsil of Turkey's TÜBİTAK-TEYDEB-1507 program grant number 7150616 with the title “Fast transmission of DNA sequencing data over a digital channel” and TÜBİTAK-ARDEB-1005 program grant number 114E293 with the title “Design and implementation of efficient storage and retrieval systems for high-throughput sequencing platforms.”
Author Disclosure Statement
No competing financial interests exist.
