Abstract
The intrinsic high-entropy sequence metadata, known as quality scores, is largely the cause of the substantial size of sequence data files. Yet, there is no consensus on a viable reduction of the resolution of the quality score scale, arguably because of collateral side effects. In this article, we leverage on the penalty functions of HISAT2 aligner to rebin the quality score scale in such a way as to avoid any impact on sequence alignment, identifying alongside a distortion threshold for “safe” quality score representation. We tested our findings on whole-genome and RNA-seq data, and contrasted the results with three methods for lossy compression of the quality scores.
1. Introduction
High-throughput sequencing technologies epitomize the challenges of data-intensive science, the current paradigm for scientific exploration (Hey et al., 2009). Massive data sets, diverse in variety, exhaustive in scope, and fine-grained in resolution, are representative descriptors of sequence data, the big data (Kitchin, 2014) in the biomedical sciences. The drop in sequencing costs (Wetterstrand, 2019) has democratized the production of sequence data and propelled its growth. Sequence data have historically doubled every 7 months,* a figure that is meticulously tracked and updated by the GenBank, the genetic sequence database of the National Institute of Health (NIH), every 2 months. The current storage requirements for sequence data are, conservatively, on par with the storage estimates for any of the other major Big Data producers: Astronomy (1 EB/year), Twitter (0.001–0.017 EB/year), and YouTube (1–2 EB/year) (Stephens et al., 2015). It has been projected an annual storage need between 2 and 40 EB per year for sequence data. However, we are at the onset of the sequence data flood. The promise of personalized medicine to revolutionize the diagnosis and treatment of diseases has triggered projects to sequence an important proportion of the human population. An attest to this is the 100,000 Genomes Project, a study launched in the United Kingdom that sequenced 100,000 human genomes from patients with rare diseases and their families, and patients with cancer. †
As per the estimates of the Global Alliance for Genomics and Health, more than 60 million patients will have their genome sequenced by 2025 (Birney et al., 2017), a projection that will be facilitated by the competition of private companies to offer genome sequencing services at a population scale with milestones to reduce sequencing costs, currently pushing the cost of 100 dollars per sequenced genome. The ubiquitous integration of personal genomic information into aspects of everyday life is around the corner, as we step into an era of the “social genome” (Martinez and Cifric, 2017).
The capacity to generate massive data sets of sequence data greatly outpaces our ability to analyze them, the notorious bottleneck in omic analyses. As per projections to the year 2025, more than 75% of the cost and complexity in omic workflows will be taken over by data analysis and storage (Davis-Turak et al., 2017). Data are accumulating very rapidly, and the immediate, pressing challenge in high-throughput sequencing is to reduce the size of sequence data files, the FASTQ files, for storage. Many solutions in the realm of data compression have emerged to reduce the size of such files, and much work has been devoted to explore methods to compress sequence data, and the associated metadata, without loss of information (Hosseini et al., 2016; Numanagić et al., 2016).
Substantial storage size is dedicated to sequence metadata in lossless compressed files compared with the storage devoted to sequence data. Conservatively speaking, this figure is more than 50% in lossless compressed files. There has been early evidence in the scientific literature to the size occupied by sequence metadata (Jones et al., 2012; Bonfield and Mahoney, 2013), and more recently such proportions have been brought back to attention (Roguski et al., 2018; El Allali and Arshad, 2019). Sequence metadata, also referred to as quality scores, is the bottleneck in the compression of FASTQ files.
The study of lossy quality score representation was initiated as a remedy to the large storage footprints of FASTQ files. The investigation of techniques for lossy compression of quality scores, along with quantification of their impact on the calling of genetic variants, has been well studied. It has been shown and confirmed at length that lossy approaches for quality score representation provide significant storage saving with negligible impact on variant calling (Hsi-Yang Fritz et al., 2011; Hach et al., 2012; Jones et al., 2012; Greenfield et al., 2016; Ochoa et al., 2016; Roguski et al., 2018), regardless of the idiosyncrasies of the lossy compression approach. The effect of lossy compression of quality scores has also been explored in differential gene expression with similar conclusions on the negligible effect of applying lossy representations (Hernandez-Lopez et al., 2018). Furthermore, recent advances in sequencing technologies are leading the production of longer genomic sequences with better accuracy and drastically reduced resolution for the quality scores (Illumina, 2017), supporting the claim that coarser representations are in principle suitable for omic analyses.
While the ultimate interest lies in assessing the impact of lossy quality score representation in a full bioinformatic pipeline, this approach may be ineffective for the purpose of understanding the role of lossy quality scores in the analysis. This is because sequence data is transformed continually as it is shepherded throughout the pipeline, and errors and associated uncertainties of the computational methods that operate on it are combined along with the data, obscuring the precise effect of lossy quality scores in the analysis.
Moreover, despite several efforts to evidence marginal impact in the application of lossy quality score representation, no consensus exists on the limits of a “safe” representation for compressing them lossy. In this context, this work focuses on read alignment to explore the effect of lossy quality score representation. In particular, we show that it is possible to compute a threshold value for transparent quality score distortion for sequence alignment, allowing the identification of a safe representation for the quality score scale. A result that aligns with current trends in sequencing technologies pushing for coarser resolutions to reduce the storage footprint of sequence data.
1.1. Read alignment
The challenge to represent lossy quality scores in the alignment of sequence reads lies in maintaining the read's original alignment location(s) with the new simplified representation. In quality-aware aligners, quality score values participate in the computation of suitable alignment locations for the reads. The way in which quality score values are factored in depends on the alignment technique, and their usage is not essential but clearly optional. Many aligners have been developed that do not rely on quality scores. This is readily noted in benchmark comparisons, which commonly include widely used aligners (Li and Homer, 2010; Fonseca et al., 2012; Hatem et al., 2013; Baruzzo et al., 2017; Thankaswamy-Kosalai et al., 2017).
Using quality scores can improve alignment accuracy because the information they provide, the probability of error in the calling of each sequence base, can be incorporated to determine which positions in a read are more important to map (Smith et al., 2008; Li and Homer, 2010). Quality scores can be used in very diverse ways among alignment tools, as the methods prioritize this metadata differently.
One of the most widely used reference-based aligners, the Burrows-Wheeler Alignment tool (BWA) (Li et al., 2008; Li and Durbin, 2009), incorporates quality scores in a measure for the reliability of alignments. The aligner does this by defining a mapping quality score that represents the error probability of each read alignment. Quality scores are not used in the BWA alignment algorithm but rather they are used to support alignment results. For example, alignments of high accuracy can be screened out using the mapping quality score. Moreover, this score is used by the aligner to estimate the insert size distribution in paired-end mapped reads.
In contrast, quality scores can be incorporated at the core of an aligner's algorithm to guide the alignment decision. This is the case for Novoalign (Novocraft, 2017), another well-known reference-based aligner, which consistently ranks well in alignment accuracy. Novoalign uses quality score information in its penalization system to score candidate alignment locations for each input sequence read.
Our purpose is to investigate the contribution of quality scores to alignment in HISAT2, which uses quality score metadata for the computation of alignment scores (ASs). Built over Bowtie2 (Langmead and Salzberg, 2012), HISAT2 is in fact the evolution of this very well-know and popular aligner. It has good adoption and performance (Li, 2013; Baruzzo et al., 2017), and has stood the test of time. Moreover, it is open source and has still been maintained. ‡ In addition, it was designed to map both DNA and RNA-seq reads.
The role of quality scores in alignment is framed within the HISAT scoring system, and understanding it will be the way through finding a simplified representation for the quality scores that circumvents undesirable effects on alignment. Concretely, the goal is to preserve alignment locations as if no modification to the values of quality scores was done before alignment, that is, we aim at varying the quality scores transparently.
We ask under what circumstances quality score values are, or become, informative for determining the alignment location of a read? To address this question, we look into how quality scores weigh in on HISAT2's scoring system.
2. Methods
Aligning sequences consists in lining up characters to reveal similarity. However, the aligner cannot always assign a read to its point of origin with high confidence, and thus, it makes an educated guess about its origin in the reference sequence.
HISAT2 quantifies how similar the sequence of a read is to the reference sequence it aligns to by computing an AS for the read, whose value is used by the aligner to classify reads as aligned or unaligned. Therefore, the AS can be seen as a proxy to measure the effect quality scores have on alignment.
The aligner starts with the assumption that no difference exists between the read sequence r and the segment of the reference sequence R, pointed to by the alignment location, it aligns to. If this condition is satisfied, the best possible AS is assigned to the read, which is zero. This the largest, non-negative value the AS can take. The concept of AS does not apply to unaligned reads, as such HISAT2 does not report a value nor the AS metric for these reads. As dissimilarities are found between r and R, HISAT2 penalizes each discrepant sequence character. Penalty values are always negative and are added together to compute the total AS for the read r. For an alignment to be considered good enough, or valid, it must have an AS with a value no less than the minimum score threshold t. The threshold is configurable and a function of the read length x, and its default value is t(x) = 0 − 0.2 × (x). Thus, valid alignments meet or exceed the minimum score theshold and are capped at zero. For example, aligned sequences of 100 base-pairs long will have valid ASs in the range −20 ≤ AS ≤0.
There are four types of penalizations, and each is scored differently:
Ambiguous characters (N). The penalty is set in positions where the reads, reference or both, contain an ambiguous character such as N. For each ambiguous character the penalization is 1. Gaps. Affine gaps in the read or the reference are penalized for their occurrence (gap opening, O), and for each position they span (gap extension E). The sum of both values defines the penalization for the gap. The penalty for a read gap of length n is O + n × E, and its default is 5 + n × 3. The same expression applies for a reference gap of length n. Soft-clips (sc). Reads can be aligned in a way such that they are trimmed at one or both extremes, because some of the characters at their ends do not match the reference. Omitted characters are trimmed or soft-clipped from the read to produce a valid alignment. Each character that is soft-clipped receives a penalty value defined by the penalty function
where MX = 2 and MN = 1 are the default values, and Q is the quality score for the soft-clipped character.
Mismatches (mm). These are discrepant characters between the read and the reference. Each mismatch is penalized using the penalty function for soft-clips. However, the parameter values change for mismatches, and they default to MX = 6, and MN = 2.
Let us note that quality scores participate only in the penalization for mismatches and soft-clips. By solving the penalty function above for the full quality score scale, and for both mismatches and soft-clips, we get Table 1.
Penalty Values for Mismatches and Soft-Clips
Penalties are always negative values. Pmm, penalization for mismatches; Psc, penalization for soft-clips.
The hypothesis is that sequence alignment is preserved when quality score distortion and AS invariance occur simultaneously. To test this, we start by grouping the quality score scale shown in Table 1, according to the penalty values for mismatches and soft-clips. The result is the table shown in Figure 1.

Rebinning of quality score scale.
With this rebinning, we can compute distortion rate baselines that represent lossy compression rates that can “at least” be applied to the quality scores of raw sequence files (FASTQ files) without compromising alignment. These baselines can be thought of as distortion thresholds, which rely on sequence files. Figure 2 shows the setup of our experiments. An input file with undistorted quality scores (

Experimentation setup.
To observe the effect that quality score distortion plays on alignment, we ran three lossy compressors: PRblock (Cánovas et al., 2014), QVZ (Malysa et al., 2015), and Quartz (Yu et al., 2015), and set their parameters such that the output files met as close as possible the value of the distortion threshold
3. Results
We experimented with synthetic and natural data and are reporting results for two natural data samples: T16M metastatic liver tumor (whole-genome sequence data), and gene expression data in skin fibroblast cells (rna-seq data). The files were sourced from the European Nucleotide Archive, § under the run accession numbers SRR089705 and SRR7093809, respectively.
Results are reported in the tables in Figure 3. The alignment report is presented as the percentage of reads grouped in one of three possible sets: reads that aligned zero times (Z), reads that aligned exactly one time (X), and reads that aligned more that one time (M).

Distortion rate and alignment percentages for wgs and rna-seq samples.
The tables summarize alignment information as the percentage of reads whose alignment coordinate changed as a consequence of quality score distortion. We call this read relocation, and can happen between alignment sets or within alignment set M (Fig. 4).

Read relocation between sets (top), and within set M (bottom).
For example, a read aligned before quality score distortion may be grouped in set Z, but if that same read is aligned after quality score distortion, it may be grouped in set X. This type of read relocation is between sets, or interset, and the percentage of reads relocated in this manner is shown under interset read relocation in Figure 3.
The second form of read relocation can occur within set M, when the quality scores of a read with multiple alignment locations are modified in a way such that the new alignment coordinate belongs to the set of its multiple candidate locations. The percentage of reads relocated within set M is shown under intraset read relocation in Figure 3. The percentages shown are relative to the total file and to the set of multireads (M).
Note that this type of read relocation occurs even in the rebinned file. This happens when the set M contains reads whose set of alignment coordinates have the same AS. HISAT2 will select one of the candidate coordinates for each read by computing a pseudorandom number generated from the read name, the sequence string, the quality score string, and an optional seed value. Thus, modifying the quality scores will trigger the HISAT2 intrinsic response toward multireads with equally likely alignment coordinates.
The graphs in Figure 5 report the effect of rebinning. In both graphs, the points to the far right show the lossless compression rate for each file. In this case, no changes to the quality scores are made, and therefore, all alignment coordinates are preserved.

Alignment coordinate changes for rna-seq and wgs samples.
If we then rebin the quality score scale by changing the values according to Figure 1, in five bins, the file can be compressed at a rate indicated by the big dots in both graphs in Figure 5. These dots identify distortion thresholds for AS invariance, and every rebinned file has a specific threshold value. Notice there is a percenage of alignment coordinate changes, which result as a consequence of intraset read relocation of multireads. In Figure 5, the points to the left of the big dots show the rebin of the quality score scale, but this time using three and two bins, instead of the standard five bins shown in Figure 1. Notice the abrupt raise in the percentage of affected reads as we move toward the left of the graph, a consequence of pushing for a coarser representation for the quality score scale.
Changes to the quality scores of read sequences will inevitably lead to changes in alignment coordinates, therefore impacting alignment. Assessing the significance of this impact will depend on the recipient application following sequence alignment. However, the impact of lossy quality scores on alignment can be eliminated by keeping the AS invariant. Although this is in principle true, we discovered that some idiosyncratic design decisions in the aligner weigh in unexpectedly and collaterally impact alignment locations; this is beyond our control.
Sequence reads will fall in one of the three sets after alignment, and an alignment location(s) will be reported afterward for each read. As seen at the top of Figure 6, each read U will be assigned an AS by the aligner, and this value will determine whether the read receives an alignment position or not. Aligned reads are grouped by the number of locations the aligner found for them; if that number is one, they group in set X, if more than one location are found for a read, they group in set M. An unaligned read has no alignment location and belongs to set Z.

Abstraction of the assignment of alignment location(s) for each input read. In the figure, alignment coordinates are depicted by notepads. A read aligned as a multiread is reported with a list of possible alignment locations, and one of these locations is randomly selected as primary alignment.
When changes are made to their quality scores of a read U*, and it is then aligned, the report of its AS may change (AS*), or keep the same value (AS). Refer to the bottom of Figure 6. A change in the value of the AS does not immediately yield a change in the alignment position for that read. However, it could be the case that it does (pos*), and thus, the alignment coordinate is tracked to record its displacement. Regardless of the outcome, the read will group in the Z, X, or M set.
The effect of rebinning reads, in accordance with Figure 1, and aligning them afterward are shown in Figure 7. Invariance of ASs is achieved for every input read U*, and the only reads that could potentially be affected by this new representation to their quality scores are the multireads. Therefore, changes to alignment coordinates can happen only within set M.

Abstraction of the assignment of alignment location(s) for rebinned reads. For multireads, the report of the primary alignment is randomly selected by the aligner. No guarantees can therefore be given with respect to their values.
A graphical summary of the process is presented in Figure 8. Input reads without changes, and with changes (rebinning), to their quality scores are aligned and compared side to side. The content of the three alignment sets (Z, X, and M) is preserved. As for the alignment coordinates, they are kept unchanged with no guarantees for those in the mutiread set.

Schematic comparing of the effect of rebinned reads on alignment.
4. Discussion
We investigated the penalty functions that drive the AS system for read sequence alignment in a well-known quality-aware aligner. We then derived a simplification in the assignment of penalty values that reduces quality score scale granularity while keeping ASs unaffected. Consequently, this coarser quality score scale reduces storage footprint of sequence files with the advantage of entirely preserving read mapping percentages. In other words, we distorted quality scores without collateral impact on alignment.
The aligner in question was HISAT2, the modern version of the popular aligner Bowtie2, suitable for mapping genome and exome sequence data. Compared with other quality-aware aligners such as Novoalign, HISAT2's approach to AS computation is straightforward and deterministic, making it a good candidate to explore the relationship and effect of quality scores and sequence alignment.
Simplifying the representation of quality scores is arguably a natural choice in the face of the sequence data explosion, and computational methods that approach the problem introduce collateral errors that are difficult to quantify. The assessment of quality score distortion has been attempted in some application domains (Alberti et al., 2016; Ochoa et al., 2016; Hernandez-Lopez et al., 2018) without clear consensus on the limits of safe lossy distortion levels. Meanwhile, the increasing complexity of genomic assays, data sets, and computational methods only adds to the difficulty of potential quantification.
Nevertheless, even uniform requantization of the quality scores is a suitable approximation for high-accuracy applications (Illumina, 2012), and we have shown that this approach can be extended further to rebin coarsely quality scores without impact in sequence alignment.
In the light of the fast-paced sequencing technology progress, the utility of quality scores is at stake, as they are arguably unnecessary for many omics applications. We must therefore advocate for a feasible and pertinent granularity that suits each host application.
Footnotes
Author Disclosure Statement
The authors declare they have no conflicting financial interests.
Funding Information
This work was generously supported by the Swiss National Science Foundation.
