GeNeo: A Bioinformatics Toolbox for Genomics-Guided Neoepitope Prediction

Abstract

High-throughput DNA and RNA sequencing are revolutionizing precision oncology, enabling personalized therapies such as cancer vaccines designed to target tumor-specific neoepitopes generated by somatic mutations expressed in cancer cells. Identification of these neoepitopes from next-generation sequencing data of clinical samples remains challenging and requires the use of complex bioinformatics pipelines. In this paper, we present GeNeo, a bioinformatics toolbox for genomics-guided neoepitope prediction. GeNeo includes a comprehensive set of tools for somatic variant calling and filtering, variant validation, and neoepitope prediction and filtering. For ease of use, GeNeo tools can be accessed via web-based interfaces deployed on a Galaxy portal publicly accessible at https://neo.engr.uconn.edu/. A virtual machine image for running GeNeo locally is also available to academic users upon request.

1. INTRODUCTION

Cancer-specific epitopes (referred to as neoepitopes), generated by nonsynonymous somatic single nucleotide variants (SNVs) carried by tumor cells, have become the main focus of personalized cancer immunotherapies. However, the identification of neoepitopes from next-generation sequencing (NGS) data in a clinical setting remains challenging and requires the use of complex bioinformatics pipelines. These pipelines must address a number of challenging computational problems, including accurate identification of somatic SNVs from short-read NGS data and predicting which mutated peptides get presented on the surface of tumor cells and can mediate antitumor immune responses.

Many of the existing bioinformatics tools focus on the epitope prediction step for a given set of somatic variants. Examples of such tools are CloudNeo (Bais et al, 2017), NeoEpitopePred (Chang et al, 2017), Neoepiscope (Wood et al, 2020), and Mupexie (Bjerregaard et al, 2017). Neopepsee (Kim et al, 2018) and NeoPredPipe (Schenck et al, 2019) also take somatic variants as input and predict T-cell receptor (TCR) recognition potential for the corresponding mutated peptides. There are only a few neoepitope prediction pipelines that integrate variant calling from matched tumor/normal NGS data, including the tumour-specific neoantigen detector (TSNAD) (Zhou et al, 2017), pTuneos (Zhou et al, 2019), DeepAntigen (Shi et al, 2020), ProTECT (Rao et al, 2020), Epidisco (Rubinsteyn et al, 2018), and OpenVax (Kodysh and Rubinsteyn, 2020).

In this paper, we present a new suite of tools, referred to as GeNeo, for predicting neoepitopes from matched tumor/normal exome sequencing data coupled with tumor RNA-Seq data. A distinguishing feature of GeNeo is that it can analyze multi-technology sequencing data generated using the Illumina and Ion Torrent platforms to take advantage of the complementary strengths of the two technologies (Boland et al, 2013). Combined with a consensus variant calling approach that integrates two different variant callers, Strelka (Saunders et al, 2012) and SNVQ (Duan et al, 2014), this results in highly accurate somatic variant calls. To facilitate use within a clinical setting, where best practices require validation of somatic SNVs by an orthogonal technology (Koboldt, 2020), GeNeo includes tools enabling high-throughput validation by targeted resequencing using the AccessArray platform. Finally, GeNeo includes tools for predicting neoepitopes encoded by validated somatic SNVs and prioritizing them using a variety of criteria including their major histocompatibility complex class I (MHC-I) binding affinity and differential agretopic index (DAI), which was previously shown to be a better predictor of tumor control in vaccination experiments in mouse tumor models (Duan et al, 2014).

The GeNeo tools have been used in several clinical studies, including phase I and II clinical trials. The tools have been deployed using a customized instance of the Galaxy framework (Afgan et al, 2018) and are publicly accessible through easy-to-use graphical user interfaces at https://neo.engr.uconn.edu/. A virtual machine image for running GeNeo locally is also available to academic users upon request.

2. METHODS

In this section, we describe the tools included in the GeNeo toolbox (Fig. 1). The tools are grouped into three categories: variant calling and filtering, variant validation, and neoepitope prediction.

FIG. 1.

The tools comprising the GeNeo toolbox are grouped into three categories. The variant calling and filtering category includes the CCCP tool along with tools for filtering SNVs based on different criteria, generating a summary report, and converting the (filtered) CCCP output to the standard VCF for analysis by external tools. The variant validation category includes tools for identifying possible sample mismatches based on mitochondrial haplogroups, designing primers for targeted resequencing of candidate SNVs using the AccessArray platform, demultiplexing AccessArray amplicon sequencing data and calling SNV genotypes, and extracting read alignments that cover candidate SNVs for manual review. The neoepitope prediction category includes tools for inferring HLA alleles from RNA-Seq data, predicting neoepitopes, and filtering them based on predicted IC₅₀ and DAI values. AA, AccessArray; CCCP, Consensus Caller Cross-Platform; DAI, Differential Agretopic Index; dbSNP, the Single Nucleotide Polymorphism database; HLA; IC₅₀, half maximal inhibitory concentration; SNVs, single nucleotide variants; VCF, variant calling format.

2.1. Variant calling and filtering tools

Variant calling is a critical first step of any immunogenomic analysis. Because the focus of these analyses is on the identification of neoepitopes encoded by somatic mutations present in the patient's tumor, variant calling is typically performed using deep whole-exome sequencing data generated from matched tumor and normal samples. Processing the large volume of sequencing data requires efficient algorithms and computing infrastructure able to deliver results in a timely manner. At the same time, clinical samples can undergo unpredictable degrees of degradation, resulting in a high patient-to-patient variation in the quality and quantity of sequencing data. This requires complex, multistep analysis pipelines that are always able to produce reliable variant calls. Filtering somatic variant calls is also essential for eliminating from consideration subclonal mutations present in a minority of tumor cells that may generate poor immunotherapy targets as well as reducing the number of false positives generated by library preparation artifacts and sequencing errors. GeNeo implements the recommended best practices for somatic variant calling and filtering (Koboldt, 2020) as described hereunder.

2.1.1. Consensus caller cross-platform

There are multiple NGS technologies commercially available, each with its own strengths and weaknesses (Goodwin et al, 2016). Although single-molecule sequencing technologies have rapidly advanced over the past few years, their higher error rate makes them less suited for variant calling. Indeed, clinical sequencing is almost universally performed using short-read NGS technologies. These technologies deliver at a very low cost and with a lower error rate of millions of reads of a few hundred bases, sufficiently long to be reliably mapped against the genome. Among commercially available short-read NGS technologies, the Illumina and Ion Torrent platforms have comparable costs. Both technologies perform sequencing by synthesis. Illumina uses fluorophore-labeled single nucleotides with the 3′-OH group blocked in the synthesis step.

Their cyclic reversible termination process ensures that at most one nucleotide is incorporated in the synthesized DNA molecules in each cycle, with incorporation events being detected by high-resolution microscopy. This results in a low rate of homopolymer sequencing errors at the cost of increased sequencing time of several days for their high-end instruments. In contrast, Ion Torrent uses natural nucleotides in each synthesis cycle (with no need for extra steps for removing the 3′ blocking groups) and detects nucleotide incorporation using field-effect transistors sensitive to pH changes. This dramatically reduces sequencing time to a few hours but also results in limited sequencing accuracy of homopolymer repeats.

The Consensus Caller Cross-Platform (CCCP) tool was designed to leverage the complementary strengths of the Illumina and Ion Torrent technologies. As shown in Figure 2 by default the tool takes as input Illumina paired-end and Ion Torrent (single-end) reads generated from matched tumor and normal exomes, along with Illumina paired-end and Ion Torrent RNA-Seq reads, all in fastq format. However, the tool also has the option of analyzing Illumina single-end reads along with Ion Torrent reads, as well as single technology data (Illumina paired-end only, Illumina single-end only, or Ion Torrent only). In addition, although primarily designed for human sequencing data, CCCP has the option of analyzing tumor model data from the commonly used C57BL/6 and BALB/c mouse strains. The tool generates an unfiltered list of candidate SNVs along with extensive information on the allele coverage in the Illumina and Ion Torrent exome and RNA-Seq datasets and calls made by two somatic SNV callers, Strelka (Saunders et al, 2012) and SNVQ (Duitama et al, 2012), with subtraction. The output of CCCP is used as input to the provided filtering tools to identify high-confidence somatic calls, which can then be validated by targeted resequencing and used to predict neoepitopes.

FIG. 2.

The Galaxy interface of the CCCP tool. By default, the tool takes as input Illumina paired-end and Ion Torrent (single-end) reads generated from matched tumor and normal exomes, along with Illumina paired-end and Ion Torrent RNA-Seq reads.

Figure 3 provides the main steps implemented by the CCCP tool. Following recommended best practices for somatic variant calling (Koboldt, 2020), CCCP uses FastUniq (Xu et al, 2012) to remove duplicate reads, which are likely to be artifacts of polymerase chain reaction (PCR) amplification. This is particularly important for Ion Torrent exome sequencing data, which uses an exome capture protocol based on highly multiplexed PCR, but PCR artifacts can be generated during Illumina library preparation too. Removing duplicated reads reduces the likelihood that a PCR error that gets amplified and sequenced multiple times results in a false-positive call. Following duplicate removal, Trimmomatic (Bolger et al, 2014) is used to remove reads with an average quality score of <20 and reads shorter than a user-specified length (default 50). Subsequently, all exome and RNA-Seq reads passing the FastUniq and Trimmomatic filters are mapped against the genome using hierarchical indexing for spliced alignment of transcripts version 2 (HISAT2) (Kim et al, 2015), with spliced alignments disabled for exome reads. Alignments of nonuniquely mapped reads and nonconcordant pairs are discarded.

FIG. 3.

A flowchart of the CCCP tool for calling somatic variants. The CCCP output includes an unfiltered aggregation of SNV calls from both platforms and both callers. SNVQ.

Candidate somatic SNVs are then called by running two somatic SNV callers [Strelka, Saunders et al (2012) and SNVQ, Duitama et al (2012) with subtraction] on the alignments generated from each of the two sequencing technologies. Strelka is independently run on the alignments generated from Illumina and Ion Torrent-matched exomes using the “no depth filter” option. The SNVQ subtraction method is also used to process independently the alignments generated from Illumina and Ion Torrent exomes. For each technology, SNVQ is run separately to call the SNVs from the normal and tumor exome alignments. A high-confidence SNV call is defined as a call made with an SNVQ genotype Phred score of at least 50, alternative allele support coverage of at least 3 reads, and alternative allele support on both strands.

An SNV call is identified as somatic by the SNVQ subtraction method if (1) is called with high confidence from the tumor exome alignments, (2) is absent from the unfiltered SNVQ calls made from the normal exome, and (3) has coverage of at least 5 reads in the normal exome. In addition, SNVs called with high confidence from the tumor exome and are called with a different genotype from the normal exome are also identified as somatic by the SNVQ subtraction method.

The output of the CCCP pipeline includes the union of all somatic calls made by Strelka from Illumina and Ion Torrent technologies along with all calls made by SNVQ in at least one of the four runs on normal and tumor exomes generated using either technology (with high confidence in the case of tumor exome calls). For each variant position in this list, the read support for each of the four nucleotides in each of the input sequencing datasets (Illumina and Ion Torrent tumor and normal exomes and tumor RNA-Seq) are computed from alignments files in sequence alignment map (SAM) format using SAMtools (Li et al, 2009) and bedtools (Quinlan, 2014). The calls made for each of these positions by Strelka and SNVQ, their status in the Single Nucleotide Polymorphism Database (dbSNP), and the number of reads supporting each nucleotide are integrated into the CCCP output file. This enables rapid filtering of predicted SNVs based on a variety of criteria including variant caller and coverage support in the exome and RNA-Seq datasets generated by the two technologies, with no need to rerun CCCP.

2.1.2. Somatic SNV filter

By applying this filter, the user can select for validation candidate somatic SNVs with different levels of support from the available sequencing data. Using one of the three implemented filtering options is mandatory because the raw CCCP output includes (likely germline) SNVs detected in both the normal and tumor exomes as well as SNVs detected only in the normal exomes. All filters generate output in the same format as the CCCP tool, enabling the application of additional filters that take dbSNP status and RNA-Seq coverage into account and compatibility with the other downstream GeNeo tools.

The default option for the somatic SNV filter tool is to apply the filter called 2 Calls Cross-Platform (2CP) filtering. This method retains candidate SNVs that are called somatic at least twice, with concordant somatic alternative alleles, of the four possible somatic calls generated when running CCCP (Strelka Illumina, Strelka Ion Torrent, SNVQ Illumina, and SNVQ Ion Torrent). In addition, 2CP requires that the SNV be called as somatic at least once from data generated by each sequencing technology. The only exceptions are SNVs that have no coverage from one of the sequencing platforms, in which case the SNV is retained only when both Strelka and SNVQ make concordant somatic calls based on the sequencing reads generated by the other platform.

As previously reported in Sherafat et al (2020), 2CP offers an excellent balance between precision and recall, matching and often outperforming the F1-scores of both Strelka and the SNVQ subtraction method. However, the somatic SNV filter tool provides both less and more stringent filtering options that can be used when 2CP generates a number of candidates that is too low to generate sufficient candidate neoepitopes or too large for the multiplexing capacity of the AccessArray platform used for variant validation. The less stringent filter retains all SNVs called somatic by at least one of the two callers implemented in CCCP based on sequencing data generated by at least one of the two sequencing technologies.

Although the list of candidate somatic SNVs generated by the 1 Call filter is likely to include more false positives than the 2CP list, validating the calls by resequencing will detect and remove these false positives. The more stringent filter retains all SNVs that are called somatic at least three times of the four possible combinations of calling method and sequencing technology.

2.1.3. dbSNP filter

This tool takes as input a list of SNV calls in CCCP output format and removes from the list the variants that are annotated as “common” in the dbSNP database (Sherry et al, 2001). Such filtering can reduce the number of false positives caused by the misclassification of germline variants as somatic. However, particularly when variant validation is performed by targeted resequencing, applying dbSNP filtering should be avoided because, as noted by Koboldt (2020), a number of known recurrent cancer mutations have been included in dbSNP.

2.1.4. RNA-Seq coverage filter

This tool takes as input a list of SNV calls in CCCP output format and keeps only variants that have a minimum coverage (by default 1) of the alternative allele in the RNA-Seq reads. The RNA coverage of the alternative allele is calculated as the sum of coverages for the alternative allele in both the Illumina and Ion Torrent RNA-Seq reads. This filter can be used to prioritize variants with direct evidence of RNA expression among large numbers of somatic candidates. As the dbSNP filter, the RNA-Seq coverage filter must be used with caution because it may remove somatic SNVs with low levels of messenger RNA (mRNA) expression that may nevertheless generate MHC-I neoepitopes presented on the surface of tumor cells.

2.1.5. Variant calling format generator

The variant calling format (VCF) generator takes as input the file generated by CCCP (or one of the above filtering tools) and converts it to the standard VCF. The VCF is the required input of the Primer Designer tool described in the next section, and also allows the user to seamlessly integrate the output of CCCP with tools external to GeNeo if needed.

2.1.6. Summary report generator

This tool collects from the CCCP log statistics on the numbers of input reads, clean reads (reads passing the deduplication and QC steps), aligned reads, and coverage levels for the analyzed exome and RNA-Seq datasets.

2.2. Variant validation tools

Best practices for somatic variant calling from clinical NGS data require validation by an orthogonal technology (Koboldt, 2020). Sanger sequencing has long been considered the gold standard validation technique (Mu et al, 2016). However, Sanger sequencing has low throughput and suffers from low sensitivity for samples with low tumor content and/or subclonal somatic SNVs present in only a fraction of tumor cells. As an alternative, validation can be performed using the AccessArray nanofluidics system, which allows simultaneous PCR amplification of up to 480 candidate somatic SNVs from upto 48 samples, in conjunction with deep amplicon sequencing using either the Illumina or Ion Torrent platforms. GeNeo includes several bioinformatics tools designed to automate the validation process and check for possible sample mislabeling.

2.2.1. Primer designer

This tool takes as input a list of candidate somatic SNVs in VCF and attempts to design primers for variant validation using either Sanger sequencing or the AccessArray platform. Following recommended best practices, for Sanger sequencing, the tool attempts to design two nested primer pairs, one for PCR amplification and one for initiating Sanger sequencing on each strand. For the AccessArray platform primers are designed to target the manufacturer's recommended amplicon length and are tagged with universal forward/reverse common sequences CS1/CS2 for preparing sequencing libraries compatible with the Illumina or Ion Torrent platforms. Both Sanger and AccessArray primers can be designed for targeted amplification of the candidate somatic SNV positions from either genomic DNA or complementary DNA (cDNA).

2.2.2. AccessArray genotype caller

This tool demultiplexes AccessArray amplicon sequencing data from (typically replicated) tumor and normal samples and calls genotypes for each targeted SNV and each sample. It takes as input the AccessArray barcoded amplicon sequencing reads in fastq format and the set of targeted SNVs in VCF. The tool starts by creating an alternative genome reference with the reference allele substituted by the alternative allele at each targeted SNV position. A HISAT2 index is created for both the reference and the alternative genomes. To speed-up indexing and read alignment, the indices are built using only windows of ±500 bp around each targeted SNV. Next the tool demultiplexing the reads based on the AccessArray barcodes using the barcode splitter tool from the FASTX-Toolkit (FASTX-Toolkit, 2022). This generates multiple fastq files, one per AccessArray well.

The barcodes and sequencing adapters are then removed from the reads. Optionally duplicate reads can be removed using FastUniq (Xu et al, 2012), although this is necessary only for very low input samples such as single cells. Similar to CCCP, reads with average quality <20 are filtered out using Trimmomatic (Bolger et al, 2014). The remaining reads are then mapped to each one of the two references using HISAT2 (Kim et al, 2015), and the read alignments are used by SNVQ to call variants. SNVQ reports called SNVs as well as the coverage of each allele.

Using the alternative reference allows SNVQ to call genotypes and compute allelic fractions for positions where the AmpliSeq data does not support the alternative allele because sequencing data supporting a homozygous reference genotype will result in an SNV call when using the alternative reference. The first output of the tool is a tab-separated file that reports SNV genotype calls obtained by combining the two SNVQ runs, along with genotype quality and strand-specific coverages for both the reference and alternative alleles. The second output is a zipped file including the alignment files for each well in SAM format.

2.2.3. Alignment extractor

When validation via targeted resequencing is not possible (e.g., because appropriate primers cannot be designed or they fail to amplify the desired region) or is inconclusive, the current best practice is to perform a visual review of the read alignments supporting clinically relevant variants (Koboldt, 2020). The standard operating procedure for performing visual review (Barnell et al, 2019) requires using a genome browser like the integrative genomics viewer (IGV) (Robinson et al, 2017). This enables the identification of false positives caused by various artifacts including misalignment around indels or misalignments of reads from paralogous regions not represented in the reference genome (Koboldt, 2020). The alignment extractor tool helps streamline this process by generating a reduced reference genome that includes 500 bases before and after each SNV position of interest and generating BAM files that include only read alignments covering these positions. The tool accepts as input the target SNVs in VCF along with fastq files for the sequencing data that needs to be visualized, including any combination of matched tumor-normal exomes, RNA-Seq, or AccessArray targeted resequencing data generated using either the Illumina or Ion Torrent platforms (see the Galaxy interface in Fig. 4).

FIG. 4.

Galaxy interface for the alignment extractor tool.

2.2.4. MitoMatch

Sample contamination and mislabeling have been long-standing problems in biomedical research, and unfortunately, their prevalence has increased with the advent of high-throughput multi-omics datasets (Kho, 2021; Boja et al, 2018). Contaminated or mislabeled samples can result in a substantial increase in the number of false-positive somatic SNV calls, compromising the clinical utility of the sequencing data. To assist in the identification of contaminated or mislabeled samples, the MitoMatch tool can be used to infer the mitochondrial haplogroups for the sequenced exome and RNA samples. The tool accepts as input the same types of sequencing datasets accepted by the CCCP tool. The reads are mapped against the Reconstructed Sapiens Reference Sequence (RSRS) mitochondrial sequence and then SNVQ is used to call mitochondrial DNA (mtDNA) variants for each dataset. The Jaccard similarity-based algorithm in Alqahtani and Măndoiu (2020) is used to match the set of mtDNA variants called from each sample against a database of variants associated with the 2897 leaf haplogroups from Build 17 of PhyloTree (PhyloTree.org, 2016) as well as the two-haplogroup mixtures that can be formed using them. As shown in Alqahtani and Măndoiu (2020), this approach can quickly and accurately identify the PhyloTree haplogroup (or mixture of haplogroups) present in a sample, thus providing a simple method to simultaneously detect with high probability both sample contamination and sample mislabeling.^†

2.3. Neoepitope prediction tools

Clinical use of personalized cancer vaccines requires predicting which mutated peptides, called neoepitopes, encoded by somatic cancer mutations get presented on the surface of tumor cells and are most likely to induce immune-mediated tumor control (reviewed in Lang et al, 2022). Most works in this area have focused on the prediction of neoepitopes presented on the surface of cancer cells by the MHC-I molecules, which can mediate the cytotoxic effects of CD8+ T cells. In this section, we describe the GeNeo tools designed to assist with neoepitope prediction and prioritization.

2.3.1. Seq2HLA

The MHC-I molecules are highly polymorphic, with three gene loci and thousands of alleles cataloged in the human population. Each individual carries up to 6 distinct alleles, one maternal and one paternal allele for each of the three loci. Because the repertoire of mutated peptides presented on the surface of cancer cells depends on the combination of alleles they express, the Seq2HLA tool uses the algorithm in Boegel et al (2013) to infer the MHC-I alleles from Illumina paired-end tumor RNA-Seq reads provided as input in fastq format. In addition, Seq2HLA also infers MHC class II alleles and the expression level for each MHC locus.

2.3.2. Neopitope finder

This tool uses NetMHC 4.0 (Andreatta and Nielsen, 2016) to predict neoepitopes encoded by somatic SNVs given as input in CCCP format. Using CCDS annotations (Pruitt et al, 2009), the tool extracts, for each input SNV, one or more cDNA sequences of 63 nucleotides (or shorter depending on the location of the SNV within the transcript) with the SNV in the middle. It then translates the cDNA into peptide sequences of 21 amino acids or shorter. If the SNV is nonsynonymous resulting in different wild-type and mutant long peptides, the tool calls NetMHC to compute the binding affinity (half maximal inhibitory concentration or IC₅₀) of each 8, 9, 10, and 11-mer peptide containing the mutated amino acid and their corresponding wild-type peptides (Fig. 5). The tool also computes for each mutated peptide the DAI previously introduced by us in Duan et al (2014).

FIG. 5.

Diagram illustrating the generation of all possible mutant and corresponding wild type 8-mer peptides spanning a somatic SNV position. The same strategy is used for generating 9, 10, and 11-mer mutant and wild-type peptides before passing them NetMHC for IC₅₀ prediction.

The DAI was originally defined in Duan et al (2014) as the logarithm of the ratio between the likelihood of the mutant and wild-type k-mers under a profile weight matrix (PWM) model trained for the MHC-I allele by NetMHC 3.0. Because newer versions of NetMHC and other prediction programs including NetMHCpan no longer include a PWM model and instead use neural networks to directly predict IC₅₀, the Neoepitope Finder tool uses a DAI approximation computed as the difference between the base 2 logarithms of the wild-type k-mer IC₅₀ and the mutant k-mer IC₅₀. In addition to the main output file, which includes all possible 8, 9, 10, and 11-mer neoepitopes along with their mutant and wild-type IC₅₀ and DAI scores, the neoepitope finder tool also outputs a summary file including a list of the input SNVs characterized as missense, nonsense, nonstop, or nonsynonymous. It also includes overall counts for each SNV type (Fig. 6).

FIG. 6.

The tail of a sample SNV statistics file generated as a secondary output of the Epitope Finder tool. HT, heterozygous.

2.3.3. IC₅₀ and DAI filters

Prioritizing neoepitopes based on their potential to mediate tumor control or rejection is currently a very active area of research, including consortium efforts such as Wells et al (2020). Most commonly neoepitopes are prioritized based on their predicted binding affinity (usually measured as the IC₅₀). However, immunization experiments in mouse models (Brennick et al, 2021; Duan et al, 2014) and retrospective analyses of clinical outcomes of TCGA patients (Ghorani et al, 2018; Rech et al, 2018) suggest that neoepitopes with high DAI may be more important for long-lasting tumor control or rejection. GeNeo includes two filtering tools, which allow selecting from the raw output generated by the neoepitope finding tool the mutant peptides with IC₅₀ values below a user-specified threshold (by default 500 nm), or DAI scores above a specified threshold (by default 1). If desired the two filters can be combined to generate lists of candidates that meet both maximum IC₅₀ and minimum DAI criteria.

2.4 Ethics approval and consent to participate

The sequencing data analyzed in this manuscript was generated as part of the Phase I clinical trial “Study of Oncoimmunome for the Treatment of Stage III/IV Ovarian Carcinoma” (ClinicalTrials.gov Identifier: NCT02933073), conducted in accordance with FDA and Human Subject Protection regulations under IRB approval from the UConn Health Center. Written informed consent was obtained from all subjects.

3. GeNeo APPLICATION IN A PHASE I CLINICAL TRIAL

The clinical trial “Study of Oncoimmunome for the Treatment of Stage III/IV Ovarian Carcinoma” (ClinicalTrials.gov Identifier: NCT02933073) was established to test the safety and feasibility of the personalized OncoImmunome vaccine in stage III/IV ovarian carcinoma patients. OncoImmunome is a tumor-specific vaccine consisting of a mixture of 7–10 synthetic long peptides each containing 17–18 amino acids and encoding neoepitopes predicted to elicit anti-tumor immune responses. Women with stage III/IV ovarian cancer who have received the standard of care treatment for their cancer and were in clinical remission were given this vaccine, which is produced specifically for each participant based on somatic mutations identified from NGS sequencing data.

3.1. Exome, RNA, and AccessArray amplicon sequencing

Surgically resected tumor tissue samples were flash-frozen in LN2. Genomic DNA (gDNA) and mRNA from the tumor samples were purified using AllPrep DNA/RNA Mini Kit (Qiagen). Blood samples were collected using ACD Vacutainer blood collection tubes. gDNA was extracted using DNeasy Blood and Tissue Kit (Qiagen). The exome and transcriptome of these samples were sequenced using Illumina and Ion Torrent sequencing platforms.

Illumina sequencing

gDNA samples were prepared for exome sequencing using the Nextera Rapid Capture Exome kit (Illumina) and sequenced on the Illumina NextSeq 500 System using Mid Output v2 300 cycle kit. The RNA sample was prepared for RNA sequencing using the Illumina Stranded mRNA Library Prep kit, Set A, and sequenced on the Illumina NextSeq 500 using Mid Output v2 300 cycle kit.

Ion Torrent sequencing

Whole-exome sequencing of the isolated gDNA was performed using Ion AmpliSeq Exome RDY kit (ThermoFisher Scientific) on the Ion Torrent Proton or S5 sequencers. RNA libraries were prepared using the Ion Total RNA–Seq kit v2 from Life Technologies and also sequenced on Ion Torrent Proton or S5 sequencers.

AccessArray sequencing

The Fluidigm AccessArray instrument was used to amplify candidate somatic SNVs and make sequencing libraries from bulk gDNA from flash-frozen tumor samples using pairs of primers designed using the GeNeo primer designer tool. The AccessArray library was then sequenced using an Ion Torrent Personal Genome Machine (PGM).

3.2. Somatic SNV calling and neoepitope prediction using GeNeo

The CCCP tool was used to call somatic SNVs using Illumina and Ion Torrent exome and RNA sequencing data from tumor tissue and blood (matching normal) samples collected for four patients. Sequencing and read mapping statistics are given in Table 1. The output of CCCP was filtered using the 2CP filter. The resulting candidate somatic SNVs were validated using AccessArray targeted sequencing performed on quadruplicate bulk tumor and normal samples using primers designed using the GeNeo primer design tool. The AccessArray genotype caller tool was applied to demultiplex the AccessArray sequencing reads and call genotypes for candidate SNVs in each replicate sample. Next, we used our Neopitope Finder tool to generate the mutant peptide sequences, compute IC₅₀ values for both mutant and corresponding wild type peptides, and calculate DAI scores.

Table 1.

Sequencing Reads Statistics for Four Ovarian Carcinoma Patients

Patient ID	Tissue	Sample type	Sequencing technology	Raw reads	Raw coverage	Cleaned reads, n (%)	Aligned reads, n (%)	Aligned coverage
P1	Normal	DNA	Illumina	351,218,042	966 ×	328,337,616 (93.5)	278,832,592 (84.9)	764 ×
	Normal	DNA	Ion Torrent	38,782,872	148 ×	23,972,156 (61.8)	20,457,755 (85.3)	79 ×
	Tumor	DNA	Illumina	252,177,256	696 ×	238,305,856 (94.5)	199,225,522 (83.6)	548 ×
	Tumor	DNA	Ion Torrent	46,212,787	175 ×	27,831,586 (60.2)	23,692,202 (85.1)	91 ×
	Tumor	RNA	Illumina	321,676,904	886 ×	211,751,088 (65.8)	178,735,476 (84.4)	494 ×
	Tumor	RNA	Ion Torrent	93,290,414	245 ×	56,519,086 (60.6)	30,873,371 (54.6)	81 ×
P2	Normal	DNA	Illumina	208,855,380	583 ×	168,085,034 (80.5)	140,325,404 (83.5)	390 ×
	Normal	DNA	Ion Torrent	44,057,553	165 ×	21,473,273 (48.7)	18,478,296 (86.1)	70 ×
	Tumor	DNA	Illumina	272,371,194	705 ×	253,891,850 (93.2)	221,186,558 (87.1)	571 ×
	Tumor	DNA	Ion Torrent	41,896,052	159 ×	21,646,956 (51.7)	18,447,014 (85.2)	70 ×
	Tumor	RNA	Illumina	74,312,412	217 ×	44,702,332 (60.2%)	35,006,544 (78.3%)	102 ×
	Tumor	RNA	Ion Torrent	83,318,317	246 ×	33,003,717 (39.6)	23,104,262 (70.0)	78 ×
P3	Normal	DNA	Illumina	237,457,528	582 ×	225,468,888 (95.0)	197,533,276 (87.6)	483 ×
	Normal	DNA	Ion Torrent	35,478,465	128 ×	19,459,867 (54.8)	16,681,248 (85.7)	62 ×
	Tumor	DNA	Illumina	228,834,342	598 ×	201,205,014 (87.9)	174,570,852 (86.8)	455 ×
	Tumor	DNA	Ion Torrent	51,626,920	192 ×	28,337,234 (54.9)	24,300,056 (85.8)	91 ×
	Tumor	RNA	Illumina	60,565,664	174 ×	39,107,600 (64.6)	32,129,292 (82.2)	92 ×
	Tumor	RNA	Ion Torrent	73,401,289	261 ×	45,547,583 (62.1)	29,349,114 (64.4)	105 ×
P4	Normal	DNA	Illumina	246,930,376	629 ×	230,506,542 (93.3)	203,079,580 (88.1)	516 ×
	Normal	DNA	Ion Torrent	49,919,517	190 ×	23,741,773 (47.6)	20,466,458 (86.2)	80 ×
	Tumor	DNA	Illumina	256,340,514	666 ×	237,744,570 (92.7)	210,591,324 (88.6)	545 ×
	Tumor	DNA	Ion Torrent	42,706,638	163 ×	20,889,697 (48.9)	18,076,502 (86.5)	71 ×
	Tumor	RNA	Illumina	109,077,798	301 ×	83,019,084 (76.1)	70,732,572 (85.2)	193 ×
	Tumor	RNA	Ion Torrent	85,170,542	299 ×	49,939,099 (58.6)	24,497,660 (49.1)	90 ×

For each patient, the table shows the raw and aligned read counts and the average exome coverage based on normal exome, tumor exome, and tumor RNA-Seq reads generated using Illumina and Ion Torrent platforms.

Statistics collected for various stages of GeNeo processing, including the number of somatic SNVs identified by 2CP and the number of SNVs predicted to yield neoepitopes at different IC₅₀ and DAI thresholds, are given in Table 2. For comparison, the number of somatic SNV calls predicted by the less stringent “1 Call” filter is also included. The results show that somatic SNVs predicted by 2CP have a consistently high validation rate, typically yielding sufficient putative neoepitopes for inclusion in the cancer vaccine. For a larger set of neoepitope choices the set of SNVs resequenced on the AccessArray can be expanded to include more candidates from the “1 Call” list. Because the full “1 Call” list is typically too large to fit on a single AccessArray IFC (which has multiplexing capacity limited to 480 SNVs), semi-supervised learning can be used to select from the “1 Call” SNVs a subset of 480 candidates that are most likely to be somatic (Sherafat et al, 2020).

Table 2.

Somatic Single Nucleotide Variant and Predicted Neoepitope Statistics for Four Ovarian Carcinoma Patients

Patient ID	CCCP 1Cal	CCCP 2CP	HT&NonSyn	Designed primers	Working primers	Validated	$I C_{50}$		DAI
Patient ID	CCCP 1Cal	CCCP 2CP	HT&NonSyn	Designed primers	Working primers	Validated	$\leq$ 500 nM	$\leq$ 50 nM	$\geq 1$	$\geq 3$
P1	829	65	48 (74)	48 (100)	36 (75)	34 (94)	8	1	16	4
P2	1012	187	140 (75)	138 (99)	129 (93)	100 (78)	56	24	68	24
P3	813	76	55 (72)	54 (98)	54 (100)	42 (78)	7	2	28	7
P4	742	67	44 (66)	43 (98)	43 (100)	32 (77)	12	2	19	4

CCCP 1Call and CCCP 2CP columns show the number of somatic SNVs that were called by CCCP using the respective filters. HT&NonSyn shows the number and percentage of the 2CP somatic calls that are both heterozygous in tumor samples and nonsynonymous. Validation results based on targeted resequencing using the AccessArray and Ion Torrent PGM sequencing are also included. The Designed Primers columns show the number and percentage of the HT&NonSyn SNVs for which primer design was successful. Working Primers shows the number and percentage of designed primers that generated sequencing reads. Validated shows the number and percentage of SNVs with working primers that were actually validated as somatic. The last four columns show the number of validated SNVs predicted to yield neoepitopes based on commonly used IC₅₀ and DAI thresholds.

2CP, 2 Calls Cross-Platform; CCCP, Consensus Caller Cross-Platform; DAI, Differential Agretopic Index; IC₅₀, half maximal inhibitory concentration.

4. CONCLUSION

For simplicity and ease of use, the GeNeo toolbox is available via a public Galaxy portal available at https://neo.engr.uconn.edu/. A virtual machine image for running GeNeo on user's premises is available for academic use upon request. The image runs a Galaxy server configured with all GeNeo tools, including their dependencies, genome references, and read mapping indices. The only exception is NetMHC, which has to be downloaded directly from https://services.healthtech.dtu.dk/services/NetMHC-4.0/ and installed by the user because distribution to third parties is prohibited under the terms of the academic license agreement.

Footnotes

AUTHOR DISCLOSURE STATEMENT

PKS and IIM are inventors of patent US-10501801, “Identification of tumor-protective epitopes for the treatment of cancers”, disclosing the use of DAI for predicting whether immunization with a particular epitope will be protective against the tumor. The other authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work was supported in part by the Neag Cancer Immunology Translational Research Fund and the Eversource Energy Chair in Experimental Oncology. ES was supported in part by a US Department of Education GAANN fellowships (Office of Postsecondary Education Grant No. P200A180092). IIM was supported in part by the US National Science Foundation Award 1564936. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

Afgan

, Baker

, Batut

, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 Update. Nucleic Acids Res 2018:46(W1):W537–W544.

Alqahtani

, Măndoiu

. Mitochondrial haplogroup assignment for high-throughput sequencing data from single individual and mixed DNA samples. In: Bioinformatics Research and Applications. Springer: Cham, Switzerland; 2020; pp. 1–12.

Andreatta

, Nielsen

Gapped sequence alignment using artificial neural networks: Application to the MHC class I system. Bioinformatics, 2016; 32(4):511–517.

Bais

, Namburi

, Gatti

, et al. CloudNeo: A cloud pipeline for identifying patient-specific tumor neoantigens. Bioinformatics, 2017; 33(19):3110–3112; doi: 10.1093/bioinformatics/btx375

Barnell

, Ronning

, Campbell

, et al. Standard operating procedure for somatic variant refinement of sequencing data with paired tumor and normal samples. Genet Med, 2019; 21:972–981.

Bjerregaard

A-M

, Nielsen

, Hadrup

, et al. Mupexi: Prediction of neo-epitopes from tumor sequencing data. Cancer Immunol Immunother, 2017; 66(9):1123–1130; doi: 10.1007/s00262-017-2001-3

Boegel

, Löwer

, Schäfer

, et al. HLA typing from RNA-Seq sequence reads. Genome Med, 2013; 4(12):1–12.

Boja

, Težak Ž, Zhang

, et al. Right data for right patient—A precision FDA NCI–CPTAC multi-omics mislabeling challenge. Nat Med, 2018; 24:1301–1302.

Boland

, Chung

, Roberson

, et al. The new sequencer on the block: Comparison of life technology's proton sequencer to an illumina hiseq for whole-exome sequencing. Hum Genet, 2013; 132(10):1153–1163; doi: 10.1007/s00439-013-1321-4

10.

Bolger

, Lohse

, Usadel

Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics, 2014; 30(15):2114–2120.

11.

Brennick

, George

, Moussa

, et al. An unbiased approach to defining bona fide cancer neoepitopes that elicit immune-mediated cancer rejection. J Clin Invest, 2021; 131(3):e142823.

12.

Chang

T-C

, Carter

, Li

, et al. The neoepitope landscape in pediatric cancers. Genome Med, 2017;9:78; doi: 10.1186/s13073-017-0468-3

13.

Duan

, Duitama

, Seesi

, et al. Genomic and bioinformatic profiling of mutational neo-epitopes reveals new rules to predict anti-cancer immunogenicity. J Exp Med, 2014; 211(11):2231–2248.

14.

Duitama

, Srivastava

, Mandoiu

. Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data. BMC Genom, 2012; 13(Suppl. 2):S6.

15.

FASTX-Toolkit. 2022. Available from: http://hannonlab.cshl.edu/fastx_toolkit [Last accessed: November 30, 2022].

16.

Ghorani

, Rosenthal

, McGranahan

, et al. Differential binding affinity of mutated peptides for MHC class I is a predictor of survival in advanced lung cancer and melanoma. Ann Oncol, 2018; 29(1):271–279.

17.

Goodwin

, McPherson

, McCombie

. Coming of age: Ten years of next-generation sequencing technologies. Nat Rev Genet, 2016; 17:333–351.

18.

Kho

SJ.

Sample mislabeling detection and correction in bioinformatics experimental data. CORE Scholar 2021. Available from: https://corescholar.libraries.wright.edu/etd_all/2443 [Last accessed: March 23, 2023].

19.

Kim

, Langmead

, Salzberg

. HISAT: A fast spliced aligner with low memory requirements. Nat Methods, 2015; 12:357–360.

20.

Kim

, Kim

, et al. Neopepsee: Accurate genome-level prediction of neoantigens by harnessing sequence and amino acid immunogenicity information. Ann Oncol, 2018; 29(4):1030–1036.

21.

Koboldt

DC.

Best practices for variant calling in clinical sequencing. Genome Med, 2020; 12(1):1–13.

22.

Kodysh

, Rubinsteyn

OpenVax: An open-source computational pipeline for cancer neoantigen prediction. Methods Mol Biol, 2020; 2120:147–160.

23.

Lang

, Schrörs

, Löwer

, et al. Identification of neoantigens for individualized therapeutic cancer vaccines. Nat Rev Drug Discov 2022.

24.

, Handsaker

, Wysoker

, et al. The sequence alignment/map format and samtools. Bioinformatics, 2009; 25(16):2078; doi: 10.1093/bioinformatics/btp352

25.

, Lu

H-M

, Chen

, et al. Sanger confirmation is required to achieve optimal sensitivity and specificity in next-generation sequencing panel testing. J Mol Diagn, 2016; 18(6):923–932.

26.

PhyloTree.org. PhyloTree.org—human mtDNA tree: Phylogeny, haplotree, haplogroups, mutations, RSRS, rCRS. 2016; Available from: https://www.phylotree.org [Last accessed: November 30, 2022].

27.

Pruitt

, Harrow

, Harte

, et al. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res, 2009; 19(7):1316–1323.

28.

Quinlan

AR.

Bedtools: The swiss-army tool for genome feature analysis. Curr Protocols Bioinform, 2014;47:11.12.1–11.12.34.

29.

Rao

, Madejska

, Pfeil

, et al. Protect—prediction of t-cpcell epitopes for cancer therapy. Front Immunol, 2020;11:83296; doi: 10.3389/fimmu.2020.483296

30.

Rech

, Balli

, Mantero

, et al. Tumor immunity and survival as a function of alternative neopeptides in human cancer. Cancer Immunol Res, 2018; 6(3):276–287.

31.

Robinson

, Thorvaldsdóttir

, Wenger

, et al. Variant review with the integrative genomics viewer. Cancer Res, 2017; 77(21):e31–e34.

32.

Rubinsteyn

, Kodysh

, Hodes

, et al. Computational pipeline for the pgv-001 neoantigen vaccine trial. Front Immunol, 2018;8:1807; doi: 10.3389/fimmu.2017.01807

33.

Saunders

, Wong

WSW

, Swamy

, et al. Strelka: Accurate somatic small-variant calling from sequenced tumorâ“normal sample pairs. Bioinformatics, 2012; 28(14):1811; doi: 10.1093/bioinformatics/bts271

34.

Schenck

, Lakatos

, Gatenbee

, et al. NeoPredPipe: High-throughput neoantigen prediction and recognition potential pipeline. BMC Bioinformatics, 2019; 20(1):264; doi: 10.1186/s12859-019-2876-4

35.

Sherafat

, Force

, Măndoiu

. Semi-supervised learning for somatic variant calling and peptide identification in personalized cancer immunotherapy. BMC Bioinf, 2020; 21(18):1–17.

36.

Sherry

, Ward

M-H

, Kholodov

, et al. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res, 2001; 29(1):308; doi: 10.1093/nar/29.1.308

37.

Shi

, Guo

, Su

, et al. Deepantigen: A novel method for neoantigen prioritization via 3d genome and deep sparse learning. Bioinformatics, 2020; 36(19):4894–4901; doi: 10.1093/bioinformatics/btaa596

38.

Wells

, van Buuren

, Dang

, et al. Key parameters of tumor epitope immunogenicity revealed through a consortium approach improve neoantigen prediction. Cell, 2020; 183(3):818–834.

39.

Wood

, Nguyen

, Struck

, et al. Neoepiscope improves neoepitope prediction with multivariant phasing. Bioinformatics, 2020; 36(3):713–720; doi: 10.1093/bioinformatics/btz653

40.

, Luo

, Qian

, et al. Fastuniq: A fast de novo duplicates removal tool for paired short reads. PLoS One, 2012; 7(12):e52249; doi: 10.1371/journal.pone.0052249

41.

Zhou

, Wei

, Zhang

, et al. pTuneos: Prioritizing tumor neoantigens from next-generation sequencing data. Genome Med, 2019; 11(1):1–17.

42.