Abstract
Human papilloma virus has a clearly demonstrated role in cervical and head and neck cancers, but viral etiology for other solid tumors is less well understood. To expand this area of research, we obtained and analyzed the immune receptor recombinations available from both blood and tumor samples, through mining of exome files produced from those sources, for 32 cancer types represented by the cancer genome atlas (TCGA). Among TCGA data sets, the recovery frequency for antiviral complementarity determining region-3 sequences (CDR3s), for T cell receptor-alpha and T cell receptor-beta, ranged from 0% to 21% of the patients, for the different cancer types, with breast, lung, pancreatic, and thymus cancers representing the highest of that range, particularly for tumor tissue resident T cells. In several cases, recovery of the antiviral CDR3s associated with distinct survival rates, and in all of these cases, the recovery of an antiviral CDR3 associated with a worse survival rate.
Introduction
Two cancers in particular have long histories wherein there has been an extensive understanding of the role of viruses in cancer development: Epstein Barr virus (EBV) in B cell lymphoma and human papilloma virus (HPV) in cervical cancer (2,11). In addition, there are other liquid tumors known to have a viral etiology, and several cancers have tangential associations with viral infections, such as Kaposi's sarcoma and Kaposi's sarcoma-associated herpes virus (KSHV) in the immunocompromised state (3), and nasopharyngeal carcinoma and EBV (20). However, other than HPV and cervical cancer (and to a lesser extent, head and neck cancers), viral connections, if any, to the development of solid tumors have yet to be widely applied to therapeutic approaches. In particular, clarifying and better understanding virus–cancer associations hold a great promise of reducing cancer incidence through prophylactic vaccinations, a goal now being aggressively pursued worldwide in the case of cervical cancer (9) and vaccinations against HPV infection.
There have been a variety of approaches to furthering the study of viral links to solid tumors, including analyses of viral DNA associated with tumor tissues. Polymerase chain reaction-based studies of cytomegalovirus (CMV) and EBV in breast cancer tissue have indicated conflicting results, and there has been no generally accepted determination that either virus has a causal role in breast cancer (22). Nevertheless, there have been reports elucidating possible effects of CMV on breast cancer progression. For example, Bishop et al. (1) have demonstrated that exposure to the CMV IL-10 orthologue, cmvIL-10 (a secreted viral protein with 27% sequence identity to human IL-10), can enhance proliferation and upregulation of matrix metallopeptidases in the MCF7 breast cancer cell line, which expresses the IL-10R. And, Richardson et al. (21) demonstrated that among women who are seropositive for CMV, there was a higher average anti-CMV IgG level found in women with breast cancer, versus normal controls. The latter result may indicate that a more recent exposure to CMV, perhaps an exposure later in life, may represent a risk factor for breast cancer.
A potential connection between viral exposure and cancer development has been very difficult to assess, possibly due to the very long incubation period that would be expected in most cases, and due to the relatively rare clear result of a cancer due to viral exposure. Thus, big data approaches, which could allow detection of rarer connections above background events, may offer new avenues for success. Cantalupo et al. (6) recently identified DNA sequences for many viruses in many different cancer RNASeq, whole exome sequence (WXS), and whole genome files represented by the cancer genome atlas (TCGA). Thus, we sought to determine whether antiviral complementarity determining region-3 sequences (CDR3s) could be detected among the T cell receptor-alpha (TRA) and T cell receptor-beta (TRB) recombination reads present in both the tumor tissue and blood TCGA, WXS files. Results revealed an extensive collection of public antiviral CDR3s, and in all cases, where measurable, the recovery of antiviral CDR3s associated with a worse survival rate.
Methods
Obtaining TRA and TRB V(D)J recombination reads from TCGA provisional exome files
WXS files of TCGA samples representing primary and metastatic tumors, along with blood-matched WXS files, were downloaded from the genomic data commons (GDC) in binary alignment map format, to the University of South Florida Research Computing using the GDC data transfer tool with authorization through dbGaP approved project #6300 (supporting online material (SOM), Supplementary Table S1, example GDC download manifest). Immune receptor recombination reads were recovered from these WXS files as described previously (8,29,34,35) using original software provided at the end of the SOM.
Matching viral TRA/TRB
The final list of productive TRA and TRB V(D)J recombination reads was translated into a peptide sequence. The CDR3 domain was defined as starting with a cysteine (the 2nd cysteine in the V region), extending into the J region, and ending in the conserved (Phe/Trp)–Gly–X–Gly domain. Only CDR3 domains lacking stop codons or frame shift mutations were counted as productive and included in the analyses. In addition, the V and J sequences met standards described in Ref. (8). These CDR3 domains were then matched to previously identified antiviral CDR3s at
Statistical analyses
Overall survival data, for the TCGA provisional HNSC, LGG, and BRCA data sets, were downloaded from the cBioPortal website (7,12). TCGA viral V(D)J recombination reads from the WXS files, derived as already indicated, were matched to overall and disease-free survival data associated with specific TCGA case IDs (formerly TCGA barcodes, effectively TCGA patients). The Kaplan–Meier survival analyses were then conducted using Graphpad prism software, exactly as described (8,34) and as further detailed in Results section. (See also Table 4 for multivariate analyses approach.)
Obtaining TRA and TRB V(D)J recombination reads from HIV exome files
The human immunodeficiency (HIV) patient and blood WXS files, generated in Ref. (17), were processed for recovery of immune receptor recombination reads as detailed in Refs. (29,32). The reads were then programmatically translated to amino acid (AA) sequences, to obtain the CDR3 domain.
Results
Identification of antiviral CDR3s in TCGA provisional data sets
WXS files were downloaded for 32 TCGA provisional data sets, and CDR3 regions of TRA and TRB were translated to an AA sequence and matched to previously identified antiviral CDR3s (26) (Methods section). The number of total antiviral CDR3s recovered in TCGA tumor samples, normalized to the number of patients (TCGA case IDs) in the indicated data sets, is indicated in Figure 1A. We repeated the same analysis for TCGA case-matched WXS files for the blood samples, across the same 32 TCGA provisional data sets (Fig. 1B). A more detailed table of counts for both tissue and blood samples is available in the SOM (Supplementary Table S5).

Total counts of antiviral CDR3 recoveries across 32 TCGA data sets.
Next we determined which antiviral CDR3s recovered were well-represented across the TCGA data sets. Across all TCGA data sets, there was recovery of CDR3s associated with CMV, dengue fever virus (DENV), EBV, hepatitis C virus (HCV), HIV-1, human T lymphocyte virus (HTLV-1), influenza A, and yellow fever virus (YFV) (Fig. 2). Among these results, the most prominent observations were (i) the recovery of anti-CMV and anti-influenza A CDR3s for both TRA and TRB was very common across TCGA, (ii) the tumor tissue and blood recoveries are similar, for example, both anti-CMV and anti-influenza A are well represented in both tumor and blood, and (iii) the high ratio of TRA to TRB antiviral CDR3s reflects the ratio seen in numerous other analyses of TCR recombination reads recovered from WXS files, likely closely related to the fact that both TRA alleles recombine in developing T cells (18). (In addition, given the search algorithms used, the likelihood of identifying a TRB recombination read is slightly less than the likelihood of identifying a TRA recombination read due to the extra space occupied in the read by the TRB D region. This extra space reduces the space available for unequivocal identifications of V and J regions.)

Specific antiviral CDR3 recoveries across 32 TCGA data sets, based on TRA or TRB recombination and sample type (tissue or blood).
Cancer survival distinctions from distinct viral CDR3 recovery in blood
To assess potential survival rate distinctions for case IDs representing viral CDR3 recovery from blood WXS files, we first compared the OS rates for HNSC case IDs representing anti-influenza A CDR3 recovery, for either TRA or TRB CDR3s, to the OS rates for all remaining case IDs. Specific antiviral CDR3 counts for all HNSC case IDs are shown in Figure 3A. Results indicated that case IDs representing anti-influenza A CDR3 recovery also represented worse OS (Fig. 3B, p = 0.0308 and Supplementary Table S6). Next, we compared DFS of LGG case IDs representing anti-CMV CDR3 recovery in TRA to the DFS for all remaining LGG case IDs. Specific viral CDR3 counts for all LGG case IDs are shown in Figure 4A. Results indicated that the case IDs representing anti-CMV CDR3 recovery also represented worse DFS rates (Fig. 4B, p = 0.0371 and Supplementary Table S7).

Antiviral CDR3 distribution specific to HNSC blood samples and associated survival outcomes.

Antiviral CDR3 distribution specific to LGG blood samples and associated survival outcomes.
Specific antiviral CDR3 counts for all the BRCA case IDs are shown in Figure 5A. Given the high counts of CMV recovery in TRB compared with other antiviral CDR3 recoveries, we first compared the DFS of BRCA case IDs representing anti-CMV TRB CDR3 recovery to the DFS for all remaining BRCA case IDs. Results indicated that case IDs representing anti-TRB CMV recoveries also represented a worse DFS (Fig. 5B, p = 0.0304 and Supplementary Table S8). We next compared DFS of BRCA case IDs representing anti-YFV CDR3 recovery for TRA recombination reads, with the DFS for all remaining BRCA case IDs. Results indicated that the case IDs representing the anti-YFV TRA CDR3 recoveries also represented a worse DFS (Fig. 5C, p = 0.0182 and Supplementary Table S9). The latter result was replicated using the McPAS-TCR database for identifying anti-YFV CDR3s among the BRCA case set, with BRCA cases having an anti-YFV CDR3 having a worse DFS (p = 0.003) (Fig. 5C).

Antiviral CDR3 distribution specific to BRCA blood samples and associated survival outcomes.
Identification of antiviral CDR3s using an alternative algorithm for identification of the TCR recombination reads in the WXS files
To reduce the possibility of any artifacts associated with the approach to identifying the V(D)J recombination reads among the TCGA data sets, we assessed the antiviral CDR3 occurrence in a series of ∼15,000 TRA and TRB recombination reads recovered from the WXS files of 10 such data sets using a variant algorithm for identifying the recombination reads. This second algorithm was used in Refs. (4,5,15,16,24,29 –32), and differs from the mentioned algorithm as follows. For the initial studies in this report, as already mentioned, the final determination of the V gene segment designation was based on sequence homology using all available germline nucleotides following the second cysteine in TRA and TRB, in addition to any nucleotides preceding the second cysteine needed to meet the minimum acceptable nucleotide match length, as described in detail in Refs. (8,34). In the second algorithm (Tables 1 and 2; Supplementary Table S10), the determination of the V gene segment was based exclusively on sequence homology preceding the second cysteine, in applying the acceptable minimum nucleotide, match length standard. The results from the application of the second algorithm indicated that antiviral CDR3s can be recovered from the TCGA data sets, indicating that there is nothing about the recoveries in the preceding data that would represent a fundamental flaw leading to an artifactual identification of antiviral CDR3s.
Recovery of Antiviral Complementarity Determining Region-3 Sequences Using an Alternative Algorithm for Identification of T Cell Receptor Recombination Reads in The Cancer Genome Atlas Whole Exome Sequence Files
TRA, T cell receptor-alpha; TRB, T cell receptor-beta.
Viruses Represented by the Antiviral Complementarity Determining Region-3 Sequences Indicated in Table 1
CMV, cytomegalovirus; EBV, Epstein Barr virus; HIV, human immunodeficiency virus.
Identification of antiviral CDR3s in an additional independent set of WXS files
To determine whether antiviral CDR3s could also be recovered from a non-TCGA WXS file data set, we processed a series of 365 WXS files from the blood of HIV patients, prepared in Ref. (14,17). The WXS files were processed using the method of Refs (4,5,15,16,24,29 –32). As expected, many CDR3s representing viruses were also recovered from this TCGA-independent data set (Table 3, Supplementary Tables S11 and S12). Thus, the mentioned reported recoveries of antiviral CDR3s from the 32 TCGA cancer WXS data sets were not likely an artifact of the preparation of the TCGA WXS files.
Viruses Represented by the Anticomplementarity Determining Region-3 Recoveries from the HIV Blood Exome Data Set (Tables S11, S12)
Other clinical variables
The opportunity to assess whether a large number of other clinical variables could account for survival associations was not available, due to limited clinical data. However, age and gender could be incorporated into a multivariate Cox regression analysis without overimputation of data. Results indicated that the survival associations with the recovery of antiviral CDR3s, as represented by Figures 3–5, were maintained in this multivariate analysis (Table 4), although in the case of LGG DFS, this multivariate analysis indicates an independent trend rather than a conventional standard of independent statistical significance.
Multivariate Cox Regression Analysis of the Association of Antiviral Complementarity Determining Region-3 Sequences with Reduced Survival
The analysis was conducted using the CoxPHFitter function in the lifelines implementation of Python.
ND, not done; YFV, yellow fever virus; OS, overall survival.
Discussion
Before considering the mentioned results, it is important to note that immune receptor recombinations in general, when recovered from tumor WXS files, strongly correlate with many features of cancer development, and have been extensively benchmarked, in ways that are related to the cancer immune response. As just two examples, the presence of B cell receptor recombinations in pancreatic adenocarcinoma (PAAD) WXS files strongly correlates with a worse outcome (15), a fact that is consistent with a report that bacterial presence in PAAD correlates with a worse outcome (13), that is, the intratumoral bacteria would likely stimulate a B cell response. And, high levels of WXS file, TCR, and recombination read recoveries correlate with high levels of immune checkpoint proteins in melanoma (31). Thus, the mentioned data indicate one more opportunity to exploit the common availability of tumor exomes, as part of large databases and retrospective studies, or in the future, as an increased opportunity for diagnostic value for the individual patient.
The mentioned data raise the question, what is the functional basis for the antiviral CDR3s at the site of the tumor tissue, particularly keeping in mind previous detections of likely bystander lymphocytes in tumor samples (25,27)? However, this question does take on additional importance given the recovery of viral DNA sequences in the TCGA tumor tissue WXS files (6), and the observations by others, using completely distinct approaches, of tumor resident, virus-specific T cells (23). The answer to this question will likely depend on a more sophisticated, more extensive, and more frequent monitoring of viral infections and cancer development, in contrast to the status quo, where only modest assessments are possible, due to the lack of long-term sampling and the lack of patient monitoring over long time periods. However, certain empirical approaches may also yet shed light on this issue, although empirical approaches can be a challenge in human settings. Nevertheless, there has yet to be an exhaustive assessment of viral peptide or protein presence in isolated human tumors or thorough empirical assessments of potential viral “hit and run” impacts on human cell malignant transformation.
Regardless of the viral mechanism in human cancer development, and the antiviral human immune response related to cancer development, the mentioned data do indicate a potential practical biomarker value in assessing the antiviral CDR3s in cancer patients. Specifically, recovery of antiviral CDR3s in the blood samples representing certain cancers correlated with a worse survival outcome. Returning to the issue of cause and effect, it cannot be ruled out that the association of the antiviral CDR3s with a worse survival could reflect only a generalized reduced state of health or a generalized reduced immunocompetency, that is, the presence of the antiviral CDR3s may simply reflect viral infections stemming from this reduced state of health. More specific connections to a worse outcome could include basic aspects of inflammation that are clearly associated with cancer development in several settings, the most prominent example being the association of colon cancer development with Crohn's disease (10,19,33). Regardless, the biomarker opportunity for prognosis, having been established for several cancers, is clearly available as a result of the mentioned data.
Footnotes
Acknowledgments
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No relevant funding.
Supplementary Material
Supplementary Table S1
Supplementary Table S2
Supplementary Table S3
Supplementary Table S4
Supplementary Table S5
Supplementary Table S6
Supplementary Table S7
Supplementary Table S8
Supplementary Table S9
Supplementary Table S10
Supplementary Table S11
Supplementary Table S12
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
