Abstract
Background:
Familial non-medullary thyroid cancer (NMTC) accounts for a relatively small proportion of thyroid cancer cases, but it displays strong genetic predisposition. So far, only a few NMTC susceptible genes and low-penetrance variants contributing to NMTC have been described. This study aimed to identify rare germline variants that may predispose individuals to NMTC by sequencing a cohort of 17 NMTC families.
Methods:
Whole-genome sequencing and genome-wide linkage analysis were performed in 17 NMTC families. MendelScan and BasePlayer were applied to screen germline variants followed by customized filtering. The remaining candidate variants were subsequently validated by Sanger sequencing. A panel of 277 known cancer predisposition genes was also screened in these families.
Results:
A total of 41 rare coding candidate variants in 40 genes identified by whole-genome sequencing are reported, including 24 missense, five frameshift, five splice change, and seven nonsense variants. Sanger sequencing confirmed all 41 rare variants and proved their co-segregation with NMTC in the extended pedigrees. In silico functional analysis of the candidate genes using Ingenuity Pathway Analysis showed that cancer was the top category of “Diseases and Disorders.” Additionally, a targeted search displayed six variants in known cancer predisposition genes, including one frameshift variant and five missense variants.
Conclusions:
The data identify rare germline variants that may play important roles in NMTC predisposition. It is proposed that in future research including functional characterization, these variants and genes be considered primary candidates for thyroid cancer predisposition.
Introduction
Thyroid cancer is recognized as the most common malignancy of the endocrine system. The risk of developing thyroid cancer is influenced by hereditary factors, environmental factors, and somatic replication errors (1). According to a report by the National Cancer Institute for 2018, the estimated number of new thyroid cancer cases in the United States was 53,990 (
Thyroid cancer can also be classified into two categories: high penetrance, often occurring in families, and low penetrance, mostly occurring in sporadic cases. Regarding high-penetrance thyroid cancers, <5% display syndromes, including Cowden syndrome, familial adenomatous polyposis (FAP) of the colorectum, Gardner syndrome (a subtype of FAP), Carney complex, and Werner syndrome (3). In the remaining 95%, a small number of genes have been detected or proposed (9 –15). Given the high heritability of thyroid cancer determined by case/control studies (16,17), it is likely that low-penetrance variants contribute to a substantial but as of yet undetermined proportion of all thyroid cancers. Recent studies of sizeable cohorts of patients using genome-wide association study (GWAS) techniques have indeed confirmed that such low-penetrance predisposition variants play important roles (18 –21).
The present study aimed to identify and characterize the traditional class of predisposing factors, namely rare high-penetrance gene variants. As mentioned above, several such variants have been identified and annotated. It is obvious that many more such variants must exist. However, it is notoriously difficult to prove the pathogenic nature of such variants, all of which are rare and some of which have been detected in just a single family. The existence of thyroid cancer in large families encompassing three or more affected individuals suggests regular Mendelian inheritance of the predisposition. These types of families fulfill the traditional requirement for successful linkage analysis and are amenable to the now available next-generation sequencing (NGS) approaches. This study identified 41 variants in 40 genes qualifying as primary candidates for high-risk thyroid cancer predisposition using whole-genome sequencing (WGS) and whole-genome linkage analysis. The results highlight several new potential predisposition genes for future study.
Methods
The study was approved by the Institutional Review Board at the Ohio State University, and all subjects gave written informed consent before participation.
Genomic DNA preparation
Genomic DNA was extracted from patient blood samples following standard inorganic isolation procedures (22). To remove RNA contamination, DNA was purified by the standard phenol-chloroform extraction procedure after RNase treatment using PureLink™ RNase A (Thermo Fisher Scientific). DNA concentration was determined with a Qubit DNA Assay Kit (Life Technologies) with a Gemini XPS Microplate Reader (Molecular Devices). Fragment distribution of the DNA library was measured using the Agilent Bioanalyzer 2100 system (Agilent Technologies).
WGS and variant calling
WGS (2 × 150 bp paired-end reads) of the 56 NMTC patient DNA samples was performed using the HiSeq X (Illumina) platform at Novogene Co. Ltd. Primary genome sequencing data analysis was performed using Churchill (23), which implements the GATK “best practices” workflow for alignment, variant discovery, and genotyping. Reads were mapped to the GRCh37 reference sequence (human_g1k_v37_decoy) using BWA v0.7.15. Duplicate reads were identified and removed using samblaster v.0.1.22. SNV/indel discovery and genotyping were performed on a per-family basis (joint calling) using GATK v3.7. SNPeff, ANNOVAR, and custom in-house scripts were used to annotate single nucleotide polymorphisms (SNPs)/insertions and deletions (INDELs) with gene, transcript, function class, damaging scores, and population allele frequencies.
Variant filtering and annotation for the identification of rare variants
Variants were subsequently analyzed using MendelScan v1.2.3 under a dominant disease model. The resulting variants were filtered to obtain those that segregate with disease (segregation score = 1.0), which means all the variants were shared by every WGS tested individual within each family. First, candidate variants were selected at <0.001 maximum allele frequency in any population of the Genome Aggregation Database (gnomAD) data. Customized cutoff thresholds were further applied to two different variant categories. Missense variants present in four or more individuals in gnomAD and truncating variants (nonsense, splice change, and frameshift) present in ≥30 individuals in gnomAD were removed. Variants were annotated using Variant Effect Predictor (VEP, Ensembl release 50) to isolate variants likely to be pathogenic. Truncating variants (nonsense, splice change, and frameshift) were automatically considered pathogenic. Missense variants were scored using 10 prediction software programs offered through VarSome (DANN, MutationTaster, FATHMM, FATHMM-MKL, MetaSVM, MetalR, LRT, MutationAssessor, SIFT, and Provean) and considered as pathogenic if at least 3/10 pathogenic scores were obtained. Missense variants in genes that (i) appear to be tolerant of missense variation (constraint Z-score <1.0), (ii) exhibit low thyroid expression in the GTEx data set, or (iii) encode proteins with unknown functions were excluded.
Additionally, to avoid missing important truncating variants caused by algorithm bias, variants (nonsense, splice change, and frameshift) from the same VCF files were further analyzed using BasePlayer v1.0.0. All the selected variants fit the same population frequency criteria (<30 individuals in gnomAD).
The variants with positive linkage scores (Z-score >0) were further selected, and they were confirmed by Sanger sequencing in all members with DNA available (including the samples used in WGS). These selected variants were evaluated for co-segregation with disease, that is, the variant must be carried by all the affected individuals and not be carried by any of the unaffected individuals. Only the individuals with diagnosed thyroid cancer were considered as affected. For the individuals with benign thyroid diseases, their genotypes were not considered in the selection criteria. The list of 41 rare variants was generated including the variants identified by both methods.
Genome-wide linkage analysis
Genome-wide linkage analysis was performed using genotypes obtained with HumanCytoSNP-12 BeadChip (Illumina) in a total of 107 samples, including all the 56 affected individuals used in WGS and other family members. Nonparametric linkage analysis was applied with MERLIN v1.1.2 (24), and Z-scores and LOD-scores were obtained. The Z-scores of the selected variants were approximated by the average value of Z-scores of two adjacent markers used in linkage analysis.
Polymerase chain reaction and Sanger sequencing validation
To validate the finding of the WGS variants after filtering, Sanger sequencing was performed on the candidate variants in all the family members with DNA available. Polymerase chain reaction (PCR) primers of the 41 variants with co-segregation are shown in Supplementary Table S1. PCR assays were performed using either an AmpliTaq Gold DNA polymerase kit (Thermo Fisher Scientific) or a HotStarTaq DNA Polymerase kit (Qiagen). PCR products were purified using ExoSAP-IT™ PCR Product Cleanup Reagent (Thermo Fisher Scientific) before being sent for sequencing using an ABI3730 DNA Analyzer at the Genomics Shared Facility, OSU.
Variant filtering and annotation in the 277 known cancer predisposition genes
Variants in the coding regions of 277 known cancer predisposition genes were annotated by ANNOVAR (25). Only the variants present in all the WGS patient samples of at least one NMTC family were selected. A cutoff value of 0.05 was used for the maximum allele frequency in any population of the gnomAD, Exome Sequencing Project v. 6500 (ESP6500), Exome Aggregation Consortium (ExAC), and 1000 Genome databases. After filtering, all the variants annotated as “synonymous” or “unknown” were removed. Predicted functional consequences of the missense variants were annotated using PredictSNP2 (26). Only the variants predicted as “deleterious” were kept. One variant each in AGK and RNF213 were excluded, since they had been identified in the 41 rare variants list. Only the variants with positive linkage Z-score were chosen (Z > 0). As a result, a total of six variants, including five missense variants and one frameshift variant, were obtained.
Databases and web resources
gnomAD:
VEP:
Varsome:
Hereditary Thyroid Cancer Panel of the University of Chicago:
PredictSNP2:
Data availability statement
WGS data of the 10 NMTC patients in this study are available at dbGAP (accession number phs001758.v1.p1). The sequence data of the rest 46 patients are not publicly available because the contain information that could compromise research privacy/consent.
Results
NMTC patient sample selection for WGS and linkage analysis
The study focused on NMTC families displaying a Mendelian-like mode of inheritance. The 17 families chosen for this study were of European descent living in the United States comprising three or more NMTC individuals from at least two generations. The number of affected individuals within each family ranged from 3 to 13 (Fig. 1). Individuals diagnosed with papillary thyroid carcinoma were classified as NMTC patients. Individuals with nodules, goiter, and hypothyroidism were classified as having benign thyroid disease. Individuals with other malignancies were not considered affected for the purpose of this study. A total of 56 samples, including at least three samples from affected individuals within each family, were selected for WGS. The main selection criteria for WGS were availability and quality of DNA, the inclusion of family members from different generations, and male sex. The ratio of female to male patients was 2.1:1 (38:18). This deviates from the female-to-male ratio of ∼3:1 in the entire series because affected males were preferentially chosen for WGS when available. This is done because in Mendelian disorders, the individuals belonging to the rarer sex are most likely to carry the sought-after variant. Notably, 60% (34/56) of the patients in this study were diagnosed with NMTC at relatively young age (<50 years old). Of the cases with pathology reports available for review, 23 were microcarcinomas (<1 cm; Supplementary Table S2). Additionally, genome-wide linkage analysis was performed within each family using a total of 107 samples, including 80 affected individuals. These 107 samples included all 56 WGS samples (Fig. 1).

Pedigrees of the 17 non-medullary thyroid cancer (NMTC) families used for whole-genome sequencing (WGS). Family structures are indicated by the pedigree maps. Males are indicated by squares, and females are indicated by circles. Generation is labeled using Roman numerals (I, II, III, etc.). Each alphabetic letter represents one family
WGS uncovers rare germline variants in NMTC families
To identify disease predisposition variants from these 17 NMTC families, WGS was conducted on 56 selected patient samples. The study aim to identify missense, splice change, nonsense, and frameshift variants caused by single nucleotide variation (SNV) or insertion/deletion events (INDELs).
The study generated ∼125 Gbp of uniquely mapped sequence data per sample (range 66–173 Gbp), yielding an average depth of coverage of 32.11 × (range 16.56–40.40 × ) for each individual. GATK v3.7 was used to perform SNV/INDEL detection on a per-family basis (joint calling). On average, WGS uncovered ∼7.64 million variants per family (range 7.10–8.11 million), of which 757,570 segregated with NMTC and 93,024 were also below the MAF threshold. A decision was made to focus on coding variants (∼82 per family), since variants in annotated regulatory non-coding regions (7874 per family on average) yield far too many candidates for further interpretation (Supplementary Table S3). Two different programs, MendelScan (27) and BasePlayer (28), were applied for the variant filtering and annotation steps after processing with GATK (Fig. 2).

Overview of the variant filtering scheme for the rare predisposing variant calls in NMTC families. To identify the rare NMTC predisposition variants, variants were called using both MendelScan and BasePlayer as indicated. Variants were then filtered and annotated. Ten programs were used for the missense variant prediction and ranking as described in the Methods. All the remaining variants were validated with Sanger sequencing in all the family individuals with DNA available. MAF, minor allele frequency.
A total of 41 candidate NMTC predisposition variants in 40 genes were identified, including 24 missense, five frameshift, five splice change, and seven nonsense variants. The 41 variants were from 65% (11/17) of the tested families. Thus, in six families, no candidate coding variant was discovered. Notably, 54% (22/41) of the selected variants have not been reported in the dbSNP database (Table 1 and Supplementary Table S4). In silico functional analysis of the 40 identified genes using Ingenuity Pathway Analysis showed that cancer was in the top category of “Diseases and Disorders” with the lowest p-value, which suggested the possible correlation between the candidate genes and NMTC carcinogenesis (Supplementary Table S5). Gene network analysis was also performed. Of the 40 candidate genes, 16 showed direct or indirect interactions within another 19 genes in the top network, and they are involved in cell death and survival, lipid metabolism, and molecular transport (Supplementary Fig. S1).
Details of the 41 Rare Candidate NMTC Predisposing Variants with Co-Segregation in NMTC Families
Value represents the average Z-score of the two adjacent single nucleotide polymorphism (SNP) markers in linkage analysis within each family.
Identification of candidate NMTC-associated germline variants in known cancer predisposition genes
In addition to the identification of the 41 rare variants, WGS variants were analyzed in known cancer predisposition genes. A total of 277 cancer predisposition genes identified in thyroid cancer and other cancers by functional studies, GWAS, and a commercially available hereditary thyroid cancer panel were selected (Supplementary Table S6) (9,11,13 –15,18–21,29 –39). For this analysis, the variants were selected in the coding regions of the known cancer genes sharing co-segregation within all WGS members per family. The filter applied with respect to allele frequency was <0.05 and linkage Z-score >0. A total of six candidate variants were identified, including one frameshift variant and five missense variants in the known cancer predisposition genes (Table 2 and Supplementary Table S7). All the missense variants were predicted as deleterious variants by PredictSNP2 (26). The six variants were identified from five families. One variant in the ATM gene was identified in more than one family (families M and N; Table 2). In these families, there are individuals with other malignancies, including kidney, lung, stomach, and prostate cancer (Supplementary Fig. S2).
Candidate Variants of the 277 Known Cancer Susceptibility Genes Identified in the 17 NMTC Families
Value represents the average Z-score of the two adjacent SNP markers in linkage analysis within each family.
Discussion
This study focused on NMTC, which accounts for ∼90% of all thyroid cancers. The aim was to use WGS to identify rare predisposing germline variants in NMTC families.
Genes that confer a risk of human disease when mutated are commonly designated as either rare with high penetrance or common with low penetrance. In reality, much of the genetic predisposition to disease is from genes that fall between the two extremes. The parameter commonly used to categorize variants by frequency is the minor allele frequency (MAF) that is derived from controls (“healthy individuals”) and varies between populations. A commonly used arbitrary cutoff between rare and common alleles is a MAF of 0.005 (5/1000) (40).
Research into rare high-penetrance loci has been possible (linkage, sequencing) for several decades and has uncovered many highly important disease-related loci in various cancers. Nevertheless, the proportion of all heritability caused by these loci is generally low. This has triggered increased research into the common low-penetrance loci, which has become accessible thanks to improved sequencing and automated genotyping, mainly in the past decade. However, importantly, even though the number of risk variants has skyrocketed, the proportion of all heritability caused by presently known genes is low (41,42). There is a clear need for more research to detect and categorize both rare and common variants. In the quest to find additional loci contributing to cancer risk in PTC, 0.001 was chosen as the MAF cutoff between common and rare in this study. This stringent MAF value was chosen because the strategy is based on studying families with several affected individuals. Such families typically harbor rare dominant risk variants that should be readily detectable not only by NGS but also by linkage. The advantage of focusing only on very rare alleles (MAF <0.001) is that “noise” from more common alleles that segregate with the phenotype by chance is reduced. It follows that the chance of missing risk alleles is increased.
In recent years, many rare disease-causing variants for recessively or dominantly inherited diseases have been reported using NGS platforms (43,44). WGS was chosen as the primary method in this study, and a number of rare germline variants in 11/17 NMTC families were identified. The criteria of “co-segregation” and “lack of co-segregation” are challenging in the context of high-penetrance NMTC variants with low-allele frequency. Importantly penetrance is age dependent and not usually complete. Thus, some unaffected individuals especially at a young age may carry the variant and thereby appear as not showing co-segregation with the disease. Collaborative genetic changes (e.g., in a low-penetrance common variant) leading to tumor formation may confound the traditional concept of co-segregation.
Besides the advantage of discovering rare variants, there are limitations to the current method in researching germline variants. In the present results, the 41 rare variants were from 11/17 NMTC families, and there was more than one candidate variant discovered in 8/17 families. Assuming there is one unique causative variant for each family, it is not easy to prioritize any variant based on functional prediction in families with multiple candidates. Functional validation and screening may be necessary to identify the causal variant(s) in the future. Of the 17 families, six did not appear to carry any risk variant. This could be because variants in non-coding genomic regions such as gene regulatory regions and long intergenic noncoding RNA genes were not studied. In addition, germline DNA copy number variations (CNVs) and genomic structural variations such as duplications, deletions, and inversions were not considered. CNVs have been reported to target genes that act in cancer-related pathways, resulting in a high cancer risk (45). Furthermore, a family may not have displayed a candidate variant in cases with imperfect segregation due to phenocopies.
Considering the six variants in known cancer genes identified in NMTC families and the important functions of these genes in tumorigenesis, the variants that were predicted as pathogenic or were able to alter gene structure may play a role in the causation of NMTC. For example, CHEK2 encodes a checkpoint kinase that interacts with cell-cycle regulators and DNA repair proteins. Multiple CHEK2 variants have been reported to be associated with increased risk in various cancer types such as breast cancer (46), colorectal cancer (47), and thyroid cancer (48). In one family, rs555607708 (CHEK2*1100delC) was identified, which is a well-established breast cancer risk variant that is most prevalent in European populations. Its estimated odds ratio (OR) for invasive breast cancer is 2.26 (49). Such an association of CHEK2*1100delC was also confirmed for prostate cancer (50). While it is not possible to explain the events underlying the multicancer involvement of CHEK2 variants, the thyroid is being adding here to the list of affected sites.
In conclusion, a practical method using WGS in searching for rare germline variants in known and previously undescribed cancer genes among NMTC families was applied. Using family-based analysis, a number of rare germline variants that may contribute to NMTC predisposition were identified. Future work including functional annotation and understanding of the molecular mechanisms of these candidate variants will hopefully improve early detection and management of NMTC.
Footnotes
Acknowledgments
We thank Isabella Hendrickson for help with PCR and Sanger sequencing. We also thank Jan Lockman and Barbara Fersch for administrative help. The analysis was done in part by an allocation of computing time from the Ohio Supercomputer Center. This work was supported by National Cancer Institute Grants P01CA124570 and P30CA16058, and the Jane and Aatos Erkko Foundation.
Author Disclosure Statement
The authors declare that no competing financial interests exist.
Supplementary Material
Supplementary Figure S1
Supplementary Figure S2
Supplementary Table S1
Supplementary Table S2
Supplementary Table S3
Supplementary Table S4
Supplementary Table S5
Supplementary Table S6
Supplementary Table S7
