Repurposing kinship coefficients as a sample integrity method for next generation sequencing data in a clinical setting

Abstract

BACKGROUND AND OBJECTIVES: Kinship coefficients measure relatedness between two individuals and have wide usage in genetic applications. In this study, we repurpose the kinship coefficient to directly facilitate sample tracking to identify potential sample swaps. Such sample integrity metrics are particularly important for the following two scenarios in large-scale clinical studies: First, multiple biological samples from the same individual were routinely processed as unique samples or technical replicates. Querying the relatedness of genomic data of two samples can identify sample swaps prior to inappropriate inclusion in data analysis. In the second scenario, different biological analytes from the same samples were run across multiple platforms and it is critical to establish the correct mapping for each individual sample, linking genomic information derived from multiple platforms to the same sample. For both cases, all downstream inferences rely on such correct mapping. Kinship coefficients can directly measure the mapping accuracy and ensure the required sample integrity.

MATERIALS AND METHODS:

We first describe the general concept of kinship coefficients and focus on the novel adaptations on feature (i.e. variants and/or SNPs) selection utilizing expressed variants to make it suitable for the clinical setting.

RESULTS:

We illustrate the adapted kinship coefficients estimate in two studies: one for lung fibrosis where multiple samples were routinely collected from each patient and one for thyroid cancers where a cohort of samples was run on different platforms.

CONCLUSION:

We demonstrate the effectiveness of using kinship coefficients to improve sample integrity and discuss potential improvements in the methodology.

Keywords

Kinship sample integrity clinical next generation sequencing

1. Introduction

Precision medicine is becoming an increasingly important paradigm in the clinical management of disease, particularly in oncology (Ashley, 2016). Next generation sequencing (NGS) allows sequencing of DNA including: whole genomes (Cirulli & Goldstein, 2010), exomes (van Allen et al., 2014), targeted genomic regions of interest (Frampton et al., 2013). NGS also allows the querying of the transcriptome through RNA-seq (Byron et al., 2016). These technologies can reveal disease-related genomic alterations, including single nucleotide variants (SNVs), copy number variants (CNVs), and genomic rearrangements. In addition to searching for disease-causing markers, numerous germline single nucleotide polymorphisms (SNPs) are queried as well (Cirulli & Goldstein, 2010).

In the clinical setting, samples are processed in high throughput, and are handled from accessioning through final output and patient report. Because treatment decisions are based on the diagnostic test result, it is critical to ensure that the sample associated with the report is the sample that was derived from the patient (CLIA 42 CFR § 493.1232; Rehm et al., 2013). Additionally, during the development of novel clinical tests, sample swaps can affect stated test performance by introducing errors into the validation set. A Laboratory Information Management System (LIMS) is critical in ensuring that the patient receives the correct report and provide an audit trail on a per sample basis (Sepulveda & Young, 2013). Additionally, the use of automation can prevent accidental sample swaps.

Because the correct mapping of patient to sequencing file is so critical, several methods have been proposed to confirm the identity of samples. These methods have been primarily built for DNA sequencing data (Pengelly et al., 2013), or require an orthogonal reference method, such as a SNP microarray (Huang et al., 2013; Jun et al., 2012), which are not applicable in our situation where the samples are run on RNA sequencing platform and corresponding reference data are not always available. Here, we investigated the novel use of kinship to verify sample identity.

In family-based association and linkage studies, inbreeding and kinship coefficients are utilized to determine relatedness of family members from the pedigree information (Boyce, 1983). Later, a number of methodologies have been proposed to estimate relatedness using genome-wide genotype data without pedigree information (Choi et al., 2009; Kelmemi et al., 2015) and in more complex structure such as admixed populations (Thornton et al., 2012). These tools have successfully identified unknown relatedness and reduced false findings in genetic studies.

Herein, we repurpose kinship methodologies to confirm biological and technical replicates using variants called from RNA sequencing data of 699 lung tissue and 2,053 thyroid tissue samples. We examined the ability of kinship coefficients to detect biological and technical replicates from both lung and thyroid tissue. Finally, we utilized kinship to ensure that DNA and RNA from the same patient but analyzed with two distinct platforms were indeed from the same patient.

2. Materials and methods

2.1 Materials

2.1.1 RNA-seq

15 ng of total RNA was input into the Illumina RNA Access kit (Illumina, San Diego, CA) and performed according to the manufacturer’s instructions on a Hamilton STAR robot. Pools of 16 samples were sequenced on the NextSeq 500 with a NextSeq v2 chemistry 150 cycle kit (Illumina, San Diego, CA) using paired-end 76 cycle sequencing chemistry.

2.1.2 Targeted DNA-seq

96 SNPs informative SNPs were chosen based on RNA-seq data for common dbSNPs that had heterozygotes in $>$ 10% of the sequencing population, and average coverage $>$ 200x. 87 SNPs gave functional amplicons in the ion torrent assay. 10 ng of genomic DNA extracted from fine-needle aspirated biopsies of thyroid nodules was used as the input for the ThermoFisher Ion AmpliSeq DNA assay according to manufacturer’s instructions. The library inputs were quantitated with the ThermoFisher Taqman Quantitation kit. The libraries were pooled together to a final concentration of 50 pM for loading onto the Ion 540 chip in the ThermoFisher Ion Chef and sequenced on the ThermoFisher Ion S5XL. The Torrent Server Suite generated base calls, reads, aligned reads, and vcf files relative to the hg19 build of the human genome.

2.1.3 Lung transbronchial biopsy (TBB) samples

RNA-seq data were generated from 699 lung transbronchial biopsy (TBB) samples of 112 patients (Pankratz et al., 2017). Each patient has 3–5 different samples that are called as biological replicates. Eight lung samples were chosen for quality control and sequenced repeatedly across 8 different batches, which are referred to as technical replicates. Among 243,951 pairs, there are 1,893 pairs of biological replicates and 563 pairs of technical replicates.

2.1.4 Thyroid tissue samples

RNA-seq data were generated for 2,053 samples over 26 batches and with a total of 2,106,159 pairs of independent samples, 49 pairs of biological replicates and 170 pairs of technical replicates. Additionally, DNA-seq data were generated for 609 samples and matched with 2,047 RNA-seq samples after removing low quality samples. There were 589 matched pairs between DNA-seq and RNA-seq data and 1,246,034 pairs of different samples.

2.2 Methods

2.2.1 Variant calling pipeline

A customized analysis pipeline was used to process the raw sequencing data from the NextSeq 500 and generate variant calls that is the foundation of the kinship analysis described in this manuscript. The first step is de-multiplexing the raw sequencing reads, and assigning reads to each sample. This is done through the Illumina public software, BCL2FASTQ. Before and after the de-multiplexing, customized scripts are developed to prepare the parameters for BCL2FASTQ, re-organize the FASTQ files and check any contamination and the FASTQ quality before the pipeline can be invoked.

After de-multiplexing, the pipeline was designed by following the GATK Best Practices for variant calling on RNA-seq (Engström et al., 2013). It consists of two sections: pre-process and variant call. The first section includes STAR-based 2-pass alignment against the human genome build 37, mark duplicates and sort, detection of fusion by STAR Fusion, expression profiling via HTSeq-count, RNASeQC, and SAMTools QC. The second section, which is processed via GATK, is composed of six parts: splitting reads across exons and trimming N’s off, indel re-alignment, recalibration to adjust quality scores, variant call, variant filtration, and coverage at known sites.

2.2.2 Estimating IBD sharing and kinship coefficients

We adapt the concept of kinship coefficient to track technical/biological replicates, monitor sample integrity and potentially measure sample quality. We describe briefly identity by descent (IBD)-sharing probabilities, k-coefficients, and kinship coefficient, and then a method for kinship coefficient estimation.

The three k-coefficients, $k_{0ij}$ , $k_{1ij}$ and $k_{2ij}$ , are defined as the probability that a pair of non-inbred individual $i$ and $j$ share, 1, and 2 alleles IBD respectively at a genetic locus. The kinship coefficient between individual $i$ and $j$ , is defined to be the probability that an allele selected randomly from individual $i$ and an allele selected randomly from the same autosomal locus of individual $j$ are IBD. The relationship between k-coefficients and kinship coefficient is $\phi_{ij}=0.5k_{2ij}+0.25k_{1ij}$ and ${0\leqslant\phi}_{ij}\leqslant 0.5$ .

Previously using unlinked loci, an expectation maximization (EM) algorithm (Choi et al., 2009) has been proposed to find maximum-likelihood estimators (MLE) for k-coefficients. The log-likelihood function for k-coefficients is

$l\left({\bm{k}}_{{ij}}\right)=\sum\limits_{s\in S_{ij}}{\log\{\Pr\left(G_{i}^{% s},G_{j}^{s}|k_{0ij},k_{1ij},k_{2ij}\right)\}}$

where $G_{i}^{s}$ and $G_{j}^{s}$ are genotypes at loci $s\in S_{ij}=1,2,\ldots,S$ for individual $i$ and $j$ respectively. The conditional probability of genotypes, $\mathrm{Pr}\left(G_{i}^{s},G_{j}^{s}|k_{0ij},k_{1ij},k_{2ij}\right)$ , is computed under the assumption of Hardy-Weinberg Equilibrium. After finding MLE of k-coefficients, $\hat{k}_{0ij}$ , $\hat{k}_{1ij}$ and $\hat{k}_{2ij}$ , the kinship coefficient can be estimated by $\hat{\phi}_{ij}^{\textit{MLE}}=0.5\hat{k}_{2ij}+0.25\hat{k}_{1ij}$ .

2.2.3 Comparison of kinship coefficients

The MLEs of kinship coefficients are compared to other methods including the method of moment (MOM) (Manichaikul et al., 2010; Thornton et al., 2012) and shared genotypes ratio (SGR) over total genotypes observed. The MOM is another way to estimate kinship coefficients directly from genotype data and used widely in genetic studies. The MOM estimator of kinship coefficient is defined as

$\hat{\phi}_{ij}^{\textit{MOM}}=\frac{1}{2}\frac{1}{\left|S_{ij}\right|}\sum% \limits_{s\in S_{ij}}\frac{\left(g_{i}^{s}-\hat{p}_{s}\right)\left(g_{j}^{s}-% \hat{p}_{s}\right)}{\frac{1}{2}\hat{p}_{s}\left(1-\hat{p}_{s}\right)}$

where $g_{i}^{s}=$ 0, 0.5, or 1 for genotype AA, Aa and aa respectively at loci $s\in S_{ij}=1,2,\ldots S$ and a is the alternative allele. $\hat{p}_{s}$ is defined as the mean of observed frequency of alternative allele a at a locus $s$ . Since the MOM does not restrict the parameter space for $\phi_{ij}^{\textit{MOM}}$ , it sometimes results in estimated values outside of the defined range [0, 0.5] of the kinship coefficient. The estimated kinship coefficients are truncated to [0, 0.5].

To distinguish the pairs of technological or biological replicates from non-replicates, we attempted to simply compute the proportion of shared heterozygous (Aa) and homozygous genotypes (aa) of alternative allele between two samples over total unique genotypes (either Aa or aa) observed in two samples as

$\hat{\phi}_{ij}^{\textit{SGR}}=\frac{1}{2}\frac{1}{\left|T_{ij}\right|}\sum% \limits_{s\in T_{ij}}{|G_{i}^{s}=G_{j}^{s}|}$

where $T_{ij}$ is a set of locus where at least one individual has either heterozygote or homozygous genotype of alternate allele and ${|T}_{ij}|$ indicates the number of element of $T_{ij}$ . Note that $T_{ij}$ can be different for each pair of individual $i$ and $j$ .

2.2.4 Variant selections and allele frequencies

To estimate kinship coefficient using detected variants through NGS, the first step is to identify the variant set. We focus on high quality variants identified by the 1000 genome project (Manichaikul et al., 2015) and ones with sufficient coverage in our data set. Sufficient coverage is defined as having read depth of more than 30 in majority of samples, replicates included. Once the variant set is identified, we compile the genotype of each sample on this targeted set. This forms the basis for calculating kinship coefficients among any pairs. Instead of using allele frequency provided by any public database, we derive allele frequency of each variant using our own dataset. This calculation is based on unrelated samples only. When biological and/or technical replicates exist for a given patient, only one sample is selected randomly to be included in the calculation of allele frequencies. More details are described in Fig. 1.

Figure 1.

Flow chart of variant selection.

Figure 2.

(A) Mean vs standard deviation of read depth of 2,054 variants in thyroid samples after filtering (B) Observed allele frequencies for alternative allele in thyroid samples vs allele frequencies of the 2,021 matched variants in European population from Phase 3 1000 Genomes data (C) Mean read depth of 222,974 commonly detected variants on autosomal chromosomes in thyroid samples and lung TBB samples. Red points are the ones selected after filtering (D) Allele frequencies of 2,736 variants detected either in thyroid samples or lung samples after filtering. 1,332 variants are in common and 722 variants are detected in thyroid samples only and 682 variants are detected in lung samples only.

Figure 3.

Boxplots of pairwise estimated kinship coefficients in lung samples using maximum likelihood estimation (MLE), method of moment (MOM) and shared genotype ratio (SGR). In total, 2,014 variants and 699 samples are used to estimate kinship coefficients for 241,495 unrelated pairs, 563 technical replicates and 1,893 biological replicates.

Figure 4.

Estimated k0 vs. k1 for Thyroid (Panel A) and Lung (Panel B) samples.

Figure 5.

Boxplots of pairwise estimated kinship coefficients using 87 SNPs between 609 DNA-seq and 2,047 RNA-seq data of thyroid samples using maximum likelihood estimation (MLE), method of moment (MOM) and shared genotype ratio (SGR).

3. Results

We first determined the quality of variant/SNP calls from RNA-seq data. One challenge in using RNA-seq data is that coverage is dependent on expression levels, and different sample types have different expression signatures. Figure 2 illustrates uneven coverage on RNA-seq and how both variant coverage and allele-frequency are highly sample-set dependent. Panel A displays the mean versus standard deviation of read depth of 2,054 variants across 2,053 thyroid samples after variant filtering. The averaged read depth for most of variants is below 200x (90%) but covers a wide range from $\sim$ 30x to $\sim$ 700x. There is a weak relationship between the mean and the standard deviation implying higher variability with increased coverage. However, the correlation is only moderate (cor $=$ 0.677) and the relationship is highly variant dependent. Panel B displays observed alternative allele frequencies estimated using in-house thyroid sample set versus the European population from Phase 3 1000 Genomes Project (Manichaikul et al., 2015). While the overall correlation is high (cor $=$ 0.985) as expected, a substantial proportion of variants do show moderate difference: 15% of variants have more than 5% difference in allele frequency and 2% of variants have more than 10% difference in allele frequency. Since alternative allele frequency is a fundamental building block for the kinship analysis, it is critical to derive such estimates directly using the sample set of interest to avoid any deviations caused by discrepancy in allele frequency. Panel C and D show the read depth and the alternative allele frequency calculated using the thyroid sample set versus the lung sample set. It highlights the substantial difference in these two key metrics when using different sample types. The overall correlation on read depth is only 0.669 (Panel C) and among the 2736 variants detected in either thyroid or lung samples after filtering, only 1332 variants (48.7%) are in common while 722 (26.4%) are detected in thyroid samples only and 682 (24.9%) are detected in lung samples only. Overall, it highlights the importance of customized variant selection to ensure sufficient coverage. Equally important is sample-type specific allele-frequency estimation when calculating kinship coefficients.

After filtering (as described in Fig. 1), the kinship is estimated using 2,014 detected variants on autosomal chromosomes for all pairs of 699 lung samples. Estimated kinship coefficients using three different methods (MLE, MOM, SGR) are compared and the result is shown in Fig. 3. The MOM kinship coefficients of several technical/biological replicates overlap with those of independent pairs since estimated values of independent pairs are ranged from 0 to 0.496. However, both MLE and SGR have separated biological/technical replicates from independent pairs cleanly. The maximum of MLE kinship coefficients for independent pairs is 0.166 and the minimum values for biological and technical replicates are 0.194 and 0.212 respectively. The maximum of SGR for independent pairs is 0.282 and the minimum values for biological and technical replicates are 0.308 and 0.309 respectively.

In addition to kinship coefficients, the MLE method allows us to estimate k-coefficients and to categorize pairs into more specific relationship. Unexpected patterns help us detect issues in experimental procedures and/or pipeline steps. For example, while the expected kinship coefficient is 0.25 for both full siblings and parent-offspring, the underlying value of $(k_{0ij},k_{1ij}k_{2ij})$ is different; k-coefficients are $(0.25,0.5,0.25)$ for full siblings and $(0,1,0)$ for parent-offspring. In our application, the expected value for k-coefficient is $(k_{0ij},k_{1ij}k_{2ij})=(1,0,0)$ for independent pairs and $(k_{0ij},k_{1ij}k_{2ij})=(0,0,1)$ for biological or technical replicates. The estimated k-coefficients for lung and thyroid samples are shown in Fig. 4. In thyroid samples, we identify several unusual independent pairs with $\hat{k}_{2ij}>\sim 0.35$ while the expected of $\hat{k}_{2ij}$ is 0 since they are supposed to be independent (Fig. 4A). After further investigation, we found that for each pair, at least one sample is missing critical information on several chromosomes due to mapping pipeline failure otherwise hard to identify. In lung TBB samples, there are several pairs of technical/biological replicates clustered around ${0.25<k}_{0ij}<0.55$ while the expected value of $k_{0ij}$ is 0 (Fig. 4B). The samples of those pairs are low-quality sequence data due to low total read or high duplicate rate and they also matched the ones in the lower tail in Fig. 3 for technical/biological replicates.

Figure 5 presents the estimated kinship coefficients by three models, MLE, MOM, and SGR for pairs of 2,047 RNA-seq and 609 DNA-seq data of thyroid samples. The non-matched pairs indicate RNA-seq and DNA-seq data are from the different patients and matched pairs indicate both RNA-seq and DNA-seq data are from the same patient. For the matched pairs, all three models result in estimated kinship coefficients close to the expected value, 0.5, with mean (s.d.) of 0.482 (0.013), 0.448 (0.060) and 0.472 (0.021) for MLE, MOM and SGR, respectively. Similarly, for the non-matched pairs, the estimated kinship coefficients are closed to the expected value, 0 with mean (s.d.) of 0.031 (0.041), 0.026 (0.044) and 0.080 (0.032) for MLE, MOM and SGR, respectively. However, the MOM results in the wide range of estimated kinship from 0 to 0.5 for non-matched pairs with highly inflated values making them undistinguishable from matched pairs. Overall, this result demonstrates that the estimated kinship coefficients could successfully match RNA-seq data with DNA-seq data of the same patient.

4. Discussion

In this study, we have shown that the kinship statistic can be repurposed for NGS-sample tracking in a clinical laboratory setting. A variety of tissue types can be tracked with this methodology, including lung and thyroid samples. Because RNA-seq data were used, where SNP coverage is influenced by expression levels, a unique SNP set was required for different tissue analysis. Finally, we were able to select a small set of informative SNPs to ensure that DNA and RNA purified from the same patient are correctly linked and analyzed together.

While the estimated kinship coefficients distinguish technical/biological replicates from unrelated pairs, the results from both lung and thyroid data set show over-estimated values in unrelated pairs and under-estimated values in replicated pairs. The MLE restricts the space of parameters, k-coefficients, $0\leqslant k_{0},k_{1},k_{2}\leqslant 1$ and $k_{0}+k_{1}+k_{2}=1$ , and allows only kinship coefficients between 0 and 0.5, which may lead to inflated values for unrelated pairs and lower values for technical/biological replicates. The MOM without truncation is an unbiased estimator (Thornton & McPeek, 2010), however, we observed the wide range of estimated values from $-$ 0.32 to 1.23 with extreme outliers. After truncating estimated MOM kinship coefficients to [0, 0.5], they are no longer unbiased.

Uncertainty in variant calling using RNA-sequencing could affect accuracy in kinship estimation. The vcf file reports only heterozygous and homozygous genotypes of alternative allele, and we assume genotype is homozygote of reference allele where no variant is reported. This assumption may cause a genotype error if there is no coverage on missing variants in vcf files. Especially when the sequence data is low quality with low coverage, the genotype error would be more severe. This explains the long lower tail in estimated kinship coefficients of replicated lung samples in our results (Fig. 3).

Missing homozygous genotypes of the reference allele in the vcf file leads us to consider using heterozygous and homozygous genotypes of alternative allele only for SGR. Since our main goal is to distinguish technical/biological replicates from unrelated pairs in clinical setting, SGR simply assumes shared genotypes are IBD across different variants and non-shared genotypes are non-IBD, even when one of the allele is shared (i.e. AA vs Aa) IBD. This approach over-weights for shared genotype between unrelated pairs and under-weights for discrepant genotypes (due to variant calling error) between replicated samples. However, our results demonstrate, even with these limitations, SGR has enough power to distinguish replicates from unrelated pairs in clinical setting. As long as the proportions of heterozygote and homozygote of alternative alleles are well balanced for the chosen SNPs in the given data set, this approach would be powerful for sample tracking.

In addition to using SNPs for confirming sample identity, common SNPs may be useful for other critical QC functions in the clinical setting. One example is sample-to-sample contamination. Studies using DNA have shown that consistent increases in allele frequency across thousands of SNPs can reveal sample contamination (Jun et al., 2012). Analytical validation experiments analyze the ability of an assay to perform in the presence of contamination. SNP-based contamination metrics would help eliminate samples that are above a critical threshold. By using SNPs that come along for the ride during NGS-based clinical tests, we can add additional QC metrics to further ensure accurate reporting to patients.

References

Van Allen

E. M.

Wagle

Stojanov

Perrin

D. L.

Cibulskis

Marlow

Jane-Valbuena

Friedrich

D. C.

Kryukov

Carter

S. L.

McKenna

Sivachenko

Rosenberg

Kiezun

Voet

Lawrence

Lichtenstein

L. T.

Gentry

J. G.

Huang

F. W.

Fostel

Farlow

Barbie

Gandhi

Lander

E. S.

Gray

S. W.

Joffe

Janne

Garber

MacConaill

Lindeman

Rollins

Kantoff

Fisher

S. A.

Gabriel

Getz

, & Garraway

L. A.

(2014). Whole-exome sequencing and clinical interpretation of FFPE tumor samples to guide precision cancer medicine. Nature Medicine, 20, 682-688.

Ashley

E. A.

(2016). Towards precision medicine. Nat Rev Genet, 17, 507-522.

Boyce

A. J.

(1983). Computation of inbreeding and kinship coefficients on extended pedigrees. The Journal of Heredity, 74, 400-404.

Byron

S. A.

Van Keuren-Jensen

K. R.

Engelthaler

D. M.

Carpten

J. D.

, & Craig

D. W.

(2106). Translating RNA sequencing into clinical diagnostics: Opportunities and challenges. Nat Rev Genet, 17, 257-271.

Choi

Wijsman

E. M.

, & Weir

B. S.

(2009). Case-control association testing in the presence of unknown relationships. Genet Epidemiol, 33, 668-678.

Cirulli

E. T.

, & Goldstein

D. B.

(2010). Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet, 11, 415-425.

CLIA 42 CFR §493. 1232. https://www.gpo.gov/fdsys/pkg/CFR-2016-title42-vol5/pdf/CFR-2016-title42-vol5-part493.pdf.

Engström

P. G.

Steijger

Sipos

Grant

G. R.

Kahles

The

RGASP

consortium Rätsch

. Goldman

Hubbard

T. J.

Harrow

Guigó

, & Bertone

(2013). Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods, 10, 1185-1191.

Frampton

G. M.

Fichtenholtz

Otto

G. A.

Wang

Downing

S. R.

Schnall-Levin

White

Sanford

E. M.

Sun

Juhn

Brennan

Iwanik

Maillet

Buell

White

Zhao

Balasubramanian

Terzic

Richards

Banning

Garcia

Mahoney

Zwirko

Donahue

Beltran

Mosquera

J. M.

Rubin

M. A.

Dogan

Hedvat

C. V.

Berger

M. F.

Pusztai

Lechner

Boshoff

Jarosz

Vietz

Parker

Miller

V. A.

Ross

J. S.

Curran

Cronin

M. T.

Stephens

P. J.

Lipson

, & Yelensky

(2013). Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol, 31, 1023-31.

10.

Huang

Chen

Lathrop

, & Liang

(2013). A tool for RNA sequencing sample identity check. Bioinformatics, 29, 1463-1464.

11.

Jun

Flickinger

Hetrick

K. N.

Romm

J. M.

Doheny

K. F.

Abecasis

G. R.

, et al. (2012). Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. American Journal of Human Genetics, 91, 839-848.

12.

Kelmemi

Teeuw

M. E.

Bochdanovits

Ouburg

Jonker

M. A.

Alkuraya

Hashem

Kayserili

van Haeringen

Sheridan

Masri

Cobben

J. M.

Rizzu

Kostense

P. J.

Dommering

C. J.

Henneman

Bouhamed-Chaabouni

Heutink

Ten Kate

L. P.

, & Cornel

M. C.

(2015). Determining the genome-wide kinship coefficient seems unhelpful in distinguishing consanguineous couples with a high versus low risk for adverse reproductive outcome. BMC Med Genet, 16, 50.

13.

Manichaikul

Mychaleckyj

J. C.

Rich

S. S.

Daly

Sale

, & Chen

W. M.

(2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26, 2867-2873.

14.

The 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526, 68-74.

15.

Pankratz

D. G.

Choi

Imtiaz

Fedorowicz

G. M.

Anderson

J. D.

Colby

T. V.

Myers

J. L.

Lynch

D. A.

Brown

K. K.

Flaherty

K. R.

Steele

M. P.

Groshong

S. D.

Raghu

Barth

N. M.

Walsh

P. S.

Huang

Kennedy

G. C.

, & Martinez

F. J.

(2017). Usual interstitial penumonia can be detected in transbronchial biopsies using machine learning. Annals of the American Thoracic Society, In press.

16.

Pengelly

R. J.

Gibson

Andreoletti

Collins

Mattocks

C. J.

, & Ennis

(2013). A SNP profiling panel for sample tracking in whole-exome sequencing studies. Genome Medicine, 5, 89.

17.

Rehm

H. L.

Bale

S. J.

Bayrak-Toydemir

Berg

J. S.

Brown

K. K.

Deignan

J. L.

Friez

M. J.

Funke

B. H.

Hegde

M. R.

, & Lyon

(2013). The working group of the american college of medical genetics and genomics laboratory quality assurance committee, ACMG clinical laboratory standards for next-generation sequencing. Genetics in Medicine, 15, 733-747.

18.

Sepulveda

J. L.

, & Young

D. S.

(2013). The ideal laboratory information system. Archives of Pathology & Laboratory Medicine, 137, 1129-1140.

19.

Thornton

, & McPeek

M. S.

(2010). POADTRIPS: Case-control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet, 86, 172-184.

20.

Thornton

Tang

Hoffmann

T. J.

Ochs-Balcorn

H. M.

Caan

B. J.

, & Risch

(2012). Estimating kinship in admixed populations. Am J Hum Genet, 91, 122-138.