Ischemic Stroke: From Next Generation Sequencing and GWAS to Community Genomics?

Abstract

Stroke is a major cause of mortality and morbidity in both the developed and developing world. Next generation sequencing (NGS) and multi-omics integrative biology research offer new opportunities in the way we research and understand stroke. These biotechnologies also signal a shift from genetics to genomics of stroke, which is highlighted in this review. Stroke is a focal neurological deficit resulting from disruption of the cerebral blood supply. There are two main types of common stroke, ischemic stroke (IS), which comprises 80% of cases, and hemorrhagic stroke (HS) that accounts for about 20% of cases. IS is a complex multi-factorial disease with multiple environmental and genomic determinants. We discuss here IS from genomics and bioinformatics perspectives, including the highlights of the genome wide association studies (GWAS), NGS progress to date, and exome studies. While both ‘common variant, common disease’ and ‘rare variant, common disease’ approaches need to be assessed in tandem, future studies into IS omics should also consider pedigree and/or community based sampling to take account of the complex diversity of IS genetics. We conclude by presenting an example of such community genomics research from China in an extended pedigree sample, and the ways in which the intersection of genomics and global society can usefully inform our understanding of IS pathophysiology and potential preventive medicine interventions in the future.

Stroke and Public Health

According to the Global Burden of Diseases, Injuries, and Risk Factors Study 2010, deaths from non-communicable diseases rose by nearly 8 million between 1990 and 2010, and accounted for two of every three deaths (34.5 million) worldwide by 2010. Heart disease and stroke were collectively responsible for the deaths of 12. 9 million people in 2010, or one in four deaths worldwide, compared with one in five in 1990 (Lozano et al., 2012).

Selected national statistics further highlight the growing problem that stroke is becoming for public health. In the USA, The National Center for Chronic Disease Prevention and Health Promotion reported that every year stroke is the cause of death of 130,000 Americans with 610,00 Americans suffering a new stroke (The National Center for Chronic Disease Prevention and Health Promotion, 2013). In the UK, The Stroke Association reports that there are approximately 152,000 strokes in the UK every year, or about one every 5 minutes, with 1.1 million stroke survivors living in the UK (Stroke Association, 2013). In Australia, a Deloitte Access Economics report on the economic impact of stroke on Australia reported that more than 1000 Australians sustain a stroke every week, of whom 40% die within 12 months (Deloitte Access Economics, 2013) (Table 1).

Table 1.

Take Home Executive Points on Recent Advances in Genomics of Ischemic Stroke

Ischemic stroke background

• Major public health issue in developed and developing countries

• IS has multiple forms (TOAST categories)

• IS is a multi-factorial complex disease involving multiple genetic and environmental factors.

Mendelian IS syndromes

• CADASIL

• CARASIL

• MELAS

• All give clues to genetic aspects of IS but are very rare

GWAS Non-syndromic IS

• CVCD approach

• Large multi-center multi nation studies required to achieve genetic and statistical significance

• Millions of possible variations assayed. 14 major candidate variations listed in this review

• Multiple GWAS studies combined into meta-analysis studies to achieve greater statistical power and detected more candidate mutations.

• Common variants detected contribute very small risk to developing IS

Whole exome/Whole genome Next Generation Sequencing

• RVCD approach

• Potential to detect low frequency rare variations which contribute a higher risk to developing IS

• 3 candidate variations detected in this manner listed in this review

• Requires significant bioinformatic resources

Community genetics and ischemic stroke

• Many community specific variations for many rare and common diseases. Ischemic stroke is no exception

• Combine both CVCD and RVCD approaches

• A more targeted experimental design proposed combining sporadic case control and pedigree level approaches such as linkage analysis.

Multi-omics data-intensive life sciences offer new vistas for neurological and mental health disorders (Goldenberg et al., 2014; Longuespee et al., 2014; Podder and Latha, 2014). Among these, stroke has been considered a disease caused by lifestyle and dietary behaviors such as increased energy intake, fat intake, and alcohol consumption, and decreased physical activity and cigarette smoking. Constitutive factors such as genomics have recently gained increasing conceptual importance.

One striking example of this is highlighted in a World Health Organization (WHO)-funded project called ‘Incidence and trends of stroke and its subtypes in China’ (Jiang et al., 2006). In these studies researchers collected medical diagnoses, genetic information, and blood samples from 3015 ischemic stroke patients in Beijing, Shanghai, and Changsha, including sampling multiple individuals within 12 pedigrees of Han Chinese (Jiang et al., 2006). It was also observed that proportionally Chinese populations had relatively more hemorrhagic strokes and fewer ISs, but this was changing and the incidence of IS was growing and HS was reducing.

A striking example was the finding that the age-adjusted incidence of overall stroke in individuals over 55 years in these populations was generally higher than in Western populations and that incidence of IS rose 50% in Beijing during the study period (1991–2000). The researchers concluded that the main cause of this rapid increase of IS incidence was due to the economic boom in China during the study period of 1991–2000. They observed that Chinese populations were rapidly adopting Western lifestyle and dietary habits, increasing the incidence of obesity and hypercholesterolemia in China and thus increasing the incidence of IS.

The Genetics/Genomics of Ischemic Stroke

Another recognized risk for stroke is a genetic/genomics predisposition. Although the study and knowledge of the impact of environmental and modifiable risk factors are well advanced, the identification of genetic variants associated with genetic predisposition is still a work in progress (Markus, 2011). The major reason for the relative paucity of data on stroke predisposition is the complexity of stroke genetics itself. Stroke is defined as a focal neurological deficit resulting from disruption of the cerebral blood supply. There are two main types of common stroke, ischemic stroke (IS), which comprises ∼80% of cases, and hemorrhagic stroke (HS) that accounts for the remaining ∼20% of cases (Markus, 2011; Bevan and Markus, 2011). According to the TOAST (Trial of Org 10172 in Acute Stroke Treatment) system (Adams et al., 1993), IS has been classified into five subtypes: 1) large artery atherosclerosis, 2) cardioembolism, 3) small vessel occlusion, 4) stroke of other determined etiology, and 5) stroke of undetermined etiology.

Within these five subtypes there are some rare Mendelian stroke disorders and syndromes, for example, CADASIL and CARASIL (Markus, 2011) and MELAS (Pavlakis et al., 1985; Sproule et al., 2013). CADASIL is the most common genetic small vessel IS and is a dominantly inherited small artery disease caused by >190 known mutations of the NOTCH3 gene (Federico et al., 2012). CARASIL is a much rarer genetic form of ischemic, nonhypertensive, cerebral small vessel disease directly affecting the cerebral small blood vessels and caused by mutations in the HTRA1 gene encoding HtrA serine peptidase/protease (Fukutake, 2011). CARASIL is an early onset disease, usually occurring between the ages of 20–30 years, and most cases are of East Asian origin (Yamamoto et al., 2011). MELAS is an early onset syndrome in which IS is one symptom. One of the first IS-related genetic syndromes to be characterized is now known to be caused by mutations in at least 30 mitochondrial genes (Sproule et al., 2013). Further investigation into ischemic stroke and mitochondrial variation has led to several correlations between mtDNA haplogroups and IS (Cai et al., 2015), as well as common OXPHOS gene variations and IS (Anderson et al., 2013). These associations are still putative and require further investigation.

The vast majority of ischemic stroke cases represent a multi-factorial complex disease involving multiple genetic and environmental factors. While in the previously discussed Chinese study, environmental factors were overwhelmingly the prime suspect in increasing IS incidence, the differing patterns of stroke subtypes between Chinese and Western stroke incidence patterns observed could not rule out a role for genetic effects. In fact, it has been previously established that conventional environmental risk factors do not explain all IS risks (Sacco et al., 1989).

Evidence from studies of twins and family history suggests that genetic predisposition is important (Traylor et al., 2012). In twin studies in particular, while it is theoretically possible to estimates of heritability, and differentiate environmental and genetic risk, performing such studies in stroke have proven challenging (Bevan et al., 2012). In twin cohorts, the number of stroke cases is small, and in studies performed to date, subtyping is not available and thus no estimates of heritability for the specific stroke subtypes are available (Bevan et al., 2012).

Genome Wide Association Studies

The exploration of possible genetic factors in the development of complex multigeneic IS was first made feasible by the advent of genome wide association studies (GWAS). Causative genetic variations have been traditionally detected using classical Mendalian gene mapping and linkage studies. While very successful for rare genetic disorders, these studies are far less successful when applied to complex diseases. As the name implies, complex diseases, such as IS, are not caused by single mutations in a single causal gene, but are instead influenced by multiple subtle genetic factors. The hunt for these small effect causative genetic variants in complex disease started with GWAS-based investigations. This approach was first theoretically developed from the concept that comparing the allele frequencies at variants across the genome between thousands of cases and controls would be well-powered to detect common alleles of small effect (Risch and Merikangas, 1996).

GWAS-based projects genotype a large number of variants (in the hundreds of thousands or even millions) in a large number of cases and controls, usually in the thousands in order to minimize false positive findings. This approach does not depend on a prior hypothesis and novel associations can therefore be detected (Bevan et al., 2012). The ability to conduct such studies was dependant on the development of the microarray, initially developed by Affymatrix™. Their first commercial single nucleotide polymorphism (SNP) array was released in 1996 and targeted 1500 human SNPs. As of 2015, there are now currently two predominate vendors that provide technology for collection of GWAS data, Affymetrix and Illumina, which use two differing methodologies to assay for over 1 million variants per sample.

In an Affymetrix chip, thousands of copies of oligonuclotide probes for each SNP and copy number variations (CNV), usually 25mers, are directly synthesized onto a silicon chip. Fluorescently labeled target DNA is then hybridized to the probes, with successful hybridization (dependent on the target DNA oligo containing the SNP contained in the probe) resulting in a fluorescent signal. Ilumina, on the other hand, developed the BeadChip approach, in which the oligo probes are attached to beads (1000s of copies of the same probe on each bead). These beads are then deposited in wells on a glass slide, or BeadChip. Again hybridization of target oligo to the probe is detected as fluorescent signals. The latest Affymatrix Genotyping, the Genome-Wide Human SNP Array contains 1.8 million genetic markers, including more than 906,600 SNPs and more than 946,000 probes for the detection of copy number variation. Illumina BeadChip technology has developed to such an extent that the latest genotyping chip, the HumanOmni5-Quad, contains ∼4.3 million markers with the ability to add 500,000 more. This rapid development of high-throughput genotyping technologies has enabled accurate and reproducible genotyping in combination with the progressive drop in genotyping costs.

Furthermore, GWAS variants not directly assayed for can be generated through genotype imputation. Genotype imputation is a statistical methodology that works by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset (Marchini and Howie, 2008; 2010; Howie et al., 2009; 2011). In effect, it can be argued that imputation further increases the number of variants assayed to such a level that potentially all common variants involved in disease risk have been investigated.

With the rapid progress of such high-density SNP microarrays and the development of genotype imputation, GWAS have been successfully undertaken for many common human diseases, revealing multiple SNPs with strong associations (minor allele frequency >5%, p<1…10⁻⁸) (Cirulli and Goldstein, 2010). This rapid development has resulted in an exponential growth in association data. For example, The Catalog of Published GWAS Studies, now hosted at European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) (http://www.ebi.ac.uk/gwas), has, as of June 2015, listed more than 15,000 SNPs involved in more than 8,000 associations with complex traits extracted from more than 2000 GWAS publications.

GWAS and Ischemic Stroke

Although initially developing at a much slower pace than GWAS of other traits, IS GWAS projects have revealed a growing number of associated SNPs (Lupski et al., 2011; Bevan et al., 2012; Foo et al., 2012; Hacke and Grond-Ginsbach, 2012). While an initial major international GWAS meta-analysis of both the Ischemic Stroke Genetic Study (ISGS) and the Bio-Repository of DNA in Stroke (BRAINS) dataset resulted in no SNP of genome-wide significance associated with IS (Meschia et al., 2011), many other IS GWAS meta-analysis projects have been successful in revealing many significant IS-associated SNPs (Yamada et al., 2009; Meschia et al., 2011; Sun et al., 2011; Arregui et al., 2012; Bellenguez et al., 2012; Holliday et al., 2012).

The International Stroke Genetics Consortium GWAS has revealed four variants associated with IS in European patients who exhibit evidence of heterogeneity of effect across different TOAST stroke subtypes (Bellenguez et al., 2012), suggesting multiple distinct stroke subtypes and their related conditions rather than a single ‘common stroke’ phenotype. This hypothesis is demonstrated by studies showing SNP associations exclusive to specific ethnicities, such as the Japanese and Han Chinese (Yamada et al., 2009; Sun et al., 2011), as well as some SNP genetic associations with conditions such as coronary heart disease and atrial fibrillation that are also significantly associated with stroke (Markus, 2011; Lemmens et al., 2011; Arregui et al., 2012).

With the growth of the number of international GWAS projects, the next stage of the exploration of IS genetics is the combination of these studies through Meta-analysis. Consortia such as Meta-Stroke (Traylor et al., 2012) and SiGN (Meschia et al., 2013) use a rationale of combing a massive amount of data gleamed from all the major international GWAS projects and sib-pair studies via meta-analysis in an effort to further define significant IS genetic markers. The amount of data available to these consortia is significant. For example, SiGN contains 14,549 cases from 24 genetic research centers: 13 from the United States and 11 from Europe. With such a large bank of population-based data, these consortia can therefore theoretically establish subgroups of data for different stroke subtypes, considered crucial in uncovering IS stroke genetic factors (Table 2).

Table 2.

Highlights of Selected IS-Associated SNPs Identified in Previous GWAS, Exome, Case-Control and Meta-Analysis Studies

Chromosome	Location	Study Methods	Gene	Population(s)	dbSNP ID(s)	TOAST sub-type	References
1	1q24.2	Exome, Case-Control	c1orf156	Han Chinese	rs10489177		(Zhang et al., 2013)
3	3p21.3	Exome, Case-Control	XYLB	Han Chinese	rs17118		(Zhang et al., 2013)
4	4q25	GWAS, Meta-analysis	intergenic- MIR197 (Near PITX2)	European	rs6843082, rs1906599, rs2200733		(Gretarsdottir et al., 2008; Gudbjartsson et al., 2009; Traylor et al., 2012)
5	5p14-p13	Case-control	NPR3	European, Han Chinese	rs696831	–	(Liu et al., 2012; Rubattu et al., 2013)
6	6p21	GWAS	AIM1	European (USA)	rs783396	–	(Matarin et al., 2007)
6	6p21.1-p21.3	GWAS	intergenic	European (Australia)	rs556621, rs556512	LVD	(Holliday et al., 2012)
6	6p24	GWAS	PHACTR1	European	rs12526453	LVD	(Bevan et al., 2012)
7	7p21.1	GWAS, Meta-analysis	HDAC9	European, European (USA)	rs11984041	LVD	(Bellenguez et al., 2012; Traylor et al., 2012)
7	7p21.2	GWAS	intergenic	European (USA)	rs10486776	–	(Matarin et al., 2007)
7	7p21.3	Exome Study	PON1	European, African	rs61736513, rs146211440, rs80019660, rs141665531, rs61736513, rs80019660, rs146211440	–	(Kim et al., 2014)
9	9p21.3	GWAS, Case-control	CDKN2BAS	European (Norway), Han Chinese	rs2383206, rs2383207, rs10757274, rs10757278	–	(Wahlstrand et al., 2009; Hu et al., 2009; Zhang et al., 2012)
9	9q34.2	GWAS, Meta-analysis	ABO	European	rs505922	LVD/CE	(Williams et al., 2013)
12	12p13	GWAS, Case-control	NINJ2	European, Japanese	rs11833579	LVD/SVD	(Ikram et al., 2009; Matsushita et al., 2010)
12	12q24.2-q24.31	Case-control	NOS1	European (Portuguese)	rs2293050, rs2139733, rs7308402, rs1483757	–	(Manso et al., 2012)
14	14q23.1	GWAS, Meta-analysis	PRKCH	Japanese, Han Chinese	rs2230500	LVD	(Kubo et al., 2007)
16	16q22.3	GWAS, Meta-analysis	ZFHX3	European	rs7193343, rs879324	CE	(Gudbjartsson et al., 2009; Traylor et al., 2012)
17	17q25.1	GWAS	LLGL2	Japanese	rs1671021	–	(Yamada et al., 2009)
18	18p11.21	GWAS	IMPA2	European (USA)	rs7506045	–	(Matarin et al., 2007)
22	22q13.3	GWAS, Case-control	CELSR1	Japanese, European (Portuguese)	rs6007897, rs4044210	–	(Yamada et al., 2009; Gouveia et al., 2011)

CE, cardioembolism; LVD, large vessel disease; SNP, single nucleotide polymorphism; SVD, small vessel disease; TOAST, Trial of Org 10172 in Acute Stroke Treatment.

Limits of GWAS

While GWAS has allowed researchers to make major advances in understanding the complex genetic architecture of IS, in turn these discoveries have shown that most IS heritability remains unidentified. In fact, GWAS was specifically designed to address what is referred to as the Common Variant, Common Disease (CVCD) hypothesis (Bevan and Markus, 2011). The ‘missing heritability’ is therefore most likely to involve rare or low frequency variants within pedigrees and individuals. The Rare Variant, Common Disease (RVCD) hypothesis (Cirulli and Goldstein, 2010; Foo et al., 2012), with multiple rare variants, sometimes specific to the individual, is combined to express a higher genetic risk for developing common diseases (Cirulli and Goldstein, 2010; Manry and Quintana-Murci, 2013). Family studies support a RVCD hypothesis, with different subtypes of stroke demonstrated within different families, suggesting alternative underlying rare variant genetic risk factors (Polychronopoulos et al., 2002; Jerrard-Dunne et al., 2003). It is doubtful whether many such risk factors would be detected via GWAS, which mainly detects more common SNPs (Markus, 2011).

Next Generation Sequencing Studies

The development of next generation sequencing (NGS) has enabled researchers to identify multiple causal variants, including rare variants, for disease that might otherwise be missed by SNP-chip technology (Kim et al., 2012). These target variants are either low frequency (typically defined as those with minor allele frequency (MAF) between 1% and 5%) or rare variants (those with MAF <1%). The most common use of NGS for detecting such low frequency and rare variants is whole exome sequencing, since the whole exome approach has been the most feasible both in economic cost and available bioinformatic resources.

The first target of exome sequencing was the identification of rare causative mutations that are responsible for Mendelian genetic disease. For example, new causative mutations for Kabuki disease (Ng et al., 2010), Joubert syndrome (Srour et al., 2012) and postaxial polydactyly type A (Kalsoom et al., 2013), to name just three, have been found via exome sequencing. Now whole exome sequencing is proving useful in pinpointing rare variations associated with complex disease such as Alzheimer's disease (Cruchaga et al., 2014), autism (Yu et al., 2013), and Crohn's disease (Ellinghaus et al., 2013).

As with GWAS, exome studies into IS are not as advanced as other complex diseases. A pilot exome study of previously surveyed GWAS samples failed to identify excess rare variation in any of the IS candidate exomes evaluated, but did confirm a common variant discovered in previous GWAS results (Cole et al., 2012). In this study, 10 individuals were selected for exome sequencing from the GEOS and SWISS GWAS cohorts. These samples were limited to males with an early age of onset with a family history of stroke in order to maximize the genetic contribution to stroke risk.

In the analysis of these exomes for IS-associated SNPs, two broad approaches were taken, genotype-specific and compound heterozygote analysis. The first approach involved top genetic variant hits for lancular stroke in various major studies from the literature on all IS stroke subtypes. These genetic variants were then assayed in the 10 exomes. The compound heterozygote approach involved screening for genes in which every sample had at least one rare variant. From the first analysis, one gene, which was previously identified in the GWAS study, CSN3, was identified as having a significant coding polymorphism and an excess of rare variants. From the second analyses, 48 genes identified having at least one rare variant.

While Cole et al. (2012) took an approach that was effectively a re-evaluation of GWAS samples with whole exome sequencing, a later study applied an exome only matched-case control approach to uncovering IS associated variants (Zhang et al., 2013). Zhang et al. completed an ambitious study involving three stages: 1) exome sequencing and candidate SNP imputation from of 100 cases and 100 matched controls from Shenzhen, China; 2) verification Sequenom™ genotyping of candidate SNPs in a further 500 cases and 500 controls from Shenzhen; and 3) replication of these genotyping results using Taqman™ in a further 1277 cases and 1277 controls from Beijing and Shenzhen. Despite the large sample size and multiple verification studies, only two candidate novel SNPs were found that were associated with an increased risk of IS for the Han Chinese population (see Table 2).

One of the biggest exome studies is that of the National Heart, Lung, and Blood Institute (NHLBI) of the United States entitled the NHLBI GO Exome Sequencing Project (ESP). In an effort to find variants contributing to heart, lung, and blood disorders, this project has found several significantly associated IS variants in the PON1 gene (Kim et al., 2014). This study involved detecting IS-associated variants in the exomes of 496 IS stroke patients from a wider exome sample of 4204 unrelated participants. In total, 7 SNPS in PON1 were found to be associated with IS stroke in participants of European and African Ancestry (Table 2). Within these 7 SNPS, 2 were only found amongst patients of European ancestry and one was exclusive to patients of African Ancestry.

All three studies discussed relied on samples were not related to each other relying on a case-control inference of association and were not IS-subtype specific, a crucial factor established by several GWAS investigations as necessary for assessing rare variant association and effect. Furthermore, as outlined by Cole et al., next generation sequencing studies have an inherent problem of producing a large number of false positives due to the nature of the read alignment and variation imputation process (Cole et al., 2012). This problem was highlighted in a major article that highlighted significant discrepancies in SNP and Indel calling between many of the currently available variant-calling pipelines when applied to the same set of Illumina sequence data with near-default software parameterizations (O'Rawe et al., 2013). It is also expected that many rare variants will have a very restricted geographic distribution, so that careful matching of case and control ancestries is likely to be extremely important due to the potential for false signals to be introduced by small differences in ancestry (Do et al., 2012).

Another problem is that exome studies are naturally limited to the coding regions of the genome and thus cannot be used to assess both non-coding regions and many structural variations. This could theoretically result in missing potential targets for IS genetic mapping. Recommendations for resolving issues in complex disease NGS strategies include international collaborations with thousands of sample in order to increase statistical power (much like current major GWAS projects) (Kilpinen and Barrett, 2013) and a renewed collection of exome and whole-genome sequencing of multi-generational families, to increase the overall accuracy of NGS studies (O'Rawe et al., 2013).

Therefore, the next logical step in exploring the genetic landscape of ischemic stroke is whole-genome sequencing. With whole-genome sequence, not only SNPs and Indels could be defined, but also larger genetic variant types such as CNVs and Structured Variants (SVs). In the past, compared to exome sequencing, whole-genome sequencing had a prohibitively large cost. However, the costs have been rapidly decreasing. For example, the introduction of Ilumina HiSeq X Ten promises $US 1000 whole-genome sequencing, a similar price to current whole-exome sequencing protocols (Illumina, 2014). The rapid development of NGS technology has resulted in international projects proposing to generate and store whole genome sequencing data from thousands of individuals, for example, the Genomics England 100,000 genomes project (Branco, 2013). The availability of such resources for IS research would allow researchers to better pinpoint multiple variants that have a small but significant impact on stroke pathology.

Bioinformatics and IS

The type and amount of next generation sequencing data has been increasing rapidly. To be able to store and analyze this increasing amount of data, extremely high-performance computing and intensive bioinformatics support must be available (Zhao et al., 2012). It is argued the development of bioinformatic infrastructure has not been as rapid as that of NGS and has created a bottleneck (Scholz et al., 2012; Schrijver et al., 2012). It can then be concluded that the full benefit of NGS will not be achieved until bioinformatics are able to maximally interpret and utilize these short-read sequences, including alignment, assembly, and annotation.

While sufficient computing power is becoming more available to researchers, the major bottleneck is still storage capacity. With the generation of 100s of exome and genomes for adequate IS genomic studies, there will be immense data storage requirements. The amount of raw sequence data for each sample is usually hundreds of Gbs, and, as well as for immediate analysis, needs to be stored for potential future analysis and interpretation when analytical algorithms improve. With such large amounts of data, transfer of data between the research and clinical centers is a problem. It has become routine for data to be shipped in portable hard drives with at least 2 Tb capacity or greater disk space. However, this makes ready access to the data problematic, and the cost of buying and shipping hard drives can add significantly to the cost of the sequencing project.

The current solution to compute and storage bottlenecks is the utilization of cloud computing services. Services such as such as DNAnexus, Illumina BaseSpace, and Seven Bridges Genomics allow scientists to sequence, analyze, and collaborate on data via Amazon Web Services (AWS) cloud infrastructure with constantly decreasing costs. Using such services, the researcher can sequence and filter raw read data, then align, annotate, and finally, visualize the annotation results. The advantages of this system advantage are the ability to align and annotate sequence read data in web browser interfaces, allowing researchers with no command line Linux experience to perform sophisticated genomic analysis. Another advantage is flexible data storage and exchange. Results from these pipeline services are available in spreadsheet or text file (Variant Call File) format or even pre-formatted for GENBANK and ENA submissions.

The use of private cloud services all raise problems with data privacy. With the advent of cloud computing infrastructure, data privacy of human subjects is increasingly difficult to control. This raises concerns in the public of genome data ownership that may inhibit consent by individuals to participate in genetic studies (Strom et al., 2012; Regola and Chawla, 2013).

Finally, to be able to collect and process this data and perform the analysis, adequate numbers of appropriately trained personnel are required, something that is much harder to address. The development of the previously mentioned integrated cloud services does allow analysis by researchers with only a rudimental knowledge on bioinformatic process. But with the ability for more researchers to develop bigger genomic datasets through rapidly improving NGS technologies expanding beyond genomics to transcriptomics, metagenomics, and proteogenomics, more bioinformaticians with the ability to develop bespoke bioinformatic solutions will become essential.

Community Genomics

An accelerator for Omics research on stroke

Evidence presented in this review show while the factors causing IS are largely environmental, there are enough data to suggest a role for heritability and genetics in the prevalence of IS. While not as advanced as in other complex diseases such as autism and schizophrenia, GWAS and NGS studies into genetic IS stroke biomarkers are beginning to reveal some possible candidate variants. With the push for bigger and bigger genomics studies theoretically to be able to detect a larger range and type of disease associated variants (Kilpinen and Barrett 2013), more and more bioinformatics resources are required, resources that have not kept up with the rapid improvement and volume of genomic sequencing. In order to address these challenges in the near future, alternative genomic study designs may be required.

In the realm of Mendelian genetic disorders, it has long been shown that different deleterious alleles causing the same specific genetic disorders have been shown to coexist and be expressed in families within highly endogamous communities (Bittles, 2012). The effect of such restricted gene flow is also important in the realm of complex disease genetics, which theoretically involve multiple rare variants with minor effects. The effect of rapidly increasing population migration and community-based endogamy are proving to be important factors in tracing the genetics of complex disease (Campbell et al., 2009). However, current studies of complex disease genetics have mainly concentrated on large international studies with combined multi-population big data sets in the range of thousands of samples. This has been somewhat successful for the detection of more common variants associated with various complex diseases, but still has not addressed the effect of different IS subtypes and population genetic structure required in the search for rarer variants.

There have been some previous population-genetic based approaches to genetic factors of complex disease, predominately featuring runs, or regions, of homozygosity (ROH) (Ku et al., 2011). An ROH is a continuous or uninterrupted stretch of a DNA sequence without heterozygosity in the diploid state that is in the presence of both copies of the homologous DNA segment. Typically, reliable ROH are >500 kb or even >1 mb in length. ROH mapping utilizes the same SNP arrays developed for GWAS analysis, though with the increase in NGS data, this may change.

Applying ROH detection in GWAS based studies has already led to the reporting of significant differences in ROH content between cases and controls for schizophrenia (Lencz et al., 2007) and late-onset Alzheimer's disease (Nalls et al., 2009) with the underlying idea of using this homozygosity association approach to uncover recessive variants contributing to complex phenotypes (Ku et al. 2011). In both the schizophrenia and Alzheimer's studies, SNPs in genes that are plausible biological candidates were found in or near ROH. Therefore, to take into account the effects of restricted gene flow and improve the detection rates of IS-associated variants, it is proposed that a global sampling and genotyping strategy combined with a subpopulation-based sampling and genotyping strategy, including ROH analysis, would be recommended. Implementation of such a strategy requires the use of population genetic approaches, and not just case control approaches, that take into account population genetic concepts such as population bottlenecks, founder effects, and endogamy to uncover both rare and common IS associated variants.

A possibility where such an approach could be taken is with a biobank created as part of the study by Jiang et al. (2006) in their exploration of IS stroke prevalence in China. As well as sporadic samples, the researchers also sampled from 12 pedigrees with evidence of IS heritability. Figure 1 shows one of these pedigrees that demonstrate available samples for several trios and sib-pairs available with multiple occurrences of ischemic stroke. This pedigree included several cousins, both male and female, who have suffered IS events. Eleven other pedigrees show similar patterns, all suggesting strong genetic predisposition to IS. More important, this pedigree has an affected sibling, theoretically allowing an affected sib approach. This affected sib-pair approach has already been developed by the “Siblings With Ischemic Stroke Study” (SWISS) (Meschia et al., 2002). Samples from the available biobank could be added to this data.

FIG. 1.

An example of community genomics. Pedigree of a Han Chinese family with strong genetic predisposition to ischemic stroke (IS). Solid symbols indicate individuals known to have suffered at least one ischemic event. Diagonal lines indicate deceased individuals.

Ability to process pedigree data could also enable the use of classical linkage techniques for the analysis of next generation sequencing in complex diseases such as IS. Such a technique was developed for Mendelian disease variant detection called linkdatagen (Smith et al., 2011). This combines classical linkage analysis (LOD score calculation) with next generation sequencing data to narrow down the field of candidate variants. This technique has been successful in the discovery of a 3 bp deletion in the gene NR5A1 that is causative for a newly described disorder of sexual development (Eggers et al., 2015). One major pitfall of a family based approach would be, unlike early onset Mendelian diseases, the limited availability of parent–child samples due to the late onset of IS. Another is the effect of familial factors on stroke incidence within a family, for example, shared environment and shared diet that would overwhelm the detectable signal from any heritable factors.

Conclusion

We recommend that future studies into IS genetics should focus on community-based sampling as well as case-control based sporadic patient samples in order to define and detect both ‘common’ and ‘rare’ ischemic stroke associated variants. More such targeted experimental design and sampling approach also allows more efficient utilization of bioinformatics resources in processing data from GWAS and the rapidly improving NGS based methodologies. IS Omics research should consider community genomics, and the ways in which the intersection of genomics and global society can usefully inform our understanding of IS pathophysiology and potential preventive medicine interventions in the future.

Footnotes

Acknowledgments

This work was supported by grants from the National “12th Five-Year” Plan for Science and Technology Support, China (2012BAI37B03), National Natural Science Foundation of China (81001281, 81273170, 31070727), and the Australia-China Sciences Research Foundation (ACSRF06444). The authors would like to acknowledge the reviewers and editors of OMICS for their helpful advice and input. We would especially like to thank Editor-in-Chief Dr. Vural Özdemir for his helpful advice and suggestions in the completion of this review.

Author Disclosure Statement

The authors declare that there are no conflicting interests.

Abbreviations Used

References

Adams

Jr , Bendixen

, Kappelle

, et al. (1993). Classification of subtype of acute ischemic stroke. Definitions for use in a multicenter clinical trial. TOAST. Trial of Org 10172 in Acute Stroke Treatment. Stroke, 24, 35–41.

Anderson

, Biffi

, Nalls

, et al. (2013). Common variants within oxidative phosphorylation genes influence risk of ischemic stroke and intracerebral hemorrhage. Stroke, 44, 612–619.

Arregui

, Fisher

, Knuppel

, et al. (2012). Significant associations of the rs2943634 (2q36.3) genetic polymorphism with adiponectin, high density lipoprotein cholesterol and ischemic stroke. Gene, 494, 190–195.

Bellenguez

, Bevan

, Gschwendtner

, et al. (2012). Genome-wide association study identifies a variant in HDAC9 associated with large vessel ischemic stroke. Nat Genet, 44, 328–333.

Bevan

, and Markus

. (2011). Genetics of common polygenic ischaemic stroke: Current understanding and future challenges. Stroke Res Treat, 2011, 179061.

Bevan

, Traylor

, Adib-Samii

, et al. (2012). Genetic heritability of ischemic stroke and the contribution of previously reported candidate gene and genomewide associations. Stroke, 43, 3161–3167.

Bittles

. (2012). Consanguinity in context. Cambridge Studies in Biological and Evolutionary Anthropology, Vol 63. Cambridge University Press, Cambridge, UK; New York.

Branco

. (2013). Bridging genomics technology and biology. Genome Biol, 14, 312.

Cai

, Zhang

, Liu

, et al. (2015). Mitochondrial DNA haplogroups and short-term neurological outcomes of ischemic stroke. Sci Rep, 5, 9864.

10.

Campbell

, Rudan

, Bittles

, and Wright

. (2009). Human population structure, genome autozygosity and human health. Genome Med, 1, 91.

11.

Cirulli

, and Goldstein

. (2010). Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev Genetics, 11, 415–425.

12.

Cole

, Stine

, Liu

, et al. (2012). Rare variants in ischemic stroke: An exome pilot study. PLoS One, 7, e35591.

13.

Cruchaga

, Karch

, Jin

, et al. (2014). Rare coding variants in the phospholipase D3 gene confer risk for Alzheimer's disease. Nature, 505, 550–554.

14.

Deloitte Access Economics. (2013). The economic impact of stroke on Australia. National Stroke Foundation.

15.

, Kathiresan

, Abecasis

(2012) Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet, 21, R1–9.

16.

Eggers

, Smith

, Bahlo

, et al. (2015). Whole exome sequencing combined with linkage analysis identifies a novel 3 bp deletion in NR5A1. Eur J Hum Genet, 23, 486–493.

17.

Ellinghaus

, Zhang

, Zeissig

, et al. (2013). Association between variants of PRDM1 and NDP52 and Crohn's disease, based on exome sequencing and functional studies. Gastroenterology, 145, 339–334.

18.

Federico

, Di Donato

, Bianchi

, Di Palma

, Taglia

, and Dotti

. (2012). Hereditary cerebral small vessel diseases: A review. J Neuro Sci, 322, 25–30.

19.

Foo

, Liu

, and Tan

. (2012). Whole-genome and whole-exome sequencing in neurological diseases. Nat Rev Neuro, 8, 508–517.

20.

Fukutake

(2011). Cerebral autosomal recessive arteriopathy with subcortical infarcts and leukoencephalopathy (CARASIL): From discovery to gene identification. J Stroke Cerebrovas Dis, 20, 85–93.

21.

Goldenberg

, Everett

, Graham

, Bernard

, and Nowak-Gottl

. (2014). Proteomic and other mass spectrometry based “omics” biomarker discovery and validation in pediatric venous thromboembolism and arterial ischemic stroke: Current state, unmet needs, and future directions. Proteomics Clin Appl, 8, 828–836.

22.

Hacke

, and Grond-Ginsbach

. (2012). Commentary on a GWAS: HDAC9 and the risk for ischaemic stroke. BMC Med, 10, 70.

23.

Holliday

, Maguire

, Evans

, et al. (2012). Common variants at 6p21.1 are associated with large artery atherosclerotic stroke. Nat Genet, 44, 1147–1151.

24.

Howie

, Marchini

, and Stephens

. (2011). Genotype imputation with thousands of genomes. G3 (Bethesda), 1, 457–470.

25.

Howie

, Donnelly

, and Marchini

. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5, e1000529.

26.

Illumina. (2014). Illumina Introduces the HiSeq X™ Ten Sequencing System. Illumina. http://investor.illumina.com/phoenix.zhtml?c=121127&p=irol-newsArticle&ID=1890696. Accessed on May 25 2015 .

27.

Jerrard-Dunne

, Cloud

, Hassan

, and Markus

. (2003). Evaluating the genetic component of ischemic stroke subtypes: A family history study. Stroke, 34, 1364–1369.

28.

Jiang

, Wang

, Chen

, et al. (2006). Incidence and trends of stroke and its subtypes in China: Results from three large cities. Stroke, 37, 63–68.

29.

Kalsoom

, Klopocki

, Wasif

, et al. (2013). Whole exome sequencing identified a novel zinc-finger gene ZNF141 associated with autosomal recessive postaxial polydactyly type A. J Med Genet, 50, 47–53.

30.

Kilpinen

, and Barrett

. (2013). How next-generation sequencing is transforming complex disease genetics. Trends Genet, 29, 23–30.

31.

Kim

, Crosslin

, Auer

, et al. (2014). Rare coding variation in paraoxonase-1 is associated with ischemic stroke in the NHLBI Exome Sequencing Project. J Lipid Res, 55, 1173–1178.

32.

Kim

, Londono

, Zhou

, et al. (2012). Single-variant and multi-variant trend tests for genetic association with next-generation sequencing that are robust to sequencing error. Hum Hered, 74, 172–183.

33.

, Naidoo

, Teo

, and Pawitan

. (2011). Regions of homozygosity and their impact on complex diseases and traits. Hum Genet, 129, 1–15.

34.

Lemmens

, Hermans

, Nuyens

, and Thijs

. (2011). Genetics of atrial fibrillation and possible implications for ischemic stroke. Stroke Res Treat, 2011, 208694.

35.

Lencz

, Lambert

, DeRosse

, et al. (2007). Runs of homozygosity reveal highly penetrant recessive loci in schizophrenia. Proc Natl Acad Sci USA, 104, 19942–19947.

36.

Longuespee

, Fleron

, Pottier

, et al. (2014). Tissue proteomics for the next decade? Towards a molecular dimension in histology. Omics, 18, 539–552.

37.

Lozano

, Naghavi

, Foreman

, et al. (2012). Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 2010. Lancet, 380, 2095–2128.

38.

Lupski

, Belmont

, Boerwinkle

, and Gibbs

. (2011). Clan genomics and the complex architecture of human disease. Cell, 147, 32–43.

39.

Manry

, and Quintana-Murci

. (2013). A genome-wide perspective of human diversity and its implications in infectious disease. Cold Spring Harb Perspect Med, 3, a012450.

40.

Marchini

, and Howie

. (2008). Comparing algorithms for genotype imputation. Am J Hum Genet, 83, 535–539; author reply 539–540.

41.

Marchini

, and Howie

. (2010). Genotype imputation for genome-wide association studies. Nat Rev Genet, 11, 499–511.

42.

Markus

. (2011). Stroke genetics. Hum Mol Genet, 20, R124–131.

43.

Meschia

, Arnett

, Ay

, et al. (2013). Stroke Genetics Network (SiGN) study: Design and rationale for a genome-wide association study of ischemic stroke subtypes. Stroke, 44, 2694–2702.

44.

Meschia

, Brown

Jr. , Brott

, Chukwudelunzu

, Hardy

, and Rich

. (2002). The Siblings With Ischemic Stroke Study (SWISS) protocol. BMC Med Genet, 3, 1.

45.

Meschia

, Singleton

, Nalls

, et al. (2011). Genomic risk profiling of ischemic stroke: Results of an international genome-wide association meta-analysis. PLoS One, 6, e23161.

46.

Nalls

, Guerreiro

, Simon-Sanchez

, et al. (2009). Extended tracts of homozygosity identify novel candidate genes associated with late-onset Alzheimer's disease. Neurogenetics, 10, 183–190.

47.

National Center for Chronic Disease Prevention and Health Promotion. (2013). Know The Facts About Stroke. United States.

48.

, Bigham

, Buckingham

, et al. (2010). Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet, 42, 790–793.

49.

O'Rawe

, Jiang

, Sun

, et al. (2013). Low concordance of multiple variant-calling pipelines: Practical implications for exome and genome sequencing. Genome Med, 5, 28.

50.

Pavlakis

, Phillips

, Dimauro

, Devivo

, and Rowland

. (1984). Mitochondrial encephalopathy, lactic acidosis, and strokelike syndrome (Melas). Ann Neurol, 18, 626–626.

51.

Podder

, and Latha

. (2014). New insights into schizophrenia disease genes interactome in the human brain: Emerging targets and therapeutic implications in the postgenomics era. Omics, 18, 754–766.

52.

Polychronopoulos

, Gioldasis

, Ellul

, et al. (2002). Family history of stroke in stroke types and subtypes. J Neurol Sci, 195, 117–122.

53.

Regola

, and Chawla

. (2013). Storing and using health data in a virtual private cloud. J Med Internet Res, 15, 3–14.

54.

Risch

, and Merikangas

. (1996). The future of genetic studies of complex human diseases. Science, 273, 1516–1517.

55.

Sacco

, Ellenberg

, Mohr

, et al. (1989). Infarcts of undetermined cause: The NINCDS Stroke Data Bank. Ann Neurol, 25, 382–390.

56.

Scholz

, Lo

, and Chain

. (2012). Next generation sequencing and bioinformatic bottlenecks: The current state of metagenomic data analysis. Curr Opin Biotechnol, 23, 9–15.

57.

Schrijver

, Aziz

, Farkas

, et al. (2012). Opportunities and challenges associated with clinical diagnostic genome sequencing: A report of the Association for Molecular Pathology. J Mol Diagn, 14, 525–540.

58.

Smith

, Bromhead

, Hildebrand

, et al. (2011). Reducing the exome search space for Mendelian diseases using genetic linkage analysis of exome genotypes. Genome Biol, 12, R85.

59.

Sproule

, Wong

, Hirano

, and Pavlakis

. (2013). Stroke-like episodes in mitochondrial encephalopathy, lactic ccidosis, and stroke-like episodes (MELAS). In: Sharma

, Meschia

(eds) Stroke Genetics. Springer. London, pp. 107–125.

60.

Srour

, Schwartzentruber

, Hamdan

, et al. (2012). Mutations in C5ORF42 cause Joubert syndrome in the French Canadian population. Am J Hum Genet, 90, 693–700.

61.

Stroke Association. (2013). Stroke Statistics. London.

62.

Strom

, Strid

, and Hammarstrom

. (2012). Disruption of the alox5ap gene ameliorates focal ischemic stroke: Possible consequence of impaired leukotriene biosynthesis. BMC Neurosci, 13, 146.

63.

Sun

, Wu

, Zhang

, et al. (2011). A tagging SNP in ALOX5AP and risk of stroke: A haplotype-based analysis among eastern Chinese Han population. Mol Biol Rep, 38, 4731–4738.

64.

Traylor

, Farrall

, Holliday

, et al. (2012). Genetic risk factors for ischaemic stroke and its subtypes (the METASTROKE collaboration): A meta-analysis of genome-wide association studies. Lancet Neurol, 11, 951–962.

65.

Yamada

, Fuku

, Tanaka

, et al. (2009) Identification of CELSR1 as a susceptibility gene for ischemic stroke in Japanese individuals by a genome-wide association study. Atherosclerosis, 207, 144–149.

66.

Yamamoto

, Craggs

, Baumann

, Kalimo

, and Kalaria

. (2011). Review: Molecular genetics and pathology of hereditary small vessel diseases of the brain. Neuropathol Appl Neurobiol, 37, 94–113.

67.

, Chahrour

, Coulter

, et al. (2013). Using whole-exome sequencing to identify inherited causes of autism. Neuron, 72, 259–273.

68.

Zhao

, Wang

, Xu

, et al. (2012). Association of inflammatory response gene polymorphism with atherothrombotic stroke in Northern Han Chinese. Acta Biochim Biophys Sin (Shanghai), 44, 1023–1030.