Abstract

We thank the discussants for their kind comments and their insightful analysis and discussion that has substantially added to the contribution of this issue. Overall, it seems the discussants have affirmed many of our primary points and have also raised a number of other relevant and important issues that we did not emphasize in the article. Several common threads emerged from these discussions, including the importance of software development, appropriate dissemination, and close collaboration with biomedical scientists and technology experts in order to ensure our work is relevant and impactful. Each discussant also mentioned other areas of bioinformatics that have been impacted by statistical researchers that we did not highlight in the original article. In response, we will first summarize and discuss these general themes, and then respond to specific comments of each discussant, and finally talk about the additional areas of bioinformatics impacted by statisticians that were mentioned by the reviewers.
General themes
Missed opportunities
All of the discussants seemed to agree with our premise that, while statisticians have made some incredible impacts on molecular biology and medicine through bioinformatics, we have missed opportunities to establish ourselves as more central leaders in the field. Michael Newton, in his review, made an interesting point that in the mid- to late 1990s, as high-throughput genomics was beginning to take off, the already-established field of statistical genomics was well positioned to assume leadership in this arena but statisticians failed to ‘throw their hats into the ring’. The reasons for this are not clear: Were they already busy with the set of DNA-based quantitative trait locus (QTL) and disease-mapping problems that was their focus and were not interested in branching out into this new area, did they not think that this field would grow as it eventually did or did the technical preprocessing problems that dominated much of the early work turn them off? It is not clear why more people in the statistical community did not get involved on the front lines as this field expanded, but we hope that our statistical community will not miss such opportunities in the future as new paradigm-changing technologies emerge.
In reading the discussions, we see three key practical points emerge that can help us avoid these missed opportunities for statisticians to make a stronger impact as leaders of the field: the dissemination of freely available, implementable software; an increased focus on publishing our work in subject-area and practice-oriented journals that may be of higher impact and more visible to the biomedical community; and a commitment to work closely with biomedical collaborators and technology experts to guide our work towards the most important problems and ensure we provide relevant input into all steps of collecting and processing the data.
Software
As emphasized by Kechris and Ghosh, and by Ma, Song and Tseng, the availability and dissemination of implementable software are crucial for new statistical methods to make a practical impact on science. We thank them for bringing up this important point, a point with which we strongly agree but did not emphasize. Indeed, one major component of all of the examples we included of statistical impacts on bioinformatics was the development of freely available, usable software to implement the underlying methods. The development of the bioconductor package for R has strongly enabled this process, as it established R, the preferred software for statistical methodology developers, as the most commonly used platform for molecular biologists and provided an open-source package onto which their contributions can be added and quickly disseminated.
However, not all statistical methods developed are accompanied by usable software. One reason for this is that the culture of the statistical methodology community has perhaps undervalued the importance of providing freely available, practically usable software when new methods are published. These software packages should not just work but be written in general enough terms so that they can be broadly applied to new problems, should be well documented to enable users to effectively apply and adapt them for their uses and should be computationally efficient enough to scale up to the big data settings for which they were intended.
Besides methodology, another area of software development in which statisticians as a group have not thought deeply enough is visualization. Given the complexity and volume of high-dimensional data, the provision of computationally efficient tools to visualize these complex data is a major need of practitioners, and nowadays deeply valued and rewarded. For example, colleagues in the Department of Bioinformatics and Computational Biology at the University of Texas MD Anderson Cancer Center, led by Bradley Broom and John Weinstein, have developed next-generation clus-tered heatmaps (NG-CHM, \mychar
Unfortunately, the reward system in the statistical community has been more aligned towards innovation and mathematical depth than the provision of implementable software or visualization tools. As a result, there are many potentially impactful methods developed by statisticians appearing in top statistical journals that are not appreciated or being used by others because of the lack of such software. This has begun to change in recent years, as more journals are requesting or even requiring the sharing of usable software with published statistical methodology papers, and some journals publishing articles on data visualization approaches. Some are even requesting or requiring the sharing of data used in the article, which as emphasized by Houwing-Duistermaat, Uh and Gusnanto can also contribute to greater reproducibility and more impactful work. We need to keep this trend moving forward and purposefully choose to value the development of good software when evaluating methodological contributions of our colleagues and trainees. As this trend continues, there is a promise that the statistical community's work can make a broader and deeper impact on science.
Dissemination
Another aspect of our reward system in the statistical community contributing to our limited impact is the journals we value. Understandably, the statistical research community tends to most strongly value publication in the top-level statistical methodology journals. Perhaps this is appropriate, as we strive for these to include the best new methodological work our profession has to offer, and we should recognize what we as a field value as the strongest work. However, an unfortunate by-product of this emphasis on publishing in statistical journals is that we undervalue contributions to other journals, especially subject-matter journals. The publishing of new methods in subject-matter journals can be challenging in its own right and can have even greater potential to impact a scientific field than a article in a statistical journal. This will require a change in culture and pedagogy—but is already happening to some degree in multiple academic departments.
As the low-impact factors of even the highest ranked statistical journals can attest, few researchers outside of the statistical community read or cite statistical papers. By contrast, papers in subject-area journals providing statistical guidance or introducing new statistical methods to a specific scientific field can have exceptionally high citation counts, in the hundreds or thousands, and can effect change in statistical practice within that field. Statisticians may or may not think the methods in these papers are the best, but the point remains that articles providing clear articulations of new methods in the subject-area journals, speaking the language of that field and addressing what they recognize as their key quantitative challenges, are heavily read by biological and medical researchers and effect change in the corresponding fields of science. More statisticians should be publishing articles like these in order to better disseminate our methods and provide guidance on best statistical practice, thus exercising quantitative leadership in these fields.
The discussions of Kechris and Ghosh and Ma, Song and Tseng hit on this point. Kechris and Ghosh emphasize that statisticians tend to be slow to the game and then when they do publish, our statistical journals are so slow that by the time the work appears, it can already be obsolete. By publishing only in statistical journals and not in the high-impact subject-area journals, many statisticians are forfeiting the opportunity for their methods to become more widely known and make an impact in science. If many of us in our profession continue to make this choice, we as a group will forfeit our leadership of statistical analysis practice within the scientific community and will miss opportunities to have a seat at the table of influence in science and society. In these cases, undoubtedly other quantitative scientists will step up in our place.
As emphasized by Kechris and Ghosh, there are practical steps we can take to make stronger impacts. We can publish the deeper mathematical principles and technical details in a statistical paper, but then publish more applied papers showing how a new method makes a difference in a subject-area journal or we can publish a software-focused paper in a bioinformatics journal. In this way, we can still publish the best methods in top statistical methodology journals yet still have our work disseminated in other venues that will reach a broader community of researchers. While it takes greater effort to learn how to write a paper that resonates with the broader audience of these subject-area journals, this effort will pay off in greater impact and position our field well for the future.
Interdisciplinary collaboration
Various discussants highlighted the importance of interdisciplinary collaboration and effective communication and coordination between biomedical scientists and statisticians if we are to make a practical impact on the science. Baggerly highlighted that it is crucial for statisticians to understand the biological goals of our collaborators, and that communication is key for us to be able to incorporate biological knowledge in an intelligent way into the analysis. Houwing-Duistermaat, Uh and Gusnanto mention their training programme ‘Innovative Methods for Future Datasets’ that seeks to educate both young statisticians and young chemists and biologists about experimental design, reproducible research and preprocessing of noisy omics data, and effectively engenders communications between them. Kechris and Ghosh emphasize that statisticians should directly work with biomedical investigators so that they can develop methods in context and provide results that are as interpretable as possible so that they resonate with biomedical scientists and are most likely to get used. In the context of mass spectrometry metabolomics and proteomics, Dowsey emphasizes the need for statisticians to be more heavily involved in the scientific process, working closely with other scientists so that they can influence the design and operation of the instrumentation, and even optimize the acquisition process itself. Our own experiences working in a premier research-oriented cancer center attest to this fact.
Newton's discussion contains a deep and interesting discussion of this topic. He introduces the terminology ‘contextual features’ to deal with the biological underpinnings of measurement technologies that must be understood for proper analysis. He emphasizes that statisticians tend to partition out these contextual issues in their attempt to distill the context-specific analytical problems into generic data problems, and that this practice can make it hard for the statistician to address the most important problems of the science. He characterizes bioinformatics as operating as a ‘three-legged stool’ of context, statistics and computational details, with the implication that all three are equally needed and work together in effective methods. We wholeheartedly agree with this well-conceived characterization, and with the insights mentioned by the discussants.
Response to specific discussants
Keith Baggerly
We thank Keith Baggerly (2017, KAB17 henceforth) for his discussion of our article. We consider Keith an international leader in reproducible research with a detailed eye and mind for the key fundamental issues often missed or glossed over by other statisticians, and with a no-nonsense approach to seek to uncover the true insights contained in these data while being extra careful to look out for spurious results. It is reassuring for us to see his agreement with our premise of the major role of statisticians in bioinformatics and his sympathy to the goal of making fuller use of the data. He also augments the discussion with several crucial points, on which we now briefly comment.
As we mentioned in our article, one of the unique perspectives we statisticians bring to the table is a deep understanding of variability and uncertainty, which allows us to provide inferential summaries and probability statements, not just point estimates. Some quantitative scientists focus on providing a point estimate that optimizes the data, without a rigorous sense of how likely the results are to characterize the underlying true process. Our inferential procedures allow us to make probability statements that account for multiple testing, such as experimentwise error rate and false discovery rate (FDR) methods, and thus have the potential to tell us when there is ‘nothing there’. Now, this idea of ‘uncertainty quantification’ has grown in prominence among other quantitative scientists, but we statisticians are experts in understanding these concepts and should be the ones to assert ourselves as leaders.
Baggerly also briefly mentions the importance of considering both statistical and practical significance in order to ensure that results reported are likely to be real and important. We strongly agree with this point and put this into practice in our Bayesian modelling whereby we flag results that have high probabilities of some minimum practical effect size, allowing us to account for both statistical and practical significance in our inference.
For example, in a given context, certain genes may be known to be regulated by particular transcription factors, methylation or copy number changes. Integrative analyses allow us to investigate that these relationships are indeed apparent and in the expected direction. Also, these inter-platform relationships allow us to reduce the model space by focusing on the most biologically relevant changes, thus reducing false positives and gaining power to detect significant differences. For example, if our goal is to find differentially methylated regions, rather than focusing on the entire set of 450k CpGs across the entire genome, we can first filter those CpGs whose methylation is associated with gene expression, and then only consider those. This reduces the number of multiple tests and focuses on those most likely to be biologically relevant. Cross-platform integrative modelling can effectively increase the sample size of available information and provide additional power for discoveries.
Michael A Newton
We thank Michael Newton (2007, MAN17 hereafter) for his insightful discussion of a number of key issues in the application of statistical methods to bioinformatics. His extensive experience and unique perspective leads him to provide some deep insights and viewpoints on these issues. We comment on some of them here.
Algorithmic approaches have dominated in a vast majority of problems in bioinformatics, with model-based methods underutilized and sometimes even viewed with skepticism by many practitioners. This field was one where the use of statistical models provided key insights not contained in the data themselves, namely the key patterns in the model-estimated probabilities. He mentions this key insight has made an enormous impact in many other areas of sequence analysis. We additionally note that this example also illustrates the benefits of unified modelling, learning probabilities across and within samples through a hierarchical model, to provide improved inference.
First, one of my most influential mentors was Emmanuel Parzen, one of the great pioneers in time series and kernel density estimation as well as the author of the seminal probability theory textbook that educated a generation of mathematical statisticians. One saying he liked to make that stuck with me and shaped my understanding of statistics was (paraphrased by Morris), ‘When you get results from a statistical procedure, you need to ask yourself what part of the answer came from the data, and what part came from the model.’ This principle is not unique to model-based approaches, but is true of any statistical method, as all statistical methods have underlying assumptions that must be made and can have an impact on the resulting inference. This issue is most commonly raised to criticize model-based approaches, especially Bayesian ones where the prior is another element of the model that must be somehow specified, but it is also true of algorithmic or seemingly nonparametric approaches. There are always assumptions, and we need to test these assumptions and assess robustness of our results to them whenever possible.
We believe this is an essential practice that must be followed by careful statisticians, especially in the context of complex, high-dimensional data, since models may have to make assumptions about multiple aspects of the data and any of these assumptions may have a strong impact on results. We believe that various types of model checking should be routinely done, including graphical and statistical testing of assumptions as well as sensitivity analyses tweaking the model by varying certain assumptions about structural, distributional and prior characteristics as well as regularization parameters and then assessing persistence of the key results across these conditions.
In much of our own work on models for complex, high-dimensional data, we aim to build models that, while borrowing strength across multiple data elements to gain efficiency and reduce effective dimensionality, are flexible enough to adapt to the key features of the data. This approach can mitigate some of the inherent bias of model-based approaches that is the source of Newton's concern while still effectively discovering and accounting for structure in these data elements to yield increased efficiency over piecemeal elementwise approaches. Indeed, this is the key motivation behind the large set of works we have done on Bayesian functional mixed models for complex, high-dimensional functional data. These principles also extend to other complex modelling efforts such as integrative genomics and graph-based methods to discover associations.
One must assess when it is worthwhile to utilize a complex modelling approach and when simpler, perhaps algorithmic approaches are sufficient. The key factor is that, like almost anything, there are trade-offs that must be considered. When utilizing a complex model or statistical method, there is an inherent ‘complexity cost’ that is incurred. This cost takes various forms, from extra time to build the model and assess its fit, a greater level of expertise to understand the various components of the model, the greater difficulty explaining it and getting readers and reviewers to understand it, and the risk of a reviewer rejecting the paper because of a lack of understanding of the innovative model. In many cases, it is much easier to use the simplest possible method, especially if the simple method is what scientists in the given field are used to seeing. This complexity cost is real and should be taken into account.
However, at times, advanced methods, including complex, flexible modelling frameworks, can yield significant benefits that make them worthy of use. These benefits can come in the form of improved prediction, increased power for discoveries and reduced FDRs, and are provided by the data integration, dimension reduction and incorporation of scientific information undergirding effective unified modelling approaches. In cases whereby these benefits are realized, the upside of the modelling can justify the complexity cost. We believe that it is the bounden duty of the statistical modeller to assess the benefits of any more complex model vs the inherent complexity cost, and should be a regular query we raise with authors of such methods in statistical journals.
Genevera I Allen
We thank Genevera Allen (2017, GIA17 henceforth) for an insightful discussion of our article, noting that our article is a ‘comprehensive and, compelling review’ and, in particular, highlighting some major challenges and opportunities in nascent field of data integration. We summarize and respond to some of these comments below.
Newton also expressed his dislike for this term ‘integromics’. Allen suggested ‘data integration’. We can sympathize with the overuse of the -omics suffix and are not wed to the term integromics. However, we believe that for these types of analyses combining information across multiple modalities, integration is the best word, so perhaps ‘integrative analysis’ could be used. Whatever the term, we believe this is among the most important areas of quantitative science, and we statisticians need to be on the front line of developing powerful methods to integrate information across multiple modalities to gain deeper knowledge of the underlying biomedical processes. We need to make sure that we do not miss the opportunity to establish ourselves as leaders in this area.
The first is to obtain the relevant data that can contain the information to answer important scientific questions. Various national/international consortium-level efforts have been made at this level which includes TCGA (cancergenome.nih.gov) and International Cancer Genomics Consortium (ICGC; icgc.org). Many data extraction pipelines have been developed to extract formatted and analyse able data from these portals, especially TCGA. GIA17 mentioned TCGA2STAT, and some others in this domain include TCGAAssembler (Zhu et al., 2014) and a standardized data portal available publicly at
From an analyses perspective, a major issue is existence of batch effects, since typically these data types are collected at different time points and locations—which might induce technical artefacts, not attributable to any real biology. We alluded to this in our main article and would like to re-emphasize the importance of this step and point out with increasing types of data generated that there exists a rich opportunity for methodological developments in this area. Another issue is missing data, which currently poses a major challenge for developing coherent statistical methods, especially model-based—since the ‘matching’ of both samples and platforms often leads to severe loss of available complete cases. We believe this can be a very fertile ground for developing new missing data methods, especially those where the biological knowledge about the interactions within and between platforms can be brought into the methodologies.
Moving beyond prediction, the goal of many multi-platform genomics studies is to generate new biological/scientific hypothesis from multi-view data that might have some potential clinical or translational relevance, and endeavour that could be called ‘discovery science’. Most of the recent advances in this area have been through exploratory learning methods, for example, data visualization, dimension reduction, pattern recognition, clustering, feature selection and network structure learning. There are few existing methods for integrative learning as most of these methods model in a latent space, such as lower-dimensonal summaries provided by matrix decompositions, which may not capture all the dependencies. Also, many of these tools are only available for continuous data and hence not ideally suited to mixed data types. This might be another germane area for investigation.
We would also like highlight a recent Bayesian treatment of this problem. Recently Bhadra et al. (2017) proposed a unified Bayesian graphical modeling procedure for inferring dependence structure for mixed as well non-normal data-types using Gaussian scale mixtures. They introduce the concept of a conditional sign independence that captures stochastic independence in terms of signs for different variables as opposed to magnitude. This conditional sign independence metric is especially relevant to mixed-data types as it has a very intuitive interpretation. For example, in mixed-data context, it might not make sense to compare actual numeric values, e.g. mRNA, to binary data, e.g. mutation data, but rather one might be interested to evaluate if positive values indicating presence of mutation co-occurs with upregulation of some gene conditional on the rest of the variables of interest. Conversely, one might also want to investigate if two arbitrarily coded binary deleterious mutations are likely to co-occur, accounting for the effect of the rest of the variables. This makes conditional sign independence a versatile tool for establishing networks between mixed-data types.
Katerina Kechris and Debashis Ghosh
We thank Kechris and Ghosh (2017, KG17 hereafter) for their thoughtful and deeply illuminating review of our article. They have brought out several complementary points emphasizing the importance of dissemination and interpretability. We highlight and comment on some of the issues raised in their review.
They also point out the constantly changing nature of high-throughput techno- logies, although ‘exciting and rewarding’, present some challenges, for example, requiring the updating of courses. They also agree that there are some constant themes that emerge regardless of technology, so by applying principles we have learned on older technologies forward to newer technologies, we avoid trying to ‘reinvent the wheel’. As we pointed in our review as well, they iterated the importance of primacy of pre-processing, high-dimensional (p > > n) problems, and structured dependencies in the variables and data such as higher-order biological interactions.
Tianzhou Ma, Chi Song and George C Tseng
We thank Ma, Song and Tseng (2017, MST17 hearafter) for a very thorough and highly informative review of our article. MST17 have made several points, and as true statisticians even backed their points by empirical justifications that are both very complementary and in line with points we have made. We highlight some of these below.
MST17 also cover machine-learning areas in which statisticians have made a tremendous impact, particularly supervised and unsupervised learning. This area has taken off over the last couple of decades and has found wide applicability in bioinformatics. For example, several-machine learning approaches such as shrunken nearest centroid, random forests, support vector machines and multiple kernel learning are often top performers in classification problems with high-throughput data. In clustering, there are several methods that have found traction such as frequentist and Bayesian Gaussian mixture models, weighted correlation networks, gap statistics, resampling-based methods for both single and multi-platform data, including methods from the Tseng lab. As MST17 point out, there is a great deal of overlap with computer scientists and applied mathematicians in these efforts, but statisticians should continue to play an important role and strive for leadership.
MST17 also cover the important area of meta-analyses, especially horizontal meta- analyses, which involve the integration of information from several single-platform studies across multiple studies. This integration of information can increase power by combining samples across studies and enable researchers to find more subtle signals. They describe some success stories in this area and also other works that conduct meta-analyses for specific purposes such as differential expression, quality control, pathway analyses, clustering and classification to name a few. Finally, MST17 cover some evaluative and comparative studies that are instrumental in guiding practitioners to select the best method for a given task, for example, classification, clustering, missing-value imputation, microarray processing and GWAS meta-analysis, differential expression and fusion transcript detection in RNA-seq.
An additional point raised by MST17 that we find to be very pertinent is the importance of making data publicly available. There are public databases such as GEO for microarray data and Sequencing Read Archive (SRA) for sequencing data that serve as repositories for enhancing additional learning and replication of study results. They point out some issues that impede open data sharing, such as privacy issues that prevent sharing of raw patient-level sequencing data. Potential solutions to this problem include protected databases such as dbGaP, but these involve substantial administrative work and establishment of data sharing standards such as Minimum Information About a Microarray Experiment (MIAME). Protocols such as this describe useful and minimal information and also potentially protect patient privacy. This might be another area in which statisticians can contribute moving forward.
Jeanine J Houwing-Duistermaat, Hae Won Uh and Arief Gusnanto
We thank Houswing-Duistermaat, Uh and Gusnanto (2017, HDUG17 hereafter) for their compliments and for their thoughtful discourse on glycomics and how the issues we discussed relate to this emerging field. Their discussion explores various key challenges in the field of glycomics, discusses the current state of the art relative to statistical practice and raises open problems and issues for this field. We will respond to some of the issues they raised.
Andrew W Dowsey
Dowsey (2017, AWD17 hearafter) presents a detailed discussion of statistical issues in mass spectrometry proteomics and metabolomics, an important area that generates enormous structured datasets on the order of many gigabytes to even terabytes. He paints a picture of the current analytical practices, points out their limitations and raises interesting ideas for how to improve the status quo while connecting to the fundamental principles we raise in our article. He does so at a great level of detail and with strong insights coming from one of the international leaders in this area, and as a result his presentation poignantly touches on the points we are trying to make in this article. We summarize and comment on some of his points here.
Unfortunately, the quantitative problems raised by these challenges have not been adequately acknowledged by the field, which Dowsey mentioned has instead chosen to focus on improved instrumentation rather than better processing algorithms. We see this general pattern at work in numerous biotechnology fields, looking to overcome limitations by developing more sensitive instruments, not improving the data-processing algorithms that are in many cases the key factors preventing effective extraction of the full information from the current instruments. Unless the quantitative algorithms are optimized, the advantages provided by the new instru- mentation can largely go to waste.
AWD17 points out that the typical MS-based bioinformatic tools are multi-step, and fail to propagate uncertainty or borrow information across steps, making them inefficient and propagating errors throughout the process. He raises some issues that make stochastic modelling of peptides especially challenging, more difficult than DNA or RNA, including the lack of uniqueness of peptide signals to specific proteins or processes and the complex missing data problems. These issues make the design of more rigorous statistical pipelines challenging but worth the effort as there is a great deal of information lost by the current error-prone strategies.
Other statistical contributions to bioinformatics
As we disclaimed, our article did not attempt to provide an exhaustive summary of areas in which statisticians have made a strong impact on bioinformatics. We sought to describe several key principles that we see as veins underlying many of the strong contributions statisticians have made to the field, and then highlight these areas through some examples of great impact and some other work that illustrates the principles even if not as impactful. By elucidating these principles, we hope we are able to help stimulate researchers to apply these principles and make strong contributions in other new technologies as they emerge. The various discussants have highlighted other areas of bioinformatics that have been strongly impacted by, and in some cases have been led by, statisticians, or are in need of greater statistical input.
Both Newton and Kechris and Ghosh remark that statistical approaches in bioinformatics started in the early days of sequence databases in the early 1970s, and also mention likelihood-based methods for phylogenies. These phylogenetic approaches are useful for modelling single-cell sequencing data, an emerging area mentioned by both Newton and Ma, Song and Tseng as one needing significant statistical input. With the ever-increasing efficiency of sequencing platforms, sequence-based methods now dominate the DNA, RNA and epigenetics field, and while not emphasized in our article, there are many contributions and problems in this area. Kechris and Ghosh mention extreme value distributions for scoring alignments, and along with Ma, Song and Tseng highlight technologies studying gene regulation including Hi-C, HITS-CLIP, ChIP-chip and ChIP-seq, and eQTL analysis. Newton described some details about contributions to transcription binding site detection which serves as a nice example of a setting in which model-based methods were able to solve a problem that was not adequately solved by algorithmic approaches.
Kechris and Ghosh reference metabolomics and proteomics technologies based on mass spectrometry and aptamer-based protemic technologies such as SOPAscan, and Dowsey provides an illuminating detailed discussion of statistical contributions and problems in LC-MS metabolomics and proteomics analysis. Houwing-Duistermaat, Uh and Gusnanto dissect various fundamental issues and statistical problems in glycomics and mention protein folding and important subfield of proteomics with some intricate and challenging quantitative problems.
Ma, Song and Tseng discuss a number of fundamental contributions to early work in genomics not mentioned by us. They recognize methods for differential expression analysis including SAM, LIMMA, edgeR and DEseq that are all very highly cited, as well as methods for study design and power calculations that have also been extremely impactful.
Statistical machine learning has also made fundamental contributions to bioinfor- matics, as statistical leaders have taken some of the principles used by machine- learning computer scientists, framed them in a statistical framework and further developed them. They provide numerous examples of methods for classification, clustering and integrative multi-platform modelling problems at the interface of machine learning and statistics. They also mention horizontal meta-analysis methods that can increase power by combining information across studies and also describe a number of comparative/evaluative studies in which statisticians have compared various methods, including classification and clustering tools for microarrays, missing-value imputation methods for microarrays and GWAS studies, and differential expression and fusion detection methods for RNAseq data. These comparative studies tend to be highly cited and provide strong guidance to practitioners.
It is encouraging to see so many areas in which statisticians have made an impact, and also to see a clear delineation of some of the key areas currently in need of greater statistical input.
Conclusions
Once again, we thank the various discussants for their insightful comments that add significantly to our discussion of statistical contributions to bioinformatics, and we thank the editors and publishers for their devotion of this issue to this topic. We believe that the quantitative problems raised by the complexity and high dimensionality of multi-platform genomics data are the most important problems in biomedical science, and lack of efficient and effective tools to solve these problems has the potential to be the bottleneck preventing the scientific community from gaining a deeper understanding of basic human biology that is needed to provide the next generation of medical strategies that can more effectively treat our most vexing medical ailments. Quantitative scientists of various ilks will certainly be involved in the front lines of science solving these problems. It is our hope that statisticians, equipped with deep understanding of variability and uncertainty and armed with a vast array of modelling and inferential techniques, will be leading this charge and through their deep impact be recognized as the leaders that they should be.
Footnotes
Acknowledgments
This work has been supported by grants from the National Cancer Institute (R01- CA178744, P30-CA016672, R01-CA160736, R01CA194391) and the National Science Foundation (1550088, 1463233).
