Machine learning for precision medicine forecasts and challenges when incorporating non omics and omics data

Abstract

Precision Medicine has emerged as a preventive, diagnostic and treatment tool to approach human diseases in a personalized manner. Since precision medicine incorporates omics data and knowledge in personal health records, people who live in industrially polluted areas have an advantage in the medicinal field. Integration of non-omics data and related biological knowledge in term omics data is a reality. The heterogenic characteristics of non-omics data and high dimensional omics data makes the integration challengeable. Hard data analytics problems create better opportunities in analytics. This review cut across the boundaries of machine learning models for the eventual development of a successful precision medicine forecast model, different strategies for the integration of non-omics data and omics data, limitations and challenges in data integration, and future directions for the precision medicine forecasts. The literature also discusses non-omics data, diseases associated with air pollutants, and omics data. This information gives insight to the integrated data analytics and their application in future project implications. It intends to motivate researchers and precision medicine forecast model developers in a global integrative analytical approach.

Keywords

Precision medicine epidemiological studies air pollution exposure health impacts machine learning non-omics data omics data data integration

1. Introduction

The aim of this review is to take up the scope and aspects of the research area in data analytics for integrated data and to evaluate the problems to construct, evaluate, and to utilize the machine learning models in precision medicine. Most of the time, the machine learning evolution happens with the uniqueness of the dataset, not the algorithms. For example, Google came up with Chinese to English translation using an old algorithm but data collected in 2015. This is an excellent example to show that the data is critical.

Precision medicine embraces all steps of new therapies, disease processes, and disease prevention. According to the sociodemographic and environmental background of individuals, there would be a difference in treatment plans, medical interventions, medication, hospital care, specialized care, and healthcare services. A large number of clinical trials in individuals for the detection of diseases can overwhelm healthcare professionals and can lead to treatment delays. The accurate characterization of individuals at high risk of developing precise diseases from those at low risk can benefit this problem. Epidemiological studies in an industrially polluted region will provide important clues about the most common non-communicable diseases in and around that area. Choices in epidemiological study have become delicate because of the ample knowledge development in medicine, and computational methods. What we need is wisdom. Choosing Wisely [1] is a global perspective in the medicinal field today. The goal of this initiative is to discuss the need for medicinal interventions attributable to the patients. It promotes a good rapport between the healthcare professionals and patients. This approach discusses the medicinal care supported by evidence and non-duplication of tests that have already been received. This is obviously a part of personalised medicine.

There are two types of modelling inculcated in environmental air pollution exposure and health fallouts. First one establishes a casual link amidst environmental air pollution and health effects [2, 3]. The other emphasizes the association amidst environmental air pollution and health issues [4, 5]. In the former case, it is unethical to anatomize human exposure to possible environmental air pollutants. The latter one relies on predictive modelling using empirical methods such as the machine learning techniques, and ensemble learning. In the casual link or fuzzy method, a casual inference or approximation of the parameters has been made through certain linguistic terms such as very high, high, low, very low, etc. [6, 7]. In this method, the final interpretation will be done through mapping of the identified parameters. In the empirical method, algorithmic identification of the parameters associated with an outcome of interests allow researchers for independent validation and interpretation of the results for subsequent studies [8, 9]. Usually ensemble models that combine outputs from several pretrained models outperform while comparing with individual models [10, 11]. Hybrid models using deep structured networks also perform well in large heterogeneous datasets [12, 13].

The genetic scope to environmental interactions leads to the evidence in the existence of a confounder in the machine learning model. That is, disease and gene-environment must be an extraneous factor for the disease. It may not be the actual cause of the disease, but can be a surrogate factor for the disease. Precision medicine is an integrated approach that incorporates predictive, preventive, personalized, and patient satisfaction. This study foresees the applicability of advanced machine learning techniques in precision medicine forecasts for individuals in the industrially polluted region by integrating non-omics data such as sociodemographic characters, environment data, and clinical data into omics data. Figure 1 shows the strategy for a precision medicine forecast model.

Figure 1.

The strategy for a precision medicine forecast.

The different types of non-omics data in epidemiological approach, and associations amidst pollution exposures and its health outcomes are discussed in Section 2. Section 3 discusses different types of omics technologies, and omics factors for chronic diseases. Section 4 focuses on the strategies for data integration and mathematical background for the machine learning algorithms for integration. Section 5 goes through the machine learning models for precision medicine forecast. Section 6 discusses the results of integration strategies and Section 7 implicates the limitations and challenges of non-omics data, omics data, integration strategies, and the precision medicine forecast models. Finally, we conclude with future directions in Section 8, and the conclusion in Section 9.

2. Non-omics data

In this part, we discuss the different types of non-omics data in epidemiological approaches for precision medicine forecast in the industrially polluted region. It consists of epidemiological data, environmental data, and clinical data. Epidemiological data is susceptible to survey or interview mode, and standardization of questions, which also affect data quality and comparability. It includes: sociodemographic characters, health problems, the family history of diseases, air pollution exposure, the proportion of time exposed, etc. Environmental data comprises of air pollutants, industrial emissions, air quality, etc. Clinical data includes health records and insurance records. Clinical parameters may have the problem of complexity in their medical terms. For instance, the results of cancer diagnosis are from pathological descriptions and medical image reports. Figure 2 shows the outline of non-omics data. There exists a demand for the integration of non-omics data due to the heterogeneity of the data. As our aim is to focus on an integrated empirical model, there is no need to make prior assumptions on the data, its functional form, or probability distributions.

Figure 2.

Overview of non-omics data.

Residents sustenance in close proximity to an industrially polluted area are prone to several types of diseases. The World Health Organization (WHO) termed air pollution as a silent executioner since its impacts are often unnoticed or not measured. Countless research found that detrimental health effects of long-term air pollution exposure are subject to the levels of national air quality standards. Environmental factors are always directly proportional to the health issues of individuals. According to the objectives of the analysis, epidemiological approaches can be grouped into three categories, namely:

•

Health profile of communities and association with local environmental risk factors.

•

Associations amidst pollution exposures and health outcomes.

•

Surveillance risk prediction.

Table 1

Summary of air pollution and identified health impacts

Reference	Exposure factor	Associated diseases
Teng et al. [14]	PM ${}_{2.5}$ , NO ${}_{2}$	Allergic rhinitis
Bernatsky et al. [15]	PM ${}_{2.5}$	Autoimmune diseases
Prada et al. [16]	PM ${}_{2.5}$	Bone diseases
Hamra et al. [17], Turner et al. [18], Jenitz et al. [19], Lavinge et al. [20]	PM ${}_{2.5}$ , NO ${}_{2}$ , Benzene	Cancers
An et al. [21], Wellenius et al. [22]	PM, CO, NO ${}_{2}$ , SO ${}_{2}$	Cardiovascular diseases
Lam et al. [23], Liu et al. [24], Wilker et al. [25], Zanobetti et al. [26]	Particulate matter, polycyclic aromatic hydrocarbons, PM ${}_{2.5}$ , CO, NO ${}_{2}$ , O ${}_{3}$ , SO ${}_{2}$	Cognitive function & neurological diseases
Honda et al. [27], Eze et al. [28], Calderon-Garciduenas et al. [29]	PM ${}_{2.5}$ , NO ${}_{2}$ , PM ${}_{10}$ , O ${}_{3}$	Diabetes, obesity, & endocrine diseases
Chang et al. [30], Hwang et al. [31]	O ${}_{3}$ , NO ${}_{2}$ , PM ${}_{10}$ , SO ${}_{2}$	Eye diseases
Wong et al. [32]	PM ${}_{2.5}$	Gastrointestinal diseases
Honda et al. [33]	Lead	Hematologic diseases
Pan et al. [34]	PM ${}_{2.5}$	Liver diseases
Yang et al. [35]	PM	Renal diseases
Cohen et al. [36], Gauderman et al. [37]	O ${}_{3}$ , PM, NO ${}_{2}$ , SO ${}_{2}$ , CO	Respiratory diseases
Lee et al. [38]	Traffic related air pollutants, PM ${}_{2.5}$ , PM ${}_{10}$ , O ${}_{3}$	Skin diseases

2.1 Associations amidst pollution exposures and health outcomes

Since our study needs associations amidst air pollution exposures and health outcomes, we have done an unbound search and come up with twenty-five articles with identified health impacts. A study done in China found that there was an upsurge in medicinal utilization for allergic rhinitis with a rise in PM ${}_{2.5}$ and NO ${}_{2}$ levels [14]. A study in Canada found that the odds ratio for rheumatic disease and exposure to PM ${}_{2.5}$ has increased [15]. In an investigation of more than nine million US Medicare enrolees, it is initiated that osteoporosis pertinent bone fractures are more prevalent in and around a higher level of PM ${}_{2.5}$ concentrations [16]. Studies related to PM ${}_{2.5}$ , and PM ${}_{10}$ exposure to humans show that there exists a high risk of lung cancer [17]. A potential cancer prevention study done in 623048 participants, that were observed for 22 years in America found that there was a strong association between PM ${}_{2.5}$ and death from kidney and bladder cancer, and high levels of NO ${}_{2}$ were associated with colorectal cancer mortality [18]. Prenatal and initial years of childhood exposure to benzene has been linked with leukaemia [19]. Prenatal exposure to PM ${}_{2.5}$ may develop the risk of having leukaemia and astrocytoma [20]. Studies have found some association between PM ${}_{2.5}$ and increase in mortality from myocardial infraction, stroke, heart failure, and hypertension [21]. A study in US cities constitute a correlation amidst particulate matter concentrations, CO, NO ${}_{2}$ , and SO ${}_{2}$ levels and stroke [22]. Many articles have been published on cognitive function disorder, such as the correlation between autism and air pollution [23]. It has been found that prenatal or early childhood exposure to polycyclic aromatic hydrocarbons, diesel exhaust, particulate matter, CO, NO ${}_{2}$ , O ${}_{3}$ , and SO ${}_{2}$ are associated with autism [24]. Long term exposure to PM ${}_{2.5}$ are associated with a higher odd of subclinical strokes and a smaller size of a brain [25] and short term exposure may lead to high risk of hospitalizations and mortality in Parkinson’s disease [26]. It has been found that there exists a pursuit amidst PM ${}_{2.5}$ and NO ${}_{2}$ , and the prevalence of diabetes and increase glycosylated haemoglobin levels in adults [27]. Exposure to PM ${}_{10}$ may lead to an increased risk of metabolic syndrome [28]. Children living in high concentrated PM ${}_{2.5}$ and O ${}_{3}$ environments are in high risk of developing vitamin D deficiency and altered appetite regulating peptides [29]. It has been found that conjunctivitis is mostly associated with increased levels of O ${}_{3}$ , NO ${}_{2}$ , PM ${}_{10}$ , and SO ${}_{2}$ concentrations [30] and dry eyes are often resulting from increased levels of O ${}_{3}$ , and decreased humidity levels [31]. A study done among the senior citizens in China found that an increased level of PM ${}_{2.5}$ concentrations raises hospitalization cases due to gastric ulcer [32]. Exposure to lead in air influences the formation of haemoglobin and thus results in anaemia in children as well as elderly people [33]. A study done in Taiwan for 23820 peoples followed up in 16.9 years found that there exists an association between PM ${}_{2.5}$ and increased risk of hepatocellular cancer, a liver disease [34]. The association of decreased renal function with exposure to particulate matter has been found in a study [35]. Ambient exposure to air pollution causes mortality cases due to the chronic obstructive pulmonary disease in more than 800000 peoples, and lung cancer in more than 280000 peoples [36]. Exposure to pollutants raises the cases of decreased lung function in people in their childhood [37]. An association between traffic related air pollutants, PM ${}_{2.5}$ , PM ${}_{10}$ , and O ${}_{3}$ and cases of eczema, particularly in children have been found in a study [38]. Table 1 shows the summary of the identified health impacts due to air pollution exposure.

3. Omics data

Omics refers to the composite technologies used to scrutinize the roles, relationships, and actions of molecules such as genes, proteins, and small metabolites that make up the cell of an organism. These technologies [39] append by the suffix “omics” include:

•
Genomics: Study of genes and their functions.
•
Epigenomics: Structure that persistently alter gene expressions without truly existing changes in the gene/DNA sequence.
•
Proteomics: Subject of proteins.
•
Metabolomics: Subject of molecules muddled in cellular metabolism, $\ldots$

High dimensional biotechnological platforms generate different types of omics data that have a huge number of raw parameters. In the current development of biotechnology, there occurs a new field known as radiomics which generates high dimensional image data. Sometimes omics data itself needs to be integrated since there are same types of omics data in distinctive studies [40], and distinctive types of omics data (multi-omics) for the same study [41]. A collective analysis of multiple datasets with distinctive study samples and overlapping omics features established on homogeneous and heterogeneous assumptions is an example of integrating the same types of omics data [40]. Many studies with cancer benchmark data has been observed multi-omics data integration [41]. ‘Omics only’ based algorithms cannot explain the traits in public health or give high predictive accuracy in epidemiological studies for precision medicine. Social and environmental parameters also should be considered. Therefore, it is necessary to integrate it with non-omics data such as epidemiological, environmental, and clinical data.

Gene-environment interaction influences most of the chronic diseases. Versions of genes are known as gene variants. These variants are different locations of DNA sequences that constitute individual genes. The combination of all the variants in the genome is known as genotypes. The physical expression of genotype is known as phenotype. Figure 3 shows the outline of omics technologies. When health fallout alters by genotype, and lacks one or more environmental stimuli, the health issue is said to be a consequence from a gene-environment reaction. Chronic ailments such as cancer, diabetes, and Parkinson’s disease are admitted to a consequence from interactions amidst environmental exposures and gene variants.

Table 2
Summary of gene-environment interactions and diseases

Reference Gene/genotype Environmental factor Associated diseases

Manthripragada et al. [44] PON1 Organophosphates Parkinson’s disease

Miller et al. [45] GST, EPHX1 O ${}_{3}$ , PM, NO ${}_{2}$ , polycyclic aromatic hydrocarbons, cigarette smoke Asthma

Makamure et al. [46] CD14 CT/TT NO ${}_{2}$ , NO Lung function

Kim et al. [47] CDH13 PM ${}_{10}$ Lung function

Rava et al. [48] PLA2G4A, PLA2R1, RELA, PRKD1, PRKCA LMW irritants Asthma

Figure 3.
Omics technologies.

3.1 Genes and environmental exposures

Reference	Gene/genotype	Environmental factor	Associated diseases
Manthripragada et al. [44]	PON1	Organophosphates	Parkinson’s disease
Miller et al. [45]	GST, EPHX1	O ${}_{3}$ , PM, NO ${}_{2}$ , polycyclic aromatic hydrocarbons, cigarette smoke	Asthma
Makamure et al. [46]	CD14 CT/TT	NO ${}_{2}$ , NO	Lung function
Kim et al. [47]	CDH13	PM ${}_{10}$	Lung function
Rava et al. [48]	PLA2G4A, PLA2R1, RELA, PRKD1, PRKCA	LMW irritants	Asthma

Diseases that draw in gene-environment interactions are developed from the input of genetics and environment [42]. Some of the gene environment interactions and concerned diseases are discussed here. The PON1 gene code metabolizes organophosphates [43]. Individuals that possess an inadequate variant of PON1, when disclosed to organophosphates, have an augmented risk to enroot Parkinson’s disease [44]. The glutathione (GST) gene and the epoxide hydrolase (EPHX1) gene have variants associated with an elevated risk of developing asthma from oxidative stress [45]. Oxidative stress includes O ${}_{3}$ , particulate matter (PM), NO ${}_{2}$ , polycyclic aromatic hydrocarbons, and cigarette smoke. A learning done amidst children in Kwazulu-Natal, Durban, found that there exists serious association between lung function and NO ${}_{2}$ and NO amidst participants bearing CD14 CT/TT genotypes [46]. A study that simulated the CDH13 gene by PM ${}_{10}$ interaction effect on the lung function at the gene level, reveals that the genetic variant of CDH13 altered the connection amidst PM ${}_{10}$ and declined lung function in Korean men [47]. A study based on three population sets identified interaction between genetic variants of PLA2G4A, PLA2R1, RELA, PRKD1, and PRKCA, and occupational exposures to Low Molecular Weight (LMW) irritants for current outset asthma [48]. Up to date gene-environment interactions and disease gene data are available at public databases. Table 2 shows the summary of learnt gene-environment interactions and diseases.

4. Strategies for integration of non-omics and omics data

The arrival of omics data and technological advancements led to the evolution of unique tools for the integration of variant data types. The multidisciplinary field of precision medicine that incorporates non-omics and omics data urges an increase in the accuracy of the model, while integrating the data. This needs an additional layer of complexity in the integration strategy. Only a few presented researches accomplished real omics and non-omics data integration. Most of the studies consider only one type of omics for analysis. Few studies incorporated more than two types of omics data integration. Based on the challenges come across by the heterogeneous characteristics of non-omics data, variants in omics data, a volume of omics and non-omics data, and the association between non-omics and omics data, the integrative strategies are classified as three types. They are:

•
Autonomous integration strategy.
•
Decision based integration strategy.
•
Collective integration strategy.

All require either variable selection, dimensionality reduction, or regularization because of a high dimensional nature of the data.
4.1 Autonomous integration strategy

In this integration strategy, self-reliant models for non-omics and omics data were built separately and then the learnt variables will be combined in the final model. In the first part, the non-omics data is modelled, which is based on the non-omics variables, entrenched risk, or factors analysed in past efforts. In parallel, omics data are modelled accordingly with omics variables. Both the models come across either variable selection or dimensionality reduction. In the second part, the learnt variables from both non-omics and omics models are combined to build the required model. The accuracy or error rate of the built final model is distinguished with the non-omics data model. Figure 4 shows the model for autonomous integration strategy.

Figure 4.

Autonomous integration strategy.

Figure 5.

Performance of bayesion model (Autonomous).

A probabilistic model by integrating clinical and microarray data has used the concept of the Markov blanket to perform the feature selection to predict the prognosis of breast cancer [49]. They have used van’t Veer data [50] which is publicly available, that is online or ITTACA [51] data. The performance measure, AUC of this best Bayesian network model with autonomous integration strategy is 0.845. A comparison for the average AUC performance for 100 iterations of non-omics, omics, and an integrated model is shown in Fig. 5. A comprehensive integration with a kernel fusion Cox model to predict the survival time of cancer patients has been done in a study [52]. They used multi-omics data such as gene expression data, a copy number of alterations (CNA), SNPs, methylation data, and microRNA (miRNA), out of which gene expression data is high dimensional, and most informative in terms of prognostic utility. Clinical pathological variables such as specific tumour scales, Lauren classification in stomach adenocarcinoma, cancer subtype definition, PAM50 signature for breast cancer, and MammaPrint are the non-omics data used in the study. The classification performance measure, C index for both ovarian cancer and HNSC ${}^{\rm f}$ was the same. The comparison of C index for non-omics, omics, and integrated data for Ovarian and HNSC dataset for the model is shown in Fig. 6. A deep learning with autoencoder framework to predict the survival time in liver cancer has been done in a study [53]. The study utilizes omics data such as RNA sequence for LIRI-JP cohort studies, and microarray gene expression for NCI cohort studies. Stage, grade, race, gender, age, and risk factor are the non-omics data. The performance measure C index for LIRI-JP is 0.74, and NCI is 0.65 respectively. The comparison of C index for non-omics, omics, and integrated data for LIRI-JP and NCI dataset for the model is shown in Fig. 7.

Figure 6.

Performance in ovarian and HNSC dataset.

Figure 7.

Performance in LIRI-JP & NCI dataset.

4.2 Decision based integration strategy

In this integration strategy, a self-reliant model for non-omics data is built and then omics variables are added to the built non-omics model. The variables to be added from omics data depends on the covariant variables from the non-omics model. Figure 8 shows the model for decision based integration strategy.

Figure 8.

Decision based integration strategy.

The procedure for the variable selection can be the univariate model where each omics variable is approved and combined with the non-omics model for accuracy on trial and error basis. Other methods are partial dimension reduction, and least squares - partial least squares (LS-PLS). Cancer in the central nervous system was analysed using LS-PLS logistic regression to assess the treatment prediction vibes as a categorical variable in two datasets [54]. They have used omics data such as gene expression in dataset1, and somatic CNA in dataset2. Sex, age, chemo CX, and chemo VP are the non-omics data for dataset1, whereas grade, tumour stage, HER2 stature, tumour size, and progesterone receptor stature are for dataset2. The performance measure, AUC was 0.82–0.90 and 0.93 respectively. A comparison of performance measure, AUC for non-omics, omics, and integrated data for the dataset1 and dataset2 for the model is shown in Fig. 9. A Bayesian network model based on decision integration has a performance [49], AUC of 0.810. A comparison for the average AUC performance for 100 iterations of the non-omics, omics, and integrated model is shown in Fig. 10.

Figure 9.

Performance in dataset1 & dataset2.

Figure 10.

Performance of bayesian model (Decision based).

4.3 Collective integration strategy

In the collective integration strategy, both non-omics and omics data are put together and is considered as one single dataset and analysed in a supervised or unsupervised manner. By doing it this way, the final model can hold indiscriminate type of relationships amidst the omics and non-omics variables. Depending upon the objective of the analytical model, this collective integration strategy can be of two types [55], multi-staged and meta-dimensional. In the multi-stage approach, an independent analysis of association amidst distinct data types is done, whereas in the meta-dimensional approach, simultaneous analysis of different data types is done. This meta-dimensional approach is again divided into three categories, namely: concatenation, transformation, and model integration. Figure 11 shows the model of collective integration strategy.

Figure 11.

Collective integration strategy.

Collective integration strategy can be used for large scale omics and non-omics data because of its capability to capture the complexity of the data and can accommodate the correlation structure between the omics and non-omics data. Survival time in patients with glioblastoma, a central nervous system cancer using multi-omics data such as high dimensional gene expression data, CNA, SNPs, and methylation, and non-omics data such as sex, and use of temozolomide has been evaluated in a study [56]. They have used the multi layered Bayesian regression framework to build the model. It adapted the transformation-based meta-dimensional analysis, which combines multiple datasets after the transformation of every data type into a transitional form known as a graph or a kernel matrix. The same model has been used in another study to evaluate survival time, and censoring in breast cancer patients [57]. It also accessed the non-omics-omics interaction factors with two types of omics data, namely gene expression and CNA. They have used non-omics data such as age, cancer subtype, histological type, Nottingham prognostic index (NPI: tumour size, grade, and nodal involvement), and treatment. The performance measure, AUC for the models are 0.72 and 0.74–0.81 respectively. A comparison of the performance measure for both the models is shown in Fig. 12. A model to predict outcomes for bladder cancer by transforming each occurrence into binary results by accounting for censoring and time, ignored sub phenotypes [58]. It utilizes only one type of omics data, SNPs, and few non-omics data such as the area, gender, the number of tumours, TSG, tumour size, and treatment. It is a concatenation based integration and implemented with Bayesian LASSO coupled threshold model. The calculated performance measure, AUC is 0.61. A comparison of performance measure, AUC for non-omics, omics, and integrated data for the dataset SBC/EPICURO is shown in Fig. 13.

Figure 12.

Performance of TCGA & METABRIC dataset.

Figure 13.

Performance of bayesian LASSO coupled threshold model.

4.4 Mathematical background for the integration strategy

4.4.1 Bayesian network model

A Bayesian network model consists of two parts: A dependency structure and local probability models [59, 60]. The dependency structure decides how the variables are related to one another. Each variable depends on a probably empty set of variables, called parents. The Eq. (1) shows the dependency:

$\displaystyle p(x_{1},\ldots,x_{n})=\prod^{n}_{i-1}{p(x_{i}|P_{a}(x_{i}))}$ (1)

where $P_{a}(x_{i})$ are parents of $x_{i}$ . A substantial notion of Bayesian networks is the Markov blanket of a variable. The Markov blanket of a variable $A$ is composed of parents, and its children. A structure learning with K2, a greedy search algorithm [61] blended with Bayesian Dirichlet scoring metric, and parameter learning with uniform Dirichlet prior, and posterior has been done by using the Eqs (4.4.1)–(4) respectively [49].

$\displaystyle p(S|D)\propto p(S)\prod^{n}_{i=1}\prod^{q_{i}}_{j=1}$ $\displaystyle\quad\left[\frac{\Gamma N^{!}_{ij}}{\Gamma(N^{!}_{ij}+N_{ij})}% \quad\prod^{r_{i}}_{k=1}\frac{\Gamma(N^{!}_{ijk}+N_{ijk})}{\Gamma N^{!}_{ijk}}\right]$ (2)

where $N_{ijk}$ is the number of items in the dataset $D$ having variable $i$ in state $k$ associated with the $j^{\text{th}}$ instantiation of its parents in the current structure $S$ . $n$ is the total number of variables. $N_{ij}$ is calculated by adding overall states: $N_{ij}=\sum^{r_{i}}_{i=1}{N_{ijk}}$ . The prior probability of the structure is $p(S)$ .

$\displaystyle p({\theta}_{ij}|S)=Dir({\theta}_{ij}|N^{!}_{ij1},\ldots,N^{!}_{% ijk},\ldots,N^{!}_{ijr_{i}})$ (3)

where ${\theta}_{ij}$ is a parameter set where $i$ refers to variable and $j$ to the $j^{\text{th}}$ instantiation of the parents in the current structure. $D i r$ corresponds to the Dirichlet distribution with $(N^{!}_{ij1},\ldots,N^{!}_{ijk},\ldots,N^{!}_{ijr_{i}})$ as parameters.

$\displaystyle p({\theta}_{ij}|S,D)=Dir({\theta}_{ij}|N^{!}_{ij1}+N^{!}_{ij1},% \ldots,N^{!}_{ijk}+N^{!}_{ijk},\ldots,N^{!}_{ijr_{i}}+N^{!}_{ijr_{i}})$ (4)

4.4.2 The kernel fusion cox model

The kernel fusion Cox model used omics similarity matrix as the kernel [52], assuming $M$ kinds of omics profiles. For the $m^{\text{th}}$ omics profile, $P_{m}$ is the biomarkers for $n$ subjects which were organized into an $n\times P_{m}$ matrix $Z_{m}$ . $Z^{!}_{m}$ is the transpose of $Z_{m}$ and $Z_{mj}$ as its $j^{\text{th}}$ column. The Eq. (5) defines the linear kernel, $n\times n$ matrix as:

$\displaystyle K_{m}=\frac{Z_{m}Z^{!}_{m}}{P_{m}}$ (5)

The Cox proportional hazard model and the prognostic score is given in Eqs (6) and (7):

$\displaystyle\lambda_{i}(t)=\lambda_{0}(t)\exp({\eta}_{i}),$ (6) $\displaystyle\eta_{i}=b_{i}+g_{i},1,2,\ldots,I$ (7)

where $\lambda_{i}(t)$ is the hazard function for the $i^{\text{th}}$ subject, $\lambda_{0}(t)$ is the baseline hazard function, and ${\eta}_{i}$ is the complete prognostic score, $b_{i}$ is the clinical prognostic score, and $g_{i}$ is the omics prognostic score. The model has been built using the Eq. (8).

$\displaystyle K=\begin{bmatrix}K_{VV}&K_{VT}\\ K_{TV}&K_{TT}\end{bmatrix}$ (8)

where $K_{VV}$ is the variance matrix for the validation set, $K_{VT}$ and $K_{TV}$ are the covariance matrices amidst the validation dataset and the training set, and $K_{TT}$ is the variance matrix for the training dataset.

4.4.3 Deep learning: Autoencoder model

An autoencoder is an unsupervised feed forward, a non-recurrent neural network. Considering an input layer $x=x_{1},\ldots,x_{n}$ , an autoencoder has been used to reconstruct the output as $x^{!}$ in a study [53]. The Eq. (9) is the autoencoder function for a given layer $i$ and $t a n h$ as the activation function between the input layer $x$ , and output layer $y$ .

$\displaystyle y=f_{i}(x)=\tanh(W_{i}\cdot x+b_{i})$ (9)

where $x$ and $y$ are two vectors of size $d$ and $p$ , respectively and $W_{i}$ is the weight matrix of size $d\times p$ , $b_{i}$ is an intercept vector of the size $p$ . Eq. (10) gives the autoencoder with $k$ layers.

$\displaystyle x^{!}=F_{1\to k}(x)=f_{1}^{\circ}\ldots{}^{\circ}f_{k-1}f_{k}(x)$ (10)

where $f_{k-1}f_{k}(x)=f_{k-1}(f_{k}(x))$ is the composed function of $f_{k-1}$ with $f_{k}$ .

4.4.4 LS-PLS regression model

A blend of least squares (LS) and partial least squares (PLS) were used in a study [54]. The vector $A$ contains continuous variables of length, where $n$ is the number of observations. $B$ is an omics data matrix of size $n\times p$ . $Z$ is the non-omics data matrix of size $n\times q$ . $W$ is the weight matrix and $N$ is the number of selected components. The LS-PLS draws in repetitive steps: In the first step, use ordinary least squares on $Z$ to predict $B$ . Then calculate new residuals of the matrix $B$ on $Z$ . Repeat it until convergence. Here a new matrix was created which is the projection of the matrix $A$ into orthogonalized variants of $Z$ . Then the standard PLS was used on the new matrix instead of $A$ .

4.4.5 Multi layered Bayesian framework

In the multi layered Bayesian framework [56], the logarithm of survival time was modelled as a function of the covariates of the baseline model and omics data. While seeing two omics data, the regression took the form as Eq. (4.4.5):

$\displaystyle y_{i}=\mu+\sum_{j}{x_{ij}}{\alpha}_{j}+\sum_{k}{z_{ik}}{\beta}_{% k}+\sum_{l}{w_{il}}{\gamma}_{l}+{\varepsilon}_{i}$

where $y_{i}$ is the logarithm of survival time, $\mu$ is an intercept, $\sum_{j}{x_{ij}}{\alpha}_{j}$ represents linear regression on the covariates of the baseline model. $\sum_{k}{z_{ik}}{\beta}_{k}$ and $\sum_{l}w_{il}{\gamma}_{l}$ represent regression terms for two distinct omics, where $z_{ik}$ and $w_{il}$ are features in each of the omics sets, and ${\varepsilon}_{i}$ is an error term simulated to be uniformly and separately distributed which is normal with null mean and variance ${\sigma}^{2}_{\varepsilon}$ .

4.4.6 Bayesian likelihood

For items with observed survival time, the Eq. (13) shows the conditional distribution of the data given the parameters and covariates was normal with the mean and variance ${\sigma}^{2}_{\varepsilon}$ . The mean for the distribution is shown in Eq. (4.4.6),

$\displaystyle\eta_{i}=\mu\!+\!\sum_{j}{x_{ij}}{\alpha}_{j}\!+\!\sum_{k}{z_{ik}% }{\beta}_{k}\!+\!\sum_{l}{w_{il}}{\gamma}_{l}\!+\!{\varepsilon}_{i}$

(12) $\displaystyle p(y_{i}|{\eta}_{i},{\sigma}^{2}_{\varepsilon})=N(y_{i}|{\eta}_{i% }{,\sigma}^{2}_{\varepsilon})$ (13)

The Eq. (14) shows the likelihood function for the right censored data and unknown survival time.

$\displaystyle p(y_{i}|{\eta}_{i},{\sigma}^{2}_{\varepsilon})=\phi\left(\frac{y% _{i}-{\eta}_{i}}{{\sigma}_{\varepsilon}}\right)$ (14)

where $y_{i}$ was the follow up time and $\phi(\cdot)$ Is the cumulative distribution function of a standard normal random variable. The collective Bayesian likelihood was shown in Eq. (4.4.6).

$\displaystyle p(y|\alpha,\beta,\gamma,\sigma^{2}_{\varepsilon})=\prod^{n}_{i=1% }N{(y_{i}|{\eta}_{i},{\sigma}^{2}_{\varepsilon})}^{1-c_{i}}\phi\left(\frac{y_{% i}-{\eta}_{i}}{{\sigma}_{\varepsilon}}\right)^{c_{i}}$

where $c_{i}$ is an indicator variable taking a value of 1 for right censored observations and 0 otherwise.
4.4.7 Bayesian LASSO coupled threshold model

A sequential threshold model to analyse time to event data for the prediction of outcomes for bladder cancer was used in a study [58]. The model speculates that for an observation of a patient to be present at a given period, that patient must have survived through all the previous time periods. The probability of not presenting the event of interest until interval $k$ , conditional on the event that the $k$ th interval has been reached is given in Eq. (4.4.7).

$\displaystyle P(y_{i}=k)|(y_{i}\geqslant k-1,\gamma,\beta)=\phi\left(\frac{{% \gamma}_{i}-X^{!}\beta}{\sigma}_{\varepsilon}\right)$

(16)

where $\gamma$ corresponds to the unordered cutoff points corresponding to each time interval, $X$ corresponds to the incidence matrix of effects $\beta$ , and ${\sigma}^{2}_{e}$ is the residual variance.
5. Machine learning techniques for precison medicine forecast

Machine learning algorithms are of higher accuracy for risk prediction scheduled to the competence to handle multidimensional data. The supervised machine learning algorithms with regression based methods imply polynomial parametric or nonparametric procedures to find the pursuit between multi-dimensional data. The application of machine learning techniques in precision medicine, tailor medical interventions for a community or individuals living in a community, based on the sociodemographic characteristics, the environmental background, and the clinical records. These non-omics data, when integrated into omics data, have an extreme power to assess the pursuit amidst environmental exposures and related health effects. According to omics technological advancement, the perception of precision medicine enhances disease prediction, prognosis, and prevention. Thus, an early warning system or risk prediction is possible through a precision medicine forecast system. Here we provide an overview of performance of machine learning techniques for precision medicine forecast in heterogeneous and high dimensional data environment, with or without omics data integration.

Machine learning methods are usually beneficial for huge, often noisy, heterogenous and high dimensional datasets. A range of these methods include Bayesian inference Networks (BN), neural networks, Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR). Fatty liver ailment is a prevalent clinical snag related to immense morbidity and mortality. Machine learning models such as RF, Naïve Bayes (NB), Artificial Neural Networks (ANN), and LR to predict liver ailment can aid physicians in labelling high risk patients and to make innovative diagnosis, has been developed in Taiwan [62]. They collected data from New Taipei City Municipal Hospital, Banqiao branch, which is beneath a liver care project. They also incorporated the patient’s demographic and clinical data from electronic health records. A total of 577 patients were embodied, out of which 377 patients suffer from fatty liver disease. The measure, AUROC of RF, NB, ANN, and LR with tenfold cross validation was 0.925, 0.888, 0.895, and 0.854 respectively. Among the models, RF performs well with an accuracy rate 87.48 which is more than the other machine learning models. The comparison of machine learning models in terms of AUROC and accuracy rate are shown in Figs 14 and 15 respectively. The study suggests that exercising a RF model in the clinical framework will support physicians to layer fatty liver patients for immediate prevention, surveillance, early treatment, and management.

Figure 14.

Comparison of ML models in terms of AUROC.

Figure 15.

Comparison of ML models in terms of accuracy.

Figure 16.

Comparison of ML models.

Figure 17.

Comparison of ML models.

Another study with 593/401 fatty/non-fatty patients using the machine learning techniques such as RF, SVM, an ANN, and logistic regression (LR) were done in Taiwan [63]. Among the models in the said data, logistic regression performed better and obtained 70.70% accuracy with a tenfold cross validation. Performance measures such as the area under ROC curve (AUC), accuracy (AC), sensitivity (SN), specificity (SP), positive predictive value (PPV), and the negative predictive value (NPV) obtained for the models are shown in Fig. 16.

The same kind of study using machine learning techniques such as k-Nearest Neighbour (kNN), SVM, logistic regression (LR), NB, BN, C4.5, AdaBoost, bagging, RF, Hidden Naïve Bayes (HNB), and aggregate one dependent estimators (AODE) were done in China with 2522/7986 fatty/non-fatty patients and found that logistic regression obtained better accuracy, 82.92% with a tenfold cross validation [64]. Performance measures such as accuracy, specificity, precision, sensitivity or recall, and F-measure compared for the models are shown in Fig. 17. The study concludes that a novel machine learning technique may drive revised skill based clinical decisions with enhanced antiquated diagnosis rate and reduced end stage complexities.

Figure 18.

Performance comparison of ML models.

Figure 19.

Sensitivity and specificity of ML models.

Figure 20.

Exposure factor and diseases.

Regression based models measure only baseline risk, rate, or hazard and consistently spawn a linear blend of covariates using an algorithm that overestimates the likelihood of the results. Bayesian networks are compact and will be the graphical representation of joint probability distributions that can be used for casual reasoning and risk prediction analysis. A few challenges associated with the traditional risk prediction methods and the description of Bayesian networks, its construction and application, and advantages in risk prediction has been mentioned in a study [65]. They took cancers and heart disease as examples for prediction. The machine learning approach for the construction of Bayesian networks include constraint and score based learning. Constraint based approaches use conditional independencies in the data to extract the model structure. A score based approach investigates for a model that maximizes the likelihood of the model for given data. The advantages of Bayesian networks include individual risk prediction, and decision making under uncertainty. The main drawback is the inefficiency to handle missing data. The missing data can impact inference and learning of the casual structure.

Fuster-Parra et al. [66] have done an in depth study of cardiovascular disease (CVD) in epidemiological system analysis with Bayesian network modelling. Analysis discovered a relationship amidst thirteen significant epidemiological features of the heart age domain to analyse cardiovascular lost years (CVLY), Cardiovascular Risk Score (CVRS), and Metabolic Syndrome (MetS). Bayesian analysis demonstrated alluring results. Amidst them, the results of CVLY and MetS were deeply altered by smoking, and physical activity of men. The study compared performance of Bayesian Network (BN), NB, Tree Augmented Naïve Bayes (TANB), Multi-Layer Perceptron (MLP), and Tree C4.5 algorithms with measures such as accuracy, sensitivity, and specificity for the three epidemiological features CVLY, CVRS, and MetS. The compared results are shown in Figs 18 and 19. The ability of Bayesian networks to predict new sequences of events when hypothetical information has been introduced, makes the model a machine learning tool in epidemiological studies.

6. Results and discussion

Association between pollution exposures and diseases are discussed in Section 2 of this article. Although air pollution is detrimental to lung and airways, it may also account to cardiovascular disease, stroke, allergic sensitization and rhinitis, bone ailments, cognitive function and neurologic ailments, diabetes, eye problems, hematologic disease, liver ailments, renal ailments, and skin ailments. The summary of the exposure factor and the factors appeared in several diseases are shown in Fig. 20.

Figure 21.

Integration strategies and performance of models.

The data integration strategies, and performance of models are discussed in Section 4. Reviewed papers in this section focused to figure out the predictive competence of the integrated model. Most of the papers integrated only single omics data. Only a few variables of non-omics data were integrated in all the study. It consists of categorical and continuous variables. Large scale non-omics data were not incorporated in those models. The summary of the performance of the models from the reviewed articles are shown in Fig. 21.

From the reviewed papers, it has been learned that even though autonomous integration strategy was the simplest integration strategy, a decision based integration with LS-PLS classification model shows the highest performance among the models. Collective integration strategy is suitable for the integration of large scale non-omics and omics data since they must consider the correlation structure between those data types. For risk prediction and association testing, decision based strategy is a better fit since it follows multistage method. Autonomous integration strategy is suitable for low dimensional data.

7. Limitations and challenges

7.1 Non-omics data

Epidemiological data are produced from different sources with different methods. Due to this heterogenous factor, the data must be integrated to keep the uniformity and consistency. Another issue is reporting bias caused by the complex elaboration and previous knowledge of the evaluator about the data.

7.2 Omics data

A common issue with omics data is the existence of many covariates. Other challenges include high correlation amidst factors, the distinct nature of information, and its hierarchical dependence. The heterogenetic characteristics of omics data types, inadequacy of standards for homogenous data from distinct omics platforms, and the high dimensionality of omics data makes the data analysis complex.

7.3 Integration of omics and non-omics data

Even though autonomous integration strategy is an uncomplicated method, some critical issues arise. They are:

•
Predictive competence of omics data will be inflated since the same feature is passed down in the feature selection process.
•
A zero interaction or correlation possibility for different types of data.

Autonomous integration strategy is the simplest and popular strategy for non-omics-omics data integration. Decision based integration strategy is more complex because of the variable selection in omics data and computational intensity. Collective strategy is a sophisticated method but fairness in the variable selection has to be ensured. The heterogenous and subjective characteristics of non-omics data, and high dimensional omics data make the integration challenging.
7.4 Machine learning for precision medicine forecast

Machine learning models have their own challenges, which are often application specific. Modelling and evaluation are important to dodge erroneous interpretations. The number of features selected and the sample tested is always critical. A good sample to feature ratio should maintain in order to make the model robust. The number of features distinguishing the data and the number of samples on which these features are measured is the data dimensionality versus data sparsity problem. Similarly, training sets should be chosen from a sufficiently large and representative sample population in order to avoid overtraining the model. Machine learning for precision medicine forecasts are presented with ample choices and often build models with high performance in terms of accuracy. The top challenges for the current precision medicine forecast model include:

•
Lack of clarity of data.
•
Data integration.
•
Resources for the data.
•
Global approach with a broad understanding of ethical values.
•
Recognizing benefits, since the results of the model is not always obvious. For example, in epidemiological studies, if omics data are generated based on individuals’ exposure to pollution, then integration could induce additional correlation amidst non-omics and omics data.

7.5 Future directions

Understanding the benefits and limitations attached to the non-omics data, omics data, non-omics and omics data integration, and machine learning techniques for a precision medicine forecast model is the first step towards building a model. Tackling these issues directly with a comprehensive machine learning technique for a precision medicine forecast model will help to generate valuable outcomes and lead to real social change in the specific community, especially healthcare of individuals in the industrial region. The machine learning mechanisms have the capability to turn into the gadgets to address the challenges such as processing a large volume of data with robustness, integration of heterogeneous and high dimensional data to establish features in their full complexity, and the efficient delivery of results in the medicinal field.

8. Conclusion

Improvement of population health requires precision medicine approaches to spot high risk individuals in the generic population, presenting a few assurances of bettering possibilities for disease prevention in specific groups, especially the individuals living in the industrial area. Precision medicine path may possess some advantage for early disease identification in this framework. Biomarkers are more trustworthy in predicting diseases than a preclinical marker and may be useful among individuals with a history of cancer, and heart disease. Also, it will be useful in predicting effects of diseases such as sudden death. Molecular testing and genetic sequencing can recognize persons who are at huge risk of ailments, but it requires highly expertise technology and is not economically feasible. From this study, it has been learned that gene-environment and disease-gene data, when incorporated into non-omics data such as epidemiological data, environment data, and clinical data can evaluate health effects associated with the environmental exposure. Thus, developing a good precision medicine forecast model with the said data to improve the path on disease identification is a societal responsibility. Machine learning promises to assist medical practitioners in integrating medical knowledge, and the environment and sociodemographic background of patients into routine care.

References

Foundation

ABIM

. Choosing Wisely, http://abimfoundation.org/what-we-do/choosing-wisely (accessed 21 August 2019).

Hertz-Picciotto

Schmidt

Krakowiak

. Understanding environmental contributions to autism: Casual concepts and the state of science. Autism Res. 2018; 11: 554-586.

Lee

Gino

, et al. Polluted morality: Air pollution predicts criminal activity and unethical behavior. Psychol Sci. 2018; 29: 340-355.

Deng

Urman

Gilliland

, et al. Understanding the importance of key risk factors in predicting chronic bronchitic symptoms using a machine learning approach. BMC Medical Research Methodology. 2019; 19: 70.

Stingone

Pandey

Claudio

, et al. Using machine learning to identify air pollution exposure profiles associated with early cognitive skills among U.S. Children. Environmental Pollution. 2017; 230: 730-740.

Demircioğlu

Ulukan

. A novel hybrid approach based on intuitionistic fuzzy multi criteria group-decision making for environmental pollution problem. 2020; 1013-1025.

Joon

Chandra

. Household air pollution from cooking fuels: An environmental and public health challenge. 2016; 33-39.

Jing

Shi

. Analysis of related factors and disease costs of respiratory infection and environmental pollution in children. 2020; 355-360.

Aditya

Deepak

, Nguyen Gia

Ashish

Babita

Prayag

. Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access. 2019. doi: 10.1109/ACCESS.2018.2888882.

10.

Vikash

, Sanjay Kumar

Aditya

Deepak

Prayag

Catarina

Robertas

Victor

. A novel transfer learning based approach for pneumonia detection in chest X-ray images. Applied Sciences (MDPI). 2020; 10(2): 559.

11.

Aditya

Aman

Divya

Deepak

Ashish

Arun

Joseph

. A Novel deep learning based multi-model ensemble methods for prediction of neuromuscular disorders. Neural Computing and Applications (Springer). 2018; doi: 10.1007/s00521-018-3896-0.

12.

Aditya

Gurinder

Deepak

Ashish

Shrasti

, Victor Hugo

. Seasonal crops disease prediction and classification using deep convolutional encoder network. Circuits, Systems and Signal Processing (Springer). 2019; doi: 10.1007/s00034-019-01041-0.

13.

Abdul

Muskan

Deepak

Ashish

Fadi

, Plácido Rogerio

. CovidGAN: Data augmentation using auxiliary classifier GAN for improved covid-19 detection. IEEE Access. doi: 10.1109/ACCESS.2020.2994762.

14.

Teng

Zhang

, et al. The association between ambient air pollution and allergic rhinitis: Further epidemiological evidence from Changchun, Northeastern China. Int J Environ Res Public Health. 2017; 14(3).

15.

Bernatsky

Smargiassi

Barnabe

, et al. Fine particular air pollution and systemic autoimmune rheumatic disease in two Canadian provinces. Environ Res. 2016; 146: 85-91.

16.

Prada

Zhong

Colicino

, et al. Association of air particulate pollution with bone loss over time and bone fracture risk: Analysis of data from two independent studies. Lancet Planet Health. 2017; 1: E337-e347.

17.

Hamra

Guha

Cohen

, et al. Outdoor particulate matter exposure and lung cancer: Systematic review and meta-analysis. Environ Health Perspect. 2014; 122(9): 906-911.

18.

Turner

Krewski

Diver

, et al. Ambient air pollution and cancer mortality in the Cancer Prevention Study II. Environ Health Perspect. 2017; 125(8): 087013.

19.

Janitz

Campbell

Magzamen

, et al. Benzene and childhood acute leukemia in Oklahoma. Environ Res. 2017; 158: 167-173.

20.

Lavinge

Belair

, et al. Maternal exposure to ambient air pollution and risk of early childhood cancers: A population-based study in Ontario, Canada. Environ Int. 2017; 100: 139-147.

21.

Jin

, et al. Impact of particulate air pollution on cardiovascular health. Curr Allergy Asthma Rep. 2018; 18(3): 15.

22.

Wellenius

Schwartz

Mittelman

. Air pollution and hospital admissions for ischemic and haemorrhagic stroke among medicare beneficiaries. Stroke. 2005; 36(12): 2549-2553.

23.

Lam

Sutton

Kalkbrenner

, et al. A systematic review and meta-analysis of multiple airborne pollutants and autism spectrum disorder. PLoS One. 2016; 11(9): E0161851.

24.

Liu

Zhang

Rodzinka-Pasko

, et al. Environmental risk factors for autism spectrum disorders. Nervenarzt. 2016; 87(suppl2): 55-61.

25.

Wilker

Preis

Beiser

, et al. Long-term exposure to fine particulate matter, residential proximity to major roads and measures of brain structure. Stroke. 2015; 46(5): 1161-1166.

26.

Zanobetti

Dominici

Wang

, et al. A national case-crossover analysis of the short-term effect of PM2.5 on hospitalizations and mortality in subjects with diabetes and neurological disorders. Environ Health. 2014; 13(1): 38.

27.

Honda

Pun

Manjourides

, et al. Associations between long-term exposure to air pollution, glycosylated hemoglobin and diabetes. Int J Hyg Environ Health. 2017; 220(7): 1124-1132.

28.

Eze

Schaffner

Foraster

, et al. Long-term exposure to ambient air pollution and metabolic syndrome in adults. PLoS One. 2015; 10(6): E0130337.

29.

Calderon-Garciduenas

Franco-Lira

, D’ Anguilli

, et al . Mexico City normal weight children exposed to high concentrations of ambient PM2.5 show high blood leptin and endothelin-1, vitamin D deficiency, and food reward hormone dysregulation versus low pollution controls. Relevance for obesity and Alzheimer disease. Environ Res. 2015; 140: 579-592.

30.

Chang

Yang

Chang

, et al. Relationship between air pollution and outpatient visits for nonspecific conjunctivitis. Invest Ophthalmol Vis Sci. 2012; 53(1): 429-433.

31.

Hwang

Choi

Paik

, et al. Potential importance of ozone in the association between outdoor air pollution and dry eye disease in South Korea. Jama Ophthalmol. Epub ahead of print 10 March 2016. doi: 10.1001/jamaophthalmol.2016.0139.

32.

Wong

Tsang

Lai

, et al. STROBE – long-term exposure to ambient fine particulate air pollution and hospitalization due to peptic ulcers. Medicine (Baltimore). 2016; 95(18): E3543.

33.

Honda

Pun

Manjourides

, et al. Anemia prevalence and hemoglobin levels are associated with long-term exposure to air pollution in an older population. Environ Int. 2017; 101: 125-132.

34.

Pan

Chen

, et al. Fine particle pollution, alanine transaminase and liver cancer: A Taiwanese prospective cohort study (REVEAL-HBV). J Natl Cancer Inst. 2016; 108(3).

35.

Yang

Chen

, et al. Associations between long-term particulate matter exposure and adult renal function in the Taipei metropolis. Environ Health Perspect. 2017; 125(4): 602-607.

36.

Cohen

Brauer

Burnett

, et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: An analysis of data from the global burden of diseases study 2015. Lancet. 2017; 389(10082): 1907-1918.

37.

Gauderman

Urman

Avol

, et al. Association of improved air quality with lung development in children. N Engl J Med. 2015; 372(10): 905-913.

38.

Lee

Sheu

, et al. Traffic-related air pollution, climate, and prevalence of eczema in Taiwanese school children. J Invest Dermatol. 2008; 128(10): 2412-2420.

39.

Emerging Technologies. Omics, Bioinformatics, Computational Biology, http://alttox.org/mapp/emerging-technologies/omics-bioinformatics-computational-biology (2014, accessed 25 August 2019).

40.

Zhao

Shi

Huang

, et al. Integrative analysis of -omics-’ data using penalty functions. Wiley Interdiscip Rev Comput Stat. 2015; 7: 99-108.

41.

Rappoport

Shamir

. Multi-omics and multi-view clustering algorithms: Review and cancer benchmark. Nucl Acids Res. 2018; 46: 10546-10562.

42.

National Institute of Environmental Health Sciences. Gene and Environment Interaction, https://www.niehs.nih.gov/health/topics/science/gene-env/index.cfm (accessed 25 August 2019).

43.

Primo-Parmo

Sorensen

, et al. The human serum Paraoxonase/Arylesterase Gene (PON1) is one member of a multigene family. Genomics. 1996; 33(3): 498-507.

44.

Manthripragada

Costello

, et al. Paraoxonase 1 (PON1), agricultural organophosphate exposure, and Parkinson disease. Epidemiology. 2010; 21(1): 87-94.

45.

Miller

Schettler

Tencza

, et al. A story of health. Agency for toxic substances and disease registry, commonweal, science and environmental health network, western states PESHU. 2019.

46.

Makamure

Reddy

Chuturgoon

, et al. Interaction between ambient pollutant exposure, CD14 (-159) polymorphism and respiratory outcomes among children in Kwazulu-Natal, Durban. Human and Experimental Toxicology. 2016; 1-9.

47.

Kim

Min

, et al. CDH13 gene-by-PM interaction effect on lung function decline in Korean men, Chemosphere. 2017; 168: 583-589.

48.

Rava

Ahmed

Kogevinas

, et al. Genes Interacting with Occupational Exposures to Low Molecular Weight Agents and Irritants on Adult-Onset Asthma in Three European Studies. Environ Health Perspect. 2017; 125(2): 207-214.

49.

Gevaert

De Smet

Timmerman

, et al. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics. 2006; 22(14): E184-e190.

50.

van’t Veer

Dai

van de Vijver

, et al. Gene expression profiling predicts clinical outcome in breast cancer. Nature. 2002; 415: 530-536.

51.

ITTACA , http://bioinfo-out.curie.fr/ittaca. 2006.

52.

Zhu

Song

Shen

, et al. Integrating clinical and multiple omics data for prognostic assessment across human cancers. Scientific Reports. 2017; 7: 16954.

53.

Chaudhary

Poiron

, et al. Deep learning based multi-omics integration robustly predicts survival in liver cancer. Clin Cancer Res. 2018; 24(6): 1248-1259.

54.

Bazzoli

Lambert-Lacroix

. Classification based on extensions of LS-PLS using logistic regression: Application to clinical and multiple genomics data. BMC Bioinformatics. 2018; 19: 314.

55.

de Maturana

Alonso

Alarcón

, et al. Challenges in the integration of omics and non-omics data. Genes. 2019; 10: 238.

56.

Rubio

YLB

González-Reymúndez

KHH

, et al. Whole genome multi-omics study of survival in patients with glioblastoma multiforme. G3 (Bethesda). 2018; 8(11): 3627-3636.

57.

González-Reymúndez

de los Campos

Gutiérrez

, et al. Prediction of years of life after diagnosis of breast cancer using omics and omics-by-treatment interactions. Eur J Hum Genet. 2017; 25(5): 538-544.

58.

de Maturana

Picornell

Masson-Lecomte

, et al. Prediction of non-muscle invasive bladder cancer outcome assessed by innovative multimarker prognostic models. BMC Cancer. 2016; 16: 351.

59.

Pearl

. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1988. ISBN: 0-934613-73-77.

60.

Neapolitan

. Learning Bayesian Networks. Prentice Hall, Upper Saddle River, NJ. 2004.

61.

Cooper

Herskovits

. A Bayesian method for the induction of probabilistic networks from data. Machine Learning. 1992; 9: 309-347.

62.

Yeh

Hsu

, et al. Prediction of fatty liver disease using machine learning algorithms. Comput Methods Programs Biomed. 2019; 170: 23-29.

63.

Islam

Poly

, et al. Applications of machine learning in fatty live disease prediction. Stud Health Technol Inform. 2018; 247: 166-170.

64.

Shen

, et al. Application of machine learning techniques for clinical predictive modeling: A cross-sectional study on nonalcoholic fatty liver disease in China. BioMed Research International. 3 October 2018; 2018: Article ID 4304376.

65.

Arora

Boyne

Slater

, et al. Bayesian Networks for Risk Prediction using Real-World Data: A Tool for Precision Medicine. Value in Health. 2019; 22(4): 439-445.

66.

Fuster-Parra

Tauler

Bennasar-Veny

, et al. Bayesian network modelling: A case study of an epidemiologic system analysis of cardiovascular risk. Comput Methods Programs Biomed. 2016; 126: 128-142.

Machine learning for precision medicine forecasts and challenges when incorporating non omics and omics data

Abstract

Keywords

1. Introduction

3. Omics data

4. Strategies for integration of non-omics and omics data

• Autonomous integration strategy. • Decision based integration strategy. • Collective integration strategy. All require either variable selection, dimensionality reduction, or regularization because of a high dimensional nature of the data. 4.1 Autonomous integration strategy

4.4.1 Bayesian network model

4.4.5 Multi layered Bayesian framework

4.4.6 Bayesian likelihood

(16) where γ corresponds to the unordered cutoff points corresponding to each time interval, X corresponds to the incidence matrix of effects β , and σ e 2 is the residual variance. 5. Machine learning techniques for precison medicine forecast

7.1 Non-omics data

7.2 Omics data

7.3 Integration of omics and non-omics data

8. Conclusion

References

•
Autonomous integration strategy.
•
Decision based integration strategy.
•
Collective integration strategy.

All require either variable selection, dimensionality reduction, or regularization because of a high dimensional nature of the data.
4.1 Autonomous integration strategy

(16)

where $\gamma$ corresponds to the unordered cutoff points corresponding to each time interval, $X$ corresponds to the incidence matrix of effects $\beta$ , and ${\sigma}^{2}_{e}$ is the residual variance.
5. Machine learning techniques for precison medicine forecast