Abstract
The emerging predictive health analytics provides great promise in reducing costs and improving health outcomes. However, most predictive models do not capture environmental exposures that impact health risk patterns in several chronic diseases such as asthma. This gap prompted the development of the exposome paradigm to improve health intervention and prevention by providing meaningful and understandable feedback on individuals’ collected data and minimizing their exposures to health risks. The exposome paradigm focuses on the simultaneous monitoring of mobility behaviors and measurement of environmental conditions to capture their impact on human health. In this paper, we introduce the concept of exposome analytics that compliments predictive analytics to develop an effective health monitoring and management system. We present the current analytical developments including our ongoing project to manage risks of asthma exacerbations as a case study. Our proposed approach uses a novel exposome assessment paradigm that utilizes the spatio-temporal properties of the data in the model training process and hence results in improving the accuracy of asthma prediction. The quality of the proposed approach is extensively evaluated using real patients and environmental datasets.
Keywords
Introduction
The future of healthcare is being shaped by the data revolution and the application of sophisticated data analytics towards improving preventive medicine and achieving personalized healthcare. While aspects of health data analytics such as the use of machine learning and statistical models have existed for decades [47], the new generation of health analytics incorporates a range of advanced and innovative technologies and tools such as (i) wearable monitoring devices and mobile applications for the capture of life-logging personal health data, (ii) bio-medical signal processing approaches for conditioning high-resolution data, (iii) “Big Data” paradigms to integrate the heterogeneous data, and (iv) data presentation capabilities to gain insights into large datasets [4,76]. This enables disease prediction and risk management in a ubiquitous personalized and continuous manner. The top motivational factor driving analytics in healthcare is reducing cost by leveraging predictive models to identify patients at risk for readmission so that healthcare providers can intervene early and make proactive decisions [11]. Prediction of patients adverse has a long history that has formed the basis for more recent innovations. For example, the study in [36] developed a chronic health evaluation system based on logistic multiple regression model to predict death rates of patients in sensitive care using data from medical records such as age, sex, indication of admission and severity of illness.
In 2015, the World Health Organization (WHO) produced a report stating that most of non-communicable disease deaths among under 70th age people was caused by four chronic diseases: i.e., cardio-vascular diseases, cancers, chronic respiratory diseases and diabetes mellitus [85]. Approximately 70%–90% of these chronic diseases and the elevation of incidence rates of these diseases are attributed to environmental factors. It is also estimated that the increasing patterns for many diseases such as asthma and degenerative diseases are marked more for environmental exposures than for genetic risks [30,80,81]. For asthma in particular, it is not sufficient to depend on bio-sensors to predict an attack because attacks result from mixtures of accumulated environmental exposure and personal triggers. It is necessary to analyze and discover the links between environmental exposures and asthma exacerbations. Exposure to environmental conditions varies across different locations and time intervals. This spatio-temporal nature of the exposure data best describes the dynamic behavior of individuals and enables the shift from static monitoring to spatio-temporally individual exposure assessment. Therefore, increased attention has been focused on the discovery and understanding of environmental influences on disease prevention [1,77,79], and the next wave of predictive health analytics will therefore involve aggregation of personal data with environmental data.
The assessment of personal exposure in space and time is affected by many variables with complex interactions among each other and the human system [9,49]. For example, small behavioral changes such as walking in different paths or staying farther way from traffic may cause significant changes on exposures to air pollute that exacerbates asthma. Thus, individual exposure assessment requires the recording of individual’s time-activity patterns besides the common environmental exposure scores. This necessity to embrace a holistic approach leads to the introduction of the exposome paradigm as a new research area to assess the impact of environmental exposures on human health [79].
The definition of exposome as the totality of environmental exposures indicates two conceptual dimensions with implications for measurement and analysis: (1) human’s health is affected by mixture of exposures; and (2) exposures occur across the lifespan [10]. This includes many factors to consider such as chemicals and pollutants, physical agents (e.g. noise), macro level factors (e.g. public health, schools, certain population density), and lifestyle factors (e.g. physical activity, sleep habits). Accordingly, exposome presents research issues in collecting, processing and analyzing exposure data to build a knowledge network for understanding exposures and the mechanisms underlying diseases. In this context, it is clear that using questionnaire to collect disease control data or to assess physical activities has become obsolete. In fact, several studies demonstrate that self-reported exposure data do not match the real-time monitored data [68].
Thus, developing effective predictive health monitoring systems requires integration of exposome analytical capabilities to facilitate knowledge discovery of environmental conditions individuals are exposed to in real time; and so detect the patterns of triggers and prevent environmental diseases. In this direction, a new line of interdisciplinary healthcare research and industry ventures has emerged to produce dynamic and real-time predictive health monitoring solutions [4,21,39,67]. While systems integrating exposome techniques with analytic capabilities hold substantial promises for tailoring care to individual patients and thus for transforming the future of healthcare, the penetration of such analytics solutions into actual clinical practice is in its infancy [11,61]. They are still presented by significant challenges in terms of data size, data scales, complex structures and relationships, uncertainty, and space and time constraints. Moreover, monitoring and assessment of individual exposure pose many challenges such as the limited resources and techniques to collect, process and analyze exposure data. In addition, currently there is no single chip to measure different types of exposures in a comprehensive manner for the “totality of environmental exposure” even for a short time interval. Moving this practice beyond its infancy will require, among other advances, novel statistical techniques capable of handling the totality of exposures and diseases in the context of interrelated disease states across short and long lifespan windows.
In this paper, we present an overview of a framework for combining exposome studies with predictive health analytics and its underlying concepts and essential processes. We also present a case study of a health analytics project that captures environmental exposures to monitor and manage risks of asthma exacerbations. Our work focuses on asthma, a common chronic respiratory disease affecting 235 millions of people of all ages in all parts of the world [85], since it is known to be highly affected by the environment. The development of asthma has been attributed to an inherent susceptibility towards a combination of many environmental variables [5]. While the list of asthma triggers include many variables (e.g. air pollution, weather, allergens, certain drugs or food, stress, and so on), the case study focuses on variables related to air quality and weather conditions. We propose computational models for the continuous learning of the relationships between environmental exposures and health outcomes, predicting health risks, and issuing meaningful and understandable personalized health feedback.
The remaining of this paper is organized as follows. In Section 2, we discuss exposome and predictive analytics on healthcare and highlight top open challenges. Section 3 discusses current methods of predictive analytics for asthma exacerbations management and their limitations. Section 4 presents our proposed framework of predictive and exposome analytics for predicting health risks of asthma, given patients and environmental exposure datasets. In Section 5, we present our investigation on a number of commonly used prediction models and discuss the limitation of the existing models. We then propose a two-step approach of regression models, which provides a meaningful and comprehensive feedback for patients. Finally, Section 6 presents future work, and provides some concluding remarks.
Analytics on healthcare
The term “Health Analytics” consolidates the various dimensions of “analysis” incorporating a range of technologies and tools: databases, electronic health record systems, data warehouses, web applications, clinical decision support systems and others that are integrated for interoperability and seamless processing of health data for insights [8,28,64]. The Healthcare Information and Management Systems Society, HIMSS (2013) defines health analytics as the “systematic use of data and related clinical and business insights developed through applied analytical disciplines such as statistical, contextual, quantitative, predictive, and cognitive spectrums to drive fact-based decision making for planning, management, measurement and learning” [29]. In a different and specific view, the term analytics in healthcare refers to a complete series of “integrated capabilities” that provide progressively deeper statistical insights into health-related information. These capabilities cover the areas of information access, quality, integration, storage, management, interpretation and governance [40]. Analytical capabilities in healthcare exist on a continuum of business-focused to clinical-focused solutions targeting three main objectives: improve organization performance, lower health care costs, and improve health or patient outcomes [14,40,64]. Generally, health analytics is perceived as a four-stage model composed of:
Descriptive Analytics: Categorize, characterize, aggregate and classify data, converting it into useful information in forms of visual data summaries such as meaningful charts and reports.
Predictive Analytics: Examine historical or summarized health data, detect patterns of relationships in these data, and then extrapolate these relationships to predict the future.
Prescriptive Analytics: Use health and medical knowledge in addition to the descriptive and predictive analytics to decide action plans.
Discovery Analytics: Utilize knowledge about knowledge, or wisdom, to discover new drugs, alternative treatment, previously unknown diseases, etc.
This discussion focuses primarily on the utilization of predictive analytical techniques in improving health outcomes and life quality. It also establishes the ground for “exposome analytics” as a joint capability to aid the understanding of exposure-disease relationships and predicting future risks as illustrated in Fig. 1 on page 531. Predictive analytics utilizes historical health data and medical knowledge to detect hidden patterns and predict health outcomes. On the other side, exposome analytics relies on latest advances in technology to track individual’s mobility patterns while simultaneously measuring the frequency and duration of exposure to environmental conditions in order to draw conclusions about related health impacts. Incorporating these conclusions or exposome knowledge discovery into health predictive modeling improves the performance of prediction and risk management of those diseases that are triggered by environmental exposures. In order to achieve such integration, we need to understand the conceptual background, processes, techniques and related issues of predictive and exposome analytics as we discuss next.

Integration of exposome and predictive analytics.
Predictive analytics emphasizes the importance of preventive healthcare by identifying behavioral factors that may be linked to developing certain diseases. By recognizing these behavioral factors at an early stage and changing them, healthcare organizations can prevent the onset of potential diseases. For example, a recent project presented in [88] utilized predictive visual analytics for managing the risk of cardiovascular diseases with the assistance of wearable sensors and visual components for risk factor visualization and risk analysis. Predictive analytics are different from traditional statistics (and from evidence-based medicine) in two ways: predictions are made for individuals and not for groups; and predictive analytics does not rely upon a normal (bell-shaped) curve [83]. The prediction modeling process uses techniques such as machine learning and artificial intelligence to create a prediction profile or algorithm from past history of individuals. The model is then deployed so that a new individual can get a prediction instantly for whatever the need is, whether an accurate diagnosis or a future alert to prevent a potential risk. Visual components of predictive analytics are designed to promote utilization of big data and enhance knowledge discovery via effective visual paradigms and well-designed user interactions [88]. In such models, visualization becomes the platform for interactive analytic process, where humans and computers cooperate using their respective distinct capabilities for data processing and visual recognition for the most effective results. The visual representation of the large and complex healthcare datasets in near-real time interactive interface conveys the intensive relationships (i.e. more multidimensional than entity relationships) among many elements in parallel, and provides analysts with directly observable memory. Within the analytic process, analysts dynamically operate between their cognitive models (i.e. experiences and knowledge) and the evolved patterns and insights from data exploration. This allows them to generate hypothesis then interpret specific and massive amounts of data to identify new or verify existing knowledge. Data exploration and interpretation during the visual step of the analytic process can then guide the selection and application of subsequent advanced analytical processes to validate or evolve the initial hypotheses and concepts [69]. The essential processes of any predictive analytics model include:
Data Acquisition: stream data from continuous physiological signal acquisition devices. The length of these data captures is typically short and downloaded only using proprietary software and data formats provided by the device manufacturers. This process also involves data storage and retrieval steps to gather and store large volumes of streaming data and patient information from clinical settings using sophisticated storage mechanisms.
Data Aggregation: transform different types of healthcare data into a data format that can be read by the data analysis platform. This involves integrating disparate sources of data, developing consistency in data, standardizing data from multiple sources, and improving confidence in data especially towards utilizing automated analytics.
Predictive Modeling: use mathematical and algorithmic-based data processing, data mining, natural language processing and machine learning to analyze and derive insights from data.
Visualization: enable investigative analysis and hypothesis generation by showing connections between entities to facilitate understanding and guide selection and application of other analytics techniques. Visualization includes three key functionalities [78]: (1) generate summaries, (2) extrapolate meaning from external data and perform visualization of the information (e.g., interactive dash-boards and charts); and (3) provide real-time reporting (i.e. alerts and proactive notifications, real time data navigation).
Prediction as a process is based on classifying a set of inputs into many classes. This requires constructing a classification model to approximate the true mapping from input data to the appropriate outputs with the intent of generating predictions of outputs for new, previously unseen data. The most popular data mining methods used for predictive molding are [7,24,53]:
Logistic Regression (LR): a statistical method that uses a set of independent variables to model a binary criterion variable that represents presence or absence of event of interest.
Clustering: an instance based method that uses attributes of already stored set of similar related instances/objects to recognize the class or cluster for a set of unclassified instances. The instances are clustered such that the intra-class similarities are maximized and the inter-class similarities are minimized based on some criteria. Once the clusters are identified, the objects are labeled with their matching clusters.
Support Vector Machine (SVM): a classification method with mathematical and statistical foundation. It uses a non-linear mapping to transfer the original data into a higher dimension. Within this new dimension, SVM searches for the optimal hyperplane (decision boundary) to separate classes.
Artificial Neural Network (ANN): a perceptron-based method that imitates the human brain in its organization of neurons and decision-making process. ANN is powered by its massive parallel structure and the ability to learn from experience. The knowledge gained by the learning experience (training) is stored in the form of connections and weights used to classify new input data into categories. The accuracy of classification depends on efficacy and depth of the training.
Bayesian Networks (BN): encodes probabilistic dependencies between random variables or attributes to predict class membership probabilities. BN represents the joint probability distribution over the set of variables through a directed acyclic graphical structure and a set of conditional probability tables. Each node in the directed acyclic graph represents a random variable while arcs represent the probabilistic dependency between the node and its parents.
Decision Trees (DT): a logic based method that uses recursive data partitioning to induce classification. DT works top-down seeking best attribute at each level of the tree to split separated classes. Decision rules in the form of: “IF condition-based on attribute value THEN outcome class” are then constructed from induced DT.
The methods listed above employ divers modeling approaches and come forward with a set of advantages and drawbacks in terms of presentation of classification model, treatment of different types of attributes, handling missing data and noise, computation cost, interpretability and generalization. They can be used alone or in a combination to construct predictive models. After predictive accuracy, predictive methods that produce explanation and interpretable results are the most preferred methods in clinical practice. This is because heath professionals need to understand the model in-work for gathering knowledge and making decisions. Hence, understanding the main concepts and issues underlying these methods is crucial for effectual models’ development and useful prediction of clinical outcomes. Clinically useful predictive models are attributed with comprehensive and purposive data analysis approach, explanatory capabilities and the capability of using domain knowledge in the process of data analysis [7]. The explanatory capabilities include visual means to allow practitioners to integrate direct observations with theoretical models. When reconciling conceptual models with direct observations of evidence, people are better in discovering their world and in developing insights to explain those observations that do not fully conform to the existing theories [84]. Thus, user interaction in predictive analytics is the key for a user-centered data exploration that is highly demanded in clinical practice [83].
Exposome analytics
We introduce the concept of “Exposome Analytics” to refer to the measurement, collection, analysis and reporting of data about environmental exposures occurring across time intervals of heightened susceptibility, for purposes of understanding their commutative effects on health state and kind of links to specific diseases. We focus on individual exposure that requires dynamic monitoring of individual’s time-activity patterns as well as health-impairing agents in the environment through which the individual is exposed. According to [65], the individual exposure can be described as the sum total of all environmental impacts that act on an individual within a certain amount of time, at a specific place and under certain circumstances (context). While it involves the same processes as in usual analytical modeling, exposome analytics presents a challenge in terms of data measurement as individuals interface with a mixture of dynamic exposures that vary over very short time intervals. In addition, exposure to environmental conditions varies across different locations and so exposures need to be considered in relation to their temporal variation to produce a particular individual profile of exposure at any given point in time. Thus, based on an initial analysis of data collected, a more precise timestamp defining data measurement periods/snapshots need to be defined. This contributes in building a continuous real-time profiling and monitoring of individuals. On the other hand, it is important to consider behavioral factors such as physical activity levels and time-activity patterns that may affect how much they are exposed to [9]. The individual’s movement has spatial and temporal components that can be described as a path. An individual’s path may be associated with other information such as transport mode used to construct a time-activity pattern detailing the time spent on specific locations during daily activities [70]. For instance, a study of different transport modes indicates that individuals riding bicycles inhaled more than eight times as much air per minute as individuals driving cars; and half as much air as individuals who walked [17]. Hence, in order to monitor the health effects of the environment on a specific individual, exposome analytics seeks to estimate the exposure to one single or a mix of pollutants that an individual accumulates over time as a result of frequent trips between two given points (say in travel from home to work each morning).
In this essence, the availabilities and capabilities of more sensitive and sophisticated environmental measurement tools have improved the ability to collect data in spatial, temporal and contextual details. These tools range from Satellite and surface-based measurement campaigns and ubiquitous sensing networks designed for special applications, to personal computing devices as smartphones [10,79]. Smartphones, in particulate, contain components that make them suitable and convenient for collecting information associated with environmental exposures. These components include Global Positioning System (GPS) sensors, Wireless Network (Wi-Fi), ambient light meters, and accelerometers. Smartphones can also be interfaced with other devices such as SensPod [66] to collect exposure-related information. SensPod collects data on ozone (
As more measurements become available from sensor technologies, they can be used to develop analytical models to assess and predict associations between environmental exposures and biomarkers indicating the severity or presence of some disease state. Basically, analytical modeling of exposome requires coupling and integration of different types of measurements obtained by different sensors at different scales. This applies a challenge to develop multi-scale aggregation techniques to link different measurements across spatial and temporal scales, with the aim of leveraging the strengths and minimizing the limitations of each type of sensor [63]. For example, the work in [65] devises a method to fuse three different types of sensing technologies (i.e. stationary sensors, mobile sensors and satellite-based remote sensing) in order to predict certain individual exposure. They examined the elucidation of mechanisms underlying the observed patterns (i.e. dependence on space, time and the situational circumstances); and explained how they can be aggregated according to a functional relationship of collective interaction using smaller scale units coupled with scale-dependent constraints. Another challenge arises about the accuracy and uncertainty of collected data. Both environmental conditions and locations of moving objects are uncertain in nature. An individual’s trajectory is represented as space-time paths of a moving object constrained to road networks; and it is modeled as discrete points that exhibit locations in time. This representation results in uncertainty in the location values recorded in the database of the moving object [89]. An environmental condition (e.g., humidity-30%) is also a spatio-temporal measurement that can be modeled as a raster grid for discrete time stamps. Each cell in the grid represents the average value of the environmental factor at a given time for the area. This average value is associated with positional and temporal uncertainties because of the approximations and interpolations used in modeling [22]. While it is extremely expensive to maintain a network of high precision sensors, the literature suggests to numerically integrate low cost sensing and high quality measurements by using uncertainty analysis and global sensitivity methods [35,51]. This should be run with engagement of analysts in the decision making process to assess the data relevance and quality with respect to analytical objectives [63].
The analysis part of exposome analytics aims at estimating how much of a known environmental agent is getting into the human system and using calculations to assign health risk. This prompts the discovery of what biomarker patterns are prospective signs of the disease and retrospective signs of the exposure [55]. The exposome discovery analysis involves a wide range of data sampling and processing methods such as data mining algorithms and statistical techniques. For example, the researchers in [87] used adaptive neuro-fuzzy inference system (ANFIS), support vector machine (SVM) and back-propagation artificial neural network (BPANN) algorithms to predict the variations in ground-level concentration of
For moving forward in integrating predictive analytics and exposome analytics in healthcare, it is required to leverage the existing studies with longitudinally measured environmental exposures with sufficient historical and current healthcare data. The discussion above suggested that a number of challenges still exist in developing a hybrid health-environment analytical infrastructure; we highlight the top of these challenges in the next section.
Highlights on open challenges for predictive and exposome analytics
Data Mining for health and environmental problems continues to be one of the most challenging problems in data mining research [34,86]. One of the biggest concerns today is the question of how to aggregate distributed and decentralized data sources, e.g. sensors for patients’ biological signals and autonomous mobile sensor networks for monitoring environmental conditions. Even more, this concern includes the question of how to ensure the quality of data and quality of assurance practices to validate data before applying analytics. This is crucial as predictive modeling might have harmful effects if interventions are made based on a flawed data model [82]. In the following subsections, we provide an overview of challenges across the main processes of developing predictive-exposome analytical solution: data acquisition, data aggregation, data modeling and data visualization, as summarized in Table 1 on page 533.
Key challenges for predictive-exposome analytics
Key challenges for predictive-exposome analytics
Today, the continuous physiological signal acquisition devices enable accessing live and high-speed streaming data. However, the length of such data captures is typically short resulting in a sparse data sampling with uncertainty in the measured variables that reduces utility of the data. In addition, there is a governance challenge such as lack of standards and data protocols especially that captured data are downloaded only using proprietary software and data formats provided by the device manufacturers. Security and privacy are also important issues that may prevent full utilization of any online monitoring system. Sensitive and critical data should be treated with different level of security measures during data collection, data transfer and data processing phases. One more important aspect in this area is the social implications and patient’s privacy perception toward the system. While some may reject the idea that their activities are camera monitored, others may be confident to control their activities while they are being recorded. A well-designed system should be able to consider different patients’ privacy preferences and provide options such as emergency backup methods when the monitor is turned off. Patients should be also able to change their preferences in different situations and environments based on their current conditions.
On the other side, there are challenges in developing and implementing systems that enable sensing of health and environmental data in a large scale at minimum cost. Environmental sensing in particular is energy consuming and cannot be performed frequently on mobile devices while sensors for individual physiological variables are hard to integrate into a single device.
Data aggregation
Integration of data from heterogeneous sources is a crucial issue in the context of health-environment analytics. Publicly available environmental data is precise but of low-resolution and often with low sampling frequency (e.g.,
Data modeling
The integration of predictive and exposome analytics poses three major challenges for data modeling and analysis as follows:
Uncertainty in location: determining the exposure-response relationships for optimal intervention will require time-space dynamic modeling for the movement of people and their individual activity-space. In this essence, challenges exist due to the spatiotemporal uncertainty associated with individual’s trajectories and environmental conditions the individual is exposed to [2,89]. In practice, actual paths are approximated by discrete location reports. To estimate environmental exposure from these approximations, precise paths, or probability distributions on possible paths are required. Impact of behavior on exposure: as the number of possible paths grows exponentially with the gap between location reports, behavioral estimates play a crucial role in exposure estimation, and the combination of human behavioral models with exposure models possess a significant computational challenge. Extensive research has been carried out on modeling route choice behavior [44,56,62]; which also identified path selection factors other than travel time and distance. Yet, most current behavior models have been designed through interviews and surveys, and have limited utility for outdoor exposure research where exposure varies in locatrion, time and en route activities. Beginning with these models, accurate modeling of path selection behaviors and their effects on exposure in the presence of spatio-temporal uncertainty remains one of the key challenges to be addressed while developing analytical solutions. Analysis of spatial and spatio-temporal data: the data analysis required for accurate prediction of environmental conditions on human health is challenging because the data involved is composed of aggregated spatial and spatio-temporal data. State-of-the-art techniques, such as random forests and support vector machine do not adequately support data that has both spatial and spatio-temporal features. Integration method of clinical data: in order to investigate and understand the health impacts of environmental conditions, clinical data should be integrated properly according to the analysis method. For example, Bayesian networks combine different data sources in different ways that are full integration, decision integration and partial integration. The full integration method puts different data sources together and treats them as if it is one dataset. This means that both the clinical variables and the exposome variables are offered as one dataset to the Bayesian network learning algorithm. Thus, the developed predictive model can contain any type of relationship between the clinical variables and the exposome variables. The decision integration method amounts to learning a separate model for the clinical and the exposome data. The predictions for the outcome are then amalgamated. The partial integration performs the first step of the algorithm (i.e. structure learning) separate for both data sources then joins the resulted two structures into one structure with only one variable in common, the outcome (predicted) variable.
Data visualization
As for data acquisitions, there is a need for scale-tolerant visual approaches that enable reasoning over large and diverse information spaces to facilitate complex analytics and uncertainty refinement [74]. This involves the capability to introduce a holistic information representation synthesizing all forms of data from sensors, stream data, historical data and models into inter-related knowledge structures. There are also more complexities inherent to these visual representations such as the capacity of human cognitive and the techniques for maintaining privacy and security [54,74,84].
An overview of asthma exacerbation risk management
Asthma is a common chronic disease that affects some 235 millions of people of all ages in all parts of the world. While factors responsible for increasing asthma rates are not fully understood, the environment and lifestyle play the key roles to have caused the large increase in the numbers of people in the world with asthma. Asthma disease is characterized by recurrent attacks of wheezing and breathlessness, which vary in frequency and severity from person to person. Symptoms may occur several times in a day or week in affected individuals, and for some individuals become worse during physical activity or at night. During an asthma attack, the lining of the bronchial tubes swell, causing the airways to narrow and reducing the flow of air into and out of the lungs. Recurrent asthma symptoms frequently cause daytime fatigue, reduced activity levels, sleeplessness, and school and work absenteeism, all of which result in impairing quality of life. It is also a cause of substantial fiscal burden as it contributes significant costs to the health care system, particularly for (ED) visits and hospitalizations, which in many cases could be prevented. Actually, admission to hospital during an asthma attack indicates the first episode in the disease or a failure of preventive care for established asthma [85]. Thus, admission rates have been proposed as a target indicator for monitoring progress towards improved asthma care. This increases the demands for reducing asthma admission rates through a primary prevention system to reduce the level of exposure to common risk factors, particularly air pollution.
In the area of exposome analytics, research in [75] applied association rule mining to explore the effects of multiple air pollutants such as
Unfortunately, while research has identified several environmental factors associated with the development of asthma, none has proven to be a causative agent. Further complicating the area of asthma management is the fact that most people diagnosed with asthma have isolated attacks separated by symptom-free periods. Thus, the goal of much asthma treatment is to control the condition at the individual level by reducing the risk of an attack. However, research has also shown that health responses to environmental exposure vary significantly over the population [50] and depend on many individual-specific variables, which makes managing health at the individual level challenging.
There have been strong-yet limited-research works to develop monitoring and risks management systems that capture individual-based measurement of exposure to certain levels of environmental factors for developing more accurate diagnoses on the causes of acute asthma episodes. In [43], a data mining alarm-system is proposed to predict asthma attacks automatically by mining asthma physiological symptoms (bio-signal data) and environmental factors together. The system collects environmental pollutant data, climatic and atmospheric data and all disease related data through the Internet and telecom communication. Then, data is analyzed by two methods: Pattern Based Decision Tree (PBDT) and Pattern Based Class-Association Rule (PBCAR). Both methods extract normal and high-risk sequential patterns as features; but then use different rule-mining techniques. The work in [46] developed a system for predicting asthma control deterioration one week ahead, by using self-monitoring tool, the Asthma Symptom Tracker (AST) assessments in conjunction with patient attributes and environmental variables. The AST assessment reflects the patient’s asthma control level over the past week. Each patient’s AST assessments were collected for six months; while environmental exposure data matched by time and location were obtained from multiple regional monitoring stations. Demographics and clinical status were also obtained. The system consists of two components: (1) a predictive model of the decision stump classifier, which makes a prediction based on a single independent variable; and (2) advanced predictive models built using multiple common classification algorithms and machine learning techniques. Similarly, the work in [19] used several existing data mining algorithms to predict asthma exacerbation utilizing tele-monitoring data such as physiological data and medication usage and identifies important attributes in asthma exacerbation.
A cloud-based predictive modeling of asthma readmission is also proposed in [12] using a set of classification and regression methods that utilize medical record information for a cohort of case patients with asthma readmission and matching control patients. The predictive module consists of several stages. First, the cohort construction and feature construction stages are conducted. The feature construction module allows the user to specify a method of aggregation for each event with respect to its values for each patient; while multiple techniques are used for feature selection (i.e. raw features, the
In a different direction, several research works present pattern recognition models to identify consequential relations between environmental factors and asthma attacks. The work in [33] presented a pattern recognition model to identify consequential relations between environmental factors and asthma attacks. The model detects the prevalence of asthma by finding complex interrelations between air pollution, weather, and asthma exacerbation. Once a burst in asthma related message is detected, historical physical sensory data is analyzed, to find a set of complex risk factor patterns that might have resulted in the burst. A brute force algorithm is used to generate all possible combinations of events in a pattern and different time lags between them; while a finite-state automaton method is applied for each pattern to count the frequency of a set of candidate patterns. The results indicate that recognizing the time lags between different events (e.g. rain, temperature,
The work in [57] identifies to which extent heterogeneous information contributes towards the prediction of asthma, while comparing linear (i.e., logistic regression) and non-linear (i.e., random forests) machine learning predictive models fitted with different feature combinations. The findings indicate that random forests algorithm appears to be an effective tool to extract and integrate large numbers of heterogeneous predictors/attributes simultaneously for asthma attack predictive models. Regarding the importance of features, the findings confirm the important contribution of allergen sensitization (i.e. dust mite, dog, cat), along with lung function markers, in predicting asthma diagnoses or symptom patterns. A logistic regression based predictive model is also proposed in [48] using the Asthma Control Questionnaire (ACQ) score over time to predict risk of a future asthma exacerbation. ACQ includes nighttime waking, symptoms on waking, activity limitation, shortness of breath, wheeze, rescue short-acting medication use and the scores of lung function test FEV (forced expiratory volume). The association between baseline ACQ score and exacerbation was assessed using a Cox proportional hazards model [73] adjusted for treatment assignment. A second Cox proportional hazards model used all ACQ scores before exacerbation as time-dependent covariates. The time-dependent model was selected to enable the use of all the ACQ scores over time before the exacerbation occurred, rather than the single baseline or the last ACQ score. The model indicates that time-dependent values of attributes (i.e. AM-PEF vs PM-PEV) are differently associated with the risk of asthma exacerbation, highlighting that morning peak flow and diurnal variation of peak flow were significantly associated with the risk of asthma exacerbation. The study also demonstrates a hazard rate (HR) trend that suggests that higher FEV1 values are associated with decreased risk of exacerbation. Such findings are confirmed by another multivariate logistic regression predictive model proposed in [41]. The study also indicates that changes in lung function gave the highest sensitivity and specificity as a proxy for exacerbations; while a combination of 20% decrease in PEF or 20% increase of symptoms (on 2 consecutive days) defines a sever exacerbations that can serve as a signal for the patient to contact the healthcare provider.
A recent study to estimate environmental effects on asthma exacerbations is presented in [42]. The study performed a cross-over analysis using variables about the weather and air pollution at 1-week intervals between cases and control before and after ED visits. The study acquired the weather and air pollution variables over a 6-year period as well as the data of patients who visited ED with asthma exacerbations over the same period. The results indicate that the high wind speed and low humidity were particularly associated with an increased risk of asthma exacerbations. More mathematical and statistical models to predict future risk of asthma exacerbation are presented in [13,72]. Research in [6] presents promising results in building a continuous real-time profiling and monitoring of individuals. The proposed method builds the “Voronoi map” where each Voronoi cell represents a region that has similar environmental conditions and calculates each individual’s exposures. The work in [32] follows up to present a Bayesian framework for estimating path selection probabilities from extremely sparse GPS data for the purpose of estimating a “measurement of interest” that varies with path and travel time.
In many of the above studies, one of the primary health indicators of lung function that is used in the management of respiratory disorders such as asthma is the peak expository flow (PEF) measurement. This measurement measures the degree of obstruction in the airways and monitors a patient’s ability to breathe out air. To interpret the significance of PEF measurement, a standardized “normal” value is often used, which is obtained from a chart comparing the person with asthma to a general population without breathing problems. One such chart is the EU scale, which is based on an individual’s gender, age and height [15]. On this scale, a male with the age 30 and 180 cm height should show his normal PEF as 635 L/min.

Kernel ridge regression.
A common approach to predict a patient’s next PEF value is to use ordinary linear regression of PEF values against recent environmental exposures. This approach aims to predict the average PEF value that would be seen over many days with the same or similar environmental exposure values. It often provides unsatisfactory predictions in the area of asthma management [8,12] where doctors and patients are more often interested in studying PEF values in the high-risk zone (lower PEF value zone). Moreover, as illustrated by Fig. 2 on page 537, which shows the results of Kernel Ridge regression with polynomial degree = 1 for one patient in our study, the outputs of linear regression models also tend to underestimate the true variance of a patient’s PEF readings.

The framework of SCHAS incorporating predictive and exposome analytics to manage risks of asthma exacerbations.
Despite the progress mentioned above and many other asthma studies, the fact remains that while the majority of healthcare organizations are planning for or already implementing analytics, current efforts address mostly descriptive analytics rather than predictive analytics. Analytics budgets are spent mostly on IT infrastructure and systems designed to capture and process data, and few resources have been devoted to real-time data analytics applications. As mentioned above, well-known classification algorithms and machine learning techniques have been used in exposome and predictive analysis, but the outcomes from these systems have shortcomings in explaining the relationships among many variables. Hence challenges remain in practical use of this analysis.
In this section, we present our ongoing work on Smart and Connected Health Alert System (SCHAS) project to develop a health monitoring system for modeling individual exposure to environmental triggers of asthma and our approaches to integrate exposome and predictive analytics.
Smart and connected health alert system
We first present a general framework of the proposed system shown in Fig. 3 on page 538. The proposed system calculates personal exposure assessment by measuring environmental factors (air quality) using sensors; and obtaining the location/time of the patient using his/her trajectory data obtained by the GPS. The air quality variables include: carbon monoxide (
However, two common challenges when dealing with GPS points are the correct mapping of these readings to the road network; and the reconstruction of the actual path for the whole trip. Most of map matching algorithms focus on high-frequency GPS sampling data, for which the path followed is short and there is usually a dominant path that starts from a fixed position [25,58,59]. When sampling rates are lower, and GPS points are sparser, large number of paths is possible between two points. In the area of health monitoring applications, battery life considerations force long-term readings to be taken much less frequently, in the order of 10 minutes or more. In this context, it is desirable to obtain not only the most likely route traversed, but also a collection of likely routes and a probability that each is the route taken given the elapsed time between consecutive readings. Additionally, the purpose of determining likely paths traversed is often to estimate the value of some variable of interest that varies by path. For example, we may be interested in daily-accumulated exposure to high heat. Most of the current map matching algorithms return only the most likely path given the GPS data and are thus inadequate for our application. In our work, we devise a probabilistic model to generate a set of potential true paths of different trips made by the patient over different travel times (i.e. traffic time), and associate likelihood with each of them. Our path prediction model centers at the individual level rather than the population level to calculate the probabilities that specific patient took certain routes given the observed data rather than in those probabilities for a randomly selected person from the greater population. Thus, it aims at, to the extent possible, minimizing the reliance on population-based statistics or population-based models of traffic flow or decision-making. It also focuses on travel time to predict taken paths in a different approach than the usual path prediction models that determine the specific route based heavily on the sequence of previous paths. This is mainly because of the uncertainty in the previous route selection compounds due to the sparse GPS data in our context, and which significantly degrades the utility of previous path predictions.
The data acquisition in our proposed system is accomplished by a micro-processor based system to interface the sensors with a mobile phone running on Android operating system. The real-time data collected via the GPS/GSM modules are sampled and transferred to servers and stored in XML format, which can be later easily converted to other formats and integrated with other GIS datasets. Our system uses the mobile application to also acquire the data indicating the health condition of lung functions (daily AM/PM PEF values). Data is sent to the server along with time and location of the data reading. The main spatio-temporal datasets, environmental data and trajectories, are built into database management systems that allow encoding complex spatial-temporal relationships.
Environmental exposure variables
Environmental exposure variables
Individual profiles and PEF datasets
PEF zoning methods
Our work focuses jointly on aiding the understanding of exposure-disease relationships and predicting future health risks. Our present analysis is based on static exposome analysis that uses the individuals’ reported daily activity routines to establish points of interest used in the exposome estimation. By incorporating the time and location of an individual’s points of interest (residence, work, school, social activities, etc.), in this study we estimate their exposures to certain environmental conditions using the methods presented in [31,34]. We note, however, that dynamic monitoring of individuals’ time-activity patterns using recent advanced sensor technologies can improve the proposed prediction models [32].
While the list of asthma triggers include many variables (e.g. air pollutants, allergens, certain food, stress, etc.), our study tackles outdoor environmental variables related to air quality and climate conditions to predict the probability of an individual’s asthma risk. We consider 5 air pollution variables (
Exposome analytics presents a challenge in terms of the data measurement and calculation as individuals interface with a mixture of dynamic exposures that vary in space and over both short and long time intervals. In our analysis, we study 24 exposure variables constructed by various combinations of environmental factor and aggregation method, as summarized in Table 2 on page 540. These include: (1) the maximum, the minimum and the total accumulation within the daily measurement period,
Predictive analytics
In our study, we analyzed 9 datasets of five asthma patients who have been consulted and monitored by doctors and medical practitioners at Soonchunhyang University Bucheon Hospital, South Korea. The PEF values of each individual were collected twice a day (in the morning and in the evening) and the durations of the data collection vary from 364 days to 689 days. Each individual’s medical profile and the AM and PM data are shown in Table 3 on page 540 together with the “normal” values from the EU scale [15].
Health risk zoning
In order to further classify risk of respiratory distress, the American Lung Association classifies PEF values into 3 zones of measurement [3]; green, yellow, and red zones. These zones are called “Normal Zones” because they are established by studying PEF values over the entire population and they are commonly used to develop management plans as shown in Table 4 on page 540 (a). However, as Fig. 4 on page 541 reveals for three individuals in our study, this zoning can be problematic because the PEF values of some affected individuals are mostly in the red zone while the majority of the PEF values of other individuals are mainly yellow.

PEF data distributions of three individuals based on the normal zoning [15].

PEF data distributions of three individuals based on the quantile zoning (
High variability within the population of individuals’ typical PEF ranges causes difficulty in using the Normal PEF zoning scale for predicting health risk, especially for at-risk individuals. Hence, we propose a new individual-based zoning of asthma health risks based on quantiles as described in Table 4 on page 540 (b), which we refer to as “Quantile Zoning”. Quantiles are cut point values dividing the entire data set of a patient’s PEF readings into zones based on the percent of readings above and below the cut point. In the proposed zoning, the patient and doctor select a percentage, a, and the readings representing the lowest
While the quantile zoning method seems to work better for asthma patients than other scales, yet there are challenges. In particular (1) optimal cut-off values for the zones vary by individual and can be difficult to establish, and (2) the values close to the boundaries between two zones tend to be noisy and degrade model performance. Therefore, we recommend that the quantile zoning method be used as a guideline or starting point for doctors and patients in selecting individual-based cutoff values for the zones.
One useful measure of health risk in medical patients is the probability that an individual’s health level drops below a threshold, such as the yellow-red cutoff established above. In the context of asthma management, we let
There can be significant difficulty in doctors and patients arriving at a practical and useful interpretation of the risk associated with a given probability. For example, if a system issues a warning message that my PEF measurement will be below my critical PEF value with
Central health management problems
Asthma patients are often most interested in predicting either what risk zone they will be in the next day or in predicting the risk that their health will deteriorate below a critical threshold level. The challenges can be formulated as two statistical problems that can guide work in predictive and exposome analytics of asthma risk management.
Determine the health risk level (red, green or yellow) of an individual’s next PEF measurement based on his/her recent exposures.
Estimate the probability that a patient will experience a critical health state given his or her recent exposures.
For the remainder of the paper, we survey several methods for addressing these problems and we propose a twofold process for solving the second that mitigates the prevalence of false warnings by providing readily understood gradations for the risk level.
Classification modeling
We now survey the most popular data mining methods [7] to solve the classification problem of the prediction of health risk zones and, in the context of asthma management, present the analysis results of five methods: (1) Decision Tree Gini (DT-G), (2) Decision Tree Entropy (DT-E), (3) Support Vector Machine (SVM), (4) Gradient Boosting (GB), and (5) Logistic Regression (LR). These methods employ divers modeling approaches and come forward with a set of advantages and drawbacks. Additionally, we discuss methods for mitigating challenges presented by the fact that we often have much more data in the yellow zone than in the red or green.
Classifiers with imbalanced classes
One of the most common challenges faced when developing a classifier is the class imbalance problem. A dataset is considered imbalanced if the class of interest (low PEF values or high PEF values) is relatively rare as compared to the other classes [45]. As a result, the classifier can be highly biased toward the majority class. A number of sampling approaches, ranging from under-sampling to over-sampling, have been developed to solve the problem of class imbalance [31]. One challenge with sampling strategies is determining how much sampling to apply. An over-sampling level must be chosen so as to promote the minority class, while avoiding overfitting to the given data. Similarly, an under-sampling level must be chosen so as to retain as much information about the majority class as possible, while promoting a balanced class distribution [45].
In this paper, we explore the above classification methods to predicting a quantile PEF zone of the individual based on his/her 24 exposure variables. Let
Confusion matrix
Confusion matrix
Model performance: average values of metrics (3 quantile zones using

Prediction confusion matrices (3 quantile zones using
The confusion matrix is a commonly used method for determining the quality of a classifier. In a binary confusion matrix, the quality of a classifier is evaluated based on the ability to distinguish “positive” examples from “negative” ones. The confusion matrix is shown in Table 5 on page 542, where
Average performance metrics of the five classification models using the quintile zoning are presented in Table 6 on page 543. The analysis results of the models show reasonable average evaluation scores for use in health risk prediction as compared to many models in health applications. For example, the average accuracy of the models was between 48% and 60%. These metrics, however, do not always represent the quality of the prediction models. This can be seen in the confusion matrix of each model presented in Fig. 6 on page 543. Although the proposed sampling method improved the models to handle imbalanced data distribution, predicting the minority class may result in high false risk predictions.
Binary model performance: accuracy (2 quantile zones using
)
Binary model performance: accuracy (2 quantile zones using
On the other hand, for more practical use of the classification in health risk prediction, the target zone can be the lower PEF value zone (higher risk zone) and this reduces the number of the classes to two, the area below
Although the above performance metrics involve visualizations in terms of the confusion matrices that aid practitioners in integrating direct observations with theoretical models, the precise practical implications of the metrics can be difficult for non-specialists to understand. Indeed, metrics such as the
We present a two-step regression modeling process based on logistic regression and quantile regression that aims to deliver easy to undertand and use predictions.
Logistic regression model
The logistic regression (LR) model is often used to estimate the probability of a binary response such as pass/fail or healthy/sick, based on one or more independent variables. It can also be used to analyze the effect that the presence of a risk factor has on the odds of a given outcome. Logistic regression is commonly used in medical fields to predict the risk of developing a given disease (e.g. diabetes or coronary heart disease), based on observed characteristics of the patient [20]. Logistic regression works by selecting the combination of coefficients for the linear combination
From a clinical standpoint, logistic regression has the advantage of having outputs metrics that are relatively easily understood. For example a typical output would be “Based on your previous exposure, you have a
Quantile regression model
The Quantile regression (QR) model aims at estimating the conditional median or other quantiles of the response variable. In ordinary linear regression, one attempts to find the best linear equation
A two-step model
We propose a two-step prediction model combining logistic regression with quantile regression;

Moving-window.
While numerous statistical tests exist for assessing the goodness of fit of logistic and quantile regression models [26,38], they are generally quite technical and their results are difficult to convey to patients for use in managing their health. Thus, we propose two new evaluation metrics, which yield immediately understandable information on the model’s performance to patients and practitioners. Evaluation of the models in both steps of our process was conducted with a moving window approach. This approach addresses both the number W of previous environmental exposures and PEF readings used to train the model and the number m of the future PEF predicted values for which the model is used before retraining. With advanced computing power and memory, updating the model every day is possible, but in real world applications saving computing power and memory are recommended unless more frequent updates on the model are required to maintain acceptable errors in prediction. Figure 7 on page 545 illustrates the use of the moving-window (W and m) in the analysis process. One might wonder why not take a larger training dataset, or even take all of the available data for training. The reason for avoiding this strategy is that it may result in loosing temporal information related to the fact that the relationship between health risks and environmental variables is dynamic and typically changes over time.
For any particular application of a model, the optimal values of W and m depend on many problem-specific factors including the desired model accuracy, computing power available and specific nature of the data. Therefore, the optimal values must be established during a period of training and testing in which the set of feasible pairs
Logistic regression model evaluation
In this section, we discuss only the pairing
Logistic regression analysis (using critical PEF value based on the 20th percentile)
Logistic regression analysis (using critical PEF value based on the 20th percentile)
Assessing the quality of a model application by using a binning strategy can be difficult due to coarseness in the dataset not matching the bin cutoffs and also due to small bin sizes. Therefore, in addition to assessing the quality of the logistic approach on our real data, we also generated a synthetic data set of
To evaluate the quality of the QR model, we define a uniform measure of the relative error for each quantile τ and evaluated the model through extensive experiments on individuals’ datasets in a similar way to the method described for the logistic regression. For quantile regression, however, it is possible to define a bin-free measure of the relative error as follows. First we observe that for every combination of exposure values and for each quantile τ, the quantile regression model outputs the PEF value that is expected to be the location of the τ quantile for all PEF readings taken on days with the same or similar environmental exposure values. If the model is accurate, then we would hope that for any τ, approximately τ fraction of the test days actually has PEF value below that day’s predicted τ quantile PEF cutoff. Thus, a uniform measure of the errors can be calculated as follows:
Our analysis shows that, as expected, smaller sizes training windows (i.e.,

Comparisons of relative errors of QR for varied window sizes.

Comparisons of average relative errors of QR for varied moving-window sizes.
With an optimal window of the QR model for each individual, we show the average relative errors of each individual’s quantile regression analysis for different window sizes while varying the size of the moving window m, as demonstrated in Fig. 9 on page 548. The results show that more frequent updates on the model reduces errors in the model but still the errors are reasonable when
To give a more intuitive understanding of the errors associated with our proposed method’s usage, we briefly imagine that patient
Once the patient knows his/her predicted probability of falling below the critical threshold, for small probabilities, the patient knows that in many such days he/she will be above the threshold, but sometimes he/she will fall below. In order to mitigate this uncertainty, the patient can turn to the results of the quantile regression and examine, say, the predicted 25th and 75th quantile cutoffs. Given the accuracy of the quantile model shown in Fig. 8 on page 547, these values would give the patient a fairly accurate picture of where to expect the middle
Conclusions and future work
The study of environmental exposures requires synchronization of environmental data, individuals’ moving trajectories, and the behaviors of individuals, such as physical activities, route selections, etc. By incorporating these data sources to characterize and predict path selection behavior, one can measure individual’s exposure to certain environmental conditions. The assessment of exposure enables analyzing the relations between negative health effects and levels of the environmental factors. In this paper, we introduced the concept of exposome analytics as a paradigm to capture the multitude of environmental exposures that impact human health and chronic diseases by using data mining techniques that identify relationships, associations and causality between environmental conditions and disease patterns. Integrating exposome analytics and predictive analytics on medical domain knowledge has the potential to predict and manage future risks; and to quantify the effects of prevention and intervention.
We presented a study of predictive and exposome analytics in the application of managing risks of asthma exacerbations. We discussed the conceptual framework of health risk modeling and environmental exposure assessment as a health monitoring platform for modeling individual exposure to environmental triggers of asthma exacerbation. Moreover, we utilized well-known data mining methods to develop a prediction model of the PEF risk zone and discussed the insight of the analysis. Restructuring the zones of the PEF values based on the individual’s data values can improve the practical usefulness of the modeling of classification of the PEF zones. Finally, we proposed a dual-regression model to predict the probability of a health risk. The proposed approach uses a novel exposome assessment paradigm that utilizes the spatio-temporal properties of the data in the model training process and hence results in improving the accuracy of prediction.
The research on health risk management that captures individual-based measurement of exposure is still open for more future work to solve topical challenges as measuring environmental conditions with a fine scale, treating uncertainties due to errors in device measurement and data sampling, integrating different representations of spatio-temporal data, and development of holistic visual representations for both structured and unstructured data from sensors, data structures and masses of streaming real-time data. With better data collection and exposome estimations, which can lead to more accurate regression models, the proposed framework has the potential to transform healthcare with a focus on self-management by issuing and monitoring individualized recommendations concerning patients daily routines. Therefore, one important future direction for work in exposome and predictive analysis should be to focus on spatio-temporal dynamic modeling of exposome. Accurate modeling of an individual’s behaviors and those behaviors’ effects on exposure in the presence of spatiotemporal uncertainty will be a key challenge to be overcome in the effort to measure individual exposome values accurately enough to allow significant improvements in the accuracy of individual health monitoring systems.
Open challenges also include the development of flexible and scalable data mining models to find accurate and sufficient information of moving objects’ patterns, behaviors and trends; and associated issues such as the parallelization of the computing model and availability of computing resources. For quantile regression modeling, we plan to develop a method to measure the quality of the model. We also seek to monitor more of asthma triggers and study their interactions as well as therapy adherence to understand their total impact on the improvement of asthma. We aspire that our work contributes to the diagnosis and prevention of asthma attacks; and to improve the quality of experience of patients and healthcare professionals.
Footnotes
Acknowledgements
The authors would like to thank the doctors and medical staff in Division of Allergy and Respiratory Medicine at Soonchunhyang University Bucheon Hospital for providing the data. This material is based upon works supported in part by Hanyang University under Grant No 2016473 and in part by the Information and Communication Technology Fund of the United Arab Emirates under Grant No 21T42. The study was also funded by Korean Ministry of Environment under Gran No HI16-00136-0603-0.
