Abstract
As the level of hospital informatization raises, it is possible to obtain huge amount of physiological data from bedside monitor and other medical instruments. The goal for this work is to recognize diseases from physiological data by unique combinations of representative patterns for different diseases. The representative patterns are clustered from the original physiological time series data, e.g. pulse, respiration rate, blood pressure, heart rate and oxygen saturation rate. Within a disease, to compose the set of representative patterns into a interrelated structure, we bring in Allen’s interval relations to describe the temporal relations between each of two neighboring patterns. We use Chinese Restaurant Process (CRP) to draw the uncertainty of every temporal relations that links two representative patterns. The two algorithms are combined into the model we use in this work, called probabilistic model. The experimental results suggests our model has potential in recognizing diseases.
Introduction
At present, clinical data and physiological signal data of patients with certain disease can be readily fetched via bedside monitors in ICU (Intensive Care Unit). With rich amount of clinical data, it is possible to recognize diseases certain people have through their corresponding physiological records using machine learning methods.
Related work
In recent years, plenty of machine learning approaches have been applied to predict diseases from clinical data. Zhang et al. presented an algorithm based on regression related machine learning methods to predict backgroundStroke from time series physiological patterns [12]. Xie et al. improved machine learning methods based on regression into classification to predict Abstract Chronic obstructive pulmonary disease [11]. Loglisci et al. presented a data mining method using temporal patterns to analyze the physiological events that result in certain disease [6]. Sacchi et al. [10] proposed a model which was based on knowledge to mine rules in temporal biomedical data. J. He et al. [5]. changed detection for complex dataset including numerical data streams to achieve a multivariate association rule mining. An approach was proposed to mine rule of certain disease from linguistic data using fuzzy inference and subtractive clustering in [9]. A fully data-driven approach based on Allen’s interval relations was presented to extract and represent temporal relations of atomic patterns in clinical data streams and make a textual output [2]. Knowledge-based models are limited by domain expert knowledge. All of the above methods are less able to express the uncertainty associated with describing time dependence. Interval temporal Bayesian network which combines probability and Allen’s relationship to identify complex activities is proposed in [13]. However, the problem of repeated atomic patterns occurring in one record of a disease remain unsolved.
Our approach
In this work, we refer to centroid time series data segments clustered from each signal as representative pattern. Thus, a sequence of physiological time series signal data points can be represented as a sequence of time-ordered representative patterns. A disease can be depicted as several unique occurance sets of representative patterns and unique possibility sets of temporal relations between two representative patterns. CRP is used to derive different composition sets of representative patterns that belongs to one disease. The possibility of each temporal relation occurring between two neighboring patterns is depicted by Multinomial distribution. Since there is possibility for every temporal relation occur between two representative patterns, our approach addresses the uncertainty of different temporal dependencies lies between two representative patterns, which can describe the relations between representative patterns more naturally.
Model description
Interval relation description
We choose Allen’s temporal relations to describe the temporal dependencies between patterns in this work. Allen provides 13 possible temporal relations between two intervals. Allen’s relations include {before, meets, overlaps, starts, contains, finished-by, equals, after, met-by, overlap-by, started-by, during and finishes}, denoted as
The model we used in this paper is Disease Recognition Probabilistic Model with Allen’s temporal relations. After preprocessing the original physiological time series data, each record of a subject is presented as five representative pattern sequences which are pulse, respiration rate, blood pressure, heart rate and oxygen saturation rate. The representative patterns within each sequence are ordered by their start time, e.g.
Temporal relations on edges
We use
Pattern generation
To depict the repeated patterns within one graph and generate pattern sets for different examples of diseases, we bring in latent variables from Chinese Restaurant Process and Latent Dirichlet Allocation. Assume a room has infinite number of bags. Each bag has a possibility set of each representative pattern occurring. Same pattern has different possibilities in different bags. We bring in balls to describe the process of generate patterns. We assume each representative pattern from each record is assigned to a ball. Each ball is attached to a bag. Suppose a ball from a is tended to be put into one of the bags from a group where the representative patterns have higher probabilities of occurring. The process of a bag chosen by a ball shows below: A. The first bag is always chosen by the first ball, of which the possibility is 1. B. Each of the following ball choosing a bag follows the steps below:
In formula (3),
Given a ball is put into a bag
Experimental metrics
Dataset
The database used in this paper is MIMIC 2. The MIMIC II Waveform Database contains records of numeric time series of physiological signals and physiologic waves [8]. Each record can be traced to only one patient. In this work, we chose five physiological signals to analyze. They are HP (heart rate), ABP (arterial blood pressure), PULSE, RESP, SpO2 (oxygen saturation rate). The records we chose are numeric continuous physiological data and are recorded once a minute. The considered diseases are angina, heart failure, and brain damage. The chosen records are of the subjects who are diagnosed as one of the diseases above. The information of the chosen records is listed in Table 1. We had some difficulties in using physiological signal records in the MIMIC 2 database. Most of the records are noisy and contain blanks or abnormal data. To solve the problem [7], data preprocessing is executed before analyzing. Deleting data points that are outside normal range and data sequence noise reducing are included in data preprocessing.

Four atomic patterns of respiratory rate signal among records of all diseases.
Diseases and information of chosen records

Four atomic patterns of oxygen saturation signal among records of all diseases.
Segmentation
In order to relate each time series data point sequence to a corresponding representative pattern sequence [3], the first step after preprocessing being applied is segmentation. In the experiment, according to the results of applying different length to segments, the length of each segment is set to 20. Each time series of HP, ABP, PULSE, RESP and SpO2 of every chosen record is segmented. In segmented time series, the segments are placed head to tail. These consecutive segments are prepared for being clustered. Atomic patterns of each physiological signal are clustered from these segment sequences respectively.

Four atomic patterns of blood pressure signal among records of all diseases.
In this work, K-means and Euclidean Distance are used to cluster the obtained segments. Segmented sequence of each signal is clustered for proper number of cluster centers which are treated as atomic patterns [4]. In the sequence of segments, each segment is replaced with the nearest cluster centroid. The corresponding atomic pattern sequence of each time series is obtained. Figures 1–4 show some of the atomic patterns of oxygen saturation raterespiratory rate, blood pressure and heart rate signal among all diseases.

Four atomic patterns of heart rate signal among records of all diseases.
Table 2 shows precision, recall and F1-Measure results over 2-fold cross-validations. The average accuracy is 0.3571. The precision of the second disease (HF) is relatively high and up to 0.375. The method performs better for HF than other disease, and the worst result appears when identifying Brain damage.
Experimental results
Experimental results
In this paper, we used a Disease Recognition Probabilistic Model with Allen’s temporal relations. The potential exists in our method for disease recognition. The biggest factor that lowered the levels of our experimental results is clustering. The clustering method we used is more adapted to cluster low-dimension data. However, it is not very suitable to cluster high-dimension time series data. Our future work is to further research better methods for clustering time series data.
Footnotes
Acknowledgements
This work was supported by grants from the Fundamental Research Funds for the Key Research Program of Chongqing Science & Technology Commission (grant no. cstc2017rgzn-zdyf0064), the Chongqing Provincial Human Resource and Social Security Department (grant no. cx2017092), and the Central Universities in China (grant nos. 2018CDXYRJ0030, CQU0225001104447).
