Abstract
This study aims to propose a framework for developing a sharable predictive model of diabetic nephropathy (DN) to improve the clinical efficiency of automatic DN detection in data intensive clinical scenario. Different classifiers have been developed for early detection, while the heterogeneity of data makes meaningful use of such developed models difficult. Decision tree (DT) and random forest (RF) were adopted as training classifiers in de-identified electronic medical record dataset from 6,745 patients with diabetes. After model construction, the obtained classification rules from classifier were coded in a standard PMML file. A total of 39 clinical features from 2159 labeled patients were included as risk factors in DN prediction after data preprocessing. The mean testing accuracy of the DT classifier was 0.8, which was consistent to that of the RF classifier (0.823). The DT classifier was choose to recode as a set of operable rules in PMML file that could be transferred and shared, which indicates the proposed framework of constructing a sharable prediction model via PMML is feasible and will promote the interoperability of trained classifiers among different institutions, thus achieving meaningful use of clinical decision making. This study will be applied to multiple sites to further verify feasibility.
Introduction
Diabetes nephropathy (DN) refers to chronic kidney disease caused by diabetes, it’s regarded as one of the main complications of diabetes worldwide, particularly in Asian regions, and the risk of incidence increases with the progress of the disease year by year. DN has been the first cause of chronic kidney disease instead of chronic glomerulonephritis in China, and there will be ushered in the peak of uremia in the next 10 to 20 years if no timely interventions or treatments are taken within the scope of specific population [1]. DN is defined as glomerular filtration rate (GFR) below 60 ml·min-1·1.73 m-2 and/or urinary albumin/creatinine ratio (UACR) above 30 mg/g for more than 3 months [1]. Patients with DN have a wide range of clinical presentations, it is characterized by an increase in albumin in the urine, a gradual decline in GFR [3], and an increase in arterial blood pressure [4], when microalbuminuria is the earliest clinically detectable indicator of DN risk. Multiple risk factors affecting the prognosis of DN have been identified, including age, long duration of diabetes, elevated BP, poor glycemic control, presence of retinopathy etc. [5]. In addition, the onset of diabetic nephropathy is relatively insidious, but it progresses rapidly after the period when a large amount of proteinuria was discharged. Renal function has been severely impaired when diabetics were diagnosed with DN, eventually leading to renal failure and endangering lives unless maintaining dialysis or relying on kidney transplantation. The literature reports that the prevalence of DN in patients with type 2 diabetes in China is 10% ∼40%. Compared with diabetic patients without DN, patients with DN have higher mortality. On the contrary, the awareness rate of diabetic nephropathy in Chinese diabetic patients is less than 20%, and the treatment rate is less than 50% [2]. Early diagnosis, prevention and delay of DN through emerging Machine Learning analysis methods are of great significance to reduce the occurrence of great vascular events, improve the survival rate and improve the quality of life [6–8].
The establishment of scientific and effective predictive models is of great significance for the management of chronic disease patients, as well as preventing and controlling complications effectively. Randomized controlled trial (RCT) is considered as the gold standard of medical decision medical decision support currently [9]. As a manual intervention study, clinical trials are carried out in a pre-set environment with sets of preselected population, clinical data entries with prearranged time window of observation. However, the outcomes inferred from the expansion of the original scope of clinical study data cannot reveal the optimum pathways in terms of efficacy and effectiveness for real world populations precisely [10]. With the wide application of electronic health records (EHRs) in recent years, more and more studies on the establishment of accurate clinical prediction models through machine learning with real world data were turned out to prove its infinite value.
The traditional clinical prediction modeling includes two challenging tasks: model development and model deployment. However, the evaluation of the feasibility of the model represents the end of the research in existing works mostly, and it is unpractical to achieve further valuable research outcomes without trying to deploy these models into the realistic medical health data analysis. In addition, the openness of health and medical data is limited due to the restrictions of laws and regulations, social concepts, economic conditions and other factors. Therefore, how to deploy the trained predictive model effectively into medical practice is still a great challenge at the present stage. While pursuing the accuracy of predictive model, researchers are beginning to focus on the deployment and dissemination of predictive algorithms and analytical tools developed in routine clinical decision support to achieve the true value of prediction in clinic. For example, Khalilia et al. [11] developed a clinical predictive model to predict the mortality in ICU by standardizing the ICU and ExactData chronic disease outpatient datasets with OMOP common data model (CDM) and deployed it based on fast health interoperability resources (FHIR) web services. OMOP CDM is a well-established generic model of medical data that ensures the standardization of data from different sources involved in the analysis.
Facing the difficulty of deploying and re-encoding according to the subsequent application environment after the establishment of the model, medical researchers have to work together with the computer programmers to guarantee the complete expression of the built model. If there is a standard structure that able to convey the whole model expression and easily to convert, it will greatly reduce the difficulty of the deployment of the model. In this study, a standard predictive model markup language (PMML) file converted from a predictive model of DN was generated in the KNIME platform, which could be deployed in other development environment.
In this study, the accuracy of the DN prediction model established by the patient’s laboratory examination data was good, and the proposed framework of constructing a sharable prediction model via PMML was feasible and was believed to promote the interoperability of trained classifiers among different institutions.
Materials and methods
Framework and toolset
In this study, the proposed framework contains four sections, i.e., information extraction, model construction, model optimization and PMML output, see in Fig. 1. In the information extraction section, the clinical records of the dataset were included. During the selection process of cohort, in order to maximize the inclusion of research-worthy patients in this study, the researchers were agreed that the patients with DN or renal insufficiency recorded at least once in the diagnosis of the examination were selected as the case group, and diabetic patients without any complications recorded in the three diagnoses were selected as the control group after consulting the endocrinologists. Then, during the model construction, classifiers were trained by assigning features and predefined categories. After the training process, the classifier could predict the class of an unseen patient record. Such model could be refined by either processing noisy and missing data or selecting a robust classifier that can deal with these issues. Finally, the optimized model could be exported via PMML and executed in other platforms for meaningful use.

The flow chart of the framework of PMML-based shareable model.
The Konstanz Information Miner (KNIME), as an open source data analysis platform based on Eclipse [12], is very powerful in data integration and has many analysis functions. It is also compatible with various data forms such as text, database and image, and capable of visualization of analysis results. Therefore, due to the above advantages, KNIME was selected as the toolset to implement the proposed framework.
The experimental dataset was obtained from the National Scientific Data Sharing Platform for Population and Health from the General Hospital of the People’s Liberation Army (PLAGH), which contains records of 6,745 diabetes patients who stayed in hospital in 2009 in total. In addition to the official guarantee of the authenticity and reliability of the data of diabetes patients in this data set, a large number of studies have confirmed that this data set of diabetes does have the value of data analysis [13–15]. The dataset contains ten tables, i.e., diagnosis, laboratory tests, medical records, vital signs, urine routine examinations, Biochemical examination, glycosylated hemoglobin examination, medical order, bill, patient index. The contents and quantities in detail were shown in Table 1.
The general components of the dataset
The general components of the dataset
As one of the main microvascular complications of diabetes mellitus, periodic inspection of the kidney for patients with diabetes is indispensable. The degree of renal function lesions in patients can be directly seen through the results of various laboratory tests indicators. In the early stage of renal damage, few obvious symptoms of DN could be found among patients. The main clinical manifestation is proteinuria, which mainly depends on the routine urine examination to understand whether the renal function is normal [16]. Therefore, the qualitative examination of urine routine has a significant effect during the process of diabetic proteinuria examination. Additionally, biochemical examination can also detect the severity of kidney injury in time through the abnormalities of various biochemical indicators in patients, such as blood urea nitrogen (BUN) and serum creatinine (Cr) [17]. Hence, patients with biochemical, glucose and routine urine examination records were selected from the dataset, and the most recent examination records were selected according to the inspection time. Urine routine, biochemical inspection, and glycosylated hemoglobin examination records of the same patients with diabetes were combined to construct clinical cohort for DN prediction, as shown in Fig. 2. The different inspection records were inputted by the three meta-nodes on the left, and then preprocessed through data processing nodes, like Column Rename, Column Filter, String to Date/Time, Sorter, Group By, etc. In the end, records with consistent fields and formats were aggregated together by Joiner nodes.

The diagram of dataset integration in KNIME.
The considered selection of risk factors for risk model of DN will strengthen the rationality and accuracy of the model. In this study, real world data was used to discover the underlying knowledge in DN. Unlike traditional test hypotheses, including as many factors as possible will help improve the accuracy of the model. The incorporated dataset was mainly constituted by general patient data, diagnosis, laboratory examination, HbA1C test and routine urine test records with various degree of missing values, the mean value and mode filling methods were used to fill the numerical variables and the nominal scalar respectively. The inspection indicators were processed into continuous intervals according to the normal range of the indicators. In feature selection, the total bilirubin, direct bilirubin, lipase, iron, and unsaturated iron binding force features were removed because the missing range were more than 50%. The qualitative test of urinary bilirubin was also removed because of the uniform negative values.
After integrating the data of three laboratory tests with patient ID as the main index, the patients without necessary examination records for analysis would be excluded. The main purpose of this work was to classify patients between patients with diabetes without or with DN or potential kidney dysfunctions. Before training, the diagnosis of patients was taken as the main inclusion criteria to include the case group with DN and the control group without complications of diabetes. Meanwhile, other patients with unrelated diagnosis results were excluded.
The random forest (RF) and decision tree (DT) were used to compare and analyze the prediction results after data preprocessing. The preprocessed dataset was divided into 80% of training samples and 20% of testing samples by Partitioning node, as shown in Fig. 3. DT, a supervised learning algorithm was trained to split the data set according to a criterion that maximizes the separation of the data, resulting in a tree-like structure. While RT algorithm is an ensemble approach that can be thought of as a form of nearest neighbor predictor [18]. The classification process of the algorithm were encapsulated into the learner node to conduct classification training on the input data. Once the training was completed, the trained model and the predicted data were connected via the predictor node for further prediction, and its performance could be evaluated by connecting related score node. In addition, the significance of features involved in disease prediction was determined by tree ensemble algorithm.

The diagram of model construction workflow in KNIME.
How to construct an interpretable classification model that can be easily applied in a clinical setting is another main purpose in this study. DT classifier hold an advantage over RT as the former classifies individuals among a range of numerical and categorical features, following certain set rules in a way that is similar to human decision making [19]. On the other hand, RF is not as easy to interpret because trees were arbitrarily added to the forest. Therefore, we focused on using DT for subsequent PMML files generation and deployment of DN prediction.
The importation and exportation of PMML document in a model construction workflow were mainly implemented by “PMML Reader” and “PMML Writer” nodes in KNIME platform (Fig. 4). The “PMML Writer” node writes the PMML model from the PMML model port to a PMML v4.0 compliant file. A problem with the current use of KINME is that the data preprocessing step cannot be directly introduced as a part of the PMML file, but the process of complete model construction can be completed through a merge approach that not only makes the workflow easier to understand, but also useful for the integration of model build. There are several relevant nodes in KNIME that are useful for this. An empty PMML document can be created by “Empty PMML Creator” node that will be used as a starting point for a workflow that generates PMML documents in a modular fashion, and then transformations and models can be added to the documents created by this node. Finally, “PMML Model Appender” was used to combine two PMML document fragments that contain model information and data transformation steps when they were generated in two PMML segments.

The diagram of workflow for PMML export in KNIME.
The characteristics of study cohorts
A total of 2159 patients were included after assembling three examination records and data screening, and 39 characteristic parameters were included as risk factors in DN prediction after data preprocessing. The descriptive statistics of processed numerical variables was shown (Table 2), in addition, qualitative data features include gender, cost category, diagnosis, urobilinogen qualitative test, urine yeast cell, GLU, urine color, urine nitrite test, urine turbidity, UABT, urine protein qualitative test. According to the case extraction rules of the case group and the control group, 342 patients with diabetes in the control group without complications were identified from 2197 patients with diabetes while 193 patients had at least one relative DN diagnosis record (Fig. 5). There were 130 males and 63 females in the case group, ranging from 35 to 87 years old with an average age of 60 years. Among the control group, there were 205 males and 137 females, aged from 12 to 87 years old, with an average age of 56.3 years old. Considering that DN was predominant in the elderly, the average age difference between the two groups was consistent with the reality.
Characteristics of numeric attributes included in the analysis
Characteristics of numeric attributes included in the analysis

Flow chart of patient inclusion.
After the model was deployed in KNIME, the mean value of the results of 10 runs was taken, in which the random forest showed better classification performance than the decision tree algorithm and it was more suitaable for the prediction of DN in this experiment, as shown in Table 3.
The comparison of DN prediction models
The comparison of DN prediction models
A total of 39 variables were included in the data preprocessing stage for analysis excluding factors that could not be included in the analysis due to the large missing range. According to the different prediction objectives, the importance of features was different. Selecting the most valuable features from a large number of features could reduce unnecessary computational expenditure. The “Tree Ensemble Learner” node in KNIME provides detailed information about the significance of features. In this study, it was used as a method to measure the importance of features. The important features in the prediction were shown below (Table 4).
The importance of features in the model training
Among the 39 variables, the “Creatine kinase” and “Urinary ketone body experiments” attributes indicators had little to do with the predicted of DN, and the ROC curve were improved slightly after removing them because it decreased the interference of non-related factors in some degree (Fig. 6).

The ROC of the DN model before and after feature extraction in KNIME.
The PMML files were directly exported by DT as a set of generated rules (Fig. 7). The structure of PMML files includes data dictionary, mining architecture, data translation, model definition, expected model output, post-processing step after model output, model interpretation, model validation, etc.

Main structure of PMML file generated by DN prediction model.
The section of data dictionary defined all the information about the predictive variables and target variables covered in this analysis, including feature names, measurements, and data types. For categorical variables, there were various types of categorical values, including valid values, missing values, and invalid values, and one or more range of valid values for continuous numerical variables could be specified. For example, there were two values of “male” and “female” in PMML file when the patient’s gender in dataset was a subtype variable with the character data type.
In different application environments, the same data processing method was applied to convert the data to be predicted into the format specified by the model. PMML defined a variety of data transformation operations, such as data discretization, data standardization, feature selection, and some others. For instance, the alanine aminotransferase values were discretized and divided into the <40, 40–80, and >80 group respectively.
In the section of Mining schema, the mining module defined predictors and target variables, and the properties of each variable were defined in the data dictionary.
A Tree Model in PMML has extremely strong hierarchical information with a host of nodes, and a node can include several child nodes, separate category was set to prevent TreeModel from being complicated. Each node in a DT contains a logical predictive expression used to judge how the DT branches. This part of function was completed by PREDICATE part, which is a set of type definition. There were instructional prediction models such as SimplePredicate, CompoundPredicate, SimpleSetPredicate. SimplePredicate defines a simple Bool expression, with three parts: field, Boolean operator and value, and in which Boolean operators of SimplePredicate were “equal”, “notEqual”, “lessThan”, “greaterThan”, “lessOrEqual”, “greaterOrEqual”.
In this study, a DN prediction model was established using laboratory inspection data from an open diabetes dataset in a Chinese hospital through random forest and decision tree classification algorithm in the KNIME platform and obtained a good prediction performance. Then the trained model was converted into a shareable model file based on PMML language. Machine Learning is a growing field concerned with the study of enormous and several variable data and grown from the study of pattern recognition and computational learning theory in artificial intelligence, having computational methods, algorithms and techniques for analysis and prediction. Machine Learning techniques have showed success in prediction and diagnosis of numerous critical diseases [20, 21]. Different approaches had been adapted in order to classify the patients with or without DN at the early stages. RF and DT models were the most widely used due to their simplicity and speed of classifying [22]. RF is a highly nonlinear classifier and works well with high-dimensional data. It had shown great potential for disease classification in the real world patients’ data [23–25]. The advantages of RT classifier are that its’ runtimes are quite fast, and that it is able to deal with unbalanced and missing data. In this study, we found that the performance of DT was comparable with RF, which was a recognized model worked well with DN prediction. There have been many studies on the prediction of DN with the use of real world data extracted from EHRs. Therefore, it is significantly imperative to identify the risk factors and predict the potential onset of DN timely in diabetic patients.
As shown in results, the urine protein, creatinine and urea were the most important features from laboratory test results. In clinical, creatinine in biochemical tests often means serum creatinine test, which is an important indicator of whether kidney function is normal. A value exceeding 133μmol/L means that the kidney has developed renal insufficiency and is at risk of renal failure. In addition, the urine protein content can directly reflect the damage of the patient’s kidney, so it is also one of the most relevant features in the prediction. In addition, renal insufficiency will cause an increase in urea, the degree of which is directly proportional to the severity of the disease, which is of special value for the diagnosis of uremia in the late stage of DN. When patients have poor blood glucose control, increased urine sugar makes the urethra a bacterial hotbed, which is prone to urinary tract infection and increased urinary yeast cells. Electrolytes such as potassium, calcium and magnesium can also reflect the severity of renal injury to some extent. In this study, it was found that the cost category of patients seeking medical treatment also has a certain correlation with the development of kidney disease, which may be caused by the difference in the reimbursement of medical insurance. There are many different expense categories in China, such as “self-pay", “new rural cooperative medical care", “medical insurance” and “commercial insurance” and some others. People with large medical expenses reimbursement will spend more money and energy to pay attention on their health probably, while patients with heavier financial burdens have difficulty maintaining compliance with medications [26], yet this also needs a further research in cooperate with different medical insurance reimbursement policies. On the other hand, the last two rankings are the “creatine kinase” and “urinary ketone body experiments” attributes. The determination of creatine kinase is mainly used to diagnose skeletal muscle and myocardial disease, and the urine ketone body test is mostly used to diagnose diabetic ketoacidosis.
The feature selection was reported to be involved in final efficiency of the model training. The latest examination records of patients were selected in data preprocessing step. When browsing the general condition in original dataset, some patients were found to have quite different results between the same examinations at different times. For example, the content of urinary protein decreased from 75 mg/day to negative in some patients with DN, which might be due to some improvement in body index after medication. In general, the incidence of DN is in proportion with the course of the disease, and the longer the course of the disease, the higher the probability of illness increases year by year. However, even patients with the similar course have a significant different incidence in DN due to hereditary difference [27]. Leung et al. [28] established a predictive model for genotype-phenotype risk pattern recognition and tried to explore the potential association between patient genotypes and clinical phenotypes through combining genotypes with clinical information and categorizing them by SVM and RT methods, and compared various machine learning and statistical methods. If the records of patient’s family history were included, it would greatly improve the classification performance of predictive models. In addition, the course of diabetes, BMI index, situation of medication and many other factors were proved to be directly related to the occurrence of DN [29–31]. Moreover, it is limited to diagnose renal lesions and predict prognosis solely by clinical manifestations and biochemical indicators. As a “gold method” for diagnosing kidney disease, kidney biopsy is also believed to be applicable to the early diagnosis of diabetic nephropathy [32].
The evaluation of the feasibility of the model often represents the end of the research in existing works, and it is difficult to produce further research value without trying to deploy these models into the real clinical scenario. After the model was trained, the exported PMML file could be shared and applied in different platforms for further cooperation and verification. PMML is a standard language for expressing predictive analysis model, which allows sharing predictive model among applications compatible with PMML without needless recoding [33]. In other words, the entire process of data transformation and classification of predictive model can be expressed into PMML files in a platform, and deployed in another one simply. For example, models trained in IBM SPSS Statistics Platform could be moved to KNIME to carry out further predictive missions by exporting models into PMML files, and the increase in the amount of data will effectively improve the accuracy of model prediction. PMML supports instant deployment and entirely predictive solutions, including data pre-processing, data post-processing and modeling skills. This paper mainly showing the generation, data processing and model deployment of PMML model files on the KNIME platform. In KNIME, the new dataset to be predicted can be read directly through parsing the PMML file, and models can be refined according to personalized requirements. The support of multiple programming languages and common tools makes it feasible to use PMML speaking language as a model standard language. Recently, there are many packages and libraries in common data analysis tools, such as R, Python, SPSS, etc., are supportive for generating PMML files for models built in specific tools. For example, MLlib module of Apache spark, which is an international mainstream big data analysis engine, supports PMML modules perfectly, models can be generated in the form of PMML by calling “model to PMML”. In Python, generation of model files can be achieved by sklearn with sklearn2pmml. R supports PMML functions by installing the “XML” and “PMML” packages. Moreover, the JAVA library JPMML can be used to generate PMML files corresponding to the models of R, SparkMLlib, XGBoost, and Sklearn. The shareable prediction models based on unified standards will promote the sharing of research methods of different institutions and the validation of results, and the increase of data will discover more knowledge that is valuable. The volume of health and medical data is huge, and the content is complex and diverse. Barely using one computer to analysis cannot meet the need of practice application in reality. Spark cloud computing platform launched by Apache is one of the most widely applied platform in big data analysis. By calling Model.to PMML, a trained model can be exported into PMML form. Achieving big data analysis by importing PMML with Spark platform will accelerate computing rate greatly and improve the accuracy of model based on new data learning. Big data functions such as Hadoop and Spark can be integrated to process medical and healthcare big data through the expansion of big data components. Workflows of data processing can be composed by connecting configured nodes that make data analysis more agile. RF was more suitable for the prediction in this study. As one of the most commonly used algorithms in current classification work, RF not only profit from its better predictive performance, but also because it can give variable importance measures (VIM) while classifying so that RF especially suitable for high dimensional omics data research.
However, the KNIME tool adopted in this paper abandons some functions, such as Gini index and out-of-bag data (OOB) error rate for calculating feature importance score when integrates the algorithm function to facilitate personalized use. As an alternative, the “Tree Ensemble Learner” node was adopted to make importance score instead of RF in this experiment, which was also the main limitation in this experiment. There may be some other possible limitations in this study. First, our DN risk model was performed in a Han Chinese population, major in the Han nationality, and the model was not probably applicable to all populations. The second limitation concerns our study samples from only one site. Further studies would be necessary to validate our results in other ethnic groups.
Conclusion
Nowadays, different classifiers have been developed for early detection of DN, while the heterogeneity of data make meaningful use of models difficult. In this study, decision tree (DT) and random forest (RF) were adopted as training classifiers in de-identified EMR dataset from 6,745 patients with diabetes in 2009, a meaningful use framework of DN clinical decision making was implemented through the DT model training and transferring into PMML via KNIME, which could be a valuable solution for heterogeneity of data. Under the encouragement of Chinese government policy, big data and artificial intelligence technology has been fully used in medical research, especially the control of the progression of chronic diseases such as diabetes. In the future work, how to improve the performance and interoperability of models in other sites will be studied.
According to China’s Mid - and Long-term Plan for the Prevention and Treatment of Chronic Diseases (2017–2025), early diagnosis and prevention through prediction model are of great significance to reduce the occurrence of great vascular events, improve the survival rate and improve the quality of life of patients with DN. Compared with the traditional “hypothesis driven” RCT, the “data driven” research model based on big data bring new ideas in both breadth and depth. In addition, the requirements for the Assessment of standardized maturity for healthcare Information Connectivity in China clarify the shareability of medical data, which will facilitate the migration of disease prediction models to multiple centers.
Declaration of competing interest
Declarations of interest: none.
Footnotes
Acknowledgments
This work was supported by the grant from National Key R&D Program of China (2018YFC1314900, 2018YFC1314902), Excellent Key Teachers in the “Qing Lan Project” of Jiangsu Colleges and Universities and “226 Project” of Nantong, Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX19_2070) and Jiangsu Students’ Platform for innovation and entrepreneurship training program (2020103040185E).
