Abstract
The Italian Hub of Population Biobanks (HIBP) includes both ongoing and completed studies that are heterogeneous in both their purpose and in the specimens collected. The heterogeneity in starting conditions makes sharing study data very difficult because of technical, ethical, and collection rights issues that hamper collaboration and synergy. With the aim of overcoming these difficulties and establishing the “proof-of-concept” that sharing studies is achievable among Italian collections, a data-sharing pilot project has been agreed to by HIBP members. Participants agreed to the general methodology and signed a shared Data Transfer Agreement. The biobanks involved were: EURAC (Micros study), CIG (GEHA project), CNESPS (FINE, MATISS, MONICA, OEC1998, ITR (Italian Twin Register), and IPREA studies, and MOLIBANK (Moli-Sani project). Biobank data were uploaded into a common database using a dedicated informatics infrastructure. Demographic data, and anthropometric and hematochemical parameters were shared for each record. Each biobank uploaded into the common database a dataset with a minimum of 1000 subjects, for a total of 5071 records. After a harmonization process, the final dataset included 3882 records. Subjects were grouped into three main geographic areas of Italy (North, Center, and South) and separate analyses were performed for men and women. The 3882 records were analyzed through multivariate logistic regression analysis. Results were expressed as odds ratios with 95% confidence interval. Results show several geographical differences in the lipidemic pattern, mostly regarding cholesterol-HDL, which represents a strong basis for further, deeper sample-based studies. This HIBP pilot study aimed to prove the feasibility of such collaborations and it provides a methodological prototype for future studies based on the participation in the partnership of well-established quality collections.
Introduction
B
The heterogeneity of sample and data management makes sharing data very difficult due to technical, ethical, and collection rights issues which hamper collaboration and synergy. 2 Thus, there is much to be gained from coordination, combination, and integration. With the objective of overcoming these difficulties, the Italian Hub of Population Biobanks (HIBP), a network of population collections including both on-going and completed studies, was established in 2011. 3 Although many collections and population biobanks are already part of European and International projects and are listed in the catalogue of Public Population Project in Genomics (p3g), 4 the number of collaborative projects based on existing collections could be increased considerably.
The aim of this study was to show the opportunities provided by the HIBP to share data of different population-based studies. Within the HIBP, biobanks that use similar approaches and methodologies or different solutions and operation workflows may coexist. The HIBP aims to encourage collaboration between these different biobanks, stimulate integration of these infrastructures, and increase the synergy between the study outcomes and Regional and National health plans. To foster interoperability and establish the “proof-of-concept” that “sharing” existing data may provide supplementary epidemiological information, the HIBP partners agreed to conduct a pilot project. The priority objective was to establish the feasibility of constructing a shared approach to the conduct of this type of project. A second objective was analysis of the data collected from different projects to find statistical associations. In addition to the ability to build such a study, the co-authors envisage that the reliability of the approach could be evaluated by comparing the associations found with current knowledge in the field.
The specific topic selected for the HIBP pilot was assessment of the associations between lipid profile data (hematochemical parameters) and the geographical area of residence, taking into account the possible effects of demographic and socio-cultural factors. Previous work has shown that the lipidemic parameters affecting cardiovascular risk are influenced by geographical gradient.5,6 One of the advantages of integrating data from different population studies is the use of regional data to cover the entire national territory.
Material and Methods
The aims and limits of the pilot project were defined during extensive preliminary discussion among the partners, and each agreed to the methodologies to be implemented. This preliminary process required extensive work, as the group needed to achieve a degree of personal knowledge and reciprocal trust. Most of this preparatory work was done during the first phase of construction of the network and the complete publication of collection information on the network website. 3 The process was facilitated by a small nucleus of participants who agreed to implement a pilot project. Approximately 1 year after the creation of the HIBP, four meetings were dedicated to defining the terms and text of the Data Transfer Agreement (DTA). The DTA established collaboration rules, data definitions, modalities for sharing and analysis of data, and the time frame of the data to be included. The main features of the pilot project are reported in Table 1.
Micros: Study of microisolates in South Tyrol (see Reference 8); 2GEHA: Genetics of Healthy Aging (see Reference 9); 3FINE: Finland Italy the Netherlands Elderly- Italian Cohorts (see Reference 10); 4MATISS: Malattie Aterosclerotiche Istituto Superiore di Sanita' (see Reference 11); 5MONICA: Area Latina-Monitoring Cardiovascular Disease (see Reference 12); 6OEC 1998: Osservatorio Epidemiologico Cardiovascolare (see Reference 13).
The characteristics of the pilot project were assessed to establish that they were compliant with ethics policies and with the informed consents that had been collected for various studies involved in the HIBP. A basic tool that allowed comparison of consent forms was developed; it enabled identification of areas of work that complied with the original consent form information and those that did not. 7
On the basis of the DTA, four categories of data—general data, demographic data, physical parameters and hematochemical parameters—were selected for use (Table 2). Almost all parameters were coded and free text fields were reduced to a minimum. A website was created to allow secure submission of the data. A hypertext transfer protocol over secure socket layer (HTTPS) was adopted. Different degrees of computerization in biobanks were taken into account by accepting three types of file formats: CSV (Comma-Separated Values), XLS/XSLX (Microsoft Excel), and XML (eXtensible Markup Language). A notification message generated by the website at the end of data upload guaranteed the data transfer and receipt. Each partner participating in the pilot project uploaded data for their specific collection: EURAC (Micros-study) 8 , CIG (GEHA-project) 9 , CNESPS (FINE, 10 MATISS, 11 MONICA, 12 OEC1998, 13 Italian Twin Register, 14 and IPREA 15 studies), MOLIBANK (Moli-Sani project) 16 . CNESPS contributed six different collections; since each of these studies had been performed independently the data had not been harmonized. Each record referred to a single subject. Additional information on these studies is available on the network website. 3
Harmonization of the final dataset for statistical analysis required considerable work and was independently performed by three people (Table 3). Units of measurement and spelling were made uniform, and the consistency of each record was carefully checked to verify coherence with the required data. The following exclusion criteria were adopted for the statistical analysis: lack of certainty on the collection of fasting blood, age ≤25 years, residence abroad, and missing data in one of the following fields: place of residence, 17 education, 18 or body mass index (BMI). 19 Moreover, underweight subjects (a pathological condition) were excluded to avoid biases in the reference group of BMI.
Harmonized records have been used for statistical analysis.
Several different aspects had to be considered for harmonizing the education data 18 and hematochemical parameters.20–22 For the education data (Table 4), the original response recorded by the biobank has been expressed in terms of years of scholarship which relates to the Italian educational system. 18 Education was analyzed on a continuous scale on the ordinal harmonized classes (see Table 4) by assuming increments of one unit between subsequent classes. For statistical analysis, two risk categories were determined for each hematochemical parameter, depending on the risk classification (Table 5). Association between “risk” category and place of residence, education, and BMI was estimated by logistic regression analyses, stratified by age range (26–40 years; 41–60 years; 61–79 years; ≥80 years) and gender, as per Table 3. Residence in North Italy and normal-overweight BMI (see Table 2) were chosen as reference categories in this analysis.
ITR lacked education data and the records were excluded from statistical analysis in accord with the exclusion criteria (see Table 3).
According to Reference 22.
Univariate and multivariate analyses were carried out to obtain crude and adjusted odds ratios (ORs) with 95% confidence interval (95% CI). All statistical analyses were carried out with STATA X11.
Results
Preliminary work enabled this pilot project to be set so that it was compliant with the informed consent forms collected by each individual collection. 7 The main features of the pilot project, as reported in Table 1, include the terms addressed in the DTA. The DTA was signed by the principal investigators of the studies used in the pilot project. Each biobank uploaded data for a minimum of 1000 subjects, making a total of 5071 records (Table 3, before harmonization). All uploaded records included the age parameter. The first two age ranges (4–12 and 13–25 years) were excluded from statistical analysis because of the low number of subjects and the uneven distribution across participating biobanks. In fact, only two (Micros and ITR) out of nine collections participating in the study collected samples in the age range of 4–25 years. After the harmonization process and the adoption of the exclusion criteria reported in the Materials and Methods section, the statistical file included 3882 records (Table 3). Thus, 23% of the records were excluded, of which 16% was due to the elimination of the 4–12 and 13–25 age ranges and 84% was due to the other exclusion criteria. The summary of the excluded records for each collection is reported in Table 3.
Results of multivariate logistic regression analysis, stratified by age and gender, expressed as adjusted ORs and respective confidence intervals, are reported in Table 6. In this table, statistically significant ORs are marked in bold. In women (Table 6B), the results of multivariate logistic regression analysis showed a consistent, significant inverse association (OR<1) between level of education and each of the considered hematochemical parameters in the age range 41–60.
As to the relationship between gender and the significant associations shown in Table 6, the comparison between panel A and B shows that, of the 17 significant associations in men and 22 in women, only 9 are common to both genders. Six of these concern HDL, and with the exclusion of subjects aged over 80 years, high HDL levels are associated with residence in the Center of Italy and with high BMI levels (just missing statistical significance in men in the age range 61–79 years). In subjects of the older age range (over 80 years), high HDL levels are associated with residence in the South of Italy in both men and women. The same positive association is also found in men over 41 years and in women in the age range 26–40 years.
Table 6 also shows that in women, higher levels of LDL are significantly associated with residence in Center Italy with the exclusion of the age range 61–79 years, that only approximate to the statistical significance. In men, the same type of relationship is observed in the age ranges 41–60 and 61–79 years. Interestingly, women and men in the 61–79 age range and living in the South present a decreased risk (OR<1) of having both high LDL and high total cholesterol levels. Limited to total cholesterol and to women, the same association is also observed in the 26–40 age range.
A description of each association found by multivariate logistic regression analysis is reported in Table 6, and will not be discussed in detail. However, it is interesting to note the trend between triglyceride levels and BMI in men. The results indicate a strong positive correlation between increased triglyceride levels and BMI between 26 and 60 years. This association tends to decrease with increasing age (no statistically significant relationship was found between 61 and 79 years); in subjects over 80 years, BMI is negatively correlated with triglycerides.
Discussion
Population biobanks require expensive investments 23 and in Italy, as elsewhere, policy makers are interested in the potential of biobanks to use their tools for national health plans and biotechnological innovation.24–26 However, it is recognized that due to the lack of harmonized measures, few of the large-scale population cohorts can be used to take full advantage of the rich array of behavioral and other disease-related data that accompany many studies, as well as the environmental exposures that are potentially available. 27 Thus, it is not surprising that the project for the implementation of the HIBP, supported by the National Center for Disease Prevention and Control (Ccm), included the initiation of a pilot project that aimed to demonstrate that synergy among already-collected and available data can be used to address scientific questions on preventive and/or predictive medicine. The overall objective of building a first “proof-of-concept” pilot study (i.e., using pooled data from different cohorts with available tools and methods to provide important information for public health) has been achieved.
With respect to the specific topic addressed in this study, we found that results of multivariate logistic regression analysis reflect some well-known relationships between lipid patterns. Thus, the consistency of the associations found with current knowledge supports the validity of the method developed and of the harmonization process, indicating in addition, the value of pooled data. The HIBP pilot study also shows some new interesting associations. The results provide some evidence that it would be worthwhile to validate these findings by further experimentation.
Other studies have reported that education level has a positive effect on some markers related to cardiovascular risk and lipid profile.28,29 In Italy, before the sixties, education levels were very poor, especially for women. 17 Most of the people now aged between 40 and 61 years had the opportunity to access a higher level of education and general knowledge. Thus, the consistent association found in our study in women aged 41–60 years between education and a less atherogenic lipid pattern (lower total-cholesterol and LDL, higher HDL and lower triglycerides), allows the speculation that higher education increases awareness of the importance of lifestyle on health and aesthetic appearance. Moreover, the finding that this association is not present in younger and more educated women (26–40 age range) could be attributed to the protective estrogen effect during this age. In fact, according to current knowledge, women but not men have age-dependent changes in the lipid pattern related to hormonal status, 30 and cardiovascular risk increases only after menopause.31,32
In both genders, in the age ranges between 26 and 79 years, we observed an association between high HDL levels and place of residence in the Center of Italy. The same observation, but limited to men in the age range 26–60 years, was seen with BMI. In other words, as the majority of OR values are positive, these associations suggest that the population living in the Center of Italy with ages between 26 and 79 years has a higher risk of having lower HDL levels than the Northern population. In the same age ranges (just failing to reach significance in men aged 61–79 years) and independently from place of residence, obese individuals display a higher risk of having lower HDL levels. With respect to the South, the same positive correlation between place of residence and high HDL was observed in men aged over 41 years, and in women under 40 and over 80 years of age. Interestingly, all associations found between place of residence and high HDL were positive with respect to the North. LDLs also showed a different trend between the Center and the South of Italy as place of residence, in comparison to the North. Thus, our data suggest that in Italy, 34 as in other countries,28,29,33 there is a geographical gradient that affects cardiovascular risk, which is not yet completely understood and/or investigated.
The associations between LDL levels and place of residence for the oldest subjects generally differed from those observed in the age ranges 61–79 years. This may suggest that after 80 years of age, the lipid parameters are not affected by environmental differences such as the area of residence, and that any possible differences recorded at younger ages are no longer evident, probably because survival for people over 80 is more dependent on family/genetic factors than on those related to environment/lifestyle acting earlier in the elderly. These data support the results of a recent study on Italian nonagenarians, whose lipid parameters were largely normal, falling within the standard ranges valid for the adult population. 35
On the whole, many of the observed associations suggest that geographical differences affect the analyzed parameters, except for subjects older than 80 years, although the results of this study require further experimental validation. In fact, this prototype is a data-sharing study and, as such, it can only identify associations, but cannot provide insight on causation. In conclusion, the HIBP pilot experience demonstrates the feasibility of this type of collaboration and provides a methodological prototype for national epidemiological studies based on the existing data. Furthermore, the consistency of our findings with current knowledge suggests the accuracy of the approach and the validity of the associations found, including some that have not been previously reported. Thus, this study also represents the starting point for further comprehensive analyses that may include other phenotype variables, with the final aim of identifying the determinants of health status in each age group and gender, taking into account areas of residence and socio-demographic factors.
Footnotes
Acknowledgments
We acknowledge the collaboration of Isabel Fortier and p3g for their suggestions and helpful assistance. We are thankful to Prof. K.M. Botham for revision of the English language of the manuscript.
Author Disclosure Statement
The authors reported no conflicts of interest or financial disclosures. The study was supported by the National Center for Disease Prevention and Control (Ref. ISS IM65).
