Abstract
Macroinvertebrates are globally used in environmental monitoring and assessment. However, due to environmental and biological evolution, local adaptations of species might occur. This can contribute to uncertainties in the extrapolation of family-specific ecological models developed from one region to another. Thus, we aimed to determine if models can be extrapolated to other regions with similar climatic conditions and if a reliable model can be developed from a pooled dataset (consisting of data from different regions). The occurrence of five families was modelled based on physical–chemical water quality variables with classification trees using the data from three tropical river basins (Chaguana in Ecuador, Gilgel Gibe in Ethiopia and Cau in Vietnam). The relevance of each model was tested on complementary data from both the same and other river basins, to test specificity and universality. Furthermore, models with a pooled dataset were developed and tested. Model reliability was assessed based on chance-corrected agreement (Cohen’s kappa, κ) and percent agreement (correctly classified instances, CCI). Values of higher than 0.4 (κ) and 70% (CCI) were used to classify models as good. Only the pollution sensitive taxon (Leptophlebiidae) resulted in reliable models for most cases. In general, responses of macroinvertebrates towards pollution were different among countries except for the pollution sensitive taxa. Thus, extrapolation of ecological models for sensitive taxa to another river basin with similar climatic and environmental conditions is possible. Nevertheless, this type of systematic analyses for all families is necessary to determine and minimize uncertainty in ecological assessment.
Introduction
Macroinvertebrates are globally used in environmental monitoring and assessment [35,37,45]. They integrate environmental changes in physical, chemical and ecological characteristics of their habitat over time and space [64]. Most aquatic invertebrates are relatively immobile and reside in the benthic habitat for at least part of their life. Due to their sensitivity, disturbances in the aquatic environment generally affect macroinvertebrate abundance, diversity and composition [56]. Thus, macroinvertebrates are a suitable bio-indicator of the ecological state of aquatic ecosystems.
Ecological modelling is an effective tool to investigate, describe and predict the ecological state of an aquatic ecosystem [40]. Techniques such as decision trees have been employed to analyse biological assemblages of macroinvertebrates in British and Slovenian rivers [24]. Decision trees can be classified into classification and regression trees. Classification trees are used when the predicted outcome is a categorical variable while the latter are used when the predicted outcome is a continuous variable. Everaert et al. [25] and Boets et al. [5] assessed the impact of alien species on the macroinvertebrate composition in Flanders, Belgium with classification trees. Furthermore, classification trees have been used to predict the distribution of an alien species, Azolla filiculoides (Lam.) at Selkeh Wildlife Refuge in Anzali wetland, northern Iran [68]. Decision trees accommodate the challenges posed by ecological data as these techniques are: flexible to handle continuous and discrete responses and explanatory variables; able to capture nonlinear relationships, complex interactions, and multiple structures; able to deal with missing values in response or explanatory variables; robust to outliers; invariant to monotonic transformation of the explanatory variables; and easy to construct and interpret [53].
According to some studies, responses of macroinvertebrates to changes in environmental conditions are similar in regions characterized by comparable climatic and environmental conditions. Bonada et al. [6] concluded that ecoregions with corresponding historical, climatic and environmental conditions are also inhabited by macroinvertebrates with an analogous assemblage structure. Furthermore, Dynesius et al. [23] reported that the responses of riparian plant species to river regulation were similar between the two different continents (North America and Europe) and suggested that a general guideline for rehabilitation of degraded boreal rivers is a realistic goal. However, some studies reported that relationship between taxonomic assemblage structure and land use intensity differed among ecoregions [18]. Taxonomic richness also varies greatly in response to environmental variability [62]. It remains unclear whether an aquatic macroinvertebrate taxon responds similarly towards changes in environmental conditions among the tropical region. This is due to the fact that related studies in the tropical-climate zone are minimal.
Although numerous ecological models were developed to analyse and predict the assemblage condition in the tropics [21,39,51], these models were never extrapolated to other regions with similar climatic and environmental condition. Furthermore, results from the development of ecological models with a pooled dataset (dataset consisting of data from regions with similar climatic condition) are limited. The question remains whether models developed for a specific region can be extrapolated to other regions with similar climatic condition or whether a reliable model can be developed from a pooled dataset. Thus, this study explores the relationship of indicator taxa and key environmental variables for water quality assessment in a broader regional context and provides an insight into whether a general guideline for bioassessment in tropical countries can be implemented.
This study aims to determine if models can be extrapolated to other regions with similar climatic conditions and if a reliable model can be developed from a pooled dataset. We used classification trees to model the relationship between physical–chemical variables and the occurrences of five macroinvertebrate families, ranging from pollution tolerant to pollution sensitive, occurring in Ecuador, Ethiopia and Vietnam.
Materials and methods
Study areas
Chaguana river basin, Gilgel Gibe river basin, Cau river basin is located southwest of Ecuador, south to west of Ethiopia, and north of Vietnam (Appendix Table 1). The three river basins are prone to similar pressures. The Vietnamese river basin is characterized by a high population density, industrialization and sand/gravel mining [57]. The Ecuadorian river basin is covered by 37% of near natural conditions (humid forests and bushes, mangroves and uncultivated land), while the remaining parts are operated by shrimp farming, banana plantations, human settlements [21]. Lastly, some streams of the Ethiopian river basin are still covered with natural vegetation, while the rest is altered by different activities such as grazing, ploughing, sand dredging, vegetation clearance and municipal waste discharge [4].
Data collection
Data were gathered from three different sampling campaigns in the Chaguana river basin, Gilgel Gibe river basin and Cau river basin (Appendix Figs 1–3). Multiple physical–chemical water quality parameters were measured and macroinvertebrate samples were collected in each survey (Appendix Table 1). The number of sampling sites varied for each river basin, but each sampling site was visited twice in each year: once in the wet season and once in the dry season. In total, 60, 104 and 306 samples were taken at 15, 29 and 47 sampling sites for 2, 2 and 6 years in the Vietnamese, Ecuadorian and Ethiopian river basin, respectively (Appendix Table 2) [26]. Water samples were analysed according to the ISO standards. Only the environmental variables that were monitored in all of the three river basins were retained for further analysis. These are conductivity (μS/cm), dissolved oxygen concentration (mg/L), pH (–), stream velocity (m/s) and water temperature (°C) (Appendix Table 2). Samples were collected during two different seasons (wet and dry); therefore seasonality was also included for further analysis. Benthic macroinvertebrates were sampled according to the method described in Gabriels et al. [32]. A standard hand net was used. The net was attached to a stick and it had a size and mesh size of 20 × 30 cm and 300 μm, respectively. A stretch of 10–20 m was sampled with kick sampling method for 5 minutes. Sampling effort was proportionally distributed over all aquatic habitats. The organisms were identified to family level according to the taxonomy in [19,29,67] for Ecuadorian taxa, [7] for Ethiopian taxa, and [15,22,50] for Vietnamese taxa. For a detailed overview of the sampling campaigns and descriptions of the locations, we refer to Dominguez-Granda et al. [21], Ambelu et al. [4], and Nguyen et al. [57].
Classification trees
Classification trees are simple nonlinear and nonparametric approach that is fitted by recursive partitioning of a multidimensional covariate space [42,71]. They were applied to model macroinvertebrate families occurrences based on physical–chemical water quality variables. Five macroinvertebrate families, present in the three river basins, were selected (Table 1). They ranged from pollution tolerant to pollution sensitive, according to the BMWP score list [12].
Selected macroinvertebrate families
Selected macroinvertebrate families
Threefold cross validation was applied to indicate the robustness of the model. The original data were divided into three groups, stratified based on the presence or absence of the selected taxa. Thus, the proportion of presence/absence records of the selected taxa is the same in each group. Two of these groups were used to build the model (training sets); while the third subset was used to test the model (validation sets). The first series of classification tree models contained data from each country separately: Ecuador, Ethiopia, and Vietnam. The second series of classification tree models were developed based on pooled datasets. This resulted in 7 training and 7 validation sets as illustrated in Fig. 1.

Schematic outline of the model development and validation process. CT represents classification trees.
The models were developed based on tenfold cross validation to avoid model overfitting. The training set (two-thirds of the original dataset) was therefore divided into ten groups. Nine of these groups were used to build the model while the tenth subset was employed to test the model (referred to as the internal validation, which is part of the model development procedure) as illustrated in Fig. 2. Each developed model was validated with the 7 validation sets (referred to as the independent validation) as shown in Figs 1 and 2. The default settings from the machine learning package WEKA were applied using the J48 algorithm [36].

Schematic presentation of internal and independent validation.
Trained models were evaluated on the basis of two performance measures: the percentage of Correctly Classified Instances (CCI) [30] and the Cohen’s kappa statistic (κ), in which:
Based on kappa and CCI statistics, reliable models for Leptophlebiidae were developed from the training sets that included Ecuador data (Table 2). Models for Hydroptilidae with a κ higher than 0.4 and CCI higher than 70% were developed from the pooled training set of Ethiopia & Vietnam (Table 3). The J48 algorithm was not able to develop reliable models (based on performance criteria) for Baetidae, Chironomidae and Libellulidae (Appendix Table 3–12).
Internal validation of Leptophlebiidae predictive models. Results in bold indicate datasets providing reliable models based on performance criteria. Mean and standard deviation of CCI and κ were derived from threefold cross validation
Internal validation of Leptophlebiidae predictive models. Results in bold indicate datasets providing reliable models based on performance criteria. Mean and standard deviation of CCI and κ were derived from threefold cross validation
Internal validation of Hydroptilidae predictive models. Results in bold indicate datasets providing reliable models based on performance criteria. Mean and standard deviation of CCI and κ were derived from the threefold cross validation
Based on independent validation, some models for Leptophlebiidae and Hydroptilidae were reliable for some validation sets (Tables 4–6). For instance, models for Leptophlebiidae developed based on the Ecuadorian training sets were valid in the pooled Ecuadorian & Vietnamese validation sets. These models almost exceed a CCI of 70% and κ of 0.4 in the Ecuadorian and pooled Ecuadorian & Ethiopian validation sets. Furthermore, Leptophlebiidae models developed from the pooled datasets of Ecuador & Ethiopia, Ecuador & Vietnam, and Ecuador & Ethiopia & Vietnam were all valid in their validation set counterparts. Hydroptilidae models developed from the pooled dataset of Ecuador & Vietnam were reliable in the validation set comprising the Vietnamese data. However, both the CCI and κ were unstable as reflected in their standard deviations. Conductivity and dissolved oxygen (DO) were the major variables determining the presence of Leptophlebiidae (Table 7; Appendix Figs 4–7) whereas season, pH and water temperature were the major variables determining the occurrence of Hydroptilidae (Table 8; Appendix Fig. 8).
Independent validation of Leptophlebiidae’s single country models. Results in bold indicate datasets providing reliable models based on performance criteria. Mean and standard deviation of CCI and κ were derived from threefold cross validation
Independent validation Leptophlebiidae’s multiple country models. Results in bold indicate datasets providing reliable models based on performance criteria. Mean and standard deviation of CCI and κ were derived from threefold cross validation
Independent validation of Hydroptilidae’s multiple country models. Results in bold indicate datasets providing reliable models based on performance criteria. Mean and standard deviations of CCI and κ were derived from threefold cross validation
Major variables of classification trees for Leptophlebiidae and the times these variables were selected at each level in the tree. The variables selected in the topmost level of the tree represent a major variable influencing the occurrence of Leptophlebiidae
Selected twice.
Major variables of classification trees for Hydroptilidae and the times these variables were selected at each level in the tree. The variables selected in the topmost level of the tree represent a major variable influencing the occurrence of Hydroptilidae
Selected twice.
Model performance
Classification trees successfully modelled the occurrence of some macroinvertebrate families occurring in the three tropical countries. As the modelling technique is robust to outliers, all observations were included in the analysis. Furthermore, classification trees are nonparametric and nonlinear. Thus, there are no inherent assumptions that the underlying relationships between the predictor and target variable are linear [53]. Although classification trees can be quite unstable with respect to their predictive performances [33], some of our results were stable, particularly the models developed for Leptophlebiidae. This shows that the occurrence of Leptophlebiidae has consistent patterns in the datasets. Furthermore, classification trees are transparent and have the ability to deal with relatively small data sets. However, when large data sets are being processed, this can lead to large and complex trees, therefore adequate parameterisations are very crucial. Most decision trees are merely constructed in purely data-driven manner, wherein incorporation of expert knowledge is not integrated. In this study, the dataset is relatively small and incorporation of expert knowledge in the model would have been very hard, as the available ecological knowledge especially for tropical countries is minimal [60]. There are other modelling techniques such as fuzzy logic, Artificial Neural Networks (ANN), Generalised Linear Models (GLM), and Bayesian Belief Network (BBN). Each of these modelling techniques has its own advantages and drawbacks. This is elaborated by Van Echelpoel et al. [75]. Both fuzzy logic and BBN require knowledge-based rules and were therefore suboptimal for our study. GLM assumes that the underlying data are characterized by specific distributions. Some of our testing and validation sets do not comply with this distribution and therefore GLM cannot be applied in a standardized and efficient manner. Lastly, ANN can lead to highly reliable models, but the drawbacks and challenges of this method are lack of guidelines for optimal design, low ecological relevance and limited explanatory power.
Most values of Cohen’s kappa statistic (κ) were lower than 0.4 and CCIs were lower than 70%. This indicates that most models did not make reliable predictions. However, some classification tree models could reliably predict some macroinvertebrate taxa (i.e. Leptophlebiidae and Hydroptilidae). The incapability of models to yield reliable predictions is due to various reasons. For instance, there were only 5 environmental variables common among the three river basins. Other important variables (e.g. physical habitat degradation) than the one used in this study were not included during model development. In literature, it has been shown that macroinvertebrates distribution patterns are determined by conductivity, pH, flow, substrate and marginal vegetation [59]; substratum and historical flow characteristics [16]; pH [63]; nutrients, velocity and sediment type [3]. Environmental variables such as velocity, dissolved oxygen, chlorophyll, chemical oxygen demand, sediment type, land use and elevation [31] and oxygen level, temperature, ammonium concentration and conductivity [74] are known to influence the water quality index based on macroinvertebrates. Common variables reported in literature (velocity, conductivity, dissolved oxygen, temperature, pH) were included in the development of our models. The inclusion of only 5 environmental variables in the modelling exercise is sufficient as these variables cover a wide range of water quality issues. For instance, conductivity integrates minerals and inorganic pollutants [26]. Dissolved oxygen gives an indication of organic pollution whereas temperature may represent global warming [73].
Records for taxon absences may be due to various reasons. Taxa might not be present on the site due to the unsuitability of the habitat. However, incorrect records of absences could lead to incorrect model prediction. Sampling error could have caused the record of a taxon which is present as absent. The habitat might be suitable but the taxa could be temporarily absent. Furthermore, migration barriers could have caused the absences of these taxa despite the high suitability of the habitat [46]. These could have contributed to the lower accuracy of the models.
Models developed for most taxa were not reliable. Additionally, the model developed for Hydroptilidae based on the combined Ecuadorian and Vietnamese dataset was reliably validated with the Vietnamese dataset. The validation results were quite unstable as the standard deviation of κ was more than 75% of the mean κ. Moreover, the selected variables varied among models developed from threefold cross validation (Table 8). Aside from predictive performance, model stability is also important [31]. Thus, the validation results of Hydroptilidae cannot be considered valid.
Overall, it was not possible to develop a reliable and applicable model to predict the occurrences of Chironomidae and Baetidae. It is reported that Baetidae is influenced by the vegetation cover [51], pH and water temperature [9] and cobbles [2]. It can also be present on moderately disturbed rivers [11]. Chironomidae can occur at both disturbed and pristine conditions [47]. As both taxa can be present in a wide range of environmental conditions, developing reliable and valid models for these taxa can be challenging.
Models of Libellulidae could not be successfully developed, which could be due to the limited number of available input variables. Libellulidae need a continuous spectrum of stagnant, transitional and flowing water [38]. This variable is not incorporated by the input variables. Furthermore, Libellulidae are reported as a pollution-tolerant family in Vietnam [58] while it is moderately tolerant to pollution in South America [66]. Thus, Libellulidae can be present over a wide range of environmental conditions in some tropical regions. As there were no successful extrapolated-models developed and validated, these results indicate that Chiromidae, Baetidae, Hydroptilidae and Libellulidae may respond differently to pollution among the three river basins.
Literature has varying reports related to the dependence of Cohen’s kappa on prevalence. According to Feinstein and Cicchetti [27], κ is affected by prevalence and they recommended to design the research with well-balanced positive and negative constituents. However, Manel et al. [48] concluded that the effect of the frequency of occurrence of target organisms on Cohen’s Kappa is negligible. Our results revealed that the model of Hydroptilidae developed from the training set of Vietnam (
Valid models which consist of pooled datasets were developed for Leptophlebiidae. A model containing the Ecuadorian dataset generally resulted in a valid and reliable model. The prevalence of Leptophlebiidae in Ecuador is almost 50% and the data may contain clear patterns of the occurrence of the taxon. The low prevalence of Leptophlebiidae in the Ethiopian or Vietnamese dataset could have resulted in an unclear pattern of its occurrence. However, when the models were developed in combination with the Ecuadorian data, the model performances improved. This indicates that the patterns of the occurrence of Leptophlebiidae were magnified when the Ecuadorian dataset was combined with either the Ethiopian or Vietnamese dataset. Hence, similar patterns are possibly observed in the dataset between Ecuador and Vietnam and between Ecuador and Ethiopia. Leptophlebiidae occurring in these river basins possibly respond similarly to environmental pressures. On the other hand, no reliable models were developed from the training sets comprising the Ethiopian and Vietnamese data. The poor performance of these models could be due to both or either the low prevalence of Leptophlebiidae in these datasets and/or the differences in the patterns of occurrence of Leptophlebiidae between these countries. Thus, the response of Leptophlebiidae towards environmental pressures between the Ethiopian and Vietnamese river basins remains unclear.
Study design
The altitude of the Vietnamese river basin is lower compared to the other river basins. Ecuador shares common altitude with Ethiopia although some sites in Ethiopia are 1000 m lower than Ecuador. According to Rezende et al. [65] and Feio et al. [28], higher altitudes were likely the primary variables that increased taxonomic richness and density of macroinvertebrate communities. However, the change in taxonomic diversity in response to elevation is often related to the impacts of human activities [44]. Macroinvertebrate assemblage composition may be affected by altitude, although such effects are often confounded with anthropogenic disturbance, which is often most extensive at lowest altitudes in the tropics [55,76].
One of the strengths of our study design is the incorporation of seasonality in model development. Seasonal variations in subtropical countries may have an effect on macroinvertebrate assemblage [1,10]. The uncertainties due to seasonal variability were integrated into the models. This is confirmed in our results, wherein the occurrence of Leptophlebiidae and Hydroptilidae were influenced by seasonal changes. Leptophlebiidae favours wet season while Hydroptilidae can be present in both wet and dry season.
The sample size is different resulting in an unbalanced sample size between countries. This is one of the limitations of the study design. However the inclusion of all cases is important for an optimal visualization of patterns in each dataset. A large number of observations or a large sample size yields better model accuracy and prediction [70,77]. However, Yu and Abdel-Aty [78] concluded that smaller sample size enhances the model’s classification accuracy. Furthermore, large datasets have a disadvantage in classification trees models as they can result in complex models, which are hard to interpret. Despite the limited number of sampling sites (15–50), sampling occurred in a standardized way giving a relevant dataset that allows the extraction of possible ecological patterns.
A balanced prevalence of each taxon could have improved model’s performance and clarity of results. However, each sampling campaign was conducted independently. A dataset with balanced prevalence of target taxa is ideal however also very rare. Nevertheless, the modelling of representative taxa with varying degree of tolerance to pollution gives valuable ecological insights from different regions with similar climatic conditions.
Monitoring and modelling of ecological data in tropical regions
This modelling exercise explores the possibility of developing a reliable model from a combined dataset or the extrapolation of models to other river basins with similar environmental conditions and pressures. Among the taxa analysed, the most sensitive taxon (Leptophlebiidae) was successfully modelled. Leptophlebiidae seems to share similar patterns between Ecuador and Vietnam and between Ecuador and Ethiopia. Thus, the use of models developed in Ecuador can be extrapolated to either Vietnamese or Ethiopian river basins with similar environmental conditions and pressures. The extrapolation of models to other river basins allows the monitoring of a lesser number of sites. As a consequence, monitoring becomes cheaper which is beneficial in developing countries. Furthermore, the results of our study open the possibility for an implementation of a general guideline for bioassessment in tropical countries based on pollution sensitive taxa.
Results revealed that the major variables predicting Leptophlebiidae are conductivity and dissolved oxygen. Thus, a river basin manager can prioritize management activities that alleviate conductivity and dissolved oxygen issues in rivers. Information on the influence of environmental variables and distribution of this taxon is limited [60,69]. However, in a study, Leptophlebiidae was not found when water quality was reduced by use of agricultural chemicals [41]. Leptophlebiidae also disappears in a river stretch recipient of industrial effluent and domestic sewage [61]. Conductivity appeared to be an important driver of macroinverterbrate community structure within this river stretch. The variable conductivity can be related to effluents, sewage and agricultural chemicals, which are in line with the results of our study.
Although numerous data were collected, common variables among river basins were only limitedly available. The differences in biomonitoring programmes among the three countries resulted in few number of variables available for model development. Thus,
Three main conclusions could be drawn from this modelling study. First, pollution sensitive taxa are more easily modelled than pollution tolerant taxa. Secondly, In general, responses of macroinvertebrates towards pollution were different among countries except for the pollution sensitive taxa. Thus, extrapolation of ecological models for sensitive taxa to another river basin with similar climatic and environmental conditions is possible. Lastly, the implementation of a standardized biomonitoring programme in tropical countries is suggested. Nevertheless, this type of systematic analyses for all taxa is necessary to determine and minimize uncertainty in ecological assessment.
Footnotes
Acknowledgements
We would like to thank all the people who contributed to the sampling campaigns. Marie Anne Eurie Forio receives financial support from the special research fund of Ghent University to support the VLIR Ecuador Biodiversity Network. Luis Domiguez-Granda received financial support of the VLIR ESPOL IUC program in Ecuador and SENACYT. Argaw Ambelu was a recipient of an ICP-PhD scholarship from VLIR-UOS. Seid Tiku Mereta was a recipient of an IUC-PhD scholarship from VLIR-UOS (IUC JIMMA). Thu Huong Hoang received financial aid from the Belgian Technical Cooperation (BTC).
Independent validation of Chironomidae’s multiple country models. Mean and standard deviation of CCI and κ were derived from threefold cross validation
| Validation sets | Models with pooled dataset | |||||||
|
|
||||||||
| Ecuador & Ethiopia | Ecuador & Vietnam | Ethiopia & Vietnam | Ecuador, Ethiopia & Vietnam | |||||
|
|
|
|
|
|||||
| CCI | K | CCI | K | CCI | K | CCI | K | |
| Ecuador | 91.3 ± 0.1 | 0.00 ± 0.00 | 91.3 ± 0.1 | 0.00 ± 0.00 | 90.4 ± 1.5 | −0.02 ± 0.02 | 91.3 ± 0.2 | 0.00 ± 0.00 |
| Ethiopia | 84.3 ± 0.7 | 0.00 ± 0.00 | 84.3 ± 0.7 | 0.00 ± 0.00 | 84.9 ± 0.4 | 0.06 ± 0.09 | 84.3 ± 0.7 | 0.00 ± 0.00 |
| Vietnam | 93.4 ± 2.1 | 0.00 ± 0.00 | 93.4 ± 2.1 | 0.00 ± 0.00 | 93.4 ± 2.1 | 0.00 ± 0.00 | 93.4 ± 2.1 | 0.00 ± 0.00 |
| Ecuador & Ethiopia | 86.1 ± 0.7 | 0.00 ± 0.00 | 86.1 ± 0.7 | 0.00 ± 0.00 | 86.8 ± 1.2 | 0.07 ± 0.11 | 86.1 ± 0.7 | 0.00 ± 0.00 |
| Ecuador & Vietnam | 92.1 ± 0.8 | 0.00 ± 0.00 | 90.6 ± 1.8 | 0.00 ± 0.00 | 91.5 ± 0.9 | −0.01 ± 0.01 | 92.1 ± 0.8 | 0.00 ± 0.00 |
| Ethiopia & Vietnam | 84.1 ± 1.4 | 0.00 ± 0.00 | 84.1 ± 1.3 | 0.00 ± 0.00 | 84.4 ± 1.2 | 0.03 ± 0.04 | 84.1 ± 1.4 | 0.00 ± 0.00 |
| Ecuador, Ethiopia & Vietnam | 86.4 ± 1.4 | 0.00 ± 0.00 | 86.4 ± 1.4 | 0.00 ± 0.00 | 87.1 ± 2.3 | 0.09 ± 0.12 | 86.4 ± 1.4 | 0.00 ± 0.00 |
