Abstract
Polymicrobial syndromes such as Bacterial Vaginosis (BV), where there is a great diversity of microorganisms and causal connotations, turn it into a disease with complex dynamics in the bacteria’s coexistence in groups of patients. The main aim of this study was to explore a dataset of patients with BV to determine a more informed number of groups to create for further analysis of bacteria’s coexistence. The Agglomerative Hierarchical Clustering (AHC) algorithm was applied to a BV dataset from an urban population in southeastern Mexico consisting of 201 patient records with 59 patient attributes and three classes (BV-positive, BV-negative, BV-indeterminate). In the clustering results obtained, it is possible to identify different remarkable groups of patients. The most prevalent coexisting bacteria among patients with BV were Atopobium
Introduction
BV is a polymicrobial clinical syndrome [1] that affects 15 to 50% [2] of women in childbearing age [3]. BV is a vaginal micro-floral dysbiosis, which usually occurs when Lactobacillus [4, 5, 6] decreases quantitatively before an overgrowth of mainly anaerobic bacteria [7, 8]. These organisms involved in this dysbiosis are present in both health and disease cases.
Women affected by BV may be symptomatic or asymptomatic. Symptom cases experience vaginal odour (fishy), grayish-white discharge, itching, and increased vaginal pH greater than 4.5 [9]. It is essential to perform a timely BV diagnosis to avoid gynaecological complications [10], such as endometritis, salpingitis, oophoritis [11], preterm premature rupture of membranes (PPROM), and chorioamnionitis [12].
The classical methods for diagnostics of BV are Amsel’s criteria [13], and Nuget score [14]. Real-time PCR has also been used to study BV, which consists of amplifying and quantifying or detecting the target DNA simultaneously [15]. BV has been addressed in multiple clinical and microbiology studies using these methods. However, the diversity of microorganisms found in the vaginal mucosa, along with their causal connotations, make BV a complex dynamic problem [16]. BV becomes even more complex when it is sought to identify groups of individuals with feature similarities.
Particularly, it is well known that a BV-positive case is a consequence of an imbalance state of bacteria; it is not a unique bacterium but coexistence of bacteria leading to a BV-positive condition. However, these bacteria may change from patient to patient. So it is of our interest to tackle this problem using a clustering approach. We aim to identify clusters that help understand patients with BV who presumably share coexisting bacteria in a grouped way.
From the ML standpoint, groups of objects are called clusters. Clustering refers to the segmentation of datasets into groups of similar objects. Each group consists of objects similar to one another and dissimilar to objects in other groups [17]. This technique can be extremely effective in determining the contexts of bacterial coexistence between groups of patients with the same diagnosis.
In this study, we address the problem of BV through the AHC algorithm to explore a BV dataset real from an urban population in southeastern Mexico consisting of 201 patient records with 59 patient attributes with three classes BV-positive, BV-negative, BV-indeterminate. The classes were hidden from the AHC to let it do its work. The aim was to obtain from a dendrogram the result of the AHC a more informed number of groups to create in the dataset for further analysis on bacteria coexistence that led to a BV-positive diagnosis. These groups must show similarity in the elements within the same group and dissimilarity among features from one group to those of the others regarding detecting coexistence context bacterial. Finally, the results were validated computationally and biologically.
The AHC algorithm is implemented using a linkage method and a distance metric. The linkage methods experimented with were Single Link, Complete Link, Group Average, Ward’s.D, and Ward’s. D2. There are several distance metrics for different purposes. In this study, the asymmetric binary similarity measure was applied. Two experiments were performed:
In the first, it the qualitative-asymmetric-binary data of Lactobacillus was used with qualitative-asymmetric binary data of microorganisms In the second, the quantitative data of Lactobacillus Cq with qualitative-asymmetric binary data of microorganisms was used.
In both experiments, the percentage of agglomeration (AP) was calculated for each linkage method. The results were used to identify the method delivering the highest AP value and build the dendrograms. After, dendrograms were created where different levels of data grouping can be read. Subsequently, clustering tables were constructed to investigate whether the elements were put in the right cluster according to the real class of elements. Posteriorly, the model was computationally validated using the following metrics: coefficient of cophenetic correlation (CCC), the optimal number of clusters through the gap-statistic, and the average silhouette. Significant clusters were evaluated through Ward.D2 method and asymmetric binary similarity measure. Thereafter, it was biologically validated by an expert in the field and finally the data visualizations were created.
After that, a comparative table was constructed of all linkage methods that obtained a low AP. The data were analysed, which allowed identifying whether the AHC method produced clusters according to two types of states (presence and absence) in the first experiment. In the second experiment, it was analysed whether the groups found were dissimilar among other groups but belonged to the same diagnosis.
It was also possible to identify different remarkable groups of patients. Differences from the group to group lie on the active bacterium that led to the condition of BV-Positive. This study contributes to the effort of providing insights into the identification of coexisting bacteria in groups of patients diagnosed as BV-positive.Also, the benefit of identifying groups translates into selecting specific treatments according to the coexistent bacteria in each grouping. Also, it becomes a support tool to obtain a priori knowledge of the contexts that may arise in clinical cases.
This paper is organized as follows: Section 2 describes the related works of machine learning in BV. Section 3 describes the dataset, linkage methods, similarity measure, cophenetic correlation coefficient, and validation metrics. Section 4 describes the steps of the AHC experimental design. Section 5 describes the results. Section 6 describes the discussion. Finally, our conclusions are in section seven.
There exist a few studies that have analyzed the VB disease using ML methods.
The purpose of Song et al.’s. [18] was to integrate Superpixel methods with Deep Learning methods based on convolutional neural network (CNN) for the automatic assisted diagnosis of BV. The experiment was based on 105 women (18–50 years old) from Shenzhen Hospital, from which 105 oil immersion images were obtained using Olympus BX43. The images were evaluated through a reference frame being processed in greyscale through superpixel computation to smooth the appearance. The Superpixel algorithm is used to group each segment of the cells and to classify the bacteria. This method allowing to obtain a description of the characteristics such as (size, colour, shape, texture, gradients) that BV may represent. The deep learning method used CNN to learn input and output mappings. It was evaluated through the methods of precision, sensitivity, accuracy, specificity, F1 measure, and the Zijdenbos similarity index. The performance of CNN was compared with other classifiers (Backward Propagation Neural Networks (PNN), Vector Support Machines (SVM), Vector Quantization Learning (LVQ), and Probabilistic Neural Networks).
Baker et al.’s [19] built a classification model by breaking down the groups of microbes based on their correlation. Likewise, it reduced the number of factors, increasing the interpretability of the classification models. The classifications were made using Genetic Programming, Random Forest, and Logistic Regression, and the precision of the models was evaluated using ROC curves. The precision obtained from the models was between 90% and 95% when they were classified using the dataset with the Nugent score. Unlike when the classification process was performed using the AMSEL criterion data, the precision was lower than that of the Nugent score. After the precision was calculated, the most representative attributes were determined based on the deconstruction of the models, and a ranking of attributes was made according to their model importance. They also identified that through Genetic Programming, there is a wide variation of models when they are phenotypically classified with the Nugent score.
In [20], Cruciani et al.’s designed a new phylogenetic microarray-based tool (VaginArray) that includes 17 probe sets specific for the most representative bacterial groups of the human vaginal ecosystem. The VaginArray was applied to evaluate the efficacy of rifaximin vaginal tablets for treating BV. The results showed the ability of rifaximin to reduce the growth of various BV-related bacteria (Atopobium vaginae, Prevotella, Megasphaera, Mobiluncus, and Sneathia spp.)
Ness et al.’s [21] applied exploratory factor analysis to investigate microflora measurements’ clustering and identify groups of associated microorganisms with pelvic inflammatory disease (PID). The initial sample consisted of 1628 women to whom exclusion criteria were applied (pregnant, married, virgin, and use of antibiotics at the beginning of the study), of which 1140 women were selected. They were interviewed and examined to acquire the vaginal floral samples and build the dataset. 20% of the data were missing, the reason it was necessary to complete the data. Multiple regression was performed to solve the missing values. Two independent groups of microorganisms were determined through exploratory factor analysis, obtaining a factor score from a linear combination of microorganisms and individual measurements. The interpretation of the results of the factor analysis (clustering) was shown in the sediment graphs. It was identified that women in the highest tertile presented growth of microorganisms associated with BV, more probable to experience pelvic inflammatory disease (PID).
The studies described above allow us to identify the most highlights aspects of scientific research using ML in BV. It is noticeable that most of the research approaches are directed towards the diagnosis and classification of a positive BV case. However, few studies use a clustering approach for determining a more informed number of clusters that allow to know the coexisting bacteria among groups of patients with a BV-positive condition. For example, research that applies a clustering approach are limited to the evaluation of a unique linking method of AHC algorithms and lack evaluation of metrics such as determination of the optimal number of groups, cophenetic correlation coefficient and significant clusters. Another limitation detected in clustering research is the absence of a visual tool to explore the characteristics shared between elements of the same grouping. Our approach encompasses the evaluation of different AHC methods, the evaluation of construct metrics that other studies do not consider, and a data visualization tool introduced to explore the contexts of bacterial coexistence among patients assigned to the same cluster.
Materials and methods
Dataset
The dataset used for this work was generated by the Laboratory of Research in Metabolic and Infectious Diseases at the Juarez Autonomous University of Tabasco. It was obtained as part of molecular epidemiological research on BV over the years 2016 to 2018 [22]. This dataset is complete with no missing values and was constructed by the biology expert.
The dataset contains 201 patient recorded with 59 attributes related to microorganisms associated with BV (BV–qPCR) and Human Papillomavirus (HPV). Our research is devoted to the BV condition only; therefore, 40 attributes related to HPV were discarded, and the 19 attributes related to BV microorganisms were kept, as shown in Table 1. All selected attributes are numerical and organized into three segments of attributes which were established and thoroughly described in [22]. The area expert described the segmentation as follows: Segment one (qualitative-asymmetric binary data of Lactobacillus) refers to the values of the presence and absence of Lactobacillus, segment two (quantitative data of Lactobacillus Cq) relates to the values of molecular diagnostics, segment three (qualitative-asymmetric binary data of Microorganisms) relates to the values of presence and absence of genital pathogens associated with BV.
Attributes selected from the VB dataset used in our experiments which were introduced in [22]
Attributes selected from the VB dataset used in our experiments which were introduced in [22]
The AHC algorithm allows for obtaining partitions according to each linkage method, with different criteria, including minimum distance, maximum distance, average group, and minimum variance. The AHC is associated with a distance matrix D
Each object is assigned a unique cluster. Calculate the (dis)similarity matrix between every pair of objects in the dataset. Determine the linkage criteria to merge the clusters. Repeat the process of steps 2 and 3 until all objects are in one cluster.
The asymmetric binary measure is a metric used to estimate the similarity between objects with asymmetric binary properties. An asymmetric attribute is a particular case of a nominal variable with two categories (1-Presence, 0-Absence); this means that one state of the attribute is more informative than the other. A clear example arises when we seek to identify the presence or absence of a disease according to its characteristics [24]. Faith, D. P. (1983), suggests the following measure of similarity,
This measure can be adjusted to be constrained between 0 and 1 as follows:
Where
In the AHC approach, each data point is defined in a group, and the existing groups are combined at each step. The different linkage methods for this approach are:
Single linkage method
Single linkage measures the proximity between two groups by calculating the distance between their objects and choosing the minimum distance as the merging distance between the two groups [25]. It is written mathematically as shown in Eq. (3) [23].
Where
Complete linkage measures the proximity between the two groups by calculating the distance between their objects, with a maximum distance to be merged. This method is sensitive to outliers [26]. It is calculated with Eq. (4) as follows [23]:
Where
The Average group measures the proximity between two groups by estimating the average distances between objects of both clusters or the average of the similarities between objects of both clusters. It is calculated with Eq. (5), as follows [27]:
Where
Ward’s minimum-variance method, the distance between two clusters, is the sum of squared deviations from point to centroid. It minimizes within clusters the sum of squares, mathematically written as shown in Eq. (6) [28]:
Where ESS is the error sum of squares.
There are two different methods found in the literature for the Ward method. Ward’s [29] and the Ward’s D2 [30] method. The difference between the methods is that Ward’s.D2 performs the squared dissimilarity calculation before updating the group [31].
Pearson correlation and Cophenetic correlation coefficient
Pearson correlation coefficient is a test that measures the statistical relationship between two continuous variables X and Y [32]. The formula for the correlation coefficient is [33]:
Where
The cophenetic correlation coefficient (CCC) is the result of computing Pearson’s correlation coefficient with the values in a cophenetic matrix [34]. The formula is defined as [35]:
The c-value or cophenetic correlation coefficient (CCC) indicates how faithfully the clustering model preserves the pairwise distances between the original unmodelled points and determines the quality of the solution. Its evaluation should be done through of the correlation magnitude strata suggested by Schober, Boer, and Schwarte [36], such as: insignificant (0.00–0.10), weak (0.10–039), moderate (0.40–0.69), strong (0.70–0.89), and very strong (0.90–1.00).
The gap statistic method determines the optimal number of groups in a dataset. This approach compares the total intragroup variation for different k values with their expected values under a null reference distribution of the data. The result will be the optimal cluster number that maximizes the gap statistic. Hence, it is defined [37]:
Where
The silhouette method estimates the mean of the observations for different values of K. The optimal number of clusters is the one that maximises the mean of the silhouette over a number of possible K values. The formula for the silhouette is [38]:
Where
The phases of this study’s construction are shown in Fig. 1, and described in this section. The phases of obtaining and dataset subselection are described in Section 3.1.
AHC experimental design.
It is important to note that the problem investigated in BV is to know which bacteria coexist in the positive cases. This is approached through an exploration of the dataset using clustering algorithms, where it is assumed that each group will be formed by elements that have in common with the bacteria that detonate under the condition. Two experiments were designed as follows:
Experiment 1. – The objective was focused on determining groups with the presence or absence of BV, that is, positive or negative diagnosis of BV. We considered segment one of the qualitative-asymmetric-binary data of Lactobacillus and segment three of the qualitative-asymmetric-binary data of microorganisms of the dataset described in Table 1.
Experiment 2. – The objective was designed to identify groups with dissimilarity, and at the same time being groups with elements showing the same diagnosis. That is, there might be more than one group with BV-positive elements. These cases allow for investigation of bacterial coexistence, which might be different bacteria between groups. We considered segment two of the quantitative data of Lactobacillus Cq and segment three qualitative-asymmetric-binary data of microorganisms from the dataset described in Table 1.
Note that we excluded the class variable in the clustering model’s construction. The class variable was used later to inspect the classes of the elements of the clusters.
Identification of metrics of distance and linkage methods
To build AHC, we explored the different linkage methods and distance measures found in the literature.
Regarding the linkage methods, we identified the most commonly used methods, which are Single Link, Complete Link, Group Average, Ward’s.D, and Ward’s.D2. These methods were selected to be applied in the study.
Thereafter, we explored the distance measures used in the AHC algorithm. For selecting the distance measure, we considered the characteristics of the attributes in the dataset. The BV dataset attributes are nominal variables with two categories (1-Presence and 2-Absence) defined as qualitative asymmetric binary data. Other attributes are quantitative variables. The distance metric was selected to apply Asymmetric binary similarity measure.
Hierarchical clustering
According to the literature, the AHC algorithm is the predominant method when seeking to understand the structure of a dataset for characterizing objects into groups for the first time [39]. AHC results in a dendrogram, which aids in identifying the proper number of groups for further analysis. This study leads to an analysis of bacterial coexistence in groups of patients with BV. To create the AHC model, the following construction stages were conducted.
Calculation of the agglomerative percentage (AP)
To decide which method to use to build your dendrogram. It is necessary to calculate the AP of each chosen linkage method. The AP is calculated using the function Agnes from the cluster package [40] and takes as arguments the dataset and linkage method. What the Agnes function does is it returns the value of AP of the method evaluated. This value allows estimating the level of clustering in the dataset.
AP is a dimensionless value in the range of 0 to 1. When AP is very low in the method evaluated, it can be interpreted that there isn’t a structure of grouping or that the data conform to only one group. In contrast, AP values close to 1 means that there is a clear structure of groping. This definition is suggested by Kaufman and Rousseeuw [41]. The method that obtains the highest AP in each experiment builds its AHC model, which is represented by a dendrogram.
Construction of the model
To generate a clustering model using the AHC algorithm. The hclust function of the stats package in the R language was used [42]. The arguments are the distance matrix and linkage method.
To estimate the distance matrix, the dist function of the stats package [42] is used. The input arguments of the dist function are the dataset and asymmetry binary similarity measures. What this function does is it returns the distance between all pairs of objects in the dataset. The linkage methods selected were those described in Subsection 4.3.1.
The hclust function creates a clustering model by taking the distance matrix of n observations that will be clustered following the linking method’s criteria.
Once the clustering model has been created, the resultant dendrogram with the AP chosen is plotted using a plot function from package graphics [43]. Afterwards, we analyzed the dendrogram structure to identify the optimal number of clusters. For this, we sought to distinguish the level where the difference between two consecutive fusions is the largest.
To make groups visible on the dendrogram, the hcoplot function was used [44]. The input arguments to this function were the clustering model, the distance matrix as obtained in Subsection 4.3.2, the number of groups identified in the analysis of the dendrogram structure given by argument
What the hcoplot function does is it cuts and orders the AHC model. To cut the dendrogram, the hcoplot function implicitly uses the cutree function. In this function is used the argument
Grouping table
A cluster table is a cross-frequency table between the real class variables and the group variable assigned by the algorithm. The column and row structures show the grouping of elements according to the group and the diagnosis assigned by the algorithm.
Computational and biological validation
Computational Validation
Computational validation involves measuring the quality of the clustering results. This phase comprises three points:
Determine the optimal number of groups, which consists of identifying the clusters based on the data through metrics that provide the optimal value of
The number of optimal clusters was validated using the Fviz_nbclust factoextra package [46] in the R language. For the validation of the optimal number of underlying clusters, the two most popular standard methods in the literature were selected; the silhouette method [38] used for experiment 1 and gap-statistics [37] for experiment 2. The purpose was to give certainty that the optimal number of clusters suggested in Subsection 4.3.3 was corresponding to the estimation through metrics. The evaluation of the methods was performed according to the objective of each experiment and the properties of the selected attributes are described in Subsection 4.1. Calculate the CCC, which consists of measuring the correlation between the initial distances taken from the original data and the final distances at which the objects have been merged in the clustering process.
The CCC is performed using the cophenetic function from the stats package [42]. The arguments required are the distance matrix and cophenetic distance.
The process of calculating the distance matrix is described in Subsection 4.3.2. The cophenetic distance is calculated using the cophenetic function from the stats package [42]. The input argument of the cophenetic function was the AHC model obtained in Subsection 4.3.2. Determine significant clusters, which consists of assessing the uncertainty of the AHC analysis. A significant cluster is a group that satisfies the condition of having a
This process was performed using the pvclust function of the pvclust package [47]. The input arguments for the two experiments were the Ward.D2 method as the linking method and the binary asymmetric similarity measure as the distance. What the pvclust function does is it returns p-values and bootstrap probability (BP) values that correspond to the frequency that the clustering appears. For more details see [48].
Biological Validation
The validation process involves verifying the biological significance of the underlying groups in the clustering models. For this purpose, the clusters were made available to an expert in the field. The expert explored each element of the underlying clusters and corroborated that all of them were put in the right cluster according to the real class of elements. That is, elements of class positive were grouped together, similarly elements of class negative and class indeterminate were grouped in its corresponding groups as shown in Tables 4 and 7 for experiments 1 and 2, respectively.
The data visualization highlights the characteristics of the clusters found in the AHC model to provide the domain expert with a screening tool to identify contexts of bacterial coexistence and validate the clusters biological significance.
Note that we performed all steps included in the experiment design for the two experiments.
In this work, the experiments were performed in RStudio Version 1.2.5019, from the stats package [42], which provides the functions hclust and dist that were used for building the Hierarchical Clustering model. Additional R packages were used for visualization, and analysis of the model such as purrr [49], cluster [40], graphics [43], factoextra [46], and pvclust [47]. These packages are pointed to in the sections where the processes that used them are described.
Results
To the best of our knowledge, at the time of this research, no other studies have been found in the literature that addresses the problem of BV to identify contexts of bacterial coexistence in groups of patients, using machine learning algorithms specifically using a clustering approach.
In this section, we show the results obtained from applying AHC method. To decide which method to apply for the construction of the dendrogram, it was necessary first to investigate the value of the AP. This value indicates the possible agglomeration level in the dataset. Four methods were explored, as shown in Table 2.
Percentage of agglomeration (AP) for experiments 1 and 2. The units of the results show the percentages obtained for each linkage method evaluated. The highest value is shown in bold
Percentage of agglomeration (AP) for experiments 1 and 2. The units of the results show the percentages obtained for each linkage method evaluated. The highest value is shown in bold
Ward’s D2 method reached the highest AP for the two experiments. The AP value obtained in the first experiment was 0.9661422 and 0.9772106 in the second one. Based on this information, we determined to build the dendrograms using the Ward’s.D2 method.
A dendrogram was created using segment one the of qualitative-asymmetric-binary data of Lactobacillus and segment three of qualitative-asymmetric-binary data of microorganisms, which is described in Subsection 5.1, and its validation process is described in Subsection 5.2. Similarly, a second dendrogram was created using the segment two of quantitative data of Lactobacillus Cq and segment three qualitative-asymmetric-binary data of microorganisms, which Subsection 5.3 describes, and its validation process is described in Subsection 5.4. Finally, Subsection 5.5 describes the clustering results of all linkage methods.
In this subsection, we show the results obtained from the construction, plot, and cut of the dendrogram of the first experiment described in Subsection 4.1.
Dendrogram
A dendrogram was created using Ward’s.D2 as the linkage method and Asymmetric Binary Similarity Measure as a distance metric. The dendrogram is pictured in Fig. 2a. The purpose of the dendrogram is to assist in the graphical identification of the potential number of clusters in the analysed dataset. We interpreted the two vertical lines in Fig. 2a. as two potential clusters. Therefore, we cut the dendrogram by setting
Quantification of the number of elements in each group resulted in 150 and 51 elements conforming to the cluster 1 and cluster 2, respectively. This is summarized in Table 3.
Element count in the underlying clusters in experiment 1
Element count in the underlying clusters in experiment 1
Dendrogram using the ward.D2 method in experiment 1; The Uncut dendrogram (a) and final dendrogram with cut-off 
Subsequently, we investigated whether the clusters found using Ward’s.D2 method have a meaningful interpretation in real BV diagnostic classes. We looked at the elements of each cluster C1, C2 which were delivered by the algorithm. The class of each element was verified and counted according to whether it was positive, negative and indeterminate. Totals are shown in Table 4.
Grouping table about the evaluation of the elements assigned in underlying clusters regarding the real classes from experiment 1
Table 4 illustrates that in group C1 were assigned elements of the BV-negative and BV-indeterminate class. Both are equivalent to 74.62 % of the dataset. In group C2 were put BV-positive instances only, which is 25.37% of the dataset. Finally, a T-SNE visualization is generated from the underlying groups, which allows to observe each element of the groups in a two-dimensional plane, as shown in Fig. 3.
Plot showing the distribution of elements of the underlying clusters in a two-dimensional plane in experiment 1.
In this subsection, we show the processes of computational validation and data visualization of the first experiment.
Optimal number of clusters
Validation of the optimal number of clusters found in the dendrogram structure was performed using the silhouette method. The result is shown in Fig. 4. It shows that the value of the highest silhouette was reached with
Plot showing the average silhouette width for clusters 
The cophenetic correlation coefficient (CCC) was calculated as described in Subsection 4.3.5. The purpose was to identify how faithfully the dendrogram preserves the pairwise distances of objects. The result of CCC obtained was 1. This CCC was interpreted through the absolute value of the Correlation Coefficient described in Subsection 3.5.1. We interpreted the CCC obtained as a very strong correlation in how faithfully the dendrogram preserves the pairwise distances between the original unmodeled data points. The correlation is shown in Fig. 5 with a Shepard-like diagram. The plot is interpreted as the relationship between a dissimilarity matrix on the x-axis and a cophenetic matrix on the y-axis. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot.
Shepard-like diagram showing the relationship between the similarity matrix and the cophenetic matrix in experiment 1. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot. 
It was evaluated the significant clusters as described in Subsection 4.3.5. The intention was to validate the groupings obtained in the AHC analysis. The dendrogram is shown in Fig. 6 where two clusters marked with a red rectangle are observed. The clusters marked satisfy the condition of having an AU value equal to or higher than 95%, which indicates that they have high reliability, as the data strongly support them.
Hierarchical clustering of the significant groups in experiment 1. The branch values are the AU p values (left) and the BP value (right). Rectangles indicate significant clusters with AU higher than 95.
Tools for exploring the underlying clustering contexts of the AHC model in experiment 1. The highlighted patients share the coexistence of pathogens in the BV-positive condition in their assigned cluster.
Data Visualization (DV) tool designed using Tableau [50] which is available online through [51]. This provides the results obtained from the AHC model and significant clusters. The purpose of the DV is to provide a support tool for exploring coexisting bacteria in groups with BV present and absent status emerging from the AHC process. Figure 7 shows a screenshot.
The DV is structured in three segments. There are three labels in the superior part: General View, Negative and Indeterminate Cases, and Positive Cases. The centre of the picture shows symbols representing the female patients that each can be hovered over to display the patient characteristics. On the right side, a number of filters are provided to adjust the patients being displayed.
Two patients from the positive group are shown in the graph. It can be seen that they share the presence of specific microorganisms in their diagnosis. Therefore, it can be determined that the algorithm was able to identify elements with the same diagnosis.
Results of the experiment 2
In this subsection, we show the results obtained from the construction, plot, and cut of the dendrogram of the second experiment described in Subsection 4.1.
Dendrogram
A dendrogram was created using Ward’s.D2 as the linkage method and Asymmetric Binary Similarity Measure as a distance metric. The dendrogram is pictured in Fig. 8a. The dendrogram has the purpose of supporting the graphical identification of the potential number of clusters in the analyzed dataset. We interpreted the dendrogram of Fig. 8b looking to distinguish the level where the difference between two consecutive fusions is largest.
Subsequently, we determined the k-value (number of clusters) in the dendrogram by performing different cut-off levels, which produced a number of clusters. Table 5 shows the cut-off height and the
Different
-values for ward.D2 method in experiment 2. The height indicates the cutting level, and K indicates the underlying groups. Group highlights the behavior of the clusters regarding the real class
Different
Six clusters were identified as the potential optimal number of groups in the AHC model. Therefore, we cut the dendrogram by setting
Quantification of the number of elements in each cluster found is distributed in the following way: Cluster 1 (28 elements), Cluster 2 (36 elements) Cluster 3 (64 elements), Cluster 4 (32 elements), Cluster 5 (29 elements), and Cluster 6 (12 elements). This is summarized in Table 6.
Element count in the underlying clusters from experiment 2
Grouping table about the evaluation of the elements assigned in underlying clusters regarding the real classes in experiment 2
Dendrograms using the ward.D2 method in experiment 2; The Uncut dendrogram (a) and final dendrogram with cut-off 
Thereafter, we investigated whether the clusters found in the AHC model have a meaningful interpretation in real BV diagnostic classes. We looked at the elements of each Cluster C1, C2, C3, C4, C5, C6 which were delivered by the algorithm. The class of each element was verified and counted according to being positive, Negative, and indeterminate. Totals are shown in Tabla 7.
Table 7 illustrates that groups C1, C2, C3 contain elements of the negative and indeterminate classes. Whereas, clusters C4 and C6 are conformed only to positive class elements. In cluster C5, there are elements of positive and negative. Finally, a T-SNE visualization is generated from the underlying groups, which allows to observe each element of the groups in a two-dimensional plane, as shown in Fig. 9.
Plot showing the distribution of elements of the underlying clusters in a two-dimensional plane in experiment 2.
In this subsection, we show the processes of computational validation, and data visualization of the second experiment.
Optimal number of clusters
The gap-statistic method used the gap statistic method to validate the optimal number of clusters found in the dendrogram structure analysis. This is shown in Fig. 10. It shows that the value of the highest gap-statistic was reached with
Plot showing the gap statistic for clusters 
The cophenetic correlation coefficient (CCC) was calculated as described in Subsection 4.3.5. The aim was to identify how faithfully the dendrogram preserves the pairwise distances of objects. The CCC obtained was 0.6618216. This CCC was interpreted through the absolute value of the Correlation Coefficient described in Subsection 3.5.1.
We interpreted the CCC obtained as a moderate correlation in how faithfully the dendrogram preserves the pairwise distances between the original unmodeled data points. The correlation is shown in Fig. 11 with a Shepard-like diagram. The plot is interpreted as the relationship between a dissimilarity matrix on the x-axis and a cophenetic matrix on the y-axis. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot.
Shepard-like diagram showing the relationship between the similarity matrix and the cophenetic matrix in experiment 2. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot.
Hierarchical clustering of the significant groups in experiment 2. The branch values are the AU 
It was evaluated the significant clusters as described in Subsection 4.3.5. The intention was to validate the groupings obtained in the AHC analysis. The dendrogram is shown in Fig. 12 where two clusters marked with a red rectangle are observed. The clusters marked satisfy the condition of having an AU value equal to or higher than 95%, which indicates that they have high reliability, as the data strongly support them.
Data visualization
Data Visualization (DV) tool designed using Tableau [50] which is available online through [52]. This provides the results obtained from the AHC model and significant clusters. The purpose of the DV in this second experiment is to provide a supporting tool to explore the coexisting bacteria in different groups belonging to the positive diagnosis of BV emerging from the AHC process. Figure 13 shows a screenshot.
The DV is structured in three segments. There are eight labels in the superior part: General View, Negative and Indeterminate Cases, Positive Cases, and significant clusters. The centre of the picture shows symbols representing the female patients that each can be hovered over to display the patient characteristics. On the right side, a number of filters are provided to adjust the patients being displayed.
Two patients from the positive group are shown in the graph. It can be seen that they share the presence of specific microorganisms in their diagnosis. Therefore, it can be determined that the algorithm was able to identify elements with the same diagnosis.
Tools for exploring the underlying clustering contexts of the AHC model in experiment 2. The highlighted patients share the coexistence of pathogens in the BV-positive condition in their assigned cluster.
For comparison purposes between the different methods with low and high AP obtained in the AHC model, were built the grouping tables shown in Table 8. This table is distributed in the following way: The first column shows the similarity measure used. The second column displays all the linkage methods. The third column shows the diagnostic results that can be obtained: positive, negative, and indeterminate. The fourth and five-column shows the results of the first experiment. Finally, the sixth and seventh columns show the results of the second experiment set.
Comparison of results from experiments 1 and 2. The grouping table shows the number of elements assigned to each group according to the actual classes and the CCC of each binding method
Comparison of results from experiments 1 and 2. The grouping table shows the number of elements assigned to each group according to the actual classes and the CCC of each binding method
Thereafter, we analyzed the grouping differences between each model of clustering. It was identified that in the first experiment, the linkage methods with low AP were achieved to create groups with states of presence and absence of BV in the same mode as the high AP methods.
For the second experiment was confirmed that only Ward’s.D2 method achieved identify clusters with dissimilarity between patients with the same diagnosis. The results obtained in the second experiment of the methods with low AP were interpreted in the following way:
The group-average method obtained 0.8542708% of agglomeration and a correlation of 0.7695198. Results of HC indicate that it is best to create groups for cases of BV-negative as well as the complete-link method. The single link method obtained an agglomeration of 0.7125233% and, correlation of 0.6575352. Results of HC indicate a mixed clustering in only one group. Fails to perform groups with features similar to those belonging to the same class. The complete link method obtained an agglomeration of 0.9087942% and, correlation of 0.7683422. Results of HC show the best grouping for the diagnosis of BV-negative. The ward method obtained an agglomeration of 0.9772106% and a correlation of 0.6618216. Results of HC show that grouping has an even distribution for different diagnoses. The best result of the hierarchical clustering was obtained through asymmetric binary similarity measures and the Ward.D2 method, which allowed the identification of clusters with elements of the same diagnostic class. The generated clusters are divided as follows: Three of the five clusters incorporated BV-negative elements in different proportions. In contrast, the fourth set includes many positive diagnoses, and finally, the sixth group comprises elements of different BV diagnostic classes.
Time and memory complexity analysis
This subsection shows the time and memory complexity evaluation of the ward.D2 method and the distance matrix obtained in Subsection 4.3.2. For this evaluation, the GuessCompx package [53] was used with the default parameters of maximum time in seconds allowed for each step of the analysis of 30 seconds and the number of replicated runs of the algorithm for a specific sample size of 2. The GuessCompx package was described and introduced by Agenis-Nevers, M. et al.’s [54]. The runtime and memory results are detailed in Table 9.
Time and memory complexity analysis values for experiments 1 and 2
Time and memory complexity analysis values for experiments 1 and 2
The complexity analysis of experiment 1 of the ward. D2 method showed a time complexity value of 0.03641079s. The results denote that the total time complexity for constructing the clustering model is low, which is called the best case complexity, as shown in Fig. 14a. The memory space complexity for experiment 1 is 16252 Mb with logarithmic trend over time, as shown in Fig. 14b.
Complexity fit against (a) run time and (b) memory usage for distance function on Experiment 1, suggests O(log(N)) as the best model for both evaluations. The complexity graph of the ward. D2 method indicates the best model by a yellow line for (a) and (b).
The complexity analysis of experiment 2 of the ward. D2 method showed a time complexity of NAs value for its measurement, that is to say, the value obtained in the runtime of experiment 2 is very close to 0, as shown in Fig. 15a. The results denote that the total time complexity for constructing the clustering model is low, which is called the best case complexity. The memory space complexity for experiment 1 is 16252 Mb with logarithmic trend over time, as shown in Fig. 15b.
Complexity fit against (a) run time and (b) memory usage for distance function on Experiment 1, suggests O(1) and O(log(N)) as best model, respectively. The complexity graph of the ward.D2 method indicates the best model by a red line (a) and a yellow line (b).
The convergence analysis was performed based on the results of Subsections 5.1.1 of experiment 1 and 5.3.1 of experiment 2, corresponding to the dendrogram analysis. Convergence in the ward.D2 method ensures that the underlying groups allow for further analysis of coexisting bacteria in clusters of BV-positive patients.
The convergence of the ward.D2 method in experiment 1 was achieved when exploring the value
The convergence behavior can be seen in Fig. 16, which shows that a low cut of the dendrogram height allows to underlie a certain number of groups with a high similarity of elements of the VB-positive condition.
The graph shows the convergence obtained from the different values explored in experiments 1 and 2.
This subsection shows the findings on the coexistence of bacteria in the clustering models found in the experiments 1 and 2. The findings on bacterial coexistence from the first experiment showed a prevalence of BV pathogens of 94.12%, 66.62%, 58.82%, and 37.5% for Atopobium, Gardnerella vaginalis, Megasphaera, Mycoplasma hominis, respectively. The contexts of coexistence bacteria identified in the 51 elements with BV-positive are shown in Table 3 were:
Atopobium Atopobium Atopobium Atopobium Atopobium
In the second experiment, two clusters of positive BV diagnosis emerge. In the first cluster, the results show a prevalence of BV pathogens of 90.62%, 78.12%, 56.25%, and 28.12% for Atopobium, Gardnerella vaginalis, Megasphaera, Mycoplasma hominis, respectively. The contexts of coexistence bacteria identified in the 32 elements with BV-positive shown in Table 8 were:
Atopobium Atopobium Gardnerella vaginalis Atopobium Atopobium Atopobium Atopobium
Cluster two of the second experiment shows BV pathogen prevalence of 100%, 58.33%, 58.33%, 58.33%, 66.66% for Atopobium, Gardnerella vaginalis, Megasphaera, Mycoplasma hominis, respectively. The contexts of coexistence bacteria identified in the 12 elements with BV-positive shown in Table 8 were:
Atopobium Atopobium Gardnerella vaginalis Atopobium Atopobium Atopobium
This research shows that linkage methods and the similarity measure contribute significantly to identifying the best hierarchical model and the optimal number of clusters for further analysis of bacteria coexisting between patient groups. A determining factor in this research was identifying the optimal number of clusters in the VB data, as clusters that share a high similarity of characteristics would emerge from this procedure. The study also highlighted that it is necessary to further investigate adaptive or random methods to automatically identify clusters with high similarity in the hierarchical structure. Therefore this is still an open research in the field.
On the other hand, it is essential to mention that until the time of the development of this study, there is no evidence of another similar approach to compare results. However, to support the results, they were subjected to biological validation by an expert using data visualizations of the models highlighting the bacterial coexistence contexts shared by the elements of each VB cluster. Due to the lack of literature on VB clustering, further experimentation with other methods is suggested to consolidate our findings.
Conclusion
In this paper, we aimed at creating an AHC model that would allow determining a more informed number of groups to create for further analysis of bacteria’s coexistence. This first effort provides an AHC model with the intention of supporting the study of bacterial coexistence contexts in groups of patients with BV. The single link, the complete link, group average, and Ward’s methods were evaluated in two experiments to identify the method with the highest AP and construct its dendrogram. Each AHC model obtained was validated through metrics. Finally, the clustering tables of the linkage methods were compared.
In the first experiment, the Ward’s.D2 method showed the best results in the metrics evaluated. The results obtained allow us to conclude that through AHC, it is possible to create groups that are distinguished by the presence and absence of BV. The Ward’s.D2 method showed the best results in the metrics evaluated in the second experiment. The results obtained allow us to conclude that through AHC, it is possible to create groups with dissimilarity and, at the same time being groups with elements showing the same diagnosis.
We consider that the proposed AHC model identifies the best method for each clustering case. Knowing which linkage methods are the best at clustering tasks in different experiments could serve as a basis for building an expert system that implements the best models. This system would facilitate the interpretation of bacterial coexistence context, which impacts the physician’s decision-making when choosing treatments for patients with BV.
The related work mentioned above focuses on the diagnosis, classification, and design of microarray-based phylogenetic tools, each contributing to the understanding of BV. This clustering study confirms that it is possible to determine a more informed number of patient groups to analyze coexisting bacteria. Furthermore, the results of the experiments indicate differences in the bacterial coexistence context between one group and the other with BV-positive. Thus, knowing the coexistence of bacteria present in clusters with similar characteristics allows physicians to establish appropriate treatment strategies to prevent the development of major gynaecological complications. This study evaluates a set of linkage methods and a similarity measure to build two clustering models for further analysis of coexisting bacteria in groups of patients with BV. This represents a contribution to medicine from a nonsupervised learning perspective with the aim of supporting physicians in understanding BV.
In future work, we will address other clustering methods and distance measures. We are also interested in obtaining the solutions of different clustering methods most prevalent in the literature and then based on that to perform comparative studies with recent clustering approaches such as SNK-AHC [55]. Finally, the models generated using all clustering methods can be integrated into expert systems to act as decision aids for specialists.
