An agglomerative hierarchical clustering approach to identify coexisting bacteria in groups of bacterial vaginosis patients

Abstract

Polymicrobial syndromes such as Bacterial Vaginosis (BV), where there is a great diversity of microorganisms and causal connotations, turn it into a disease with complex dynamics in the bacteria’s coexistence in groups of patients. The main aim of this study was to explore a dataset of patients with BV to determine a more informed number of groups to create for further analysis of bacteria’s coexistence. The Agglomerative Hierarchical Clustering (AHC) algorithm was applied to a BV dataset from an urban population in southeastern Mexico consisting of 201 patient records with 59 patient attributes and three classes (BV-positive, BV-negative, BV-indeterminate). In the clustering results obtained, it is possible to identify different remarkable groups of patients. The most prevalent coexisting bacteria among patients with BV were Atopobium $+$ Gardnerella vaginalis with 37.50%, Atopobium $+$ Megasphaera with 15.68% in the first experiment. Whereas, in the second experiment, the coexisting bacteria were Atopobium $+$ Megasphaera $+$ Mycoplasma hominis with 33.33% and Atopobium $+$ Gardnerella vaginalis $+$ Mycoplasma hominis with 25%. Finally, we provided evidence that via the AHC algorithm, it was possible to identify an optimal number of clusters with high intra-similarity and inter-dissimilarity. Furthermore, this approach allowed us to create a clustering model that helps analyze the complex dynamics between bacteria in groups of patients with BV.

Keywords

Hierarchical clustering bacterial vaginosis data mining coexisting bacteria

1. Introduction

BV is a polymicrobial clinical syndrome [1] that affects 15 to 50% [2] of women in childbearing age [3]. BV is a vaginal micro-floral dysbiosis, which usually occurs when Lactobacillus [4, 5, 6] decreases quantitatively before an overgrowth of mainly anaerobic bacteria [7, 8]. These organisms involved in this dysbiosis are present in both health and disease cases.

Women affected by BV may be symptomatic or asymptomatic. Symptom cases experience vaginal odour (fishy), grayish-white discharge, itching, and increased vaginal pH greater than 4.5 [9]. It is essential to perform a timely BV diagnosis to avoid gynaecological complications [10], such as endometritis, salpingitis, oophoritis [11], preterm premature rupture of membranes (PPROM), and chorioamnionitis [12].

The classical methods for diagnostics of BV are Amsel’s criteria [13], and Nuget score [14]. Real-time PCR has also been used to study BV, which consists of amplifying and quantifying or detecting the target DNA simultaneously [15]. BV has been addressed in multiple clinical and microbiology studies using these methods. However, the diversity of microorganisms found in the vaginal mucosa, along with their causal connotations, make BV a complex dynamic problem [16]. BV becomes even more complex when it is sought to identify groups of individuals with feature similarities.

Particularly, it is well known that a BV-positive case is a consequence of an imbalance state of bacteria; it is not a unique bacterium but coexistence of bacteria leading to a BV-positive condition. However, these bacteria may change from patient to patient. So it is of our interest to tackle this problem using a clustering approach. We aim to identify clusters that help understand patients with BV who presumably share coexisting bacteria in a grouped way.

From the ML standpoint, groups of objects are called clusters. Clustering refers to the segmentation of datasets into groups of similar objects. Each group consists of objects similar to one another and dissimilar to objects in other groups [17]. This technique can be extremely effective in determining the contexts of bacterial coexistence between groups of patients with the same diagnosis.

In this study, we address the problem of BV through the AHC algorithm to explore a BV dataset real from an urban population in southeastern Mexico consisting of 201 patient records with 59 patient attributes with three classes BV-positive, BV-negative, BV-indeterminate. The classes were hidden from the AHC to let it do its work. The aim was to obtain from a dendrogram the result of the AHC a more informed number of groups to create in the dataset for further analysis on bacteria coexistence that led to a BV-positive diagnosis. These groups must show similarity in the elements within the same group and dissimilarity among features from one group to those of the others regarding detecting coexistence context bacterial. Finally, the results were validated computationally and biologically.

The AHC algorithm is implemented using a linkage method and a distance metric. The linkage methods experimented with were Single Link, Complete Link, Group Average, Ward’s.D, and Ward’s. D2. There are several distance metrics for different purposes. In this study, the asymmetric binary similarity measure was applied. Two experiments were performed:

(1)
In the first, it the qualitative-asymmetric-binary data of Lactobacillus was used with qualitative-asymmetric binary data of microorganisms
(2)
In the second, the quantitative data of Lactobacillus Cq with qualitative-asymmetric binary data of microorganisms was used.

In both experiments, the percentage of agglomeration (AP) was calculated for each linkage method. The results were used to identify the method delivering the highest AP value and build the dendrograms. After, dendrograms were created where different levels of data grouping can be read. Subsequently, clustering tables were constructed to investigate whether the elements were put in the right cluster according to the real class of elements. Posteriorly, the model was computationally validated using the following metrics: coefficient of cophenetic correlation (CCC), the optimal number of clusters through the gap-statistic, and the average silhouette. Significant clusters were evaluated through Ward.D2 method and asymmetric binary similarity measure. Thereafter, it was biologically validated by an expert in the field and finally the data visualizations were created.

After that, a comparative table was constructed of all linkage methods that obtained a low AP. The data were analysed, which allowed identifying whether the AHC method produced clusters according to two types of states (presence and absence) in the first experiment. In the second experiment, it was analysed whether the groups found were dissimilar among other groups but belonged to the same diagnosis.

It was also possible to identify different remarkable groups of patients. Differences from the group to group lie on the active bacterium that led to the condition of BV-Positive. This study contributes to the effort of providing insights into the identification of coexisting bacteria in groups of patients diagnosed as BV-positive.Also, the benefit of identifying groups translates into selecting specific treatments according to the coexistent bacteria in each grouping. Also, it becomes a support tool to obtain a priori knowledge of the contexts that may arise in clinical cases.

This paper is organized as follows: Section 2 describes the related works of machine learning in BV. Section 3 describes the dataset, linkage methods, similarity measure, cophenetic correlation coefficient, and validation metrics. Section 4 describes the steps of the AHC experimental design. Section 5 describes the results. Section 6 describes the discussion. Finally, our conclusions are in section seven.
2. Related works

There exist a few studies that have analyzed the VB disease using ML methods.

The purpose of Song et al.’s. [18] was to integrate Superpixel methods with Deep Learning methods based on convolutional neural network (CNN) for the automatic assisted diagnosis of BV. The experiment was based on 105 women (18–50 years old) from Shenzhen Hospital, from which 105 oil immersion images were obtained using Olympus BX43. The images were evaluated through a reference frame being processed in greyscale through superpixel computation to smooth the appearance. The Superpixel algorithm is used to group each segment of the cells and to classify the bacteria. This method allowing to obtain a description of the characteristics such as (size, colour, shape, texture, gradients) that BV may represent. The deep learning method used CNN to learn input and output mappings. It was evaluated through the methods of precision, sensitivity, accuracy, specificity, F1 measure, and the Zijdenbos similarity index. The performance of CNN was compared with other classifiers (Backward Propagation Neural Networks (PNN), Vector Support Machines (SVM), Vector Quantization Learning (LVQ), and Probabilistic Neural Networks).

Baker et al.’s [19] built a classification model by breaking down the groups of microbes based on their correlation. Likewise, it reduced the number of factors, increasing the interpretability of the classification models. The classifications were made using Genetic Programming, Random Forest, and Logistic Regression, and the precision of the models was evaluated using ROC curves. The precision obtained from the models was between 90% and 95% when they were classified using the dataset with the Nugent score. Unlike when the classification process was performed using the AMSEL criterion data, the precision was lower than that of the Nugent score. After the precision was calculated, the most representative attributes were determined based on the deconstruction of the models, and a ranking of attributes was made according to their model importance. They also identified that through Genetic Programming, there is a wide variation of models when they are phenotypically classified with the Nugent score.

In [20], Cruciani et al.’s designed a new phylogenetic microarray-based tool (VaginArray) that includes 17 probe sets specific for the most representative bacterial groups of the human vaginal ecosystem. The VaginArray was applied to evaluate the efficacy of rifaximin vaginal tablets for treating BV. The results showed the ability of rifaximin to reduce the growth of various BV-related bacteria (Atopobium vaginae, Prevotella, Megasphaera, Mobiluncus, and Sneathia spp.)

Ness et al.’s [21] applied exploratory factor analysis to investigate microflora measurements’ clustering and identify groups of associated microorganisms with pelvic inflammatory disease (PID). The initial sample consisted of 1628 women to whom exclusion criteria were applied (pregnant, married, virgin, and use of antibiotics at the beginning of the study), of which 1140 women were selected. They were interviewed and examined to acquire the vaginal floral samples and build the dataset. 20% of the data were missing, the reason it was necessary to complete the data. Multiple regression was performed to solve the missing values. Two independent groups of microorganisms were determined through exploratory factor analysis, obtaining a factor score from a linear combination of microorganisms and individual measurements. The interpretation of the results of the factor analysis (clustering) was shown in the sediment graphs. It was identified that women in the highest tertile presented growth of microorganisms associated with BV, more probable to experience pelvic inflammatory disease (PID).

The studies described above allow us to identify the most highlights aspects of scientific research using ML in BV. It is noticeable that most of the research approaches are directed towards the diagnosis and classification of a positive BV case. However, few studies use a clustering approach for determining a more informed number of clusters that allow to know the coexisting bacteria among groups of patients with a BV-positive condition. For example, research that applies a clustering approach are limited to the evaluation of a unique linking method of AHC algorithms and lack evaluation of metrics such as determination of the optimal number of groups, cophenetic correlation coefficient and significant clusters. Another limitation detected in clustering research is the absence of a visual tool to explore the characteristics shared between elements of the same grouping. Our approach encompasses the evaluation of different AHC methods, the evaluation of construct metrics that other studies do not consider, and a data visualization tool introduced to explore the contexts of bacterial coexistence among patients assigned to the same cluster.

3. Materials and methods

3.1 Dataset

The dataset used for this work was generated by the Laboratory of Research in Metabolic and Infectious Diseases at the Juarez Autonomous University of Tabasco. It was obtained as part of molecular epidemiological research on BV over the years 2016 to 2018 [22]. This dataset is complete with no missing values and was constructed by the biology expert.

The dataset contains 201 patient recorded with 59 attributes related to microorganisms associated with BV (BV–qPCR) and Human Papillomavirus (HPV). Our research is devoted to the BV condition only; therefore, 40 attributes related to HPV were discarded, and the 19 attributes related to BV microorganisms were kept, as shown in Table 1. All selected attributes are numerical and organized into three segments of attributes which were established and thoroughly described in [22]. The area expert described the segmentation as follows: Segment one (qualitative-asymmetric binary data of Lactobacillus) refers to the values of the presence and absence of Lactobacillus, segment two (quantitative data of Lactobacillus Cq) relates to the values of molecular diagnostics, segment three (qualitative-asymmetric binary data of Microorganisms) relates to the values of presence and absence of genital pathogens associated with BV.

Table 1
Attributes selected from the VB dataset used in our experiments which were introduced in [22]

Data	Acronyms	Legend	Values
Segment one	L. crispatus	Lactobacillus Crispatus	Values 1 or 2
	L. gasseri	Lactobacillus Gasseri
	L. iners	Lactobacillus Iners
	L. jensenii	Lactobacillus jensenii
Segment two	CrispatusCq	Lactobacillus crispatus Cq* (Growth Value)	Values between 0
	GasseriCq	Lactobacillus gasseri Cq* (Growth Value)	and 37.46
	JenseniiCq	Lactobacillus jensenii Cq* (Growth Value)
	InersCq	Lactobacillus Iners Cq*( Growth Value)
Segment three	PathogenComb	Pathogen combination	Values 1 or 2
	Megasphaera Phylotipo1	Gram Negative Anaerobic Bacteria 16s RNA Sequence
	Atopobium	Atopobium
	Gardnerella V.	Gardnerella Vaginalis
	CT	Chlamydia Trachomatis
	NG	Neisseria Gonorrhoeae
	HSV1&2	Herpes Simplex Type 1 and 2
	MH	Mycoplasma Hominis
	MG	Mycoplasma Genitalium
	UP	Ureaplasma Parvum
	UU	Ureaplama Urealyticum

3.2 Agglomerative hierarchical clustering (AHC)

The AHC algorithm allows for obtaining partitions according to each linkage method, with different criteria, including minimum distance, maximum distance, average group, and minimum variance. The AHC is associated with a distance matrix D ${}_{\textit{n}}$ of size ( $\textit{n}\times\textit{n}$ ). For more details see [23]. AHC construction consists of the following steps.

(1)
Each object is assigned a unique cluster.
(2)
Calculate the (dis)similarity matrix between every pair of objects in the dataset.
(3)
Determine the linkage criteria to merge the clusters.
(4)
Repeat the process of steps 2 and 3 until all objects are in one cluster.

3.3 Asymmetric binary similarity measure

The asymmetric binary measure is a metric used to estimate the similarity between objects with asymmetric binary properties. An asymmetric attribute is a particular case of a nominal variable with two categories (1-Presence, 0-Absence); this means that one state of the attribute is more informative than the other. A clear example arises when we seek to identify the presence or absence of a disease according to its characteristics [24]. Faith, D. P. (1983), suggests the following measure of similarity, $S$ :

$\displaystyle S=(1\times a+0\times d-1\times U)/N=(a-U)/N)$ (1)

This measure can be adjusted to be constrained between 0 and 1 as follows:

$\displaystyle c=((a-U)/N+1)/2=((a-U/N)+((a+U+d)/N))/2=(2a+d)/2N=(a+d/2)/N$ (2)

Where $U$ is the equal number of disagreements (either “1”–“0” or “0”–“1”), $a$ is the equal number of shared presences, $d$ is equal the number of shared absences. $N$ is the number of characters.

3.4 Linkage methods

In the AHC approach, each data point is defined in a group, and the existing groups are combined at each step. The different linkage methods for this approach are:

3.4.1 Single linkage method

Single linkage measures the proximity between two groups by calculating the distance between their objects and choosing the minimum distance as the merging distance between the two groups [25]. It is written mathematically as shown in Eq. (3) [23].

$\displaystyle\min_{i_{C1},i_{C2}}d(i_{C1},i_{C_{2}}),i_{C_{1}}\in C_{1},i_{C_{% 2}}\in C_{2}$ (3)

Where $i_{C1}$ and $i_{C2}$ are any two observations considered as a cluster, and d ( $i_{C1}$ and $i_{C2}$ ) denotes the minimum distance between the two clusters. The notation $i_{C1}\in$ C1 and $i_{C2}\in$ C2 means that the observation belongs according to its corresponding cluster (C1, C2).

3.4.2 Complete linkage method

Complete linkage measures the proximity between the two groups by calculating the distance between their objects, with a maximum distance to be merged. This method is sensitive to outliers [26]. It is calculated with Eq. (4) as follows [23]:

$\displaystyle\max_{i_{C1},i_{C2}}d(i_{C1},i_{C_{2}}),i_{C_{1}}\in C_{1},i_{C_{% 2}}\in C_{2}$ (4)

Where $i_{C1}$ and $i_{C2}$ are any two observations considered as clusters, and d ( $i_{C1}$ and $i_{C2}$ ) denotes the maximum distance between the two clusters. The notation $i_{C1}\in$ C1 and $i_{C2}\in$ C2 indicate that the observation belongs according to its corresponding cluster (C1, C2).

3.4.3 Average group method

The Average group measures the proximity between two groups by estimating the average distances between objects of both clusters or the average of the similarities between objects of both clusters. It is calculated with Eq. (5), as follows [27]:

$\displaystyle D_{AV}(C_{i},C_{j})=\text{avg }x\in C_{i},y\in C_{j}(\text{dist}% (x,y))$ (5)

Where $C_{i}$ and $C_{j}$ are two sets of objects (clusters). The notation avg, $x\in C_{i}$ , $y\in C_{j}$ denotes the average of the elements of the $C_{i}$ and $C_{j}$ , and (dist( $x$ and $y$ )) denotes the distance between the vectors.

3.4.4 Ward’s and Ward’s.D2 minimum variance method

Ward’s minimum-variance method, the distance between two clusters, is the sum of squared deviations from point to centroid. It minimizes within clusters the sum of squares, mathematically written as shown in Eq. (6) [28]:

$\displaystyle\textit{ESS}=\sum_{x_{n\in C}}\|x_{n}-\bar{x}\|^{2}$ (6)

Where ESS is the error sum of squares. $x_{n}$ refers to an element of the cluster, and the notation $x_{n}\in C$ is the sum over all data points, $y_{i}$ is each data point and $\bar{y}$ is mean value. ${}^{2}$ is the square of the result.

There are two different methods found in the literature for the Ward method. Ward’s [29] and the Ward’s D2 [30] method. The difference between the methods is that Ward’s.D2 performs the squared dissimilarity calculation before updating the group [31].

3.5 Validation metrics

3.5.1 Pearson correlation and Cophenetic correlation coefficient

Pearson correlation coefficient is a test that measures the statistical relationship between two continuous variables X and Y [32]. The formula for the correlation coefficient is [33]:

$\displaystyle r=\frac{\sum(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum(X_{i}-% \bar{X})^{2}\sum_{Z}(Y_{i}-\bar{Y})^{2}}}$ (7)

Where $r$ is a coefficient correlation, $X_{i}$ are the values of the variable $X$ in a sample, $\bar{X}$ is the mean of the values of the variable $X$ . $Y_{i}$ are the values of the variable $y$ in a sample, and $\bar{Y}$ is the mean of the values of the variable $Y$ .

The cophenetic correlation coefficient (CCC) is the result of computing Pearson’s correlation coefficient with the values in a cophenetic matrix [34]. The formula is defined as [35]:

$\displaystyle c=\frac{\sum_{i<j}(Y_{ij}-y)(Z_{ij}-z)}{\sqrt{\sum_{{}_{i<j}}(Y_% {ij-y})^{2}\sum_{i<j}(Z_{ij-z})^{2}}}$ (8)

The c-value or cophenetic correlation coefficient (CCC) indicates how faithfully the clustering model preserves the pairwise distances between the original unmodelled points and determines the quality of the solution. Its evaluation should be done through of the correlation magnitude strata suggested by Schober, Boer, and Schwarte [36], such as: insignificant (0.00–0.10), weak (0.10–039), moderate (0.40–0.69), strong (0.70–0.89), and very strong (0.90–1.00).

3.5.2 Gap statistic

The gap statistic method determines the optimal number of groups in a dataset. This approach compares the total intragroup variation for different k values with their expected values under a null reference distribution of the data. The result will be the optimal cluster number that maximizes the gap statistic. Hence, it is defined [37]:

$\displaystyle\textit{Gap}_{n}(K)=E_{n}^{*}\left\{\right.{\log(\textit{W}_{k})}% \left.\right\}-\log(\textit{W}_{k})$ (9)

Where $E_{n}^{*}$ denotes expectation under a sample of size $n$ from the reference distribution. The notation $\log(W_{k})$ refers to the logarithm of the data.

3.5.3 Silhouette

The silhouette method estimates the mean of the observations for different values of K. The optimal number of clusters is the one that maximises the mean of the silhouette over a number of possible K values. The formula for the silhouette is [38]:

$\displaystyle s(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}}$ (10)

Where $a(i)$ is the median distance between $i$ and all other observations in the same cluster, and $b(i)$ is the median distance between $i$ and the observations in the nearest cluster.

4. Experimental design

The phases of this study’s construction are shown in Fig. 1, and described in this section. The phases of obtaining and dataset subselection are described in Section 3.1.

Figure 1.

AHC experimental design.

4.1 Description of the experiments

It is important to note that the problem investigated in BV is to know which bacteria coexist in the positive cases. This is approached through an exploration of the dataset using clustering algorithms, where it is assumed that each group will be formed by elements that have in common with the bacteria that detonate under the condition. Two experiments were designed as follows:

Experiment 1. – The objective was focused on determining groups with the presence or absence of BV, that is, positive or negative diagnosis of BV. We considered segment one of the qualitative-asymmetric-binary data of Lactobacillus and segment three of the qualitative-asymmetric-binary data of microorganisms of the dataset described in Table 1.

Experiment 2. – The objective was designed to identify groups with dissimilarity, and at the same time being groups with elements showing the same diagnosis. That is, there might be more than one group with BV-positive elements. These cases allow for investigation of bacterial coexistence, which might be different bacteria between groups. We considered segment two of the quantitative data of Lactobacillus Cq and segment three qualitative-asymmetric-binary data of microorganisms from the dataset described in Table 1.

Note that we excluded the class variable in the clustering model’s construction. The class variable was used later to inspect the classes of the elements of the clusters.

4.2 Identification of metrics of distance and linkage methods

To build AHC, we explored the different linkage methods and distance measures found in the literature.

Regarding the linkage methods, we identified the most commonly used methods, which are Single Link, Complete Link, Group Average, Ward’s.D, and Ward’s.D2. These methods were selected to be applied in the study.

Thereafter, we explored the distance measures used in the AHC algorithm. For selecting the distance measure, we considered the characteristics of the attributes in the dataset. The BV dataset attributes are nominal variables with two categories (1-Presence and 2-Absence) defined as qualitative asymmetric binary data. Other attributes are quantitative variables. The distance metric was selected to apply Asymmetric binary similarity measure.

4.3 Hierarchical clustering

According to the literature, the AHC algorithm is the predominant method when seeking to understand the structure of a dataset for characterizing objects into groups for the first time [39]. AHC results in a dendrogram, which aids in identifying the proper number of groups for further analysis. This study leads to an analysis of bacterial coexistence in groups of patients with BV. To create the AHC model, the following construction stages were conducted.

4.3.1 Calculation of the agglomerative percentage (AP)

To decide which method to use to build your dendrogram. It is necessary to calculate the AP of each chosen linkage method. The AP is calculated using the function Agnes from the cluster package [40] and takes as arguments the dataset and linkage method. What the Agnes function does is it returns the value of AP of the method evaluated. This value allows estimating the level of clustering in the dataset.

AP is a dimensionless value in the range of 0 to 1. When AP is very low in the method evaluated, it can be interpreted that there isn’t a structure of grouping or that the data conform to only one group. In contrast, AP values close to 1 means that there is a clear structure of groping. This definition is suggested by Kaufman and Rousseeuw [41]. The method that obtains the highest AP in each experiment builds its AHC model, which is represented by a dendrogram.

4.3.2 Construction of the model

To generate a clustering model using the AHC algorithm. The hclust function of the stats package in the R language was used [42]. The arguments are the distance matrix and linkage method.

•
To estimate the distance matrix, the dist function of the stats package [42] is used. The input arguments of the dist function are the dataset and asymmetry binary similarity measures. What this function does is it returns the distance between all pairs of objects in the dataset.
•
The linkage methods selected were those described in Subsection 4.3.1.

The hclust function creates a clustering model by taking the distance matrix of n observations that will be clustered following the linking method’s criteria.
4.3.3 Plot and cut of the dendrogram

Once the clustering model has been created, the resultant dendrogram with the AP chosen is plotted using a plot function from package graphics [43]. Afterwards, we analyzed the dendrogram structure to identify the optimal number of clusters. For this, we sought to distinguish the level where the difference between two consecutive fusions is the largest.

To make groups visible on the dendrogram, the hcoplot function was used [44]. The input arguments to this function were the clustering model, the distance matrix as obtained in Subsection 4.3.2, the number of groups identified in the analysis of the dendrogram structure given by argument $k$ .

What the hcoplot function does is it cuts and orders the AHC model. To cut the dendrogram, the hcoplot function implicitly uses the cutree function. In this function is used the argument $k$ , which is the number of clusters that will be visible in the dendrogram. Similarly, it uses of the function reorder.hclust. This function allows the reordering of objects so that pairs of nearby objects are contiguous, which is the ordering criterion suggested by Gruvaeus and Wainer [45]. The result is a dendrogram with visible and reordered groups.

4.3.4 Grouping table

A cluster table is a cross-frequency table between the real class variables and the group variable assigned by the algorithm. The column and row structures show the grouping of elements according to the group and the diagnosis assigned by the algorithm.

4.3.5 Computational and biological validation

•
Computational Validation

Computational validation involves measuring the quality of the clustering results. This phase comprises three points:

(1)
Determine the optimal number of groups, which consists of identifying the clusters based on the data through metrics that provide the optimal value of $k$ clusters.

*
The number of optimal clusters was validated using the Fviz_nbclust factoextra package [46] in the R language. For the validation of the optimal number of underlying clusters, the two most popular standard methods in the literature were selected; the silhouette method [38] used for experiment 1 and gap-statistics [37] for experiment 2. The purpose was to give certainty that the optimal number of clusters suggested in Subsection 4.3.3 was corresponding to the estimation through metrics. The evaluation of the methods was performed according to the objective of each experiment and the properties of the selected attributes are described in Subsection 4.1.

(2)
Calculate the CCC, which consists of measuring the correlation between the initial distances taken from the original data and the final distances at which the objects have been merged in the clustering process.

*
The CCC is performed using the cophenetic function from the stats package [42]. The arguments required are the distance matrix and cophenetic distance.

⋅
The process of calculating the distance matrix is described in Subsection 4.3.2.
⋅
The cophenetic distance is calculated using the cophenetic function from the stats package [42]. The input argument of the cophenetic function was the AHC model obtained in Subsection 4.3.2.

(3)
Determine significant clusters, which consists of assessing the uncertainty of the AHC analysis. A significant cluster is a group that satisfies the condition of having a $p$ -value in the confidence interval equal to or higher than 95%, which indicates that it is a group strongly supported by the data.

*
This process was performed using the pvclust function of the pvclust package [47]. The input arguments for the two experiments were the Ward.D2 method as the linking method and the binary asymmetric similarity measure as the distance. What the pvclust function does is it returns p-values and bootstrap probability (BP) values that correspond to the frequency that the clustering appears. For more details see [48].

•
Biological Validation

The validation process involves verifying the biological significance of the underlying groups in the clustering models. For this purpose, the clusters were made available to an expert in the field. The expert explored each element of the underlying clusters and corroborated that all of them were put in the right cluster according to the real class of elements. That is, elements of class positive were grouped together, similarly elements of class negative and class indeterminate were grouped in its corresponding groups as shown in Tables 4 and 7 for experiments 1 and 2, respectively.

4.3.6 Data visualization

The data visualization highlights the characteristics of the clusters found in the AHC model to provide the domain expert with a screening tool to identify contexts of bacterial coexistence and validate the clusters biological significance.

Note that we performed all steps included in the experiment design for the two experiments.

In this work, the experiments were performed in RStudio Version 1.2.5019, from the stats package [42], which provides the functions hclust and dist that were used for building the Hierarchical Clustering model. Additional R packages were used for visualization, and analysis of the model such as purrr [49], cluster [40], graphics [43], factoextra [46], and pvclust [47]. These packages are pointed to in the sections where the processes that used them are described.

5. Results

To the best of our knowledge, at the time of this research, no other studies have been found in the literature that addresses the problem of BV to identify contexts of bacterial coexistence in groups of patients, using machine learning algorithms specifically using a clustering approach.

In this section, we show the results obtained from applying AHC method. To decide which method to apply for the construction of the dendrogram, it was necessary first to investigate the value of the AP. This value indicates the possible agglomeration level in the dataset. Four methods were explored, as shown in Table 2.

Table 2
Percentage of agglomeration (AP) for experiments 1 and 2. The units of the results show the percentages obtained for each linkage method evaluated. The highest value is shown in bold

Method	Agglomeration experiment 1	Agglomeration experiment 2
Single Link	0.8412858	0.7125233
Complete Link	0.8937486	0.9087942
Group Average	0.8646396	0.8542708
Minimum Variance or Ward’s.D2 Method	0.9661422	0.9772106

Ward’s D2 method reached the highest AP for the two experiments. The AP value obtained in the first experiment was 0.9661422 and 0.9772106 in the second one. Based on this information, we determined to build the dendrograms using the Ward’s.D2 method.

A dendrogram was created using segment one the of qualitative-asymmetric-binary data of Lactobacillus and segment three of qualitative-asymmetric-binary data of microorganisms, which is described in Subsection 5.1, and its validation process is described in Subsection 5.2. Similarly, a second dendrogram was created using the segment two of quantitative data of Lactobacillus Cq and segment three qualitative-asymmetric-binary data of microorganisms, which Subsection 5.3 describes, and its validation process is described in Subsection 5.4. Finally, Subsection 5.5 describes the clustering results of all linkage methods.

5.1 Results of experiment 1

In this subsection, we show the results obtained from the construction, plot, and cut of the dendrogram of the first experiment described in Subsection 4.1.

5.1.1 Dendrogram

A dendrogram was created using Ward’s.D2 as the linkage method and Asymmetric Binary Similarity Measure as a distance metric. The dendrogram is pictured in Fig. 2a. The purpose of the dendrogram is to assist in the graphical identification of the potential number of clusters in the analysed dataset. We interpreted the two vertical lines in Fig. 2a. as two potential clusters. Therefore, we cut the dendrogram by setting $k=$ 2 in the hcoplot function. The result obtained was a dendrogram with visible clusters coloured in red and green, as shown in Fig. 2b.

Quantification of the number of elements in each group resulted in 150 and 51 elements conforming to the cluster 1 and cluster 2, respectively. This is summarized in Table 3.

Table 3
Element count in the underlying clusters in experiment 1

Cluster	Number of elements
Cluster 1	150
Cluster 2	51
Total	201 Elements

Figure 2.

Dendrogram using the ward.D2 method in experiment 1; The Uncut dendrogram (a) and final dendrogram with cut-off $k=$ 2 clusters (b).

Subsequently, we investigated whether the clusters found using Ward’s.D2 method have a meaningful interpretation in real BV diagnostic classes. We looked at the elements of each cluster C1, C2 which were delivered by the algorithm. The class of each element was verified and counted according to whether it was positive, negative and indeterminate. Totals are shown in Table 4.

Table 4

Grouping table about the evaluation of the elements assigned in underlying clusters regarding the real classes from experiment 1

Distance	Linkage method		Groups
Asymmetric	Ward’s. D2	Dx. Vaginosis	C1	C2
Binary		Positive	0	51
Similarity		Negative	134	0
Measure		Indeterminate	16	0

Table 4 illustrates that in group C1 were assigned elements of the BV-negative and BV-indeterminate class. Both are equivalent to 74.62 % of the dataset. In group C2 were put BV-positive instances only, which is 25.37% of the dataset. Finally, a T-SNE visualization is generated from the underlying groups, which allows to observe each element of the groups in a two-dimensional plane, as shown in Fig. 3.

Figure 3.

Plot showing the distribution of elements of the underlying clusters in a two-dimensional plane in experiment 1.

5.2 Validations

In this subsection, we show the processes of computational validation and data visualization of the first experiment.

5.2.1 Optimal number of clusters

Validation of the optimal number of clusters found in the dendrogram structure was performed using the silhouette method. The result is shown in Fig. 4. It shows that the value of the highest silhouette was reached with $k=$ 2 from the values explored from $k=$ 1 to $k=$ 10. This procedure allowed us to corroborate the two clusters found in the previous step.

Figure 4.

Plot showing the average silhouette width for clusters $k=$ 1 to $k=$ 10. The highest silhouette value is obtained at $k=$ 2 in experiment 1.

5.2.2 Cophenetic correlation coefficient’s

The cophenetic correlation coefficient (CCC) was calculated as described in Subsection 4.3.5. The purpose was to identify how faithfully the dendrogram preserves the pairwise distances of objects. The result of CCC obtained was 1. This CCC was interpreted through the absolute value of the Correlation Coefficient described in Subsection 3.5.1. We interpreted the CCC obtained as a very strong correlation in how faithfully the dendrogram preserves the pairwise distances between the original unmodeled data points. The correlation is shown in Fig. 5 with a Shepard-like diagram. The plot is interpreted as the relationship between a dissimilarity matrix on the x-axis and a cophenetic matrix on the y-axis. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot.

Figure 5.

Shepard-like diagram showing the relationship between the similarity matrix and the cophenetic matrix in experiment 1. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot.

5.2.3 Significant clusters

It was evaluated the significant clusters as described in Subsection 4.3.5. The intention was to validate the groupings obtained in the AHC analysis. The dendrogram is shown in Fig. 6 where two clusters marked with a red rectangle are observed. The clusters marked satisfy the condition of having an AU value equal to or higher than 95%, which indicates that they have high reliability, as the data strongly support them.

Figure 6.

Hierarchical clustering of the significant groups in experiment 1. The branch values are the AU p values (left) and the BP value (right). Rectangles indicate significant clusters with AU higher than 95.

Figure 7.

Tools for exploring the underlying clustering contexts of the AHC model in experiment 1. The highlighted patients share the coexistence of pathogens in the BV-positive condition in their assigned cluster.

5.2.4 Data Visualization

Data Visualization (DV) tool designed using Tableau [50] which is available online through [51]. This provides the results obtained from the AHC model and significant clusters. The purpose of the DV is to provide a support tool for exploring coexisting bacteria in groups with BV present and absent status emerging from the AHC process. Figure 7 shows a screenshot.

The DV is structured in three segments. There are three labels in the superior part: General View, Negative and Indeterminate Cases, and Positive Cases. The centre of the picture shows symbols representing the female patients that each can be hovered over to display the patient characteristics. On the right side, a number of filters are provided to adjust the patients being displayed.

Two patients from the positive group are shown in the graph. It can be seen that they share the presence of specific microorganisms in their diagnosis. Therefore, it can be determined that the algorithm was able to identify elements with the same diagnosis.

5.3 Results of the experiment 2

In this subsection, we show the results obtained from the construction, plot, and cut of the dendrogram of the second experiment described in Subsection 4.1.

5.3.1 Dendrogram

A dendrogram was created using Ward’s.D2 as the linkage method and Asymmetric Binary Similarity Measure as a distance metric. The dendrogram is pictured in Fig. 8a. The dendrogram has the purpose of supporting the graphical identification of the potential number of clusters in the analyzed dataset. We interpreted the dendrogram of Fig. 8b looking to distinguish the level where the difference between two consecutive fusions is largest.

Subsequently, we determined the k-value (number of clusters) in the dendrogram by performing different cut-off levels, which produced a number of clusters. Table 5 shows the cut-off height and the $k$ -value that arises from that cut-off. Finally, $k$ -values experimentally explored Ward.D2 algorithm were 2, 3, 4, 5, 6. Instances in the dataset were assigned to different clusters, as shown in Table 5. This process shows that the best performing clustering originates at the 0.44 cut-off level where $k=$ 6 is obtained.

Table 5
Different $k$ -values for ward.D2 method in experiment 2. The height indicates the cutting level, and K indicates the underlying groups. Group highlights the behavior of the clusters regarding the real class

Heigth	K	Vaginosis DX	Groups
			C1	C2
0.85	2	Positive	12	39
		Negative	22	112
		Indeterminate	6	10
			C1	C2	C3
0.76	3	Positive	12	32	7
		Negative	22	32	80
		Indeterminate	6	4	6
			C1	C2	C3	C4
0.60	4	Positive	12	0	7	32
		Negative	22	32	75	5
		Indeterminate	6	4	6	0
			C1	C2	C3	C4	C5
0.5	5	Positive	12	0	0	32	7
		Negative	22	32	58	0	22
		Indeterminate	6	4	6	0	0
			C1	C2	C3	C4	C5	C6
0.44	6	Positive	0	0	0	32	7	12
		Negative	22	32	58	0	22	0
		Indeterminate	6	4	6	0	0	0

Six clusters were identified as the potential optimal number of groups in the AHC model. Therefore, we cut the dendrogram by setting $k=$ 6 in the hcoplot function. The result obtained was a dendrogram with visible clusters, as shown in Fig. 8b.

Quantification of the number of elements in each cluster found is distributed in the following way: Cluster 1 (28 elements), Cluster 2 (36 elements) Cluster 3 (64 elements), Cluster 4 (32 elements), Cluster 5 (29 elements), and Cluster 6 (12 elements). This is summarized in Table 6.

Table 6

Element count in the underlying clusters from experiment 2

Cluster	Number of elements
Cluster 1	28
Cluster 2	36
Cluster 3	64
Cluster 4	32
Cluster 5	29
Cluster 6	12
Total	201 Elements

Table 7

Grouping table about the evaluation of the elements assigned in underlying clusters regarding the real classes in experiment 2

Distance	Method Linkage		Groups
Asymmetric	Ward.D2	Vaginosis Dx.	C1	C2	C3	C4	C5	C6
Binary		Positive	0	0	0	32	7	12
Similarity		Negative	22	32	58	0	22	0
Measure		Indeterminate	6	4	6	0	0	0

Figure 8.

Dendrograms using the ward.D2 method in experiment 2; The Uncut dendrogram (a) and final dendrogram with cut-off $k=$ 6 clusters (b).

Thereafter, we investigated whether the clusters found in the AHC model have a meaningful interpretation in real BV diagnostic classes. We looked at the elements of each Cluster C1, C2, C3, C4, C5, C6 which were delivered by the algorithm. The class of each element was verified and counted according to being positive, Negative, and indeterminate. Totals are shown in Tabla 7.

Table 7 illustrates that groups C1, C2, C3 contain elements of the negative and indeterminate classes. Whereas, clusters C4 and C6 are conformed only to positive class elements. In cluster C5, there are elements of positive and negative. Finally, a T-SNE visualization is generated from the underlying groups, which allows to observe each element of the groups in a two-dimensional plane, as shown in Fig. 9.

Figure 9.

Plot showing the distribution of elements of the underlying clusters in a two-dimensional plane in experiment 2.

5.4 Validations

In this subsection, we show the processes of computational validation, and data visualization of the second experiment.

5.4.1 Optimal number of clusters

The gap-statistic method used the gap statistic method to validate the optimal number of clusters found in the dendrogram structure analysis. This is shown in Fig. 10. It shows that the value of the highest gap-statistic was reached with $k=$ 6 from the values explored from $k=$ 1 to $k=$ 8. This procedure allowed us to corroborate the two clusters found in the previous step. The result obtained gives strong support to the optimal number of clusters found in the analysis of structures.

Figure 10.

Plot showing the gap statistic for clusters $k=$ 1 to $k=$ 8. The highest gap statistic value is obtained at $k=$ 6 in experiment 2.

5.4.2 Cophenetic correlation coefficient’s

The cophenetic correlation coefficient (CCC) was calculated as described in Subsection 4.3.5. The aim was to identify how faithfully the dendrogram preserves the pairwise distances of objects. The CCC obtained was 0.6618216. This CCC was interpreted through the absolute value of the Correlation Coefficient described in Subsection 3.5.1.

We interpreted the CCC obtained as a moderate correlation in how faithfully the dendrogram preserves the pairwise distances between the original unmodeled data points. The correlation is shown in Fig. 11 with a Shepard-like diagram. The plot is interpreted as the relationship between a dissimilarity matrix on the x-axis and a cophenetic matrix on the y-axis. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot.

Figure 11.

Shepard-like diagram showing the relationship between the similarity matrix and the cophenetic matrix in experiment 2. The visible points show the distance where the objects become members of the same group. The line shows the trend in the plot.

Figure 12.

Hierarchical clustering of the significant groups in experiment 2. The branch values are the AU $p$ values (left) and the BP value (right). Rectangles indicate significant clusters with AU higher than 95.

5.4.3 Significant clusters

It was evaluated the significant clusters as described in Subsection 4.3.5. The intention was to validate the groupings obtained in the AHC analysis. The dendrogram is shown in Fig. 12 where two clusters marked with a red rectangle are observed. The clusters marked satisfy the condition of having an AU value equal to or higher than 95%, which indicates that they have high reliability, as the data strongly support them.

5.4.4 Data visualization

Data Visualization (DV) tool designed using Tableau [50] which is available online through [52]. This provides the results obtained from the AHC model and significant clusters. The purpose of the DV in this second experiment is to provide a supporting tool to explore the coexisting bacteria in different groups belonging to the positive diagnosis of BV emerging from the AHC process. Figure 13 shows a screenshot.

The DV is structured in three segments. There are eight labels in the superior part: General View, Negative and Indeterminate Cases, Positive Cases, and significant clusters. The centre of the picture shows symbols representing the female patients that each can be hovered over to display the patient characteristics. On the right side, a number of filters are provided to adjust the patients being displayed.

Figure 13.

Tools for exploring the underlying clustering contexts of the AHC model in experiment 2. The highlighted patients share the coexistence of pathogens in the BV-positive condition in their assigned cluster.

5.5 Comparative of all linkage methods

For comparison purposes between the different methods with low and high AP obtained in the AHC model, were built the grouping tables shown in Table 8. This table is distributed in the following way: The first column shows the similarity measure used. The second column displays all the linkage methods. The third column shows the diagnostic results that can be obtained: positive, negative, and indeterminate. The fourth and five-column shows the results of the first experiment. Finally, the sixth and seventh columns show the results of the second experiment set.

Table 8
Comparison of results from experiments 1 and 2. The grouping table shows the number of elements assigned to each group according to the actual classes and the CCC of each binding method

Distance	Method		Groups experiment 1		Cophenetic correlation experiment 1	Groups experiment 2						Cophenetic correlation experiment 2
Asimmetric	Ward.D2	Dx. Vaginosis	C1	C2	1	C1	C2	C3	C4	C5	C6	0.6618216
Binary		Positive	0	51		0	0	0	32	7	12
Similarity		Negative	134	0		22	32	58	0	22	0
Measure		Indeterminate	16	0		6	4	6	0	0	0
	Ward.D	Dx. Vaginosis	C1	C2	1	C1	C2	C3	C4	C5	C6	0.5315148
		Positive	0	51		0	0	3	41	0	7
		Negative	134	0		22	28	50	0	22	12
		Indeterminate	16	0		6	3	5	0	0	2
	Average	Dx. Vaginosis	C1	C2	1	C1	C2	C3	C4	C5	C6	0.7694198
		Positive	0	51		3	13	24	7	4	0
		Negative	134	0		15	41	66	8	0	4
		Indeterminate	16	0		2	7	5	2	0	0
	Single	Dx. Vaginosis	C1	C2	1	C1	C2	C3	C4	C5	C6	0.6575352
		Positive	0	51		0	48	0	3	0	0
		Negative	134	0		3	120	8	0	2	1
		Indeterminate	16	0		0	14	2	0	0	0
	Complete	Dx. Vaginosis	C1	C2	1	C1	C2	C3	C4	C5	C6	0.7683422
		Positive	0	51		3	34	3	7	4	0
		Negative	134	0		15	102	5	8	0	4
		Indeterminate	16	0		2	12	0	2	0	0

Thereafter, we analyzed the grouping differences between each model of clustering. It was identified that in the first experiment, the linkage methods with low AP were achieved to create groups with states of presence and absence of BV in the same mode as the high AP methods.

For the second experiment was confirmed that only Ward’s.D2 method achieved identify clusters with dissimilarity between patients with the same diagnosis. The results obtained in the second experiment of the methods with low AP were interpreted in the following way:

(1)

The group-average method obtained 0.8542708% of agglomeration and a correlation of 0.7695198. Results of HC indicate that it is best to create groups for cases of BV-negative as well as the complete-link method.

(2)

The single link method obtained an agglomeration of 0.7125233% and, correlation of 0.6575352. Results of HC indicate a mixed clustering in only one group. Fails to perform groups with features similar to those belonging to the same class.

(3)

The complete link method obtained an agglomeration of 0.9087942% and, correlation of 0.7683422. Results of HC show the best grouping for the diagnosis of BV-negative.

(4)

The ward method obtained an agglomeration of 0.9772106% and a correlation of 0.6618216. Results of HC show that grouping has an even distribution for different diagnoses.

(5)

The best result of the hierarchical clustering was obtained through asymmetric binary similarity measures and the Ward.D2 method, which allowed the identification of clusters with elements of the same diagnostic class. The generated clusters are divided as follows: Three of the five clusters incorporated BV-negative elements in different proportions. In contrast, the fourth set includes many positive diagnoses, and finally, the sixth group comprises elements of different BV diagnostic classes.

5.6 Time and memory complexity and convergence analysis

5.6.1 Time and memory complexity analysis

This subsection shows the time and memory complexity evaluation of the ward.D2 method and the distance matrix obtained in Subsection 4.3.2. For this evaluation, the GuessCompx package [53] was used with the default parameters of maximum time in seconds allowed for each step of the analysis of 30 seconds and the number of replicated runs of the algorithm for a specific sample size of 2. The GuessCompx package was described and introduced by Agenis-Nevers, M. et al.’s [54]. The runtime and memory results are detailed in Table 9.

Table 9
Time and memory complexity analysis values for experiments 1 and 2

	Complexity experiment 1		Complexity experiment 2
Method	Time	Memory	Time	Memory
Ward.D2	0.03641079 s	16252 Mb	NA s	16252 Mb

The complexity analysis of experiment 1 of the ward. D2 method showed a time complexity value of 0.03641079s. The results denote that the total time complexity for constructing the clustering model is low, which is called the best case complexity, as shown in Fig. 14a. The memory space complexity for experiment 1 is 16252 Mb with logarithmic trend over time, as shown in Fig. 14b.

Figure 14.

Complexity fit against (a) run time and (b) memory usage for distance function on Experiment 1, suggests O(log(N)) as the best model for both evaluations. The complexity graph of the ward. D2 method indicates the best model by a yellow line for (a) and (b).

The complexity analysis of experiment 2 of the ward. D2 method showed a time complexity of NAs value for its measurement, that is to say, the value obtained in the runtime of experiment 2 is very close to 0, as shown in Fig. 15a. The results denote that the total time complexity for constructing the clustering model is low, which is called the best case complexity. The memory space complexity for experiment 1 is 16252 Mb with logarithmic trend over time, as shown in Fig. 15b.

Figure 15.

Complexity fit against (a) run time and (b) memory usage for distance function on Experiment 1, suggests O(1) and O(log(N)) as best model, respectively. The complexity graph of the ward.D2 method indicates the best model by a red line (a) and a yellow line (b).

5.6.2 Convergence analysis

The convergence analysis was performed based on the results of Subsections 5.1.1 of experiment 1 and 5.3.1 of experiment 2, corresponding to the dendrogram analysis. Convergence in the ward.D2 method ensures that the underlying groups allow for further analysis of coexisting bacteria in clusters of BV-positive patients.

The convergence of the ward.D2 method in experiment 1 was achieved when exploring the value $k=$ 2, that is to say, perfect clustering of BV-positive patients is obtained, as shown in Table 4. Whereas that, the convergence of the ward.D2 method in experiment 2 was determined by exploring values $K=$ 2, 3, 4, 5, 6 with the value $K=$ 6 achieving the best clustering performance, as shown in Table 5.

The convergence behavior can be seen in Fig. 16, which shows that a low cut of the dendrogram height allows to underlie a certain number of groups with a high similarity of elements of the VB-positive condition.

Figure 16.

The graph shows the convergence obtained from the different values explored in experiments 1 and 2.

5.7 Bacterial coexistence context

This subsection shows the findings on the coexistence of bacteria in the clustering models found in the experiments 1 and 2. The findings on bacterial coexistence from the first experiment showed a prevalence of BV pathogens of 94.12%, 66.62%, 58.82%, and 37.5% for Atopobium, Gardnerella vaginalis, Megasphaera, Mycoplasma hominis, respectively. The contexts of coexistence bacteria identified in the 51 elements with BV-positive are shown in Table 3 were:

•
Atopobium $+$ Gardnerella vaginalis $=$ 31.37% (16/51)
•
Atopobium $+$ Megasphaera $=$ 15.68% (8/51)
•
Atopobium $+$ Gardnerella vaginalis $+$ Mycoplasma hominis $=$ 9.80% (5/51)
•
Atopobium $+$ Gardnerella vaginalis $+$ Megasphaera $=$ 9.80% (5/51)
•
Atopobium $+$ Gardnere vaginalis $+$ Megasphaera $+$ Micoplasma hominis $=$ 9.80% (5/51)

In the second experiment, two clusters of positive BV diagnosis emerge. In the first cluster, the results show a prevalence of BV pathogens of 90.62%, 78.12%, 56.25%, and 28.12% for Atopobium, Gardnerella vaginalis, Megasphaera, Mycoplasma hominis, respectively. The contexts of coexistence bacteria identified in the 32 elements with BV-positive shown in Table 8 were:

•
Atopobium $+$ Gardnerella vaginalis $=$ 37.50% (12/32)
•
Atopobium $+$ Megasphaera $=$ 15.68% (8/32)
•
Gardnerella vaginalis $+$ Megasphaera $=$ 9.37% (3/32)
•
Atopobium $+$ Gardnerella vaginalis $+$ Megasphaera $=$ 9.80% (5/32)
•
Atopobium $+$ Gardnerella vaginalis $+$ Mycoplasma hominis $=$ 6.25% (2/32)
•
Atopobium $+$ Megasphaera $+$ Myicoplasma hominis $=$ 9.37% (3/32)
•
Atopobium $+$ Gardnere vaginalis $+$ Megasphaera $+$ Mycoplasma hominis $=$ 15.62% (5/32)

Cluster two of the second experiment shows BV pathogen prevalence of 100%, 58.33%, 58.33%, 58.33%, 66.66% for Atopobium, Gardnerella vaginalis, Megasphaera, Mycoplasma hominis, respectively. The contexts of coexistence bacteria identified in the 12 elements with BV-positive shown in Table 8 were:

•
Atopobium $+$ Gardnerella vaginalis $=$ 16.66% (2/12)
•
Atopobium $+$ Megasphaera $=$ 8.33% (1/12)
•
Gardnerella vaginalis $+$ Megasphaera $=$ 9.37% (3/32)
•
Atopobium $+$ Gardnerella vaginalis $+$ Megasphaera $=$ 16.66% (2/12)
•
Atopobium $+$ Gardnerella vaginalis $+$ Mycoplasma hominis $=$ 25% (3/12)
•
Atopobium $+$ Megasphaera $+$ Mycoplasma hominis $=$ 33.33% (4/12)

6. Discussion

This research shows that linkage methods and the similarity measure contribute significantly to identifying the best hierarchical model and the optimal number of clusters for further analysis of bacteria coexisting between patient groups. A determining factor in this research was identifying the optimal number of clusters in the VB data, as clusters that share a high similarity of characteristics would emerge from this procedure. The study also highlighted that it is necessary to further investigate adaptive or random methods to automatically identify clusters with high similarity in the hierarchical structure. Therefore this is still an open research in the field.

On the other hand, it is essential to mention that until the time of the development of this study, there is no evidence of another similar approach to compare results. However, to support the results, they were subjected to biological validation by an expert using data visualizations of the models highlighting the bacterial coexistence contexts shared by the elements of each VB cluster. Due to the lack of literature on VB clustering, further experimentation with other methods is suggested to consolidate our findings.

7. Conclusion

In this paper, we aimed at creating an AHC model that would allow determining a more informed number of groups to create for further analysis of bacteria’s coexistence. This first effort provides an AHC model with the intention of supporting the study of bacterial coexistence contexts in groups of patients with BV. The single link, the complete link, group average, and Ward’s methods were evaluated in two experiments to identify the method with the highest AP and construct its dendrogram. Each AHC model obtained was validated through metrics. Finally, the clustering tables of the linkage methods were compared.

In the first experiment, the Ward’s.D2 method showed the best results in the metrics evaluated. The results obtained allow us to conclude that through AHC, it is possible to create groups that are distinguished by the presence and absence of BV. The Ward’s.D2 method showed the best results in the metrics evaluated in the second experiment. The results obtained allow us to conclude that through AHC, it is possible to create groups with dissimilarity and, at the same time being groups with elements showing the same diagnosis.

We consider that the proposed AHC model identifies the best method for each clustering case. Knowing which linkage methods are the best at clustering tasks in different experiments could serve as a basis for building an expert system that implements the best models. This system would facilitate the interpretation of bacterial coexistence context, which impacts the physician’s decision-making when choosing treatments for patients with BV.

The related work mentioned above focuses on the diagnosis, classification, and design of microarray-based phylogenetic tools, each contributing to the understanding of BV. This clustering study confirms that it is possible to determine a more informed number of patient groups to analyze coexisting bacteria. Furthermore, the results of the experiments indicate differences in the bacterial coexistence context between one group and the other with BV-positive. Thus, knowing the coexistence of bacteria present in clusters with similar characteristics allows physicians to establish appropriate treatment strategies to prevent the development of major gynaecological complications. This study evaluates a set of linkage methods and a similarity measure to build two clustering models for further analysis of coexisting bacteria in groups of patients with BV. This represents a contribution to medicine from a nonsupervised learning perspective with the aim of supporting physicians in understanding BV.

In future work, we will address other clustering methods and distance measures. We are also interested in obtaining the solutions of different clustering methods most prevalent in the literature and then based on that to perform comparative studies with recent clustering approaches such as SNK-AHC [55]. Finally, the models generated using all clustering methods can be integrated into expert systems to act as decision aids for specialists.

References

Herrero

D.R.

and Domingo

A.A.

, Vaginosis bacteriana, Enfermedades Infecciosas y Microbiología Clínica 34 (2016), 14–18.

Cohen

C.R.

Wierzbicki

M.R.

French

A.L.

Morris

Newmann

Reno

Green

Miller

Powell

Parks

et al., Randomized trial of lactin-v to prevent recurrence of bacterial vaginosis, New England Journal of Medicine 382(20) (2020), 1906–1915.

Borgogna

J.-L.C.

Shardell

M.D.

Grace

S.G.

Santori

E.K.

Americus

Ulanov

Forney

Nelson

T.M.

Brotman

R.M.

et al., Biogenic amines increase the odds of bacterial vaginosis and affect the growth of and lactic acid production by vaginal lactobacillus spp, Applied and Environmental Microbiology 87(10) (2021), e03068–20.

Pavlova

Kilic

J.-S.

Nader-Macias

Simoes

and Tao

, Genetic diversity of vaginal lactobacilli from women in different countries based on 16s rrna gene sequences, Journal of Applied Microbiology 92(3) (2002), 451–459.

Zhou

Bent

S.J.

Schneider

M.G.

Davis

C.C.

Islam

M.R.

and Forney

L.J.

, Characterization of vaginal microbial communities in adult healthy women using cultivation-independent methods, Microbiology 150(8) (2004), 2565–2573.

Shi

Chen

Tong

and Xu

, Preliminary characterization of vaginal microbiota in healthy chinese women using cultivation-independent methods, Journal of Obstetrics and Gynaecology Research 35(3) (2009), 525–532.

Larsson

P.-G.

and Forsum

, Bacterial vaginosis – a disturbed bacterial flora and treatment enigma: Review article iv, Apmis 113(5) (2005), 305–316.

Redelinghuys

M.J.

Geldenhuys

Jung

and Kock

M.M.

, Bacterial vaginosis: Current diagnostic avenues and future opportunities, Frontiers in Cellular and Infection Microbiology 10 (2020), 354.

Coudray

M.S.

and Madhivanan

, Bacterial vaginosis – a brief synopsis of the literature, European Journal of Obstetrics & Gynecology and Reproductive Biology 245 (2020), 143–148.

10.

Peipert

J.F.

Montagno

A.B.

Cooper

A.S.

and Sung

C.J.

, Bacterial vaginosis as a risk factor for upper genital tract infection, American Journal of Obstetrics and Gynecology 177(5) (1997), 1184–1187.

11.

C. for Disease Control, Prevention, et al., Sexually transmitted diseases treatment guidelines, Morbid Mortal 42 (1993), 57–59.

12.

Debora

W.W.

and Kimberlin

, Bacterial vaginosis: Association with adverse pregnancy outcome, Seminars in Perinatology-Elsevier 22 (aug 1998), 242–250.

13.

Amsel

Totten

P.A.

Spiegel

C.A.

Chen

K.C.

Eschenbach

and Holmes

K.K.

, Nonspecific vaginitis: Diagnostic criteria and microbial and epidemiologic associations, The American Journal of Medicine 74(1) (1983), 14–22.

14.

Nugent

R.P.

Krohn

M.A.

and Hillier

S.L.

, Reliability of diagnosing bacterial vaginosis is improved by a standardized method of gram stain interpretation, Journal of Clinical Microbiology 29(2) (1991), 297–301.

15.

Hussain

Yüce

Ullah

and Budak

, Bioconjugated nanomaterials for monitoring food contamination, in: Nanobiosensors, Elsevier, 2017, pp. 93–127.

16.

Kalra

Palcu

C.T.

Sobel

J.D.

and Akins

, Bacterial vaginosis: Culture-and pcr-based characterizations of a complex polymicrobial disease’s pathobiology, Current Infectious Disease Reports 9(6) (2007), 485–500.

17.

Berkhin

, A survey of clustering data mining techniques, in: Grouping Multidimensional Data, Springer, 2006, pp. 25–71.

18.

Song

Zeng

Chen

Lei

and Wang

, Automatic vaginal bacteria segmentation and classification based on superpixel and deep learning, Journal of Medical Imaging and Health Informatics 4(5) (2014), 781–786.

19.

Beck

and Foster

J.A.

, Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics, PloS One 9(2) (2014), e87830.

20.

Cruciani

Biagi

Severgnini

Consolandi

Calanni

Donders

Brigidi

and Vitali

, Development of a microarray-based tool to characterize vaginal bacterial fluctuations and application to a novel antibiotic treatment for bacterial vaginosis, Antimicrobial Agents and Chemotherapy 59(5) (2015), 2825–2834.

21.

Ness

R.B.

Kip

K.E.

Hillier

S.L.

Soper

D.E.

Stamm

C.A.

Sweet

R.L.

Rice

and Richter

H.E.

, A cluster analysis of bacterial vaginosis-associated microflora and pelvic inflammatory disease, American Journal of Epidemiology 162(6) (2005), 585–590.

22.

Sanchez Garcia

E.K.

Contreras Paredes

Martinez Abundis

Garcia Chan

Lizano

and de la cruz Hernandez

, Molecular epidemiology of bacterial vaginosis and its association with genital microorganisms in asymptomatic women, Journal of Medical Microbiology 68(9) (2019), 1373–1382.

23.

Giordani

Ferraro

M.B.

and Martella

, Hierarchical clustering, in: An Introduction to Clustering with R, Springer, 2020, pp. 14–15.

24.

Faith

D.P.

, Asymmetric binary similarity measures, Oecologia 57(3) (1983), 287–290.

25.

Kassambara

, Practical guide to cluster analysis in R: Unsupervised machine learning, volume 1, Sthda, 2017.

26.

Pacheco

E.R.

, Unsupervised Learning with R, Packt Publishing Ltd, 2015.

27.

Aggarwal

C.C.

and Reddy

C.K.

, Data Clustering: Algorithms and Applications, Chapman and Hall/CRC, 2018.

28.

Abu-Jamous

and Nandi

A.K.

, Integrative cluster analysis in bioinformatics, John Wiley & Sons, 2015, pp. 158–159.

29.

Ward

J.H.

Jr, Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association 58(301) (1963), 236–244.

30.

Murtagh

and Legendre

, Wards hierarchical agglomerative clustering method: Which algorithms implement wards criterion? Journal of Classification 31(3) (2014), 274–295.

31.

Legendre

and Legendre

, Numerical ecology, Elsevier, 2012, p. 365.

32.

Boslaugh

, Statistics in a nutshell: A desktop quick reference, O’Reilly Media, Inc., 2012.

33.

Berman

J.J.

, Data simplification: taming information with open source tools, Morgan Kaufmann, 2016.

34.

Sokal

R.R.

and Rohlf

F.J.

, The comparison of dendrograms by objective methods, Taxon, 1962, 33–40.

35.

Cluster-cophenetic correlation coefficient, Feb 2020.

36.

Schober

Boer

and Schwarte

L.A.

, Correlation coefficients: Appropriate use and interpretation, Anesthesia & Analgesia 126(5) (2018), 1763–1768.

37.

Tibshirani

Walther

and Hastie

, Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2) (2001), 411–423.

38.

Rousseeuw

P.J.

, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20 (1987), 53–65.

39.

Nielsen

, Hierarchical clustering, in: Introduction to HPC with MPI for Data Science, Springer, 2016, pp. 195–211.

40.

et al.[r package cluster version 2, 1.0], Jun 2019.

41.

Kaufman

and Rousseeuw

P.J.

, Finding groups in data: an introduction to cluster analysis, volume 344, John Wiley & Sons, 2009.

42.

Documentation for package stats version 4.2.0, Dic 2019.

43.

Lewin-Koh

, Cran task view, graphic displays, dynamic graphics, graphic devices, visualization, Jan 2015.

44.

Borcard

Gillet

and Legendre

, Numerical ecology with R, Springer, 2018.

45.

Gruvaeus

and Wainer

, Two additions to hierarchical cluster analysis, British Journal of Mathematical and Statistical Psychology 25(2) (1972), 200–206.

46.

Extract and visualize the results of multivariate data analyses [r package factoextra version 1.0.7], Apr 2020.

47.

Pvclust: Hierarchical clustering with

p

-values via multiscale bootstrap resampling 2019, Nov 2019.

48.

Suzuki

and Shimodaira

, Pvclust: An r package for assessing the uncertainty in hierarchical clustering, Bioinformatics 22(12) (2006), 1540–1542.

49.

Functional programming tools [r package purrr version 0.3.4], Apr 2020.

50.

Company

T.S.-S.

, Tableau software.

51.

Gomez

H.J.H.

, Data visualization-experiment 1. https://public.tableau.com/app/profile/henryphd/viz/Exp1AHC/Exp1.

52.

Gomez

H.J.H.

, Data visualization-experiment 2. https://public.tableau.com/app/profile/henryphd/viz/Exp2AHC/Exp2.

53.

Package guesscompx, Jun 2019.

54.

Agenis-Nevers

Bokde

N.D.

Yaseen

Z.M.

and Shende

, Guesscompx: An empirical complexity estimation in r, arXiv, 2019.

55.

Ah-Pine

, An efficient and effective generic agglomerative hierarchical clustering approach, The Journal of Machine Learning Research 19(1) (2018), 1615–1658.

An agglomerative hierarchical clustering approach to identify coexisting bacteria in groups of bacterial vaginosis patients

Abstract

Keywords

1. Introduction

3. Materials and methods

3.1 Dataset

Table 1 Attributes selected from the VB dataset used in our experiments which were introduced in [22]

3.4.1 Single linkage method

3.5.1 Pearson correlation and Cophenetic correlation coefficient

4.2 Identification of metrics of distance and linkage methods

4.3 Hierarchical clustering

4.3.1 Calculation of the agglomerative percentage (AP)

4.3.2 Construction of the model

4.3.4 Grouping table

4.3.5 Computational and biological validation

5. Results

Table 2 Percentage of agglomeration (AP) for experiments 1 and 2. The units of the results show the percentages obtained for each linkage method evaluated. The highest value is shown in bold

5.1.1 Dendrogram

Table 3 Element count in the underlying clusters in experiment 1

5.2.1 Optimal number of clusters

5.3 Results of the experiment 2

5.3.1 Dendrogram

Table 5 Different k -values for ward.D2 method in experiment 2. The height indicates the cutting level, and K indicates the underlying groups. Group highlights the behavior of the clusters regarding the real class

5.4.1 Optimal number of clusters

5.4.4 Data visualization

Table 8 Comparison of results from experiments 1 and 2. The grouping table shows the number of elements assigned to each group according to the actual classes and the CCC of each binding method

5.6.1 Time and memory complexity analysis

Table 9 Time and memory complexity analysis values for experiments 1 and 2

7. Conclusion

References

Table 1
Attributes selected from the VB dataset used in our experiments which were introduced in [22]

Table 2
Percentage of agglomeration (AP) for experiments 1 and 2. The units of the results show the percentages obtained for each linkage method evaluated. The highest value is shown in bold

Table 3
Element count in the underlying clusters in experiment 1

Table 5
Different $k$ -values for ward.D2 method in experiment 2. The height indicates the cutting level, and K indicates the underlying groups. Group highlights the behavior of the clusters regarding the real class

Table 8
Comparison of results from experiments 1 and 2. The grouping table shows the number of elements assigned to each group according to the actual classes and the CCC of each binding method

Table 9
Time and memory complexity analysis values for experiments 1 and 2