Abstract
Inferring phylogenetic trees in human populations is a challenging task that has traditionally relied on genetic, linguistic, and geographic data. In this study, we explore the application of Deep Learning and facial embeddings for phylogenetic tree inference based solely on facial features. We use pre-trained ConvNets as image encoders to extract facial embeddings and apply hierarchical clustering algorithms to construct phylogenetic trees. Our methodology differs from previous approaches in that it does not rely on preconstructed phylogenetic trees, allowing for an independent assessment of the potential of facial embeddings to capture relationships between populations. We have evaluated our method with a dataset of 30 ethnic classes, obtained by web scraping and manual curation. Our results indicate that facial embeddings can capture phenotypic similarities between closely related populations; however, problems arise in cases of convergent evolution, leading to misclassifications of certain ethnic groups. We compare the performance of different models and algorithms, finding that using the model with ResNet50 backbone and the face recognition module yields the best overall results. Our results show the limitations of using only facial features to accurately infer a phylogenetic tree and highlight the need to integrate additional sources of information to improve the robustness of population classification.
Introduction
Phylogenetic analysis of human populations and other species traditionally relies on genetic sequences, including chromosomal DNA [5, 17], mitochondrial DNA (mDNA) [19, 28], and other sequences such as amino-acid chains [39]. However, in cases where genome data is unavailable, phenotypic features are still utilized to derive Phylogenetic trees. These features are typically determined by experts, who rely on measurements and the quality of the samples.
Convolutional Neural Networks (ConvNets) have emerged as state-of-the-art approaches in various vision-related tasks, including image classification [6, 10, 20, 23], image segmentation [22], inpainting [42], and more. These models have demonstrated exceptional performance and have been successfully applied in medical imaging, assisting experts in different tasks [20].
In this study, we explore the possibility of using ConvNets to approximate the phylogenetic tree of human populations from facial images. By taking advantage of the ability of ConvNets to extract meaningful features from images, we aim to create a model that can capture important information about population structure from facial features.
The motivation for this research is to investigate an alternative approach to infer genetic relationships and ancestry when genetic data are limited or unavailable. Facial images are readily available and, in the case of historical photographs, may be the only source of ethnicity information.
In this paper we present a methodology, experimental setup, and results. We compare the predicted phylogenetic tree generated by our ConvNet model with known genetic ancestry information obtained from the literature. We discuss possible applications and limitations of this approach.
Overall, our study aims to contribute to the field of population genetics by exploring the capabilities of Convolutional Neural Networks in approximating the phylogenetic tree of human populations based on facial images. By merging the domains of computer vision and phylogenetics, we hope to provide valuable insights for various fields, including forensic anthropology and human migration studies.
Related work
One of the primary applications of Deep Learning in the biosciences is phylogeny inference [2, 3, 13, 16, 31, 37, 43]. However, research conducted by Zaharias et al. [41] showed that ConvNets do not outperform well-established algorithms in comparative results.
Rather than employing deep architectures for tree inference, our approach focuses on utilizing Deep Learning to extract essential features that can be utilized by standard tree inference methods. However, there is limited literature available on this specific approach.
In another study,
To our knowledge, there is no existing work specifically addressing the application of ConvNets for phylogeny inference using human faces. However, it is worth noting that the related problem of ethnic classification has been explored using ConvNets. For instance, Masood et al. [21] utilized a ConvNet to categorize three primary ethnic groups, achieving superior results compared to traditional methods. Vo et al. [38] reported marginally improved outcomes in ethnic classification using a pre-trained VGG-16. In their study, the authors distinguished one ethnic group from others. Similarly, Wu et al. [40] employed VGG-16 to classify three principal ethnic categories. Das et al. [7] proposed a ConvNet for the simultaneous classification of gender, age, and ethnicity, aiming to reduce associated biases. Srinivas et al. [36] also undertook joint classification of age, gender, and ethnicity, with a specific focus on Asian populations.
Methodology
In our study, we adopted the methodology introduced by Kiel et al. [14] and Hunt and Pedersen [12]. We employed pre-trained ConvNets or image encoders to extract the essential features from the images. Subsequently, we applied hierarchical clustering to infer the phylogenetic tree. Additionally, we compared our results with established phylogenetic trees of human populations.
Ground truth
The phylogenetic trees of human populations have been extensively discussed; however, a clear consensus regarding the positioning of certain ethnic groups is yet to be established. In the work of Duda and Zrzavý, three potential phylogenetic tree models, labeled as Model A, Model B, and Model C (see Fig. 2 in [8]), were presented. We utilized these models as a basis for comparison in our study.
Data acquisition
Data acquisition was carried out through web scraping of images corresponding to the 55 human populations studied by Duda and Zrzavý. We downloaded 30 images for each population using the Python module
We employed the
This dataset is rather small. However, we implemeted pre-trained visual encoders (see Subsection 3.3). Thus, a larger corpus is not needed for this study, although it might be recommendable to capture different variations between populations from the same ethnic group.
Visual encoders
Once the faces were selected, we employed three different image encoders:
We performed two subtasks: Subtask A: Ethnic Classification. We allocated 33% of the dataset for testing purposes. Subtask B: Tree Inference. We computed the mean embeddings per class and applied traditional tree inference algorithms, including UPGMA (unweighted pair group method with arithmetic mean) [35], Neighbor Joining (NJ) [30], and Hierarchical Agglomerative Clustering (HAC) provided by Scikit-Learn [27]. In all cases, we used Euclidean distance.
For subtask B, where we generated trees using the embeddings, we employed the following metrics to compare the generated trees with the theoretical trees: The first 4 computes the distance between phylogenetic trees, while the Shared Phylogenetic Information and Nye Similarity measure how closely related are two trees.
In the following section, we present the results of our study. These findings provide insights into the performance of various pre-trained models in the complex task of ethnic classification. The results are divided into the two mentioned subtasks, each addressing a unique aspect of the problem. The detailed outcomes and their implications are discussed in the subsequent sections.
Subtask A
Ethnic classification with 30 classes proved to be challenging for the pre-trained models, with the highest achieved F1-macro score of 0.29043 obtained using the kNN algorithm with k = 3. The main results are summarized in Table 1. The difficulty of the task can be attributed to the presence of similar classes within the dataset.
Measured metrics for the subtask A
Measured metrics for the subtask A
The main results for Subtask B are presented in Tables 2–4. The results demonstrate that ResNet50 and the Face Recognition module performed better compared to VGG-16. Specifically, ResNet50 with UPGMA (see Figs. 1–3) achieved superior results in most metrics, except for Robinson-Foulds and Clustering Information Distance. On the other hand, the Face Recognition module produced better results in Robinson-Foulds (with UPGMA, see Fig. 4) and Clustering Information Distance (with Hierarchical Agglomerative Clustering).
Results on VGG-16
Results on VGG-16
Results on ResNet50
Results on Face Recognition module

Phylogenetic tree generated with UPGMA and ResNet50 embeddings.

Distance matrix generated with UPGMA and ResNet50 embbeddings.

Visualization of the Shared Phylogenetic Information of the tree generated with UPGMA and ResNet50 embeddings (b) compared to the ground truth (a).

Phylogenetic tree generated with UPGMA and Face Recognition embeddings.
Phylogenetic trees are primarily constructed using DNA samples from various individuals to capture differences and similarities among populations. These trees identify characteristics that can determine whether an individual belongs to a specific ethnic group and their relatedness to others. Biologists and experts often rely on DNA due to its complexity and reliability. However, obtaining DNA samples involves considerable time and effort as it requires human participation.
This research aims to develop a Machine Learning approach to estimate phylogenetic trees from an alternative data source: facial images. This could have a positive impact on both biology and genetics, as Deep Learning algorithms can help identify similarities among humans without actual genetic data. This research may also offer new insights into the relevance of phenotypical facial features and characteristics for analyzing related populations.
This work is based on the hypothesis that different genetic populations produce different phenotypic populations. While this is a general fact in biology, environmental factors can also influence phenotypic expression [1]. Recent statistical or Machine Learning-based studies provide evidence that it is possible to differentiate face shapes/images. For example, Hopman et al. [11] found that it is possible to differentiate two phylogenetically related populations (Dutch and English populations in this case) based solely on facial traits. If this holds true for all genetically-related populations, it suggests that Machine/Deep Learning approaches could be used to distinguish even similar populations based solely on images.
Other works aim to predict facial structure using genetic information (DNA-to-face approaches) and the inverse problem (face-to-DNA approaches) [1]. Mathematically, let
Is it possible to estimate the Phylogenetic Tree using only facial images? To answer this question, let us first make some assumptions: It is possible to effectively estimate the Phylogenetic Tree in It is possible to effectively estimate g with
Given assumptions A and B, it is reasonable to conclude that it is possible to approximate the Phylogenetic Tree using only the set of images
Empirically, this work suggests that ConvNets are somewhat capable of deriving Phylogenetic Trees, but they do present some significant limitations. Face embeddings serve as phenotypic features that preserve similarity between related groups in our study. The hierarchical clustering method performed well for closely related human populations, but like other phenotype-based phylogenetic tree induction approaches, it is not robust in cases of convergent evolution. While Kiel et al. (2021) [14] addressed this issue by considering a supertree approach using well-known phylogenetic trees, our objective was to compare the potential use of embeddings without relying on published trees.
Another approach proposed by the same Kiel et al. is the use of model ensembles, where predictions from different taxonomic models are combined. This approach requires re-training the networks to classify larger clusters of human populations, making it supervised and reliant on the previous construction of the tree.
In our case, we do not depend on pre-built trees as we applied transfer learning with the VGGFaces dataset, which addresses individual face classification. Therefore, the ConvNets were trained to classify non-hierarchically clustered labeled classes to produce the necessary embeddings. An unsupervised approach, such as using an autoencoder architecture, could also be employed for this purpose [32].
However, while this approach is independent of pre-built phylogenetic trees, it sacrifices the ability to differentiate similar features arising from convergent evolution. Without additional information, this problem may be difficult to overcome. Although it might be useful to add other sources of information, the objective of this work was to rely solely on images.
In our research, the Australian and Makrani ethnic groups were consistently misclassified by most of the proposed approaches. Makrani, a Central Asian population, was often clustered with sub-Saharan African populations in the generated trees. A similar issue was observed with the Australian population.
Apart from these mentioned groups, the methodology generally produced consistent results. In the best model (Fig. 2), we observe three primary clusters: African, Euroasiatic, and Eastern Asian-Oceanic-American populations. However, although the African and Euroasiatic clusters form a larger cluster, the Eastern Asian-Oceanic-American cluster appears to be significantly different, which might not be consistent with the models. In fact, the Uyghur-Hazara populations, related to other Central Asian populations, are clustered with the Yakut, another Turkic-speaking population, in this particular case.
Additionally, despite Model A generally being the closest model to the generated trees, our best tree supports the idea that the Cambodian-Lahu population is clustered with Polynesian groups (Model C), and it indicates the proximity between the Uyghur-Hazara peoples and the Yakut ethnic group, as shown in Models B and C.
Our study explored the application of Deep Learning and face embeddings for phylogenetic tree inference in human populations. We utilized pre-trained ConvNets and image encoders to extract facial features, and then applied hierarchical clustering algorithms to construct phylogenetic trees. Our research aimed to compare the potential use of embeddings without relying on pre-built phylogenetic trees.
We found that while face embeddings captured phenotypic similarities between closely related human populations, the methodology encountered challenges in cases of convergent evolution. The hierarchical clustering approach performed well overall, but misclassified certain ethnic groups, such as the Australian and Makrani populations. These misclassifications highlight the limitations of solely relying on facial features for accurate classification and tree inference.
Comparing the results of different models and algorithms, we observed that ResNet50 and the Face Recognition module produced better results in most metrics, demonstrating their effectiveness in capturing facial similarities.
Our study demonstrated that face embeddings, although capturing phenotypic similarities, have limitations in accurately representing complex evolutionary relationships. Future research could explore the integration of additional genetic, linguistic, and geographic information to improve the accuracy and robustness of phylogenetic tree inference in human populations.
