Multimodal biometric authentication system leveraging optimally trained ensemble classifier using feature-level fusion

Abstract

Objective

This study aims to enhance cybersecurity by implementing a robust biometric-based authentication approach. A Multimodal Biometric System (MBS) is proposed, utilizing feature-level fusion of human facial (physiological) and speech (behavioral) features to improve security, accuracy, and user convenience. The system addresses the limitations of traditional authentication methods, including unimodal biometrics and password-based security.

Background

In the modern digital landscape, human-computer interaction and digital platforms play a crucial role in daily life. With billions of users engaging in social media, financial transactions, and e-commerce, the demand for secure authentication mechanisms has intensified. However, the increasing sophistication of cyber threats poses significant risks, undermining trust, security, and confidence in digital systems.

Method: The proposed MBS incorporates improved proposed techniques for feature extraction, feature level fusion strategies and an ensemble classification model combining Bi-LSTM and DCNN. To optimize performance, the system is enhanced using an improved bio-inspired Manta Ray Foraging Optimization (MRFO) algorithm.

Results

The system's performance was evaluated using two publicly available Voxceleb1 and VidTIMIT datasets, achieving accuracy rates of 98.23% and 97.92%, with Equal Error Rates (EERs) of 3.23% and 3.62%, respectively.

Conclusion

The proposed approach outperforms conventional optimization techniques and existing state-of-the-art MBS. As a contactless and non-intrusive authentication system, it enables seamless data acquisition through devices equipped with cameras and microphones, such as smartphones, ensuring real-time processing of biometric modalities.

Application: This contactless MBS presents a viable solution for secure and hygienic authentication in applications requiring high cyber resilience, including banking, e-commerce and other digital security domains.

Precis/Table of Contents: This research enhances cybersecurity by proposing a Multimodal Biometric System (MBS) that integrates feature-level fusion of facial (physiological) and speech (behavioral) traits. The approach improves security, accuracy, and user convenience while addressing hygiene concerns. It overcomes the limitations of traditional authentication methods, including unimodal biometrics and password-based security vulnerabilities.

Keywords

Ensemble classifier face recognition speaker recognition feature level fusion improved manta ray foraging optimization multimodal biometric system security and privacy

1. Introduction

The digital world has brought with it a plethora of interactive gadgets, ranging from smartphones to smart appliances, all of which offer vital services to their customers. These gadgets are capable of communicating with one another via the internet, resulting in the Internet of Things (IoT) network.^1,2 The personal information acquired by these devices are very confidential, thus protecting them from intruders is a top priority without compromising the user experience. The use of biometric based authentication system^2–4 is a promising strategy for both improving individualization in service delivery and bolstering the safety of sensitive information.

With the proliferation of Multimodal Biometric Systems (MBS),⁵ it is now possible to achieve the most demanding security criteria without resorting to passwords or Unimodal Biometric System (UBS).⁵ Moreover, the Corona pandemic outbreak has accelerated the shift towards contactless biometric systems as opposed to those that rely on the sense of touch. Further, major advantages of contactless MBS are high precision, speedy matching, adaptability and effortless communication with systems in the medical sector, law enforcement, educational institutions, and border and aviation security.⁵ In order to create a reliable MBS and lessen the chance of corona transmission, the choice of speech (behavioral) and face (physiological) modalities evolved as viable options. The major benefit of the face modality over others is that the data acquisition can be done without the user's cooperation or awareness, as in airports or other public places in mass. Similarly, compared to other modalities which require greater user cooperation and certain proximity to the device; speech-based recognition, especially for disabled, kids or elderly people, is convenient and reliable considering the ease of accessing a speech-receiving device and its contactless nature. Also, to construct a robust MBS the selection of biometric cues or modalities should be such that one's flaw can be compensated by the other thereby improving the performance of such system. In nutshell, the data acquisition process for the aforesaid system under adverse conditions, such as low illumination and noisy environment are independent of each other i.e., does not hamper one another w.r.t. quality, data acquisition, etc. As a corollary, face and speech modalities together pair up to be the most reliable and user convenient one.

The extracted biometric characteristics are an abundant source of information. Combining features enables classes to be highly distinguishable, thereby enhancing the effectiveness and precision of the MBS. The selection of a suitable fusion strategy is becoming more crucial for the development of a successful multimodal biometric based authentication system. There are five different types of fusion strategies that are frequently used: feature level, rank level, sensor level, match score level and decision level fusion.⁶ Among these, the advantage of feature-level fusion (pre-classification) is the melding of associated features derived from multiple biometric algorithms into single robust template, which enables the identification of prominent feature sets to expedite the recognition accuracy. By integrating diverse biometric characteristics via feature level fusion strategy,⁶ the MBS can greatly decrease the overlap within the feature spaces of individuals (interclass similarities). Likewise, the unified feature vectors yield an exceedingly resilient and trustworthy human recognition system that is challenging to fabricate or imitate. Hence, the fused feature of aforesaid modalities will yield an efficient and robust MBS.

Practical implications of proposed work:

Smart access control: Reliable authentication in secure buildings

Public surveillance: Real-time identification with liveness verification

E-banking: Tamper-resistant voice-face authentication for financial transactions

Healthcare CPS: Secure access to medical devices using contactless biometrics

The research work structure is: Section 2 describes the related works regarding MBS followed by the problem formulation for this research. The proposed OEMBAS model is discussed in details in Section 3 including feature extraction, feature-level fusion and classification. Section 4 describes the experimental setup, datasets and performance metrics including the validation part of OEMBAS with other methodologies based on different 60-70-80-90 percentage of training data, five traditional optimization approaches and different machine learning (Random Forest (RF), Gaussian Mixture Model (GMM), Support Vector Machines (SVM)) and DL (LSTM, Bi-LSTM, Bi-GRU, DCNN, Bi-LSTM-DCNN) based classifiers in terms of ten performance metrics³ such as: positive measures (accuracy, sensitivity, precision and specificity),³ negative measures (False Negative Rate (FNR), False Positive Rate (FPR) and Equal Error Rate (EER)) and other measures (F-measure, Mathew's Correlation Co-efficient (MCC) and Negative Predictive Value (NPV))³ along with cost function and statistical measures. Also, the comparison of the OEMBAS with a few similar state-of-the-art MBS is carried out. Section 5 is the conclusion of this work.

2. Related works

DL⁶ has achieved significant success, especially with ensemble classifiers, in the fields of pattern recognition and classification.⁶ The utilization of meta-heuristic algorithms for training classifiers offers a powerful optimization approach that can efficiently navigate complex search spaces. By leveraging meta-heuristic techniques, classifiers can adapt and evolve over iterations, leading to improved accuracy and robustness in various machine learning tasks, even in the presence of noisy or high-dimensional data. Moreover, the flexibility and scalability of meta-heuristic algorithms make them well-suited for addressing challenging optimization problems encountered in classifier training across diverse domains, from pattern recognition to financial forecasting. Many researchers have made efforts to construct such DL based MBS⁶ that show promising performance, as documented in the literature. Among these efforts, Vekariya et al.,⁷ proposed a technique for multi-biometric authentication that involves feature-level fusion too. The proposed method utilizes a Binary Chimp Optimized Adaptive Kernel Support Vector Machine (BCO-AKSVM) to determine the most effective features. The method's experimental results demonstrate a high level of accuracy of 97%. Purohit et al.,⁸ introduced an MBS that utilizes fingerprint, ear, and palm biometric data. The researchers introduced a novel approach utilizing grey wolf optimization to achieve feature-level fusion, specifically for the purpose of picking the most optimal features. Wu et al.,⁹ developed LVID, a smartphone biometric authentication method that uses both voice and lip motions. For precise authentication, LVID uses a combination of two biometrics, a fine-grained estimate of lip movements and a pure speech sample obtained from the recorded signal. With 104 subjects tested, LVID was able to recognize 93.47% of threats and authenticate users with 95% accuracy. The assessment is limited to a small group of students. Increasing the number of participants and broadening their age range will aid in gaining a better understanding of the system's performance. Mustafa et al.,¹⁰ introduced a decision level fusion technique that combines fingerprint and iris biometrics utilizing the Gray Level Co-occurrence Matrix (GLCM) with KNN classifier for feature extraction. The final decision is made using an AND gate. Based on the results, the fusion strategy described in this study clearly outperformed UBS. The method achieved a 95% efficiency level in reaching the final outcome, for 20 users.

Singh et al.,¹¹ introduced an MBS that utilizes feature-level fusion of facial and fingerprint information. The PCA approach was employed for facial feature extraction, while the Raymond Thai algorithm was utilized for minutiae extraction as a fingerprint feature. This proposed approach using SVM classifier, significantly enhances efficiency and has a stated accuracy rate of 95.38%. Nainan et al.,¹² introduced a speaker recognition system incorporating dynamic voice features with static features. This inclusion of important speaker information leads to a significant enhancement in the accuracy of automatic speaker recognition (ASR). The Fisher score approach was utilized to identify the most influential characteristics using a 1D-CNN, leading to a notable enhancement in the accuracy of ASR to 94.77%. The multi-kernel SVM technique is employed for recognition. Abinaya et al.,¹³ devised MBS that combines multiple modes of behavior, including keystroke (typing timings) and audio (speech) features. This system employs pretrained DL models to identify the individual. The features from the two modalities were merged using a weighted linear approach of feature-level fusion. These merged features were then trained by a DL convolutional neural network (CNN) classifier model. Similarly, Abdulbaqi et al.,¹⁴ have created a prototype system that verifies users by analysing the distinctiveness of their ECG signal in conjunction with their facial features, utilizing Awica Wavelet Transform algorithms. However, only 94% of classification accuracy was achieved. Recently, A. El_Rahman et al.,¹⁵ has presented a technique for fusing fingerprints and ECG at the feature level, utilizing CNN as the classifier. Based on the results of the experiments, the proposed approach has an accuracy of 94.5%.

2.1. Problem formulation

The reliability of biometric systems shifts depending on the specific biometric modalities, features, classifier and optimization techniques being employed. The UBS is susceptible to spoofing attacks and has interclass variability. In order to build a trustworthy MBS employing facial and vocal features, a thorough literature analysis was conducted. From the perspective of modality choice, it has been discovered that the amount of study on MBS utilizing both face and voice is low in comparison to the amount of research into either modality alone. From the perspective of fusion level, it was discovered that in MBS, sensor modifications during score fusion, feature incompatibility during feature fusion, score normalization due to variation during score fusion, serial versus parallel design process during rank fusion, and a lack of data during decision fusion all pose challenges. Developing protective methods that satisfy operational requirements, enhancing public trust in biometric technology and preserving personal data are essential for achieving a higher user acceptability point.

Therefore, to tackle the above problems, this paper proposes an MBS with following contributions:

For robust feature extraction – The three state-of-the-art speech feature extraction techniques: Power Normalized Cepstral Coefficient (PNCC), Constant Q Cepstral Co-efficient (CQCC) and Spectral Flux (SF) is suggested and implemented. Also, for face modality the state-of-the-art Eye Aspect Ratio (EAR) technique along with Improved-Active Shape Model (I-ASM) is proposed and implemented.

Fusion of face with speech feature - The feature level fusion melds feature for better classification and reduce data dimensionality. Moreover, melding image-audio features culminate into a more potent and resilient recognition system. Thus, an improved mutual information-based feature level fusion though concatenation is proposed.

Classifier - The state-of-the-art Bi-LSTM with DCNN (weighted) as ensemble classifier is proposed.

Optimization - For optimal weight tunning and to minimize the inaccuracy of the suggested classifier, the state-of-the-art MRFO technique with improvement as SI-MRFO is proposed and implemented. Ultimately, the goal of this research is to improve and deliver a reliable authentication solution for cyber security application like an IoT setting by leveraging a high-quality face and speech feature set for optimal trained ensemble model-based MBS.

3. Proposed scheme

In MBS, it is extremely challenging to spoof or forge numerous biometric features simultaneously for an authenticated user. As a consequence, it offers greater precision and more resilience to illegal access by an adversary than a UBS. Moreover, the MBS provides user data confidentiality and improves security. In this paper, we have proposed an Optimal trained Ensemble classifier based Multimodal Biometric Authentication System (OEMBAS) model for user authentication (verification) as given in Figure 1.

Figure 1.

Overall layout of proposed OEMBAS model.

For user authentication process, every MBS must go through two phases, training and testing.^3,4,16 In case of training phase, the user takes the initiative to sign up. A brief video is acquired by a smart sensor camera. The audio/speech signal is preprocessed from the recorded video in order to obtain reliable speech attributes. The facial image frame is derived from the same video for effective feature extraction. The resulting feature vectors are concatenated in order to achieve feature-level fusion. Classifier is used to train these fused feature templates of individual speaker models, which are then saved in a database for later usage (testing). Testing is carried out in a manner analogous to training, except that the claimed user's fused biometric template is compared to the trained reference N models already recorded in the database. The claimed user's fused template is compared one-to-one with all registered templates in the database to determine whether or not the user is legitimate. The final decision is made according to how closely the query matches the reference data based on performance metrics. The OEMBAS has four processes, which are explained in detail below:

3.1. Preprocessing

For efficient user authentication,^4,16 preprocessing¹⁶ is the initial step, where the input face image and the input speech signal is preprocessed. For input face image preprocessing,¹⁷ face detection is performed by using the Viola Jones algorithm.¹⁸ After the completion of face detection process, Sato tubeness¹⁹ filtering process is used to filter the input face image. Similarly, for input speech signal preprocessing, a notch filtering²⁰ process is carried out. It is a band-stop filter²⁰ which reduces frequencies within a small band of frequencies while transmitting all the other frequencies unmodified. It is suitable for use in audio systems to remove disruptive frequencies like powerline hum.

3.2. Feature extraction

Subsequent to the face detection and signal denoising, distinctive features are extracted from each modality. The key features are chosen and improved for each modality based feature extraction techniques.²¹ The details of each feature extraction techniques are mentioned below:

3.2.1. Face: improved-active shape model with eye aspect ratio (I-ASM with EAR)

The feature extraction techniques EAR²² along with improved ASM is proposed and considered as the state-of-the-art algorithms. Since, the effect of these features on recognizing the facial feature points from the face image is high with enhanced accuracy. Moreover, the six scalar values of EAR²² technique extract the information based on eyes (open and closed states of eyes i.e., for liveliness detection) and combined framewise for each sample. In order to discover the best match between the model as well as the data inside a new image, ASM²³ uses a prior model to determine what is anticipated in the image. A shape could be described by a 2n-D element vector X = (x₁, y₁, …., y_n)^T made up of n points, (X_i, Y_i). The statistic shape is created using the same coordinates as the training shape S(S = X_i). The Generalized Procrustes Analysis (GPA) technique²⁴ aligns the training set S. All alignment shapes’ average shape vector is represented by average shape $\bar{x}$ . In ASM, the mean shape specified in Equation (1) is generally computed before Principal Component Analysis (PCA)²⁵ is given to shape vector. However, with standard mean it is difficult to fully exploit the specific information in the image. Since, the normal mean heavily influenced via the extreme values in the tails of the image. Therefore, the proposed improved logic, I-ASM will compute the trimmed mean, which is determined as in Equation (2). The trimmed mean compensates the influence by tilting certain values on the tails of the image.

\begin{aligned} \bar{x} & = \frac{1}{S} \sum_{i = 1}^{S} x_{i} \end{aligned}

(1)

\begin{aligned} \bar{x} & = \frac{\sum_{i = p + 1}^{n - p} x (i)}{n - 2 p} \end{aligned}

(2)

where n depicts number of samples and the variable p is generating in random manner between the range [0,1 Due to the randomness, the performance of the features gets highly affected. Hence, in the research work skew tent map function is used here to calculate variable p which is adapted to encode the input preprocessed image

I_{p}^{F}

defined in Equation (3).

\begin{aligned} x_{i + 1} = {\begin{array}{ll} x_{i} / p, & x_{i} \in [0, p] \\ ((1 - x_{i}) / (1 - p)), & x_{i} \in [p, 1] \end{array}, x_{i} \in [0, 1], p \in [0, 1] \end{aligned}

(3)

Also, PCA²⁵ is applied to the covariance matrix J conventionally computed in Equation (4) and as per the proposed logic; improved computational formula for covariance matrix is calculated in Equation (5), where W depicts normalized data.

\begin{aligned} J & = \frac{1}{m - 1} \sum_{i = 1}^{m} (x_{i} - \bar{x}) (x_{i} - \bar{x})^{T} \end{aligned}

(4)

\begin{aligned} J^{2} & = \frac{\sum x^{2} - \frac{{(\sum x)}^{2}}{N}}{N - 1} \end{aligned}

(5)

The shape variance given by the associated eigenvector is equal to the Eigen value of J². The form of variation, or how the landmark points shift together as their shape changes, is provided by the eigenvectors.

3.2.2. Speech: PNCC, CQCC & SF

To recognize a speaker^3,16 we have extracted distinct and robust speech features, the state-of-the-art techniques PNCC, CQCC and SF are chosen and concatenated framewise for each sample. As higher the number of features enhances the distinguishing ability of the classifier and efficiency of the MBS. The PNCC,^3,16,21 is a cepstral domain feature that is considered as a state-of-the-art method for extracting noise robust acoustic features. The formulation of PNCC was motivated by the fact that its feature set is more resistant to acoustical variations, performs well even when the speech signal is undistorted and has computational complexity comparable to that of Mel-Frequency Cepstral Coefficient (MFCC)^3,25 and Perceptual Linear Prediction (PLP).²¹ For the simulation of PNCC, we have used sampling rate of 16 kHz, FFT size 512, length of window is 25 ms, overlap window size is 10 ms, number of filters in the filter bank is 128 and number of extracted features are 13. The second feature is CQCC²⁶ extraction utilizes a hybrid of the constant Q transform and cepstral analysis. In contrast to the majority of automatic speaker verification system frontends, CQCC represents the spectro-temporal resolution which accurately records the tampering objects. It denotes the characteristic of spoofing attacks, thereby increasing the system's accuracy. we have extracted 19 CQCC features. At last, the third feature is Spectral Flux,²⁷ is used to create a decision rule that aims to reduce the frequency of decision errors. As a result, this attribute is used to enhance the speech verification decision-making. These three speech features are used as features for speech modality.

3.3. Feature level fusion of face and speech modalities

The above features, from the face (I-ASM with EAR) and speech (PNCC, CQCC and SF) modalities are combined at feature level^6,9,25,26 through concatenation as feature set. To extract the efficient components from their corresponding features set, we determine the ratio of inter-class as well as intra-class variance in order for each dimension of the speech and face features. The standard feature level fusion FF is carried out as Equation (6). Also, the standard feature level fusion produces a spatial distortion in feature fusion due to the lack of considering the class or label. Thus, the drawbacks have been conquered in this research work by including the Mutual Information (MI)²⁸ between the features and the labels. This improved feature level fusion calculation is done using Equations (6)-(8).

\begin{aligned} F F & = \frac{C_{I n t e r}}{C_{I n t r a}} \end{aligned}

(6)

\begin{aligned} C_{I n t e r} & = \sum_{i = 1}^{N} (m_{i} - m_{a l l})^{2} \end{aligned}

(7)

Where, m_i is the mean feature vector of user i, m_all is the global mean across all users. High inter class variance ensures better separation between different users.

\begin{aligned} C_{I n t r a} = \sum_{i = 1}^{N} (\frac{1}{n} \sum_{j = 1}^{n} {(m_{j, i} - m_{i})}^{2}) \end{aligned}

(8)

Where, m_j,i is the feature vector of sample j for user i, m_i is the mean feature vector for that user, n is the number of samples per user. Low intra class ensures that features of the same user remain consistent.

Since, the standard mutual information lacks the information about the interaction between the features and the classifier, therefore the proposed logic for improved feature fusion is evaluated as Equation (9):

\begin{aligned} F F = \frac{C_{I n t e r}}{C_{I n t r a}} + I M I \end{aligned}

(9)

Where IMI is the Improved MI i.e., normalized MI defined in Equation (11) and conventional MI calculation is done in Equation (10) as:

\begin{aligned} M I (X, Y) = \sum_{X} \sum_{Y} P (X, Y) \log (\frac{P (X, Y)}{P (X) P (Y)}) \end{aligned}

(10)

Where P(X,Y) is the joint probability distribution of features X and labels Y, P(X) and P(Y) are the marginal distributions. Higher MI means better feature-label correlation, leading to improved classification. Since IMI normalizes MI, it accounts for variations in data distribution, making the authentication system more stable.

\begin{aligned} I M I = 2 \frac{M I (X, Y)}{W * H (X) + W * H (Y)} \end{aligned}

(11)

Where, H(X) and H(Y) are the entropies of features and labels, W is a weight factor. IMI ensures the MI value is scale-invariant, making it more effective for fusion.

Where $W = \frac{1 - H_{i}}{N - \sum_{i = 1}^{N} H_{i}}$ , and $(W * H (X) + W * H (Y))$ is the weighted entropy function, $N$ depicts number of samples, X is the features, Y is the label, W is the weight and H() is the Shannon entropy.

3.4. Optimally trained ensemble classifier

For user authentication process, we employ an ensemble classification algorithm to obtain the final classified output (authorized or unauthorized person) by giving the fused feature $F F$ as an input to optimized Bi-LSTM and the DCNN classifier. The Bi-LSTM and the DCNN are chosen based on the specific characteristics. Particularly, the Bi-LSTM classifier is chosen due to its memory and the characteristics of solving a fixed sequence to sequence detection. Likewise, the advantage of DCNN is the weight sharing, minimal computation compared to other neural network and very accurate for classification task. For the proposed work, the Bi-LSTM architecture consists of one input layer, two LSTM layer, two dropout layer and one output layer. We have used 128 LSTM units, SGD (Stochastic Gradient Descent) as Optimizer, ReLu and Sigmoid as the activation function for 0.5 and 0.2 dropout rate respectively. The DCNN classifier⁴ consists of three main layers: convolutional, pooling and fully connected layer. To generate feature maps, a convolution layer is composed of many convolution kernels. The DCNN architecture consists of one input layer, one output layer, three convolutional layer, three pooling layer and one flatten layer. Each convolutional layers consists of filter count of 16, 32 and 64 respectively with kernel size 1 × 1 and ReLu as the activation function. Each pooling layer consists of pool size 1 × 1 with dropout rate of 0.2. The sparse categorical cross entropy is deployed as the loss function with Adam optimizer having batch size of 100.

3.4.1. Proposed Self Improved-Manta Ray Foraging Optimization (SI-MRFO) Algorithm for optimal tuning of the BiLSTM-DCNN as ensemble classifier

The Manta ray foraging optimization (MRFO)²⁹ algorithm is a state-of-the-art bio-inspired optimization technique for dealing with global optimization problems. It is useful in feature selection for efficient classification, hyperparameter tuning and optimizing deep learning models for high performance of the system. The major objective of optimization technique³⁰ is to minimize the error. In this work, in order to find the optimum solution in the search space, proposed SI-MRFO simulates the foraging habits of manta rays (sea creature) in the wild. SI-MRFO employs three foraging techniques: (proposed) chain foraging, (proposed) cyclone foraging, and somersault foraging. Chain foraging: enhance local search ability and guides feature selection towards the best-performing feature subsets. Cyclone foraging: enhance global search ability and ensures exploration by searching different feature combinations. Somersault foraging: enhance local search ability and raises the convergence rate. The exploitation search is mostly aided by chain foraging and somersault foraging behaviors, whereas the exploration search is primarily aided by cyclone foraging. The three foraging behaviors are used in combination with the following update processes to solve optimization problems using SI-MRFO.

In this phase, first, the weight function of the Bi-LSTM (∂) and DCNN (τ) classifier i.e., $ϑ \in {\partial, τ}$ is given as an input solution to the proposed SI-MRFO algorithm to obtain the optimized weight $ϑ_{b e s t}$ . SI-MRFO is used to fine-tune the hyperparameters of the ensemble classifier (Bi-LSTM-DCNN) and the fusion module. It integrates self-adaptive weight adjustment and feedback-driven mutation to maintain solution diversity and avoid premature convergence, thus outperforming standard metaheuristic techniques in convergence speed and global optimum detection.

Key hyperparameters tuned include:

Learning rate and batch size

Number of LSTM units and convolution filters

Dropout rate, fusion vector dimensions

Number of epochs and optimizer selection

This hyperparameter tuning strategy ensured global exploration and local exploitation, resulting in a well-generalized model with improved accuracy and stability.

Proposed chain foraging: The Manta rays’ chain foraging method is written as Equation (12):

\begin{aligned} ϑ_{i} (t + 1) = {\begin{array}{ll} ϑ_{i} (t) + r \times (ϑ_{b e s t} (t) - ϑ_{i} (t)) + α \times (ϑ_{b e s t} (t) - ϑ_{i} (t)), i = 1 \\ ϑ_{i} (t) + r \times (ϑ_{i - 1} (t) - ϑ_{i} (t)) + α \times (ϑ_{b e s t} (t) - ϑ_{i} (t)), i = 2, \dots, N \end{array} \end{aligned}

(12)

Here, $α = 2 \times r \times \sqrt{| \log (r) |}$ , traditionally r = 0 to 1. As per proposed logic, random number r is calculated using the IPWLCM (improved piecewise linear chaotic map), in which map is straightforward chaotic system widely employed to produce pseudo random numbers (PRN) as denoted in Equation (13):

\begin{aligned} r_{n + 1} = \frac{r_{n} - [r_{n} / q] \times q}{q} \end{aligned}

(13)

where

r_{n} \in (0, 1)

, q refers to control parameter. Because, the randomness affects the solution of each search agent.

Proposed cyclone foraging: The Manta rays’ cyclone foraging is written as Equation (14):

\begin{aligned} ϑ_{i} (t + 1) = {\begin{array}{l} ϑ_{b e s t} (t) + r \times (ϑ_{b e s t} (t) - ϑ_{i} (t)) + β \times (ϑ_{b e s t} (t) - ϑ_{i} (t)), i = 1 \\ ϑ_{b e s t} (t) + r \times (ϑ_{i - 1} (t) - ϑ_{i} (t)) + β \times (ϑ_{b e s t} (t) - ϑ_{i} (t)), i = 2, \dots, N \end{array} \end{aligned}

(14)

Here, $β = 2 e^{r_{1} \frac{T - t + 1}{T}} \times \sin (2 π r_{1})$ and $ϑ_{r a n d} (t) = L b + r \times (L b - U b)$ . Where Lb and Ub refers to lower and upper bound. Every manta rays conduct their search at random, with the food serving as a guide. As a result, cyclone foraging has made good use of the area with the best solution so far. We can encourage each participant to seek out an alternate position that is distinct from the current best one by choosing a new random position within the entire search region as their reference position. This method focuses primarily on the exploration as well as allows SI-MRFO to accomplish a wide global lookup. Therefore, the proposed logic is stated as Equation (15):

\begin{aligned} ϑ_{n e w} = \frac{(ϑ_{r a n d} - r \times ϑ_{i} (t)) + β \times ϑ_{r a n d} - β \times ϑ_{i} (t) + r \times ϑ_{r a n d} + r \times ϑ_{i - 1} (t)}{2} \end{aligned}

(15)

where

β = C * U b {(\frac{U b}{L b})}^{(\frac{1}{1 + (t / T)})}

β

refers to weight coefficient, C ε (0,1) refers to random value, t refers to current iteration and T refers to maximum iteration.

At last, the Manta rays’ somersaulting foraging with S somersault factor is written as Equation (16):

\begin{aligned} ϑ_{i} (t + 1) = ϑ_{i} (t) + S \times (r_{2} \times ϑ_{b e s t} - r_{3} \times ϑ_{i} (t)), i = 1, \dots, N \end{aligned}

(16)

4. Result and discussion

In this section, the proposed OEMBAS is implemented using Python programming language to examine its performance and comparison with the conventional approaches. The aforementioned MBSs are trained and tested for two standard video and speech-face dataset, VoxCeleb1³¹ and VidTIMIT,²⁵ with learning data percentage i.e., training dataset from 60-70-80-90 percentage and 40-30-20-10 percentage for testing respectively. However, from computational perspective, we are considering the result based on 80 percent for training data and remaining 20 percent for testing data. The OEMBAS is compared with five state-of-the-art optimization techniques²⁹ such as Henry Gas Solubility Optimization (HGSO),³² COOT Optimization,³³ Bald Eagle Search (BES),³⁴ Blue Monkey Optimization (BMO) and MRFO²⁹ in terms of positive measures (accuracy, precision, sensitivity and specificity),³⁵ negative measures³⁵ (FPR, FNR and EER) and other measures³⁵ (F-measure, MCC and NPV)³⁵ along with cost function and statistical analysis.

Dataset1 Description: VoxCeleb1³¹ is an extensive audio-visual dataset comprising human voice samples obtained from YouTube videos. It consists of 1251 celebrities; each speaker is represented by 18 videos. The dataset includes facial bounding boxes. Initially, we applied temporal filtering to the bounding boxes to account for variations in their sizes and positions. Additionally, we increased their dimensions by 1.5 times to guarantee that the complete face is visible at all times.

Dataset2 Description: The VidTIMIT²⁵ database contains audio as well as video recordings of 43 persons reading brief text-independent 10 sentences (per person) of average length of 4.25 s, 106 video frames per recording. Each person's video is saved as a series of JPEG photos at a resolution of 384 × 512 pixels. The respective audio is stored as a 16 bit, 32 kHz and mono WAV file. It is useful for research related to automatic lip reading, multi-view FR, multi-modal SR and user authentication.

4.1. Comparison of OEMBAS (proposed approach) with different classifier based MBS

The ablation assessment on OEMBAS model for 80% of training data and 20% of testing data, with similar MBSs (using speech and face modality) based on different classifiers such as Random Forest (RF), Support Vector Machine (SVM), Bi-directional Gate Recurrent Unit (Bi-GRU), GMM,¹² Long Short-Term Memory (LSTM),²¹ BiLSTM,²¹ DCNN and Bi-LSTM-DCNN are assessed using distinctive performance measures illustrated in Table 1. From computational perspective, the proposed OEMBAS model using optimal tuned hybrid Bi-LSTM-DCNN classifier was found superior over other MBSs. This advancement for the proposed MBS i.e., OEMBAS is because of the distinctive feature extraction, improved mutual information based feature level fusion of speech and face modality as well as ensemble classifier Bi-LSTM-DCNN being optimized using proposed bio-inspired SI-MRFO technique.

Table 1.
Assessment of proposed approach (in %) with various classifier based MBSs for both datasets.

Dataset 1

Evaluation metrics RF SVM Bi-GRU GMM LSTM Bi-LSTM DCNN OEMBAS (Proposed)

Accuracy 85.09 88.07 88.14 89.42 92.87 94.04 95.03 98.23

Sensitivity 88.26 87.63 90.93 88.81 89.27 87.29 86.44 99.20

Specificity 79.38 88.49 83.04 89.99 85.18 80.85 79.58 94.00

Precision 88.50 87.96 90.73 89.18 91.99 89.41 86.70 99.10

F-measure 88.38 87.79 90.83 88.99 90.61 88.34 86.57 99.00

MCC 67.56 78.63 74.04 78.81 73.53 67.50 70.71 89.09

NPV 78.98 88.17 83.37 89.63 80.62 77.45 79.16 94.50

FPR 20.61 11.50 16.95 10.00 14.8 19.14 15.64 5.50

FNR 11.73 12.37 9.06 11.19 10.72 12.70 8.78 0.96

EER 16.17 11.93 13.01 10.60 13.76 15.92 12.21 3.23

Dataset 2

Evaluation metrics RF SVM Bi-GRU GMM LSTM Bi-LSTM DCNN OEMBAS (Proposed)

Accuracy 85.93 81.19 84.81 83.52 90.28 92.90 94.59 97.92

Sensitivity 88.94 84.69 88.28 88.62 86.63 86.30 82.09 98.10

Specificity 80.47 75.00 78.62 74.90 79.93 79.47 75.01 93.00

Precision 89.18 85.72 88.03 85.65 88.85 88.56 82.35 97.90

F-measure 89.06 85.20 88.16 87.10 87.72 87.42 82.22 97.99

MCC 69.34 59.43 66.98 64.36 65.91 65.11 65.73 88.10

NPV 80.09 73.45 79.02 79.56 76.42 75.90 74.58 93.88

FPR 19.53 25.00 21.38 25.10 20.07 20.53 24.99 6.50

FNR 11.06 15.31 11.71 11.38 13.37 13.70 17.91 0.75

EER 15.29 20.15 16.54 18.24 17.71 17.11 21.44 3.62

Dataset 1
Accuracy	85.09	88.07	88.14	89.42	92.87	94.04	95.03	98.23
Sensitivity	88.26	87.63	90.93	88.81	89.27	87.29	86.44	99.20
Specificity	79.38	88.49	83.04	89.99	85.18	80.85	79.58	94.00
Precision	88.50	87.96	90.73	89.18	91.99	89.41	86.70	99.10
F-measure	88.38	87.79	90.83	88.99	90.61	88.34	86.57	99.00
MCC	67.56	78.63	74.04	78.81	73.53	67.50	70.71	89.09
NPV	78.98	88.17	83.37	89.63	80.62	77.45	79.16	94.50
FPR	20.61	11.50	16.95	10.00	14.8	19.14	15.64	5.50
FNR	11.73	12.37	9.06	11.19	10.72	12.70	8.78	0.96
EER	16.17	11.93	13.01	10.60	13.76	15.92	12.21	3.23
Dataset 2
Evaluation metrics	RF	SVM	Bi-GRU	GMM	LSTM	Bi-LSTM	DCNN	OEMBAS (Proposed)
Accuracy	85.93	81.19	84.81	83.52	90.28	92.90	94.59	97.92
Sensitivity	88.94	84.69	88.28	88.62	86.63	86.30	82.09	98.10
Specificity	80.47	75.00	78.62	74.90	79.93	79.47	75.01	93.00
Precision	89.18	85.72	88.03	85.65	88.85	88.56	82.35	97.90
F-measure	89.06	85.20	88.16	87.10	87.72	87.42	82.22	97.99
MCC	69.34	59.43	66.98	64.36	65.91	65.11	65.73	88.10
NPV	80.09	73.45	79.02	79.56	76.42	75.90	74.58	93.88
FPR	19.53	25.00	21.38	25.10	20.07	20.53	24.99	6.50
FNR	11.06	15.31	11.71	11.38	13.37	13.70	17.91	0.75
EER	15.29	20.15	16.54	18.24	17.71	17.11	21.44	3.62

4.2. Convergence study on proposed OEMBAS using SI-MRFO with different conventional optimization techniques

Figure 2 shows the comparison of convergence evaluation i.e., cost function of proposed SI-MRFO over different technique based MBS such as HGSO, COOT, BES, BMO and MRFO, while changing the optimization iteration from 0 to 50. It was observed that the proposed SI-MRFO and the conventional methods obtained higher error rate during the initial (0th) iteration. However, it was observed that as the iteration progressed, the error rate declined. Using dataset1, the proposed SI-MRFO attained an error rate of 1.045 in the 15th iteration, while reaching to the 50th iteration it acquired the (lowest) error rate of 1.016 as indicated in Figure 2, whilst the COOT is 1.06, BMO is 1.041, HGSO is 1.020 and MRFO is 1.019, respectively. The overall results using both datasets, affirmed that the proposed SI-MRFO is considerably more efficient in identifying and verifying the MBS with a low error value than the conventional methodologies.

Figure 2.

Convergence analysis of proposed SI-MRFO with conventional optimization techniques.

4.3. Statistical evaluation of OEMBAS with different conventional optimization techniques with regard to error

The statistical estimation of OEMBAS over the HGSO, COOT, BES, BMO and MRFO based different MBSs under different statistical measures is summarized in Table 2. For the better MBS the model should attained lesser error rate. Similarly, the OEMBAS scored the lowest error value in almost all the statistical measures for both datasets. For Dataset1, the OEMBAS acquired error rate for the median statistical measure is 1.021, whereas for the HGSO is 1.026, COOT is 1.062, BES is 1.053, BMO is 1.041 and MRFO is 1.025. Simultaneously, for the maximum statistical measure, the HGSO, COOT, BES, BMO and MRFO maintained the greatest error value of 1.128, 1.099, 1.227, 1.124 and 1.133, though the OEMBAS generated the lowest error value of 1.098. Similarly, for Dataset2 the OEMBAS attained superior result compared to other techniques. Hence, it can be inferred that the OEMBAS has provided excellent performance than the other five conventional optimization technique. This clearly indicates that due to improved MRFO (SI-MRFO) the proposed OEMBAS can be regarded as more accurate and trustworthy authentication system.

Table 2.
Statistical assessment of OEMBAS model with traditional optimization approaches.

Dataset1

Methods Standard Deviation Mean Minimum Median Maximum

HGSO 0.040 1.055 1.026 1.026 1.128

COOT 0.023 1.071 1.050 1.062 1.099

BES 0.043 1.066 1.025 1.053 1.227

BMO 0.028 1.061 1.041 1.041 1.124

MRFO 0.029 1.045 1.024 1.025 1.133

OEMBAS 0 . 023 1 . 035 1 . 021 1 . 021 1 . 098

Dataset2

Methods Standard Deviation Mean Minimum Median Maximum

HGSO 0.014 1.055 1.046 1.051 1.108

COOT 0.024 1.087 1.068 1.068 1.116

BES 0.009 1.042 1.035 1.035 1.071

BMO 0.016 1.044 1.034 1.035 1.082

MRFO 0.014 1.046 1.039 1.047 1.052

OEMBAS 0 . 012 1 . 037 1 . 032 1 . 032 1 . 107

Dataset1
HGSO	0.040	1.055	1.026	1.026	1.128
COOT	0.023	1.071	1.050	1.062	1.099
BES	0.043	1.066	1.025	1.053	1.227
BMO	0.028	1.061	1.041	1.041	1.124
MRFO	0.029	1.045	1.024	1.025	1.133
OEMBAS	0 . 023	1 . 035	1 . 021	1 . 021	1 . 098
Dataset2
Methods	Standard Deviation	Mean	Minimum	Median	Maximum
HGSO	0.014	1.055	1.046	1.051	1.108
COOT	0.024	1.087	1.068	1.068	1.116
BES	0.009	1.042	1.035	1.035	1.071
BMO	0.016	1.044	1.034	1.035	1.082
MRFO	0.014	1.046	1.039	1.047	1.052
OEMBAS	0 . 012	1 . 037	1 . 032	1 . 032	1 . 107

4.4. Performance analysis of OEMBAS and the conventional methodologies with regard to positive, negative and other measures

In this section, the OEMBAS is compared to similar MBSs with different optimization techniques such as HGSO, COOT, BES, BMO and MRFO. The assessment is done based on positive measures such as accuracy, sensitivity, precision and specificity; negative measures such as FNR, FPR and EER and other measures such as F-measure, MCC and NPV, as illustrated in Figures 3 and 4 using Dataset1 and Dataset2 respectively. The investigation on OEMBAS was carried out for varied percentage of data ranging between 60–90 of training and 40-10 of testing, respectively. In particular, for Dataset1, at the 80% of training data and 20% of testing data, the OEMBAS recorded the positive metric such as the value of accuracy is 98.23%, Specificity is 94.00%, Precision is 99.10% and Sensitivity is 99.20%, whilst the HGSO, COOT, BES, BMO and MRFO generated lesser positive metric values. Similarly, for Dataset2, at the 80% of training data and 20% of testing data, the OEMBAS recorded the positive metric such as the value of accuracy is 97.92%, Specificity is 93.00%, Precision is 97.90% and Sensitivity is 98.10%, whilst the HGSO, COOT, BES, BMO and MRFO generated lesser positive metric values. In addition, the objective of the OEMBAS is minimizing the negative and maximizing the positive measure values. Similarly, as the training data percentage is increased the negative metric value get reduced for the OEMBAS approach. The results of the aforementioned ablations demonstrate that each component meaningfully contributes to the observed performance and that the fusion-based, optimized ensemble achieves the best results (accuracy: 98.23%, EER: 3.23%).

Figure 3.

Performance analysis of OEMBAS with conventional optimization techniques using Dataset1.

Figure 4.

Performance analysis of OEMBAS with conventional optimization techniques using Dataset2.

4.5. Discussion on the novelty and technical contribution of SI-MRFO

The proposed SI-MRFO algorithm introduces significant enhancements over the conventional MRFO framework, specifically tailored for optimizing complex deep learning-based MBS architectures. Its novelty lies in the integration of a self-adaptive inertia weight mechanism, which dynamically adjusts the balance between exploration and exploitation during the optimization process. This mechanism mitigates premature convergence and stagnation in local optima, which are common limitations of classical MRFO, especially when tuning deep models (Bi-LSTM-DCNN) in a high-dimensional fused feature space.

Moreover, SI-MRFO incorporates a feedback-driven mutation operator, inspired by differential evolution strategies, to increase diversity in candidate solutions, thus improving the robustness and convergence speed of the optimization. These innovations contribute to superior performance in selecting optimal hyperparameters for the ensemble classifier and fine-tuning the feature-level fusion strategy. The mathematical formulation, algorithmic flow and theoretical convergence characteristics have been rigorously discussed and benchmarked against the original MRFO and other optimizers (e.g., COOT, BMO, etc.) to substantiate its superiority both algorithmically and empirically.

4.6. Comparison with a few state-of-the-art methodologies

This study demonstrates innovative strides within the MBS domain, notably by investigating the integrated utilization of face and speech for user authentication. Existing literature predominantly emphasizes face paired with modalities other than speech or vice versa.^9–11 For centuries, individuals have relied on recognizing others through facial or verbal cues, rather than through attributes like ear, hand, signature, finger/fingerprints, etc. While previous studies offer valuable groundwork for MBS applications,^36,37 research into combining face with speech interaction remains relatively unexplored, highlighting the novelty and significance of this study's contribution. The performance comparison of the proposed OEMBAS with a few state-of-the-art MBSs in terms of accuracy of classifier and EER³⁸ is presented in Table 3. Moreover, to facilitate better understanding, Table 4 presents a summary of the ablation study conducted on both databases, reporting performance metrics in terms of accuracy and EER. Based on the findings, we can infer that the OEMBAS is significantly more efficacious at user authentication.

Table 3.
Performance comparison of the proposed work with state-of-the-art methods.

Literature Biometric Modality Fusion level Database Accuracy (in %) EER (in %)

Wu et al.,⁹ 2019 Lip movements & speech Data Self made 95.00 N/A

Mustafa et al.,¹⁰ 2020 Iris & finger-print Decision CASIA V1 & V2(iris) & FVC 2004 (fingerprint) 95.00 N/A

Singh et al.,¹¹ 2020 Face & finger-print Feature Live (face) & FVC 2004 (fingerprint) 95.38 4.61

Nainan et al.,¹² 2021 Speech N/A VidTIMIT 94.77 N/A

Purohit et al.,⁸ 2021 Fingerprint, ear & palm Feature IIT Delhi (ear), CASIA (fingerprint & palmprint) 91.60 N/A

Abinaya et al.,¹³ 2022 Speech & keystroke Feature BioChaves 91.50 N/A

Alagarsamy et al.,³⁹ 2022 Face & ear Score ORL (face) & IIT Delhi (ear) 96.24 N/A

Abdulbaqi et al.,¹⁴ 2023 Face & ECG Decision N/A 94.00 52.96

A. El_Rahman et al.,¹⁵ 2024 ECG & fingerprint Feature MIT-BIH (ECG) & FVC2004 (fingerprint) 94.50 N/A

Vekariya et al.,⁷ 2024 Face & finger-print Feature SDUMLA 97.00 N/A

OEMBAS (proposed) Face & speech Feature Dataset1 / VoxCeleb1 98.20 3.23

Dataset2 / VidTIMIT 97.92 3.62

Literature	Biometric Modality	Fusion level	Database	Accuracy (in %)	EER (in %)
Wu et al.,⁹ 2019	Lip movements & speech	Data	Self made	95.00	N/A
Mustafa et al.,¹⁰ 2020	Iris & finger-print	Decision	CASIA V1 & V2(iris) & FVC 2004 (fingerprint)	95.00	N/A
Singh et al.,¹¹ 2020	Face & finger-print	Feature	Live (face) & FVC 2004 (fingerprint)	95.38	4.61
Nainan et al.,¹² 2021	Speech	N/A	VidTIMIT	94.77	N/A
Purohit et al.,⁸ 2021	Fingerprint, ear & palm	Feature	IIT Delhi (ear), CASIA (fingerprint & palmprint)	91.60	N/A
Abinaya et al.,¹³ 2022	Speech & keystroke	Feature	BioChaves	91.50	N/A
Alagarsamy et al.,³⁹ 2022	Face & ear	Score	ORL (face) & IIT Delhi (ear)	96.24	N/A
Abdulbaqi et al.,¹⁴ 2023	Face & ECG	Decision	N/A	94.00	52.96
A. El_Rahman et al.,¹⁵ 2024	ECG & fingerprint	Feature	MIT-BIH (ECG) & FVC2004 (fingerprint)	94.50	N/A
Vekariya et al.,⁷ 2024	Face & finger-print	Feature	SDUMLA	97.00	N/A
OEMBAS (proposed)	Face & speech	Feature	Dataset1 / VoxCeleb1	98.20	3.23
Dataset2 / VidTIMIT	97.92	3.62

Table 4.

Summary of the above ablation study results.

		Dataset 1		Dataset 2
Sl. No.	Biometric System	Accuracy	EER	Accuracy	EER
1.	UBS using face	96.8%	4.2%	94.8%	6.2%
2.	UBS using speech	95.9%	5.4%	95.1%	6.1%
3.	MBS using MRFO	97.1%	4.1%	96.7%	5.3%
4.	MBS using SI-MRFO (proposed OEMBAS)	98.20%	3.23%	97.92%	3.62%

4.7. Real-World deployment challenges

In real-world settings, deploying a robust MBS entails addressing practical challenges beyond algorithmic performance.

Latency and Computational Efficiency: The proposed system has been profiled for training and inference times across multiple platforms, including CPU and GPU configurations. On an NVIDIA RTX 3080 GPU, the average inference time is 0.34 s per instance, making it feasible for near real-time applications.

User Acceptability and Usability: The system relies on non-intrusive, contactless modalities (face and speech), enhancing user comfort and hygiene—particularly relevant in post-pandemic authentication scenarios. A user study is planned for future work to assess user experience, satisfaction and interaction latency under varying acoustic and lighting conditions.

Environmental Variability: No real-user field testing has been conducted to date and this remains a limitation. While extensive evaluation was performed on benchmark datasets (VoxCeleb1 and VidTIMIT), future work aims to include in-situ validation through longitudinal data collection under real deployment conditions (e.g., surveillance, access control).

Security and Privacy Concerns: Adversarial robustness remains an area for future exploration.

5. Conclusion

In this work, we have proposed a novel MBS named as an OEMBAS model for user authentication using face and speech combined at feature level. The optimal training of the ensemble classifier is done using improved and advanced MRFO algorithm. In particular, the OEMBAS recorded the maximal positive metric, less negative metric and cost function (on both datasets) than all other conventional optimization techniques and different classifiers on two open-source datasets such as Voxceleb1 and VidTIMIT. In particular, for dataset1 / Voxceleb1, at the 80% of training data, the OEMBAS attained the following metrics: Accuracy of 98.23%, Specificity of 94.00%, Precision of 99.10% and Sensitivity of 99.20%. Moreover, for Dataset2 / VidTIMIT at the 80% of training data, the OEMBAS attained the following metrics: Accuracy of 97.92%, Specificity of 93.00%, Precision of 97.90% and Sensitivity of 98.10%. The EER value of the OEMBAS for both Datasets are 3.23% and 3.62% respectively, which is the lowest compared to other similar work and approaches (HGSO, COOT, BES, BMO and MRFO) denoting high performance of the model. In nutshell, this advancement is because of the selection of robust state-of-the-art and improved feature extraction techniques for face and speech modality, improved mutual information based feature level fusion of both modalities as well as ensemble classifier BiLSTM-DCNN being optimized using improved metaheuristic MRFO technique. Hence, the proposed approach for user authentication utilizing contactless MBS can be deemed as an efficient, reliable and hygienic security solution. The future research may incorporate transformer based architecture using cross-domain generalization with multi-accent and multi-ethnic datasets for better performance.

Key Points:

This study proposes a novel Optimally Trained Ensemble Multimodal Biometric Authentication System (OEMBAS), integrating proposed human facial (physiological) and speech (behavioral) features at the feature level. The system enhances authentication performance using an advanced proposed SI-MRFO algorithm for hyperparameter tuning and optimal training of the BiLSTM-DCNN ensemble classifier.

Experimental results on Voxceleb1 and VidTIMIT datasets demonstrate superior performance compared to conventional optimization techniques. At 80% training data, the system achieved 98.23% accuracy, 99.10% precision, and 99.20% sensitivity on Voxceleb1, and 97.92% accuracy, 97.90% precision, and 98.10% sensitivity on VidTIMIT. The Equal Error Rates (EERs) of 3.23% and 3.62%, respectively, are the lowest among existing approaches.

The superior performance of OEMBAS is attributed to robust proposed feature extraction techniques, improved mutual information-based feature-level fusion and ensemble classification optimization using an enhanced metaheuristic SI-MRFO technique.

The proposed contactless multimodal biometric system ensures high security, reliability, and hygiene, making it an effective and scalable solution for user authentication in cybersecurity applications.

Footnotes

Ethical considerations

Not applicable as publicly available data sources^25,31 are used throughout the study.

Author contributions

The authors confirm contribution to the paper as follows: Study conception, design, analysis and interpretation of results and draft manuscript preparation: K. Jha, supervision: A. Jain and S. Srivastava. All authors reviewed the results and approved the final version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Online publicly available data sources^25,31 are used throughout the study.

ORCID iD

Khushboo Jha

References

Gupta

Maple

Crispo

, et al. A survey of human-computer interaction (HCI) & natural habits-based behavioural biometric modalities for user recognition schemes. Pattern Recognit 2023 Jul 1; 139: 109453.

Jha

Jain

Srivastava

. A secure biometric-based user authentication scheme for cyber-physical systems in healthcare. Int J Exp Res Rev 2024; 39: 154–169.

Jha

Srivastava

Jain

. A novel speaker verification approach featuring multidomain acoustics based on the weighted city block Minkowski distance. ETRI J 2025; 47: 227–243.

Jha

Srivastava

Jain

. Integrating global and local features for efficient face identification using deep CNN classifier. In 2023 International conference on device intelligence, computing and communication technologies, (DICCT) 2023 Mar 17. IEEE, pp. 532–536.

Oloyede

Hancke

. Unimodal and multimodal biometric sensing systems: a review. IEEE Access 2016 Sep 30; 4: 7532–7555.

Qin

Zhao

Zhuang

, et al. A survey of identity recognition via data fusion and feature learning. Inf Fusion 2023 Mar 1; 91: 694–712.

Vekariya

Joshi

Dikshit

. Multi-biometric fusion for enhanced human authentication in information security. Measurement: Sensors 2024 Feb 1; 31: 100973.

Purohit

Ajmera

. Optimal feature level fusion for secured human authentication in multimodal biometric system. Mach Vis Appl 2021 Jan; 32: 1–12.

Yang

Zhou

, et al. LVID: a multimodal biometrics authentication system on smartphones. IEEE Trans Inf Forensics Secur 2019 Sep 27; 15: 1572–1585.

10.

Mustafa

Abdulelah

Ahmed

. Multimodal biometric system iris and fingerprint recognition based on fusion technique. Int J Adv Sci Technol 2020; 29: 7423–7432.

11.

Singh

Khanna

Garg

. Multimodal biometric based on fusion of ridge features with minutiae features and face features. Int J Inform Syst Model Des 2020 Jan 1; 11: 37–57.

12.

Nainan

Kulkarni

. Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN. Int J Speech Technol 2021 Dec; 24: 809–822.

13.

Abinaya

Indira

Swarup Kumar

. Multimodal biometric person identification system based on speech and keystroke dynamics. In: International conference on computing, communication, electrical and biomedical systems. Cham: Springer International Publishing, 2022 Feb 28, pp.285–299.

14.

Abdulbaqi

Turki

Obaid

, et al. Spoof attacks detection based on authentication of multimodal biometrics face-ECG signals. In: Artificial intelligence for smart healthcare. Cham: Springer International Publishing, 2023 Jun 10, pp.507–526.

15.

A El_Rahman

Alluhaidan

. Enhanced multimodal biometric recognition systems based on deep learning and traditional methods in smart environments. Plos One 2024 Feb 15; 19: e0291084.

16.

Jha

Jain

Srivastava

. An efficient speaker identification approach for biometric access control system. In 2023 5th international conference on recent advances in information technology (RAIT) 2023 Mar 3. IEEE, pp. 1–5.

17.

Jha

Srivastava

Jain

. A novel texture based approach for facial liveness detection and authentication using deep learning classifier. Int J Comput Exp Sci Eng 2024; 10: 323–331.

18.

Juneja

Rana

. An extensive study on traditional-to-recent transformation on face recognition system. Wirel Pers Commun 2021 Jun; 118: 3075–3128.

19.

Arora

Mittal

Kukreja

, et al. An evaluation of denoising techniques and classification of biometric images based on deep learning. Multimed Tools Appl 2023 Mar; 82: 8287–8302.

20.

Tan

Jiang

. Digital signal processing: fundamentals and applications. 3. United Kingdom: Academic Press, 2018 Oct 2.

21.

Jha

Jain

Srivastava

. Analysis of human voice for speaker recognition: concepts and advancement. J Electr Syst 2024; 20: 582–599.

22.

Hutamaputra

Utaminingrum

Budi

, et al. Eyes gaze detection based on multiprocess of ratio parameters for smart wheelchair menu selection in different screen size. J Vis Commun Image Represent 2023 Mar 1; 91: 103756.

23.

Fan

Chen

, et al. A landmark-free approach for automatic, dense and robust correspondence of 3D faces. Pattern Recognit 2023 Jan 1; 133: 108971.

24.

Dai

Pears

Huber

, et al. 3D morphable models: the face, ear and head. In: Liu Y, Pears N, Rosin PL, et al. (eds) 3D imaging, analysis and applications. Cham: Springer, 2020, pp.463–512.

25.

Jha

Jain

Srivastava

. A contactless speaker identification approach using feature-level fusion of speech and face cues with DCNN. Proc Eng Sci 2024; 6: 1047–1056.

26.

Jha

Jain

Srivastava

. Feature-level fusion of face and speech based multimodal biometric attendance system with liveness detection. AIP Adv 2024 Nov 1; 14: 1–10.

27.

Jolad

Khanai

. An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks. Int J Speech Technol 2023 Jul; 26: 287–305.

28.

Liu

Yang

You

, et al. Mutual information regularized feature-level frankenstein for discriminative recognition. IEEE Trans Pattern Anal Mach Intell 2021 May 4; 44: 5243–5260.

29.

Elaziz

Abualigah

Ewees

, et al. Triangular mutation-based manta-ray foraging optimization and orthogonal learning for global optimization and engineering problems. Appl Intell 2023; 53: 7788–7817.

30.

Rani

Jain

Garg

. Study of real-world optimization problems using advanced Nature Inspired Algorithms (NIA) discovered from 2019 to 2022. Artificial Intelligence Review 2023. DOI: 10.21203/rs.3.rs-2769987/v1.

31.

Nagrani

Chung

Zisserman

. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. 2017 Jun 26.

32.

Deepa

Rasi

. FHGSO: flower Henry gas solubility optimization integrated deep convolutional neural network for image classification. Appl Intell 2023 Mar; 53: 7278–7297.

33.

Jithendra

Khan

Basha

, et al. A novel QoS prediction model for web services based on an adaptive neuro-fuzzy inference system using COOT optimization. IEEE Access 2024 Jan 8; 12: 6993–7008.

34.

Al Mazroa

Ishak

Aljarbouh

, et al. Improved bald eagle search optimization with deep learning-based cervical cancer detection and classification. IEEE Access 2023 Nov 27; 11: 135175–135184.

35.

Luque

Carrasco

Martín

, et al. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 2019 Jul 1; 91: 216–231.

36.

Rajasekar

Saracevic

Hassaballah

, et al. Efficient multimodal biometric recognition for secure authentication based on deep learning approach. Int J Artif Intell Tools 2023 May 22; 32: 2340017.

37.

Jha

Jain

Srivastava

. A challenge-response based authentication approach for multimodal biometric system using deep learning techniques. Scalab Comput Pract Exp 2025 July; 26: 2118–2129.

38.

Muniasamy

. Revolutionizing health monitoring: integrating transformer models with multi-head attention for precise human activity recognition using wearable devices. Technol Health Care 2025 Jan; 33: 395–409.

39.

Alagarsamy

Murugan

. Multimodal of ear and face biometric recognition using adaptive approach Runge–Kutta threshold segmentation and classifier with score level fusion. Wirel Pers Commun 2022 May; 124: 1061–1080.