Abstract
The need of newer biometric traits is increasing, as the conventional biometric systems are found to be vulnerable to forging. Nowadays, tongue print is gaining importance as a biometric trait, especially in the area of forensics. Tongue is a well protected vital organ which exhibits rich structural patterns. Success of tongue print as a biometric tool depends on how well the discriminating features are extracted from it. Advancements in the field of deep neural network and availability of high-end computing environments facilitate remarkable progress in the area of image recognition. CNN follows a hierarchical learning to extract feature maps that highly characterize the training data. However, obtaining a tongue print dataset large enough to train a CNN for recognition poses a huge challenge. Alternatively, two techniques can be used to successfully employ CNN for recognition: fine-tuning pre-trained CNN models, to use as a classifier, with the new input dataset and class labels to perform tongue-print image recognition. Another effective method is to use a pre-trained CNN model as a feature extractor, to extract features from the input tongue dataset and then use a state-of-the-art classifier to perform image recognition. In this paper, we addressed three important factors regarding the deployment of tongue-print as a biometric tool. Since, a tongue-print dataset is not publicly available, our first objective to create a challenging tongue-print dataset. We then explored and evaluated different state-of-the-art CNN architectures for image recognition. These models are varied in their architecture and contain 5 million to 144 million parameters. Finally, we analyzed different approaches to use the pre-trained CNN models for the tongue-print identification task.
Introduction
A biometric system should support the facet of identification, authentication and non-repudiation in information security. The conventional biometric systems fail to meet these requirements as they can be forged. Hence, tongue prints are gaining importance in biometric authentication as a new biometric trait. Tongue is a unique vital organ and the characteristic features of the tongue exhibits remarkable difference even between identical twins. In traditional Chinese medicine, tongue played an important role in diagnosing disease conditions by observing characteristics such as colour and shape. Not much studies have been initiated in the field of tongue print identification. Zhi Liu et al. [1] made an attempt to build a tongue database and based on their analysis, concluded that the tongue print can be used for personal identification. Li Q et al. [2] and Manoj Diwakar et al. [3] continued to study on tongue print image for the possibility of using tongue prints as a biometric trait. They also proposed different methods for creating tongue database.
Omer et al. [4], based on their study, showed that the tongue possessed different characterization even for identical twins using tongue’s cross-section and stated that tongue print image can be used as a new biometric trait for human identification. Radhika et al. [5] compared tongue prints with other biometric traits and highlighted its superiority over other biometric tools. Bob Zhang and Han Zhang [6], used geometric features that are extracted from the tongue print images of both healthy and unhealthy humans to study a patient’s condition. Stefanescu et al. [7], reported a classification for tongues by analyzing the morphological features.
Salim Lahmiri [8], used six statistical features for tongue print verification by extracting textural features from tongue print image using wavelet transformation. Manoj Diwakar et al. [3], used the histogram features for human identification that are extracted from tongue print images. Ryszard S. Choras [9] proposed steerable filters combined with Weber Law Descriptor feature for identification. Zhang et al. [10] used both shape and textural features for identification. In their work, they have considered geometrical features for shape characterization and textural codes as the textural features. In their paper, Jeddy et al. [11] explained the use of tongue print as a method of biometric authentication for personal identification. Sivakumar et al. [12] studied the textural patterns of the tongue by extracting Local Binary Pattern (LBP) features. A linear Support Vector Machine (SVM) is used to train the extracted features for personal identification.
Tongue-print as biometric
For number of reasons, tongue-print can be considered as a reliable biometric trait. First and foremost, tongue-print gets fully formed even at the time of fetus development. Studies by Omer et al. [4], have proven that, the tongue possessed different characterization even for identical twins. Since tongue is well protected inside the mouth, it is a reliable tool for forensic studies, as it is not affected by external factors.
The study group comprises of randomly selected 180 individuals with age ranging from 17-35 years of both genders who volunteered to participate after taking informed consent. Initial objective of our work is to build a challenging dataset of tongue-print images. Images of the dorsal tongue are captured under the standardized lighting conditions using a SONY-WX 350 Compact Camera with 20× Optical Zoom (DSC-WX350) with fixed head position and tongue protrusion, maintaining the distance of subject to camera. For each individual, 5 different images from dorsal surface of tongue are captured with different orientation, scale and shape. Therefore, our dataset consists of a total of 900 images with resolution 4896 × 2752. In order to not lose the generality, we have not pre-processed the image to segment dorsal surface of tongue. Table 1 gives the description of the tongue-print dataset used for personal identification. Fig. 1 shows example tongue-print images from the Tongue DB dataset.
Description of the tongue-print datasets used for the proposed method
Description of the tongue-print datasets used for the proposed method

Examples of Tongue DB dataset: Eight tongue-print images from 4 different individuals that varied in different scale and orientations.
Success of a tongue print image as a biometric tool for human identification lies on how well the geometric outline and physiological texture information of the human tongue helps to analyze the uniqueness of tongue [12]. Therefore, a study has been initiated to extract the most suitable features automatically from the dorsum of tongue for the automatic identification of an individual.
There are a number of handcrafted feature extraction methods proposed in the literature, like Gabor features, Histogram of oriented Gradients (HoG), SURF features and Visual Bag of Words (BoW) framework, that are proven to be effective in their own domains. However, these features are hard engineered, the low-level semantics of these features may not be well suited for a specific domain like tongue print identification. An automatic feature extraction method that learns features by itself is a good alternative. For instance, Convolutional Neural Network (CNN) based methods are proven in learning to extract specific features automatically from the input image. Deep CNN architecture uses stack of layers to learn features where the initial layers extracts local features like blobs and edges and the final layers are capable of extracting global features that are used for the recognition [13]. Recently, variety of CNN based models are reported in the literature that can learn features automatically for image recognition.
The paper is organized as follows: Section 2 explains the methodology used in this paper. Section 3 describes the different approaches used to deploy CNN to the tongue identification. The experimental setup, implementation details and the results are provided in Section 4. Section 5 draws the conclusions.
Deep CNN methods are capable of learning feature representations to discriminate between different objects in the recognition task. However, most of the popular deep CNN models are trained on a large-scale dataset like ImageNet [23] for recognition. Many studies [19, 20] have shown that the feature representations generated using some pre-trained CNN models also achieved excellent performance in many application domains like classification, recognition and retrieval. Moreover, existing CNN models are computationally expensive and have not been tested in the tongue-print dataset. Therefore, we have used different approaches in pre-trained CNN models to investigate it’s performance on the tongue-print application.
Deep CNN models
We used four most popular deep neural network architectures that varies in their computational complexity, the number of parameters, depth and representational power to investigate how well the features are extracted from tongue print images for the human identification. A general architecture of CNN is shown in the Fig. 2.

General architecture of a deep CNN model.
The architecture of AlexNet is more deeper and wider compared to other CNN models when they proposed for object recognition. It consists of a total of 8 layers, where 5 are convolutional layers and the rest are fully connected. The major contribution they brought to the CNN model is the response-normalization and pooling. The first convolutional layer takes the input image of size 227 × 227 × 3 and filters it with 96 filters of size 11 × 11 × 3 with a stride of 4 pixels. The second layer performs the convolutional operation using 256 filters of size 5 × 5 ×48 and the normalized, pooled output is fed as the input to the third layer. The third, fourth and fifth convolutional layers are connected to one another without any normalization and pooling layers. 384, 384, and 256 filters of size 3 × 3 are used in these layers, respectively. Fully connected layers are used with dropouts and have 4056 neurons each and are followed by a softmax layer for classification. AlexNet neural network architecture has over 60 million trainable parameters.
VGG takes the input image of size 224 × 224 × 3. This input image is processed using a stack of convolutional layers (depending on the architecture used). The main change they introduced in VGG compared to AlexNet is that they used very small filters of size 3 × 3. However, the number of filters increase by a factor of 2, starting from 64 in the first layer until it reaches 512 in the last layer. The convolution stride set to 1 pixel. VGG uses 5 max-pooling layers in between some of the convolution layers for performing spatial pooling. 3 fully connected layer follow the stack of convolution layers and the final layer is a softmax layer. The 1000 output channels of the last fully connected layer is used for classification. Number of parameters of VGG-16 is 138 million where as for VGG-19 is 144 million.
The idea of the inception architecture is to reduce drastically the parameter required for convolution. This is done by replacing the larger sized convolution filters of size n × n with a sequence of 2 respectively smaller filters of size n × 1 and 1 × n. Overall, GoogLeNet architecture consists of 2 convolutional layers, 2 pooling layers and 9 Inception layers. Each Inception layer is built as 6 convolutional layers and 1 pooling layer. GoogLeNet architecture does not use fully connected layers as other models does, instead it uses a global average pooling layer, and the activation values of each 1000 output channels are used for image classification.
The overall ResNet-50 architecture can be viewed as a connection of 5 residual groups each with a convolution and identity block. Each convolution block has 3 convolution layers and each identity block also has 3 convolution layers. Feature maps computed by different layers in each group share the same resolution. The ResNet-50 has over 25 million trainable parameters. There are several advanced architectures that have been proposed with the combination of Inception and Residual units. The concept of Inception block with residual connections is introduced in the Inception-v4 architecture [22].
Table 2 summarize the key properties of the deep CNN models we considered for training. Output size field in the Table 2 specifies the number of output channels in the last fully connected layer of the deep CNN models selected, while in our case the output size will be 1 × 180 that corresponds to the 180 subjects of our Tongue DB dataset.
Comparison of the characteristics of the selected deep CNN models
CNN models being computationally expensive, requiring a large-scale dataset for the recognition task, we propose two approaches to apply CNN for tongue print image recognition: 1) The first one uses a pre-trained CNN model as a classifier and we fine tune it for our dataset 2) The second approach uses it as a feature extractor and subsequently, these features are trained using SVM classifier for the final classification.
Deep CNN models can be either learned from scratch or fined-tuned from pre-trained models. Training millions of parameters available with deep CNN models like AlexNet, GoogLeNet and annotating enormously large number of tongue print images pose an unmanageable challenge to start with from the scratch. Studies [19, 20] have shown that CNN models pre-trained with dataset like ImageNet go well with other datasets too with some degree of fine tuning. Therefore, the idea of the first approach is to transfer the weights that are trained using large-scale ImageNet dataset to make the tongue print image recognition tasks more effective.
For this approach, we used the methodology cited in [19, 18], wherein, all CNN layers except the last one gets fine tuned at a learning rate, 10 times smaller than the default. The last fully connected layer is completely replaced with a fresh layer that accommodate the new 180 subject labels with respect to our Tongue DB dataset. This layer is then randomly initialised and trained afresh.
Our second approach is to use the pre-trained models as the feature extractor. We considered four pre-trained models described in the Section 2.1. In this approach, we give our tongue-print image dataset as input and use the same trained parameters of the pre-trained models to get the final activation from the fully connected layer. This activation represents the global generic descriptor of the input image. The advantage of this approach is that, it gives faster training time as compared to the first approach, since the training phase does not modify the network parameters. Let define this training process by:
Given set of N training samples {X
i
, y
i
} , i = 1, . . . . . , N where X
i
∈ R
n
belong to the binary class labeled by y
i
∈ {1, - 1}, SVM implicitly maps the data into a higher dimensional feature space and finds a separating linear hyperplane, for the given data, with a maximum margin. For a new sample X, the SVM classifier use the following function to decide its class.
The training samples X i with α i > 0 are called support vectors, and SVM finds the separating hyperplane that maximizes the margin between the support vectors and the hyperplane. The most frequently used kernel functions are linear, polynomial and Radial Basis Function (RBF).
Being maximum margin classifier, SVM are designed to solve two-class problem, while tongue print identification is a q-class problem where q is the number of known individuals. Two approaches can be taken to solve the q-class problem. First is to reformulate the tongue print identification problem as a several separate two-class problems (one-vs-all). Employ a set of SVMs to solve a generic q-class recognition problem (one-vs-one) [12]. In this paper, we used the one-vs-all technique, which trains binary classifiers to separate one class from all other classes, and outputs the class with largest posterior probability.
In order to test the proposed method, we have used the Tongue DB dataset. Initially, we divided the dataset such that 60% images were taken for training and the rest for testing the supervised classifiers. Since, each CNN models used in the experiment requires different input size, we programmed to automatically augment the required sized data for training. We have investigated the performance of the deep CNN architecture described in the Section 2.1 for the identification task. For the first approach, we used the default softmax layer for subject identification, while for the second approach we have used the SVM classifier.
Experimental setup
The proposed tongue-print identification method is implemented using MATLAB R2019a deep learning toolbox. We run the deep learning toolbox in the Testla K80 dual-GPU environment with dynamic NVIDIA GPU boost technology with widely used CUDA parallel computing model.
In the first approach, we perform the fine-tuning of a pre-trained model using our tongue dataset as the input training set to train the new subject labels. The learning rate was set to a very low value (0.000003) to avoid large weight updates, preserving the useful features learned by the pre-trained model.
In the second approach, pre-trained CNN model was used for feature extraction with the default network’s weight parameters using our tongue dataset as the input training set. The output of the last fully connected layer with an output dimension of 1 × 1000 is considered as the final feature vector for classification. This feature vector is given as the input to the SVM classifier. We have set the total number of support vectors α i = 3 and used 5-fold cross validation for training the samples. We have achieved best accuracy when the quadratic polynomial kernel of degree 2 is used with SVM.
Table 3 details the results of the two different approaches used for investigating the performance of different CNN models in the automatic tongue-print identification system. We have achieved the best accuracy of 98.61% for the first approach (using fine tuned pre-trained model) and an accuracy of 96.94% for the second approach (using pre-trained model without fine-tuning) in the identification task when ResNet deep CNN model used. Though the training time is high for the first approach when compared with the second, the results underline the superior performance of the transfer learning for the identification task. The results also show that the ResNet deep CNN model outperformed the other models in both approaches.
Comparison of the accuracy with the different deep CNN models used
Comparison of the accuracy with the different deep CNN models used
To evaluate the performance of the proposed method, we have compared the accuracy of the state-of-the-art techniques which have been already proposed in the literature by Salim Lahmiri [8], Zhang et al. [10] and Sivakumar et al. [12] in the same dataset and the comparison result is shown in the Table 4. From the table, it is evident that the proposed method is giving better performance compared to other state-of-the-art techniques.
Performance comparison with other worksreported in the literature
Performance comparison with other worksreported in the literature
To further evaluate the performance, False Acceptance Ratio (FAR) and False Rejection Ratio (FRR) are used. All biometric system works in a range of operational values. The system will accept the user as genuine, once the system outputs a value within this range. A Receiver Operating Characteristic (ROC) curve is used to plot the Genuine Acceptance Rate (1-FRR) against false acceptance rate for all system operational points. To compute the False Acceptance Rate

ROC plot of experimental results on Tongue DB dataset for two different methods.
Tongue is a well protected vital organ within the oral cavity and hence not vulnerable to forgery. The dorsal surface of tongue exhibits rich structural patterns that enables the possibility of using tongue-prints as a novel biometric tool in forensic and biometric applications. In this paper, a framework for applying deep CNN models for automatic tongue-print identification is proposed. The framework used two approaches for the identification task: First approach used fine-tuned pre-trained deep CNN model as a classifier while the second approach used pre-trained deep CNN model as a feature extractor. In the second approach, the feature extracted by the CNN is trained by SVM classifier for the final identification. In the above two approaches, the first approach with ResNet CNN model achieves the best identification accuracy. The results shows the superior performance of the deep CNN for the personal identification using tongue-print and thus tongue-print can be used as a reliable biometric trait.
Footnotes
Acknowledgment
Authors acknowledge the University Grants Commission (UGC) for providing the funding to set up Massive Parallel Processing systems at Department of Computer Science, Cochin University of Science and Technology (CUSAT), under UGC XII plan. (File No. PL.(UGC)1/SPG/2016-17 dated 08.07.2016)
Authors also acknowledge PMS College of Dental Science and Research, Thiruvananthapuram, Kerala for the partial funding of the project.
