Multi-class classification using a new Bayesian method

Abstract

This paper proposes a new classification model using the Bayes method. This model not only determines the prior probability based on the k-means algorithm, builds the method for estimating the probability density function via the kernel function, but also classifies the objects to the known populations. The proposed model is described via the experiment of image classifying. In this example, we first use the Gray level co-occurrence matrix to extract the features of images, and next classify this data set based on the improved Bayesian method. In another application, we also build the classification problem for the Algerian Forest Fires data set. The outstanding advantages of this method are the adaptive ability of the kernel function, the classification for multi-class, and the reduction of computational costs. In addition, the experimental results also show the potential of the developed model.

Keywords

Bayesian method kernel function fuzzy c-mean image classification

1. Introduction

Classifying by the Bayesian method is a technique of supervised learning model used very commonly in machine learning. The classification problem has been applied in many fields such as economics, medicine, technology, etc. Currently, there are many methods to classify. The main methods used popularly are Fisher (Fisher, 1992), Logistic regression (Hosmer et al.,1991), Linear Discriminant Analysis (LDA) (Izenman, 2013), Support Vector Machine (SVM) (Tanveer et al., 2019) and Bayesian method (Murphy, 2006; Vo-Van et al., 2018). For each method, we have some different approaches. For example, the SVM algorithm is constructed by a separating hyperplane with the maximal margin. Since only support vectors are used for classification and many majority samples far from the decision boundary can be removed, SVM can be more accurate on moderately unbalanced data. However, SVM is sensitive to high unbalanced classification since it is prone to generating a classifier that has a strong estimation bias towards the majority class and would give a bad accuracy in the classification performance for the minority class, which is discussed in the work of Tang et al. (2008). Furthermore, in many cases, this method is still limited because it has to assume some conditions which are very difficult to satisfy in reality. The Bayes method does not require the conditions for data and can classify many populations (Tai et al., 2016; Vo-Van et al., 2018). Considering the different distances, and inheriting the previous results, Vo-Van et al. (2018) proposed the $L^{1}$ -distance into the classification problem. This distance was used to build the rule to classify and calculate the Bayes error. This study has looked at the computational problem and proved the advantages in comparing to the previous methods through some numerical examples. However, in many cases, the error of this method is quite large.

In the Bayesian method, determining the prior probability is very important. Nguyen-Trang and Vo-Van (2017) proposed a method to determine the prior probability based on the fuzzy clustering algorithm. However, it only improved the prior probability that did not take the other important problems for classification problems. Therefore, it did not obtain good results. Then, Vo-Van et al. (2018) have had important contributions to the theory of Bayesian classification, but they only considered the case of two populations. Compared with discrete data, classifying images is more complex in terms of algorithms and computational problems (Zhao et al., 2019). In general, a typical image classification algorithm consists of two main phases: Extracting the features of the image, and building the principle for classifying images from the extracted features. The process of extracting the characters is to seek out the representative elements for the image to distinguish it from others. Presently, there are three prominent features commonly accustomed to extracting features of an image. Swinging on the classification problem, different significant features are commonly proposed, such as colour, texture, and shape (Tang et al., 2008). Some studies such as Zhao et al. (2019), Vo-Van et al. (2018) had confirmed that there was yet no method that reached an optimal solution for all situations. It depends on the set of images and the purpose of the classification. This study is interested in extracting the features of images to classify them. For this approach, first, the grey-level co-occurrence matrix (GLCM) is determined, thereby outputting the texture feature results. The texture features retain valuable details about the fundamental arrangement of the surface, so for X-rays images, it proves suitable characters to represent the image.

From the above analysis, we realize that multi-class classification using the Bayesian method, especially the application to images is still limited. In this study, we improve the important steps of the Bayesian method to obtain the best result. Specifically, the following issues will be upgraded:

(i)
Find the prior probability of each class based on the k-means algorithm.
(ii)
Introduce a method to estimate the probability density function for multi-class data.
(iii)
Improve the Bayesian method for the multi-population case. In addition, the proposed algorithm can classify many elements at the same time instead of one element as the existing algorithms.
(iv)
Calculate the fuzzy relationship between each population and the considered objects.
(v)
The computational program reduces the cost and the running time.

Moreover, the study has considered the method to extract the features of images based on the kernel function. Working on the two data sets, the proposed model has shown outstanding advantages in comparing with the existing algorithms. Moreover, this implementation can be fundamental to many important practical applications in different fields.

This paper is organized as follows. Section 2 introduces the basic concept of the Bayes theorem. Section 3 presents the method for extracting the features of images. The experiment and discussion are described in Section 4. The numerical examples are illustrated in Section 5. The conclusion is given at the end of the paper.
2. Bayes theorem and its estimation

Given $k$ populations ${w_{1}},{w_{2}},\ldots,{w_{k}}$ with ${q_{i}}\in(0,1),$ and ${f_{i}}(x)$ are the prior probability and the probability density function (pdf) of the $i^{th}$ population, respectively, $i=1,2,\ldots,k$ .

Given the classified element, represented by a vector $x=(x_{1},x_{2},\ldots,x_{n})$ to have $n$ variables. This element is assigned to the population $C_{k}$ with the probability $p({C_{k}}\mid{x_{1}},{x_{2}},\ldots,{x_{n}})$ as follows:

$\displaystyle p({C_{k}}\mid{x_{1}},{x_{2}},\ldots,{x_{n}})\propto p({C_{k}},{x% _{1}},{x_{2}},\ldots,{x_{n}})$

When the variables are independence, we have

$\displaystyle p({C_{k}}\mid{x_{1}},{x_{2}},\ldots,{x_{n}})=\frac{{P({C_{k}})% \prod_{i=1}^{n}{p({x_{i}}\mid{C_{k}})}}}{{P(x)}}=\frac{{P({C_{k}})\prod_{i=1}^% {n}{p({x_{i}}|{C_{k}})}}}{{\sum_{k}{P({C_{k}})p(x\mid{C_{k}})}}}.$

Then, we obtain the following result:

$\displaystyle p({C_{k}},{x_{1}},{x_{2}},\ldots,{x_{n}})\propto P({C_{k}})p({x_% {1}}\mid{C_{k}})\ldots p({x_{n}}\mid{C_{k}})\propto P({C_{k}})\prod_{i=1}^{n}{% p({x_{i}}\mid{C_{k}})}.$

In fact, the data required to perform the classification problem are discrete, so to apply the Bayesian method, the first thing to do is to estimate the probability density function. There are many parametric and non-parametric methods for estimating. In this paper, we use the kernel function, a common method at present for real applications (Vo-Van et al., 2018; Vovan, 2017).

For the case of the n-dimensional, the density probability function for each population is estimated by the kernel method with the following form:

$\displaystyle p({x_{i}}\mid{C_{k}})=f(x)=\frac{1}{N{h_{1}}{h_{2}}...{h_{n}}}% \sum_{i=1}^{N}\prod_{j=1}^{n}{{K_{j}}\left(\frac{x_{j}-x_{ij}}{h_{j}}\right),}$ (1)

where $N$ is the total of elements in $C_{k}$ ; $h_{j}$ is the smooth parameter of $j^{\text{th}}$ variable, $K_{j}(.)$ is the kernel function of $j^{\text{th}}$ variable (it is a smooth and symmetric function), $x_{ij}$ are $i^{th}$ value of $j^{\text{th}}$ variable in $C_{k}$ . There are many kernel functions and methods to choose the smooth parameter. In study, the normal distribution is chosen for kernel function, and the smooth parameter taken according to Vo-Van et al. (2018) and Vovan (2017).

3. Extracting the feature of images

The gray level co-occurrence matrix (GLCM) for an image with size $M\times N$ is the $P$ matrix to have the size $g\times g$ , where $g$ is the number of gray-level used to construct the matrix. Each element ${p_{d\theta}}(i,j)$ of $P$ shows the probability for occurrence the intensity $i$ and $j$ with distance $d$ and orientation angle $\theta$ . It is given by Eq. (2).

$\displaystyle p_{d\theta}(i,j)=\{((r,c),(r^{\prime},c^{\prime}))\in M\times N% \mid d=|(r,c),(r^{\prime},c^{\prime})|,\theta=\Theta((r,c),(r^{\prime},c^{% \prime})),I(r,c)=i,I(r^{\prime},c^{\prime})=j\}$ (2)

The descriptive structure of GLCM is given in Fig. 1.

Table 1

The formula for four features of each image

Feature	Formulate
Entropy	$\sum\limits_{i,j}p({i,j})^{2}$
Contrast	$\sum\limits_{i,j}{\|i-j\|^{k}}{p^{l}}(i,j)$
Homogeneity	$\sum\limits_{i,j}\frac{p(i,j)}{1+\|i-j\|}$
Correlation	$\sum\limits_{i,j}\frac{(i-{\mu_{i}})(j-{\mu_{j}})p(i,j)}{\delta_{i}\delta_{j}}$

Figure 1.

GLCM with $d=1$ and $\theta=0$ .

Based on the GLCM, Haralick and Shapiro (1992) proposed some equations used to calculate 14 features of texture. However, most of the recent studies only use three or four important features of these 14 features as representative features (Celebi & Alpkocak, 2000; Vovan et al., 2020). In this paper, the four main features used for studying are Entropy, Contrast, Homogeneity and Correlation. They are shown in Table 1.

Each image has four features with a distance $d=$ 1, and four angles $\theta=\{0,\frac{\pi}{2},\pi,2\pi\}$ . It means that we have 16 features to recognize for an image.

4. The proposed model

The proposed model has six steps as follows:

Step 1. Initializing the input data signed by the form as follows:

$\displaystyle I=\left\{{{I_{1}},{I_{2}},\ldots,{I_{N}}}\right\},$

where $I_{i}$ is the representing data of $i^{\text{th}}$ object.

Step 2. Calculating the prior probability of each population based on Eq. (3).

$\displaystyle P({C_{k}})={p_{i}}=\frac{{{m_{i}}}}{kn},i=1,\ldots,k,$ (3)

where $m_{i}$ is the number of elements belong to $i^{\text{th}}$ population, $n$ is the number of objective inputs, and $k$ is the number of groups.

In this step, we use the k-means algorithm to find the suitable number of elements for each population.

Step 3. Estimating the probability density function for each population to have ${f_{1},f_{2},…,f_{k}}$ according to Eq. (1).

Step 4. Calculating the value of $F_{k}^{N}$ index:

$\displaystyle F_{1}^{1}={p_{1}}{f_{1}}({I_{1}}),F_{2}^{1}={p_{2}}{f_{2}}({I_{1% }}),\ldots,F_{k}^{1}={p_{k}}{f_{k}}({I_{1}})$ $\displaystyle F_{1}^{2}={p_{1}}{f_{1}}({I_{2}}),F_{2}^{2}={p_{2}}{f_{2}}({I_{2% }}),\ldots,F_{k}^{2}={p_{k}}{f_{k}}({I_{2}})$ $\displaystyle\ldots$ $\displaystyle F_{1}^{N}={p_{1}}{f_{1}}({I_{N}}),F_{2}^{N}={p_{2}}{f_{2}}({I_{N% }}),\ldots,F_{k}^{N}={p_{k}}{f_{k}}({I_{N}}) .$

Step 5. Using the rule to classify based on Eq. (4).

$\displaystyle{C_{i}}=\arg\max\limits_{i}\left\{{F_{1}^{j},F_{2}^{j},...,F_{k}^% {j}}\right\},j=\overline{1,N}.$ (4)

Step 6. Calculating the fuzzy relationships between each object and population as follows:

$\displaystyle R_{i}^{j}=\frac{F_{i}^{j}}{\sum_{i=1}^{k}{F_{i}^{j}}},j=% \overline{1,N},i=\overline{1,k}.$ (5)

5. Some numerical examples

5.1 Example 1

In this section, we apply the proposed model to classify seven images described in research of Phanmtoan and Vovan (2021). The data have two populations as Flamingo and Zebra with four and three images, respectively. These image are shown in Fig. 2.

Figure 2.

The images of two populations Flamingo and Horse.

Step 1. Firstly, we need to extract the features of these images. Using 16 features of the GLCM, we have Fig. 3.

Figure 3.

Descriptive data of 16 features seven images.

Step 2. Computing the prior probability for each group, we have

$\displaystyle{p_{1}}=0.571;{p_{2}}=0.429.$

Step 3. Estimating the represenative pdfs of two populations, we obtained Fig. 4.

Figure 4.

The estimated pdf of two populations: Flamingo ( $f_{1}$ ) and Zebra ( $f_{2}$ ).

Step 4. Calculate the $F_{i}^{j}$ value for each image. We obtained the matrix as follows:

$\displaystyle F=\left[\begin{array}[]{cc}{F_{1}^{1}}&{F_{2}^{1}}\\ {F_{1}^{2}}&{F_{2}^{2}}\\ {F_{1}^{3}}&{F_{2}^{3}}\\ {F_{1}^{4}}&{F_{2}^{4}}\\ {F_{1}^{5}}&{F_{2}^{5}}\\ {F_{1}^{6}}&{F_{2}^{6}}\\ {F_{1}^{7}}&{F_{2}^{7}}\end{array}\right]=\left[\begin{array}[]{cc}{0.160}&{0.% 000}\\ {0.165}&{0.000}\\ {0.166}&{0.000}\\ {0.165}&{0.000}\\ {0.163}&{3.345}\\ {0.164}&{2.391}\\ {0.162}&{2.850}\end{array}\right].$

Step 5. Classify the objects of two populations, we have

$\displaystyle C=\left[\begin{array}[]{ccccccc}1&1&1&1&2&2&2\end{array}\right].$

It means that all images have been properly classified.

Step 6. Calculate the fuzzy relationships of each image belongs to each population.

$\displaystyle R=\left[\begin{array}[]{cc}{1.000}&{0.000}\\ {1.000}&{0.000}\\ {1.000}&{0.000}\\ {1.000}&{0.000}\\ {0.046}&{0.954}\\ {0.064}&{0.936}\\ {0.054}&{0.946}\end{array}\right].$

Figure 5.

The fuzzy probability of 7 images to two populations.

The fuzzy relationships are presented by $R$ , and shown in Fig. 5 show that the result of classifying is good because the probability to belong to the right population is quite high.

5.2 Example 2

This experiment has been conducted on the database of the Normalized Brodatz Texture (NBT). It’s made up of 50 images of 4 populations. Some sample images are given in Fig. 6.

Figure 6.

Some images samples of the NBT data set.

Extracting the 16 features of images according to the GLCM, and estimating the probability density function for 4 populations, we have 4 probability density functions shown in Fig. 7.

Performing the steps of the proposed model and compare it with others, we obtain Table 2.

Table 2

The error rate of classifier methods and CPU time for NBT data

Method	Error	CPU time
Naive Bayes	0.064	317
LDA	0.069	244
Fisher	0.082	422
SVM	0.045	642
Logistic regresssion	0.095	581
Proposed	0.021	195

Figure 7.

The estimated pdf of four populations for NBT data set.

Figure 8.

The estimated pdf for two populations fire and no forest fire.

Table 3

The error rate of classifier methods and CPU time for Algeria data

Method	Error	CPU time
Naive Bayes	0.033	127
LDA	0.049	141
Fisher	0.057	272
SVM	0.025	643
Logistic regresssion	0.049	481
Proposed	0.011	695

Figure 9.

The probability to belong to two populations of 244 elements.

Table 2 shows the performance of different classifiers, including Naive Bayes, LDA, Fisher, Logistic regression and SVM. It’s obvious that the performance of the proposed classifier is better than that of others. It gives the value 0.021 of classification error, and 195 seconds of CPU time. These values are much smaller than those of other approaches.

5.3 Example 3

This application classifies 244 instances that regroup data from two regions of Algeria. They are named as Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria. This dataset is provided by source https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset.

The data have ten variables and are divided into two classes such as forest fire and no forest fire. They are described as follows: Date, Temp (Celsius degrees), RH, Ws, Rain (FWI), Fine Fuel Moisture Code (FFMC), Duff Moisture Code (DMC), Drought Code (DC), Initial Spread Index (ISI), Buildup Index (BUI), Fire Weather Index (FWI).

•
The prior probability: ${p_{1}}=$ 0.516; ${p_{2}}=$ 0.484.
•
The probability density function of two groups are shown in Fig. 8.

The classifying result of methods are shown in Table 3.

From Table 3, we see that the error result of Fisher, LDA and Logistic regression methods are quite high with 0.05 for error. The lower error thresholds are Naive Bayes and Multi-SVM at 0.033 and 0.025, respectively. However, the result of the proposed method is the lowest with 0.011. At the same time, the calculation time of the proposed method is also lower than the considered methods.

In addition, we also obtain the fuzzy probability of belonging to each population by Fig. 9.

Fig. 9 shows that the fuzzy relationship of each observation to the two populations is quite high. These values represent the classification error of various models. For example, at the peak from 150 ${}^{\text{th}}$ to 200 ${}^{\text{th}}$ observation in first population, there is an observation that has almost 80% probability of belonging to the first population. However, in fact, it belongs to the second group.
6. Conclusion

This study has positive contribution in the field of machine learning. The proposed method is a combination of many improvements from the k-means and Bayesian classifier methods. Another significant contribution of the proposed model is the application in classifying the image. Moreover, the illustrated and applied examples have shown the rationality and the outstanding advantages of the proposed model in comparison with the existing ones. This research also looks at all the computational issues in the actual application of the proposed model by the established Matlab procedure. It also shows the potential of this study in practical application. In the future, we will apply the proposed model to many practical issues in medicine and security.

References

Celebi

& Alpkocak

(2000). Clustering of texture features for content-based image retrieval. In International Conference on Advances in Information Systems, Springer, Berlin, 216-225.

Haralick

R. M.

& Shapiro

L. G.

(1992). Computer and Robot Vision (vol. 1, pp. 578-588). Addison-Welsey, Reading.

Hemanth

D. J.

& Anitha

(2019). Modified genetic algorithm approaches for classification of abnormal magnetic resonance brain tumour images. Applied Soft Computing, 75, 21-28.

Fisher

R. A.

(1992). Statistical methods for research workers. In Breakthroughs in Statistics (pp. 60-70), Springer.

Murphy

K. P.

(2006). Naive Bayes classifiers. University of British Columbia, 18(60), 1-8.

Thao

N. T.

& Vovan

(2017). A new approach for determining the prior probabilities in the classification problem by Bayesian method. Advances in Data Analysis and Classification, 11(3), 629-643.

Izenman

A.J.

(2013). Linear discriminant analysis. In Modern Multivariate Statistical techniques (pp. 237-280), Springer, New York.

Phamtoan

& Vovan

(2021). Automatic fuzzy genetic algorithm in clustering for images based on the extracted intervals. Multimedia Tools and Applications, 80(28), 35193-35215.

Phamtoan

Vovan

Phamchau

A. T.

Thao

N.T.

& Hokieu

(2019). A New binary adaptive elitist differential evolution Based automatic k-medoids clustering for probability density functions. Mathematical Problems in Engineering, 1-25.

10.

Phamgia

Turkkan

& Vovan

(2008). Statistical discrimination analysis using the maximum function. Communications in Statistics-Simulation and Computation, 37(2), 320-336.

11.

Tanveer

Tiwari

Choudhary

& Jalan

(2019). Sparse pinball twin support vector machines. Applied Soft Computing, 78, 164-175.

12.

Vovan

Thao

N. T.

& Ha

C. N.

(2016). The prior probability in classifying two populations by Bayesian method. Applied Mathematics Engineering and Reliability, 6, 35-40.

13.

Tang

Zhang

Y. Q.

Chawla

N. V.

& Krasser

(2008). SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39(1), 281-288.

14.

Vovan

Phamtoan

& Tranthituy

(2019). Automatic genetic algorithm in clustering for discrete elements. Communications in Statistics-Simulation and Computation, 1-16.

15.

Vovan

Chengoc

& Thao

N.T.

(2018). Textural features selection for image classification by Bayesian method. In 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, 733-139.

16.

Vovan

(2017). L1-distance and classification problem by Bayesian method. Journal of Applied Statistics, 44(3), 385-401.

17.

Vovan, T., Phamtoan

& Thao

N.T.

(2020). An automatic clustering for interval data using the genetic algorithm. Annals of Operations Research, 3, 1-22.

18.

Hosmer Jr

D. W.

Lemeshow

& Sturdivant

R. X.

(2013). Applied logistic regression. John Wiley & Sons.

19.

Zhao

Liu

Zheng

& Lyu

(2019). A reliable method for colorectal cancer prediction based on feature selection and support vector machine. Medical & Biological Engineering & Computing, 57(4), 901-912.