Customer segmentation research in marketing through clustering algorithm analysis

Abstract

In marketing, customer segmentation is a very critical element. This paper focuses on clustering algorithms. First, the commonly used K-means algorithm was introduced, and then, it was optimized using the improved Lion Swarm Optimization (ILSO) algorithm and the Calinski-Harabasz (CH) index. The results of the experiment for the UCI dataset showed that the CH indicator obtained an accurate number of clusters, and the clustering accuracy of the ILSO-K-means algorithm was higher, both above 90%. Then, in customer segmentation, the customers of an enterprise were divided into four groups using the ILSO-K-means algorithm, and different marketing suggestions were given. The experimental analysis proves the usability of the ILSO-K-means algorithm in customer segmentation, which can be further applied in practice.

Keywords

Clustering algorithm marketing customer segmentation lion cluster optimization algorithm marketing methods

1 Introduction

With the rapid growth of the economy, marketing has become more and more refined, and the demand for customer information has further increased [1]. In order to avoid the loss of customers [2], the marketing methods adopted by enterprises are usually precision marketing, personalized marketing, etc., i.e, achieving low-cost, efficient marketing through full communication with customers. Most of the current marketing is based on customer segmentation; therefore, the research on customer segmentation has become a key element in business management and marketing [3]. Chen et al. [4] proposed an algorithm called PurTreeClust to analyze customer transaction datasets and verified the feasibility of the method through analyzing ten real transaction datasets. Motevali et al. [5] designed a wildebeests herd optimization algorithm to segment bank customers in four aspects, including profitability, loyalty, cost, and creditworthiness and proved the usability of the method. Barman et al. [6] proposed a customer segmentation method based on self-organizing mapping and minimum spanning tree and proved the accuracy of the method through evaluations on several datasets. Bhatnagar et al. [7] performed customer segmentation from the perspective of customer reviews through a long short-term memory (LSTM) model and found through experiments that the accuracy of the model reached 90.9%. Clustering algorithm is a common algorithm in customer segmentation, and the optimization and improvement of this algorithm have been widely studied. Ping et al. [8] applied the K-means algorithm to order batch optimization, improved it by using the seed sorting idea, and obtained higher picking efficiency. Thamer et al. [9] improved the kernel K-mean clustering algorithm using appropriate optimization algorithms and experimentally found that the method achieved the best computation time. For text clustering, Gopal et al. [10] used the whale optimization algorithm to select the fuzzy C-mean clustering centers and verified the performance of the algorithm on three datasets. Surono et al. [11] combined Minkowski distance and Chebyshev distance to optimize the fuzzy C-mean clustering algorithm and obtained a high clustering accuracy. This paper mainly studied the K-means algorithm among clustering algorithms, improved it, and applied it to customer segmentation. Moreover, some marketing suggestions were given for different types of customers. This work provides a theoretical basis for the marketing of enterprises.

2 Marketing and Customer Segmentation

With the development of the market, precision marketing has received more and more attention [12]. Precision marketing of enterprises is developing precise marketing strategies based on the correct customer segmentation. Steps from content production to pushing all have personalized features.

According to the concept of customer relationship management (CRM) [13], it is crucial for companies to understand the needs of customers [14]. Inadequate maintenance of customers, distant relationship with customers, inability to provide timely feedback on customer needs will all affect customer satisfaction with the company and lead to the loss of customers; therefore, it is necessary to strive for new customers as much as possible, take into account the old customers and strive to improve the value of customers, and in this process, customer segmentation is needed [15].

Customer segmentation can provide better services to target customers [16], and its purpose is to expand and maintain customers with high value for companies to guide their decision making [17] and increase business revenue [18]. Competition in the market is, in essence, a competition for customer resources, and for companies, maintaining good relationships with customers is a key focus of marketing.

In order to better play the role of customer segmentation, the following principles need to be followed: ding172 the purpose of segmentation should be clarified, and suitable indicators should be used for segmentation; ding173 the results of segmentation should be differentiable; ding174 the segmentation plan should be adjusted at any time according to the changes of the market. With the development of data mining technology [19], at present, factor analysis and neural network have been very widely used in customer segmentation [20].

3 Customer Segmentation Method Based on Clustering Algorithm

3.1 K-means algorithm

Clustering algorithms are methods to classify data from the perspective of sample similarity, which has a very wide range of applications in economics and biology [21]. Both clustering algorithms and fuzzy clustering algorithms are able to classify objects in a dataset into different groups; however, fuzzy clustering algorithms are more often used to deal with data with ambiguity or uncertainty, such as in the fields of text classification and image segmentation, while clustering algorithms are mostly used for exploratory data analysis. Therefore, in the customer segmentation problem studied in this paper, clustering algorithms are used.

The K-means algorithm is the most widely used one [22], and its basic steps are as follows.

The number of clusters (k) for clustering is determined. k samples are randomly selected as the initial central points.

The distance from every sample to the center point is calculated: d_ij =∥ x_j - u_j ∥.

Every sample is classified into its closest cluster.

The new clustering center is calculated again and again until every central point becomes stable. Finally, the result of clustering is obtained.

The K-means algorithm has a relatively simple structure and high computation speed. It can also handle well in the face of the large amount of customer data generated in marketing; however, it also has obvious disadvantages [23]: ding172 sensitive to the initial clustering center; ding173 the selection of the initial k value is mostly determined empirically, which may lead to locally optimal solutions.

3.2 Initial clustering center determination method

In this paper, an improved Lion Swarm Optimization Algorithm (ILSO) is used to obtain the initial clustering center of the K-means algorithm. The LSO algorithm is an algorithm that simulates the behavior of lion swarm [24]. It is assumed that the number of lions in the D-dimensional dataset, the number of lions is N, the number of adult lions is nLeader, $2 ⩽ nLeader ⩽ \frac{N}{2}$ , the position of the lioness is written as: x_i = (x_i1, x_i2, ⋯ , x_iD), and the number of adult lions is: nLeader = [Nβ], where β is the scale factor.

In the process of hunting, the update formula of the lion king’s position is: $x_{i}^{k + 1} = g^{k} (1 + γ ∥ p_{i}^{k} - g^{k} ∥)$ (1) where g^k is the optimal swarm position of the k-th generation, $p_{i}^{k}$ is the optimal individual position of the k-th generation, and γ is the random number in (0,1).

The equation for updating the position of the lioness is: $x_{i}^{k + 1} = \frac{p_{i}^{k} + p_{c}^{k}}{2} (1 + α_{f} γ)$ (2) where $p_{c}^{k}$ is the historical optimal position of the lioness collaborator, α_f is the perturbation factor, $α_{f} = step \times \exp {(- \frac{30 t}{T})}^{10}$ , t and T are the current number of iterations and the maximum number of iterations, and step is the maximum moving step length.

The formula for updating the position of lion cubs is: $x_{i}^{k + 1} = {\begin{matrix} \frac{p_{i}^{k} + g^{k}}{2} (1 + α_{c} γ), 0 < q < \frac{1}{3} \\ \frac{p_{i}^{k} + p_{m}^{k}}{2} (1 + α_{c} γ), \frac{1}{3} ⩽ q < \frac{2}{3} \\ \frac{p_{i}^{k} + {\bar{g}}^{k}}{2} (1 + α_{c} γ), \frac{2}{3} ⩽ q < 1 \end{matrix}$ (3) ${\bar{g}}^{k} = \bar{low} + \bar{high} - g^{k}$ (4) $α_{c} = step (\frac{T - t}{T})$ (5) where $p_{m}^{k}$ is the optimal position for the k-th generation of the lion cubs to follow the lioness, ${\bar{g}}^{k}$ is the position to which the cubs are driven, $\bar{low}$ is the mean of the minimum value in the lion activity range, $\bar{high}$ is the mean of the maximum value in the lion activity range, α_c is the perturbation factor, and q is the probability factor in [0,1]. To further improve the search capability of the algorithm, the initialized position of the population is improved. Based on the sin chaos mapping sequence, the position of the lion swarm is initialized: $x_{n + 1} = sin (\frac{2}{x_{n}}), x_{n} \in (- 1, 1)$ and x_n ≠ 0, thus speeding up the convergence speed of LSO. The flow of the ILSO algorithm is as follows.

The position of the lion swarm is initialized using sin chaotic mapping.

The number of lion kings, lionesses, and lion cubs are calculated, and the position of the lion king is regarded as the initial clustering center of the K-means algorithm.

The position of the lion king, lioness, and cubs is updated.

Whether the termination condition or not is reached is determined. If not, it returns to the last step; if it is, it goes to the next step.

The optimal solution is obtained, i.e., the initial clustering center.

3.3 Optimal K value determination method

For the selection of the number of clusters, the Calinski-Harabasz (CH) index [25] is used in to determine the optimal K value, and the corresponding calculation formula is: $CH = \frac{tr (B_{k})}{tr (W_{k})} \times \frac{N - K}{K - 1}$ (6) where N is the number of samples, K is the number of clusters, tr (B_k) is the trace of the intergroup dispersion matrix, r (W_k) is the trace of the dispersion matrix within the cluster, and W_k is the intra-cluster dispersion matrix. $W_{k} = \sum_{q = 1}^{k} \sum_{x \in c_{q}} (x - c_{q}) {(x - c_{q})}^{T}$ (7)

B_k is the intergroup dispersion matrix: $B_{k} = \sum_{q} n_{q} (c_{q} - c) {(c_{q} - c)}^{T}$ (8) where c_q is the center of cluster q and c denotes the average center of all cluster centers. By calculating the CH index, the number of clusters (k) at the maximum CH value is used as the optimal K value of the K-means algorithm.

The flow of the clustering-based customer segmentation method designed in this paper is shown in Fig. 1.

Fig. 1

The clustering-based customer segmentation method.

As shown in Fig. 1, the initial clustering center is first obtained by the ILSO algorithm, the optimal number of clusters is determined according to the CH index, the obtained parameters are input into the K-means algorithm, the distance of each sample to the clustering center is calculated, each sample is assigned to the closest cluster, whether the algorithm converges is determined by repeatedly calculating the clustering center, and the final clustering result, which is the result of customer segmentation, is output.

4 Results and Analysis

4.1 Segmentation indicator selection

For customer segmentation in marketing, this paper presented a case study of an online maternal and infant products enterprise. The enterprise’s products include baby bottles, pacifiers, tableware, toys, etc., but with the development of the market, the enterprise is facing more and more fierce competition, the phenomenon of customer loss is more serious, and product sales continue to be low. Therefore, in order to provide some support for the development of the future marketing plan of this enterprise, the customer segmentation was used to understand the customer situation. The selection of customer segmentation indicators is based on the RFM model [26]. The RFM model consists of three main indicators.

R (recency) refers to a customer’s most recent consumption. If the number of customers that consume recently is gradually increasing, it indicates that the company’s recent marketing strategy has obtained good results, and the trend of development is good.

F (frequency) refers to the consumption frequency of customers over a period of time. In marketing, the more frequent the consumption, the higher the business revenue brought by the customer to the company, i.e., the higher the value of the customer.

M (monetary) refers to the total spending of customers over a period of time. The higher the spending of customers, the higher the revenue for the company, and when there are limitations in resources and costs for the company, targeting high-spending customers for marketing can lead to greater revenue.

The traditional RFM model in customer analysis ignores the customer’s experience of the company, and the F and M indicators also have the problem of collinearity, because when a customer’s consumption frequency is higher, his consumption amount is also higher, therefore, for these two problems, the following optimization was made.

Customer satisfaction indicator S was used to describe the customer’s sense of experience in the process of consumption. It is generally believed that the higher the customer’s satisfaction with the product and company, the stronger the inclination to continue to buy, and it will affect the decision of other consumers in the wait-and-see state.

The average customer spending (M/F) was used to replace the total spending to better reflect the customer’s spending power.

According to the RFMS model obtained above, the order data of the enterprise from March 1, 2022 to December 31, 2022 were collected. The statistics of different indicators are as follows.

R: the interval between the time of the latest consumption of the customer and the statistical cut-off time (December 31, 2022).

F: frequency of consumption of the customer from March 1, 2022 to December 31, 2022.

M: the average customer spending from March 1, 2022 to December 31, 2022.

S: the average value of the scores of all the orders of the customer from March 1, 2022 to December 31, 2022, and the range of the score was 0–5.

The collected sample data were sorted. After eliminating abnormal and incomplete values, a total of 41,258 customer data were obtained, and some of the data are shown in Table 1.

Table 1
Some customer data

Customer number R F M S

1 93 57 17821 4.6

2 42 55 1641 4.5

3 71 26 215 4.7

4 121 39 12 3.9

5 56 10 1580 4.5

6 20 21 64 4.7

7 116 55 591 4.8

8 32 45 1481 4.9

9 5 64 218 4.5

10 9 12 10 4.3

Customer number	R	F	M	S
1	93	57	17821	4.6
2	42	55	1641	4.5
3	71	26	215	4.7
4	121	39	12	3.9
5	56	10	1580	4.5
6	20	21	64	4.7
7	116	55	591	4.8
8	32	45	1481	4.9
9	5	64	218	4.5
10	9	12	10	4.3

It was seen from Table 1 that there was a large gap between the values of different indicators, which needs to be standardized. The smaller the R value, the higher the value of the customer. The formula of standardization is: $X^{'} = \frac{X_{\max} - X}{X_{\max} - X_{\min}}$ (9)

The larger the values of F, M, and S, the higher the customer value. The formula of standardization is: $X^{'} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}$ (10) where X is the original value, X′ is the normalized value, and X_max and X_min are the maximum and minimum values.

4.2 Analysis of clustering algorithms

First, five datasets were selected from UCI for analysis [27], as shown in Table 2.

Table 2
UCI dataset

Dataset Number of features Number of categories Number of samples

Iris 4 3 150

Wine 13 3 178

Glass 9 6 214

Cancer 9 2 683

Vowel 3 6 871

Dataset	Number of features	Number of categories	Number of samples
Iris	4	3	150
Wine	13	3	178
Glass	9	6	214
Cancer	9	2	683
Vowel	3	6	871

First, the optimal K value determined using the CH indicator was analyzed and compared with the actual number of categories in the dataset, and the results are shown in Fig. 2.

Fig. 2

Analysis of the determination of the optimal number of clusters.

It was seen from Fig. 1 that the optimal K value determined using the CH indicator method exactly matched the actual number of categories in every dataset, indicating that the optimal number of clusters can be obtained through the CH indicator.

Then, the accuracy of traditional K-means and ILSO-improved K-means algorithms in classifying clusters was compared. The experiment was first carried out using the UCI dataset. The average value was taken after running the two algorithms for ten times. The results are shown in Fig. 3.

Fig. 3

Accuracy analysis of clustering algorithms.

It was seen from Fig. 3 that the accuracy of the traditional K-means algorithm for all the five datasets was below 90%, among which, the accuracy for Wine was only 56.44%. After using the ILSO algorithm to optimize the initial clustering center, the accuracy of the algorithm reached more than 90% for all the datasets, and the accuracy reached 96.54% for Wine, which was 40.1% higher than the traditional K-means algorithm. This proved the reliability of the ILSO algorithm. The algorithm could be used in customer segmentation.

4.3 Customer segmentation results

The customer data of this enterprise were segmented using the ILSO algorithm. The number of clusters was determined as 4 by the CH indicator, and the specific results are shown in Table 3.

Table 3
Customer clustering results

Customer group 1 Customer group 2 Customer group 3 Customer group 4

Number (percentage) 19658 (47.64%) 12879 (31.22%) 3056 (7.41%) 5665 (13.73%)

Average R value 32.29 54.66 21.08 153.94

Average F value 10.97 5.46 11.22 3.57

Average M value 4521.36 2856.33 13025.77 1429.61

Average S value 4.33 4.89 4.87 3.82

	Customer group 1	Customer group 2	Customer group 3	Customer group 4
Number (percentage)	19658 (47.64%)	12879 (31.22%)	3056 (7.41%)	5665 (13.73%)
Average R value	32.29	54.66	21.08	153.94
Average F value	10.97	5.46	11.22	3.57
Average M value	4521.36	2856.33	13025.77	1429.61
Average S value	4.33	4.89	4.87	3.82

It was observed in Table 3 that customer group 1 had the highest number of people, reaching 47.64%. To be specific, this category of customers had a short period of time since their latest consumption, a high frequency of consumption, and a higher average spending (4521.36 yuan), and an average satisfaction degree of 4.33, indicating that this category of customers was accustomed to and happy to spend money at the enterprise. Based on the analysis, it was considered that such customers were the retained customers and main consumer group of the enterprise.

In the marketing process, the enterprise should improve the loyalty and adhesion of such customers through the membership system, points exchange and other methods to create greater value for the enterprise. At the same time, it should further improve customer satisfaction with the enterprise through birthday wishes, holiday sympathy, and other ways to maintain a good reputation in such customer groups, which is conducive to the increase of enterprise revenue.

Customer group 2 accounted for a high percentage in the four customer groups, reaching 31.22%. During this period, it has been a little longer since the last consumption, the frequency of consumption was not particularly high, and the average spending was 2,856.33 yuan, but customer’s satisfaction with the enterprise was high. These customers may be new customers and may become stable customers in the future. Based on the analysis, it was considered that such customers were potential customers of the enterprise.

In the marketing process, the enterprise should attract such customers to consume through coupons and discount activities to further explore the value of such customers, and at the same time, it can further understand the popular products among new customers through investigating of such customers, which can provide a reference for the subsequent production.

Although customer group 3 accounted for the smallest proportion, only 7.41%, but this type of customer has a recent consumption behavior and a high consumption frequency. During this period, compared with other customer groups, this type of customer had the highest average spending (13,025.77 yuan) and a high average satisfaction value (4.87). Based on the analysis, it was considered that this type of customer was the value customers of the enterprise and have a high satisfaction level with the enterprise, which can also bring a large revenue to the enterprise.

In the marketing process, for such customers, the enterprise should do a good job in customer management, make more efforts in marketing, try to avoid the loss of value customers, understand customer needs through various ways, provide them with high-quality services, and recommend new products as a priority to create more revenue for the enterprise.

Customers in customer group 4 had a small percentage, 13.73%. Specifically, the period of time since their last consumption was the longest, 153.94 days, and the consumption frequency was the lowest, 3.57 times, and the average spending was low, only 1,429.61 yuan. Their satisfaction with the enterprise was also poor, indicating that such customers were less satisfied with the enterprise and its products, so there was a risk of losing them. Based on the analysis, it was considered that such customers needed to be retained by the enterprise.

In order to retain customers, companies can understand the reasons for less consumption and poor satisfaction through investigation, and then, where appropriate, the investment in marketing resources can be reduced, such as sending regular promotions, consumer coupons, etc. At the same time, it is also necessary to carry out a more in-depth segmentation of such customers, retain customers with certain values, and abandon customers with low values.

5 Conclusion

This paper mainly researched customer segmentation through clustering algorithms, improved the K-means algorithm for its defects, and took the customer data of an enterprise as an example. Through experiments, we found that the improvement of K-means was reliable, and it obtained the accurate number of clusters and had a higher accuracy. The enterprise’s customers were divided into four groups, and different marketing suggestions were made for different groups, which provided some referable directions for the future development of the enterprise.

References

Mueller

J.M.

, Pommeranz

, Weisser

and Voigt

, Digital, Social Media, and Mobile Marketing in industrial buying: Still in need of customer segmentation? Empirical evidence from Poland and Germany, Industrial Marketing Management 73 (2018), 70–83.

Abdurrahman

C. Agarwal

and Lokesh

K.R.

, Architecture for Evaluating Customer Retention Strategies, ECS Transactions 107 (2022).

Nakano

and Kondo

F.N.

, Customer segmentation with purchase channels and media touchpoints using single source panel data, Journal of Retailing and Consumer Services 41 (2018), 142–152.

Chen

, Fang

, Yang

, Nie

, Zhao

and Huang

J.Z.

, PurTreeClust: A Clustering Algorithm for Customer Segmentation from Massive Customer Transaction Data, IEEE Transactions on Knowledge and Data Engineering 30 (2018), 559–572.

Motevali

M.M.

, Shanghooshabad

A.M.

, Aram

R.Z.

and Keshavarz

, WHO: A New Evolutionary Algorithm Bio-Inspired by Wildebeests with a Case Study on Bank Customer Segmentation, International Journal of Pattern Recognition & Artificial Intelligence 33 (2019), 1–32.

Barman

and Chowdhury

, A Novel Approach for the Customer Segmentation Using Clustering Through Self-Organizing Map, International Journal of Business Analytics (IJBAN) 6 (2019), 23–45.

Bhatnagar

and Bhatia

, A Sentiment Analysis Based Approach for Customer Segmentation, Recent Patents on Engineering 16 (2022), 32–42.

Ping

and Zhou

, Order Batch Optimization Strategy Based on Improved K-Means Clustering Algorithm, 2020 2nd International Conference on Information Technology and Computer Application (ITCA) (2020), 217–221.

Thamer

M.K.

, Algamal

Z.Y.

and Zine

, Enhancement of Kernel Clustering Based on Pigeon Optimization Algorithm, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 31 (2023), 121–133.

10.

Gopal

and Brunda

, Text Clustering Algorithm Using Fuzzy Whale Optimization Algorithm, International Journal of Intelligent Engineering and Systems 12 (2019), 278–286.

11.

Surono

and Putri

, Optimization of Fuzzy C-Means Clustering Algorithm with Combination of Minkowski and Chebyshev Distance Using Principal Component Analysis, International Journal of Fuzzy Systems 23 (2021), 139–144.

12.

Liu

, E-Commerce Precision Marketing Model Based on Convolutional Neural Network, Scientific Programming 2022 (2022), 1–11.

13.

Zerbino

, Aloini

, Dulmin

and Mininno

, Big Data-enabled Customer Relationship Management: A holistic approach, Information Processing & Management 54 (2018), 818–846.

14.

Bezerra

, Souza

E.M.D.

and Correia

A.R.

, Passenger Expectations and Airport Service Quality: Exploring Customer Segmentation, Transportation Research Record 2675 (2021), 604–615.

15.

Saini

, Sharma

, Sarangi

P.K.

, Singh

and Rani

, Customer Segmentation using K-Means Clustering, 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) (2022), 1–5.

16.

Othayoth

S.P.

and Muthalagu

, Customer segmentation using various machine learning techniques, International Journal of Business Intelligence and Data Mining 20 (2022), 480–496.

17.

Wong

and Yan

, Customer online shopping experience data analytics: Integrated customer segmentation and customised services prediction model, International Journal of Retail & Distribution Management 46 (2018), 406–420.

18.

Maulina

N.R.

, Surjandari

and Rus

A.M.M.

, Data Mining Approach for Customer Segmentation in B2B Settings using Centroid-Based Clustering, 2019 16th International Conference on Service Systems and Service Management (ICSSSM) (2019), 1–6.

19.

Silva

, Varela

, López

L.A.B.

and MillánAssociation

R.H.R.

, Rules Extraction for Customer Segmentation in the SMEs Sector Using the Apriori Algorithm, Procedia Computer Science 151 (2019), 1207–1212.

20.

Fitriani

M.A.

and Febrianto

D.C.

, Data Mining for Potential Customer Segmentation in the Marketing Bank Dataset, JUITA Jurnal Informatika 9 (2021), 25.

21.

Pritpal

, Marcin

, Anna

, Magdalena

, Koryna

, Tadeusz

, Barbara

S.W.

and Paweł

, Analysis of fMRI Signals from Working Memory Tasks and Resting-State of Brain: Neutrosohic-Entropy-BasedClustering Algorithm, International Journal of Neural Systems 32 (2022), 1–20.

22.

Bigdeli

, Maghsoudi

and Ghezelbash

, Application of self-organizing map (SOM) and K-means clustering algorithms for portraying geochemical anomaly patterns in Moalleman district, NE Iran, Journal of Geochemical Exploration: Journal of the Association of Exploration Geochemists (2022), 233.

23.

Huang

and Cheng

, Optimization of K-means Algorithm Base on Map Reduce, Journal of Physics: Conference Series 1881 (2021), 1–12.

24.

Zhang

and Jiang

, Parallel discrete lion swarm optimization algorithm for solving traveling salesman problem, Journal of Systems Engineering and Electronics 31 (2020), 751–760.

25.

Lima

S.P.

and Cruz

M.D.

, A genetic algorithm using Calinski-Harabasz index for automatic clustering problem, Revista Brasileira de Computação Aplicada 12 (2020), 97–106.

26.

Ernawati

, Baharin

and Kasmin

, A review of data mining methods in RFM-based customer segmentation, Journal of Physics: Conference Series 1869 (2021), 1–8.

27.

http://archive.ics.uci.edu/ml/index.php.