Geospatial Insights for Retail Recommendation Using Similarity Measures

Abstract

Recommending a retail business given a particular location of interest is nontrivial. Such a recommendation process requires careful study of demographics, trade area characteristics, sales performance, traffic, and environmental features. It is not only human effort taxing but often introduces inconsistency due to subjectivity in expert opinions. The process becomes more challenging when no sales data can be used to make a recommendation. As an attempt to overcome the challenges, this study used the machine learning approach that utilizes similarity measures to perform the recommendation. However, two challenges required careful attention when using the machine learning approach: (1) how to prepare a feature set that can commonly represent different types of retail business and (2) which similarity measure approach produces optimal recommendation accuracy? The data sets used in this study consist of points of interest, population, property, job type, and education level. Empirical studies were conducted to investigate (1) the overall accuracy of proposed similarity measure approaches to the retail business recommendation, and (2) whether the proposed approaches have a bias toward certain retail categories. In summary, the findings suggested that the proposed similarity-based techniques elicited an accuracy of above 70% and demonstrated higher accuracy when the recommendation was made within a set of similar retail businesses.

Introduction

Geospatial analytics has widely been used to address site selection challenges in retail businesses,^1–13 mainly for the following reasons: (1) to determine the next site for business expansion, (2) to estimate the monthly sales, (3) to investigate the coexistence between two or more business types, and (4) to reallocate existing nonprofitable retail outlets. A good retail site has always been the key to a store's success because it attracts consumers by offering them easy accessibility to products or services, which significantly influences market share and profitability.^3,14–16 To optimize business site selection outcomes, location theorists have proposed three methods, namely analogue, regression, and gravity.¹⁷ In addition to the theoretical approaches, retailers have also applied the Geographic Information System (GIS) that provides a visual way to analyze different sources of data on a map. To ease the process of site selection, researchers have also attempted the Analytic Hierarchy Process that integrates human knowledge into automatic decision-making.^5,14,18–22 The above literature has shown that the existing works on retail site selection would either require active intervention from human experts or the existence of sales data, before a recommendation could be made.

While site selection for retail business is itself a difficult problem, recommending the right retail business given a location presents a greater challenge. The challenge stemmed largely because there is no formula on how to determine a retail business given any location. Besides, it is a challenge to acquire sales data for every retail business because of privacy and confidential issue. The sales data, however, is an important criterion to infer the suitability of a retail business at that particular location. Without sales data, no inference process could be performed and all other estimation methods could resort to unmanageable risk. In this light, this study proposed an approach where recommendation of retail business does not require sales data of any retail business. That is, the suitability of retail business at a particular location can be perceived as a function of location characteristics, including the demographic information, retail businesses, traffic flow, and road characteristics (e.g., facing main road junctions). In this study, the research work had the following two main objectives: (1)

To construct an analytics data set from different raw data sets suitable for recommending retail businesses.

(2)

To propose algorithms for retail business recommendation through the computation of similarity scores between locations.

In this research work, three fundamental assumptions formed the basis for analytics work discussed in the Proposed Methods for Retail Recommendation section.

Assumption 1.1. Retail business is sustainable at a location if it has been operating for a predefined threshold of n months. It can be inferred that the location corresponds to that particular retail business. Retail businesses that have not been in operation for n months are likely to be terminated or relocated.

Assumption 1.2. A location has a good match for a specific retail business if that location matches most of the geospatial characteristics of other outlets of the same retail business.

Assumption 1.3. Recommendation of retail business does not apply to new trade areas. A trade area is deemed new when it shows insufficient required information (e.g., lack of surrounding retail businesses) as the input for making recommendations.

The next section discusses the techniques proposed in this research work to address the challenges, taking the above assumptions as a basis for predictive modeling work. The discussion begins with the preparation of analytics data set and subsequently with identifying the appropriate similarity measures for similarity scoring purpose. In this research work, the design and development of retail recommendation system rely on similarity measures. Therefore, before the detailed design of the entire recommendation system is discussed in the Proposed Methods for Retail Recommendation section, the next section provides an overview of different similarity measure techniques commonly applied by researchers.

Similarity and Distance Measures

There are a variety of studies comparing similarity measure techniques in different domains and knowledge areas. The main objective of the similarity measures is to determine the likeness or dissimilarity in a given set of objects (items). For continuous data, Shirkhorshidi et al.²³ conducted a comparison study to investigate the characteristics of low- and high-dimensional data for different similarity or distance measures. They compared and benchmarked 15 publicly available data sets with 12 distance measures such as Euclidean distance, average distance, chord distance, cosine measure, Mahalanobis distance, Manhattan distance, mean character difference, index of association, Canberra metric, Czekanowski coefficient, coefficient if divergence, and Pearson coefficient. The research gave an overall conclusion that the average distance is the most accurate and fastest distance measures among all the clustering algorithms. On the contrary, for categorical data, Boriah et al.²⁴ carried out a comparison study on similarity measures and reviewed, compared, and benchmarked the categorical data based on binary-based similarity measures. For a specific knowledge area, for instance, genetic interaction data sets, Deshpande et al.²⁵ concluded that the dot product is consistent among the best measures in different circumstances. In another research, Kanza et al.²⁶ discussed four traditional distance measures, namely, Hausdorff distance, center of mass distance, link distance/Earth Mover's distance, and nearest-neighbor distance, and proposed two novel distance measures, that is, mutually nearest distance and quad-tree distance to detect the geosocial similarity based on the locations of the users' online activities. The study showed the two novel distance measures were outperforming than the existing distance measure.

Similarity measure techniques were investigated to tackle the challenges and assumptions in this research work. In this work, only four widely used distance measure methods were considered, namely Euclidean Distance, Manhattan Distance, Hamming Distance, and Gower Distance.

Euclidean distance

The distance, $d (x, y)$ , between two vectors x and y having p-dimensional space, is defined by $d (x, y) = \sum_{i = 1}^{p} \sqrt{{(x_{i} - y_{i})}^{2}}$ (1)

Euclidean is a special case of the Minkowski distance.^27,28 Euclidean distance performs well when deployed to data sets that include compact or isolated clusters. This method can only be applied to numeric data. Therefore in this work, only sales data was fed into this equation.

Manhattan distance

The distance is calculated as the absolute value of the sum of differences between the two observations. Unlike Euclidean, Manhattan only considers the horizontal and vertical distances. Manhattan Distance is also known as city block distance or taxicab metric. $d (x, y) = \sum_{i = 1}^{p} | x_{i} - y_{i} |$ (2)

Jaccard distance

Such distance measure method has been widely used to calculate the distance between categorical variables. A contingency table is created to calculate the number of mismatches among the observations. $d (x, y) = 1 - \frac{| x \cap y |}{| x \cup y |}$ (3)

Gower distance

When the measurements are mixed (numeric, categorical) variables, the similarity coefficient suggested by Gower can be applied. It computes the distance between the observations weighted by its variable type before taking the mean of the variables. The scaling of each variable to a [0,1] is performed. $d (x, y) = 1 - \frac{\sum_{k = 1}^{p} s_{x y k}}{\sum_{k = 1}^{p} w_{x y k}}$ (4)

where $s_{x y k}$ denotes the similarity between observations x and y on variable k, and $w_{x y k}$ is a binary weight given to the corresponding distance.

For continuous variables: $s_{x y k} = \{\begin{matrix} 1 \\ 0, \end{matrix} \begin{matrix} - \frac{| x_{k} - y_{k} |}{m a x (x_{k}) - m i n (x_{k})}, \end{matrix} \begin{matrix} i f w_{x y k} = 1 \\ i f w_{x y k} = 0 \end{matrix}$ (5)

For binary variables: $s_{x y k} = \{\begin{matrix} \begin{matrix} 1, & i f x_{k} = y_{k} = 1 \\ 0, & i f o t h e r w i s e \end{matrix} \end{matrix}$ (6)

When variables are binary, $w_{x y k} = 1$ unless $x_{k} = y_{k} = 0$ .

Analytical Data Set Construction

This section discusses the proposed solution to tackle the first challenge in this research work: “How to construct an analytics data set for retail businesses suitable for similarity and distance measures?” It first discusses the structures and components of five raw data sets used in this study. Subsequently, it highlights the data aggregation and transformation process needed to form an analytics data set, a requirement for use of different similarity measure techniques.

Raw data set

Table 1 presents the five data sets and the corresponding variables used in this study. Let $D_{p o i}$ denotes the points of interest (POIs) data set with 532,249 entries containing shops and places of interest (e.g., schools, shops, clinics, and restaurants) from 14 states in Malaysia. All the POIs can be categorized into 1243 subcategories. For example, Skechers and Swatch can be categorized under “consumer shopping,” and further grouped into “shoes” and “watch” subcategories, respectively. In addition to the business name, main and subcategories, $D_{p o i}$ contains other detailed information such as latitude, longitude, city, and state. $D_{p o i}$ is used for nearby POI extraction given a location of interest. The detailed process for utilization of POIs is discussed in Algorithm 1 in the Extracting Surrounding Location Features section.

Table 1.

Feature list in each data set

Data set	Features	Count
Point of interest ( $D_{p o i}$ )	Business name, building name, address, branch name, category code, latitude, longitude, city, and state	532,249
Population ( $D_{p o p}$ )	State, administrative district, subdistrict, local authority area, Malay, Chinese, Indian, Other Bumiputera, others, and non-Malaysian	2577
Job type ( $D_{j o b}$ )	Type of Job: manager, professional, associate technician and professional, clerical support workers, service and sales workers, skilled workers of agriculture, forestry and fisheries, skilled workers or carpenters, operation and installer for plant and machine, basic jobs.	144
	Industrial Field: agriculture, forestry and fisheries, mining and quarrying, manufacturing, electricity, gas, steam, and air conditioning, water supply, sewerage, waste management and recovery activities, construction, wholesale and retail trade, repair of motor vehicles and motorcycles, transport and storage, accommodation and service activities for Food and Beverage, information and communication, financial and insurance activities/Takaful, real estate activities, professional, scientific, and technical activities, administrative and support services activities, public administration and defense; compulsory social security activities, education, human health, and social work activities, arts, entertainment and recreation, other service activities, household activity as an employer, offshore bodies and organizations.
	Employment Status: employer, employee, self-employed, family workers without salary
Education ( $D_{e d u}$ )	Schooling Status: still schooling, graduate, have not attended school, never go to school.	144
	Education Level: preschool education, primary education, low secondary education, upper secondary education, preuniversity, special and technical skill certificate program, first-level tertiary education at the certificate/diploma level, first-level tertiary education at the bachelor level.
	Education Certificate: no certificate, primary school evaluation test or equivalent, lower secondary assessment or equivalent, Malaysian certificate of education or equivalent, Malaysian high school certificate or equivalent, certificate of special or technical skills, certificate of polytechnic/university/bodies that gives recognition or equivalent, diploma/advanced diploma in specialized or technical skills, diploma in polytechnic/university or equivalent, diploma/advanced diploma
Property ( $D_{p p t}$ )	Detached House, semidetached house, link house (single/double), low cost, squatter house, Kampong house, low-cost flat, condominiums, apartments, one-storey shop house, two-storey shop house, three-storey shop house, four-storey shop house, five-storey shop house, Rumah estate, bungalow, two-storey shop house (high density), three-storey shop house (high density), four-storey shop house (high density), town house, chalet, long house	5,221,734

The second data set is the Malaysian population data set ( $D_{p o p}$ ) that can be obtained from the Department of Statistics Malaysia website. The population data set includes the demographics at administrative district level, local authority area, and subdistricts' level. There are 201 administrative districts, 1178 subdistricts, and 1198 local authority areas recorded in $D_{p o p}$ . For each administrative district in Malaysia, extra information about the population is obtained.

The third data set, $D_{j o b}$ , stores information about 9 different job types from 21 different industries with 4 types of employment status. The fourth data set, $D_{e d u}$ , presents the 8 different educational levels with 10 types of certificates, ranging from high school to bachelor degree, for all the districts in Malaysia. $D_{j o b}$ can be described via type of job, industrial field, and employment status, while $D_{e d u}$ has schooling status, educational level, and education certificate to provide a detailed description of the population. Both the data sets were supplied by Telekom Malaysia.

The fifth data set, $D_{p p t}$ , consists of data about different Malaysian residential property types at street level. There are 5,221,734 residential properties categorized into 28 different property types. The examples of the property types are detached house, apartments, condominiums, bungalow, townhouse, one-storey shop, and many more. The common residential property types found in Malaysia are terrace, apartment, single storey, and bungalow. Data sets provided by the respective authorities were screened to remove all customer-level information.

In this study, all five data sets are linked together via $D_{p o i}$ . That is, given a POI, relevant information about the POI from $D_{p o p}$ , $D_{j o b}$ , $D_{e d u}$ , and $D_{p p t}$ is extracted and then presented as a record in the analytical data set. A detailed discussion about the transformation from the raw to the analytical data set is given in the Transforming Raw to Analytics Data Set section.

Extracting surrounding location features

Algorithm 1 extract-surroundingLocationFeatures
Input: retails {_A, _B, _C}
Output: $F_{A}^{15}, F_{B}^{15}, F_{C}^{15}$
1: for R in {_A, _B, _C} do
2: $F_{R} \leftarrow g e t - F e a t u r e s (R, 100 m)$
3: $F_{R}^{15} \leftarrow r a n k D e s c - F e a t u r e s (F_{R}, 15)$
4: end for

Before any recommendation of a retail business can be performed, the retail business recommender requires several inputs for data preprocessing. The first input is that given a retail business, a list of surrounding businesses within a certain distance (e.g., 100 m) must be supplied. Algorithm 1 shows the steps in extracting surrounding retail business given a list of retail businesses of interest. For simplicity in explanation, three retail business of interest (i.e., A, B, and C) serve as the input to Algorithm 1. The output of the algorithm is the three-most frequently found surrounding retail businesses corresponding to the retail businesses of interest. A feature set, F, can be defined as a list of surrounding retail businesses given that particular retail of interest at a certain location.

To further elaborate the purpose of Algorithm 1, for instance, in Figure 1, all retail businesses within 100 m radius from business A will be extracted. If business A has outlets at other locations, then the surrounding retail businesses for all other outlets will also be extracted and stored as F_A. Subsequently, the frequency of each feature stored in F_A is calculated and only the top 15 features are considered important and stored as $F_{A}^{15}$ . In this study, 100 m was used as standard distance because it is believed that a distance of 100 m between two shops is considered far and beyond convenient walking distance, especially under the hot weather of Malaysia. We also believe that there exist strong dependencies between the retails within 100 m. For instance, a self-service laundromat would prefer to be located near a 24-7 convenient shop or a coffee shop.

FIG. 1.

Sample location profile extraction.

Transforming raw to analytics data set

The raw data set obtained in the previous section cannot be used directly. It was transformed into an analytics data set before subsequent analytics tasks can be performed. To be specific, the feature sets (i.e., $F_{A}^{15}, F_{B}^{15}, F_{C}^{15}$ ) were aggregated into one feature set, $F_{u n i o n}$ , through the relational algebra named UNION. In our findings, the three feature sets are not distinct; there exist overlapping of features from different feature sets. Therefore, the number of features fulfills the condition: $F_{u n i o n} \leq F_{A}^{15} + F_{B}^{15} + F_{C}^{15}$

The next process is to construct a matrix, $D_{L o c_{A - B - C}}$ , that maps all branches of the three businesses to $F_{u n i o n}$ . This is performed via the line 3 in Algorithm 2. For each location in $D_{L o c_{A - B - C}}$ , extra information such as population ( $D_{L o c_{A - B - C}}^{p o p}$ ), education ( $D_{L o c_{A - B - C}}^{e d u}$ ), job ( $D_{L o c_{A - B - C}}^{j o b}$ ), and property types ( $D_{L o c_{A - B - C}}^{p p t}$ ) was aggregated. The final data set that is suitable for the analytics task was then constructed ( $D_{a n a l y t i c s}$ ) (Table 2).

Algorithm 2 construct-AnalyticsDataset
Input: $F_{A}^{15}, F_{B}^{15}, F_{C}^{15}$
Output: $F_{u n i o n}, D_{a n a l y t i c s}$
1: $F_{u n i o n} \leftarrow F_{A}^{15} \cup F_{B}^{15} \cup F_{C}^{15}$
2: $L o c_{A - B - C} \leftarrow a l l - s i t e s (A, B, C)$
3: $D_{L o c_{A - B - C}} \leftarrow c o n s t r u c t - M a t r i x (L o c_{A - B - C}, F_{u n i o n})$
4: $D_{a n a l y t i c s} \leftarrow D_{L o c_{A - B - C}} ⊳ ⊲$ $D_{L o c_{A - B - C}}^{p o p} ⊳ ⊲$ $D_{L o c_{A - B - C}}^{e d u} ⊳ ⊲$ $D_{L o c_{A - B - C}}^{j o b} ⊳ ⊲$ $D_{L o c_{A - B - C}}^{p r o p}$

Table 2.

Sample analytics data set

Retail business	Outlet site	Starbucks	7–11	Burger King	KFC	Citibank	…
A	Cyberjaya	Yes	No	Yes	Yes	No	…
A	Dengkil	No	No	Yes	No	Yes	…
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	$⋮$
B	Kuching	No	No	Yes	Yes	Yes	…
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	…

Proposed Methods for Retail Recommendation

In this section, three proposed methods for retail location matching are discussed. Given any new location, the first method (Algorithm 3) scans through all the geospatial features of the three retail businesses of interest (i.e., A, B, C) and determines the retail business that exhibits the highest geospatial similarity with that new location. The second method (Algorithm 4), however, gets the average geospatial similarity index for each retail business. The matched retail business is the one with the highest average similarity index. The third method (Algorithm 5) uses cluster centers to perform the calculation of similarity index. Each retail business is a cluster, and it has two cluster centers. The highest similarity index can be determined by the smallest mean distance between the location of interest and the cluster centers. The retail from which the smallest distance is obtained will be the optimal retail business for that new location.

Algorithm 3 one-Min
Input: $D_{a n a l y t i c s}, r_{n e w}, r_{{A, B, C}}$
Output: $d i s t_{r}^{m i n}, r_{m}$
1: for r in $D_{a n a l y t i c s}$ do
2: $d i s t_{r} \leftarrow g o w e r D i s t (r, r_{n e w})$
3: end for
4: $d i s t_{r}^{m i n} \leftarrow m i n (d i s t_{r})$
5: $r_{m} \leftarrow r e t r i e v e R e t a i l (d i s t_{r}^{m i n}, r_{{A, B, C}})$

Let $ℳ_{o n e - M i n}$ denote the first method used in this study to solve challenges in retail recommendation (see Algorithm 3). The algorithm begins with taking $D_{a n a l y t i c s}$ as an input to the algorithm and the output, the optimal retail (r_m). In the algorithm, the Gower distance measurement method is used to calculate the similarity between the new location and existing outlets of retails A, B, and C (line 2). The Gower method was used in this study because this method can perform distance measure on $D_{a n a l y t i c s}$ , which contains both numerical and nominal data types.

Algorithm 4 retail-Average
Input: $D_{a n a l y t i c s}, r_{n e w}, r_{{A, B, C}}$
Output: $d i s t_{r}^{m i n (a v g)}, r_{m}$
1: for r in $D_{a n a l y t i c s}$ do
2: $d i s t_{r} \leftarrow g o w e r D i s t (r, r_{n e w})$
3: end for
4: $d i s t_{r_{{A, B, C}}}^{a v g} \leftarrow a v e r a g e (d i s t_{r}, r_{{A, B, C}})$
5: $d i s t_{r}^{m i n (a v g)} \leftarrow m i n (d i s t_{r_{{A, B, C}}}^{a v g})$
6: $r_{m} \leftarrow r e t r i e v e R e t a i l (d i s t_{r}^{m i n (a v g)})$

The second proposed method in this study to retail recommendation, $ℳ_{r e t a i l - A v e r a g e}$ , can be explained via Algorithm 4. The main difference between $ℳ_{r e t a i l - A v e r a g e}$ and $ℳ_{o n e - M i n}$ is that in $ℳ_{r e t a i l - A v e r a g e}$ , the average distance is used for recommendation of retail business. As shown at line 5, the average of distances for the three retail businesses is initially acquired (line 5); it then proceeds with line 6 where the minimum average distance is obtained and from which the corresponding retail business is determined. The optimal retail business is denoted by r_m.

Algorithm 5 cluster-Mean
Input: $D_{a n a l y t i c s}, r_{n e w}, r_{{A, B, C}}$
Output: $d i s t_{r}^{c m}, r_{m}$
1: $D_{r_{{A, B, C}}}^{c l u s} \leftarrow c r e a t e C e n t r o i d (D_{a n a l y t i c s}, r_{{A, B, C}}, 2)$
2: for r in $D_{r_{B}}^{c l u s}$ do
3: $d i s t_{r} \leftarrow g o w e r D i s t (r, r_{n e w})$
4: end for
5: $d i s t_{r}^{a v g C l u s} \leftarrow a v e r a g e C l u s (r_{{A, B, C}}, d i s t_{r})$
6: $d i s t_{r}^{m i n (a v g C l u s)} \leftarrow m i n (d i s t_{r}^{a v g C l u s})$
7: $r_{m} \leftarrow r e t r i e v e R e t a i l (d i s t_{r}^{m i n (a v g C l u s)})$

The design of $ℳ_{c l u s t e r - M e a n}$ takes the average distance for each retail business (see Algorithm 5). Such an approach, however, does not separate outlier from the normal data. Therefore, the variation could happen when calculating the average distance with the existence of outlier. The challenge in this study is to identify which data point is considered an outlier. In this light, to minimize the impact of outlier, $ℳ_{r e t a i l - A v e r a g e}$ was modified to calculate the average distance of cluster centroids (line 5–6). The optimal retail (r_m) can then be determined from the retail business that returns minimal distance from the new location.

Evaluation approach

In this study, the proposed three approaches for retail recommendation were validated by randomly selecting three retail businesses from three different retail categories (Table 3). The three categories are Beverage, Food and Beverage, and Food. As shown in Table 4, there were nine experiments conducted with overlapping retail categories. The purpose was to assess the performance of proposed algorithms in handling different similar categories of retails.

Table 3.

Retail stores' count

Business	No. of stores
CoolBlog	303
Chatime	141
Blackball	24
McDonald's	283
Domino's Pizza	152
Pizza Hut	255
KFC	647
A&W	35
Marrybrown	87
Boost Juice	66
Starbucks	190
TheLibrary	16
Old Town	209

Table 4.

Categories and retails

Combination	Business 1	Business 2	Business 3
Beverages	CoolBlog	Chatime	Blackball
Beverages	CoolBlog	Chatime	Boost Juice
Beverages	Boost Juice	TheLibrary	Starbucks
Food	KFC	A&W	Marrybrown
Food	KFC	Domino's Pizza	Marrybrown
Food	KFC	McDonald's	A&W
Beverage and food	CoolBlog	McDonald's	Domino's Pizza
Beverage and food	CoolBlog	McDonald's	Pizza Hut
Beverage and food	CoolBlog	KFC	Old Town

In the evaluation phase, 10 sets of experiment with 80%–20% split of $D_{a n a l y t i c s}$ were performed. Twenty percent of data were randomly selected to serve as a test data set. The proposed algorithm takes each data from the test data set and performs proximity measures against the 80% data set, from which the recommended retail types can be derived. The matching accuracy of the algorithm can be calculated by the matched retail types between the proposed and the actual retail types. Such a process was repeated 10 times to obtain the average matching accuracy. The matching accuracy was calculated using the formula below: $A v e r a g e M a t c h i n g A c c u r a c y = \frac{1}{10} (\frac{n_{m a t c h e d}}{N_{20}} \times 100 %)$ (7)

n_matched denotes the matched prediction between the predicted retail and actual retail from the testing data set.

Results and Discussion

In this study, a total of nine experiments were conducted to investigate the performances of proposed methods for retail recommendation. Three retail businesses selected for each experiment are shown in Table 4. The three businesses can comprise the same business type or different types (Table 5). This study had focused on three different business type combinations, namely, Beverages only, Food only, and Food and Beverage. The three proposed methods were assessed in each set of experiment.

Table 5.

Experimental results for retail recommendation

Category	Business 1	Business 2	Business 3	$ℳ_{o n e - M i n}$	$ℳ_{r e t a i l - A v g}$	$ℳ_{c l u s t e r - M e a n}$	Mean
Beverages	CoolBlog	Chatime	Blackball	91.30	82.61	86.96	86.96
Beverages	CoolBlog	Chatime	Boost Juice	78.00	72.00	82.00	77.33
Beverages	Boost Juice	TheLibrary	Starbucks	68.00	60.00	60.00	62.67
			Mean accuracy	67.68	61.35	67.72	75.65
Food	KFC	A&W	Marrybrown	86.67	72.00	78.67	79.11
Food	KFC	Domino's Pizza	Marrybrown	80.46	66.67	70.11	72.41
Food	KFC	McDonald's	A&W	66.32	69.47	74.74	70.18
			Mean accuracy	77.82	69.38	74.51	73.90
Food and Beverage	CoolBlog	McDonald's	Domino's Pizza	57.53	52.05	64.38	57.99
Food and Beverage	CoolBlog	McDonald's	Pizza Hut	54.22	49.40	51.81	51.81
Food and Beverage	CoolBlog	KFC	Old Town	74.56	71.93	78.95	75.15
			Mean accuracy	73.52	67.98	73.65	61.65
			Overall average accuracy	73.01	66.24	71.96	70.40

Table 5 shows the mean accuracy of the different proposed methods used. For the Beverages category, the highest mean accuracy of 67.72% was obtained through $ℳ_{c l u s t e r - M e a n}$ , which minimizes the average distances of 3 business clusters. It is then followed by $ℳ_{o n e - M i n}$ , which scored 67.68%, a decrease of 0.04% from $ℳ_{c l u s t e r - M e a n}$ . In average, the mean accuracy for the Beverages category obtained was 75.65%. As for the Food category, the highest accuracy achieved was 77.82% when $ℳ_{o n e - M i n}$ was deployed, while $ℳ_{c l u s t e r - M e a n}$ ranked second highest with an accuracy of 74.51%. $ℳ_{r e t a i l - A v g}$ scored the lowest with accuracy below 70.00%, with 69.38%. The overall average accuracy for this category was 73.90%. As for the Food and Beverage category, the highest accuracy obtained was 73.65% via $ℳ_{c l u s t e r - M e a n}$ . It is then followed by the first method with an accuracy of 73.52%. Similar to other categories, $ℳ_{r e t a i l - A v g}$ ranked lowest in accuracy. The overall mean accuracy for this category is 61.65%, the lowest among the three categories. In summary, from the overall nine experiments, $ℳ_{o n e - M i n}$ had shown the highest accuracy of 73.01%, suggesting that retail recommendation could be performed by using the least complex distance measurement algorithm. In terms of overall average accuracy when comparing the three categories, the Food and Beverage category scored the lowest (61.65%), suggesting that similarity measure techniques can best be applied on recommendation for retails within the same category. Last, the overall accuracy obtained was 70.40% and this implies that distance measure techniques can perform as expected when recommending a retail business.

In this study, further statistical analysis was conducted to investigate the performance of proposed distance measure methods. t-Test was conducted to study whether performance of distance measure methods could be affected by business category. The t-test result showed that t = −3.0612, df = 13.579, p = 0.008716; the null hypothesis that stated the true difference in means equals 0, rejected at α = 0.05, thus suggesting that there was a significant difference in mean accuracy among the business categories. That is, the proposed distance measure methods performed better within the same business category (i.e., Beverages = 75.65% and Food = 74.78%) as opposed to a mixture of business categories (i.e., Food and Beverages = 61.65%).

In addition, one-way ANOVA (analysis of variance) was conducted to investigate the differences between the business categories. In ANOVA test, p = 0.0129, there is a significant difference in the three business categories (Table 6). Moreover, Tukey's honest significant difference was performed to investigate the pairwise comparison between the means of the categories (Table 7). With α = 0.05, in conjunction with the adjusted p-value, there was a significant difference between the group of Beverage and Food compared with the Beverages group or Food group alone. However, there was no significant difference between the Food category and the Beverages group. The mean accuracy of Beverages (75.7%) is the highest, and the mean accuracy of Food (73.9%) is higher than the combination of Beverage and Food (61.6%). The mean accuracy obtained by the same category of businesses was above 70%, while the mean accuracy obtained by the different business categories was slightly exceeding 60%.

Table 6.

Analysis of variance results for category

	Df	Sum of squares	Mean square	F	Pr (>F)
Category	2	0.1048	0.05240	5.249	0.0129^*
Residuals	24	0.2396	0.00998

5% significance (0.01≤ p < 0.05)

Table 7.

Tukey multiple comparisons of means

Combination	Difference	Lower	Upper	p-Adjusted
Beverage and food–beverage	−0.14004288	−0.257669493	−0.02241628	0.0175150
Food–beverage	−0.01751111	−0.135137720	0.10011550	0.9268747
Food–beverage and food	0.12253177	0.004905165	0.24015838	0.0400516

In this study, ANOVA was also used to investigate the difference between the three proposed methods. It was observed that p-value was 0.422, concluding that null hypothesis was not rejected at α = 0.05 (Table 8). Such finding concluded that no significant difference existed between the methods used.

Table 8.

Analysis of variance results for method used

	df	Sum of squares	Mean square	F	Pr (>F)
Approach	2	0.0239	0.01195	0.895	0.422
Residuals	24	0.3205	0.01335

Conclusion

Recommending a suitable retail business based on a location is not a trivial task. Not only that, there are many different layers with a large amount of variables to be considered, but more challenging is the absence of sales data, a crucial reference when site selecting.²⁹ However, getting the sales data for retail businesses is not possible due to private and confidentiality issues. These challenges often resort to relying on human knowledge and experiences in retail business recommendation. Some would use the GIS to extract insights and patterns about the profiles of a location. GIS, however, has its limitations when there exist too many layers. The visual representation and inspection can be very taxing and misleading when there are several overlapping layers. More importantly, the GIS approach to represent geospatial information is taxing in defining important variables from each layer needed for retail business recommendation. In this light, the work presented in this article attempted the challenge in retail recommendation, particularly in the absent of sales data.

There are two main contributions in this research work. The first contribution discusses how different data sets can be structured to form an analytics data set suitable for retail business recommendation. The second contribution centers around using a similarity measure method for recommending the most appropriate retail business. There were nine sets of experiment conducted on three proposed methods. The findings suggested that the third proposed similarity measure method ( $ℳ_{c l u s t e r - M e a n}$ ) tends to perform the best compared with $ℳ_{o n e - M i n}$ and $ℳ_{r e t a i l - A v e r a g e}$ . However, the difference in accuracy is <5.5% and statistically insignificant. The experimental results had shown that the proposed three methods performed better within the same category of businesses as opposed to a mixture of business categories.

While the findings discussed in this study have shown the positive impact of leveraging machine learning approach to retail recommendation, there are still some limitations. First, the current work applies only on the recommendation for retail outlets at landed trade areas. No recommendation can be performed within shopping malls with multiple storeys. Second, the recommendation engine does not apply to a closed specialized compound such as within an academic institution or a school. This is because most academic institutions have regulations on the selection of retail outlets. Third, the recommendation system does not apply on new residential areas. That is, no recommendation can be made where there is too little population and very few adjacent retail outlets. This is because currently the recommendation system requires adjacent retails for correlation analysis.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received.

Cite this article as: Ting C-Y, Ho CC, Yee HJ (2020) Geospatial insights for retail recommendation using similarity measures. Big Data 8:6, 519–527, DOI: 10.1089/big.2020.0028.

Abbreviations Used

References

Merino

, Ramirez-Nafarrate

. Estimation of retail sales under competitive location in Mexico. J Bus Res. 2016; 69:445–451.

Ailawadi

, Farris

. Managing multi- and omni-channel distribution: Metrics and research directions. J Retail. 2017; 93:120–135.

Bradlow

, Gangwar

, Kopalle

, et al. The role of big data and predictive analytics in retailing. J Retail. 2017; 93:79–95.

Dekimpe

. Retailing and retailing research in the age of big data analytics. Int J Res Market. 2020; 37:3–14.

Erbıyık

, Özcan

, Karaboğa

. Retail store location selection problem with multiple analytical hierarchy process of decision making an application in Turkey. Procedia Soc Behav Sci. 2012; 58:1405–1414.

Fong

, Fang

, Luo

. Geo-conquesting: Competitive locational targeting of mobile promotions. J Market Res. 2015; 52:726–735.

Grewal

, Roggeveen

, Nordfält

. The future of retailing. J Retail. 2017; 93:1–6.

Larson

, Bradlow

, Fader

. An exploratory look at supermarket shopping paths. Int J Res Market. 2005; 22:395–414.

Mulky

. Distribution challenges and workable solutions. IIMB Manage Rev. 2013; 25:179–195.

10.

Liu

, Wang

, Li

, et al. Elan: An efficient location-aware analytics system. Big Data Res. 2016; 5:16–21.

11.

Roig-Tierno

, Baviera-Puig

, Buitrago-Vera

, et al. The retail site location decision process using GIS and the analytical hierarchy process. Appl Geogr. 2013; 40:191–198.

12.

Kabir

, Sumi

. Power substation location selection using fuzzy analytic hierarchy process and promethee: A case study from Bangladesh. Energy. 2014; 72:717–730.

13.

Trivedi

, Singh

. A hybrid multi-objective decision model for emergency shelter location-relocation projects using fuzzy analytic hierarchy process and goal programming approach. Int J Project Manage. 2017; 35:827–840.

14.

Garcia

, Alvarado

, Blanco

, et al. Multi-attribute evaluation and selection of sites for agricultural product warehouses based on an analytic hierarchy process. Comput Electron Agric. 2014; 100:60–69.

15.

Rao

, Goh

, Zhao

, et al. Location selection of city logistics centers under sustainability. Transport Res D Transport Environ. 2015; 36:29–44.

16.

Turhan

, Akalın

, Zehir

. Literature review on selection criteria of store location based on performance measures. Procedia Soc Behav Sci. 2013; 99:391–402.

17.

Anderson

, Volker

, Phillips

. Converse's breaking-point model revised. J Manage Market Res. 2010; 2:1–10.

18.

Chavez

, Berentsen

, Lansink

. Assessment of criteria and farming activities for tobacco diversification using the analytical hierarchical process (AHP) technique. Agric Syst. 2012; 111:53–62.

19.

, Ma

. The state-of-the-art integrations and applications of the analytic hierarchy process. Eur J Oper Res. 2018; 267:399–414.

20.

Vasileiou

, Loukogeorgaki

, Vagiona

. GIS-based multi-criteria decision analysis for site selection of hybrid offshore wind and wave energy systems in Greece. Renew Sustain Energy Rev. 2017; 73:745–757.

21.

Shaheen

, Khan

. A method of data mining for selection of site for wind turbines. Renew Sustain Energy Rev. 2016; 55:1225–1233.

22.

Ekmekçioğlu

, Kaya

, Kahraman

. Fuzzy multicriteria disposal method and site selection for municipal solid waste. Waste Manage. 2010; 30:1729–1736.

23.

Shirkhorshidi

, Aghabozorgi

, Wah

. A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS One. 2015; 10: e0144059.

24.

Boriah

, Chandola

, Kumar

. Similarity measures for categorical data: A comparative evaluation. In SDM, SIAM, Atlanta, GA, USA. 2008. pp. 243–254.

25.

Deshpande

, VanderSluis

, Myers

. Comparison of profile similarity measures for genetic interaction networks. PLoS One. 2013; 8:e68664.

26.

Kanza

, Kravi

, Safra

, et al. Location-based distance measures for geosocial similarity. ACM Trans Web. 2017; 11:17.

27.

Mao

, Jain

. A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Trans Neural Netw. 1996; 7:16–29.

28.

Jain

, Murty

, Flynn

. Data clustering: A review. ACM Comput Surv. 1999; 31:264–323.

29.

Ting

C-Y

, Ho

, Jia Yee

, et al. Geospatial analytics in retail site selection and sales prediction. Big Data. 2018; 6:42–52.