Location prediction for facility placement by incorporating multi-characteristic information

Abstract

In the course of recommending locations for establishing new facilities on urban planning or commercial programming, the location prediction offers the optimal candidates, which maximizes the number of served customers or minimize customer inconvenience, therefore brings the maximum profits. In most existing studies, only the spatial-temporal features are recognized to evaluate the location popularity, where social relationships of customers, which are significant factors for popularity assessing, have been ignored. Additionally, current researches also fail to take capacities and categories of the facilities into consideration. To overcome the drawbacks, we introduce a novel model of Multi-characteristic Information based Top- $k$ Location Prediction (MITLP), it captures the spatio-temporal behaviors of customers based on historical trajectories, exploits the social relevancy from their friend relationships, as well as examines the category competitiveness of specific facilities thoroughly. Subsequently, by drawing on the feature evaluation and popularity quantization, MITLP will be implemented within a hybrid B-tree-liked recommending framework, Constrained Location and Social-Trajectory Clustered forest (CLSTC-forest), which can not only produce better performance in practice but also address the facility service constraints. Finally, extensive experiments conducted on real-world datasets demonstrate the higher efficiency and effectiveness of the proposed model.

Keywords

Location recommendation facility placement spatio-temporal trajectory geo-social relationship capacity constraint

1. Introduction

Location prediction aims to identify the appropriate locations for placing new facilities either for commercial or public from a range of available candidates. In this context, location is characterized as a special site where the facility is set up to a given service for customers in a road network. This kind of prediction has been widely applied in a variety of recommendation systems in terms of urban planning or commercial facility locating, e.g., programming of establishing new vehicle charging stations, mobile toilets, retail stores, or restaurants in a city. One of the most important factors of prediction is the profits that incurred by candidate locations, different benefits could be gained from various locations as a result of their geographical situations, service periods, customer flows, or service capacities, and so on. Therefore, to predict the location popularity, which is a reflection of profits or earnings intuitively, the number of potential served customers is formally estimated, and there are grounds to recommend the locations with the highest evaluated popularity.

With GPS and mobile devices developed in recent years, people’s daily trajectories have been recorded and analyzed widely [1, 2, 3], where a trajectory is a sequence of check-in locations with timestamps in succession that illustrate the motion path more precisely. Therefore, an increasing number of studies of location recommending or location predicting come to focus on trajectories [4, 5, 6, 7, 8, 9, 10, 11]. However, these researches evaluate the popularity of candidates by the spatial-temporal information between customers and locations solely [12, 13, 5, 6, 7], or a few other characteristics are taken into consideration [14, 8, 9, 10]. As a result, these studies fail to depict the effects of social friend relationships, temporal regions and facility categories over the assessment of location popularity, which will render the final recommending results inaccurate and cause uneconomic to some degree. And next, the details of the three overlooked features are introduced.

Social relevancy. Advances in social network are facilitating interpersonal communications, friend-based recommendation becomes more and more essential in recommendation practice [15]. For instance, customers receive electronic red envelopes or coupons occasionally in modern promotional shopping or consuming [16, 17], and then share them with friends in smart-phone Apps such as WeChat, Alipay, Yelp, or UberEATS, with these shared red envelopes or coupons, their friends will get a great discount when consuming at the same places afterwards. To illustrate the necessity of acknowledging the social friend relationships for reciprocal recommending, a down-to-earth example is demonstrated.

Figure 1.

An example for candidate location prediction.

Example 1. As shown in Fig. 1, there are four candidate locations $l_{1}$ to $l_{4}$ which will be chosen to establish vehicle charging stations, and six vehicle trajectories $T_{1}$ to $T_{6}$ with corresponding to six customers $u_{1}$ to $u_{6}$ . To predict the optimal location for serving $u_{6}$ , if only spatial distance is considered, a random location among $l_{2}$ , $l_{3}$ and $l_{4}$ will be selected by a few of existing methods since the shortest spatial distances from $T_{6}$ to $l_{2}$ , $l_{3}$ and $l_{4}$ are all equal. Nevertheless, it is observed that $u_{5}$ and $u_{6}$ would readily share electronic red envelopes or coupons since they are close friends, therefore, if $u_{5}$ has been served by $l_{4}$ (as $u_{5}$ passes $l_{4}$ directly), there will be a great possibility that $u_{6}$ is served by $l_{4}$ .

Temporal conflict. Temporal conflict is another critical aspect on popularity predicting of the facility service constraints. Commercial facilities such as restaurants, supermarkets or gas stations, will have traffic congestions or crowding conditions at several certain periods either weekend evenings or discount seasons. During these periods, customers will wait for a long time to get service or pay for goods due to the limited capacities. In this circumstance, facility constraints and temporary activities of customers should be exploited thoroughly. For example, if customer activities are concentrated at the similar temporal periods, the recommending probability to the identical facility should be reduced.

Categorical effect. Different categories of service facilities may have different effects on both customers and locations. On one hand, customers will be sensitive to spatial distance on public facilities like mobile toilets or parking lots, nevertheless, as for commercial facilities like supermarkets or shopping malls, customers are more willing to get practical discounts or have a better shopping experience. On the other hand, location selection is affected by others nearby with same category as a result of homogeneous competition, that is to say, the evaluation of a candidate location should be more circumspect if some other surrounding facilities of its kind exist.

In this paper, to overcome the aforementioned deficiencies, we formulize the location prediction problem, aiming to identify the optimal $k$ locations for setting up new facilities. During this process, we incorporate multitudes of spatio-temporal trajectories and social relationships into a novel model MITLP with respect to facility capacity constraints and categories. Furthermore, to determine whether a candidate location is recommended or not, the location popularity is thoughtful predicted with four significant characteristics, which are spatial commute distance, social relevancy, temporal conflict and facility category competitiveness respectively. Meanwhile, we also develop a new hybrid B-tree-liked framework with the help of location popularity and capacities, named CLSTC predicting forest, to incorporate customers, spatio-temporal trajectories and locations into CLSTC-trees and CLSTC-forest efficiently. With the framework, an effective query approach is presented that exploits a simple selecting strategy to obtain the final results. Our key contributions are summarized as follows.

•

This study proposes the location prediction problem within a novel model MITLP, it recognizes several kinds of relevant characteristic information comprehensively, and meanwhile depicts the feature of quantization and the evaluation of location popularity.

•

We have developed a new recommending framework CLSTC-forest, which incorporates candidate locations, customers and spatio-temporal trajectories efficiently, an effective query algorithm is also employed accordingly.

•

Extensive experiments are performed on real datasets to offer insight into the efficiency and effectiveness of the proposed model and framework, moreover, a case study is conducted further to visualize the effectiveness.

The rest of the paper is organized as below. Section 2 reviews the related work and Section 3 illustrates the definition and the framework of the proposed problem. Section 4 formulates the model with respect to feature extraction, popularity quantization and popularity evaluation. Section 5 depicts the implementation of the model and the corresponding query method. Experimental evaluation and a case study are reported in Section 6. The last section concludes the entire paper.

2. Related work

The existing work is covered on next optimal location prediction and optimal $k$ locations prediction.

2.1 Next optimal location prediction

Quite a few studies [18, 19, 20, 9, 21, 22, 12, 5, 13] focus on the next optimal location prediction problem by taking advantage of various metrics and objective functions, the background is that some facilities in spatial area already exist, and then the next optimal location is recommended for placement. Sun et al. [19] acknowledge the service capacity of each location facility, which suggests that there is a certain limitation on the number of served customers, however, the metric is only examined by spatial distances between customers and locations. Li et al. [20] query a location for establishing facility on the base of the optimal segment query of trajectories, they assign a score on each segment as in [5] without recognizing the candidate location as a specific position on a road network. Besides the metric model, the probability model-based prediction is another important side referring to the location recommending problem. Yao et al. [9] take advantage of a Semantics-Enriched Recurrent Model (SERM) for the next optimal location prediction in semantic trajectories, where both spatio-temporal transitions and relevant textual information posted by customers are considered on improving the precision, nonetheless, they concentrate on the mining of semantic information from trajectories. And Wang et al. [21] employ this different characteristic information of customers and integrate it into a simple regression model to query the optimal location, and they neglect the effects of facility categories and only recognize the restaurant placement. Karamshuk et al. [18] also propose a probability model Geo-Spotting on identifying the optimal area (location) for a new specific retail store, where the power of various machine learning features on the facility popularity are studied, but they only investigate few features of geographic and user mobility, the region characteristic of metropolises is ignored.

2.2 Optimal $k$ locations prediction

Recent researches have concentrated on exploring the problem of optimal $k$ location recommendation identified from either POI check-ins or trajectories [15, 23, 4, 11, 6, 7, 14, 8, 10], where a great portion of these studies predict the optimal locations based on the trajectories in particular. In more details, Liu et al. [8] introduce a systematic POI demand modeling framework RPDI (Region POI Demand Identification), which aims to model POI demands of specified locations by exploiting the daily needs of customers, the fundamental features examined mainly include POI profiles and customer travel trajectories. Although the road network has been split into regions by the corresponding POI profiles and demographic features in advance, they do not acknowledge the correlation of adjacent regions. Li et al. [11] mine the most influential $k$ -location, from this point of view, they evaluate the maximum number of unique trajectories that traverse a location in a given spatial region, therefore, the common practicability of this work is greatly restricted by the traverse limitation. Mitra et al. [4, 6, 7] focus on the top- $k$ location query problem with respect to trajectory merely, they [6] propose an index framework of NetClus for TOPS (Trajectory-aware Optimal Placement of Services) query, their work assumes that each of candidate locations has the maximum service range $\tau$ of a given spatial distance. As a result, the construction of NetClus leads to lots of index instances that are calculated and stored with different values of $\tau$ and cluster radii. They [7] further extend TOPS to TIPS (Trajectory-aware Inconvenience-minimizing Placement of Services) with an attempt to minimize the maximum customer inconvenience or the average inconvenience, but only the effects of user trajectories are considered in these studies. Hsieh et al. [10] extend the feature set proposed by Geo-Spotting [18], propose the location popularity prediction problem for placing retail stores, and develop the API (Affinity-based Popularity Inference) model in terms of the gaussian random fields, however, there are several limitations on the service ranges of locations, and they also fail to take advantage of the friend relationship-based reciprocal recommendation.

Despite of the great contributions made by existing studies, yet there is no work dedicates the reciprocal recommendation, category competitiveness, and historic trajectories in spatio-temporal domain on candidate location prediction synchronously. Consequently, we propose the novel model MITLP to address the issue in this work, furthermore, a tree-liked framework is introduced by accordance with the structure of road network effectively.

3. Problem statement

In this section, we formally introduce the problem of candidate location prediction for facility placement, an overview of the entire problem framework is employed as well. Table 1 summarizes the frequently used notations throughout this paper.

Table 1
Notation and the corresponding description

Notation	Description
${G}$	a road network
${L}$ , ${C}$	set of locations and service capacities
${\rm I}$	set of facility categories
${U}$	set of customers
${R}$	set of social friend relationships
${\Gamma}$	set of spatio-temporal trajectories
${F_{s}}({u_{j}})$	social friend set of ${u_{j}}$
${P_{b}}({l_{i}})$	set of being evaluated customers of ${l_{i}}$
${f_{d}}({l_{i}},{u_{j}})$	spatial commute distance of ${l_{i}}$ and ${u_{j}}$
${f_{s}}({l_{i}},{u_{j}})$	social relevancy of ${l_{i}}$ and ${u_{j}}$
${f_{t}}({l_{i}},{u_{j}})$	temporal conflict of ${l_{i}}$ and ${u_{j}}$
${f_{c}}({l_{i}})$	category competitiveness of ${l_{i}}$
${\varsigma}({l_{i}},{u_{j}})$	service utility of ${l_{i}}$ and ${u_{j}}$
$\phi({l_{i}})$	popularity of ${l_{i}}$
${\Im_{k}}$	set of optimal $k$ locations
${\Phi_{\Im}}$	total popularity of ${\Im_{k}}$

3.1 Problem formulation

Consider a setting where candidate locations and trajectories are located in a road network and social relationships of customers with respect to trajectories can be captured. The road network is illustrated as a directed weighted graph $G=\{{V_{g}},{E_{g}}\}$ , where ${V_{g}}$ denotes the set of vertices (road intersections) and ${E_{g}}$ denotes the set of directed edges (road segments), the weight of directed edge denotes its actual spatial distance.

Definition 1 (Candidate location). The candidate location $l$ is a place for establishing a certain facility or service like vehicle charging station or mobile toilet, where a specified category of facility or service belongs to ${\vartheta},{\vartheta\in I}$ .

The choice of $L=\{{l_{1}},\ldots,{l_{m}}\},{l_{i}}\in{V_{g}},1\leqslant i\leqslant m$ , is affected by various factors and it is beyond this research, therefore, we just acknowledge that $L$ represents the input of our problem, to simplify, it is assumed that candidate locations are situated at the road intersections, if not, the location will be assigned to a new vertex, and then the corresponding edge is split into two new ones.

Definition 2 (Service capacity). Given a candidate location $l_{i}$ , the service capacity of $l_{i}$ is defined as $c_{l_{i}}$ , which indicates that the number of served customers can not exceed $c_{l_{i}}$ in real serving application.

For example, if we plan to set up a charging station in one candidate location with 30 power poles for serving NEV (New Energy Vehicles), the service capacity of this facility (location) will be set to 30.

Definition 3 (Trajectory). A trajectory $T$ is represented in the sequential form: $\{({v_{1}},{t_{1}}),\ldots,({v_{\zeta}},{t_{\zeta}})\},{v_{\zeta}}\in{V_{g}}$ , where ${v_{\zeta}}$ is a vertex in G and ${t_{\zeta}}$ denotes the timestamp when this trajectory just crosses ${v_{\zeta}}$ .

Definition 4 (Social relationship). The social relationships of customers are simply modeled as an undirected and unweighted graph $R=\{{V_{u}},{E_{u}}\}$ , where ${V_{u}}$ is the set of nodes that represent customers, and ${E_{u}}$ is the set of edges, each edge $({u_{i}},{u_{j}})\in{E_{u}}$ denotes that there is a close friend relationship between ${u_{i}}$ and ${u_{j}}$ .

Note that each trajectory is corresponding to a certain exclusive customer represented as ${{T_{i}}{\leftrightarrow}}{u_{i}}$ , and $|\Gamma|=|U|=n$ , formally, this kind of trajectory is referred to spatio-temporal-social trajectory, where the corresponding social relationships are presented in the mean time as shown in Fig. 1.

Problem definition. Given a road network $G$ , a set of customers ${U}$ with corresponding trajectories set ${\Gamma}$ , a set of social friend relationships $R$ , as well as a set of candidate locations ${L}$ , the multi-characteristic information-based location prediction problem seeks to select the optimal location set ${\Im}$ for establishing new facilities with a specified category, which maximizes the number of potential customers without exceeding the capacity constraints.

3.2 Framework overview

An overview of the proposed location predicting framework is illustrated in Fig. 2, which consists of four logic parts in general. To begin with, by taking advantage of multi-characteristic information of the inputs, the facility location prediction extracts the significant features of spatial commute distance, social relevancy, temporal conflict, and category competitiveness respectively. Next, candidate locations are mapping to the road network in advance, and in order to predict the correlations between locations and its corresponding customers precisely, location popularity is quantified and then evaluated properly with the help of service categories, capacity constraints, and four previous extracted features. Subsequently, by repeating the interactive processes of popularity evaluation and parameters learning within test data, the framework of CLSTC-tree and CLSTC-forest are constructed by following several combination strategies of model learning. In the end, the candidate locations are kindly organized and ranked with the help of CLSTC-forest level marks, the final results are obtained conveniently through a simple selecting approach for either consultants or decision-makers on urban planning, public facilities locating, or commercial facilities locating, and so on.

Figure 2.

An overview of the framework.

4. Proposed model

In this section, prediction features are illustrated within a few of characteristics, the popularity quantization and popularity evaluation for model learning are depicted, the proposed model is also briefly demonstrated with an example.

4.1 Prediction features

Feature extraction from multiple inputs is the critical essences for the processes of popularity predicting and model learning, we not only recognize the mutual correlations between locations and customers, but also capture the relevant characteristics of specific facilities that will be established at the same time. Subsequently, all the associated features are developed at length.

Commute distance. It is the fundamental characteristic in feature evaluation, in this study, the formula of the shortest spatial commute distance between location and trajectory (customer) is adopted as:

$\displaystyle{d_{s}}({l_{i}},{T_{j}})={\min_{\forall{v_{jk}},{v_{j\iota}}\in{T% _{j}}}}\{{d_{s}}({v_{jk}},{l_{i}})+{d_{s}}({l_{i}},{v_{j\iota}})-{d_{a}}({v_{% jk}},{v_{j\iota}})\}$ (1)

where ${d_{s}}({v_{jk}},{l_{i}})$ is the shortest commute distance from ${v_{jk}}$ to ${l_{i}}$ on $G$ , and ${d_{a}}({v_{jk}},{v_{j\iota}})$ is the spatial distance from ${v_{jk}}$ to ${v_{j\iota}}$ by going along of ${T_{j}}$ , to illustrate, a customer ${u_{j}}$ deviates from her/his usual trajectory ${T_{j}}$ at ${v_{jk}}$ to ${l_{i}}$ while seeking for particular services, and then returns to ${T_{j}}$ at ${v_{j\iota}}$ after served, from which it can be observed that ${d_{s}}({l_{i}},{T_{j}})$ depicts the additional spatial distance while ${u_{j}}({T_{j}})$ is served by ${l_{i}}$ generally. Note that ${d_{s}}({v_{i}},{v_{j}})$ is not always equal to ${d_{s}}({v_{j}},{v_{i}})$ in a directed road network. Therefore, the spatial commute distance characteristic is employed:

$\displaystyle{f_{d}}({l_{i}},{u_{j}})=\frac{{\log({d_{s}}({l_{i}},{T_{j}}))}-{% {\rho_{\min}}}}{{{\rho_{\max}}}-{{\rho_{\min}}}}$ (2)

where ${\rho_{\max}}=\max(\log({d_{s}}({l_{i}},{T_{j}})))$ , ${\rho_{\min}}=\min(\log({d_{s}}({l_{i}},{T_{j}})))$ , $i\in[1,m]$ , $j\in[1,n]$ . Assume that there is always a route existed in each pair of locations and trajectories in $G$ , if not, ${\rho_{\max}}$ will be set to the default. Moreover, the spatial road distance between origin and destination is employed instead of the Euclidean length, because customers are more sensitive to the definite routes correlated with the actual trajectory distance.

Social relevancy. To better assess the significant correlations between ${l_{i}}$ and ${u_{j}}$ in terms of the characteristic of social relevancy, three related factors are evaluated alternately.

The first one is the attractiveness effect of candidate location, assume that ${{F_{s}}(u_{j})}$ is the set of customers where social friend relationships exist with ${u_{j}}$ , ${u_{j}}\in R$ , and ${{P_{b}}(l_{i})}$ represents the customers set that have already been evaluated to ${l_{i}}$ for being served, as a result, the attractiveness between ${u_{j}}$ and ${l_{i}}$ is defined:

$\displaystyle{\psi_{l}}({l_{i}},{u_{j}})=\frac{{|\{{u_{k}}|{u_{k}}\in{F_{s}}({% u_{j}})\wedge{u_{k}}\in{P_{b}}({l_{i}})\}|+1}}{{|{P_{b}}({l_{i}})|+{\lambda_{l% }}}}$ (3)

where ${\lambda_{l}}$ denotes the Laplace smoothing coefficient. The intuition behind the location attractiveness of social relevancy is that the more friends of ${u_{j}}$ have been evaluated to being served by ${l_{i}}$ , ${u_{j}}$ will be more likely to be served by ${l_{i}}$ , as a result of friend recommendation through shared electronic red envelopes or coupons.

The second one is the customer check-in effect in terms of the category of facility which is planned to be established on this location, consider that the specific category is ${\vartheta_{k}}$ , $\Omega({u_{j}})$ is the set of historical check-in records of ${u_{j}}$ to all POIs in the road network, and $\Omega({u_{j}},{\vartheta_{k}})$ denotes the check-in records over the POIs with category ${\vartheta_{k}}$ posted by ${u_{j}}$ , then the check-in effect is carefully depicted as below.

$\displaystyle{\psi_{u}}({u_{j}},{\vartheta_{k}})=\frac{{|\Omega({u_{j}},{% \vartheta_{k}})|}}{{|\Omega({u_{j}})|+{\lambda_{c}}}}$ (4)

where ${\lambda_{c}}$ is also the Laplace smoothing. The check-in effect shows that a customer will be more likely to be interested in this location of category ${\vartheta_{k}}$ if there are a large quantity of check-ins posted by herself/himself at the same kind of facilities.

The last one is the attractiveness of customer, to begin with, the double Pareto lognormal (DPLN) distribution [24] is developed to fit the overall spreading of social friend relationships over customers with respect to $\textit{DPLN}({\alpha_{r}},{\beta_{r}},{\nu_{r}},{\tau_{r}})$ . For each of ${F_{s}}({u_{j}})$ , a z-score-liked standardization is utilized as ${F^{\prime}_{s}}({u_{j}})={\raise 3.01pt\hbox{${({F_{s}}({u_{j}})-{\nu_{r}})}$% }\!\mathord{\left/{\vphantom{{({F_{s}}({u_{j}})-{\nu_{r}})}{{\tau_{r}}}}}% \right.\kern-1.2pt}\!\lower 3.01pt\hbox{${{\tau_{r}}}$}}$ , then this effect is illustrated:

$\displaystyle{\psi_{u}}({u_{j}})=\frac{{{\alpha_{r}}{\beta_{r}}}}{{{\alpha_{r}% }+{\beta_{r}}}}*\frac{{{F_{s}}({u_{j}})-\min({F^{\prime}_{s}})}}{{\max({F^{% \prime}_{s}})-\min({F^{\prime}_{s}})}}$ (5)

${\max({F^{\prime}_{s}})}$ and ${\min({F^{\prime}_{s}})}$ denote the maximum and the minimize values of ${F^{\prime}_{s}}({u_{j}})$ respectively, $j\in[1,n]$ . The reason behind the third effect is that if ${u_{j}}$ has more friends in real society, she/he will be more attractive to other customers as a result of the effect of influencer marketing.

From these points of view, the overall character of social relevancy among ${u_{j}}$ , ${l_{i}}$ and the corresponding customers ${{P_{b}}(l_{i})}$ is defined as a simple linear combination of the three sub-features.

$\displaystyle{f_{s}}({l_{i}},{u_{j}})={\alpha_{s}}*{\psi_{l}}({l_{i}},{u_{j}})% +{\beta_{s}}*{\psi_{u}}({u_{j}},{{\rm I}_{k}})+{\gamma_{s}}*{\psi_{u}}({u_{j}})$ (6)

where ${\alpha_{s}}$ , ${\beta_{s}}$ , and ${\gamma_{s}}$ are factor weights, and ${\alpha_{s}}+{\beta_{s}}+{\gamma_{s}}=1$ .

Temporal conflict. Suppose that ${u_{j}}$ departs from her/his trajectory ${T_{j}}$ to ${l_{i}}$ at timestamp ${t_{jk}}$ as shown in Eq. (1), and arrives at ${l_{i}}$ at the timestamp ${\chi_{a}}({u_{j}},{l_{i}})={t_{jk}}+{\raise 3.01pt\hbox{${{d_{s}}({v_{jk}},{l% _{i}})}$}\!\mathord{\left/{\vphantom{{{d_{s}}({v_{jk}},{l_{i}})}{{\nu_{j}}}}}% \right.\kern-1.2pt}\!\lower 3.01pt\hbox{${{\nu_{j}}}$}}$ , where ${\nu_{j}}$ is the average velocity of ${u_{j}}$ in ${T_{j}}$ and ${\nu_{j}}={\raise 3.01pt\hbox{${{d_{s}}({T_{j}})}$}\!\mathord{\left/{\vphantom% {{{d_{s}}({T_{j}})}{({t_{\zeta}}-{t_{1}})}}}\right.\kern-1.2pt}\!\lower 3.01pt% \hbox{${({t_{\zeta}}-{t_{1}})}$}}$ . Therefore, to denote the duration time of ${\chi_{d}}({u_{j}},{l_{i}})$ , ${u_{j}}$ has been served in ${l_{i}}$ , gaussian distribution of ${\mathcal{N}}({\mu_{d}},{\sigma_{d}}^{2})$ has been raised intuitively, each of ${\chi_{d}}$ is sampled from this distribution randomly by different facility categories, therefore, the time intervals of ${u_{j}}$ which stays at ${l_{i}}$ is listed:

$\displaystyle{\varpi_{s}}({u_{j}},{l_{i}})=[{\chi_{a}}({u_{j}},{l_{i}}),{\chi_% {a}}({u_{j}},{l_{i}})+{\chi_{d}}({u_{j}},{l_{i}})]$ (7)

Subsequently, within all the customers that have already been evaluated to ${l_{i}}$ for being served, the temporal conflict characteristic is modeled as:

$\displaystyle{f_{t}}({l_{i}},{u_{j}})=\frac{{\sum\limits_{{u_{k}}\in{P_{b}}({l% _{i}})}{{\pi_{c}}({\varpi_{s}}({u_{j}},{l_{i}}),{\varpi_{s}}({u_{k}},{l_{i}}))% }+\max({\chi_{d}})}}{{|{P_{b}}({l_{i}})|*\max({\chi_{d}})+{\lambda_{t}}}}$ (8)

where ${\pi_{c}}({\varpi_{s}}({u_{j}},{l_{i}})$ , ${\varpi_{s}}({u_{k}},{l_{i}}))$ depicts the overlapped time intervals between ${u_{j}}$ and ${u_{k}}$ , and ${\lambda_{t}}$ is also the Laplace smoothing coefficient. The ground truth of considering temporal region is that, if a customer intends to stagger her/his serving time with others who have already been evaluated in the same candidate facility, the customer will receive more guaranteed comfortable service intuitively.

Categorical competitiveness. Assume that ${{\rm B}_{\vartheta}}$ denotes a set of detailed brands of the same category when the specified facility is given, for instance, if the category ${\vartheta}$ is fast food restaurant, then ${{\rm B}_{\vartheta}}$ may contain KFC, McDonald’s, Pizza Hut, and Yon ho, to name a few. Nevertheless, there is no need to recognize all facilities of the same category in an entire road network, thus a global parameter of radius ${r_{b}}$ is set to represent the spatial area where the facilities will be evaluated, and suppose ${{\Lambda_{r}}({{\rm B}_{\vartheta}})}$ is the set of facilities within brand ${{\rm B}_{\vartheta}}$ that have already existed or been established in this area. Therefore, the effect of categorical competitiveness is defined as:

$\displaystyle{f_{c}}({l_{i}})=\sum_{{b_{\partial}}\in{{\rm B}_{\vartheta}}}{{% \nu_{\partial}}}\sum_{{\mathchar 22\mskip-10.0mu \lambda_{j}}\in{\Lambda_{r}}(% {b_{\partial}})}{\frac{{{r_{b}}-{d_{E}}({l_{i}},{\mathchar 22\mskip-10.0mu % \lambda_{j}})}}{{r_{b}}}}$ (9)

where ${\lambda_{k}}$ is a specific existing facility to ${\vartheta}$ , ${d_{E}}({l_{i}},{\mathchar 22\mskip-10.0mu \lambda_{j}})$ indicates the Euclidean distance between ${l_{i}}$ and ${\mathchar 22\mskip-10.0mu \lambda_{j}}$ , and each brand ${b_{\partial}}$ is connected with a weight ${\nu_{\partial}}$ .

The reason for acknowledging categorical competitiveness is that existing facilities of the same category nearby will affect the selection of candidate location through homogeneous competition. Moreover, different sub-categories (brands) may have diverse effects on location facility, for example, if we plan to establish a KFC store on ${l_{i}}$ , the competitive effect of a McDonald’s adjacently will be far larger than that of a Yon ho near by.

4.2 Popularity quantization

The objective of an optimal or top- $k$ location prediction is to determine one or more popular locations effectively, which would optimize an objective function or certain metrics (location popularity), such as gaining the maximum profits, or attracting the maximum number of customers. From this point of view, it shows that location popularity particularly indicates the location attractiveness over customers. Consequently, with the help of many practical characteristic features employed previously, the location popularity is quantified as the total service utilities of this candidate location itself to potential serving customers, where the service utility is represented by the sum of numeric scores returned from each of the features.

Definition 5 (Service utility). The service utility between ${l_{i}}$ and ${u_{j}}$ is depicted as:

$\displaystyle{{\varsigma}_{\vartheta}}({l_{i}},{u_{j}})=\left\{\begin{array}[]% {ll}1-{\alpha_{\vartheta}}*{\displaystyle\frac{{\min({d_{a}}({v_{j1}},{l_{i}})% ,{d_{a}}({v_{j\zeta}},{l_{i}}))}}{{\max({d_{a}}(\Gamma))}}},&{d_{s}}({l_{i}},{% T_{j}})=0\\ &\\ -{\alpha_{\vartheta}}*{f_{d}}({l_{i}},{u_{j}})+{\beta_{\vartheta}}*{f_{s}}({l_% {i}},{u_{j}})-{\gamma_{\vartheta}}*{f_{t}}({l_{i}},{u_{j}}),&\text{otherwise}% \\ \end{array}\right.$ (10)

where ${\alpha_{\vartheta}}$ , ${\beta_{\vartheta}}$ and ${\gamma_{\vartheta}}$ are characteristic weights, ${\alpha_{\vartheta}}+{\beta_{\vartheta}}+{\gamma_{\vartheta}}=1$ , and ${\vartheta\in I}$ denotes a specified category of facilities that will be planned to establish on ${l_{i}}$ , ${\max({d_{a}}(\Gamma))}$ is the spatial road length of the longest trajectory in $\Gamma$ .

Note that there is a special circumstance on the evaluation of service utility, which is ${T_{j}}({u_{j}})$ just traverses ${l_{i}}$ straightforward, meaning ${d_{s}}({l_{i}},{T_{j}})=0$ and there is no distinct discrimination to recognize the feature of spatial distance between locations and customers. More specifically, if two locations ${l_{i}}$ and ${l_{j}}$ are both sited on a trajectory ${T_{k}}$ , and ${l_{i}}$ is in the middle trace of ${T_{k}}$ , but ${l_{j}}$ locate at the departure position of ${T_{k}}$ , then there is a remarkable probability that ${{\varsigma}_{\vartheta}}({l_{j}},{u_{k}})$ is greater than ${{\varsigma}_{\vartheta}}({l_{i}},{u_{k}})$ , the reason is that a customer will not be visibly disturbed or interrupted if she/he chooses a facility near the departing position or the arriving position over her/his route, such as refueling a vehicle or buying foods. Therefore, the spatial distance between the location and the departing or the arriving vertex along this trajectory is exactly considered to calculate the service utility while ${d_{s}}({l_{i}},{T_{j}})=0$ , and the assessment of ${f_{s}}$ , ${f_{t}}$ and ${f_{c}}$ will be neglected.

In the process of service utility assessing, a location serves customers who perceive it as their closely associated facility within its capacity, as well as a customer is calculated with the entire candidate locations and will be served by only one in the end. The greater value of ${{{\varsigma}_{\vartheta}}({l_{i}},{u_{j}})}$ indicates that ${l_{i}}$ is more attractive to ${u_{j}}$ , comparing to other candidate locations, which also shows that there is a closer connection between ${l_{i}}$ and ${u_{j}}$ , and a higher probability that ${u_{j}}$ will be served by ${l_{i}}$ , vice versa. To sum up, the popularity is depicted with the perspective of the candidate location as below.

Definition 6 (Location popularity). location popularity is defined as a linear combination of the total service utilities for all evaluated serving customers and the categorical competitiveness in terms of service capacity that can not be exceeded.

$\displaystyle{\phi_{\vartheta}}({l_{i}})={\alpha_{\phi}}*\sum_{{u_{j}}\in U^{% \prime}}{{{\varsigma}_{\vartheta}}({l_{i}},{u_{j}})+{\beta_{\phi}}*{f_{c}}({l_% {i}}),}U^{\prime}\in U,|U^{\prime}|\leqslant{C_{i}}$ (11)

where $\forall{u_{\upsilon}}\in U^{\prime},\forall{u_{l}}\in U-U^{\prime},{{\varsigma% }_{\vartheta}}({l_{i}},{u_{\upsilon}})\geqslant{{\varsigma}_{\vartheta}}({l_{i% }},{u_{l}})$ , ${\alpha_{\phi}}$ and ${\beta_{\phi}}$ are also the weights, ${\alpha_{\phi}}+{\beta_{\phi}}=1$ .

4.3 Formulation of MITLP

Within the definitions of ${{\varsigma_{\vartheta}}({l_{i}},{u_{j}})}$ and ${\phi_{\vartheta}}({l_{i}})$ , the model of multi-characteristic information based on Top- $k$ location prediction (MITLP) is formally stated.

Definition 7 (MITLP). Given a query with parameters ${\vartheta}$ and $k$ , sets of ${G}$ , ${U}$ , ${\Gamma}$ , ${L}$ , ${C}$ , ${\rm I}$ and $R$ , MITLP seeks to select the optimal candidate location set ${\Im_{k}}$ , ${\Im_{k}}\in L,|{\Im_{k}}|=k$ , which has the maximum location popularity with respect to ${\Phi_{\Im}}=\arg\max\sum_{i=1}^{k}{{\phi_{\vartheta}}({l_{i}})},{l_{i}}\in{% \Im_{k}}$ . In other words, the MITLP could also be regarded as the top- $k$ location recommendation problem for facility placement.

Example 2. As shown in Fig. 1, the candidate locations $L=\{{l_{1}},{l_{2}},{l_{3}},{l_{4}}\}$ , the set of customers $U=\{{u_{1}},{u_{2}},{u_{3}},{u_{4}},{u_{5}},{u_{6}}\}$ , the set of corresponding trajectories $\Gamma=\{{T_{1}},{T_{2}},{T_{3}},{T_{4}},{T_{5}},{T_{6}}\}$ and the friend relationships are also demonstrated. Suppose the set of location capacities $C=\{4,4,3,3\}$ , ${\alpha_{s}}={\beta_{s}}={\gamma_{s}}={\alpha_{\vartheta}}={\beta_{\vartheta}}% ={\gamma_{\vartheta}}={\raise 3.01pt\hbox{$1$}\!\mathord{\left/{\vphantom{13}}% \right.\kern-1.2pt}\!\lower 3.01pt\hbox{$3$}}$ , ${\alpha_{\phi}}=1$ and ${\beta_{\phi}}=0$ , each edge in road network equals to 500 meters, all departure time of six trajectories are 7:00 a.m. If $k=2$ , the MITLP will return the result of ${\Im_{2}}=\{{l_{2}},{l_{4}}\}$ , since ${l_{2}}$ and ${l_{4}}$ have the maximum service utilities in term of serving customers ${{u_{1}},{u_{3}},{u_{4}}}$ and ${{u_{2}},{u_{5}},{u_{6}}}$ respectively, it also can be seen that ${u_{3}}$ will be served by ${l_{1}}$ since it is located close to the starting position of ${u_{3}}$ .

5. Model implementation

Similar to the set cover problem [25], the general implementation of MITLP is time-consuming and would not be appropriately learned. Therefore, it is inapplicable to real metropolis road network or larger magnitudes of ${U}$ and ${L}$ , especially in dynamic queries with various $k$ or $\vartheta$ . Nevertheless, we can identify that the candidate locations in close proximity are prone to serve multiple identical customers if they have sufficient serving capacities, when $k\ll m$ and ${c_{i}}\ll n,i\in[1,m]$ , the facility locations recommended in query results of proposed MITLP will all keep a certain spatial distance with each other in real urban planning scenarios. This actuality inspires us to design a B-tree-liked hybrid framework, named CLSTC-forest. It incorporates candidate locations, customers and their corresponding spatio-temporal trajectories based on their service utilities, location popularity and capacity constraints. With this framework, the prediction would be conducted easily.

5.1 CLSTC-forest (tree)

The introductions of CLSTC-tree and CLSTC-forest are elaborated as below.

Definition 8 (CLSTC-tree). The CLSTC-tree is a B-tree-liked hybrid framework that integrates candidate locations, spatio-temporal trajectories and customers based on the corresponding location popularity and capacity constraints. Each of the tree nodes ${o}$ gathers both a candidate location ${l_{i}}$ and a set of predicted customers ${U({l_{i}})}$ . What’s more, a tree node contains additional important information, they are the node label (representation) ${l_{i}}$ , the location popularity ${\phi_{\vartheta}}({l_{i}})$ , and the set of corresponding service utility ${\varsigma_{\vartheta}}({u_{j}},{l_{i}}),{u_{j}}\in{U({l_{i}})}$ . It is also regarded as a completed binary tree because each of the none-leaf tree nodes has only two children (nodes).

Specifically, the node in CLSTC-tree reflects the ground truth that a particular sub-area of road network is represented by its label ${l_{i}}$ , similar to the case of clusters in $k$ -medoids, both locations and customers are presented only once in all leaf nodes. However, the none-leaf nodes of a CLSTC-tree are quite different from leaf nodes, since the label of a none-leaf node is the same as either one of the labels of its two child nodes. The customers here are a subset of a customer union linked to its two child nodes within its service capacities, moreover, location popularity and the corresponding service utilities are also updated due to the variation of customers.

Definition 9 (CLSTC-forest). The CLSTC-forest is the set of a series of separate CLSTC-trees generated from a large spatial area. Note that a CLSTC-forest only corresponds to a specified facility category.

Figure 3.

A simple CLSTC-tree.

Example 3. The candidate locations, trajectories, and their customers, as well as their social relationships are shown in Fig. 1, according to definition 8 and example 2, a representative CLSTC-tree is demonstrated in Fig. 3, where $\phi$ and $\varsigma$ of each tree node are omitted, it is observed that the root node ${l_{2}}$ would denote the entire spatial area within service capacity of ${c_{2}}=4$ .

5.2 Generation processes

A bottom-to-top approach is employed to the generation of CLSTC-forest (trees), the overall generation processes are presented in Fig. 4, which includes two steps: 1) leaf node generating; and 2) none-leaf node combining, it can be observed that a CLSTC-tree is ultimately developed while the tree root node can not be combined any more.

5.2.1 Leaf node generating

Suppose that a customer has been served with no more than one candidate location (facility) at a time, under this condition, we focus on bundling both candidate locations and their corresponding customers into leaf nodes by taking advantage of a constrained $k$ -medoids-liked clustering algorithm.

To be more specific, firstly the service utility ${\varsigma_{\vartheta}}({l_{i}},{u_{j}})$ of each pair of locations and customers is pre-calculated through Eq. (10), and then, by following the criteria of candidate location as ‘medoids’ and service utility as metric with categorical competitiveness, the constrained $k$ -medoids-liked clustering method gathers all customers to their nearest ‘medoids’ alternately. Furthermore, the final clusters are constrained by facility capacities ${C}$ on the scales, for instance, if a customer within a particular candidate location which has the maximum value of ${\varsigma_{\vartheta}}$ , but this location has already been evaluated to its capacity limitation, as a result, this location can not accommodate (serve) this customer, then this customer should be re-evaluated through the remaining locations that have available capacity until the largest service utility is identified properly.

After this process, the entire cluster set of $Ln(|Ln|=m)$ generated previously is taken as the leaf nodes of CLSTC-trees, in the meantime, the ‘medoids’ is the tree node label, the location popularity is also calculated by Eq. (11). If $\sum_{i=1}^{m}{{c_{i}}}<n$ , there will be a part of customers within the number of $n-\sum_{i=1}^{m}{{c_{i}}}$ that have not been covered by any candidate locations, however, this case is allowed in MITLP.

Figure 4.

The flowchart of generation processes (model learning).

5.2.2 None-leaf node combining

In this practice, the entire CLSTC-trees are constructed from bottom leaf nodes to top root nodes by handling a sequence of merging strategies, and then the CLSTC-forest will be automatically formed.

It is observed that each leaf node generated from the former step is also a simple CLSTC-tree that has merely one root node, from this point of view, a series of merging approaches can be adopted to merge two CLSTC-trees into one by taking advantage of their information of root nodes. The process of forming one parent node from two child nodes is also referred as node combining, which is a crucial exercise in generating CLSTC-trees (forest). First of all, the definitions of tree node correlation coefficient and CLSTC-tree correlation coefficient are proposed.

Definition 10 (Correlation coefficient). The correlation coefficient between ${o_{i}}$ and ${o_{j}}$ is the additional returned gain which combines the two nodes with a new parent node, in other words, it also represents the greater value of location popularity, which is evaluated from the maximum customer served by two separate candidate locations corresponding with the tree nodes. Therefore, it can be formally depicted.

$\displaystyle{\eta_{n}}({o_{i}},{o_{j}})=\max({\phi_{\vartheta}}({l_{oi}}),{% \phi_{\vartheta}}({l_{oj}})),{U_{\eta}}\in({U_{oi}}\cup{U_{oj}})$ (12)

where ${l_{oi}}$ represents the label (candidate location) of tree node ${o_{i}}$ , ${U_{oi}}$ is the set of the evaluated corresponding customers of ${l_{oi}}$ .

In this context, the correlation coefficient ${\eta_{t}}$ between two CLSTC-trees equals to that of their two root nodes. The greater value indicates that there is closer relevancy between two CLSTC-trees, such as intimate social relationships between the two sets of customers, or adjacent spatial distances of two candidate locations in the road network. As a result, two CLSTC-trees can be combined intuitively according to their correlation coefficient. Regarding to the newly created parent node, its representation is the candidate location with the larger correlation coefficient, the corresponding customer set is ${U^{\prime}_{n}}$ , which is a sub set of ${U_{\eta}}$ without exceeding the capacity limitation, and the popularity is equivalent to ${\eta_{t}}$ . Then a new CLSTC-tree is formed as the new parent node becomes the root node, as well as the two sub-trees are the former merged CLSTC-trees.

Nevertheless, there is no need to assess the correlations over each pair of CLSTC-trees in the combining development, since two candidate locations are less likely to share a large quantity of customers who are in close relations when they are far apart from each other in spatial distance. From this point of view, the combination threshold between two tree nodes is suggested as ${\Delta_{n}}=\frac{{\max({d_{E}}({l_{oi}},{l_{oj}}))}}{2}$ , ${\Delta_{n}}$ denotes the upper bound if two nodes (CLSTC-trees) could be combined or not, thus we only calculate the correlations with those ${d_{E}}({l_{oi}},{l_{oj}})\leqslant{\Delta_{n}}$ relatively. For simplicity, a global combination threshold is utilized where each CLSTC-tree shares the same ${\Delta_{n}}$ .

After this node combining practice, new root (parent) nodes have been created, and fresh desirable CLSTC-trees are formed after one round of combining, subsequently, the newly shaped CLSTC-trees are taken as the inputs to the next combining process until there is no CLSTC-tree left, then the CLSTC-forest is automatically established.

5.3 Model learning

With the help of Eq. (11) and definition 8, it can be observed that the output score ${\phi_{\vartheta}}({l_{i}})$ is referred as a linear combination of the inputs from multiple characteristics, accordingly, several different kinds of regression algorithms could be deployed to learn these significant parameters. In this study, a linear regression with regularization is utilized straightforward, which goal is to minimize the error ${e_{M}}$ between the ground-truth location facilities and the recommended location results returned by CLSTC-forest. Suppose that the overall parameters are denoted as ${\theta^{\vartheta}_{M}}({\alpha_{s}},{\beta_{s}},{\gamma_{s}},{\nu_{\partial}% },{\alpha_{\vartheta}},{\beta_{\vartheta}},{\gamma_{\vartheta}},{\alpha_{\phi}% },{\beta_{\phi}})$ , then the corresponding optimized function is defined as:

$\displaystyle{\min}_{{\theta^{\vartheta}_{M}}}\sum_{i=1}^{k}{{d_{E}}^{2}({l_{% pi}},{l_{ri}})}+{\gamma_{e}}||{\theta^{\vartheta}_{M}}|{|^{2}}$ (13)

where ${\gamma_{e}}$ is the regularization parameter and set to ${10^{-8}}$ as demonstrated in [18], ${l_{pi}}$ is the predicted location, the corresponding ground truth (testing) location is ${l_{ri}}$ .

We can see that ${\theta^{\vartheta}_{M}}$ affects the whole development of the framework on leaf node generating and none-leaf node combining according to Eqs (11) and (12) respectively, meanwhile different parameters have diverse effects on the corresponding characteristics. For instance, ${\alpha_{\vartheta}}$ controls the spatial commute distances between customers and locations, if it increases, customers will be likely to be served by the locations nearby, while ${\beta_{\vartheta}}$ controls the importance of social relevancy, and ${\gamma_{\vartheta}}$ controls the magnitude of temporal regions, when ${\beta_{\vartheta}}$ raises, customers are significantly influenced by their close friends and are prone to be served in more various facilities (may not be the nearest facilities), it accounts for the effect of ${\gamma_{\vartheta}}$ in the same way in terms of served periods. Besides, there is negative relevance between ${\varsigma_{\vartheta}}$ and ${f_{d}}$ , because larger distance demonstrates weaker attractiveness of a location to certain customers generally, and ${f_{t}}$ has the same effect. The details will be demonstrated thoroughly in experiments which covers other parameters.

Note that the actual facilities are almost impossible to be established at the centers of locations (road intersections), that is to say, they barely coincide with the same geographic positions. As a result, we acknowledge that ${l_{pi}}$ and ${l_{ri}}$ are identical results with ${d_{E}}({l_{pi}},{l_{ri}})\leqslant{\tau_{E}}$ , where ${\tau_{E}}$ is set to 200 meters as in most other models.

5.4 Location prediction

After model implementation, each CLSTC-tree is well-organized, and these tree nodes would represent both candidate locations and the corresponding customers comprehensively in a spatial area, it is important to catch that there will be a greater number of customers that have been served by the location within larger spatial space if the node is at a higher level of the CLSTC-tree. What’s more, candidate locations with their corresponding customers are apart from each other in different CLSTC-trees, where the specific aspects include spatial distance, social relevancy or temporal region, and so on. With this ground truth, we can see that the top- $k$ locations can be selected easily from the tree nodes located at the peak levels of CLSTC-forest in most prediction circumstances as $k\ll m$ .

To easily explicate the querying, tree levels of CLSTC-forest are marked and put into a set $L s$ in a top-down fashion. In more detail, assume that the highest level among all CLSTC-trees is $\hbar$ , where $\hbar\leqslant\log(\lfloor{|Ln|}\rfloor+1)$ , firstly all the root nodes of CLSTC-trees are marked as ${ls_{\hbar}}$ , and then each child of root nodes (if exist) is marked as ${ls_{\hbar-1}}$ . In other words, when the tree nodes are marked as ${ls_{i}}$ in one level, the child nodes are all marked as ${ls_{i-1}}$ , and the mark of their parent nodes is ${ls_{i+1}}$ accordingly, repeat the process until there is no node left (until to ${ls_{1}}$ ), then the entire tree level marks with their corresponding tree nodes are inserted into the mark set $L s$ . For two arbitrary adjacent tree level set ${ls_{i}}$ and ${ls_{i-1}}$ , the number of tree nodes follows the equation of $|{ls_{i-1}}|\in[2,2*|{ls_{i}}|]$ in terms of the structure of CLSTC-forest. A straightforward example of tree level marking is shown in Fig. 3.

[b] Top- $k$ locations predicting algorithmquery parameter $k,{\vartheta}$ ; $k$ candidate location set ${\Im_{k}}$ ; select the corresponding ${Ls}$ by ${\vartheta}$ all ${ls_{i}}\in{Ls}$ exists $|{ls_{i}}|=k$ insert all candidate locations in ${ls_{i}}$ into ${\Im_{k}}$ and break choose two mark sets where $|{ls_{i}}|<k<|{ls_{i-1}}|$ select $k$ distinct candidate locations that meet $\max_{i=1}^{k}{\phi_{\vartheta}}({l_{i}})$ insert them into ${\Im_{k}}$ and break return ${\Im_{k}}$ ;

Based on the CLSTC-forest and $L s$ , the querying process of constrained top- $k$ candidate location prediction (recommendation) is proposed while $k$ and ${\vartheta}$ are available. To begin with, given parameter ${\vartheta}$ , the corresponding mark set ${Ls_{\vartheta}}$ are chosen since one facility category is related to one explicit CLSTC-forest, and for specified $k$ , the corresponding tree level mark is selected while $|{ls_{i}}|=k$ , if exists, then the $k$ candidate locations in the nodes of ${ls_{i}}$ are the results. However, if the target mark does not exist, two marks of ${ls_{i}}$ and ${ls_{i-1}}$ are preferred while $|{ls_{i}}|<k<|{ls_{i-1}}|$ , subsequently, by utilizing a simple orthodox exhaustion manner, the $k$ distinct candidate locations, which boast the maximum sum of location popularity, are selected to the recommendation results from the nodes in both $ls_{i}$ and $ls_{i-1}$ . Note that the node in level $ls_{i}$ and the node in one of its two child nodes (level $ls_{i-1}$ ) at a single CLSTC-tree share one label (candidate location), so they should not be chosen into the results together at the same time. The pseudo-code of predicting is illustrated in algorithm 1.

Example 4 As described in Fig. 3, there is only one CLSTC-tree (also a CLSTC-forest), if ${c_{2}}=3$ , two leaf nodes of ${l_{1}}$ and ${l_{2}}$ at Level 1 will still be merged into a new node, but ${l_{2}}$ and ${l_{4}}$ at level 2 can not be further merged into a new parent node (presented as Level 3) due to the capacity constraint of ${c_{2}}$ and ${c_{4}}$ , therefore, the sub-trees become two separate CLSTC-trees, where they are illustrated as two dotted ellipses respectively. For querying, ${\Im_{k}}=\{{l_{2}}\}$ is obtained while $k=1$ , and ${\Im_{k}}=\{{l_{2}},{l_{4}}\}$ is returned when $k=2$ , if $k=3$ , the one with the greater value of ${\Phi_{\Im}}$ between $\{{l_{2}},{l_{3}},{l_{4}}\}$ and $\{{l_{1}},{l_{2}},{l_{4}}\}$ will be the final results.

6. Experimental evaluation

In this section, we experimentally evaluate the efficiency and effectiveness of MITLP. The datasets and environment are described firstly, then we list the baseline methods and experiment settings, next the top- $k$ querying efficiency and CLSTC-forest structure are reported, lastly, the effectiveness is examined and a case study is further carried out.

6.1 Datasets and environment

6.1.1 Datasets

The most widely used datasets of Shanghai and Beijing, which are the two largest cities in China, are employed in this study. By taking Beijing datasets as an example, the details of candidate locations, trajectories, customers with their corresponding social relationships, POI check-ins and ground-truth facilities are introduced as follows.

The intersections ${V_{g}}$ in urban road network are utilized to denote the candidate locations for facility placement, trajectories are also collect through the automobile traces. To better conduct the experiments, trajectories that contain at least 10 GPS check-in points with traveling intervals from 15 minutes to 120 minutes are carefully selected. In addition, customers with their corresponding social friend relationships and POI check-ins are extracted from Sina WeiBo1

¹
https://www.weibo.com/.

respectively, where two customers following with each other shows that they are close friends.

However, since it is difficult to capture a large number of customers within their corresponding daily trajectories in reality, we adopt the algorithm of discovering the Popular Routes in [26] to simulate the processes of generating customers with respect to trajectories by existing real datasets, the reason is that historical traveling experiences would indicate how customers usually determine routes between spatial locations. The datasets of Shanghai are prepared in the same way, Table 2 gives an overview of the two datasets explicitly.

Table 2

Statistics of two metropolises

	Shanghai	Beijing
# of intersections	333,766	171,186
# of road segments	440,922	226,237
# of customers	230,303	412,032
# of trajectories	230,303	412,032
# of POI check-ins	1,233,700	4,068,215
# of candidate locations	333,766	171,186
# of social relationships	13,687,459	22,139,861

We collect the POI (facility) categories from the open LBS platform in AutoNavi,2

https://lbs.amap.com/.

which consist of 23 main items such as automobile service, food & beverages and transportation service, with a total number of 264 mid-categories and 868 sub-categories. Besides, four popular types of existing facilities (sub-categories) until June 30, 2019 are also prepared, as well as their geographical positions for both model training and model testing from Shanghai and Beijing, they are Leisure Food Restaurant (LFR), Baby Service Place (BSP), China Unicom Service Hall (CUSH), and fast Vehicle Charging Station (VCS), the corresponding summaries are reported in Table 3.

Table 3

Summaries of four facilities

	LFR	BSP	CUSH	VCS
# in Shanghai	313	460	535	868
# in Beijing	245	499	686	869
Average capacities	100	30	50	50
Service durations (minutes)	(85, 19 ${}^{2}$ )	(42, 7 ${}^{2}$ )	(16, 5 ${}^{2}$ )	(117, 23 ${}^{2}$ )

6.1.2 Experiment environment

All approaches in Python (Version 3.6.0) platform are implemented, and the entire experiments are performed on a 64-bit Intel(R) Xeon(R) E5-2630 v2 2.60 GHz CPU (24 Cores) machine with 256 GB RAM, 1TB hard disk, which runs a CentOS Linux release 7.4.1708 (Core) OS.

6.2 Baselines

To the best of our knowledge, no existing studies have been committed to the constrained top- $k$ candidate location prediction by exploiting customer trajectories and social relationships in a city-scale road network to date. Therefore, in this study, a variety of competitive methods with minor modifications have been adopted as benchmarks.

•
$k$ -Medoids. A special form of $k$ -Medoids is proposed as the medoids are represented by candidate locations and Eq. (11) is taken advantage of the metric, moreover, it is also similar to the process of leaf node generation in MITLP without parameter learning.
•
SERM. SERM [9] is a recurrent model for the next optimal location prediction in semantic trajectories, where the timestamps, spatial positions and contents of trajectories are integrated into a recurrent layer. In this study, we extract the most POI-related textual contents posted by customers at Sina WeiBo, and then extend the algorithm to top- $k$ location prediction.
•
API. API model [10] is the implementation of location popularity predicting based on gaussian random fields, this approach also supposes that there are a part of facilities already existed, and the main inputs is a set of POI check-ins rather than trajectories. For comparing, the circumstance of existing facilities is abandoned, with the corresponding feature is re-evaluated.
•
NetClus. The framework of NetClus [6] is the implementation of TOPS model that only recognizes the spatial distances between locations and customer trajectories, this work has just been more involved with ours but the remaining principal factors, such as facility categories or customer friend relationships, are not acknowledged at all, therefore, we compare with it straight-forward.

What’s more, there is no doubt that verifying the effectiveness on real establishing facilities of various categories is sincerely ambitious and impractical in metropolises, as a result, to simulate the process of facility placement, we carry out the entire experiments with respect to several popular existing facilities adopted by most recent studies as in [18, 14, 8, 9, 10, 6, 21].
6.3 Basic settings

To conduct the model learning and experiment evaluating efficiently, we need to initialize a series of primary parameters at first. The average capacities of four existing facilities are listed in Table 3 with some field investigations that have been appraised on both evaluated cities, three Laplace smoothing coefficients of ${\lambda_{l}}$ , ${\lambda_{c}}$ and ${\lambda_{t}}$ are all set to 1. Besides, $\textit{DPLN}({\alpha_{r}},{\beta_{r}},{\nu_{r}},{\tau_{r}})$ is carried out significantly from $U$ and $R$ by the values of 0.63, 0.59, 5.02, and 0.01 respectively, every kind of service durations ${{\mathcal{N}}}({\mu_{d}},{\sigma_{d}}^{2})$ is collected with respect to on-the-spot investigating and also described in Table 3 thoroughly, and ${r_{b}}$ is set to 1,000 meters according to the urban planning community by default. With regard to weight parameters, the default values of ${\alpha_{\phi}}$ and ${\beta_{\phi}}$ are 0.5, and ${\alpha_{s}},{\beta_{s}},{\gamma_{s}},{\alpha_{\vartheta}},{\beta_{\vartheta}}$ , and ${\gamma_{\vartheta}}$ are all equal to ${\raise 3.01pt\hbox{$1$}\!\mathord{\left/{\vphantom{13}}\right.\kern-1.2pt}\!% \lower 3.01pt\hbox{$3$}}$ at the beginning of model learning and in $k$ -Medoids as well. Meanwhile, $k$ is initialized as 20, 50, 80, and 100 separately for querying practice.

For exploiting every of specified location facilities in a city, the corresponding datasets (trajectories, customers and social relationships) are divided into training part and testing part, which consist of 80% and 20% of the whole datasets randomly. Furthermore, each experiment is evaluated for 10 times and the average results are returned finally. During the algorithm practices, the multi-process programming technology is utilized to accelerate the whole evaluations while a total of 20 CPU cores are handled.

6.4 Performance in efficiency

6.4.1 Evaluation metrics

We have conducted a comprehensive evaluation in efficiency on the running time of model learning (or index constructing) with respect to all competitors, and besides, for assessing facility capacities and visualizing the structures of CLSTC-forest, none-full/full (root) CLSTC-tree nodes and the maximum tree heights are illustrated briefly.

6.4.2 Experimental results

The running time of five models in various category of facilities and a couple of urban datasets are depicted in Fig. 5, in which Fig. 5a–d denote the results returned from Shanghai and Fig. 5e–h are from Beijing. It can be observed that there is an upward trend in time consumption generally, NetClus, SERM and $k$ -Medoids have spent less time on model learning than that of API and MITLP, the reason is that API and MITLP consider more characteristics comparing to the others, as a result, the model are sophisticated to learn and time-consuming. Furthermore, although MITLP dedicate on the capacity constraint and facility category for the first time, the implementation of our model is light-weighted but efficient compared to API. In contrast, NetClus only examines the spatial distances between trajectories and candidate locations, and this causes plenty of indexes constructed. $k$ -Medoids does not learn the parameters at all, so it has lower time cost relatively. Furthermore, the volumes of Beijing dataset is larger comparing with Shanghai regarding to the quantities of trajectories (customers) and social relationships, but the number of candidate locations in Beijing is smaller than that of Shanghai, the two factors jointly affect the efficiency, therefore, the general trends of learning time on Shanghai is slightly less than that of Beijing.

Table 4
CLSTC-forest collections in Shanghai

Facilities	Heights	Leaf nodes		Root nodes
		None-full	Full	None-full	Full
BSP	8	59,413	3,371	0	13,443
CUSH	8	67,179	1,587	0	11,207
VCS	9	68,395	1,650	0	10,091
LFR	10	77,940	569	0	6,293

Table 5

CLSTC-forest collections in Beijing

Facilities	Heights	Leaf nodes		Root nodes
		None-full	Full	None-full	Full
BSP	7	39,072	2,361	0	7,916
CUSH	8	44,139	1,187	0	6,214
VCS	8	45,903	968	0	6,001
LFR	9	51,065	283	0	4,927

Figure 5.

Performance on model learning (hours). (a)–(d) In Shanghai, (e)–(h) in Beijing.

Tables 4 and 5 have given a brief insight into the highest tree heights, as well as the number of none-full/full tree nodes in bottom level (leaf nodes) and top level (root nodes) of CLSTC-forest on the four specific facilities. Since different facilities may have distinct capacities indeed, the highest height of CLSTC-forest raises with the increase of capacities, as well as the number of full tree nodes decreases at the same time as in leaf nodes and root nodes, the reason is that candidate locations (facilities) in tree nodes are filled in more quickly when the capacity is small, thus the number of CLSTC-trees (root nodes) also grows up, and vice versa. Meanwhile, the height of every CLSTC-tree is also positively correlated with the facility capacities.

6.5 Performance in effectiveness

6.5.1 Evaluation metrics

A couple of the metrics of Precision and Root Mean Square Error (RMSE) are designed carefully for effectiveness evaluating. On one hand, precision is one of the most important traditional evaluation metrics in top- $N$ recommendation, we suppose that ${\Im^{\vartheta}}$ is the top- $k$ querying results for specific category $\vartheta$ obtained from testing data, $L^{\vartheta}$ is the set of the corresponding facilities with the same category that have already existed in road network, then the precision is given as follows:

$\displaystyle{{\rm P}_{k}}=\frac{{\sum\nolimits_{i=1}^{k}{\textit{hit}({\Im^{% \vartheta}_{i}},L^{\vartheta}_{i})}}}{k}$ (14)

where $\textit{hit}({\Im^{\vartheta}_{i}},L^{\vartheta}_{i})=1$ indicates there is a corresponding facility $L^{\vartheta}_{i}$ that satisfies ${d_{E}}({\Im^{\vartheta}_{i}},L^{\vartheta}_{i})\leqslant{\tau_{E}},{\Im^{% \vartheta}_{i}}\in{\Im^{\vartheta}}$ and $L^{\vartheta}_{i}\in L^{\vartheta}$ , note that a recommended location is mapped to its nearest facility in a one-to-one manner, therefore, ${L^{\vartheta}}$ is also regarded as validated dataset.

On the other hand, to better verify the effectiveness of candidate location predicting, RMSE is also adopted to measure the deviations between the recommended locations and the ground-truth facilities, the reason is that the facilities are almost impossible to be established just on the locations, and then, the definition of RMSE is listed as:

$\displaystyle\textit{RMSE}_{k}=\sqrt[2]{{\frac{{\sum\nolimits_{i=1}^{k}{\min{{% ({d_{E}}({\Im^{\vartheta}_{i}},L^{\vartheta}_{i}))}^{2}}}}}{k}}}$ (15)

where every pair of ${\Im^{\vartheta}_{i}}$ and ${L^{\vartheta}_{i}}$ is also handled only once and there is no constraint on ${d_{E}}({\Im^{\vartheta}_{i}},L^{\vartheta}_{i})$ .

The smaller value of two metrics implies that the recommended locations are really better in accordance with the actual existing facilities, and the model has a better performance in effectiveness.

6.5.2 Experimental results

Figure 6.

Performance in terms of precision on Shanghai.

Figure 7.

Performance in terms of precision on Beijing.

The precision of varying $k$ in different facility categories has been illustrated in Figs 6 and 7, they are evaluated from the datasets of Shanghai and Beijing respectively. It is evident that the proposed MITLP significantly outperforms the other methods under all circumstances, the reasons will be analyzed in several important aspects, as we not only consider the effects of customer historical trajectories on facility placing, but also take advantage of the friend relationship-based reciprocal recommendation, which is adopted by a large quantity of businesses for a special sales promotion in recent years. Furthermore, the crucial acknowledgements of facility category and service capacity improves the accuracies of the prediction as well. And meanwhile, the performance on BSP and LFR is superior to that of CUSH and VCS, this is because the majority of customers are more prone to be influenced by red envelopes or positive comments posted by their close friends while they are going to have consumptions at BSP or LFR. What’s more, the precision declines slightly along with the raise of $k$ , the larger value of $k$ is, the more locations will participate into the hit process with the corresponding ground-truth facilities, then the hitting accuracy will experience diminishing returns.

Figure 8.

Performance in terms of RMSE on Shanghai.

Figure 9.

Performance in terms of RMSE on Beijing.

Subsequently, the performance of RMSE is presented in Figs 8 and 9 on same conditions, we can see that our proposed MITLP model has a better achievement comparing with all competitors, and the results has a reverse manner with the precision. The reason can be derived from their definitions, a higher precision indicates that there is a lower deviation between recommended location and validated facility, and vice versa. In other words, if the value of precision is larger, the candidate locations recommended will be better represented by the corresponding facilities in road network, it also demonstrates that a larger RMSE will result in a worse performance on predicting contrarily.

Figure 10.

Weights learned by MITLP. (a)–(c) In Shanghai, (d)–(f) in Beijing.

Figure 11.

Visualization in VCS ( $k=20$ ). (a) Predicted locations, (b) ground-truth facilities.

In order to offer insights into the characteristic contributions to the performance of candidate location predicting, we further investigate three groups of feature weights with respect to Eqs (6), (10) and (11) in Fig. 10 as a whole. For ${\alpha_{\phi}}$ and ${\beta_{\phi}}$ , it can be observed that the weight of service utility is significantly larger than that of categorical competitiveness, because quite a few features are relevant to the customer and candidate location. When referring to ${\alpha_{\vartheta}}$ , ${\beta_{\vartheta}}$ , and ${\gamma_{\vartheta}}$ , the characteristic of social relevancy is a very principal factor in evaluation of location popularity, especially in BSP and LFR. But spatial commute distance and temporal conflict play a relatively important roles in CUSH and VCS in general, the reason is the same as the precision performance of different facility categories. Besides, three related sub-features of social relevancy are also conducted, we can see that the attractiveness effect of location and customer effect within facility category have taken a large proportion of the effectiveness improving by the proposed MITLP, the reasons can be seen carefully in Section 4.

6.5.3 Case study

A simple case study on VCS is simultaneously presented to visualize both the recommended locations and the ground-truth facilities at the central area of Beijing, as shown in Fig. 11, a total number of top-20 candidate locations within the hit facilities are depicted respectively. We can see that the recommendation results in Fig. 11a are almost close to the corresponding ground-truth facilities in Fig. 11b in geographical positions, more specifically, the precision is 88% and the RMSE is 89 meters as described in Figs 7 and 9 respectively. This case study shows that our query results are accurate and reliable in facility location predicting.

7. Conclusions

In this paper, we have proposed a novel model MITLP for the problem of location prediction for facility placement with multi-characteristics. We not only acknowledge the spatio-temporal behaviors and social relationships of customers, but take advantage of the capacity limitations and categories of specified facilities at the same time. What’s more, the location popularity is quantized and evaluated with respect to these relating features. In order to achieve efficient top- $k$ candidate locations querying, the CLSTC-forest that combine candidate locations and customers are also illustrated in detail, and a straightforward querying method is presented afterwards. Finally, extensive experiments with real datasets are performed to offer insights into the efficiency and effectiveness of querying problem, the effectiveness is verified further via a case study. In the future, we are of interest to explore the updates of either multi-characteristic information or candidate locations.

References

Lee

J.-G.

Han

and Whang

K.-Y.

, Trajectory clustering: a partition-and-group framework, in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 2007, pp. 593–604.

Lee

J.-G.

Han

and Li

, Trajectory outlier detection: A partition-and-detect framework, in: 2008 IEEE 24th International Conference on Data Engineering, IEEE, 2008, pp. 140–149.

Zheng

Wang

Hua

and Zhou

, Go slow to go fast: minimal on-road time route scheduling with parking facilities using historical trajectory, The VLDB Journal 27(3) (2018), 321–345.

Mitra

, Identifying Top-K Optimal Locations for Placement of Large-Scale Trajectory-Aware Services, in: PhD@ VLDB, 2016.

Bao

Ruan

and Zheng

, Planning bike lanes based on sharing-bikes’ trajectories, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 1377–1386.

Mitra

Saraf

Sharma

Bhattacharya

Ranuy

and Bhandari

, NetClus: A scalable framework for locating top-k sites for placement of trajectory-aware services, in: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), IEEE, 2017, pp. 87–90.

Mitra

Saraf

and Bhattacharya

, TIPS: mining top-k locations to minimize user-inconvenience for trajectory-aware services, in: IEEE Transactions on Knowledge and Data Engineering, 2019, pp. 1–14.

Liu

Teng

Zhu

and Xiong

, Point-of-interest demand modeling with human mobility patterns, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 947–955.

Yao

Zhang

Huang

and Bi

, Serm: A recurrent model for next location prediction in semantic trajectories, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, 2017, pp. 2411–2414.

10.

Hsieh

H.-P.

Lin

C.-T.

Yen

I.E.-H.

and Chen

H.-Y.

, Temporal popularity prediction of locations for geographical placement of retail stores, Knowledge and Information Systems 60(1) (2019), 247–273.

11.

Bao

Gong

and Zheng

, Mining the most influential k-location set from massive trajectories, in: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2016, pp. 1–4.

12.

Chen

Liu

Wong

R.C.-W.

Xiong

Mai

and Long

, Efficient algorithms for optimal location queries in road networks, in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014, pp. 123–134.

13.

Gao

Chen

Zheng

and Li

, On efficient k-optimal-location-selection query processing in metric spaces, Information Sciences 298 (2015), 98–117.

14.

Zheng

Yao

Liu

Xiong

and Yuan

, Sparse real estate ranking with online user reviews and offline moving behaviors, in: 2014 IEEE International Conference on Data Mining, IEEE, 2014, pp. 120–129.

15.

Mao

and Li

, Location recommendation by combining geographical, categorical, and social preferences with location popularity, Information Processing & Management 57(4) (2020), 1–18.

16.

Liang

T.-P.

Y.-T.

Y.-W.

and Turban

, What drives social commerce: the role of social support and relationship quality, International Journal of Electronic Commerce 16(2) (2011), 69–90.

17.

Yin

and Liu

, Study on herd behavior and its influence on consumer impulse buying behavior during online shopping festival, China Business and Market 33(8) (2019), 99–107.

18.

Karamshuk

Noulas

Scellato

Nicosia

and Mascolo

, Geo-spotting: mining online location-based services for optimal retail store placement, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 793–801.

19.

Sun

Huang

Chen

Zhang

and Du

, Location selection for utility maximization with capacity constraints, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, 2012, pp. 2154–2158.

20.

Čeikute

Jensen

C.S.

and Tan

K.-L.

, Trajectory based optimal segment computation in road network databases, in: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2013, pp. 396–399.

21.

Wang

Chen

and Pan

, Where to place your next restaurant? Optimal restaurant placement via leveraging user-generated reviews, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 2371–2376.

22.

Xiao

Yao

and Li

, Optimal location queries in road network databases, in: 2011 IEEE 27th International Conference on Data Engineering, IEEE, 2011, pp. 804–815.

23.

Mitra

Ranu

Kolar

Telang

Bhattacharya

Kokku

and Raghavan

, Trajectory aware macro-cell planning for mobile users, in: 2015 IEEE Conference on Computer Communications (INFOCOM), IEEE, 2015, pp. 792–800.

24.

Yuan

Liu

and Junjun

, Research on the user characteristics and grouth rates distribution in microblog, Chinese Journal of Computers 37(4) (2014), 767–778.

25.

Jain

and Vazirani

V.V.

, Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation, Journal of the ACM (JACM) 48(2) (2001), 274–296.

26.

Wei

L.-Y.

Zheng

and Peng

W.-C.

, Constructing popular routes from uncertain trajectories, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2012, pp. 195–203.

27.

Yin

Cui

Chen

and Zhang

, Modeling location-based user rating profiles for personalized recommendation, ACM Transactions on Knowledge Discovery from Data (TKDD) 9(3) (2015), 1–41.

Location prediction for facility placement by incorporating multi-characteristic information

Abstract

Keywords

1. Introduction

2.1 Next optimal location prediction

2.2 Optimal k locations prediction

3. Problem statement

Table 1 Notation and the corresponding description

3.2 Framework overview

4.1 Prediction features

5. Model implementation

5.1 CLSTC-forest (tree)

5.2.1 Leaf node generating

6. Experimental evaluation

6.1 Datasets and environment

6.1.1 Datasets

1 https://www.weibo.com/.

6.2 Baselines

6.4 Performance in efficiency

6.4.1 Evaluation metrics

6.4.2 Experimental results

Table 4 CLSTC-forest collections in Shanghai

6.5.1 Evaluation metrics

7. Conclusions

References

2.2 Optimal $k$ locations prediction

Table 1
Notation and the corresponding description

¹
https://www.weibo.com/.

Table 4
CLSTC-forest collections in Shanghai