Multimodal interaction aware embedding for location-based social networks

Abstract

Location-based social networks (LBSNs) have greatly promoted the development of the field of human mobility mining. However, the sparsity, multimodality and heterogeneity nature of the user check-in data remains a great concern for learning high-quality user or other entity representations, especially in the downstream application tasks, such as point-of-interest (POI) recommendation. Most existing methods focus on user preference modeling based on sequential POI tags without exploring the interaction between different modalities (e.g., user-user interactions, user-timestamp interactions, user-POI interactions, etc.). To this end, we introduce a multimodal interaction aware embedding framework to generate reliable entity embeddings on the heterogeneous socio-spatial network. At its core, first, multi-modal interaction sub-graph sampling techniques are designed to capture the heterogeneous contexts; then, a self-supervised contrastive learning technique is leveraged to extract intra-modality and inter-modality interactions in a light way. We conduct experiments on the next-POI recommendation tasks based on three real-world datasets. Experimental results demonstrate the superiority of our model over the state-of-the-art embedding learning algorithms.

Keywords

POI recommendation embedding representation heterogeneous network graph convolutional neural network

1. Introduction

With the rapid development of mobile Internet, numerous location-based social network (LBSN) application services have emerged, such as Yelp, Twitter and Uber. In LBSN, users share digital footprints of daily life with their friends by checking-in at a point of interest (POI), which provides fine-grained user mobility and social network information. Consequently, various location-aware data mining tasks, e.g., next-POI demand modeling [2,31] and friend recommendation [28,32], benefit from such fine-grained information. However, the multimodality (e.g., user, POI and time) and heterogeneity (e.g., user-user interactions, user-POI interactions and user-POI-timestamp interactions) of the data brings challenges for learning good entities’ representation embeddings in data-driven human mobility [31] and social network analysis [28].

Currently, to exploit the complete semantic context information, most existing works [2,3,5,7,15,26,29] regard it as a heterogeneous information network mining problem [25,33]. However, they usually split heterogeneous information networks into bipartite graphs to model multimodal interactions, which drop out part interaction of multimodal information and only focus on the interaction between the user and the item for embedding representation learning.

It is very realistic to consider multimodal interaction modeling in POI recommendation scenarios [34]. For example, Fig. 1 represents the hierarchical relationship of multiple modals of user check-in activities. In the dimension of friends, Emma’s relevance score with Jack is lower than Cora’s with Jack, which holds in the traditional social relationship-based recommendation systems. However, the abovementioned phenomenon is extremely unreasonable because Jack and Emma are similar in their activity schedules (they are all used to activities in the first and third time periods, as shown in Fig. 1’s time period modal). Therefore, it makes sense to recommend Emma’s favorite POIs to Jack. As a result, the interaction between each modality is significant for POI recommendation, and we need to incorporate multimodal information for embedding representation learning as much as possible.

Fig. 1.

A multi-modal information network contains time period modal, POI modal, user modal, and other modalities. The solid line in the figure indicates the relationship of nodes in the same modal domain, and the dashed line indicates the relationship between different modal domain.

In the process of fusing complex and variable multimodal information into an embedding representation, several key problems remain unsolved. ❶ The relationship between the behavioral relevance of users and the interaction of multiple modals were not sufficiently investigated. ❷ Traditional methods did not consider the interaction effects of multiple modal data, which resulted in the loss of a large volume of multimodal interaction information. ❸ Multiple modal data are independent, and in previous graph convolutional embedding learning models, when the model fuses a large amount of information from neighboring nodes, it leads to over-smoothing of information and inefficient embeddings.

To tackle the aforementioned problems, we develop a sampling-aggregation POI recommendation embedding representation learning framework (SAPRec) based on multi-modal sub-graph sampling and heterogeneous information aggregator. SAPRec aims to capture the interaction between different modalities and generate efficient embedding representations [20,28]. For problem ❶, we explored the relationship between the interactions of different modals and designed SAPRec based on the analysis of the correlations in our paper. For problem ❷, SAPRec focuses on the most characteristic information in the multi-modal interest subgraphs of users, which is exploited to learn the embedding representation for each node in the heterogeneous information network. For problem ❸, SAPRec generates a multi-modal interaction sub-graph of users based on their social connections and recent check-ins. Then, we put the multi-modal interaction subgraphs into the aggregator for feature extraction.

The main contributions of this paper are summarized as follows:

We study the distribution properties of multimodal data on real data sets and analyze the influence of different modal information on each other.

We propose three graph sampling methods based on the interaction features between different modalities, which improve the representation capability of the embeddings generated by the model while reducing the graph learning cost.

We propose a graph information aggregation method named light deep graph infomax network (lightDGI). lightDGI improves the embedding representation’s learning performance while preserving the vital information in the original graph and extracting mutual information from multimodal data.

Extensive experiments were conducted on datasets of three cities to prove our model’s excellent performance on multimodal interaction aware embedding.

2. Related work

2.1. Graph neural networks

Graph neural networks have recently become a critical research method in the recommendation system. Li et al. propose a few-shot learning framework, which encodes geographical neighborhood information using graphs and models the dependence relationship among businesses using graph convolutional networks [11]. Wang et al. propose NGCF, which exploits the user-item graph structure by propagating embeddings [22]. NGCF leads to the expressive modeling of high-order connectivity in user-item graph, effectively injecting the collaborative signal into the embedding process in an explicit manner. Ying et al. propose PinSage, which paves the way for a new generation of web-scale recommender systems based on graph convolutional architectures [30]. In this paper, we partition heterogeneous information network (HIN) into subgraphs by three sampling methods in order to better capture the personalized information of users and the attribute features of items. Compared with previous work, SAPRec focuses more on the local features of the HIN, i.e., the personalized information of the user.

2.2. Graph neural embedding learning

Most of the current research addresses generic graph embedding. Zeng et al. propose a graph sampling-based inductive learning method named GraphSAINT, which constructs mini-batches by sampling the training graph, rather than the nodes or edges across GCN layers. GraphSAINT demonstrates superior performance in both accuracy and training time on five large graphs. [33]. The ActiveHNE framework [1] contains a novel semi-supervised heterogeneous network embedding method based on GCN. Wang et al. propose a heterogeneous graph neural network based on the hierarchical attention, including node-level and semantic-level attentions [23]. Inspired by Deep Graph Infomax [20] and lightGCN [6], this paper adds lightDGI module compared to the previous work on graph neural embedding learning, which improves the generalization ability of embedding.

2.3. POI recommendations embedding learning

Few studies focus on the embedding representations for multi-modal POI recommendations in recent years. LBSN2Vec $+ +$ provides a reliable idea for learning multi-modal embedding representation [28]. Nevertheless, it does not consider the topological relationship between the embedding nodes. Wang et al. study the problem of mobile users’ profiling with POI check-in data and propose a deep adversarial sub-structured learning framework. However, it overlooks the temporal and semantic information of check-ins [21]. Wang et al. propose PPE, which earns user embeddings from a user-tag bipartite graph by minimizing supervised loss to preserve the similarity of users visiting analogous places [26]. Huynh et al. propose a model in which these holistic interactions can be learned and transferred into node embeddings derived from a hypergraph representation and a persona decomposition process [8]. Wang et al. propose DyHNE, which introduces the meta-path based first- and second-order proximities to preserve structure and semantics in HINs [24].

3. Preliminary analysis

In this section, we first describe the different modal data features in the user check-in record. Then we explored the distribution preferences of user check-in records under different modalities. The symbol table of this paper is shown in Table 1.

Table 1
Symbol definition descriptions

Symbol Definition Descriptions

u User ID

t Timestamp

p POI ID

U User ID set

T Timestamp set

P POI ID set

C Check-in set

$U_{U}$ Set of user social relations

$U_{P}$ Set of POIs visited by each user

$U_{T}$ Set of activity time period of each user

$P_{P}$ Set of correlation relations(i.e. spatial distance, prevalence correlation, etc.) between POIs

$P_{U}$ Set of historical interactive users of each POI

$P_{T}$ Set of active time slots for POI

$T_{T}$ Set of relationships between time periods (modeled chronologically in this paper)

$T_{U}$ Set of historical interactive users of each time period

$T_{P}$ Set of POIs that are highly correlated in time period

G Graph of heterogeneous information network for POI recommendation

V Node set of G

E Edge set of G

$m_{i, j}$ The correlation status between node i and node j

D The maximum distance threshold (hyperparameter)

$M_{t, s}$ The correlation matrix between modal t and modal s

$T M$ Types of modalities

$T N_{i}$ Number of nodes under modal i

M Multimodal correlation matrix

$G_{sub}$ Sub-graph obtained by the sampler

$V_{sub}$ Node set of $G_{sub}$

$E_{sub}$ Edge set of $G_{sub}$

S Set of sub-graph

P Positive sample set

N Negative sample set

Symbol Definition	Descriptions
u	User ID
t	Timestamp
p	POI ID
U	User ID set
T	Timestamp set
P	POI ID set
C	Check-in set
$U_{U}$	Set of user social relations
$U_{P}$	Set of POIs visited by each user
$U_{T}$	Set of activity time period of each user
$P_{P}$	Set of correlation relations(i.e. spatial distance, prevalence correlation, etc.) between POIs
$P_{U}$	Set of historical interactive users of each POI
$P_{T}$	Set of active time slots for POI
$T_{T}$	Set of relationships between time periods (modeled chronologically in this paper)
$T_{U}$	Set of historical interactive users of each time period
$T_{P}$	Set of POIs that are highly correlated in time period
G	Graph of heterogeneous information network for POI recommendation
V	Node set of G
E	Edge set of G
$m_{i, j}$	The correlation status between node i and node j
D	The maximum distance threshold (hyperparameter)
$M_{t, s}$	The correlation matrix between modal t and modal s
$T M$	Types of modalities
$T N_{i}$	Number of nodes under modal i
M	Multimodal correlation matrix
$G_{sub}$	Sub-graph obtained by the sampler
$V_{sub}$	Node set of $G_{sub}$
$E_{sub}$	Edge set of $G_{sub}$
S	Set of sub-graph
P	Positive sample set
N	Negative sample set

3.1. Data description

In the POI recommendation task scenario, users share their location checking-in at POI with their friends. With the user’s check-in record, we denote U as user set, P as POI set, T as time period set, and $⟨ u, t, p ⟩$ as a check-in, which means user u visited POI p at time period t. We denote check-in set as C, and each user has a check-in sequence defined as $C_{u} = (⟨ u, t_{1}, p_{1} ⟩, ⟨ u, t_{2}, p_{2} ⟩, \dots, ⟨ u, t_{n}, p_{n} ⟩), u \in U, t_{i} \in T, p_{i} \in P$ .

Notably, human movement patterns are very intricate, and various factors have been studied in POI recommendation tasks [21]. In this paper, we only consider mixed interactions of three modalities, i.e., users, POIs, and time periods, which have been shown effective in next-POI recommendation [13].

User Modal Interaction. In user modal interactions, there are social interactions between users, interactions between users and POIs and interactions between users and time periods, which we define respectively as $⟨ u_{i}, u_{j} ⟩ \in U_{U}$ , $⟨ u_{i}, p_{j} ⟩ \in U_{P}$ and $⟨ u_{i}, t_{j} ⟩ \in U_{T}$ .

POI Modal Interaction. In POI modal interactions, there are geographic distance interactions between POIs, collaborative filtering interactions between POIs and users, and preferences between POIs and time periods, which we define respectively as $⟨ p_{i}, p_{j} ⟩ \in P_{P}$ , $⟨ p_{i}, u_{j} ⟩ \in P_{U}$ and $⟨ p_{i}, t_{j} ⟩ \in P_{T}$ .

Time Period Modal Interaction. In this paper, we divide a week into 168 time periods. There is a correlation between time periods, the preferences of the public in a specific time period, and the popularity of POI under different time periods, which we define respectively as $⟨ t_{i}, t_{j} ⟩ \in T_{T}$ , $⟨ t_{i}, u_{j} ⟩ \in T_{U}$ and $⟨ t_{i}, p_{j} ⟩ \in T_{P}$ .

Heterogeneous Information Network for POI Recommendation (HINPR). Unlike LBSN, HINPR is an open, heterogeneous network structure. Unlike knowledge graphs, edges in HINPR only represent the existence of connections between nodes, and there is no entity relationship. The features of HINPR allow us to focus on the correlation relationships between data of different modalities and provide a unified framework to represent data of different modalities. Combining user modal, POI modal, and time period modal, we get a HINPR $G = (V, E)$ , where $V = (U \cup P \cup T)$ , $E = (U_{U} \cup U_{P} \cup U_{T} \cup P_{P} \cup P_{U} \cup P_{T} \cup T_{T} \cup T_{U} \cup T_{P})$ , $u_{i} \in U$ , $p_{i} \in P$ and $t_{i} \in T$ .

3.2. Data observation

We observed the user’s mobility behavior and the correlation between different modal data on three real-world datasets: New York City, Jakarta, and Tokyo [10].

Figure 2 shows the size of POI intersections between users and their friends in three cities. Figure 2(a) shows the intersection size in New York City is generally less than 5. However, in Jakarta and Tokyo, the intersection size mostly ranges from 1 to 20. The differences in the POI intersection size show that user POI preferences in Jakarta and Tokyo are more likely to be influenced by social relationships compared to New York City.

Figure 3 shows the size of time period intersections between users and their friends in three cities. Users in Tokyo have more similar time periods of activities to their friends, because the intersection size of time periods between the user and their friends in Tokyo is around 1-40. On the contrary, Fig. 3(a) and Fig. 3(b) show that users in New York City and Jakarta have less similar time periods of activities to their friends.

Fig. 2.

Number of POI intersections between the user’s friends and the user.

Fig. 3.

The number of time period intersections between the user’s friends and the user.

Combining the observations in Fig. 2 and Fig. 3, we can infer that New York users’ check-in preferences are less influenced by their friends, while Tokyo users’ check-in preferences are easily influenced by their friends. Although Jakarta users’ check-in location preferences are easily influenced by their friends, the time period of their activities is personalized. Therefore, it is necessary to introduce the influence of different modal factors on users’ check-in preferences.

By analyzing the check-in dataset, we can conclude that implicit affiliations between users’ different modal check-in data exist, and these implicit affiliations often show a power-law distribution. Many studies have also proposed that user check-in data have power-law distribution properties, but these studies only focus on the power-law distribution of check-in frequency and POI transitions [2,16]. Therefore, the study of embedding representation based on multi-modal check-in data is significant.

4. The SAPRec framework

In this section, we first introduce the proposed multi-modal sub-graph sampling method. Then we present a method named light deep graph infomax to learn the embedding representation for each node in the heterogeneous information network. The model structure is shown in Fig. 4.

Fig. 4.

Sampling-aggregated POI recommendation embedding representation learning framework.

4.1. Multimodal-aware sub-graph sampling (MSS)

Graph sampling-based method is motivated by the challenges in scalability (in terms of model depth and graph size) [33]. When more modalities are considered, the larger the HINPR (the heterogeneous information network for POI recommendation) is constructed. In this context, the solution time and memory requirements of traditional methods, such as GCN, increase exponentially [9]. Therefore, we introduce the graph sampling method into HINPR, which effectively alleviates the model time complexity and space complexity. Based on the features of heterogeneous information networks, we introduce three modality-aware sub-graph sampling techniques.

There is a part of HINPR in which we can apply multi-modal sub-graph sampling. In this paper, we only consider the interaction between three modal data: user, POI, and time period. The HINPR matrix $M_{t, s} = {m_{i, j} ‖ i, j \in V}$ is defined according to interaction between modalities, where $\begin{array}{l} (1) & m_{i, j} = \{\begin{matrix} 1, & if interaction (i, j) is observed; \\ 0, & otherwise . \end{matrix} \end{array}$ As shown in Equation (1), a value of 1 for $m_{i, j}$ indicates there is an interaction between node i and node j. POI $p_{i}$ and POI $p_{j}$ are accessed successively in the user’s check-in history and $dis (p_{i}, p_{j}) < D$ , therefore, we set $m_{p_{i}, p_{j}} = 1$ , where $dis (.)$ is the distance function between $p_{i}$ and $p_{j}$ , D is the maximum distance threshold. And for temporal modal data, when the two time periods $t_{i}$ , $t_{j}$ are successive in time sequence, then we set $m_{t_{i}, t_{j}} = 1$ .

With Equation (1), we can generate an adjacency matrix for each modal interaction, shown as Equation (2) $\begin{array}{l} (2) & M_{t, s} = [\begin{matrix} m_{1, 1} & \dots & m_{1, j} \\ ⋮ & ⋱ & ⋮ \\ m_{i, 1} & \dots & m_{i, j} \end{matrix}] . \end{array}$

In Equation (2), $M_{t, s}$ is the adjacency matrix of the multi-modal interaction, $t, s \in T M$ , and $T M$ is types of modalities; in this paper, $T M = {u, p, t}$ . Finally, we combine $M_{t, s}$ into M for sample processing, as shown in Equation (3). $\begin{array}{l} (3) & M = [\begin{matrix} M_{u, u} & \dots & M_{p, t} \\ ⋮ & ⋱ & ⋮ \\ M_{t, u} & \dots & M_{t, t} \end{matrix}] . \end{array}$

We propose three sampling strategies to obtain sub-graphs: node-level hop sampler, random walk sampler, and user-based random walk sampler.

Node-level Hop Sampler. For each sub-graph $G_{sub} = (V_{sub}, E_{sub})$ , we first sample a node $n_{i}$ from V uniformly at random, and put node $n_{i}$ into set $V_{sub}$ . Then, we get a neighbor set $S_{n}$ of node $n_{i}$ . For each node $n_{j}$ in set $S_{n}$ , we obtain the set of its h neighbor nodes randomly and put node $n_{j}$ into $V_{sub}$ . We stop it until $‖ V_{sub} ‖ > k$ , where k is the size of sub-graph.

Using the Node-level Hop Sampler, the probability of nodes with different modalities being sampled is as follows: $\begin{array}{l} (4) & p (n_{j} \in T M ‖ n_{i}) \propto m_{i, j} \times T N_{j} \times h \times deg (n_{i}) . \end{array}$ In Equation (4), where $T M$ is types of modalities, $T N_{j}$ is the number of nodes with the same modality as node j, h is the number of sampling for node i’s neighbor nodes, and $deg (n_{i})$ is the degree of node i. The sampling probability for different kinds of modalities depends on the network topology, the number of nodes in that modality, and the degree of the previous node. Thus, Node-level Hop Sampler prefers to sample users’ existing interactions and has difficulty in mining users’ potential interest preferences. Details can be found in Algorithm 1.

Algorithm 1

Node-level hop sampler

Random Walk Sampler. There are numerous random walk based samplers proposed in the literature [33]. In our experiments, we implement a regular random walk sampler (with r root nodes selected uniformly at random, and each walker goes h hops).

Using the Random Walk Sampler, the probability of nodes with different modalities being sampled is as follows: $\begin{array}{l} (5) & p (n_{j} \in T M ‖ n_{i}) \propto m_{i, j} \times T N_{j} \times r \times h \times deg (n_{i}) . \end{array}$ In Equation (5), where r is the number of root nodes and h is the length of walk steps. Unlike in Equation (4), h in Equation (5) is usually larger than in Equation (4). Therefore, Random Walk Sampler prefers to explore the connections between users’ different modal interactions and recommend novel points of interest for users.

User-based Random Walk Sampler. In Section 3.2, we introduce a correlation between user social relationships, POI check-ins, and time periods. Therefore, we propose a user-based random walk scheme to sample friendship, POI check-ins, and time periods jointly. For each sub-graph $G_{sub}^{i} = (V_{sub}^{i}, E_{sub}^{i})$ , we use random walk of depth d walk h times through the social network of user $u_{i}$ to obtain the set $S_{friend}^{i}$ of user’s friend nodes. Subsequently, for each user $u_{j}$ in the user’s friend set $S_{friend}^{i}$ , we sample c times check-in records for the user according to a normal distribution, where the check-in record contains POI $p_{j}$ and time periods $t_{j}$ , then we put $u_{j}$ , $p_{j}$ , $t_{j}$ into $V_{sub}$ . Compared with the first two sampling algorithms, User-based Random Walk Sampler pays more attention on the impact of users’ social relationships over POI as well as active time period. Details can be found in Algorithm 2.

Algorithm 2

User-based random walk sampler

We construct the edge $e_{i, j} \in E_{sub}$ of sub graph $G_{sub} = (V_{sub}, E_{sub})$ that are based on the adjacency of the node $n_{i} \in V_{sub}$ and $n_{j} \in V_{sub}$ in HINPR, where $\begin{array}{l} (6) & e_{i, j} = \{\begin{matrix} 1, & if m_{i, j} = 1 \\ 0, & otherwise . \end{matrix} \end{array}$ Zhang et al. propose that in order to learn the global features in large graphs (such as HINPR) correctly, the total number of nodes K in set $S = (G_{sub}^{1}, G_{sub}^{2}, \dots, G_{sub}^{n})$ needs to satisfy $| K | ≫ | V |$ , where $| V |$ is the node size of HINPR [33].

4.2. Light deep graph infomax (lightDGI)

After getting the sub-graph set S, we need to deal with the problem of how to generate node embeddings $e_{i} \in E^{| V | \times R}$ , where E is an embedding set of HINPR’ nodes, and $R$ is the embedding dimension. Numerous studies have been focusing on unsupervised graph embedding methods, such as GraphSaga, LINE, and DeepWalk [4,14,18].

From the viewpoint of graph convolution, the convolution process already fuses its neighbor information, and after several such patch-level fusions, the neighboring nodes already have similar expression vectors. Thus, in the loss function part, we should pay more attention to the variability among neighboring nodes to generate a more efficient node embedding representation.

Therefore, Velickovic et al. propose the DGI method to maximize mutual information in graph convolutional neural network [20]. GCN was initially designed for graph node classification tasks, and these nodes are rich in attributes to be used as input features. However, each node (user, POI or time period) in HINPR is only described by an ID, which has no concrete semantics besides being an identifier. In such a case, feature transformation will bring no benefits, but negatively increases the difficulty for model training [6]. Therefore, we propose lightDGI to improve model performance and reduce the number of parameters on HINPR.

In each embedding learning batch of lightDGI, we need to maximize local mutual information as much as possible. In Section 4.1, we obtained the sub-graph set S of HINPR. Moreover, for each sub-graph $G_{sub}^{i}$ , we seek to obtain node representations that capture the entire sub-graph’s global information content, represented by a summary vector $\vec{s}$ .

In order to obtain the sub-graph summary vectors $\vec{s}$ , we define a graph readout function $R (x) = mean (e_{i}), e_{i} \in E_{sub} and E_{sub} \in E$ , and use it to fuse local representation embeddings into sub-graph representation, i.e., $\vec{s} = R (E_{sub})$ .

Since lightDGI is based on an unsupervised graph embedding method, we need to obtain negative sample $G_{neg}^{i}$ of sub-graph $G_{sub}^{i}$ . Based on positive sub-graph $G_{sub}^{i}$ , we do not change the adjacency of sub-graph edges $E_{sub}$ , and then we use normal distribution $N$ to generate node index in the range 0 to $‖ V ‖ - 1$ for each node in $V_{sub}$ . The negative graph $G_{neg}^{i}$ of $G_{sub}^{i}$ is defined as $\begin{array}{l} (7) & G_{neg}^{i} = G (N (V_{sub}^{i}), E_{sub}^{i}) . \end{array}$ In Equation (7), where $G (V, E)$ is a graph generation function, based on node set V and edge set E. $G_{neg}^{i}$ has the same adjacency structure as $G_{sub}^{i}$ , but every node embedding changes. Next, we need to make the information flow through nodes, where we use a lite GCN to aggregate information flow between each node. We define a lite GCN as $\begin{array}{l} (8) & \hat{A} = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}} . \end{array}$ In Equation (8), where D is a diagonal matrix of sub-graph $G_{sub}^{i}$ , I is identity matrix of sub-graph $G_{sub}^{i}$ , and A is adjacency matrix of sub-graph $G_{sub}^{i}$ . $\begin{array}{l} (9) & H^{(k + 1)} = \hat{A} H^{(k)} . \end{array}$ In Equation (9), where $H^{(0)}$ is a node embedding matrix of sub-graph $G_{sub}^{i}$ , and $\hat{A}$ is normalized adjacency matrix of sub-graph $G_{sub}^{i}$ . Inspired by contrastive learning, we define a discriminator function to calculate the probability score of each node with summary vector $\vec{s}$ , function is $\begin{array}{l} (10) & \vec{s} = mean (G_{sub}^{i}) . \\ (11) & D (\vec{s}, H^{(k + 1)}) = E (\vec{s}) W H^{(k + 1)} . \end{array}$ In Equation (10), $G_{sub}^{i}$ is the positive sample in contrastive learning. We need to encode the positive graph $G_{sub}^{i}$ as a teacher to teach the discriminator to know what the positive graph is and what the negative graph is. In Equation (11), where E is expand function to expand $\vec{s}$ to the same size as $H^{(k + 1)}$ and W is weight matrix for calculate the probability score.

For the loss function, we use maximize Jensen–Shannon divergence loss [20] to evaluate mutual information probability scores for positive and negative samples, and it is defined as $\begin{array}{l} (12) & Loss = \frac{1}{P + N} (\sum_{i = 1}^{P} E_{(V, E)} [\log D (\vec{s}, H^{(k + 1)})] + \sum_{j = 1}^{N} E_{(\overline{V}, \overline{E})} [\log (1 - D (\vec{s}, H^{(k + 1)}))]) . \end{array}$ In Equation (12), $(V, E)$ is positive set, $(\overline{V}, \overline{E})$ is negative set, P is size of positive set and N is size of negative set.

5. Experiments

In this section, we evaluate SAPRec on location/activity prediction tasks. First, we introduce the experimental setup and discuss the advantages of SAPRec over other baseline models. Then we explore the performance of the SAPRec under different samplers. Finally, we analyze the information embedding effect of the SAPRec under different parameter settings.

5.1. Experimental setup

We used a large scale and long interval LBSN dataset containing three cities collected by [27]. Due to the cultural differences and social preferences of users in different regions, the three selected cities Tokyo (TKY), New York (NY), and Jakarta (JK). Details of the datasets are shown in Table 2.

Table 2
Statistics of three city datasets

Datasets POIs Users Check-ins Friendships

TKY 3628 4024 105961 8723

NY 10856 7232 699324 37480

JK 8826 6395 378559 11207

Datasets	POIs	Users	Check-ins	Friendships
TKY	3628	4024	105961	8723
NY	10856	7232	699324	37480
JK	8826	6395	378559	11207

The hardware configuration of our experimental platform is as follows: AMD Ryzen 7 3700X 8-Core Processor, 64GB DDR4 memory and GeForce RTX 2070 (8G). The software configuration of our experimental platform is as follows: Python 3.6.12, Pytorch 1.7.0, Numpy 1.19.2 and Pandas 1.1.3.

We use Xavier uniform to initialize the embedding parameters and set the initial learning rate to 0.001. We set embedding size to 128, sub-graph size to 200, number of sub-graph to 2000, node-hop to 3, number of root node to 32, length of random walk to 16, friend sample number to 8, friend windows size to 2, and check-in sample number to 3. We divided the dataset into training set, validation set and test set by 8:1:1.

To avoid the variability of different model metrics, we metricize all comparison models under cosine space. The metric formula is $\begin{array}{l} (13) & score (u, p, t) = \frac{u \cdot p}{‖ u ‖ \times ‖ p ‖} + \frac{t \cdot p}{‖ t ‖ \times ‖ p ‖} . \end{array}$

We use the precision of Top@K as an evaluation metric, and its formula is defined as $\begin{array}{l} (14) & Precision @ K = \frac{\sum_{Test}^{i} ‖ R_{k} \cup T_{i} ‖}{‖ Test ‖} . \end{array}$

5.2. Comparison against state-of-the-art methods

We compare our method to the following state-of-the-art graph information embedding methods:

NetMF [17] derives the closed form of DeepWalk’s implicit matrix and factorizes this matrix to output node embeddings.

DHNE [19] proposes a Deep Hyper-Network Embedding (DHNE) model to embed hyper-networks with indecomposable hyperedges.

LBSN2Vec $+ +$ [28] proposes a POI recommendation embedding method and learns the embedding representation of nodes in cosine space.

IMP-GCN [12] performs high-order graph convolution inside subgraphs to identify users with common interests by generating embedding of user features and graph structure. IMP-GCN outperforms the state-of-the-art GCN-based recommendation models significantly.

SAPRec is the model we propose, and we denote SAPRec with three different sampling strategies. ${SAPRec}^{u r w}$ indicates user-based random walk sampler, ${SAPRec}^{r w}$ indicates random walk sampler and ${SAPRec}^{n h}$ indicates node-level hop sampler.

Table 3
Precision@K comparison results of different models on 6 cities datasets

TKY NY JK

Pre@10 Pre@20 Pre@10 Pre@20 Pre@10 Pre@20

NetMF 0.009479 0.011294 0.014435 0.017551 0.011757 0.015455

DHNE 0.011358 0.013332 0.012312 0.015474 0.014927 0.018532

LBSNVec $+ +$ 0.019152 0.020283 0.013539 0.016228 0.015593 0.030621

IMP-GCN 0.025180 0.030331 0.016878 0.021254 0.025075 0.030589

${SAPRec}^{u r w}$ 0.025231 0.027627 0.014105 0.018352 0.023748 0.032227

${SAPRec}^{r w}$ 0.030407 0.032581 0.017313 0.024627 0.021993 0.027908

${SAPRec}^{n h}$ 0.025210 0.027677 0.013963 0.023448 0.025412 0.031251

	TKY	NY	JK
NetMF	0.009479	0.011294	0.014435	0.017551	0.011757	0.015455
DHNE	0.011358	0.013332	0.012312	0.015474	0.014927	0.018532
LBSNVec $+ +$	0.019152	0.020283	0.013539	0.016228	0.015593	0.030621
IMP-GCN	0.025180	0.030331	0.016878	0.021254	0.025075	0.030589
${SAPRec}^{u r w}$	0.025231	0.027627	0.014105	0.018352	0.023748	0.032227
${SAPRec}^{r w}$	0.030407	0.032581	0.017313	0.024627	0.021993	0.027908
${SAPRec}^{n h}$	0.025210	0.027677	0.013963	0.023448	0.025412	0.031251

As shown in Table 3, our model achieves state-of-the-art embedding results on the datasets of three cities. Through the previous analysis of the datasets in Fig. 2 and Fig. 3, Tokyo users have the most frequent interactions between different modal data. Hence, our model performs best on the Tokyo dataset, proving that our model can fully explore the interactions between different modal data and generate effective embedding representations. Furthermore, the three sampling strategies we propose have different performances on the three cities’ dataset, and the random walk strategy has the best average performance.

NetMF is a matrix decomposition-based graph embedding representation model that learns the representation of each node in a graph network by converting the network embedding into a matrix decomposition. However, NetMF does not consider the heterogeneous information in the LBSN graph and the correlation of information between different modals; therefore, NetMF achieves a poor performance in this experiment.

DHNE proposes a deep hyper-network embedding model to embed hyper-networks with indecomposable hyper-edges. DHNE does not decompose the connections between nodes into bipartite graphs as in previous graph embedding models but learns the embedding directly on the heterogeneous network, thus preserving the rich structural information retained in the heterogeneous network. However, DHNE is a graph embedding model without a targeted design for LBSN heterogeneous graphs and thus has poor performance on the three city datasets.

LBSN2Vec $+ +$ is a heterogeneous hypergraph embedding approach explicitly designed for LBSN data for automatic feature learning. For the LBSN heterogeneous graph features, LBSN2Vec $+ +$ samples the nodes on the heterogeneous graph by the random walk. It embeds the multimodal node information into the same embedding space using a linear transformation to realize the homogeneous metric of heterogeneous information. However, the embedding learning method of LBSN2Vec $+ +$ does not consider the adjacency relationship between heterogeneous graph nodes, so its performance still has a gap compared to SAPRec despite considering different modal information on LBSN.

IMP-GCN performs high-order graph convolution inside subgraphs to identify users with common interests by exploiting user features and graph structure. IMP-GCN achieves excellent performance in traditional recommendation scenarios by relying on user-item interaction, so IMP-GCN also achieves suboptimal performance on LBSN-based POI recommendations. However, like DHNE, IMP-GCN is not explicitly designed for LBSN heterogeneous graphs, so there is still a tiny gap in LBSN-based POI recommendations compared to SAPRec.

5.3. Effect of different sampling strategies

To investigate the effect of three sampling strategies on the performance of the embedding representation of SAPRec, we experiment on the dataset of Tokyo.

In Fig. 5, it shows that the random walk strategy has the best embedding representation performance on the Tokyo dataset. This situation is because the random walk can obtain deeper neighboring nodes than the node-level hop and user-based random walk. Therefore, more mutual information can be learned in lightDGI.

Fig. 5.

Effect of three sampling strategies on the performance of the embedding representation of SAPRec.

5.4. Ablation experiments

In this section, we verified the effectiveness of lightDGI by ablation experiments and explored the different parameter settings of SAPRec.

5.4.1. Ablation experiments for lightDGI

In Table 4, we compare the embedding representation performance of SAPRec with feature transformation layer (FTL) and SAPRec without feature transformation layer on the Tokyo dataset. As we can see, SAPRec without Dense has achieved an enormous performance boost, an average performance improvement of up to 35%. This result proves the effectiveness of lightDGI.

Table 4
Light deep graph infomax with dense and without dense on Tokyo dataset

Pre@K Without FTL With FTL

Pre@5 0.020963 0.016315

Pre@10 0.025231 0.017238

Pre@15 0.026447 0.017924

Pre@20 0.027627 0.018789

Pre@30 0.029035 0.020090

Pre@50 0.031266 0.023029

Pre@K	Without FTL	With FTL
Pre@5	0.020963	0.016315
Pre@10	0.025231	0.017238
Pre@15	0.026447	0.017924
Pre@20	0.027627	0.018789
Pre@30	0.029035	0.020090
Pre@50	0.031266	0.023029

5.4.2. Ablation experiments for different modalities

To verify the effect of different modal data on the performance of our model, as well as to demonstrate that multimodal data has a positive effect on improving the embedding representation, we conducted ablation experiments on multimodal data. γ indicates that all modalities are considered, α indicates that the effect of time period on POI is not considered, β indicates that the effect of the time period on POI and users is not considered, θ indicates that we only consider social influence and interaction between users and POIs. As shown in Fig. 6, on the Tokyo dataset, the time period has a strong influence on user check-in preferences. And the performance of the embedding representation of the model decreases whenever some of the modal data is removed so that SAPRec can generate effective embedding representations for multimodal information.

Fig. 6.

Embedding representation performance of SAPRec with different modalities.

5.5. Embedding size for SAPRec

As shown in Fig. 7, through parametric experiments on SAPRec, we can draw a conclusion that the model can obtain the best embedding representation performance at an embedding dimension of 128. When the embedding dimension is too small, it is not conducive to the fusion of multimodal information, which results in information loss. On the other hand, when the embedding dimension is too large, it causes difficulties in model convergence, which leads to the degradation of model performance.

Fig. 7.

Embedding representation performance of SAPRec with different embedding dimensions.

6. Conclusion

In this work, we propose a multimodal interaction aware embedding learning framework named sampling-aggregated POI recommendation embedding representation learning framework (SAPRec). SAPRec can generate reliable embeddings on the heterogeneous socio-spatial network by the modality-aware sub-graph sampling and the self-supervised contrastive learning. We conduct experiments to demonstrate the strengths of SAPRec: better information fusion capabilities and more effective embedding representation. In the future, we will further explore the integration of SAPRec with downstream recommender systems and further improve the accuracy and robustness of POI recommendations.

Footnotes

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62072094), the LiaoNing Revitalization Talents Program (XLYC2005001), and the Key Research and Development Project of Liaoning Province (2020JH2/10100046).

References

Chen,

Yu,

Wang,

Domeniconi,

Li and

Zhang, ActiveHNE: Active heterogeneous network embedding, in: IJCAI 2019, 2019, pp. 2123–2129. doi:10.24963/ijcai.2019/294.

Feng,

L.V.

Tran,

Cong,

Chen,

Li and

Li, HME: A hyperbolic metric embedding approach for next-POI recommendation, in: SIGIR 2020, 2020, pp. 1429–1438.

Guo,

Sun,

Zhang and

Y.-L.

Theng, An attentional recurrent neural network for personalized next location recommendation, in: AAAI 2020, Vol. 34, 2020, pp. 83–90.

W.L.

Hamilton,

Ying and

Leskovec, Inductive representation learning on large graphs, in: NeurIPS 2017, 2017, pp. 1024–1034.

Han,

Li,

Liu,

Zhao,

Li,

Wang and

Shang, Contextualized point-of-interest recommendation, in: IJCAI 2020, 2020, pp. 2484–2490. doi:10.24963/ijcai.2020/344.

He,

Deng,

Wang,

Li,

Zhang and

Wang, LightGCN: Simplifying and powering graph convolution network for recommendation, in: SIGIR 2020, 2020, pp. 639–648.

Huang,

Zhang,

Rong and

Huang, Adaptive sampling towards fast graph representation learning, in: NeurIPS 2018, 2018, pp. 4563–4572.

T.T.

Huynh,

V.V.

Tong,

T.T.

Nguyen,

Jo,

Yin and

Q.V.H.

Nguyen, Learning Holistic Interactions in LBSNs with High-order, Dynamic, and Multi-role Contexts, IEEE Transactions on Knowledge and Data Engineering (2022), 1–1. doi:10.1109/TKDE.2022.3150792.

T.N.

Kipf and

Welling, Semi-supervised classification with graph convolutional networks, in: ICLR 2017, 2017.

10.

Li,

Westerholt,

Fan and

Zipf, Assessing spatiotemporal predictability of LBSN: A case study of three Foursquare datasets, GeoInformatica 22(3) (2018), 541–561. doi:10.1007/s10707-016-0279-5.

11.

Li,

Wu,

Wu and

Wang, Few-shot learning for new user recommendation in location-based social networks, in: WWW 2020, 2020, pp. 2472–2478.

12.

Liu,

Cheng,

Zhu,

Gao and

Nie, Interest-aware message-passing GCN for recommendation, in: WWW 2021, 2021, pp. 1296–1305.

13.

Lu and

Huang, GLR: A graph-based latent representation model for successive POI recommendation, Future Gener. Comput. Syst. 102 (2020), 230–244. doi:10.1016/j.future.2019.07.074.

14.

Perozzi,

Al-Rfou and

Skiena, DeepWalk: Online learning of social representations, in: KDD 2014, 2014, pp. 701–710.

15.

Qi,

Hu,

Zhang,

M.R.

Khosravi,

Sharma,

Pang and

Wang, Privacy-aware data fusion and prediction with spatial-temporal context for smart city industrial environment, IEEE Trans. Ind. Informatics 17(6) (2021), 4159–4167. doi:10.1109/TII.2020.3012157.

16.

Qiao,

Luo,

Li,

Tian and

Ma, Heterogeneous graph-based joint representation learning for users and POIs in location-based social network, Inf. Process. Manag. 57(2) (2020), 102151. doi:10.1016/j.ipm.2019.102151.

17.

Qiu,

Dong,

Ma,

Li,

Wang and

Tang, Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec, in: WSDM 2018, 2018, pp. 459–467. doi:10.1145/3159652.3159706.

18.

Tang,

Qu,

Wang,

Zhang,

Yan and

Mei, LINE: Large-scale information network embedding, in: WWW 2015, 2015, pp. 1067–1077.

19.

Tu,

Cui,

Wang,

Wang and

Zhu, Structural deep embedding for hyper-networks, in: AAAI 2018, Vol. 32, 2018.

20.

Velickovic,

Fedus,

W.L.

Hamilton,

Liò,

Bengio and

R.D.

Hjelm, Deep graph infomax, in: ICLR 2019, 2019.

21.

Wang,

Fu,

Xiong and

Li, Adversarial substructured representation learning for mobile user profiling, in: KDD 2019, 2019, pp. 130–138.

22.

Wang,

He,

Wang,

Feng and

T.-S.

Chua, Neural graph collaborative filtering, in: SIGIR 2019, 2019, pp. 165–174.

23.

Wang,

Ji,

Shi,

Wang,

Ye,

Cui and

P.S.

Yu, Heterogeneous graph attention network, in: The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13–17, 2019, 2019, pp. 2022–2032.

24.

Wang,

Lu,

Shi,

Wang,

Cui and

Mou, Dynamic heterogeneous information network embedding with meta-path based proximity, IEEE Transactions on Knowledge and Data Engineering 34(3) (2022), 1117–1132. doi:10.1109/TKDE.2020.2993870.

25.

Wang,

Wang and

Ling, Attention-guide walk model in heterogeneous information network for multi-style recommendation explanation, in: AAAI 2020, Vol. 34, 2020, pp. 6275–6282.

26.

Wang,

Qin,

Pang,

Zhang and

Xin, Semantic annotation for places in LBSN through graph embedding, in: CIKM 2017, 2017, pp. 2343–2346.

27.

Yang,

Qu,

Yang and

Cudré-Mauroux, Revisiting user mobility and social relationships in LBSNs: A hypergraph embedding approach, in: WWW 2019, 2019, pp. 2147–2157.

28.

Yang,

Qu,

Yang and

Cudré-Mauroux, LBSN2Vec

+ +

: Heterogeneous Hypergraph Embedding for Location-Based Social Networks, IEEE Transactions on Knowledge and Data Engineering (2020).

29.

Yang and

Zhu, Next POI recommendation via graph embedding representation from H-deepwalk on hybrid network, IEEE Access 7 (2019), 171105–171113. doi:10.1109/ACCESS.2019.2956138.

30.

Ying,

He,

Chen,

Eksombatchai,

W.L.

Hamilton and

Leskovec, Graph convolutional neural networks for web-scale recommender systems, in: SIGKDD 2018, 2018, pp. 974–983.

31.

Yu,

Ye and

Li, RePiDeM: A refined POI demand modeling based on multi-source data^*, in: INFOCOM 2020, 2020, pp. 964–973.

32.

Yu,

Ye,

Wang,

Zhangand,

A.M.

Oguti,

Li,

Jin and

Kurdahi, CFFNN: Cross Feature Fusion Neural Network for Collaborative Filtering, IEEE Transactions on Knowledge and Data Engineering (2021), 1–1. doi:10.1109/TKDE.2020.3048788.

33.

Zeng,

Zhou,

Srivastava,

Kannan and

V.K.

Prasanna, GraphSAINT: Graph sampling based inductive learning method, in: ICLR 2020, 2020.

34.

Zhang,

Sun,

Zhang,

Lei,

Li,

Wu,

Kloeden and

Klanner, An interactive multi-task learning framework for next POI recommendation with uncertain check-ins, in: IJCAI 2020,

Bessiere, ed., 2020, pp. 3551–3557. doi:10.24963/ijcai.2020/491.

Multimodal interaction aware embedding for location-based social networks

Abstract

Keywords

1. Introduction

2.1. Graph neural networks

2.2. Graph neural embedding learning

2.3. POI recommendations embedding learning

3. Preliminary analysis

3.2. Data observation

5. Experiments

5.1. Experimental setup

Table 2 Statistics of three city datasets Datasets POIs Users Check-ins Friendships TKY 3628 4024 105961 8723 NY 10856 7232 699324 37480 JK 8826 6395 378559 11207

5.4.1. Ablation experiments for lightDGI

Table 4 Light deep graph infomax with dense and without dense on Tokyo dataset Pre@K Without FTL With FTL Pre@5 0.020963 0.016315 Pre@10 0.025231 0.017238 Pre@15 0.026447 0.017924 Pre@20 0.027627 0.018789 Pre@30 0.029035 0.020090 Pre@50 0.031266 0.023029

Footnotes

Acknowledgements

References

Table 2
Statistics of three city datasets

Datasets POIs Users Check-ins Friendships

TKY 3628 4024 105961 8723

NY 10856 7232 699324 37480

JK 8826 6395 378559 11207

Table 4
Light deep graph infomax with dense and without dense on Tokyo dataset

Pre@K Without FTL With FTL

Pre@5 0.020963 0.016315

Pre@10 0.025231 0.017238

Pre@15 0.026447 0.017924

Pre@20 0.027627 0.018789

Pre@30 0.029035 0.020090

Pre@50 0.031266 0.023029