Abstract
Abnormal collusive behavior, widely existing in various fields with concealment and synergy, is particularly harmful in user-generated online reviews and hard to detect by traditional methods. With the development of network science, this problem can be solved by analyzing structure features. As a graph-based anomaly detection method, the Markov random field (MRF)-based model has been widely used to identify the collusive anomalies and shown its effectiveness. However, existing methods are mostly unable to highlight the primary synergy relationship among nodes and consider much irrelevant information, which caused poor detectability. Therefore, this paper proposes a novel MRF-based method (ACEagle), considering node-level and community-level behavior features. Our method has several advantages: (1) based on the analysis of the nodes’ local structure, the community-level behavioral features are combined to calculate the nodes’ prior probability to close the ground truth, (2) it measured the behavior’s collaborative intensity between nodes by time and weight, constructing MRF by the synergic relationship exceeding the threshold to filter irrelevant structural information, (3) it operates in a completely unsupervised fashion requiring no labeled data, while still incorporating side information if available. Through experiments in user-reviewed datasets where abnormal collusive behavior is most typical, the results show that ACEagle is significantly outperforming state-of-the-art baselines in collusive anomalies detection.
Keywords
Introduction
With the development of anti-anomaly detection, anomalies adopting various camouflage methods [1] to avoid identification, present diversification, and the mainstream variation trend is from isolated to collusive, such as opinion spammer [2, 3], network intrusion [4], and telecommunications fraud [5, 6]. At present, it is still an urgent problem to be solved how to effectively detect hidden and variable abnormal behaviors, especially in user-generated online reviews, in which abundant spammers will severely affect users’ decisions. To tackle the abnormality detection problem, many features, text content feature [7], user attribute feature [8], traffic feature [9], and structure feature [10, 11] have been considered in the past decades. Among them, structures feature acquired by abstracting behavior as a graph, which is difficult to forge and explanatory [11, 12, 13], has been becoming the main focus and developed a series of methods [14, 15, 16, 17].
The graph-based anomaly detection methods abstract the behavior between entities, transforming the problems that anomaly detection into identifying abnormal structures in a complex network. According to the different definitions of anomaly structure, graph anomaly detection dividing into four categories: structure-based, probability-based, community-based, and decomposition-based [18]. There have been in-depth studies on detection through isolated anomaly structure, Ref. [19] is based on the egonet of nodes and builds feature space to find abnormal nodes by analyzing its relationship. Ref. [20, 21] further considered the directional characteristics. However, the above methods are not ideal in detecting collusion behavior.
Compared to supervised methods, unsupervised methods have received widespread attention in research in recent years, with the advantages laying in the low dependence on historical anomalous features, high precision, and robustness for new exceptions. Recently, the Markov random field (MRF)-based method such as FraudEagle [22] and SpEagle [23], as a classic unsupervised model, which utilizes the original network structure to distinguish anomalies behaviors, has shown superior to other ranking methods. ColluEagle [24] extends the model by reconstructing MRF further deliberate both network effects and time effects. Nevertheless, prior probability calculation of the above techniques ignores anomaly cooperative behavior characteristics, making it difficult to distinguish between spammers and benign communities. In this paper, considering both the features of individual and group cooperation, we propose a novel MRF-based model named ACEagle that can detect both collusive spammers and efficiently avoid involving ordinary communities. We summarize the contributions of this work as follows.
We propose ACEagle, a novel pairwise-MRF model, which elegantly embeds the collusive phenomenon to formulate it as a reorganized network Considering the diversity of nodal local structure abstracted by entities behavior, we design node-level features which extract variable entity’s behavior characteristics to illustrate this discrepancy. Then, the review system was selected as the specific scenario, used to analyze the strange properties of spammer reviews and design corresponding structural features. To specifically quantify this abnormity, we adopt two normalizations, namely cumulative distribution function (CDF) and logarithmic function (Log), to highlight numerical differences or deviation from the overall distribution of different behavioral characteristics. To accurately calculate the anomaly prior probability of nodes, we design community-level features which combine network effects and node-level behavior features, to measure the suspicion of collusive behaviors among nodes. The anomaly prior probability of nodes is adjusted by analyzing the structural characteristics of network
The remainder of this paper is organized as follows. Section 2 discusses the related work of collusive anomalies detection and includes the MRF model and its inference method. In Section 3, the description of ACEagle is divided into three parts, namely the feature extraction and aggregation, constructing collaborative network
Akoglu and Tong first systematically summarized the graph-based anomaly detection methods and studying the application scenarios of collusion behavior [14]. Pourhabibi further surveys each specific scene in detail and outlines the relevant techniques in recent years [18]. This issue aroused extensive interest and many works in this research field to solve it [25, 26, 27, 28, 29].
Xu et al. [30] first proposed SCAN algorithm based community clustering on the undirected graph, and Bryan further considered behavioral directivity [31], which detect anomaly behaviors by expressing as a community. However, it limits detectability that how to distinguish the normal community resembles abnormal’s which has similar behavior habits, preferences, and others [32]. Ye et al. [33] quantify the potential of nodes as targets for aberrant collusive behavior from the local-structure information, analyzing the overlapping neighbors of the node with higher target outlier to find the strange group. Jiang et al. [34] believe rarity and synchronization exist in abnormal collaborative behavior, quantifying behavior with neighbor behavior and mapping into feature space to find deviant collaborative groups. However, it still fails to distinguish the normal cooperative behaviors which similar to abnormal accurately.
Assuming that abnormal individuals have a solid cooperative relationship while normal’s close but relatively irregular, Ref. [22] proposed FraudEagle taking the original network as MRF and applying LBP to detect anomalies by inferring the anomaly probability under the premise of random labeling of nodes. Nonetheless, the accuracy of FraudEagle is inadequate without prior knowledge, SpEagle [23] designed a series of abnormal features for specific scenes to calculate the initial prior probability of nodes, improving the detection capability significantly and proving that LBP is effective and extensible. ColluEagle [24] reconstructs a collaborative network as MRF composed of the synergy relationship considering the effects of time and network, further improving detection efficiency. The above MRF-based methods though the interconnection potential function of closer cooperative relationship among abnormal nodes converge the likelihood of anomalies to higher anomaly probability interval, while the benign individuals lower. Its strong universality and low computational complexity make the MRF-based methods applied to a variety of abnormal scenarios and widely used in the field of anomaly detection in recent years [35, 36, 37]. Its specific concepts will introduce as follows.
Markov random field
As one of the classical models of probability graphs, the Markov random field, regarded as a Markov undirected network composed of multiple Markov chains. Therefore, it also applies to the Markov process, namely the implicit state of any node is determined jointly by the neighbor nodes’, using to solve reasoning problems in uncertain data by modeling the implicit states of nodes and their relations [38, 39]. As shown in Fig. 1, each node in MRF has arbitrary possible states expressed as a hollow node named hidden node constituting an undirected network structure. For a hidden node
Schematic diagram of Markov random field model. 
Based on the above concepts, for network
For the network formed by
The belief propagation method infers based on the prior probability evaluated from the anomaly feature, mainly depending on the depiction of symbiosis in MRF construction and potential function design. At present, most of the MRF-based methods are less focused on the construction of MRF, which taking the original network as MRF or reconstructs the MRF by simply considering the Spatio-temporal synchronization relationship [24]. The inaccurate information received by neighbors may fail to highlight principal correlations between individuals lead to the low accuracy of anomaly detection through inference. Hence, we propose ACEagle, which is an entirely unsupervised fashion and can easily accommodate labels when available, analyzing node-level and community-level behavior features to detect collusive anomalies.
In Section 3.1, we first introduce the node-level and community-level behavior features, aggregating the factors to evaluate nodes’ prior probability. The reconstruction of MRF and the design of potential function described in Section 3.2. The process of ACEagle is shown in Fig. 2.
The anomaly detection process of ACEagle. Network 
Abnormal collusive behavior is close to normal whereas having apparent Spatio-temporal synchronization. This trait is particularly evident in the spam review of user-generated online reviews. The first step of our model needs to extract behavior features to calculate the nodes’ prior probability. Therefore, we select two real commodity review datasets with spurious reviews to extract features, YelpZip and YelpNYC, which were acquired by a user-generated online reviews system, describing the scenario that spammers originate collusive review behavior to disturb products’ rating. These datasets can be abstracted into corresponding graphs, where the nodes exist in two categories: product and user. In addition, edges represent the user’s reviews, which timestamp and rating represent the edge’s appearance time and the weight respectively.
Node-level behavior feature
To preliminarily outline the anomaly degree of nodes in
Description of abnormal feature of node-level behavior
Description of abnormal feature of node-level behavior
Meanwhile, partial features are only applicable to nodes with corresponding local structures owing to differences, such as SRD, ERD, and WRF, which not suitable for nodes with a single edge, failing to distinguish them will interfere with the measurement of anomaly probability. Therefore, In Table 1, the appropriate feature types are divided into
Since the above abnormal probability obtained merely from the individual local structure is bound to have a certain degree of deviation, the magnitude of such variation will directly affect the final result of LBP. To narrow the deviation as possible to improve accuracy, the network structure used from the overall perspective measures the suspicion of the collaborative relationship among nodes to enhance the anomaly probability of anomalies and suppress ordinary respectively.
Edge-edge suspicious collusion score
Given node pair
As shown in Eq. (7),
Node-node suspicious collusion score
There may be multiple common neighbors between
Through the above analysis, most behaviors can be preliminarily distinguished, but it is still possible for the cooperative relation between benign node pair
To include all possible suspicious relations as far as possible, setting the threshold
Description of abnormal feature of community-level behavior
By extracting the egonet of node
To explain the statistical process of abnormal cooperative behavior characteristics clearly, we using the specific example given in Fig. 3 to illustrate. Firstly, towards the original network
Therefore, we consider the feature both of individual and cooperative behavior when measuring the anomaly prior probability by aggregating them using the following formula, which aims to improve the anomaly likelihood of actual abnormal nodes while suppressing normal nodes.
The process of 
Since merely considering suspicious synergies in the construction of abnormal synergies graph
The potential function design of ACEagle
The potential function design of ACEagle
[b] ACEagle
Collusiveness of node pair
Suppose there is a common neighbor node
Based on the above analysis, we can construct a collaborative graph
According to the analysis in Eq. (15), the potential function of node pair
Assuming network
Evaluation and comparison
Compared to classified detection methods, ACEagle order nodes in a ranked list based on their anomaly marginal probability to quantify the abnormal degree in fine granularity. The advantage of this approach lies in important exception nodes in the exception community can be identified, while the algorithm detection capabilities of classification methods limited on account of equally treat the node with different degrees of importance of deviant groups. Therefore, We use Precision@k and NDCG@k [22, 23] to evaluate the top-ranked detectability of ACEagle compare with other state-of-the-art ranking-based approaches while use AUC and AP to measure the overall recognition ability of ACEagle.
As for the selection of comparison methods, we choose the most related works to ACEagle, such as FraudEagle [22], SpEagle [23], and ColluEagle [24], which also rank nodes by abnormal probability obtained by inference based on MRF. In addition, we also selected the classic classified community-based method SCAN to compare with various sorting methods to indicate the effect of sorting methods on the accuracy of identifying abnormal nodes. To highlight the effectiveness of each step in ACEagle, Firstly, we measure the anomaly detection capability based on the prior probability calculated by individual behavior features introduced in Section 3.1.1 and denote it as PRIOR
Validation of real dataset
To reveal the actual detection ability of the method in this paper, we select YelpZip and YelpNYC, two real commodity review datasets with fake collaborative reviews, widely used in the verification of abnormal group identification methods [22, 23, 35, 36] and the detailed description is as follows. We extract the comments in these datasets to abstract as links, and the users and products as nodes, so a bipartite network can be used to represent the connection between products and users. Evidently, The collusion anomaly behavior of spammers for multiple targets can be regarded as a collusion anomaly subgraph structure.
A description of the dataset with actual exception
A description of the dataset with actual exception
Curves evaluated by NDCG@k and Precision@k for each algorithm in the range 0–2000. In FraudEagle, the prior probabilities of nodes are set as (0.5, 0.5). In SpEagle and ColluEagle, the prior probabilities calculated using the feature designed by themselves, and the threshold 
As shown in Fig. 4, we compared the performance of Prior
The above ranking-based approaches taking the anomaly probability of nodes as scores for sorting, measure the precision of anomaly detection by the actual proportion of anomalies in the top-k ranked list. However, it is not comprehensive for evaluating the detectability of methods. To further access the overall classification ability, AP and AUC metrics are utilized in the entire ranked list. In addition, the experiment select the classic graph clustering-based method SCAN to attest ACEagle is superior to the classified-based method, where
Results of ACEagle and baselines measured by AP, AUC, Precision@k
The value of threshold
Since ACEagle is an MRF-based abnormal subgraph identification method, the prior probability of nodes calculated by the features of cooperative behavior and individual behavior. Then the probability propagation according to the synergistic relationship until convergence. The inference result of the method depends on consistency between the nodes’ prior probability and the actual situation. Therefore, ACEagle can also be seen as a semi-supervised approach when we know some actual labels of networks. Next, we investigate how much the detection performance can be improved by semi-supervision when setting varying amounts of labeled data. As shown in Table 6, we measure the detection capability by AUC, AP and Precision@k.
Under partial labeled data, results of ACEagle measured by AP, AUC, Precision@k
A sensitivity test that observes the effect of changes in 
Different labeling abnormal nodes proportions in network be set to measure the detection capability enhancement of ACEagle by using NDCG@k and Precision@k, {0.5%, 1.5%, 3%} for YelpNYC, and {0.5%, 1.5%, 3%} for YelpZip. To reflect the change of top-k sorting ability, we use Precision and NDCG@k to demonstrate how many actual abnormal nodes are included in the top-k list and whether actual abnormal nodes are ranked in the top region of the top-k list, respectively. As shown in Fig. 6, even only a few nodes are labeled, the accuracy of ACEagle has been largely improved. Visible in the experiment result, the model can be an unsupervised anomaly detection method without prior knowledge and can be a semi-supervised method for anomaly collusive subgraph identification when known a few abnormal node labels. Moreover, knowing the more abnormal label, the detection accuracy increases non-linearly.
The enhancement of detection capability in ACEagle under the condition of labeling abnormal nodes with different proportions in network. NDCG@k and Precision@k are used to measure the influence and compare to ACEagle without label. 
ACEagle proposed in our work abstracts the communication relationship between entities as complex network, analyzing node-level and community-level behavior features to identify the camouflage and represent synergistic relationships. In the process of detecting anomalies, nodes’ anomalous prior probability is measured by node-level features extracted from time and weight properties of behaviors. Then, the anomalous prior probability is adjusted to ground-truth by considering community-level behavior features extracted from the network constructed by suspicious cooperative behaviors. Experiments on YelpZip and YelpNYC show that ACEagle is superior to baseline methods under various evaluation indicators. For instance, when ACEagle experiments under unsupervised conditions, which AUC metric 15% higher than FraudEagle, about 5% higher than SpEagle and ColluEagle. While ACEagle experiments under semi-supervised conditions, which AUC metric improves 40% than unsupervised conditions. Future work will study the discrepancy in linkage motivation of benign and anomalous groups and the characteristics of anomalous communities evolving on complex networks.
Footnotes
Acknowledgments
This research was supported by the National Natural Science Foundation of China (No. 61803384).
