Reducing the synchronizing communication overhead for distributed graph-parallel computing

Abstract

A number of graph-parallel computing abstractions have been proposed to address the needs of solving complex and large-scale graph computing. However, unnecessary and excessive communication and state sharing between nodes in these frameworks not only reduce the network efficiency but may also cause decrease in runtime performance. In this paper, we propose a mechanism called LightGraph, which reduces the synchronizing communication overhead for distributed graph-parallel computing abstractions. Besides identifying and eliminating the redundant synchronizing communications in existing systems, in order to minimize the required synchronizing communications LightGraph also proposes an edge direction-aware graph partitioning strategy. This new graph partitioning strategy optimally isolates the outgoing edges from the incoming edges of a vertex. We have conducted extensive experiments using real-world data, and our results verified the effectiveness of LightGraph. For example compared to PowerGraph LightGraph can not only reduce up to 31.5% synchronizing communication overhead for intra-graph synchronizations, but also cut up to 16.3% runtime for PageRank running on Livejournal dataset.

Keywords

Graph-parallel computing big data light communication

1. Introduction

Due to recent advances in high-throughput techniques in various fields, big data analytics technique is more and more popular used by multiple fields, such as system Biology [28, 24], government sector [10], moblie communication [27] and so on. Complex networks such as social, biological and technologies networks can be mathmatically modelled as graphs. The size of these real-world networks are often large consisting of millions or even billions of vertices, and hundreds of billions of edges. Making sense of large real-world networks ranging from social networks of friends; links between web pages in the World Wide Web; and gene regulatory networks, is an increasingly important problem. Thus, designing effective and scalable computing systems for analyzing and processing huge real-world graphs has gained significantly attention and effort.

To alleviate the communication overhead and accelerate the graph-structured application execution, we propose a mechanism that identifies and eliminates the avoidable communication during synchronization in existing distributed graph structured computing abstractions. We implemented our method and created LightGraph: a light communication distributed graph-parallel computation system. Furthermore, to minimize the required intra-graph synchronizations for PageRank-like applications, LightGraph also employs an edge direction-aware graph partitioning strategy, which optimally isolates the outgoing edges from the incoming edges of a vertex when creating and distributing replicas among different machines.

The rest of the paper is organized as follows. Section 2 introduces the related works. LightGraph is detailed in Section 3. Section 4 details the experiment design and result. Section 5 concludes this paper.

2. Related works

A number of distributed graph-parallel abstractions have emerged in literatures. Pregel [13] explores graph-parallelization through the use of a bulk synchronous distributed message-passing system. Several other systems are similar to Pregel such as GPS [19], Giraph [1]. PGX [8] developed by Oracle can process large scale graphs under either single-machine shared memory or distributed computing model. Stutz et al. propose the Signal-Collect [22] framework to concisely specify and execute a number of computations that are typical for Semantic Web. Naiad [15, 16] is able to conduct incremental iterative computation. However, it adopts traditional synchronous check pointing for fault tolerance and cannot respond to stragglers [26]. Distributed GraphLab [12] and its successors, PowerGraph [6] and Ligraph [29] exhibit more excellent performance than others with better graph processing rate and higher scalability [30, 7, 4]. Cyclops [3] is also a vertex-oriented graph-parallel framework. However, compared with LightGraph (written in C++) its java implementation based on Hama [21] drags its runtime performance down. Work [14] uses vertex-cut graph partitioning that considers both diverse vertex traffic and heterogeneous network costs. However, its partitioning method does not take the application characteristics into account like EDAP partitioning strategy proposed in this work. Work [20] proposes a light-weight processing framework called Frog with a hybrid coloring model. However, Frog only supports asynchronous computing model. In general, asynchronous computing model introduces much more communications than synchronous computing model. On the other hand, works [23, 25] only support synchronous computing model not like this work that can support both synchronous and asynchronous computing model. Compared with our preliminary work [31] this work details the proposed algorithms, conducts more extensive experiments verifying the effectiveness of LightGraph on processing various data set using different algorithms, reports that LightGraph outperforms more recent mechanisms.

In order to deal with the inherent problem, communication overhead, in distributed computing systems, much effort has been done as well. In traditional message passing abstractions, such as Pregel [13], Giraph [1], and GPS [19], all vertex-programs run simultaneously in a sequence of super-steps. In each super-step, each program instance receives all messages sent by its neighbors in the previous super-step and sends messages to its neighbors for next super-step [6]. In order to reduce the number of communication messages, Pregel introduces a commutative associative message combiner, which merges messages destined to a same vertex [13]. Work [18] proposes asynchronous broadcast and reduction operations to reduce communication associated with high-degree vertices. Works [6, 29] and LightGraph proposed in this work employ GAS (Gather, Apply, and Scatter) graph computing model and ensures the changes made to the vertex or edge data are automatically visible to adjacent vertices. Thus, LightGraph eliminates the messages transferred between adjacent vertices.

3. LightGraph: Lighten communication in distributed graph-parallel abstractions

3.1 Challenges of communication and synchronization

In order to process a large-scale graph, a distributed graph-parallel computing system needs to partition the graph into smaller sub-graphs and distribute the sub-graphs to different machines. GraphLab [12] uses an edge-cut approach while PowerGraph [6] adopts a vertex-cut strategy. Nevertheless, replicas (ghosts) have to be created for the vertices and edges across the cutting-line. And through synchronizing these replicas, computation states and data can traverse through the sub-graphs placed on different machines.

In both GraphLab and PowerGraph, communication occurs during synchronizations of replicas and the volume is proportion to the number of ghosts. One prominent problem in GraphLab is that when partitioning a power-law graph, it has to resort to a hashed (random) vertex placement algorithm that cuts across most of the edges and creates many unnecessary mirrors [6]. The communication overhead will then substantially increase and seriously impact the execution efficiency of graph-structured applications for power-law graphs. On the other hand, under PowerGraph, the vertex-cut partitioning process stores each edge exactly once, thus eliminates the need of edge-mirrors and data updates on edges do not need to be communicated to other sub-graphs [6].

In other words, with PowerGraph, only vertices are replicated and only vertex data need to be synchronized. One of the vertex-replicas is randomly chosen as the master and the remaining ghosts are noted as mirrors. In a typical vertex-program, the master runs the apply function and sends the updated vertex data to all mirrors. Although PowerGraph reduces the communication overhead significantly comparing to GraphLab, it still suffers from having very high communication overhead, which limits its performance and scalability [6].

3.2 Communication overhead in PowerGraph

Existing distributed graph-parallel computing systems including PowerGraph blindly synchronize all replicas of a vertex or edge when there is a data change in one of the replicas. However, there are certain graph algorithms such as PageRank [17], the data on some of the replicas will never be accessed in future computation iterations. Therefore, the communication for synchronizing these mirrors can be avoided.

Through experimenting with PowerGraph, we found that in certain graph computing applications/algorithms, the direction of information or data flow is consistent with the direction of the edges. More specifically, in a directed edge, the data on the target vertex is not needed by that edge or that edgeâ€™s source vertex during computation. Thus, in a distributed graph, the data on a vertex replica that has no out-going edge will never be accessed by any other edges or vertices in future computation iterations; and such replicas do not need to be synchronized. PageRank [17], HITS [9] and SALSA [11] all fall into this category. We define this category of applications/algorithms as a PageRank-like application/algorithm as follows:

Definition 1 (PageRank-like Algorithm). An algorithm is a PageRank-like algorithm if

$\displaystyle\forall\textit{edge}(u\rightarrow v)\in E.$ (1)

The computation happens on $u$ and $\textit{edge}(u\rightarrow v)$ are subject to

$\displaystyle D_{u}=f(\textit{args}[0],\textit{args}[1],\ldots,\textit{args}[i% ]).$ (2)

and

$\displaystyle D_{u\rightarrow v}=f(\textit{args}[0]^{\prime},\textit{args}[1]^% {\prime},\ldots,\textit{args}[j]^{\prime}).$ (3)

in which

$\displaystyle D_{v}\notin([\textit{args}[0],\textit{args}[1],\ldots,\textit{% args}[i]]$ (4) $\displaystyle\cup[\textit{args}[0]^{\prime},\textit{args}[1]^{\prime},\ldots,% \textit{args}[j]^{\prime}]).$

Where $E$ is the set of overall edges; $D_{v}$ denotes the data associated with vertex $v$ ; $D_{u\rightarrow v}$ denotes the data associated with $\textit{edge}(u\rightarrow v)$ and $f$ is the update function of vertex $v$ or $\textit{edge}(u\rightarrow v)$ .

We demonstrate our observation through a simplified example of running PageRank in PowerGraph with a sample graph shown in Fig. 1. Conceptually, the PageRank score of a node is the long-term probability that a random web surfer is at that node at a particular time step. The computation of the PageRank score of a webpage $v$ is an iterative process where the PageRank algorithm recursively computes the rank $R_{v}$ considering the scores of web pages ( $u$ ) that are connected to $v$ , defined as:

$\displaystyle R(v)=(1-\alpha)\sum_{u\,\textit{links}\,to\,v}w_{u,v}\times R(u)% +\frac{\alpha}{n}.$ (5)

where $\alpha$ is the damping factor (see [12, 17] for detailed description about PageRank).

Figure 1.

A partial sample graph for PageRank algorithm.

Figure 2 shows an example of a 4-way vertex-cut of the graph based on PowerGraph’s partitioning algorithm. Let us assume that we are computing the PageRank score of vertex $A$ , and the replica of vertex $A$ located on machine 2 is nominated as the master. In PowerGraph, first, the gather function runs locally on each machine to calculate the partial PageRank score of vertex $A$ based on the local sub-graph and then the partial value is sent from each mirror to the master. Second, the master runs the apply function to compute the new vertex value of $A$ and then sends the updated vertex data to all mirrors of $A$ . Finally, the scatter phase is run in parallel on all mirrors of $A$ and writes the new value back to the data graph for the next computing iteration. As shown in Fig. 2, the mirror of vertex $A$ located on machine 4 has no outgoing edges, which indicates that the data (PageRank score of vertex $A$ ) on this mirror will not be used by any other vertices within the sub-graph on machine 4 in the future computations. Thus, the master of $A$ does not need to synchronize this mirror upon updates; and therefore, the communication between the master (machine 2) and this mirror (machine 4) can be avoided. Moreover, the scatter phase on this mirror can also be eliminated.

Figure 2.

Graph placement under PowerGraph.

3.3 The LightGraph abstraction

Based on the above observation and analysis, we propose a mechanism that identifies and eliminates these avoidable communications during synchronizing master and replicas. We implemented our method on PowerGraph and created LightGraph: a light communication distributed graph-parallel computation system targeting to alleviate the communication overhead for PageRank-like algorithms. In particular, to achieve a light communication distributed graph-parallel computing system, we propose two novel methods in LightGraph: (1) a streamlined synchronization process that eliminates the unnecessary communications; and (2) an edge direction-aware vertex-cut partitioning strategy to maximize the proportion of mirrors with no out-going edges and further reduce the communication overhead.

Figure 3.

The communication pattern of PowerGraph and LightGraph when master communicates with a mirror with no outgoing edges in PageRank-like algorithm. In LightGraph the synchronization and scatter phases of this mirror are eliminated.

Figure 4.

An example of reduced communication in LightGraph comparing to PowerGraph during synchronization.

3.3.1 Streamline the synchronization process

Under PowerGraph, as illustrated above, the data on mirrors without out-going edges will not be accessed in future computations for PageRank-like algorithms. Thus, even if the data on the master of a vertex has been updated there is still no need for the master to synchronize these mirrors. LightGraph identifies these mirrors during the initial partitioning process by checking their out-going degree and then eliminates the synchronizing operations pertaining to these mirrors during the execution stage of the graph application as shown by Fig. 3. Consequently, LightGraph is able to reduce the overall communications required for PageRank-like algorithms. Figure 4 gives an example of comparing the synchronization process in LightGraph with that under PowerGraph, in which the communication workload is reduced. The pseudo codes for the vertex computing in PowerGraph and LightGraph are shown in Algorithms 1 and 2, respectively.

Vertex Computing in PowerGraph[1] for each mirror of vertex A doGather(); Send partial sum to master; end forApply(); for each mirror of vertex A dovdata_exchange.send (mirror, new data of A); Scatter(); end for

Vertex Computing in LightGraph[1] for each mirror of vertex A doGather(); Send partial sum to master; end forApply(); for each mirror of vertex A domirror has outgoing edges vdata_exchange.send (mirror, new data of A); Scatter(); end for

3.3.2 Edge direction-aware graph partition

As shown above, for PageRank-like algorithms, the synchronization to the mirrors without any outgoing edges can be eliminated. Naturally, we can reason that assuming the same (or similar) number of overall mirrors, the more mirrors that are created without any outgoing edges, the more synchronizing communication overhead can be avoided. Therefore, we propose a new graph displacement method in LightGraph to take into account the direction of edges during the initial graph partitioning phase, namely the edge direction-aware partition (EDAP) strategy. First, during the graph partitioning process, for a particular vertex, EDAP tries to assign edges with the same direction (inbound or outbound edge of a vertex) to the same machine. By this design EDAP maximum the proportion of replicas that has no outgoing edge in the overall vertex replicas. Second, instead of randomly appointing one of the vertex replicas as the master, EDAP chooses the master from the replicas that do have out-going edges, since the synchronization communication only occurs from the master to mirrors. EDAP optimally isolates the outgoing edges from the ingoing edges of a vertex among different machines while maintaining other partitioning mechanisms used by PowerGraph to guarantee good work balance and low number of replicas of vertices.

Figure 5 illustrates the new 4-way placement of the sample graph shown in Fig. 1 achieved by EDAP. Distinct from existing vertex-cut approaches (as demonstrated in Fig. 2), the inbound edges of vertex $A$ : edge ( $H\rightarrow A$ ) and edge ( $G\rightarrow A$ ) are assigned to the same machine (machine 3) and the outbound edges of vertex $A$ : edge ( $A\rightarrow I$ ) and edge ( $A\rightarrow J$ ) are placed together on machine 1. Consequently, in addition to the mirror in machine 4, the mirror on machine 3 is now also a no-outgoing-edge mirror of vertex $A$ . Under this placement, once the data of vertex A is updated, the master of $A$ only needs to synchronize the mirror in machine 1. Figure 5 shows the corresponding synchronizing scenario under the new graph placement achieved by EDAP. Compared with the synchronizing scenario using existing vertex-cut approaches as shown in Fig. 4, the synchronizing communication is further reduced.

Figure 5.

An example of graph placement using the proposed edge direction-aware partitioning strategy.

Figure 6.

An example of synchronizing communications under EDAP-based graph placement.

We implemented the proposed EDAP approach based on two existing partitioning strategies in PowerGraph: Random and Oblivious [30]. The Random strategy uses a hash function that randomly distributes edges to machines. The Random strategy is fully data-parallel during the partitioning process and can achieve near perfect balanced workload distribution on large graphs. On the other hand, the Oblivious partitioning strategy uses a sequential greedy heuristic whose goal is to place subsequent edges on appropriate machines to minimize the conditional expected replication factor. As defined in [30], the replication factor is the ratio of the number of overall replicas in the distributed graph over the number of vertices in the original input graph. In a $p$ -way vertex-cut placement scenario, assuming each vertex ( $v$ ) of the original input graph spans over $A(v)$ machines that contain its adjacent edges, the replication factor can be formally defined as:

$\displaystyle\textit{Replication Factor}=\frac{1}{|V|}\sum_{v\in V}|{A(v)}|.$ (6)

Therefore, the objective of the Oblivious strategy is to place the $i+1$ edge after having placed the previous $i$ edges, so that:

$\displaystyle\textit{arg}\min_{j}\mathbb{E}\Bigg{[}\sum_{v\in V}|{A(v)}|\Bigg{% |}A_{e_{1}}\ldots A_{e_{i}},A_{(e_{i+1})}=j\Bigg{]}$ (7)

where $A_{e_{i}}$ is the assignment for the $i$ th edge, $j$ is the ID of a machine in the distributed system.

Oblivious runs the greedy heuristic independently on each machine without additional communication and it has the best performance of all the partitioning strategies implemented in PowerGraph [30].

In LightGraph, we extended the Random and Oblivious partition strategies using the edge direction awareness feature and created the partition method: EDAP_Random and EDAP_Oblivious, respectively. In particular, based on the Random and Oblivious implementation, we added a heuristic operating process that isolates the outgoing edges from the incoming edges of a vertex among different machines. In detail, following the placement of previous $i$ edges, to place the ( $i+1$ )th edge(source, target) the new strategies first screen out machines satisfying the following condition:

$\displaystyle\textit{In\_Dre}(\textit{machine},\textit{source})=0$ (8)

and

$\displaystyle\textit{Out\_Dre}(\textit{machine},\textit{target})=0$ (9)

where $\textit{In\_Dre}/\textit{Out\_Dre}(\textit{Machine }A,\textit{Vertex }v)$ denotes the in-degree/out-degree of vertex $v$ on the local graph located on machine $A$ .

During the machine selection process we give higher priority to the machines, on which

$\displaystyle\textit{Out\_Dre}(\textit{machine},\textit{source})>0$ (10)

$\displaystyle\textit{In\_Dre}(\textit{machine},\textit{target})>0$ (11)

By doing so, we maximize the chances of creating mirrors with only incoming or outgoing edges.

Algirithms 3–6 detail the pseudo codes of the mentioned partition strategies, respectively.

Random Partition in PowerGraph[1] // Random assign edge (source, target) to a machine p in {0, …numprocs-1} INPUT: Source, Target, # of processes OUTPUT: process/machine ID Return $p=$ hash_edge(source, target) in all machines;

Oblivious Partition in PowerGraph[1] // Greedy assign edge (source, target) to a machine p in {0, …numprocs-1} INPUT: Source, Target, # of processes OUTPUT: process/machine ID for each machine i dobalance $=$ (maxedges $-$ proc_num_edges[i])/(epsilon $+$ maxedges $-$ minedges); proc_score[i] $=$ balance $+$ a credit if source is already on i $+$ a credit if target is already on i; end for

for each machine i do $\left|\textit{proc\_score}[i]-\textit{maxscore}\right|<$ 1e-5 Put i into the candidate set;

end forReturn $p=$ hash_edge(source, target) in the candidate set;

EDAP_Random in LightGraph[1] // Edge direction-aware random assign edge (source, target) to a machine p in {0, …numprocs-1} //proc_dst_vertex[i].get(vertex ID) and proc_src_vertex[i].get(vertex ID) indicates whether a vertex has already been a target or a source of another edge in machine i; INPUT: Source, Target, # of processes OUTPUT: process/machine ID for each machine i do((proc_dst_vertex[i].get(source)! $=$ 1) and (proc_src_vertex[i].get(source) $==$ 1)) or ((proc_dst_ vertex[i].get(target) $==$ 1) and (proc_src_vertex[i].get(target)! $=$ 1)) Put i into the candidate set; end for size of candidate set $==$ 0 for each machine i do proc_dst_vertex[i].get(source)! $=$ 1) and (proc_src_vertex[i].get(target)! $=$ 1 Put i into the candidate set; end for size of candidate set $==$ 0 for each machine i doPut i into the candidate set; end forReturn $p=$ hash_edge(source, target) in the candidate set;

Table 1

Notations

Symbol	Description
$n$	The number of iterations in a graph computing job
$\text{\o}(v,i)$	The flag indicating whether $v$ is active or not in the $i$ th iteration
$v\_m_{0}$	The number of $v$ ’s mirrors with both incoming and outgoing edge
$v\_m_{1}$	The number of $v$ ’s mirrors with no incoming edge
$v\_m_{2}$	The number of $v$ ’s mirrors with no outgoing edge
$N\textit{\_sync\_comm\_}v$	The number of synchronizing messages happening on vertex $v$
$N\textit{\_total\_comm}$	The number of total synchronizing communication messages happening in a graph computing job
$P\textit{\_reduced\_comm}$	Percentage of reduced synchronizing communication messages in a graph computing job

EDAP_Oblivious in LightGraph[1] // Edge direction-aware greedy assign edge (source, target) to a machine p in {0, …numprocs-1} INPUT: Source, Target, # of processes OUTPUT: process/machine ID for each machine i dobalance $=$ (maxedges $-$ proc_num_edges[i])/(epsilon $+$ maxedges $-$ minedges); proc_score[i] $=$ balance $+$ a credit if source is already on i $+$ a credit if target is already on i; end for for each machine i do((proc_dst_vertex[i].get(source)! $=$ 1) and (proc_src_vertex[i].get(source) $==$ 1)) or ((proc_ dst_vertex[i].get(target) $==$ 1) and (proc_src_vertex[i].get(target)! $=$ 1)) $\left|\textit{proc\_score}[i]-\textit{maxscore}\right|<$ 1e-5 Put i into the candidate set; end for size of candidate set $==$ 0 for each machine i do proc_dst_vertex[i].get(source)! $=$ 1) and (proc_src_vertex[i].get(target)! $=$ 1 $\left|\textit{proc\_score}[i]-\textit{maxscore}\right|<$ 1e-5 Put i into the candidate set; end for size of candidate set $==$ 0 for each machine i do $\left|\textit{proc\_score}[i]-\textit{maxscore}\right|<$ 1e-5 Put i into the candidate set; 17 end forReturn $p=$ hash_edge(source, target) in the candidate set;

3.4 Volume of synchronizing communications analysis

In this section we conduct the volume of synchronizing communications analysis. We look inside the distribution structure of a graph and explore its relationship with the volume of synchronizing communications. Table 1 explains the related notations.

3.4.1 General analysis

Given a vertex, $v$ , according to the graph partition process in PowerGraph all mirrors of $v$ have edges. Thus $v$ ’s mirrors can be classified into the following three classes:

1.
mirrors with both incoming and outgoing edge;
2.
mirrors with outgoing edge and without incoming edge;
3.
mirrors with incoming edge and without outgoing edge;

We also introduce a flag, ø $(v,i)$ , which subjects to

$\displaystyle\text{\o}(v,i)=\left\{\begin{array}[]{ll}0&v\text{ is not active % in the }i\text{th ireration}\\ 1&v\text{ is active in the }i\text{th ireration}\\ \end{array}\right.$

In PowerGraph

After the Apply phase is done, data on master is updated. Then the master will synchronize all its mirrors with the new data.

All $v$ ’s mirrors need to be synchronized by $v$ ’s master. Thus, the number of synchronizing messages happening on vertex $v$ is:

$\displaystyle N\textit{\_sync\_comm\_}v=\sum_{i=0}^{n-1}[\text{\o}(v,i)\sum_{j% =0}^{2}v\_m_{j}]$ (12)

There is no duplicated communication between any two different vertices. Consequently, the total number of synchronizing communication messages happening in the whole graph computing job is:

$\displaystyle N\textit{\_total\_comm}=\sum_{i=0}^{|V|-1}N\textit{\_sync\_comm% \_}v_{i}=\sum_{i=0}^{|V|-1}\sum_{j=0}^{n-1}\left[\text{\o}(v_{i},j)\sum_{k=0}^% {2}v_{i}\_m_{k}\right]$ (13)

In LightGraph

LightGraph just eliminates the unnecessary synchronizing communications between mirror and master. Thus, LightGraph does not induce any impact on the mediated computing results of a vertex. Therefore, for each vertex $v$ , $v\_p_{i}$ is the same with that under PowerGraph.

In LightGraph, $v$ ’s mirror with no outgoing edge does not need to be synchronized by $v$ ’s master. Thus, the number of synchronizing messages happening on vertex $v$ is:

$\displaystyle N\textit{\_sync\_comm\_}v=\sum_{i=0}^{n-1}[\text{\o}(v,i)(v\_m_{% 0}+v\_m_{1})]$ (14)

Consequently, the number of total synchronizing communication messages happening in the whole graph is:

$\displaystyle N\textit{\_total\_comm}=\sum_{i=0}^{|V|-1}N\textit{\_sync\_comm% \_}v_{i}=\sum_{i=0}^{|V|-1}\sum_{j=0}^{n-1}[\text{\o}(v_{i},j)(v_{i}\_m_{0}+v_% {i}\_m_{1})]$ (15)

Thus the number of reduced synchronizing communication messages achieved by LightGraph over PowerGraph is:

$\displaystyle N\textit{\_total\_comm}_{\textit{Reduced}}=\sum_{i=0}^{|V|-1}% \sum_{j=0}^{n-1}\left[\text{\o}(v_{i},j)\sum_{k=0}^{2}v_{i}\_m_{k}\right]-\sum% _{i=0}^{|V|-1}\sum_{j=0}^{n-1}[\text{\o}(v_{i},j)(v_{i}\_m_{0}+v_{i}\_m_{1})]=% \sum_{i=0}^{|V|-1}\sum_{j=0}^{n-1}[\text{\o}(v_{i},j)v_{i}\_m_{2}]$ (16)

Thus,

$\displaystyle P\textit{\_reduced\_comm}=\frac{\sum_{i=0}^{|V|-1}\sum_{j=0}^{n-% 1}[\text{\o}(v_{i},j)v_{i}\_m_{2}]}{\sum_{i=0}^{|V|-1}\sum_{j=0}^{n-1}[\text{% \o}(v_{i},j)\sum_{k=0}^{2}v_{i}\_m_{k}]}\times 100$ (17)
3.4.2 Optimal analysis

In LightGraph, it is possible that all outgoing edges of a vertex $v$ are aggregated in a small number of $v$ ’s mirrors. Thus, in the optimal senario it is possilbe that

$\displaystyle\lim\left[\sum_{i=0}^{|V|-1}[v_{i}\_m_{2}]\right]=\sum_{i=0}^{|V|% -1}\sum_{j=0}^{n-1}[v_{i}\_m_{k}]$ (18)

Then,

$\displaystyle\lim{[P\textit{\_reduced\_comm}]}=\frac{\sum_{i=0}^{|V|-1}\sum_{j% =0}^{n-1}[\text{\o}(v_{i},j)v_{i}\_m_{2}]}{\sum_{i=0}^{|V|-1}\sum_{j=0}^{n-1}[% \text{\o}(v_{i},j)\sum_{k=0}^{2}v_{i}\_m_{k}]}\times 100=100$ (19)

Thus, in the optimal case, compared with PowerGraph, a considerable propotion of the synchronizing communication overhead can be reduced by LightGraph.

Table 2 compares the key characteristics of LightGraph with three state-of-the-art graph parallel abstractions.

Table 2

LightGraph vs existing systems

Metrics	Pregel	Distributed GraphLab	PowerGraph	LightGraph
Required Sync communication	$\propto$ # of ghosts	$\propto$ # of ghosts	$\propto$ # of mirrors	$\propto$ # of mirrors with outgoing edges
Graph partitioning method	Edge-cut	Edge-cut	Vertex-cut	Edge direction-aware vertex-cut
Computation model	Synchronous	Synchronous &	Synchronous &	Synchronous & Asynchronous
		Asynchronous	Asynchronous

4. Experimental evaluation

In this section, we show the effectiveness of LightGraph comparing with PowerGraph from various aspects through experiments.

4.1 Experiment environment

Our experiments were conducted on a 65-node (528 processors) Linux-based cluster. The cluster consists of one front-end node that runs the TORQUE resource manager and the Moab scheduler and 64 computing (worker) nodes. Each computing node has 16 GB of RAM and 2 quad-core Intel Xeon 2.66 GHz CPUs. The/home directory is shared among all nodes through NFS.

4.2 Benchmarking application and dataset

We selected PageRank and SSSP (shortest path algorithm) as benchmarking applications and processed three data sets listed in Table 3. These data sets are all large-scale graphs. The selection standard is to select graphs extracted from real-world use with diverse characteristics and different scales in size. LightGraph is only applicable for directed graphs. Thus all graphs selected are directed.

Table 3
Summary of data sets

Name	Description	# of vertices	# of edges	Graph density	Average
				( $\times 10^{-5}$ )	degree
soc-LiveJournal [2]	LiveJournal friednship social network	4,847,571	68,993,773	0.59	14
Twitter [2]	Social news website	11,316,811	85,331,846	0.13	8
BFS1 [5]	Facebook social networks	61,876,615	336,776,269	0.02	5

Figure 7.

The number of synchronizing communication messages vs the number of mirrors needing to be synchronized for PageRank on Livejournal.

4.3 Experiment design and results

We conducted tests by running the selected benchmarking algorithms on the datasets. Each presented result is the average of at least three runs. We first ran the PageRank application on LiveJournal data set and Fig. 7 plots the number of synchronizing communication messages vs the number of mirrors needing to be synchronized under synchronous mode using 16 machines. In Fig. 7 the numbers of mirrors needing to be synchronized under LightGraph (Random) and LightGraph (Oblivious) are actually the numbers of mirrors with outgoing edge under Random and Oblivious partition, respectively. As Fig. 7 shows 86.8% and 89.6% of overall mirrors have outgoing edge under Random and Oblivious placement, respectively. Namely 13.2% and 10.4% of overall mirrors have no outgoing edge under Random and Oblivious placement, respectively. PowerGraph needs to synchronize all mirrors. Instead LightGraph only synchronizes 86.8% and 89.6% of overall mirrors under Random and Oblivious, respectively. Thus, the numbers of synchronizing communication messages are reduced by 14.7% and 10.8% by LightGraph (Random) and LightGraph (Oblivious) compared with PowerGraph (Random) and PowerGraph (Oblivious), respectively. EDAP partition strategy tries to further isolate the incoming and outgoing edge for each vertex. And as expected, under EDAP_Random and EDAP_Oblivious the percentages of mirrors with outgoing edge are reduced to 71.5% and 82.8%, respectively. Consequently, the numbers of synchronizing communication messages are reduced by 26.4% and 16.5% by LightGraph (EDAP_Random) and LightGraph (EDAP_Oblivious) compared with PowerGraph (Random) and PowerGraph (Oblivious), respectively.

Figure 8.

Comparing the volume of synchronizing communication under synchronous computation mode for PageRank on livejournal.

Figure 9.

Comparing the volume of synchronizing communication under asynchronous computation mode for PageRank on livejournal.

Figure 10.

PageRank runtime under synchronous mode while computing livejournal.

Figure 11.

PageRank runtime under asynchronous mode while computing livejournal.

Figure 12.

Comparing the volume of synchronizing communication under synchronous computation mode for SSSP on BFS1.

Figure 13.

SSSP runtime under synchronous mode while computing BFS1.

Figure 14.

Comparing the volume of synchronizing communication under asynchronous computing mode for SSSP on Twitter.

Figure 15.

SSSP runtime under asynchronous mode while computing Twitter.

Figure 16.

SSSP runtime under asynchronous mode while computing Twitter.

Figures 8 and 9 show the volume of synchronizing communication happening in PageRank on Livejournal data set in different execution environments under synchronous and asynchronous computation mode, respectively. As expected, LightGraph and its edge direction-aware partitioning strategies can significantly eliminate avoidable synchronizing communication during the PageRank application execution under both modes. As the figures show, the synchronizing communication overhead is reduced up to 31.5% by LightGraph over PowerGraph. Moreover, as the number of machines increases, more synchronizing communication can be saved by in LightGraph.

Figures 10 and 11 show PageRank runtime on Livejournal data set using different partitioning strategies in LightGraph and PowerGraph under synchronous mode and asynchronous mode, respectively. As shown the figures, through eliminating the avoidable synchronizing communication, LightGraph shorten the execution time of PageRank under both synchronous and asynchronous modes. Moreover, the proposed edge direction-aware partitioning strategies, further improved the performance of the PageRank application. As the figures show, the runtime is shortened up to 16.3% by LightGraph over PowerGraph. Furthermore, LightGraph shows better performance gains through reducing synchronizing communication as the number of machines increases in the cluster.

Figures 12–15 show the selected experiment results of SSSP running on Twitter and BFS1. All reported experimental results of diverse benchmarking applications on different datasets demonstrate the universality of LightGraph.

In Fig. 16 we provide PageRank runtime of processing Livejournal data set comparisons between LightGraph and several representative existing systems. In our experiments Giraph, GPS, and GraphX all adopt their system default graph partition strategy: random edge-cut partitioning. And because all these there systems do not support asynchronous graph computing they all conduct the graph computing under synchronous computing mode. As the results demonstrate LightGraph outperforms other systems in runtime. For example, the runtimes of Giraph and GPS are 122.3 s and 110.6 s, respectively. On the other hand, the runtime of LightGraph (Sync, EDAP_Random) is only 59.4 s.

5. Conclusion

Although distributed graph-parallel computing systems such as PowerGraph can provide high computational capabilities and scalability for large-scale graph-structured computation, they often suffer heavy communication overhead. In this paper, we proposed LightGraph that eliminates the avoidable communications during synchronization of mirrors in existing distributed graph structured computing abstractions. Through extensive experiments with real-world network data, we show that LightGraph can not only reduce synchronizing communication significantly but also improve the runtime performance of PageRank-like applications.

References

Apache incubator giraph, http://incubator.apache.org/giraph/.

Snap, http://snap.stanford.edu/data/, 2006.

Chen

Ding

Wang

Chen

Zang

and Guan

, Computation and communication efficient graph processing with distributed immutable view, in: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, ACM, 2014, pp. 215–226.

Elser

and Montresor

, An evaluation study of bigdata frameworks for graph processing, in: Big Data, 2013 IEEE International Conference on, IEEE, 2013, pp. 60–67.

Gjoka

Kurant

Butts

C.T.

and Markopoulou

, Walking in facebook: A case study of unbiased sampling of osns, in: INFOCOM, 2010 Proceedings IEEE, IEEE, 2010, pp. 1–9.

Gonzalez

J.E.

Low

Bickson

and Guestrin

, Powergraph: Distributed graph-parallel computation on natural graphs, in: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, Berkeley, CA, USA, USENIX Association, 2012, pp. 17–30.

Guo

Biczak

Varbanescu

A.L.

Iosup

Martella

and Willke

T.L.

, How well do graph-processing platforms perform? an empirical performance evaluation and analysis. IPDPS, 2013.

Hong

Depner

Manhardt

Van Der Lugt

Verstraaten

and Chafi

, Pgx. d: a fast distributed graph processing engine, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, 2015, p. 58.

Hung

B.Q.

Otsubo

Hijikata

and Nishida

, Hits algorithm improvement using semantic text portion, Web Intelligence and Agent Systems 8(2) (2010), 149–164.

10.

Kim

G.-H.

Trimi

and Chung

J.-H.

, Big-data applications in the government sector, Communications of the ACM 57(3) (2014), 78–85.

11.

Lempel

and Moran

, Rank-stability and rank-similarity of link-based web ranking algorithms in authority-connected graphs, Inf. Retr. 8(2) (2005), 245–264.

12.

Low

Bickson

Gonzalez

Guestrin

Kyrola

and Hellerstein

J.M.

, Distributed graphlab: A framework for machine learning and data mining in the cloud, Proc. VLDB Endow. 5(8) (Apr. 2012), 716–727.

13.

Malewicz

Austern

M.H.

Bik

A.J.

Dehnert

J.C.

Horn

Leiser

and Czajkowski

, Pregel: A system for large-scale graph processing, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, New York, NY, USA, ACM, 2010, pp. 135–146.

14.

Mayer

Tariq

M.A.

Mayer

and Rothermel

, Graph: Traffic-aware graph processing, IEEE Transactions on Parallel and Distributed Systems, 2018.

15.

McSherry

F.D.

Isaacs

Isard

M.A.

and Murray

D.G.

, Differential dataflow, May 10 2012. US Patent App. 13/468,726.

16.

Murray

D.G.

McSherry

Isaacs

Isard

Barham

and Abadi

, Naiad: a timely dataflow system, in: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, 2013, pp. 439–455.

17.

Page

Brin

Motwani

and Winograd

, The pagerank citation ranking: Bringing order to the web, 1999.

18.

Pearce

Gokhale

and Amato

N.M.

, Faster parallel traversal of scale free graphs at extreme scale with vertex delegates, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, 2014, pp. 549–559.

19.

Salihoglu

and Widom

, Gps: A graph processing system, in: Proceedings of the 25th International Conference on Scientific and Statistical Database Management, SSDBM, New York, NY, USA, ACM, 2013, pp. 22:1–22:12.

20.

Shi

Luo

Liang

Zhao

and Jin

, Frog: Asynchronous graph processing on gpu with hybrid coloring model, IEEE Transactions on Knowledge and Data Engineering 30(1) (2018), 29–42.

21.

Siddique

Akhtar

Yoon

E.J.

Jeong

Y.-S.

Dasgupta

and Kim

, Apache hama: an emerging bulk synchronous parallel computing framework for big data applications, IEEE Access 4 (2016), 8879–8887.

22.

Stutz

Bernstein

and Cohen

, Signal/collect: Graph algorithms for the (semantic) web, in: Proceedings of the 9th International Semantic Web Conference on The Semantic Web – Volume Part I, ISWC’10, Berlin, Heidelberg, Springer-Verlag, 2010, pp. 764–780.

23.

Xiao

Xue

Miao

Chen

and Zhou

, Tux2: Distributed graph computation for machine learning, in: NSDI, 2017, pp. 669–682.

24.

Xue

Chen

J.-X.

Zhao

Medvar

and Knepper

M.A.

, Data integration in physiology using bayes’ rule and minimum bayes’ factors: deubiquitylating enzymes in the renal collecting duct, Physiological Genomics 49(3) (2016), 151–159.

25.

Yan

Huang

Liu

Chen

Cheng

and Zhang

, Graphd: Distributed vertex-centric graph processing beyond the memory limit, IEEE Transactions on Parallel and Distributed Systems 29(1) (2018), 99–114.

26.

Zaharia

Das

Hunter

Shenker

and Stoica

, Discretized streams: Fault-tolerant streaming computation at scale, in: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, 2013, pp. 423–438.

27.

Zhao

Liu

Y.-H.

X.-G.

H.-Y.

and Mei

, A method for mobile path prediction based on data mining, in: Education Technology and Training, 2008. and 2008 International Workshop on Geoscience and Remote Sensing. ETT and GRS 2008. International Workshop on, IEEE, Vol. 1, 2008, pp. 691–695.

28.

Zhao

Yang

C.-R.

Raghuram

Parulekar

and Knepper

M.A.

, Big: a large-scale data integration tool for renal physiology, American Journal of Physiology-Renal Physiology 311(4) (2016), F787–F792.

29.

Zhao

Yoshigoe

Bian

Xie

Xue

and Feng

, A distributed graph-parallel computing system with lightweight communication overhead, IEEE Transactions on Big Data 2(3) (2016), 204–218.

30.

Zhao

Yoshigoe

Xie

Zhou

Seker

and Bian

, Evaluation and analysis of distributed graph-parallel processing frameworks, Journal of Cyber Security 3, 289–316.

31.

Zhao

Yoshigoe

Xie

Zhou

Seker

and Bian

, Lightgraph: Lighten communication in distributed graph-parallel processing, in: Big Data (BigData Congress), 2014 IEEE International Congress on, IEEE, 2014, pp. 717–724.

Reducing the synchronizing communication overhead for distributed graph-parallel computing

Abstract

Keywords

1. Introduction

2. Related works

3. LightGraph: Lighten communication in distributed graph-parallel abstractions

3.1 Challenges of communication and synchronization

3.2 Communication overhead in PowerGraph

3.3.2 Edge direction-aware graph partition

3.4.1 General analysis

In PowerGraph

In LightGraph

4.1 Experiment environment

4.2 Benchmarking application and dataset

Table 3 Summary of data sets

References

Table 3
Summary of data sets