Optimizing data transmission and access of the incremental clustering algorithm using CUDA: A case study

Abstract

Incremental clustering algorithms can find wide applications in real-time streaming data processing and massive data analysis. Such algorithms need to continuously load data, and thus data transmission and access can induce non-negligible time overhead. Additionally, we have proposed two algorithms to exploit high data parallelism for incremental clustering on CUDA-enabled GPGPU: the Top-down (TD) algorithm and Moderate-granularity (MG) algorithm. In this paper, we adopt TD and MG algorithms as a case study to optimize data transmission and access based on CUDA. First, we reinterpret the two algorithms in the point view of overlapping read/write and computing operations on CUDA-warp level. Second, we adjust the flow of TD and MG algorithms to enhance data locality. As a result, shared memory can be sufficiently utilized. Third, we reorder input data points to raise data rate of global memory through coalesced memory access. Fourth, we hide part of data transmission latency by running multiple CUDA streams. Experiment results validated the efficiency of our optimizations.

Keywords

CUDA incremental clustering pipeline pattern data reordering

1. Introduction

1.1 Background

Incremental clustering algorithms are essential for many application scenarios, such as time-limited applications, memory-limited applications and redundant data elimination. These typical application scenarios always place high demand on computing power of the hardware platform. Parallel computing is a common method to satisfy this demand. General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. GPGPU enjoys various advantages including but not limited to the following.

•
Dramatic computing power: A GPGPU integrates enormous processing cores into a single processor and has significantly stronger computing power than a single CPU. Moreover, the GPGPU is undergoing noticeably faster computing-power growth than the CPU.
•
Affordability and low cost-effectiveness ratio: GPGPUs are common devices in modern desktops and laptops and thus furnish promising high performance computing devices with the absence of extra financial burden. In addition, the GPGPU is more cost effective than the CPU in terms of floating-point computation power per unit cost.
•
Programmability: Compute Unified Device Architecture (CUDA) provides C/C++ like programming tools to parallel computing developers. CUDA enjoys superior programmability than specified acceleration devices like FPGA.

Our previous work proposed two GPGPU-friendly incremental clustering algorithms [5, 6], namely, the Top-down algorithm (TD algorithm) and the Moderate-grained algorithm (MG algorithm). These two algorithms focus on enhancing the data parallelism. Consequently, the computing power of GPGPU corers can be more sufficiently utilized.

Nevertheless, only high data parallelism is not enough if we aim at taking full advantage of GPGPU computing power. Even in the latest research works of high performance computing, decreasing data transmission and access latency is still indispensable [1, 2, 16, 19]. Data transmission and access latency may significantly slow the execution of an incremental clustering algorithm, due to the fact that the algorithm needs to continuously load newly arrived data.
1.2 Related works

The work of [16] enhances the task and data parallelism of H.264/ACV encoding on a CPU-GPGPU hybrid platform. Their method furnishes efficient task scheduling and load balancing at no cost of extra computational overhead. Experiments show that this method achieves significant speedup compared to existing GPGPU-only methods. The method of [16] is similar to our method in the sense that both methods emphasize data parallelism and leverage CPU-GPGPU hybrid platform. Nevertheless, our method focuses on incremental clustering.

Michael et al. improve DBSCAN algorithm and maximize throughput of spatial-data oriented clustering algorithms based on a CPU-GPU hybrid system [11]. This method identifies eps-near neighbors of every input data point and thus resolves a neighbor table. The neighbor table builds indexes for the input data, and thus data locality is enhanced. The data, which are pre-fetched into share memory, tends to be repeatedly accessed. Consequently, high bandwidth of shared memory is fully utilized and low-bandwidth global memory access is minimized. Additionally, CUDA kernel execution and host-GPGPU data transfer are overlapped by leveraging CUDA streams. This method significantly increases the throughput of GPGPU-powered DBSACAN algorithm. However, this improved DBSCAN algorithm cannot incrementally cluster data due to the fact that it requires accessing entire input data to build indexes. By contrast, the original DBSACAN can incrementally cluster input data.

Zhao et al. point out that high performance computing (HPC) systems are stepping into a data-centric era due to the fact that enormous emerging workloads are data intensive. They argue that existing distributed file systems (like Google File System, GFS, and Hadoop Distributed File System, HDFS) cannot accommodate metadata-intensive and write-intensive workloads due to the following two reasons: First, storage granularity of GFS or HDFS is excessively large while size of a metadata file is typically small. Oversized granularity leads to inefficiency of metadata storage and transmission. Second, existing fault tolerance mechanisms rely on data replication but not checkpoints. Consequently, such mechanisms generate limited writing operation and thus bypass optimization of writing. However, data-centric HPC systems adopt checkpoints and inevitably induce many writing operations. Consequently, optimization of writing operation requires reconsidering. Zhao et al. propose a file system to overcome these two drawbacks and validate the efficiency through simulation. The work of [9] designs a general-purpose file system to accommodate data intensive HPC applications. Whereas, our work focuses on optimizing the data transmission/access between diverse functional units of the CPU-GPGPU hybrid platform.

Jo et al. propose a framework to integrate CPU, GPGPU and Intelligent Solid State Device (iSSD) into a collaborative computing system [18]. CPU is suitable for executing sequential high-time-complexity tasks; high-data-parallelism task can readily fit into GPGPUs; iSSD is capable of in-storage parallel processing and thus significantly reduces latency of data transmission. Diverse computing tasks are scheduled among the three computing devices based on synchronous dataflow model (SDF) and best imaginary level scheduling (BIL). Jo et al. apply their framework to data-intensive algorithms (k-means, Page Rank and Sim Rank) and validate the efficiency of this framework. However, this framework mainly focuses on integration of heterogeneous devices and does not explain maximizing GPGPU computing power utilization under the background of incremental clustering.

1.3 Main contribution

We optimized the data transmission and access of both TD algorithm and MG algorithm based on CUDA. Our optimizations include: First, we reinterpret our previous work (TD and MG algorithms) from the point view of warp-level data access. Second, we reform the flow of the two algorithms and compare their speedups achieved by utilizing share memory. Third, we reorder the input data to coalesce global memory access. Fourth, we overlap the data I/O operation and the computation operation among CUDA streams.

The rest of this paper is organized as follows: Section 2 introduces CUDA features relevant to our methods and outlines our optimizations. Section 3 describes our optimization methods in detail. Section 4 presents and analyzes the experiment results. Section 5 concludes this paper and points out the future work.

2. CUDA features relevant to the optimization methods

2.1 Execution mode of CUDA threads

A GPGPU contains a certain number of Streaming Multiprocessors (SM). Inside an SM, the basic unit of thread scheduling is a warp. Threads within the same warp are scheduled together and execute the same instruction (but the operands can be different). One SM contains a warp scheduler, an instruction dispatch unit, a data read/write unit, a load/store unit, processor cores, and a special function unit.

Taking the NVIDIA Fermi architecture as an example, a warp contains 32 threads; an SM contains 2 warp schedulers, 2 instruction dispatch units, 16 data read/write units, 32 processor cores, and 4 special function units.

Under Fermi, an SM can schedule two warps simultaneously. Two warp schedulers select a warp in ready state, respectively. Afterwards, both schedulers execute the selected warp. Computing operations of a warp can be overlapped with read/write operations of another warp on the premise that two or even more warps of thread run simultaneously on one SM. That is, the number of threads running on a GPGPU should two times or even more than the total number of processing cores on the GPGPU. The situation is similar under other NVIDIA GPGPU models like GTX 660.

2.2 Data transmission concurrency in a GPGPU-CPU hybrid system

In CUDA, GPGPU is a co-processor of CPU. The serial code and parallel code (i.e. CUDA kernel) run on CPU (host) and GPGPU (device), respectively. It is a requirement to load the input data of a kernel into GPGPU global memory before the kernel can be launched.

Figure 1 shows the data flows in a GPGPU-CPU hybrid system and summarizes our optimization methods. The data are transferred from the external storage (disk) to GPGPU’s global memory through two consecutive transmissions: disk-to-host transmission and host-to-device transmission. Subsequently, read/write units read data from global memory into registers of processing cores (for computing) or shared memory (for prefetching).

Figure 1.

Data flow in a GPU-CPU hybrid system and our optimization methods.

The input data volume to an incremental clustering algorithm can be huge. However, the output data volume is generally small. Consequently, we focus on the input data transmission, i.e. disk-to-host and host-to-device transmissions in Fig. 1. CPU and GPGPU have separate I/O units. Consequently, the two transmissions are parallelizable. Moreover, CUDA stream [8] can leverage this parallelism. A single CUDA stream contains a series of serial instructions. Multiple CUDA streams can run concurrently. Access latency of shared memory is dramatically lower than that of global memory.

2.3 Access pattern of GPGPU’s global memory

Although programming-visible cache like shared memory is available in CUDA, its capacity is limited. Access latency of GPGPU global memory is comparably high (400 $\sim$ 800 clock cycles) in spite of large capacity [7]. Conventionally, memory coalescing is necessary to raise data transmitting rate. A common solution is to reorder the data.

After the reordering, threads of a CUDA warp will always read valid data from consecutive memory addresses, namely, the memory access is coalesced. More reading operations are required to copy the same amount of data with the absence of memory coalescing. The principles of CUDA data reordering is amply illustrated in reference [8]. More details related to our data reordering method will be discussed in Subsection 3.3.

3. Optimization methods

3.1 Basic terminologies

Fundamental terminologies were detailedly introduced in our previous works [5, 6]. For integrity of this paper, we explain some basic terminologies as follows.

Definition 1: Incremental clustering

${\bm{x}}_{\bm{1}},{\bm{x}}_{\bm{2}},{\bm{x}}_{\bm{3}},...,{\bm{x}}_{\bm{T}}$ is a series of data points ( ${\bm{x}}_{i}\in R^{d},1\leqslant i\leqslant T$ ). The data points are partitioned into T1 sets: ${\bm{X}}_{1},{\bm{X}}_{2},{\bm{X}}_{3},...,{\bm{X}}_{T1}$ . The partition satisfied the following conditions:

1)
${\bm{X}}_{j}\neq\emptyset(1\leqslant j\leqslant T1)$ ;
2)
If ${\bm{x}}_{i1}\in{\bm{X}}_{j1},{\bm{x}}_{i2}\in{\bm{X}}_{j2}(i1\neq i2,j1\neq j2)$ , then $j2>j1\Leftrightarrow i2>i1$ .

A data analysis task adopts discrete time system. Time stamps are labeled as 1, 2, 3, … This task is incremental clustering if and only if:

1)
When $t=$ 1, the task receives ${\bm{X}}_{1}$ . In time interval [1, 2], the task partitions ${\bm{X}}_{1}$ into clusters and ${C}_{1}$ is the set of these clusters. All input to the task is: ${\bm{X}}_{1}$ .
2)
When $t=j$ ( $j=$ 2, 3, 4, …), the task receives ${X}_{t}$ . In time interval [ $t$ , $t+$ 1], the task resolves the set of clusters ${C}_{t}$ such that $\forall{\bm{x}}_{j}\in\bigcup_{1\leqslant k\leqslant t}{{\bm{X}}_{k}}$ can find its affiliated cluster in ${C}_{t}$ . All inputs to the task are: ${\bm{X}}_{t}$ and ${C}_{t-1}$ . ${C}_{t}$ represents the clustering result of all historic data up to time $t$ : $\bigcup_{1\leqslant k\leqslant t}{{\bm{X}}_{k}}$ .

Definition 2: The $t$ th step

An algorithm intended for incremental clustering is an incremental clustering algorithm. The time interval [ $t$ , $t+$ 1] ( $t=$ 1, 2, 3, …) (explained in Definition 1) is the $t$ th step of an incremental clustering algorithm.

Definition 3: Batch clustering part and the incremental part of step $t$

Some incremental clustering algorithms divide step $t$ ( $t=$ 1, 2, 3, …) into two parts [3, 4, 13]:

In the first part, ${\bm{X}}_{t}$ is partitioned into clusters (or micro-clusters) according to certain similarity metrics. ${C}_{t,\textit{new}}$ is the set of these clusters (or micro-clusters). In the second part, ${C}_{t}$ is resolved based on ${C}_{t,\textit{new}}$ and ${C}_{t-1}$ .

The first part can be accomplished by a batch-mode clustering algorithm, and it is the batch clustering part of step $t$ . The second part is the incremental part of step $t$ .
3.2 Exploiting high data-parallelism incremental clustering algorithms on GPGPU

We exploit high data parallelism for incremental clustering in pursuant to the hardware structure of NVIDIA GPGPU [5, 6].

Pursuant to Subsection 2.1, we need to run thousands of threads in single-instruction-multiple-data (SIMD) mode if we aim to fully utilize every core of GPGPU as well as hide the read/write latency of warps.

However, our previous works [5, 6] find that incremental algorithms are facing an accuracy-parallelism dilemma on GPGPU. On one hand, block-wise algorithms first clusters ${\bm{X}}_{t}$ in batch mode to obtain ${C}_{t,\textit{new}}$ (batch clustering part), then find and merge similar cluster pairs between ${C}_{t-1}$ and ${C}_{t,\textit{new}}$ . Batch-mode clustering can be accomplished by GPGPU-friendly algorithms. Time overhead of identifying and merging similar cluster pairs is negligible compared to the GPGPU-powered bath clustering part even if identifying and merging operations are executed sequentially. As a result, block-wise algorithms are GPGPU-friendly. However, such algorithms suffer from low accuracy due to coarse evolution-granularity. On the other hand, point-wise algorithms achieve high accuracy due to fine evolution-granularity. Nevertheless, such algorithms are difficult to execute with thousands of threads in SIMD mode owing to strong data dependency between two adjacent steps.

Figure 2.

Flows of two high data-parallelism incremental clustering algorithms.

Our previous works propose two incremental clustering algorithms to seek balance between accuracy and parallelism: the Top-down algorithm (TD) and Moderate-granularity algorithm (MG). Figure 2 illustrates the flows of both algorithms. We summarize similarities and differences between these two algorithms in Table 1.

Table 1

Comparison between Top-down (TD) and Moderate-granularity (MG) algorithm

	Characteristics	TD algorithm	MG algorithm
Similarities	Design objective	Seek balance between accuracy and data parallelism.
	Overall processing pattern	Block-wise (Fig. 2).
	Inner flow within a single	Each step is divided into two parts: batch clustering part and incremental part
	step	(Definition 3).
	Basic idea of enhancing	Batch clustering part undertakes most of computation, and this part possesses high
	data parallelism	data parallelism (GPGPU-friendly).
Differences	Batch clustering part	Expectation Maximum algorithm of Gaussian Mixture Model.	Mean-shift algorithm.
	Incremental part	Identify and merge similar cluster pairs between $C_{t-1}$ and $C_{t,\textit{new}}$ .	Absorb micro-clusters of $C_{t,\textit{new}}$ into a neural network.
	Basic idea of achieving high accuracy	Improve conventional block-wise algorithms. In each step, pre-estimate the cluster number of $C_{t-1}$ before executing incremental part. Thus avoid producing excessive unnecessary clusters in $C_{t}$ .	Integrate advantages of conventional block-wise and point-wise algorithms. Let $C_{t}$ evolve in the granularity of micro-cluster to reduce inaccuracy.
	Application scope	TD algorithm is parametric.	MG algorithm is non-parametric algorithm.

Figure 3.

Adjusted flow of bath clustering part of TD algorithm.

3.3 Fully utilize shared memory

GPGPU shared memory enjoys significantly higher bandwidth than global memory. However, capacity of shared memory is much smaller than that of global memory. In CUDA, shared memory is a program-visible cache; pre-reading data into shared memory and using them properly can effectively reduce data access latency of CUDA threads.

Assuming that time overhead of a CUDA thread read a data point ${\bm{x}}_{s}$ from global memory into shared memory, and the time overhead is $t_{G}$ . In addition, time overhead is $t_{S}$ if the thread read ${\bm{x}}_{s}$ directly from shared memory. Time overhead is $t^{\prime}_{G}$ if the thread read ${\bm{x}}_{s}$ from global memory into the register. Since shared memory and register both induce extremely small access latency, we suppose that $t^{\prime}_{G}=t_{G}$ . Obviously, $t_{G}\gg t_{S}$ . In one iteration of an algorithm, let $N_{G}$ represent the total number of times that ${\bm{x}}_{s}$ is read from global memory to shared memory; $N_{S}$ denotes the total number of times that ${\bm{x}}_{s}$ is read from shared memory into the register for computation. Time overhead saved by prefetching can be calculated as:

$\displaystyle\Delta=N_{S}\cdot(t^{\prime}_{G}-t_{S})-N_{G}\cdot t_{G}=(N_{S}-N% _{G})\cdot t_{G}-N_{S}\cdot t_{S}$ (1)

Prefetching can reduce data access latency if $\Delta$ is positive. Otherwise, prefetching increases total access latency. Videlicet, every data point prefetched into shared memory should be read by CUDA threads as many times as possible if we tend to reduce access latency by utilizing prefetching.

We adjust the flow of batch-mode parts of both TD and MG algorithm to fully utilize shared memory. Take the TD algorithm as an example, the algorithm runs EM algorithm of GMM in batch-mode part of each step. Suppose TD algorithm receives $N$ data points: ${\bm{x}}_{s}(s=1,2,3,...,N)$ in step $t$ ; and the GMM contains $K$ clusters: $k(k=1,2,3,...,K)$ . The EM algorithm runs iteratively. The conventional flow of each iteration contains five parts summarized as follows.

(1)

Calculate the probability density of ${\bm{x}}_{s}$ under cluster $k$ , denoted as $pd_{sk}^{(i)}$ . Superscript $(i)$ represent the $i$ th interation.

(2)

Calculate the probability that ${\bm{x}}_{s}$ belongs to cluster $k$ , denoted as

$\displaystyle p_{sk}^{(i)}=pd_{sk}^{(i)}\cdot\pi_{k}^{(i-1)}\left/\left(\sum% \limits_{k=1}^{K}pd_{sk}^{(i)}\cdot\pi_{k}^{(i-1)}\right)\right.,$

where $\pi_{k}^{(i-1)}$ is the weight of cluster $k$ when the last iteration is completed.

(3)

Update the weights $\pi_{k}^{(i)}=\left(\sum\limits_{s=1}^{N}p_{sk}^{(i)}\right)/N$ .

(4)

Update mean vectors ${\bm{\mu}}_{k}^{(i)}=\left(\sum\limits_{s=1}^{N}p_{sk}^{(i)}\cdot{\bm{x}}_{s}% \right)\left/\left(\sum\limits_{s=1}^{N}p_{sk}^{(i)}\right)\right.$ .

(5)

Update covariance matrixes

$\displaystyle\sum_{k}^{(i)}=(\delta_{uj})_{d\times d},\delta_{uj}=\left(\sum% \limits_{s=1}^{N}p_{sk}^{(i)}\cdot(x_{s,u}-\mu_{k,u}^{(i)})\cdot(x_{s,j}-\mu_{% k,j}^{(i)})\right)\left/\left(\sum\limits_{s=1}^{N}p_{sk}^{(i)}\right)\right.$ $\displaystyle(x_{s,u}\text{ is the }u\text{th component of }{\bm{x}}_{\bm{s}},% \mu_{k,u}^{(i)}\text{ is the }u\text{th component of }{\bm{\mu}}_{\bm{k}}^{{% \bm{(i)}}};$ $\displaystyle s=1,2,3,...,N;k=1,2,3,...,K;u,j=1,2,3,...d)$

Under the conventional flow, data dependence exists between two successive parts, and thus each part is required to start after the previous part completes. As a result, data points prefetched into shared memory during the current part cannot be reused by the subsequent parts. Subsequent parts inevitably reload data points from global memory into shared memory if they tend to utilize shared memory. Pursuant to Eq. (1), $N_{G}$ increases and $\Delta$ tends to drop.

As illustrated by Fig. 3, we adjust the flow to decrease $N_{G}$ .

$\displaystyle\left\{\begin{array}[]{l}\textit{numer}\_\pi_{k}^{(i)}=\left(\sum% \limits_{b=1}^{B}\textit{sub}\_\pi_{k,b}^{(i)}\right)\\ \pi_{k}^{(i)}=\textit{numer}\_\pi_{k}^{(i)}/N\end{array}\right.$ (2) $\displaystyle\left\{\begin{array}[]{l}\textbf{{numer}}\_\bm{\mu}_{k}^{(i)}=% \left(\sum\limits_{b=1}^{B}\textbf{{sub}}\_\bm{\mu}_{k,b}^{(i)}\right)\\ \bm{\mu}_{k}^{(i)}={\textbf{{numer}\_}}{\bm{\mu}}_{k}^{(i)}/\textit{numer}\_{% \pi_{k}}^{(i)}\end{array}\right.$ (3) $\displaystyle\left\{\begin{array}[]{l}{\textbf{{numer}\_}}{\bm{sq}}_{k}^{(i)}=% \left(\sum\limits_{b=1}^{B}{{\textbf{{sub}\_}}}{\bm{sq}}_{k,b}^{(i)}\right)\\ {\bm{sq}}_{k}^{(i)}={\textbf{{numer}\_}}{\bm{sq}}_{k}^{(i)}/\textit{numer}\_% \pi_{k}^{(i)}\\ \Sigma_{k}^{(i)}={\bm{sq}}_{k}^{(i)}-{\bm{\mu}}_{k}^{(i)}\cdot({\bm{\mu}_{k}}^% {(i)})^{T}\end{array}\right.$ (4)

Let $B$ denote the number of SMs in a GPGPU. Every column in Fig. 3 corresponds to a SM. In step $t$ , the TD algorithm receives the data block ${\bm{X}}_{t}=\{{\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{M}\}$ and divides it into $B$ subsets: $\textit{sub}\bm{X}_{t,j}(j=1,2,3,...,B)$ and $\textit{sub}{\bm{X}}_{t,j}=\{{\bm{x}}_{b_{(j-1)}},{\bm{x}}_{b_{(j-1)}+1},...,{% \bm{x}}_{b_{j}}\}(b_{0}=1,b_{B}=M,b_{j}=b_{j-1}+|\textit{sub}\bm{X}_{t,j}|)$ . The $j$ -th SM prefetches $\textit{sub}\bm{X}_{t,j}$ into its own shared memory; then calculate the three partial sums $\textit{sub}\_\pi_{k,j},\textbf{{sub}\_}\bm{\mu}_{k,j},\textbf{{sub}\_}\bm{sq}% _{k,j}$ . CUDA codes of these calculations are in one CUDA kernel. During the calculating, CUDA threads on the $j$ -th SM read data points directly from the corresponding shared memory. After each SM calculated its own partial sums, $\pi_{k}^{(i)}$ , $\bm{\mu}_{k}^{(i)}$ , $\Sigma_{k}^{(i)}$ can be resolved pursuant to Eqs (2)–(4).

The adjusted process flow enhanced data locality. We optimize shared memory usage of MG algorithm in a similar manner.

3.4 Data reordering

In the $t$ th step of an incremental clustering algorithm (either TD algorithm or MG algorithm), the newly arrived data block contains $M$ (use Italy style)data points:

$\displaystyle{\bm{a}}_{j}=({a_{j,1}},{a_{j,2}},...,{a_{j,d}})^{T}\in{R^{d}}(j=% 1,2,3,...,M)$

Typically, all these data are stored in a one-dimensional array. $d$ dimensions of a data point occupy consecutive memory addresses. The length of this array is $M\cdot d$ , if the entire data block is directly loaded into GPGPU’s global memory.

Figure 4.

Data reordering in GPGPU global memory: (a) memory access pattern before reordering, (b) memory access pattern after reordering.

Figure 4 illustrates the effectiveness of data reordering. Let $B$ represent the number of threads within a CUDA thread block. Due to hardware limitations, a warp is the minimum unit of CUDA thread scheduling. Moreover, threads of the same warp are required to read consecutive memory addresses but rather separate addresses. During the algorithm execution, every CUDA thread needs to successively read $d$ dimensions of a data point before processing the data point. As shown by Fig. 4a, threads of a CUDA warp inevitably access separate memory addresses with the absence of reordering. Non-coalesced access pattern decreases data rate. We assume that $M$ is divisible by $B$ . The length of the array should be increased to $(M+B-M{\%}B)\cdot d$ , if $M$ is not divisible by $B$ . The array elements are divided into $\lceil M/B\rceil$ groups. The length of each group is $B\cdot d$ . The first $B$ elements of the first group are ${a_{1,1}},{a_{2,1}},{a_{3,1}}...,{a_{M,1}}$ ; followed by ${a_{1,2}},{a_{2,2}},{a_{3,2}}...,{a_{M,2}}$ ; and so on. As shown by Fig. 4b, threads within a warp can access consecutive memory addresses due to reordering. Coalesced memory access pattern achieves high data transmitting rate.

3.5 Hiding data transmission latency through the pipeline pattern

As illustrated in Fig. 5, part of the data transmission latency can be hidden, if the following conditions are satisfied: First, multiple CUDA streams execute concurrently. Second, every CUDA stream covers disk-to-host transmission, host-to-device transmission, and execution of CUDA kernels. Relevant symbols are explained as follows:

$T_{1,t}$ : time overhead of disk-to-host transmission (in the $t$ th step of the incremental clustering algorithm); $T_{2,t}$ : time overhead of host-to-device transmission and data pre-processing such as reordering (in the $t$ th step of the incremental clustering algorithm); $T_{3,t}$ : time overhead of CUDA kernel execution (in the $t$ th step of the incremental clustering algorithm).

$T_{1},T_{2},T_{3}$ represent the averages of ( $t=$ 1, 2, 3, …), respectively. $P$ is the number of stages in the pipeline.

Figure 5.

Comparison between non-pipeline and pipeline patterns: (a) Non-pipelined pattern; (b) pipelined pattern.

We consider the speedup of pipelined pattern up to step $t$ . For convenient calculation, we approximate $T_{1,v}$ , $T_{2,v}$ and $T_{3,v}$ with $T_{1}$ , $T_{2}$ and $T_{3}$ , respectively $(v=1,2,3,\ldots,t)$ .

If we assume that $T_{1}=T_{2}=T_{3}=T_{0}$ . Pipelined execution time up to step $t$ is:

$\displaystyle T_{\textit{pipeline}}=P\cdot T_{0}+(t-1)\cdot T_{0},$

Under the non-pipelined pattern, algorithm execution time up to step $t$ is:

$\displaystyle T_{\textit{non}}=t\cdot P\cdot T_{0},$

Speedup of pipelined pattern is:

$\displaystyle\textit{Speedup}_{\textit{pipeline}}=T_{\textit{non}}/T_{\textit{% pipeline}}=t\cdot P/[P+(t-1)]$ (5)

Theoretically, $\mathop{\lim}\limits_{t\to\infty}\textit{Speedup}_{\textit{pipeline}}=P$ . According to Eq. (5), the number of CUDA streams has no influence on speedup. In practice, values of $T_{1,v}$ , $T_{2,v}$ and $T_{3,v}$ can be different. However, Eq. (5) can quantitatively explain the principle of our pipelined pattern.

If we assume that $T_{1}\neq T_{2}\neq T_{3}$ , we can estimate the speedup by considering throughputs.

Under the non-pipelined pattern, the throughput is:

$\displaystyle\textit{Throughput}_{\textit{non}}=\frac{1}{T_{1}+T_{2}+T_{3}}.$

Under the pipelined pattern, the steady-state throughput is:

$\displaystyle\textit{Throughput}_{\textit{pipeline}}=\frac{1}{\max(T_{1},T_{2}% ,T_{3})}.$

As a result, speedup of pipelined pattern is:

$\displaystyle\textit{Speedup}^{\prime}_{\textit{pipeline}}=\frac{\textit{% Throughput}_{\textit{pipeline}}}{\textit{Throughput}_{\textit{non}}}=\frac{T_{% 1}+T_{2}+T_{3}}{\max(T_{1},T_{2},T_{3})}$ (6)

4. Experiments

4.1 Hardware and software environment

Hardware and software parameters are shown in Table 2. All experimental codes are written in C++.

The double-precision floating-point data type was adopted for both CPU and GPGPU codes.

A single step of both TD and MG algorithms includes two parts: the batch clustering part and the incremental part. The batch-mode and incremental parts are executed on GPGPU and CPU, respectively. In the batch clustering part of TD algorithm, Bayesian Information Criterion (BIC) was adopted to identify the number of clusters contained by a single data block. The searching range of BIC is 50 $\sim$ 100. In the batch clustering part of MG algorithm, we used mean-shift with the uniform kernel. Calculating BIC requires running the standard EM algorithm of GMM to convergence. Details on GPGPU-powered GMM EM algorithm is elaborated by reference [10]. CUDA-accelerated mean-shift is discussed in detail by reference [12].

Table 2
Hardware and software parameters

Parameter	Value	Parameter	Value	Parameter	Value
CPU (number)	Intel ${}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}$ Core2TM E7500(6)	Host RAM size	8 GB	CUDA version	5.0
GPGPU (number)	NVIDIA GTX 660(1)	OS version	Linux kernel version	CUDA grid size	45
			2.6.18–194.el5
Hard disk (number)	Maxtor 6L160M0(1)	GCC version	4.1.2	CUDA block size	64

4.2 Input data

As shown in Tables 3 and 4, 256-color grayscale images from USC-SIPI data set [17] were used as inputs. A data point includes three components: grayscale, X-coordinate and Y-coordinate. Clustering results are incrementally segmented images. Clustering accuracy is measured by Rand Index [15]. We need a benchmark to calculate Rand-Index-value of a clustering result. The clustering results of conventional batch-mode GMM EM algorithm and mean-shift algorithm are adopted as benchmark for incremental clustering results of TD and MG algorithms, respectively. The value of Rand Index is within the range [0, 1]. Larger Rand Index means higher accuracy.

Table 3
Accuracy of TD algorithm on the datasets

Human and animal images		Aerial geographical images
Name	Rand index	Name	Rand index	Name	Rand index
Azu	0.9345	usc2.2.01	0.9371	usc2.2.15	0.9285
Baboon	0.9564 ${}^{\text{max}}$	usc2.2.02	0.9028	usc2.2.16	0.929
Barbara	0.952	usc2.2.03	0.9526 ${}^{\text{max}}$	usc2.2.17	0.9263
Couple	0.9109	usc2.2.04	0.8362	usc2.2.18	0.9458
Crowd	0.937	usc2.2.05	0.9059	usc2.2.19	0.8916
Elaine	0.8854	usc2.2.07	0.9337	usc2.2.20	0.9496
Lena	0.9305	usc2.2.08	0.7667	usc2.2.21	0.9322
Man	0.938	usc2.2.11	0.9129	usc2.2.22	0.9303
Tiffany	0.9363	usc2.2.13	0.7981	usc2.2.24	0.9398
Mat	0.9345	usc2.2.14	0.8808	usc2.2.25	0.9012

Tables 3 and 4 show clustering accuracy of TD and MG algorithms, respectively. With regards to MG algorithm, the bandwidth parameter of batch clustering part is (4, 2); this is a balance point between accuracy and parallelism on the basis of experience [6]. Pursuant to these two tables, we select baboon and usc223.03 as input to validate acceleration performance on TD algorithm. Additionally, we choose baboon and usc2.2.17 to validate acceleration performance of MG algorithm. In other words, we choose images with highest accuracy. Our optimization methods exhibit similar speedup on images of Tables 3 and 4. As a result, we only adopt images with best accuracy to avoid repetition.

Table 4

Accuracy of MG algorithm on the datasets ${}^{1}$

Human and animal images		Aerial geographical images
Name	Rand index	Name	Rand index	Name	Rand index
Azu	0.9505	usc2.2.01	0.961	usc2.2.15	0.6849
Baboon	0.9946 ${}^{\text{max}}$	usc2.2.02	0.5025	usc2.2.16	0.8575
Barbara	0.9557	usc2.2.03	0.9612	usc2.2.17	0.9983 ${}^{\text{max}}$
Couple	0.939	usc2.2.04	0.9717	usc2.2.18	0.991
Crowd	0.9314	usc2.2.05	0.996	usc2.2.19	0.9838
Elaine	0.9806	usc2.2.07	0.9022	usc2.2.20	0.9699
Lena	0.9123	usc2.2.08	0.9624	usc2.2.21	0.9841
Man	0.9796	usc2.2.11	0.9923	usc2.2.22	0.8716
Tiffany	0.9346	usc2.2.13	0.9816	usc2.2.24	0.9703
Woman	0.739	usc2.2.14	0.9657	usc2.2.25	0.9746

${}^{1}$ Bandwidth parameters of mean-shift algorithm is (4, 2).

Every input image was divided into equal-size data blocks. These blocks were processed consecutively. “Data block size” (the fourth column of Table 5) describes the number of data points within a single data block.

Table 5

Input data ${}^{1}$

Algorithm	Image name	Image size	Data block size
TD	Baboon	512 $\times$ 512	128 $\times$ 128
	usc2203	1024 $\times$ 1024	128 $\times$ 128
MG	Baboon	512 $\times$ 512	128 $\times$ 128
	usc2217	1024 $\times$ 1024	128 $\times$ 128

${}^{1}$ Image size or data block size is in the format “width $\times$ height”, measured in pixels.

4.3 Results and analysis

Table 6 illustrates the optimization effects on TD algorithm.

Table 6
Optimization methods’ effects on TD algorithm

Image name	Speedup (shared memory)	Speedup (pipelined pattern)	Speedup (data reordering)
Baboon	1.02	Lower than 0.1%	2.33
usc22.0.3	1.05	Lower than 0.1%	4.87

Figure 6.

MG algorithm: Speedup gained by pipeline pattern.

Ten CUDA streams are used. With regards to MG algorithm, Figs 6 and 7 show the speedup achieved by the pipeline pattern and data reordering, respectively.

With regards to TD algorithm, speedup achieved by data prefetching is not eligibly low. By contrast, acceleration obtained through prefetching is eligibly small in terms of MG. We discuss the reasons as follows. During step $t$ , batch clustering part of TD (batch-mode GMM EM algorithm) need to access every data points of ${X}_{t}$ in each iteration. As a result, data locality is strong and data points in shared memory tend to be accessed for more times. In contrast, batch clustering part of MG (batch-mode mean-shift algorithm) requires to access data points lying within a certain distance from current cluster centers. Thus data locality is comparatively weak and data points in shared memory tend to be accessed for fewer times.

As is shown by Table 6 and Fig. 6, MG algorithm achieves much higher speedup through pipelined pattern than TD algorithm. The time complexities of EM and mean-shift algorithms are $O(|\bm{X}_{t}|\cdot N_{t}\cdot I_{t})$ and $O(|\bm{X}_{t}|^{2}\cdot I_{t})$ , respectively; and the symbols are explained as follows:

•

$\left|{{\bm{X}}_{t}}\right|$ : the amount of data points in ${\bm{X}}_{t}$ ;

•

$N_{t}$ : the number of clusters in $C_{t,\textit{new}}$ ;

•

$I_{t}$ : the total iteration number of the batch clustering algorithm in the $t$ th step.

Although mean-shift exhibits quadratic complexity in terms of $\left|{{\bm{X}}_{t}}\right|$ , the batch clustering part of TD algorithm produces much more time overhead than that of MG algorithm. The reasons are: First, TD algorithm needs to run standard EM for (50 $+$ 51 $+$ 52 $+$ … $+$ 100) $=$ 51 (50 $+$ 100)/2 times (the searching range of cluster number is 50 $\sim$ 100). Second, TD algorithm generates significantly less clusters than MG algorithm [5, 6]. Generating fewer clusters means much more iterations.

In the case of this paper, TD algorithm is computation-intensive. In contrast, MG algorithm is much more data-intensive. As a result, MG algorithm obviously benefits from the pipeline pattern (Fig. 6). Nevertheless, the pipeline pattern can exert almost no influence on TD algorithm (Table 6).

Figure 7.

Overall speedup of GPGPU-powered MG algorithm: Effects of data reordering.

Figure 8.

Incremental image segmentation results.

Table 6 and Fig. 7 demonstrate that both TD and MG algorithms achieve significant speedup by utilizing data reordering. The reasons are explained as follows. First, global memory is the primary storage unit of GPGPU. TD and MG algorithms both necessarily read large amount of data from global memory. Second, both algorithms place most of computation in the batch clustering part. In addition, batch clustering parts of the two algorithms possess high data parallelism (GPGPU-friendly). Third, high data-parallelism is a vital precondition that data reordering can work successfully.

For MG algorithm alone, larger bandwidth means less iterations of mean-shift (uniform kernel is used). Consequently, total time overhead of the algorithm is smaller. Meanwhile, the data transmission time remains unchanged. As a result, larger bandwidth leads to a more data-intensive MG algorithm.

From our previous work [6], we know that larger bandwidth means that higher percentage of MG algorithm’s computation run on GPGPU. Hence speedup obtained by data reordering rises with the increasing bandwidth (Fig. 7).

In the disk-to-host transmission, we copied the whole data block all at once. If the data points of a data block had been copied one by one, the pipeline pattern would have achieve much higher speedup.

Representative incremental segmentation results are shown in Fig. 8.

Figure 9 compares the overall speedup between finely-optimized multithreaded CPU algorithms and GPGPU-powered algorithms. The multithreaded versions are implemented using POSIX thread; they run 12 threads concurrently and are optimized as follows. (1) Compiled using compiler option -xo4; (2) Use t mmap() calls to map files into the address space but rather call read(); (3) reorder data points to enhance data locality; (4) Identify fine-grained locking through performance tuning; (5) Use asynchronous I/O to overlap computing and I/O operations. Our GPGPU-powered methods can obtain significant speedup over the multithread implementation.

Figure 9.

Speedup of multithread-CPU and GPGPU-powered algorithm over single-thread implementation.

5. Conclusion

Our methods exploit multi-level optimizations (especially optimizations relevant to data access/transmission) in a CPU-GPGPU hybrid system to accelerate the incremental clustering algorithm. First, warp level: The basic idea of TD and MG algorithms require that most computation of an incremental clustering algorithm should possess high data parallelism. As a result, enormous threads can execute simultaneously to overlap computing operation and read/write operations. Second, shared-memory level: Prefetching data into shared memory can reduce data access latency if the data are repeatedly accessed. The precondition that this optimization can work successfully is strong data locality. Third, global-memory level: Memory coalescing through data reordering is a wide-applicable method to optimize global memory access. In addition, coalescing tends to achieve higher speedup if data parallelism is higher. Fourth, data I/O level: CUDA streams can overlap disk-to-host transmission, host-to-device transmission and kernel execution. Larger proportion of data transmission time means higher speedup achieved by CUDA streams.

Our future work will focus on data transmission and access optimization for GPGPU-powered clusters.

Footnotes

Acknowledgments

This work was financially supported by the following research funds:

References

Shambayati

and Chien

A.A.

, A Data Layout Transformation (DLT) accelerator: Architectural support for data movement optimization in accelerated-centric heterogeneous systems, in: Europe Conference & Exhibition in Design, Automation & Test (DATE), 2016, pp. 1489–1492.

Peña

A.J.

and Balaji

, A data-oriented profiler to assist in data partitioning and distribution for heterogeneous memory in HPC, Parallel Computing 51 (2016), 46–55.

Zhou

Cao

Qian

and Jin

, Tracking clusters in evolving data streams over sliding windows, Knowledge and Information Systems 8 (2007), 181–214.

Aggarwal

C.C.

Han

Wang

and Yu

P.S.

, A framework for clustering evolving data streams, in: the 29th International Conference on Very Large Data Bases (VLDB’03), 2003, pp. 81–92.

Chen

Zhang

and Hu

, Incremental Learning of Gaussian Mixture Model: A Top-down Algorithm, Journal of Computational Information Systems 9(17) (2013), 6971–6980.

Chen

Zhang

and Hu

, Nonparametric Incremental Clustering: A Moderate-grained Algorithm, Journal of Computational Information Systems 10(3) (2014), 1183–1193.

Computing with CUDA Lecture 3 – Efficient Shared Memory Use. http://www.bu.edu/pasi/files/2011/07/Lecture31.pdf.

CUDA Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz30xFGR1Ve.

Zhao

Liu

Kimpe

et al., Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations, IEEE Transactions on Parallel & Distributed Systems 27(6) (2016), 1824–1837.

10.

Machlica

Vanek

and Zajic

, Fast Estimation of Gaussian Mixture Model Parameters on GPU Using CUDA, in: International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011, pp. 167–172.

11.

Gowanlock

Rude

C.M.

Blair

D.M.

et al., Clustering Throughput Optimization on the GPU, in: Parallel and Distributed Processing Symposium (IPDPS), 2017, pp. 832–841.

12.

Huang

Men

and Lai

, Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms, in: International Workshop on Modern Accelerator Technologies for GIScience (MAT4GIScience), 2012, no page number.

13.

Song

and Wang

, Highly efficient incremental estimation of Gaussian mixture models for online data stream clustering, SPIE05 (2005), 174–183.

14.

Hua

Zhong

and Zheng

, Enabling low latency at large-scale data center and high-performance computing interconnect networks using fine-grained all-optical switching technology, in: International Conference on Optical Network Design and Modeling (ONDM), 2017, no page number.

15.

Vinh

N.X.

Epps

and Bailey

, Information theoretic measures for clusterings comparison: is a correction for chance necessary? in: International Conference on Machine Learning, 2009, pp. 1073–1080.

16.

Momcilovic

Roma

and Sousa

, Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms, Journal of Real-Time Image Processing 11(3) (2016), 571–587.

17.

USC-SIPI. http://sipi.usc.edu/database/.

18.

Y.Y.

Cho

S.W.

Kim

S.W.

et al., Collaborative processing of data-intensive algorithms with CPU, intelligent SSD, and GPU, in: ACM Symposium on Applied Computing (SAC), 2016, pp. 1865–1870.

19.

Zhang

Feng

Liu

Tong

and Fang

, A Novel ReRAM-based Main Memory Structure for Optimizing Access Latency and Reliability, in: the 54th Annual Design Automation Conference, 2017, Article No.82.

Optimizing data transmission and access of the incremental clustering algorithm using CUDA: A case study

Abstract

Keywords

1. Introduction

1.1 Background

1.3 Main contribution

2. CUDA features relevant to the optimization methods

2.1 Execution mode of CUDA threads

2.2 Data transmission concurrency in a GPGPU-CPU hybrid system

3. Optimization methods

3.1 Basic terminologies

4.1 Hardware and software environment

Table 2 Hardware and software parameters

Table 3 Accuracy of TD algorithm on the datasets

Table 6 Optimization methods’ effects on TD algorithm

Footnotes

Acknowledgments

References

Table 2
Hardware and software parameters

Table 3
Accuracy of TD algorithm on the datasets

Table 6
Optimization methods’ effects on TD algorithm