Abstract
Software solutions are not effective to be used in network applications because of their low throughput. By employing hardware implementation on FPGA, not only sufficient flexibility is achieved but also the throughput is increased considerably. In this paper, two multi-core architectures are proposed for Bloom filter and CRC as two main network processing core functions. These architectures called multi-core architecture with shared queue and multi-core architecture with private queue. The proposed architectures are implemented for 1, 2, 4, 8 and 16 cores. Experimental results show that multi-core architecture with private queue achieves higher throughput In comparison to the other one. As compared to Bloom filter, CRC application leads to less computational load and consequently more throughput. Moreover, Bloom filter is implemented on GPU and CPU and the results are compared with each other. When number of packets in GPU memory is 16384, the speedup achieved by GPU implementations using CUDA is about 274 times compared with CPU implementations. However, FPGA results outperform GPU, so that the throughput of the first architecture (shared queue) and second architecture (private queue) with 16 cores are almost 5.5 and 7.1 times higher than GPU throughput, respectively.
Keywords
Introduction
In recent years, the role of parallel architectures to achieve higher performance and speed-up in various programs has become more and more important. According to Moore’s law, the number of transistors on a chip has become double every 18 to 24 months. This growth in the number of transistors has increased the efficiency of on-chip hardware. However, this law is just about hardware while development in software area is far behind than hardware developments [20]. The purpose of advances in processor developments is not mainly because of optimizing the serial performance of general-purpose processors. On the contrary, the parallelism and employing more cores in processors are the main goals. This is due to the fact that increasing the frequency of a single-core CPU leads to more power consumption which is the main reason of moving toward multi-core technology [6,29].
Multi-core processors are used in various aspects of computer field such as signal processing, image processing, embedded systems, desktops and etc. These processors can provide the required processing power with a reasonable power consumption level. Moreover, the Internet and other computer networks are growing rapidly. This growth is because of increasing number of different services such as Firewalls, Quality of Service (QoS), Virtual Private Networks (VPN) and network security. Therefore, the bigger number of users and services becomes, the more amount of bandwidth is required. Furthermore, hardware and software of the next generation network devices, especially routers, must be capable of supporting the aforementioned services [11,15]. Consequently, the need for routers with high speed and high throughput which are capable of fast and accurate packet processing is essential.
Packet processing is widely used in network devices such as routers, switches and firewalls. The main purpose of packet processing is to provide different services such as Quality of Service (QoS), Virtual Private Networks (VPNs), network security and policy-based routing [2]. To provide these services, routers have to classify incoming packets according to the routing information of one or more fields in the packet header. These fields can be source/destination addresses, protocol field, and source/destination port numbers. Packets are classified based on the rule-sets. Consequently, for each incoming packet the process of classification based on the pre-determined fields is performed. This means for each packet a search is performed in the rule-set in order to find the proper operation for that packet, but considering the huge number of packets and rules in a rule-set, this operation is extremely time-consuming. Former solutions of this problem were based on hash functions to accelerate this operation while faster and more convenient solution for reducing the duration of this process is based on Bloom filters [1,8]. Bloom filter is an optimal and space efficient data structure which can be used for membership checking of an input element (query) in a data-set.
In heavy network processing tasks such as packet classification, custom cores (e.g. Bloom filter and CRC cores) can be applied in order to achieve higher performance. Field-Programmable Gate Arrays (FPGAs) as reconfigurable platforms are very suitable cases to implement aforementioned cores in order to optimize memory usage and accelerate them when they are used as network processors. On the other hand, high arrival rate of network packets as well as the potential of parallel processing of individual packets are interesting motivations to use multi/many-core architectures. In this paper, high-performance multicore architectures for network processing applications are proposed. The cores in this architecture can be selected among different network processing applications. In this case, Bloom filter and CRC cores are selected to implement on this architecture. Moreover, the Bloom filter is also implemented on GPU as a many-core platform. Furthermore, hardware implantation of Bloom filter-based processor on FPGA has been compared with the GPU implementation as many-core architecture.
The contributions of this paper are as follows:
The proposal of two high-performance multi-core architectures (with shared and private queues) for network processing.
Hardware implementation of Bloom filter and CRC as network processing cores on these architectures using FPGA.
Software Implementation of Bloom filter on GPU as many-core architecture to compare with the proposed architectures.
The rest of this paper is organized as follows: related work is summarized in Section 2. An overview of Bloom filter, CRC and GPGPU is given in Section 3. In Section 4, the proposed architectures are explained. Implementation results are presented in Section 5. Finally, Section 6 concludes the paper.
Related work
Because of current improvements of the multi-core systems, parallel packet processing has gained more population. The following contains some of the many different works in the field of network processing especially packet processing. In [25], a template matching algorithm is proposed with optimized algorithm for network processing dedicated multicore platforms. Their method is implemented on a platform of 16 MIPS cores. In [24], a method is presented for packet classification which is optimized for the multi-core network processors. The evaluation has been done on the Intel IXP 2850 network processor. Moreover, some researches have been done using multicore processors in the network security where high speed of packet classification is needed [18,28,31]. High-throughput and tremendous performance of the multi-core processors make them an excellent candidate. Bloom filter is also used in different applications of computer networks [8] specially in the network security [13] (For example it has been shown that Bloom filter can accelerate a software router to speedup IP lookups [4]). On the other hand, a lot of works have been done in the field of cyclic redundancy codes such as parallel implementations of CRC [10,16]. In [30], packet classification task deployed in the virtual switches that exploits Bloom filter searches is implemented on GPU. The authors try to build GSwitch as a GPU-accelerated software switch. In [26], a 2-dimensional pipelined architecture for packet classification on FPGA; this architecture achieves high throughput while supporting dynamic updates. In this architecture, modular Processing Elements (PEs) are arranged in a 2-dimensional array. Each PE accesses its designated memory locally, and supports prefix match and exact match efficiently.
In [14], authors proposed a GPU-based multiple-pattern matching algorithm for filtering malicious packets by using a Bloom filter to inspect the packet payload by leveraging the high parallelism computing power of GPU. They compared the proposed algorithm with different GPU-implemented technologies to sequence the Bloom filter algorithm on different platforms and achieved 58× speedup in comparison to a single thread platform.
In [3], a two layer NIDS to accelerate the performance and processing capacity of Snort NIDS is proposed. To accelerate Snort NIDS, they offload dynamically and on the fly the preprocessor and the detection engine functions of the most frequently triggered rules to NetFPGA. They implemented Bloom filter technique on NetFPGA to match the incoming packets against the offloaded rules.
In [5], a hybrid computing architecture which enables the communication between the Android OS and a traffic analysis hardware accelerator, coexisting on the same chip is proposed. At this aim, the proposed architecture is hosted by new FPGA chip family, the Xilinx’s Zynq, a SoPC based on dual-core ARM.
In [33], a distributed and parallel data statistical modeling algorithm is implemented within the MapReduce framework. Based on that the big data in a certain unit block can be assigned into several distributed compute nodes. A statistic combination strategy is induced so that the intermediate results from each block can be combined into the global result of the entire dataset.
In [23], a novel design and implementation for the MILC compression algorithm, denoted as “Parallel MILC”, which is able to exploit the power and the capabilities of the parallel computing paradigm is proposed. By doing this, the novel algorithm we propose can be executed over several heterogeneous device types supporting the OpenCL framework, as for example CPU, GPU, FPGA and many others. They redesigned the MILC compression strategy according to the OpenCL framework. The speedup achieved by the proposed algorithm ranges from 4 up to 36 times faster than MILC.
In [32], a high-level overview of the existing parallel data processing systems categorized by the data input as batch processing, stream processing, graph processing, and machine learning processing and introduce representative projects in each category is proposed. Then, they surveyed other batch-processing systems, including general-purpose systems Dryad, Nephele/PACT and Spark. SQL-like systems involved in this paper are Hive, Shark, SCOPE, AsterixDB, and Dremel. For stream processing systems, Storm and S4 are introduced as representatives. Scalability is one of the ML algorithms bottlenecks. We then discussed how graph-centric systems like Pregel and GraphLab, and ML-centric systems like Petuum, parallelize the graph and ML model, as well as their distinctive characteristics.
In the standard Bloom Filter, there are different types of hash functions (e.g. CRC32 and CRC32C are used in this paper) to generate k indexes to be set in Bloom filter bit-array. The functions also can be processed in parallel. Ma et al. [19] implemented the Bloom Filter on GPU, specifically targeting a genome bio-sequence alignment application. In their design, long queries are split into multiple sub-queries and the sub-queries are processed independently by the threads. The sub-query size influences two performance indicators, throughput and false positive rate. There is an inherent tradeoff between them and changing sub-query size in order to decrease false positive rate, reducing the throughput as well. The number of indexes and the number of sub-queries are limited. Furthermore, increasing number of sub-queries increases false positive. Therefore, the previous models cannot exploit capacity of parallel execution of a large number of threads offered by modern GPUs. In our previous work [12], two high-performance architectures with shared and private queue is presented, in this work, the results and details of these architectures are highlighted.
Background fundamentals
In this section, Bloom filter and CRC which are used primarily in the processing cores are explained. Afterwards, a brief introduction to General-purpose computing on GPU (GPGPU) is presented.
Bloom filter
Bloom filter is a space efficient randomized data structure for representing a data-set in order to support membership queries. Burton Bloom introduced Bloom filters in the 1970s [7]. A set
In this equation, n represents the number of elements, m represents the number of bits in the bit array and k represents the number of hashing functions. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is as follows:
Also hash functions can be computed from two CRC values according to the following equation [17]:
In this equation
Cyclic redundancy check (CRC)
One of the popular hash functions in computer networks is Cyclic Redundancy Check (CRC). It has been widely used for error detection in computer networks and data storage applications [22,27]. The major advantages of this hash function are straight-forward hardware implementation, simple mathematical analysis, the high probability of error detection.
In telecommunication systems, sender and receiver in order to detect errors, transmit 2 signals, i.e., the Message and the Key. Let message
General-purpose computing on GPU (GPGPU)
General-Purpose computing on Graphical Process Unit (GPGPU) [21] refers to the fact that the GPU can be used as general purpose processor so that it is specialized for highly parallel and computationally intensive processes. To apply the parallel computing capabilities for solving many complex computational problems, NVIDIA Corporation introduced Compute Unified Device Architecture (CUDA) as a general purpose parallel computing architecture with a new parallel programming model and instruction set architecture [9]. In-fact CUDA is an extended model of standard C language for parallel computing that allows the user to program algorithms on GPU easily. Comprehensive information about Parallel programming with CUDA can be found in [9].
The GPU includes a set of multiprocessor cores while each multiprocessor includes several scalar processors that are capable of executing a large number of threads concurrently. Therefore, each multiprocessor includes Single Instruction, Multiple Threads (SIMT) architecture, and consequently a large number of threads run the same instruction, operating on different data elements in parallel.
In a CUDA program, a function that must be executed on GPU is called kernel. The same process on N different data elements is written in forms of a kernel to execute by N threads in parallel. The kernel is invoked by the host (CPU side) to run on the device (GPU side). Programmer determines the number and structure of threads based on the data elements. However, the programmer should consider constraints of GPU model to achieve high performance.
The proposed architecture
The proposed architecture based on Bloom Filter as well as its implementation on FPGA and GPU is described in this section. Finally, the CRC multi-core architectures are explained in details.
Bloom filter on FPGA
Our proposed architecture consists of
The polynomials of the CRC hash functions are as follow:

Block diagram of a single worker core.
These hash functions are calculated simultaneously in a single clock. The 6 locations of the related bit-array are also set in one clock. Therefore, the programming phase requires 7 clocks. This phase is done only once in order to insert rule-set fields into the Bloom filter. Afterwards, the querying phase is performed whenever a packet enters in each worker core.
In querying phase, whenever a packet enters from input line, the two CRC32 and CRC32C hash functions are calculated in parallel in one clock. Afterwards, the resulting values are used in order to generate 4 more hash values using Eq. (3). If all these
This proposed design is described as a processing core using VHDL language to implement on FPGA. Some of the advantages of the proposed method are hardware implementation, the customized design and the usage of high-performance hash functions that significantly decreases the processing time of the packets. Moreover, this architecture is flexible because of its implementation on FPGA, and consequently it is easy to change the design according to the new requirements.
Figure 1 only shows how a single Bloom filter core works. But in order to gain high-performance network processing capability, larger number of such cores is needed. Consequently, in a multi-core architecture a control unit is needed to manage these cores. Two different architectures are proposed for control unit as follows.
As mentioned previously, a single Bloom filter core only includes computational capability and since to create a multi-core architecture, in order to get these cores together, a control unit is required. One of the possible architecture to this control unit (core manager) is using shared queues as depicted in Fig. 2. The manager controls the cores and distributes arrived packets among them. All received packets are firstly put into the shared FIFO core and then they will be distributed among different processing cores. All cores can access the shared queue such that whenever a core was able to process another packet, sends its request to the manager and the manager delivers the packet from queue (if any) to it.

Block diagram of a shared queue manager.
As Fig. 2 shows, the shared queue architecture consists of two major parts: 1) the shared queue and 2) the manager. The input packet can be received using the 32-bits input line. The “valid_in” signal clarifies the validity of input value. The shared queue is capable of receiving packets continuously. On the other hand, the processing cores can request packets through request line whenever they want and receive them through output line. The control unit processes core requests and if a packet exists, it delivers the packet through output line to the destined core while setting the “valid_out” to clarify the validity of the output.
This was only a simplified scheme of shared queue architecture in order to give a simple explanation. The complete architecture when Bloom filter based cores (explained in the previous sub-section) are assembled is depicted in Fig. 3.

Bloom Filter based multi-core with shared queue.
Figure 3 shows the complete architecture of a multi-core shared queue Bloom filter based solution. In this architecture, the cores are Bloom filter based and the “Master” unit is the manager or the control unit. The only difference in this scheme is that when the two or more cores send their requests simultaneously, the manager selects the core with lowest ID using “sel” signal. Since there is only one output in the shared queue architecture, we have put two de-multiplexers in order to send the packet to the destined Bloom filter core and set the validity signal.
As it is obvious, the major drawback of this architecture is that it can only deliver a single packet in one clock. Therefore, this may cause a speed bottleneck when number of request is enormous. Therefore, a scheme with private queue is proposed in order to conquer this problem.
This scheme is similar to shared queue architecture including the queue and control unit but in this scheme each core includes a private queue. The simplified scheme of this architecture focusing on the core manager is depicted in Fig. 4. In this scheme, the numbers of queues are equal to the number of cores. The length of each queue is set to

Block diagram of a private queue manager.
As mentioned previously, the major drawback of the shared queue architecture was the inability to send multiple packets to different cores in one clock. This problem has been resolved in this scheme, and different packets can be sent to different cores in a single clock. The complete architecture of private queue scheme is depicted in Fig. 5.

Bloom Filter based multi-core with private queue.
As Fig. 3 and Fig. 5 show, there are two signal clocks named “clk1” and “clk2” which are responsible for writing and reading signals, respectively. That is because of ISE IP-core requirements which need two different frequencies to be applied.
Unlike aforementioned related work [19], the queries (packets) are processed in parallel in our implementation. The large number of incoming packets is a motivation to exploit thousands of device threads, such that each thread is assigned to one packet for processing. Therefore, in the kernel, CRC32 and CRC32C hash functions are performed for a packet to check its membership. If the number of threads considered by the programmer exceeds the total number of the threads which can run simultaneously on GPU, then extra threads should run serially. This matter is considered in performance considerations.
Since the GPU implementation of Bloom filter is applied widely for network processing, a PC with programmable GPU can be used as high-performance router. The incoming packets from network interface are stored in host memory. Afterwards, when the device memory is fully filled with the received packets, the packets are transferred to it. Thus, more packets are processed concurrently by GPU.
CRC multi-core architectures
In order to evaluate the Bloom filter efficiency and proposal of a new faster architecture, the worker cores are implemented using CRC hash functions alone. Therefore, the worker cores only compute CRC of packets in a single clock. Consequently, the speed and resource usage are expected to raise and decrease, respectively. The resulting architectures are similar to Bloom filter equivalent but worker cores are replaced with CRC cores instead of Bloom cores.

CRC based multi-core with shared queue.
Figure 6 shows the multi-core architecture with shared queue based on CRC worker cores. It is obvious that the only change compared with Bloom filter based architecture Fig. 3 is the worker cores replacement.

CRC based multi-core with private queue.
Moreover, Fig. 7 shows the multi-core architecture with private queues using CRC worker cores. The same situation is repeated for this scheme too. The worker cores are replaced with CRC instead of Bloom filter. All CRC cores compute the hash in a single clock compared with 7 clocks in Bloom filter cores. Therefore, a considerable speedup is expected as shown in the next section.
VHDL description of the proposed architectures has been synthesized for implementation on Xilinx Virtex-II Pro FPGA. The results of Bloom filter based architectures are discussed firstly, and the CRC based results are explained in details secondly.
Resources and throughput of Bloom filter evaluation
The required resources in order to implement Bloom filter based on Shared queue architecture is depicted in Table 1.
Resources used for Shared Queue Architecture (Bloom)
Resources used for Shared Queue Architecture (Bloom)
Also Table 2 shows resources used to implement Bloom filter based on Private queue architecture.
Resources used for Private Queue Architecture (Bloom)
In order to measure the throughput of the two implemented architectures, the following equation is used:
As mentioned before, the required clock of Bloom filter is equal to 7 and for the CRC is equal to 1. Table 3 depicts the frequency of master and workers of both shared and private queue architectures for different number of cores.
Shared and private queue architectures frequency for different number of cores
Based on the Eq. (4) and clock period in Table 3 the theoretical throughput of the architectures is computed. Given that, the throughput of the two architectures is calculated in Table 4.
Throughput comparison of the proposed architectures (Bloom)
These results show that the increasing number of cores leads to higher and better throughput. Figure 6 depicts the results of Bloom filter cores throughput in three different columns (i.e. Shared Queue, Private Queue, and Theoretical value).

Bloom filter throughput.
The Fig. 8 clearly shows that with increasing number of cores, all schemes, i.e. shared, private and theoretical results increase.
On the other hand, it must be determined that by increasing throughput what costs (resources consumption) are followed. We have shown the ratio of throughput to LUTs and registers for Bloom filter schemes in the following charts.

The ratio of Bloom filter throughput to registers.

The ratio of Bloom filter throughput to LUTs.
Both Figs 9 and 10 show that the ratio of throughput to LUTs and Registers decreases with increasing number of cores. This degradation is more evident for 16 cores. This reduction is due to increased number of consuming resources which makes these resources to be distributed in the FPGA, and decreases the required frequency significantly.
To sum up, it can be noted that the shared queue architecture achieves its highest efficiency using 4 Bloom filter cores. However, its efficiency drops when more cores are employed. Hence, the private queue architecture in most cases shows almost efficient behavior.
In this subsection, we propose a faster multi-core architecture using CRC working cores instead of Bloom filters. These cores can compute the hash function in only one clock instead of 7 clocks in Bloom filter.
Similar to Bloom filter architectures, Table 5 shows the required resources in order to implement CRC based Shared queue architecture.
Resources used for Shared Queue Architecture (CRC)
Resources used for Shared Queue Architecture (CRC)
Moreover, required resources employed to implement CRC based Private queue architecture is shown in Table 6.
Resources used for Private Queue Architecture (CRC)
In order to measure the throughput of the two implemented architectures Eq. (4) is used. As mentioned before, the required clock of CRC core is only a single clock. Given that, the throughput of the two architectures is calculated in Table 7.
Throughput comparison of the proposed architectures (CRC)
These results lead to similar conclusion that the increasing number of cores leads to higher and better throughput. Figure 11 depicts the results of Bloom filter cores throughput in three different columns (i.e. Shared Queue, Private Queue, and Theoretical value).

CRC throughput.
Figure 11 clearly shows that with increasing number of cores, all schemes, i.e. shared, private and theoretical results increase. In the first architecture, if all cores are idle, in each clock the packet can be delivered to only one of them. Hence, in the private queue architecture each idle core can receive a packet in a clock because of exclusive queue for the cores. This difference between the two schemes, leads to higher throughput in the latter one. In other words, in first scheme bigger number of cores causes throughput bottleneck while in the latter one leads to higher throughput. On the other hand, it must be determined that by increasing throughput what costs (resources consumption) are followed. We have shown the ratio of throughput to LUTs and registers for CRC schemes similar to Bloom filter in the following charts.

The ratio of CRC throughput to registers.
Figure 12 shows that by increasing number of cores, in second scheme (private queue) the ratio is almost equal to the expected theoretical value, and increased slightly with the growing number of cores. This shows that the number of registers used almost in proportion to the increased throughput. The first architecture (shared queue) has also increased with the growing number of cores up to 8 cores. However, this ratio has declined slightly in the 16 cores in which shows that despite the increasing number of cores, increasing resource consumption have been greater than the increase in throughput.

The ratio of CRC throughput to LUTs.
Figure 13 shows that by increasing the number of processing cores in private queue architecture, the behavior is not very precise and has a variable behavior. The shared queue scheme shows good performance up to 4 cores. But then it takes a downward trend because of its inability to respond to all cores. In other words, the queue itself becomes a throughput bottleneck with the increasing number of cores.
Ultimately, in the case of resource consumption is not considered, the CRC based private queue architecture outperforms shared queue scheme in terms of throughput. Hence, the aforementioned resource consumption charts showed that the ratio of LUTS and Registers consumption for both schemes is almost the same. In other words, it is true that the second scheme has higher throughput compared with shared queue architecture, but it uses much more resources. The first scheme shows fewer throughputs and uses fewer resources.
In this subsection, in order to compare CRC and Bloom filter as network processing approaches, a throughput comparison among them for different number of cores is presented. The results are depicted in Fig. 14.

Throughput ratio of CRC architectures to Bloom (shared and private queue).
In Fig. 14 the x axis shows the number of cores for each scenario and the y axis shows the throughput ratio of CRC architecture to Bloom architecture (shared and private queue). In the first scheme, the proportion remained relatively constant. Of course, the amount of precipitation at the beginning of the graph is due to the using of single-core in the CRC and Bloom architectures. In shared-queue architecture for both CRC and Bloom cores, the ratio remains constant. However, for private queue architecture, for more than 8 cores, the ratio increases almost linearly. This is due to considerable operating frequency drop for Bloom filter based architecture for more than 8 cores while CRC based approach does not get affected by this problem.
As mentioned before, the Bloom filter was also implemented on NVIDIA GeForce GT 540M GPU using CUDA-C. Furthermore, the entire program is written on the host side using C language to compare the execution time.
Figure 15 indicates the throughput of GPU version for various numbers of packets which are transferred to device memory.

GPU throughput.
As it can be seen in Fig. 15, by increasing the number of packets the throughput increases as well. The time of data transfer is considered for throughput measurement. As the number of packets increases, we increase the number of concurrent threads for processing packets, fixing the execution time of the kernel. Consequently, with storing bigger number of packets in device memory for parallel execution, the throughput significantly rises. Figure 16 shows GPU speedup over CPU.

GPU speedup over CPU.
As depicted in Fig. 16, the maximum achieved speedup is 274x. It is notable that rising throughput by increasing the number of packets stops at a certain point due to two limiting factors. First one is the total number of threads that can really run by GPU in parallel and the second one is limitation of available memory by device for storing packets.
Finally, we have measured the throughput of all three proposed architectures in the best case. The best case for GPU implementation is when number of packets residing on the memory is equal to 16384 (as explained in the previous subsection). The best result for FPGA implementations is also when number of cores is at the highest value i.e. 16 cores. The results are as follows:
Throughput of proposed architectures in the best case (Packets/Seconds)
Throughput of proposed architectures in the best case (Packets/Seconds)
Based on the Table 8, FPGA outperforms GPU implementation, so that the throughput of first architecture (shared queue) and second architecture (private queue) with 16 cores are respectively almost 5.5 and 7.1 times higher than GPU throughput.

Throughput comparison of the three architectures.
To sum up, Fig. 17 depicts the discussed results of throughput comparison of 3 different network processing architectures in which Private Queue scheme implemented on FPGA outperforms the other two.
This paper presented an implementation of Bloom filter and CRC cores on FPGA and GPU that can be used for network processing and their results were compared. In case of FPGA which is a platform for hardware implementation of Bloom filter and CRC, we designed a custom core which achieved a throughput of 234 million packets per second. Hence, we showed that CRC based architectures outperform Bloom filter based architectures significantly. Moreover, considering inherent parallelism in the querying phase of the Bloom Filter, we implemented our architecture on the GPU using CUDA. Experimental results showed that higher speedup is achieved when bigger number of input packets is simultaneously transmitted to the device memory. This is due to the fact that there is a lower communication overhead per packet in transferring input packets to device memory where the number of packets is very large. The throughput of Bloom filter using CUDA-enabled GPU was achieved approximately up to 32.9 million packets per second.
