Abstract
The bandwidth growth of networks increased almost exponentially in the recent years and is expected to continue so for years to come. This has been fuelled by emerging new technologies that are capable of achieving higher bandwidths. Consequently, new applications are being developed that take advantage of the new capabilities. These complex applications and heavier traffic have emerged demands on more powerful devices. Furthermore, due to dynamic nature of network traffic, totally scalable methods for packet processing and implementation platforms that are able to cope with this dynamicity are needed. Field-Programmable Gate Arrays (FPGAs) are efficient candidates as they contain hardware programmable logic units that can be reconfigured based on the application requirements. ρ-VEX is a reconfigurable soft-core Very Long Instruction Word (VLIW) processor on an FPGA and it is able to use the inherit Instruction Level Parallelism (ILP) of applications. ρ-VEX has all the requirements of processing network applications. For the purpose of scalability and economical reasonability, in this paper, an implementation of CRC called one-clock CRC (OC-CRC) as a customized instruction in ρ-VEX is presented. Using this customization, one of packet classification algorithms that have taken advantage of Bloom filter e.g. tuple pruning using Bloom filter is implemented. CRC is selected as a custom instruction because of its importance in Bloom filter as hashing function and error detection in the most of the layer of common protocol stacks. The results show the huge performance gap between General Purpose (GPP) processors and our customized ρ-VEX such that our customized ρ-VEX provided on average approximately 8× speedup.
Introduction
Internet at its beginning was only text emailing and webs but it encountered to drastic changes. On one hand the increasing of the number of applications e.g. distributed databases, videoconferencing, multiplayer gaming and the number of network users causes more traffic growth. The new traffic means many new added operations on each network packet. The result is more processing requirements per packet in comparison to early network devices such as firewall, antivirus and intrusion detection/prevention applications. For this complex applications and heavier traffic than early days of networks, demands on more powerful devices, which are able to handle this new situation, have been emerged. Due to dynamic nature of network traffic, the implementation platforms that are able to cope with this dynamicity is needed [25,26]. Field Programmable Gate Arrays (FPGAs) are efficient candidates as they contain hardware programmable logic units that can be reconfigured based on the application requirements, sometimes FPGAs referred to as reconfigurable platforms [26]. Application Specific Integrated Circuit (ASIC) is the other perfect candidate, but considering time to market as an important parameter in hardware production, ASIC is less appropriate platform than FPGA due to the fact that FPGA does not need time to market. FPGAs have become a widely used tool for rapid prototyping, providing software like flexibility and hardware-like performance [31].
To exploit Instruction Level Parallelism (ILP), a Very Long Instruction Word (VLIW) processor can be utilized to increase the performance beyond a single issue-width processor. Single issue-width processors can only take advantage of temporal parallelism (by utilizing pipelining) while VLIW architectures can additionally take advantage of the spatial parallelism by utilizing multiple Functional Units (FUs) to execute several operations simultaneously [12,13]. In addition, VLIW processors are simpler in design when compared to their more complex Reduced Instruction Set Computer (RISC) counterparts. In many cases, the utilization of FPGAs implies a large reduction in development costs or an enormous speedup of the implemented algorithm. VLIW processors exploit ILP by means of a compiler that is completely aware of the target processor architecture [1]. When a VLIW processor is implemented in an Application Specific Integrated Circuit (ASIC), its issue-width is fixed at design time. Therefore, the issue-width of the processor cannot be adjusted after fabrication to suit a different set of applications. A soft-core VLIW processor can be implemented in an FPGA, and its organization can be changed easily by loading a new bit-stream [33]. In literature, all of the available soft-core VLIW processors only provide some form of design-time and not run-time reconfigurability. In order to change the issue-width or any parameter of a processor, a new full bit-stream is to be downloaded to configure the whole FPGA and all other circuits in the FPGA have to be stopped. ρ-VEX is a parameterized and extensible soft-core VLIW design-time reconfigurable processor in an FPGA [1,2,32,33]. In this processor, parameters such as type and the number of functional units (FUs), supported instructions, memory bandwidth and register file size can vary based on the variation of the applications and available FPGA resources. It supports run-time partial dynamic reconfiguration. After the processor is implemented in an FPGA, it can adjust its issue-width while other circuits in the FPGA keep running. Only a small partial bit-stream is loaded to adjust the issue-slots. It has a freely available tool-chain and a well-defined development framework. This VLIW processor is able to use the inherit Instruction Level Parallelism (ILP) of applications. Its architecture is based on the VEX Instruction Set Architecture (ISA). A software development tool-chain for VEX is made freely available by Hewlett-Packard. As the bandwidth and traffic is increasing and varying a lot in networks, reconfigurable processors as a powerful, parametrizable and flexible platform is needed. ρ-VEX has all the requirements of processing network applications. One of the main tasks in network devices is packet classification. Packet classification is the process of classifying different packets into different flows (classes) based on applying rules on their headers. Despite the progress made in recent years on solutions to the packet classification problem, existing routers need to be more powerful and faster due to mentioned increases in networks traffic and bandwidth. Packet classification enables routers to support various value-added services, such as blocking traffic from insecure sites, giving preferential treatment to premium traffic, routing based on traffic type and source, firewall packet filtering, policy routing, Virtual Private Network (VPN) implementations, traffic billing, and Quality of Service (QoS) applications [18,24,34]. Routers classify arriving packets by comparing them to a set of predefined rules and finding the highest priority rule or the rule that best matches the packet header fields (the best matching rule, or BMR). A rule consists of a set of fields made up of the IP source prefix, the IP destination prefix, the source port range, the destination port range, and the protocol type and flags. The difficulty of packet classification is in performing multiple field lookups at wire speed for every incoming packet, given that the packet arrival rate can be several million packets per second. The speed of a packet classification algorithm is measured by the times of memory access. Other performance metrics for a packet classification algorithm include memory storage requirements, update speed, and scalability to both the number of the rules and the number of fields included in the rules, among possible others. Many packet classification algorithms, such as tuple space pruning, cross-producing and coarse-grained tuple space perform a separate lookup on each field to narrow the search space. Hence, these algorithms cause off-chip memory accesses for both the individual field lookups and the final combined lookup. With the mention increases and dynamicity in traffic and bandwidth faster, more efficient and totally scalable algorithms is needed. For the purpose of scalability and economical reasonability, we gave more attention to methods that have taken advantage of Bloom filter. In [16,23] tuple pruning using Bloom filters for packet classification was proposed to replace each field lookup with an on-chip Bloom filter and add a new tuple Bloom filter to further reduce unnecessary off-chip memory accesses. This algorithm has very small memory requirements. It scales well from two-dimensional classifiers to high-dimensional ones. In particular, this algorithm is inherently parallel. It is easy to exploit the parallel mechanism in the hardware. Therefore, in this paper, an option of ρ-VEX processor to create a customized instruction is used. Due to the application of CRC in various network processing tasks e.g. packet classification, CRC calculation and error detection a one-clock CRC (OC-CRC) instruction is proposed and designed. One-clock CRC is a combinational circuit that calculates the CRC at one-clock and is added as a custom instruction into ρ-VEX instruction set. Subsequently, one of the packet classification algorithm called tuple pruning packet classification is mapped on the ρ-VEX with and without customized instruction and the achieved results is compared to the GPP processors. As results show proposed approach provides on average approximately 8× speedup over General Purpose (GPP) processors. The main contribution of this paper is in the following:
Proposal of the customized one-clock CRC instruction in ρ-VEX.
Mapping of the tuple pruning packet classification on the ρ-VEX and customized one.
Achieving the higher performance using ρ-VEX with customized instruction in comparison to the ρ-VEX without customization and GPP.
The organization of the paper is as follows. In Section 2, the related research is presented. In Section 3, a background on network processing platforms, reconfigurable-based general purpose computing and Bloom filter is described. In Section 4, our one-clock CRC approach is proposed. The implementation results in presented in Section 5. Finally, Section 6 concludes the paper.
Related work
In this section, previous researches related to the work presented in this paper is described. In general, there have been two major solutions for packet classification: architectural, and algorithmic solutions. A few pioneering groups of researcher offered a collection of algorithmic solutions [4,10,17,21,27,29]. There are some limitation to meet a good performance in high speed link in algorithmic solutions, therefore, architectural solution to the problem were proposed.
The rapid growth of the network link rate, as well as the rule-set size, made multi-field packet classification one of the fundamental challenges to designing high speed routers. For example, the current link rate has been pushed beyond the OC-1920 rate, which requires processing a packet every 3 ns in the worst case (where the packets are of minimum size i.e. 40 bytes). In other words such throughput is impossible to achieve using existing software-based solutions [17]. OC-1920 supports rates of 100 gigabits per second (Gbps) on a fiber optic carrier. New dense wavelength division multiplexing (DWDM) systems are now in development to run at 10 trillion bits per second (10 Tbps) per fiber. This rapid growth has led to a great demand for the Internet to provide higher bandwidth, lower latency and better security. In order to achieve these goals, network devices need to identify the data or packets being transmitted at high speed. It is a challenge for packet classification to meet such extremely high bandwidth. And as the numbers of new applications and protocols arise, the rule-sets are becoming increasingly large and complex. Large-scale enterprise networks need fine-grained flow control over thousands of physical and virtual machines, while new network protocols such as OpenFlow [25] employ complex rule specifications containing more than 15 header fields. The predominant industrial solutions today are based on multi-core or Ternary Content Addressable Memory (TCAM) platforms. Multi-core based solutions are cheap but they are limited in performance due to memory I/O bottlenecks [34]. Software classification solutions commonly used on multi-core network processors for high performance packet classification with flexibility and programmability, but it inherently lacks high parallelism and abundant on-chip memory. As a result, even the best software classification algorithms on multi-core network processors can only achieve 10 Gbps [24,35]. This is much less than the widely deployed 100 Gbps links used in large ISP (Internet Service Provider) backbones and DC (Data Center) networks [9]. TCAM-based solutions are widely used because of their ability to process packets at high speed. However, TCAMs are expensive, and are not scalable with respect to clock rate, power consumption, or circuit area, compared to Static Random Access Memories (SRAMs). TCAMs only supports prefix matching which means a rule represented in range must be converted into prefixes. Such range-to-prefix conversion may exponentially increase the number of rules. Therefore, supporting large and complex rule-sets with them is difficult [22]. Some other proposed algorithms have used high bandwidth and a small on-chip memory, while locating the rule database in a slower and higher capacity off-chip memory. Many other multi-field packet classification algorithms perform a separate lookup on each field to narrow the search space. All of these algorithms cause off-chip memory accesses for their lookup. Fast search and low implementation complexity is provided by Tuple Space Pruning (TSP) multi-field packet classification [23] and reduced the search space to a subset of tuples determined by individual field lookups that each lookup will cause at least 4 to 5 off-chip memory accesses. To meet the throughput requirement, other researches in this area seek to combine algorithmic and architectural approaches, like various hashing schemes such as Bloom filters [11,26].
As mentioned a Bloom filter composed of two operations: programming and querying. Bloom Filters have become popular due to their
Beside that, ρ-VEX [1,2,32,33] recently proposed to map the general purpose applications on the FPGA using a multi-issues VLIW processor. To the best of our knowledge, it is for the first time that a network processing task e.g. packet classification is mapped on a VEX like architectures. In addition, a custom instruction that called one-clock CRC is designed and implemented on the ρ-VEX to accelerate the network processing task. The one-clock CRC is an implementation of CRC that calculates the CRC in one clock. The CRC is selected because of it popularity and importance role in the networking tasks e.g. packet CRC check and hashing functions of Bloom filters.
Background
In this section, the network processing and related platforms, reconfigurable-based general purpose computing including VLIW-based ρ-VEX, and basic principles of standard Bloom filter is presented.
Network processing and related platforms
Network processing is the acted operations on packets by network devices such as switches and routers. Packets have one or more headers that contain information such as source/destination addresses, source/destination port number, protocol type, checksum, expiration timer, etc. TCP/IP is a protocol stack, which is in use in different networks. Every layer of TCP/IP adds its own header to the packet. Network processing is classified into two categories, Header processing and Payload processing.
The most common header processing task is packet classification. It enables routers to provide differential services such as routing based on the traffic type and source, blocking packets from a predefined source and allowing some others. Routers classify arriving packets by a comparison against a set of rules (filters), and find the best matching rule (BMR) or the highest priority one. A rule with d fields is a d-dimensional rule. 5-dimensional rules consist of source/destination IP prefix, source/destination port range and the protocol type. The difficulty of packet classification is in classifying packets at wire speed for all the incoming packets (this arrival rate can be several millions packets per second) [11,23,28].
A platform is a set of hardware and software modules that can be used as a building framework for the development and the execution of specific set of applications. Different platforms can be chosen for network processing, but each of them has some advantages and disadvantages. In the following, different platforms are considered and a comparison of them is presented.
General-Purpose Processors (GPP): processors such as Intel P3, IBM PowerPC 750× and strong ARM SA-110 are designed to cover a broad range of applications. The early day network devices were mainly implemented using GPP. The workloads of early day networks were not so heavy and GPPs were quite efficient. The main positive factor of GPP is its flexibility.
Application-Specific Integrated Circuits (ASIC): offer high performance but low flexibility due to the facts that they are circuits, which are customized for special tasks or applications.
Network Processors (NP): are System on Chip (SoC) platforms that consist of a number of Application-Specific Instruction Processors (ASIPs) optimized for network processing applications. In addition, NPs incorporate network and memory interfaces and some specialized co-processors (e.g., encryption accelerators).
Field-Programmable Gate Arrays (FPGA): their reprogramming ability (at hardware level) makes them quite flexible. Because of this feature they support more than one type of task and beside this, the hardware parallelism can be exploited in them to accelerate the processing. They are widely used for rapid system prototyping since their development time is shorter than ASICs.
Today’s FPGAs are powerful platforms that consists not only reconfigurable logic but also hard/soft core processors, embedded memory, network and memory interfaces and specialized modules such as small DSP units. Furthermore, some FPGA devices can also be partially reconfigured during execution time or even at run time [1,2,32]. Therefore, FPGAs are qualified candidates for network processing platforms, mainly because of the combination of performance and flexibility that they provide.
Packet classification
Traditionally, routers forward packets based on the destination address in the packet. The support of many different services such as Quality of Service (QoS), Virtual Private Network (VPN), policy-based routing, traffic shaping, firewalls, and network security, increases the importance of packet classification [11,17,34]. In order to provide these services, the router must categorize the incoming packets according to different criteria. These criteria are determined based on one or more fields in the packet header. Packet header fields include destination and source IP addresses, the protocol type, and the destination and source port numbers.
Packet classification can be seen as the categorization of incoming packets based on their headers according to specific criteria that examine specific fields within a packet header. The criteria are comprised of a set of rules that specify the content of specific packet header fields to result in a match. A packet classifier can be implemented in either software or hardware. An example of a real classifier in four dimension is presented in Table 1. In this table, the third column ‘eq’ and ‘gt’ keywords are operations that mean equal and greater than.
Sample classifier rules
Sample classifier rules
∗ means wildcard, that shows any port value or protocol can be used.
In this classifier, the first rule R1 has the highest priority and rule R4 has the lowest priority. An examples of packet classification is presented in Table 2.
Example of packet classification
To achieve higher performances within the application-specific domain, [31] and [33] they combined several technologically proven paradigms to provide a new architectural solution for a general-purpose computing machine. Field-Programmable Gate Arrays (FPGAs) are constantly improving and provide a technology platform that allows fast and complex reconfigurable designs. In many cases, the utilization of FPGAs implies a large reduction in development costs, an enormous speedup of the implemented algorithm, or both. Nowadays, a broad spectrum of reconfigurable architectures is used for applications that would have been implemented in Application-Specific Integrated Circuit (ASIC) technologies or as software for a general-purpose processor. The MOLEN polymorphic processor [31] provides the possibility of executing an application-specific core in a custom generated hardware unit, which resides inside a reconfigurable fabric. The general-purpose processor within a MOLEN machine takes care of general-purpose calculations, concurrently with application-specific calculations by the custom unit. Very Long Instruction Word (VLIW) processors are efficient machines for calculations that contain a lot of Instruction Level parallelism (ILP) that can be exposed by a good compiler. Applications in the networking domain happen to contain a lot of ILP, because they typically consist of many independent repetitive calculations. One VLIW co-processor is able to perform a large number of different calculations within a fixed area footprint, whilst a custom hardware unit is probably only able to perform one type of calculation on a fixed area footprint. Most processor Instruction Set Architectures (ISAs) defines many atomic operations. However, in many applications a custom operation would result in an increase of the performance (and a decrease in power dissipation). So processors which support reconfigurable operations are needed. ρ-VEX is an embedded reconfigurable and extensible open source VLIW processor, accompanied by a development framework which is called ρ-VEX [1,3,6,33]. This processor architecture is based on the VLIW Example (VEX) ISA, as introduced in [33]. The VEX ISA offers a scalable technology platform for embedded VLIW processors that allows variation in many aspects including instruction issue-width, organization of Functional Units (FUs), and instruction set. A software development compiler toolchain for VEX is made publicly available by Hewlett-Packard. The reasons to choose the VEX ISA for this work was announced its extensibility and the quality of the available compiler. The design provides mechanisms that allow parametric extensibility of the new processor. Both reconfigurable operations, as well as the versatility of VEX machine models are supported by ρ-VEX.
VLIW-based VEX specifications
The VEX (VLIW Example) ISA [13] is loosely modelled on the ISA of the HP/ST Lx [12] family of VLIW embedded cores. The VEX ISA supports multi-cluster machines, where each cluster provides a separate VEX implementation. Each cluster has the ability to issue multiple operations in the same instruction (that is, each cluster acts as a separate VLIW core). A VEX multi-cluster processor shares one instruction fetch unit and one memory controller. The extensibility of the instruction set enables the definitions of special-purpose instructions in an organized way. VEX does not support floating point operations. Each cluster has support for multi-issue widths. The extensibility of the instruction set enables the definitions of special purpose instructions in an organized way. By default, a VEX cluster has 4 ALU units, 2 multiplier (MUL) units, 1 branch control (CTRL) unit, 1 memory access (MEM) unit, 64 32-bit general-purpose registers (GR) and 8 1-bit branch registers (BR) per cluster. Also, an instruction- and data-memory cache of 32 kB is present. A VEX instruction consists of one or more syllables, depending on the issue-width. A syllable can be seen as a single “RISC-style” instruction. A publicly available VEX software toolchain is provided by Hewlett-Packard Laboratories, which offers a VEX C compiler and a VEX simulator. The compiler allows the user/designer to easily adjust parameters of the VEX processor (like the number of clusters and the issue-width). The VEX simulator offers an architecture-level simulator. The simulator is able to output many statistical run-time data of simulated applications. Both tools are parametric by means of machine models. The VEX C compiler is a derivation of the Lx/ST200 C compiler, itself a descendant of the Multiflow C compiler. It uses trace scheduling as its main scheduling method. Trace scheduling implies that operations will be restructured in order for large “traces” to appear without branches. The VEX simulator is a compiled simulator which translates the target executable binary code to a binary executable that can run on the host system. Figure 1 depicts the organization of a 4-issue ρ-VEX processor.

ρ-VEX organization (4-issues) [33].
The actual operations take place in either the execute unit, or in one of the parallel CTRL or MEM units. ALU and MUL operations are performed in the execute stage. This stage is implemented parametric, so that the number of ALU and MUL functional units could be adapted. All jump and branch operations are handled by the CTRL unit, and all data memory load and store operations are handled by the MEM unit. All write activities are performed in the writeback unit, to ensure that all targets (GR, BR, Program Counter (PC) or external memory) are written back at the same time. To determine the write targets per syllable, a target signal is assigned in the decode unit for each syllable. The different write targets could be the GR register file, BR register file, data memory or PC.
The standard set of VEX operations consists of 73 operations. Opcodes for the inter-cluster operations described by the VEX ISA are reserved in ρ-VEX. This default set of operations has been complemented with two extra operations: STOP and LONG IMM. The former operation tells ρ-VEX when to stop fetching instructions from the instruction memory. The latter is used when long immediate operands are handled. An instruction consists of four syllables by default. The VEX standard defines the use of three types of immediate operands: short immediate operands, branch offset immediate operands and long immediate operands. Every syllable has an immediate switch field consisting of 2 bits that describe the type of immediate operand within the syllable. ρ-VEX syllables include two bits with syllable meta-data, the L and F bits. The L bit denotes whether a syllable is the last syllable in an instruction and the F bit denotes whether it is the first syllable in an instruction. All logical and select ALU operations can have a GR register or a BR register as their destination operand. When the GR destination address equals $r0, the BR destination address is used to store the result of the operation. Four ALU operations operate on three source operands: two GR register operands, and one BR register operand. Because some of these operations are also able to operate on a GR register or BR register as destination, a new location for the BR source register address should be assigned. Special opcodes for these operations were assigned in ρ-VEX, so that the 4 most significant bits are unique. The least significant bits of the opcode field are used in this case to pack the BR source address.
As mentioned custom instruction is provided in ρ-VEX. ρ-OPS provides a mechanism to execute custom instructions via pragmas inside the application code in an ρ-VEX processor. One of the 24 available ρ-OPS opcodes should be chosen, and a template VHDL function should be extended with the custom functionality. ρ-OPS are not restricted combinatorial operators; sequential ρ-OPS consisting of multiple atomic operations are also allowed, as long as the design gets properly synthesized and routed. Currently, the following properties of ρ-VEX are parametric:
Syllable issue-width.
Number of ALU and MUL units.
Number of GR and BR registers.
A standard Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries [5]. Bloom filters may yield a small rate of false positives in membership queries; that is, an element might be incorrectly recognized as member of the set. Although Bloom filters allow false positives, for many applications the space savings and locating time constantly outweigh this drawback when the probability of false positive can be made sufficiently small.
The idea of SBF is to allocate vector A of m bits, initially all set to zero, for representing a set
A particular position in the vector might be set to one multiple times, but only the first time has an effect. In the querying phase, to query for an element y, the bits at positions
The percentage of false positive of a Bloom filter can be minimized by tuning the three parameters: (i) Number of elements (n) added to generate the Bloom filter. In most cases, this parameter is defined by the application and, thus, cannot be controlled. (ii) Number of bits used in a Bloom filter (m). m can be used in order to minimize false positives but obviously the larger the value of m the less compact representation. (iii) Number of hashing functions (k) used to create the Bloom filter. The larger k the higher processing overhead (CPU usage) especially if hashing functions perform complex operations.
There is a trade-off between the probability of false positive and the length m of the Bloom filter array [5,14]. It has been proven that the probability of false positive (
Now it is clear that the optimal number of hashing functions k, which minimize
In the last decade, a number of extensions of the SBF have been developed by researchers to address its limitations. Some of these variants include counting Bloom filter, compressed Bloom filter, dynamic Bloom filter, spectral Bloom filter, scalable Bloom filter, attenuated Bloom filter and space-code Bloom filter [7,14].
Proposed approach
Because of the useful characteristics of Bloom filter and the importance role of multi-field packet classification in network devices, and also due to the dynamicity and constant growing traffic of networks, in this paper, to implement a Bloom filter multi-field packet classification algorithm in ρ-VEX which is an extensible, open-source and soft-core in an FPGA processor is chosen. In the following, our implementation of multi-field packet classification and the other related topics is presented.
Packet classifier engine
A modified 2D version of the proposed algorithm in [23] has been chosen for our packet classification engine because of its memory efficiency, extensibility and Bloom filter usage. This algorithm can be extended for more or less fields easily. Figure 2 shows the overall architecture of this algorithm which replaces individual field lookups with Bloom filters (a source Bloom filter (src-BF), a destination Bloom filter (dst-BF) and also a tuple Bloom filter (tuple-BF)).

Architecture of packet classifier using tuple space pruning and Bloom filters.
The Bloom filter’s size and the number of hash functions affect its performance, which is determined by the false positive rate. Any arbitrary hash generator can serve as a hash function. Our required Bloom filter must accommodate prefixes of arbitrary lengths in a single Bloom filter. So a hash generator that produces hash indices for variable-length inputs is required. A cyclic redundancy check (CRC) generator suits our goal. Two of the most common CRC-16 generators (CRC-16-IBM and CRC-16-CCITT) is chosen. This is because of the utilization of CRC in the error detection in the most layer of protocol stacks, the lower required clock cycles in compared to other hashing functions, lower average search length when it is used in hash table [15] and simplicity in hardware implementation. It should be noted the Bloom filter needs k independent hashing functions, but It has proven that using just two random hashing functions and generating a linear combination of their outputs works similar to the case of using k independent hashing functions [20]. Therefore, only two hashing functions
CRC computation is the most important action of our packet classification application based on its high repetition. Serial CRCs because of their serial nature need lots of clocks in each CRC computation. So the implementation of serial CRC in our application will make CRC computation the bottleneck of it. To cope with this situation two solutions is provided.
Parallel CRC. Our proposed one-clock CRC.
The parallel CRC [8] is a form of CRC that uses a recursive formula to generate the CRC codes. The number of bits processed in parallel can be different from the degree of the polynomial generator.
If it is assume that the degree of polynomial generator (m) and the length of the message to be processed (k) are both multiples of the number of bits to be processed in parallel (w), then the required clock periods to compute the parallel CRC is
Logic utilization of parallel CRC-16 and parallel CRC-32 in ρ-VEX
Logic utilization of parallel CRC-16 and parallel CRC-32 in ρ-VEX
In addition, one parallel CRC-16 as a customized instruction on ρ-VEX needs approximately 20% of LUTs which means we cannot have two parallel CRC-16 on ρ-VEX as customized operations. Furthermore, from the Table 3, it can be observed that the customization of CRC-32 parallel on ρ-VEX is impossible due to lack of resources. Therefore, in order to gain the desired suitable CRC with high scalability, a new CRC which only consume 2% of FPGA LUTs is proposed. Our one-clock CRC (OC-CRC) is only consists of combinational circuits and is scalable enough.

OC-CRC circuit.
The circuit of our one-clock CRC is depicted in Fig. 3. Our OC-CRC as a combinational circuit with d stages, where d is the number of bits in input data is designed. The process of CRC code computation in OC-CRC is as follows. The input data will be padded with 16 zeros. Also the CRC polynomial will be padded with
As it can be seen in Fig. 3 there is no register in this design to provide input for the other levels so it can be done in one clock. Its only limitation is the period of the clock which is totally to the maximum delay of the critical path in the circuit. Considering the advancements in FPGA technology, this delay will be less than what we computed in Virtex II pro. The maximum delay of our one-clock CRC is 31.7 ns so maximum clock which can be acquired in our FPGA platform is 31.5 MHz. All the other delays of customized ρ-VEX processor implementation will be covered by this delay. As a result of this the maximum clock of our customized ρ-VEX with two one-clock CRCs is 31.5 MHz. This implementation needs less than 80% of FPGA LUTs and will provide multi Gbps throughput of packet classifying. It is notable to say that by using newer versions of FPGAs not only we could pipeline our engine but also we can run it in less period time of clock. Both of these advancements mean more throughput.
Process of mapping ρ-VEX extensibility is provided by two mechanisms, namely ρ-OPS and VEX machine models. ρ-OPS are custom reconfigurable operations that can be easily added to the execution model of ρ-VEX. VEX machine models define among others the configuration of FUs, register files, and the issue-width. The VEX software toolchain supports the use of custom instructions via pragmas inside the application code, with ρ-OPS. So for mapping our packet classification application, first the application kernel described in C code is compiled. Then the generated assembly code is converted into an executable program for ρ-VEX. An assembler/instruction ROM generator ρ-ASM translates VEX assembly applications into an executable application within the ρ-VEX instruction memory. ISE software to synthesize ρ-VEX and other generated VHDL codes is used.
Experimental results
In this section, the packet classification benchmark, the implementation results of packet classifier with and without customization on ρ-VEX and GPP is presented.
Packet classification benchmark
We performed simulations for rule-sets created by Classbench [30] which is widely used in evaluating the performance of multi-field packet classification algorithms. We just used source and destination prefix fields, the remaining fields, including the port ranges are stored in an off-chip hash table, and compared all fields with a given input when the hash entry was accessed. We used three types of 5D rule-sets access control list (ACL), IP chain (IPC), and firewall (FW), with five sets for each type. The number of rules and input traces are presented in Table 4.
Rule-set database and packet trace specification
Rule-set database and packet trace specification
The obtained results of simulating and implementing our application on ρ-VEX processor and a GPP (Pentium 4 with clock frequency of 3.06 GHz) are presented. A comparison between the results of FPGA based implementations and the mentioned GPP that has similar characteristics to ρ-VEX. These simulations are based on Modelsim simulator, and ISE tool. We simulated ρ-VEX processor and our application using Modelsim and synthesize it by ISE. Further the consumed amount of resources after mapping our packet classification application on ρ-VEX and the number of required cycles of running our application in both platforms has been discussed. To compile a program with the provided VEX compiler of HP in ρ-VEX package, C codes are in demand. So we wrote the C code of our application. For a desired rule-set this application will first program the Bloom filters and the hash table, then for appropriate packet traces it will search through the rules and defines the BMR. To evaluate the performance of this application we first ran it on ρ-VEX processor. We compiled C code of our application with the VEX compiler. Then, the generated assembly code assembled by ρ-ASM and generated a VHDL instruction ROM file. It is notable to say that the current version of ρ-ASM and VEX compiler are executable in Linux operating system. The ROM file is then synthesized with the rest of the processor VHDL design files by using ISE 8.1 because of its compatibility with our model of FPGA. The model of our FPGA is Virtex II-Pro Xc2vp30 of Xilinx. We adapted the 4-issue version of ρ-VEX to customize OC-CRC on it.
Investigation of ρ-VEX packet classifier performance
Figure 4 depicts the execution time and speedup of the our packet classification application for different rule-sets on the GPP and ρ-VEX without customization.

Performance of packet classifier on ρ-VEX. (a) Execution time of packet classifier for different rule-sets on GPP and non-customized ρ-VEX. (b) Speedup of ρ-VEX to GPP for packet classifier with different rule-sets.
It can be observed that in Fig. 4(a) and (b), an improvement in processing time and speedup between 4 and 12 times approximately for different rule-sets. This improvement is achieved in processing time and speedup for following reasons.
The result of concurrent processing of multiple issues in VLIW processors and the similarity of VLIW architecture to RICS (when it is 1-issue width in worst case) which is much better than CISC in term of clocks and execution time of instructions. Utilization of FPGA technology because of mapping the application on hardware that can accelerate the execution time in comparison to the GPP platforms.
In Fig. 4(b), the fluctuation of speedup in different rule-sets is the result of differences in distribution of bits among the different rule-sets. Furthermore, as it is shown in Fig. 4(a) in both platforms the execution time raises as the number of rules become more and more. This is expected due to the increase of operations. In addition, as Fig. 4(b) depicts the decreasing in the speedup when the number of rules is growth. As an instance, the achieved speedup for FW1-1k rule-set is 12 times while this speedup for FW1-10k rule-set is 5 times.
Figure 5 depicts the speedup of packet classifier without and with customization on ρ-VEX using OC-CRC.

Speedup of packet classifier on ρ-VEX without and with customization using OC-CRC.
As it can be observed in Fig. 5, the speedup increases at most by 4× and at least by 2×. It should also take into consideration that as rule-sets (FW1-10k, ACL1-10k, and IPC1-10k) become larger and more complex this improvement decreases. This decrement is the result of more collisions in hash indices and more sequential searches. This speedup can be changed for other applications depend on the utilization of OC-CRC as custom instruction. In the other hands, in packet classification that used Bloom filters with the CRC as its hashing functions, different hashing functions must be calculated using OC-CRC that improve the execution time and speedup. Our results show that using the OC-CRC as custom instruction in the mentioned packet classifier engine achieves the average processing time 5.6 ns for one packet per each rule and the average obtained throughput 33.9 Gbps.
The resource utilization of our OC-CRC is depicted in Table 5. We adapted the 4-issue code of ρ-VEX to use the ρ-OPS. All configurations were synthesized to run at the clock speed of 89.44 MHz.
Logic utilization of OC-CRC
Logic utilization of OC-CRC
This low utilization provides lots of benefits for our engine such as extensibility and higher throughput.
The resulting numbers of executed clock cycles without custom instruction and with adding our two OC-CRC codes as custom instructions are presented in Table 6. It should be noted that in Table 5, the utilized resource of ρ-VEX with customized CRC (OC-CRC) shows the consumed resource by a ρ-VEX and two OC-CRC. From Tables 3 and 5, it can be observed that the resource utilization by OC-CRC is extremely low. This means that it is possible to add more OC-CRC if required.
ρ-VEX without and with customized CRC
As figures and tables in the previous sections depicts, the customization of ρ-VEX as a reconfigurable and VLIW processor provides proper speedup. This improvement then caused better performance of our engine on ρ-VEX in comparison to Pentium 4 as a GPP platform. The provided throughput of our engine and its amount of resources utilization makes it an appropriate platform for network processing. With rapid growth of network traffic and number of applications which are related to networks this throughput makes our engine one of the best existing platforms in this field. Also we should mention that the memory utilization of our engine is extremely low due to existence of Bloom filter in its architecture. As we mentioned CRC computation is the most important action of our application based on its high repetition. The implementation of serial CRC in our application will make CRC computation a bottleneck. To cope with this situation we have provided two solutions: Parallel CRC and our proposed one-clock CRC. The presented results shows that customization of parallel CRC-32 on ρ-VEX as a ρ-OPS is impossible due to its more than enough slice consumption. Also one parallel CRC-16 as a customized instruction on ρ-VEX needs approximately 20% of LUTs which means we cannot have two parallel CRC-16 on ρ-VEX as customized operations. Therefore, in order to gain the desired suitable CRC with high scalability we customized two of our OC-CRC which only consumes 4% of FPGA LUTs. Our one-clock CRC (OC-CRC) is only consisting of combinational circuits and it is scalable.
Generally speaking we can summarize the advantages and disadvantages of our engine as below. The advantages are: low time to market, performance improvement in comparison to the GPP platforms e.g. Pentium 4 processor, low memory requirement, low resource utilization and scalability. The disadvantages are: reduced clock frequency and high delay.
Conclusions
In this paper, ρ-VEX as an embedded processor was customized for one of the most important network applications, packet classification. The results have been depicted the huge performance gap between GPP processors and ρ-VEX. The customized ρ-VEX provided on average approximately 8× speedup. This improvement makes ρ-VEX as a qualified processor in network applications due to the rapid growth in network bandwidth and applications. Although this performance increase is at the cost of long delay and clock period increment, they are not at all major problems considering the rapid advancement in FPGA technology. In addition, it should be noted that the OC-CRC customized instruction in ρ-VEX can be used and extended not only in the Bloom filter based packet classification but also in all applications that utilize the Bloom filters and CRC codes. Furthermore, the customization of ρ-VEX can be extended for other applications that need to customize an instruction for achieving higher performance.
