A new architecture for online error detection and isolation in network on chip

Abstract

The Network-on-Chip (NoC) has become a promising communication infrastructure for Multiprocessors-System-on-Chip (MPSoC). Reliability is a main concern in NoC and performance is degraded when NoC is susceptible to faults. A fault can be determined as a cause of deviation from the desired operation of the system (error). To deal with these reliability challenges, this work propose OFDIM (Online Fault Detection and Isolation Mechanism),a novel combined methodology to tolerate multiple permanent and transient faults. The new router architecture uses two modules to assure highly reliable and low-cost fault-tolerant strategy. In contrast to existing works, our architecture presents less area, more fault tolerance, and high reliability. The reliability comparison using Silicon Protection Factor (SPF), shows 22-time improvement and that additional circuitry incurs an area overhead of 27%, which is better than state-of-the-art reliable router architectures. Also, the results show that the throughput decreases only by 5.19% and minor increase in average latency 2.40% while providing high reliability.

Keywords

Network on chip online test fault tolerance error reliability

1. Introduction

The Network-on-Chip (NoC) has become a promising communication infrastructure for Multiprocessors-System-on-Chip (MPSoC). The NoC architectures have been proposed to replace traditional global interconnects. However, a complex system such as NoC based System-on-Chip (SoC) consists of billions of transistors which makes it vulnerable to faults; one single transistor failure in one router or link can even break down the entire system or degrade the system performance.. The performance of a NoC is expressed in terms of bandwidth, latency, power dissipation and reliability. The latter has become one of the most important metrics. Therfore, it is vital to develop reliable and efficient Fault-tolerant NoC designs.

Typically, faults refer to circuit malfunctioning or data errors, which can occur due to transistor aging, crosstalk noise, fluxes of neutron and alpha particles, power noise, energetic particle strikes, signal glitch, and skin effect [23,43]. To increase the yield and maintain performance, fault tolerance and reliability are becoming more important concerns in the future SoC design.

During manufacture and utilization, NoC cannot always be ideal. If the faults happen in the NoCs, router cannot get necessary data, or the received data is not correct. Therefore, it is necessary to ensure the reliability of NoCs through effective fault-tolerant designs. Fault-tolerance is an important issue in the field of NoCs [13,31,35]. Fault-tolerant NoCs should enhance the reliability of NoCs as far as possible. On the other hand, fault-tolerant NoCs should reduce the area overhead, power consumption, and preserve performance.

Fig. 1.

Transient fault, intermittent fault, and permanent fault.

A fault in the circuit may produce a general failure of the system. According to [43], faults can be classified into transient errors, intermittent errors, and permanent errors as shown in Fig. 1.

Permanent faults or hard faults, apparent in the NoC chip for a lifetime after their occurrence. They affect the functionality of the chip on all occasions. They are traditionally caused by several facors like: open/short circuits in links, time-dependent-dielectric-breakdown (TDDB), electro-migration (EM), stress migration, negative-bias-temperature-instability (NBTI), hot carrier injection and thermal cycling. On the other hand, transient faults, also known as soft errors, occur only for one or two clock cycles (see Fig. 1). Also, they are traditionally caused by internal or external factors such as:electrical noise, electrostatic discharge, electrical power drops, overheating, mechanical shocks, process variation, and external radiations like alpha particles and cosmic rays.

The majority of failures (80%) are produced by transient faults, while the rest of them originates mainly from permanent and intermittent faults [50]. Faults in the router architecture conduct to misrouting, deadlock, traffic hot spots, packet loss, and increased latency. Thus, reliable router architecture is necessary to avoid these undesirable fault scenarios [50]. Making the chip capable of autonomously detecting and tolerating its faults is a problem of extreme difficulty,

Fault diagnosis methods can detect and determine the reason and location of faults in NoC. The design of fault diagnosis methods involves methods and test strategies. The test methods operate and capture the faults by test vectors. Only a few works have been published on test strategies. Many works proposed test methods for NoCs especially or proposed methodologies to design test vectors for integrated circuits. Test strategies goal is reducing the impact on system performance from the test by scheduling and managing test procedures. Fault diagnosis methods can be classified into active detection methods if test vectors are injected into circuits under test, otherwise, they are called passive detection methods.

Active detection methods inject well-designed test vectors into the circuit under test to operate and capture faults. Built-in self-test (BIST) is one of the typical active detection methods. Due to the high fault coverage and fine-gain fault diagnosis, BIST is more suitable to detect and diagnose intermittent and permanent faults. Most proposed BIST methods for NoCs do not specify the test strategy [28]. The normal operation of NoCs has to be stopped definitely, which leads to a significant impact on the system performance.

Two typical passive detection methods are reported in the litterature: the error detection code (EDC) and the checker. The first method is used in the data path. Differently, the second method is used in the control paths. The parity check code (PCC), Hamming code, and cyclic redundancy check (CRC) are the most commonly used coding methods in NoCs [14].

Therefore, the test strategies purpose should be minimizing the impacts introduced by the test and increasing the test frequency as well. In reference [29] the authors reduced the impact on system performance by combining the off-line test and online test.

Both passive and active fault detection methods can detect the faults in one router and the entire network. Also, both passive and active fault detection methods can cover the faults in the data path and the control path. But, it costs much longer test procedure and introduces significant area overhead and power consumption. This method must isolate the circuit under test from the network which damages the integrity of the system resources and decreases the performance of the network [19].

So, in this work we proposed a reliable NoC router architecture called “OFDIM” which can detect and tolerate multiple permanent, transient faults and ensure continuous network operation. In contrast to existing reliable NoC architectures, the proposed solution can tolerate faults in all components of the router and maintain a gracefully degraded performance under heavy network traffic.

The main contributions of this paper can be summarized as follows:

Firstly, we propose a reliable architecture router that provide fault tolerance in all components of router (Buffer, Cross bar, etc) and a novelty that includes transient and permanent faults compared to existing works which treat one type errors only. Secondly, the reliability aware design-time can be performed in order to produce solutions that can be used to cure faults at runtime and reduce the chances of system failure. So the advantages of our solution are: to achieve low latency and reliability analysis using SPF metric. In addition to that, the results show that the proposed architecture achieves higher reliability by incurring smaller overhead as compared to an existing architecture. The proposed solution allows to a NoC to achieve its work even if an error is encountered. Thus, the performance and the reliability of the NoC can be improved.

The rest of this paper is organized as follows. In Section 2, related work is presented. In Section 3, the proposed router architecture is discussed. In Section 4, the components of the OFDIM are introduced. The analytical and experimental results are reported in Section 5 while the summary and conclusion are given in the last section.

2. Related work

One of the biggest concerns raised by the VLSI community regarding recent and future Networks-on-Chip (NoCs) designs is the continuous decrease in feature size and its impact on reliability. Also, most recent proposals in the area of NoCs favor the use of adaptive routing for fault-tolerance, requiring more sophisticated routing decisions [7]. It is therefore mandatory to provide some level of protection against errors. The problem of error detection is thoroughly studied in the context of NoC, in this section we present some works.

In reference [39], the authors propose improving the reliability by adding minimum extra circuitry and exploiting temporal parallelism. They protect all pipeline stages and are capable of coping with two faults at the RC, SA, and XB stages, and four faults at the VA stage. However, with random faults this design can tolerate up to 27 faults, in the best case. Their design has an area overhead of 31%, which they claim to lead to a SPF of 11 – they use the mean of these two numbers, $(2 + 28) / 2 = 15$ , which is arbitrary and probably unfair.

In reference [46], the authors have proposed a scheme based on a dynamic resource sharing approach that can tolerate soft errors in multiplexers, demultiplexers, and VCs of the input port unit. Poluri et al. [40] proposed the solution of tackling both transient and permanent faults in the pipeline of the router. The fault-tolerant techniques employed in the pipeline stages of the router are spatial redundancy, exploitation of idle cycles, bypassing faulty resources, and selective hardening. This solution results in better performance but results in larger area overhead. As the area increases, the fault probability. of the circuit also increases. If two faults occur in the RC stage, then the router fails its operation.

The work in reference [11] propose a comprehensive online and real-time fault detection and localization mechanism. This solution focuses on the detection and localization of faults in any components of the NoC’s control logic. Based on the notion of invariance checking. An illegal output (invariance) is specified here as an operational decision that violates the structured correctness rule(s) of a particular component. The checkers operate concurrently with normal NoC operation, thus eliminating the need for periodic, or triggered-based, self-testing. NoCAlert can detect (a) single permanent and transient faults.

The HPR [48] has designed permanent fault protection techniques for an NoC router. Error-correcting codes are utilized for the protection of single-bit faults in the flits. They presented a concept of a double routing technique for the protection of RC faults. The VA in case of faults uses the default winning technique. The SA used the runtime arbiter selection approach for tolerating the faults. The XB design used a bypass path technique for tolerating the faults in this stage. The overall design achieved higher reliability but still, the design was not able to tolerate the multiple faults occurring in the multiplexers and demultiplexers of the input ports. The router fails if permanent faults occur in these components.

In reference [34], authors consider faults in the state field of the VC. It splits the state field into two groups and gives spare registers for every group for fault tolerance. For fault detection in the state fields It uses built-in-self-test (BIST). The status register indicates the VC as faulty on the occurrence of most than one fault in a group. Afterward the VC closing strategy informs the upstream router to avoid sending flits to that downstream VC. It only take care of fault in the input port VC, leaving other router components vulnerable.

In reference [42], author proposes three different fault-tolerant architectures namely single spare, turn priority and stay alive. An additional unit is available to replace faulty units at each level in the single spare architecture. Priority-based selection is used in the case if two units at the same level get a fault. Exploiting regularity of the router, turn priority architecture uses the available resources by assembling as much input and output channels as possible. Different priorities are assigned to input and output channels to ensure router operation even in the worst case. Stay alive architecture combines these two previous architectures to enhance tolerance. These architectures incur high area overhead, i.e., single spare incurs 42%, turn priority incurs 77% and stay alive incurs 115% area overhead.

In reference [51], author uses channel slicing along with on-demand triple modular redundancy (TMR) for fault tolerance. Each node in the network contains three identical router slices that split internal paths. It supports three modes of operation. First is the parallel mode, in which divers router slices shares their control logic with each other in case of a fault in the control logic of a slice. Secondly, the separate mode, in which different router slices shares internal resources like buffers and XB’s in the case of fault in the buffer or XB of a slice. Finally the TMR mode, in which all three slices working in parallel on the same data and control signals. It uses majority voting to select the correct result at the output.

The work presented in [47] is a built-in-self-test (BIST) strategy for the data path and control path of NoC architecture. It uses free time slots of data path components for testing. It uses idle cycles of inter-router links for data path testing. But for control path testing, it isolates the components with the help of test wrappers. In the test mode, to ensure the secure connection of core with the network it connects the local port with the east or west port. To favor this setting, it uses the adaptive routing algorithm. This testing approach needs complex control and routing mechanism. The area overhead is 9.88% and power overhead is 4.63% for this fault detection strategy.

Besides the permanent faults that could occur within the router, the links between them are also a source of faults [9,52], Faulty links could cause wrongly transmitted bits between routers, possibly breaking the connection between them and cannot guarantee anymore service.

In reference [45], the authors propose an online diagnosis mechanism that reads error symptoms to locate faulty links, while ECC is used to correct the data errors. The diagnosis resolution of this method is limited to the links between routers and does not directly consider faults inside the routers.

Under the realistic traffic, the exploitation rate of links is not high so that there are many slots without flits to deliver. In reference [33], the author’s proposed to use these free slots for tests. This method does not require to isolate links away from the network and avoid blocking the normal communication. Consequently, the impacts on the network are reduced.

DiTomaso et al. [18] provided 4 links between routers. The direction of all links can be modified to address different traffic and fault distributions. This work also provided a backup ring path to preserve the communications under massive faults. In another work [1], a test method is presented to try to detect and locate faults in links. The work take on four routers in the test process. This test method needs four test rounds to complete the fault diagnosis, which imposes a large overhead.

Table 1 summarizes some previous works. All these works aim to provide higher reliability with lower hardware consumption, but they neglect the performance degradation during the process of tolerating faults. Consequently, we propose a simple and effective method that allows detection of precise location of the fault in router/link.

Table 1
Some previous works

Reference Works Authors Year Test Object

[39] PFTR Poluri and Louri 2013 Router

[41] – Poluri and Louri 2014 Router

[30] – Killian C et al. 2014 Router

[18] QORE D. DiTomaso et al. 2015 Link

[37] – Pengzhan Yan et al. 2015 Router

[38] – Pengzhan Yan et al. 2015 Router

[10] – X. Chen et al. 2016 Link

[40] SHIELD Poluri and Louri 2016 Router

[46] DRS Valinataj and Shahiri 2016 Router

[51] – Wang, L et al. 2017 Router

[47] EsyTest Wang et al. 2019 Router

Reference	Works	Authors	Year	Test Object
[39]	PFTR	Poluri and Louri	2013	Router
[41]	–	Poluri and Louri	2014	Router
[30]	–	Killian C et al.	2014	Router
[18]	QORE	D. DiTomaso et al.	2015	Link
[37]	–	Pengzhan Yan et al.	2015	Router
[38]	–	Pengzhan Yan et al.	2015	Router
[10]	–	X. Chen et al.	2016	Link
[40]	SHIELD	Poluri and Louri	2016	Router
[46]	DRS	Valinataj and Shahiri	2016	Router
[51]	–	Wang, L et al.	2017	Router
[47]	EsyTest	Wang et al.	2019	Router

3. The proposed router architecture

Fig. 2.

The Multi/Many-Processor Systems-on-Chip (MPSoCs) scheme interconnected using network on chip (NoC).

A Multiprocessors System-on-Chip (MPSoC) is widely used in embedded systems to increase their performance. It is composed of many processing elements (PEs), a memory hierarchy, and I/O devices. All these various components are connected to each other through a NoC. Figure 2 shows a graphical depiction of a MPSoC composed of several PEs, memories, and I/O devices.

Fig. 3.

The architecture of baseeline 4-stage NoC router.

The Router is the main component in the NoC [25]. It is the communication backbone in the NoC-based interconnect architecture. It receives incoming data packets and analyzes their routing information to select the best path, based on the routing function. A typical router is composed of two main components, the Datapath, where the packets go through from an input port to an output port, and the Controlpath to lead the flow of the data packets in the router depicted in Fig. 3. The control path includes the buffer control logic, pipeline control logic, arbiters, and so on.

A traditional NoC consists of a set of elements, called IPs, which can be processor cores, built-in memory, separate specialized hardware modules, I/O interfaces. In general, each IP has one router to provide packet routing. Intermediate layer, which is a NI (Network Interface), connects routers with IP nodes. All these components are interconnected by connecting lines [5]. For the communication infrastructure to dynamically pinpoint components with permanent or transient faults and to autonomously self-reconfigure, the router needs to exhibit some inner intelligence.

Fault diagnosis includes three tasks, that is, fault localization or detection, fault isolation, and fault identification.

The technique to locate the faults is the most basic task of fault diagnostic. Fault locating is also essential in fault tolerance in order to pinpoint faults on router or link [20,50]. So, to check whether there is malfunction or faulty component in the system, it is preferrable to know and localize the fault on NoC and his type to allow the implementation of the suitable approach.

Fig. 4.

The proposed router.

The schematic of fault-tolerant control is depicted in Fig. 4, which shows that fault diagnosis is integrated with fault-tolerant control. Real-time fault diagnosis cannot whether the system is faulty and inform where the fault occurs. From this valuable information, the control system can thus take appropriate fault tolerant actions. The fault information known by routers has a direct relation with the fault-tolerant capacity and network performance.

A Router is an entity that facilate communication between PEs cores in Networks on chip. We introduce in the section below the proposed router architecture and its main components. The router is the back-bone component of the NoC design. The two necessary components which are added to the basic router architecture are Fault Register (FR) and Test Module (TM) are shown in Fig. 4. Each end node has a maximum number of 4 input and 4 output ports. One input/output port is used to connect the switch to the NI, in the perspective to eliminate the extra area overhead and the power consumption. Each router can be connected to a maximum of four adjacent routers as well as the local intellectual property (IP) across NI.

The router architecture of OFDIM is shown in Fig. 4, sharing a similar structure as the traditional NoC router [25] as shown in Fig. 3. This method need special control circuitry and elements to be built for the system, which are usually very complicated because of their design, but offers a good reliability in terms of detecting permanent and soft errors in parallel with small area overhead as compared to static fault tolerance approach. It is considered as a promising method for networks-on-chip. Each step of the sequence is explained below.

4. Online fault detection and isolation mechanism-OFDIM

During the first step, we will focus on the design and implementation of a robust router resistant to a given number and type of faults. We will identify the router’s most sensitive parts that have to be protected. This information will be used to customize the fault tolerance techniques for each block of the router (input and output buffer, switch, controller, etc), as shown in Fig. 5. This approach having as a goal the minimization of the hardware overhead and the overall energy consumption of the router and the NoC. However, some router components may be very expensive to protect or they can tolerate a given number of errors (e.g. buffers using a correction code limited to two errors).

4.1. Detection mechanism

In this subsection, we have utilized the concept of online fault detection. It is introduced as an additional circuitry to detect single stuck-at faults (stuck-at 0 and stuck-at 1) during the operation of the system [2,15,16,22].

The basic detection mechanisms are either the end-to-end (E2E) [36] or switch-to-switch (S2S) [35]. E2E checks packets for error at destinations only, whereas (S2S) performs error detection at intermediate routers [44].

Fig. 5.

ECC achieved fault detection.

In terms of fault detection in on-chip network, built-in-sift-test (BIST) [14] is a traditional solution in integrated circuits which can be appropriated as a fault-detection scheme to handle the faults in routers, links and network interfaces. This method is a fault detection scheme. But this technique needs external circuits to perform error detection, and time to test which can reduce the performance of the NoC in term of latency. Another drawback the disconnection of the system for test that is required in BIST and the original traffic flow is affected which make dependability and performance hardly balanced. So, for minimizing the impact during fault detection, more online fault detection schemes are under investigation [3,4].

The fault detection can be implemented in any one of the following forms [24].

periodic tests: Tests are performed occasionally and output is verified.

self-checking circuits: Duplication in terms of circuit or time with comparison for output verification.

watchdog timers: It checks the timer value and resets when its value has exceeded. It is commonly used in multiple processor systems.

However, most fault detection methods focus only on permanent faults [3,6,8,32], ignoring the transient and intermittent faults. Transient faults have a various characteristic that they occur frequently in fixed positions while having a certain randomness on the time of occurrence. One solution to detect transient faults is increasing the test frequency.

The main goal is to keep the logic of our circuit as simple as possible, thus, it would be desirable that the total area of a checker would not exceed the area of the router it is checking.

To develop our approach, we have opted for the Cyclic Redundancy Cycle (CRC), a solution that can be used to perform error detection by comparing an input and output of each router to detect wrong packets and faulty components in a NoC. This approach is inspired by [19], with some modifications. Each router is equipped with a Test Module (TM). The goal of this mechanism is to enable online fault detection. The CRC polynomial used is $g = 1 + x + x 4$ , so we add 4 bits to the header of each packet. So, when an error is detected anywhere in the router or link, we know the error exists but we cannot correct it with CRC decoding.

Fig. 6.

Internal architecture of the router.

To test and diagnose the communication infrastructure in the NoC. First of all, it is necessary to detect the fault, secondly how to identify the faulty components: routers, links or cores. Finally, we analyze how to use efficiently the remaining components that are fault free. In the router, the fault can occur in all the components of the router (crossbar, buffers and others). At this high core density in networks on chip, considering faults only in the routers or links do not provide the optimal safety. Other components such as input-buffers and crossbar should be given prominent attention to ensure the fault tolerance and improve the system reliability.

To achieve this goal, we present in this section the Test Module (TM) added in our router (depicted in Fig. 6). The TM is one of the main components of our system, because it manages the diagnosis from all kinds of faults in four main components: inter-router links, input-buffers, crossbar, and header flit. We start in this subsection to describe the main functionalities of the Test-Module (TM) and its important role in orchestrating the different processes inside the router. Test Module detects and locates the faulty routers or links. Therefore, only the adjacent routers can communicate this information. In order to keep the fault occurrence in the input-buffers and the crossbar and header flits, this information is always sent to the FR (Fault Register). The aim is to be used during the selection of the next port. We used a simple fault detection mechanism based on a Cyclic- Redundancy-Check (CRC) in each output component of the router that reads the incoming flit to detect any error. Depending on this verification, the Test-Module (TM) sends a single-bit signal to the TM of upstream node that can be either 0 or 1, for valid or faulty, respectively, as shown in Fig. 5. Each router sends the collected information corresponding to its own fault status to all the neighboring nodes. This information is represented in a 3-bit (see Table 2) signal representing the router/link status in each direction (North, East, South, and West). Depending on these states, our router (OFDIM) reads the fault status of the next nodes/links received from the TM and checks the number of possible safety directions. For example, when a fault is detected in the buffer, a signal is sent to inform the TM module about the fault presence. The same case is applied for the other components. When receiving this signal, the TM disables the entire output-port and the faulty router to save the dynamic power. At the same time, TM updates the FR-status array by flagging the link connected to the faulty router.

Table 2

Codification of different states of links and routers

State	Value	Description
1	00	Port East unsafe
1	01	Port North unsafe
1	10	Port South unsafe
1	11	Fail Router
0	00	Port East safe
0	01	Port North safe
0	10	Port South safe
0	11	Safe Router

4.1.1. Permanent fault detection

The permanent errors cannot be eliminated by a simple reset of the circuit. The routers adjacent to a router with a permanent fault are informed about its state. This is to prevent any traffic to this defective router (for example, they can disable the output ports leading to this router). The same case is applied for defective links.

4.1.2. Transient fault detection

The transient errors occur typically during a very short time and are not destructive. So, after k attempts, the test module can consider the (temporary) dynamic faults as permanent. So we want to tolerate several on-line faults without resetting the circuit.

4.2. Fault location

After fault detection, we also need to keep the fault ocurence. Upon the detection of errors, fault diagnosis can only be performed with the information collected from fault detection. This information is constantly sent to all the TMs for all neighboring nodes, as shown in Fig. 7.

Fig. 7.

Communication between two neighboring routers.

4.3. Isolate fault

When receiving this signal, the TM disables the entire output-port and the faulty router to save the dynamic power. At the same time, the TM updates the FR-status array by flagging the link connected to the faulty router. Finally, to keep the faults information in routers or links, the TM interacts with the FR unit to exchange fault information and control signals, as shown in Fig. 8.

Fig. 8.

Communication between TM and FR.

5. Performance analysis

We devote this section to assess and analyze the performance of the proposed solution and discuss the performance variation. We have selected Random traffic pattern to evaluate the performance of the OFDIM router. In order to compare the performance of our reliable router design, we modify the booksim simulator [26] to support these two kinds of fault tolerant routers. We compare the latency caused by the add module for achieving fault tolerance. All the simulations are conducted on a $8 * 8$ 2D mesh network,with 4VCs per port and 16 flits per VC. Faults are injected randomly. To be fair, two kinds of reliable routers are injected with the same faults which can be handled properly in each design. NoC. The XY routing is used for this network, which routes packets along the X-axis first, and then Y-axis. The performance measures include system reliability, throughput and latency.

5.1. Reliability analysis

The reliability of a NoC router Rr(t) [49] is the probability that a router performs its functionalities correctly from time 0 to time t. It is decided by the failure rate, λ, which is measured by the number of failures per time unit. $\begin{matrix} (1) & R t (t) = e^{λ * t} \end{matrix}$

The reliability model for the traditional $N \times N$ mesh are given below: $\begin{array}{l} (2) & R s y s = {(R t)}^{n * n} \end{array}$ Normally the failure rate of the router is $λ = 0.00315$ (times/year) [17] Our design will reduce the λ. The reliability of different mesh size over 1 to 10 years is shown in Fig. 8. When a router fails, we need to decide which neighboring router should be connected to the corresponding IP.

The reliability curve of traditional $10 * 10$ NoC drops from 0.74 to 0.05 over 1 to 10 years. However, the reliability of our design only drops down to 0.91 in the 10th year. Also, we can see from Fig. 9(c) that the reliability gain increases exponentially with the year especially for NoC with large mesh size.

Fig. 9.

Reliability analysis of various size of mesh from 1-10 years.

5.1.1. Mean defects to failure (MDTF) analysis

To estimate the reliability improvement of our router as compared to the existing reliable routers, we use the MDTF metric. MDTF of a system is given by: $\begin{array}{l} (3) & MDTF = \frac{Maximum faults to cause failure + Minimum faults to cause faiure}{2} \end{array}$

To calculate the MDTF of our solution, we first calculate the maximum and the minimum number of faults tolerates. If all four primary links become faulty, the output ports become unreachable. Therefore, a minimum of 4 faults causes failure.

Thus, the OFDIM can tolerate a maximum 53 faults. To calculate MDTF, we need a maximum number of faults to cause failure. Thus, for the proposed architectre, the maximum number of faults to cause failure are $53 + 1 = 54$ . Now by putting these values in Equation (3), the MDTF is calculated as: $\begin{array}{l} (4) & MDTF = \frac{4 + 54}{2} = 29 \end{array}$

Thus, an average of 29 faults cause the failure. We compare MDTF of our solution with the existing reliable routers such as SHIELD [40]and HPR [48].

5.1.2. Silicon protection factor (SPF) analysis

To estimate the reliability improvement of our architecture as compared to the basic router, we use SPF [50]. It considers the fault-tolerant capability of the architecture with respect to the hardware cost. It is given as: $\begin{array}{l} (5) & SPF = \frac{Mean number of faults}{Area overhead} \end{array}$ The term Mean number of faults to cause the failure can be defined in Equation (3). Router microarchitecture can tolerate a maximum of 53. One more fault, the entire router becomes faulty and stops working. The maximum faults to cause failure are $53 + 1 = 54$ . Thus, the mean number of faults by using Equation (3) to cause failure is $(54 + 4) / 2 = 29$ faults. The proposed architecture resulted in an area overhead of up to 27%. According to the definition of SPF Equation (5), the proposed design SPF can be calculated as $29 / 1.27 = 22.8$ (See equation (6)). $\begin{array}{l} (6) & SPF = \frac{29}{1.27} \end{array}$

The router achieves the highest mean number of faults, resulting in a higher SPF value, as shown in Table 2. According to the definition of SPF, a higher SPF value means better reliability. The router achieves the highest means number of faults and SPF as compared to existing architectures.

The results are shown in Table 3.

Table 3
Comparison of area overhead for Shield, Wang and our proposed router

Architecture Area Number of Faults causing Failure SPF

Poluri Router [40] 34% 15 11.19

Wang Router [48] 30% 28.5 21.92

Proposed Router 27% 29 22.8

Architecture	Area	Number of Faults causing Failure	SPF
Poluri Router [40]	34%	15	11.19
Wang Router [48]	30%	28.5	21.92
Proposed Router	27%	29	22.8

Table 3 shows the SPF comparison of our solution with state-of-the-art reliable router architectures. It is evident that the proposed architecture is more economical than the existing reliable router architectures.

5.2. Area analysis

The area overhead [27] plays an important role to determine the fault tolerance capacity of router architecture design. A comparison of area overhead with state-of-the-art fault-tolerant router architecture is shown in Fig. 10. It is shown by comparing the baseline router and proposed solution that the area overhead of the proposed design is 27%. For fault detection, we utilized checkers design mehanism. The results show (see Fig. 10) that the proposed architecture achieves higher reliability by incurring smaller overhead as compared to HPR and SHIELD architecture.

Fig. 10.

Comparison of area overhead.

A comparison of the area overhead with some related works for fault-tolerant router architecture is shown in Fig. 10. It is shown by comparing the baseline and the proposed OFDIM (which performs 27% of the area overhead). For fault detection, we used checkers with a fault detection mechanism. The results show that the proposed solution achieves higher reliability by incurring smaller overhead as compared to an existing architecture.

5.3. Throughput analysis

In the simulation of throughput, 100 different fault patterns are randomly generated. In this evaluation, all PEs are assumed to be healthy. As revealed, an error-free network has the highest throughput. The throughput decreases as the number of faulty routers increases. However, in an $8 * 8$ network with only a few faulty routers (15%), our proposed design has the advantage of keeping the throughput almost the same as an error-free network. This is due to the fact that the OFDIM routers can replace some faulty routers, thus maintaining the functionality of the network and the throughput as shown in Fig. 11.

Fig. 11.

Average throughput for a network of 64 PEs.

5.4. Latency analysis

We have simulated the NoCs with various sizes and various numbers of faulty routers and links. Only the latency for the $8 * 8$ NoC is shown in Fig. 12, which demonstrates the typical results for most simulation cases. In the simulation, all the PEs inject packets into the network. Figure 12 shows the simulation results for a $8 \times 8$ mesh with 15% of faulty router. When the number of faulty routers increases, the maximum throughput decreases and the average latency increases. Compared with the error-free NoC (blue), the throughput only decreases by 5.19% (see Fig. 12) and the average latency increases by 2.40%. During the degraded modes, the router is still used, but has reduced performance. The cost evaluation of the degraded mode and the self-healing will have to be performed dynamically, as it is application dependent. It also requires a new diagnostic phase to verify if the fault has been mitigated, otherwise a degraded mode has to be applied.

Fig. 12.

Latency for $8 * 8$ mesh with zero and single faulty routers. The dots denote the simulation data. A solid line is the fitting curve of the discrete data.

To evaluate the performance of the proposed router architecture, we modify the baseline router architecture. All simulation is evaluated on simulating 64 nodes in a 2D mesh network. The mesh network is evaluated on synthetic traffic patterns. The uniform random traffic patterns is injected in the networks on a varying injection rate of 0.1 to 0.5. The proposed router architecture can tolerate the faulty links and all components of the router. The fault is inserted during the runtime simulation and added randomly in the network. Initially, one fault is injected, and the total number of faults is injected until the router reaches a stage where it can tolerate the maximum number of faults. It is observed that, as the number of faults increases, the latency of the router also increases, but it continues working.

Fig. 13.

Comparison of average latency with the faults injected in our design and HPR. Dashed lines show the result of HPR design. Real lines show the result of our design.

We analyze the impact of average latency under the uniform random traffic pattern with the faults injected. Figure 13 compares the average latency of HPR reliable NoC router with our design when faults are injected to different components. The result of our proposed router is shown in real lines. Fault-tolerant strategies for the input buffer, the RC, the VA and the SA bring minor performance decrease.

From the evaluation results, we could observe how the proposed router architecture was able to satisfy these three important requirements: 1) 100% functionality regardless of the location of faults 2) graceful performance degradation 3) low area overhead.

6. Conclusion

In this paper, we proposed a reliable fault tolerant router architecture, called OFDIM, endorsed with two mechanisms; detection and isolation. The proposed architecture manages to avoid the system failure at the presence of a large number of faults while ensuring graceful high reliability and performance degradation while minimizing the additional hardware complexity. Despite the good results obtained with the proposed architecture, some points should be fixed to enhance its performance and reliability. The first one is Local congestion and second global congestion caused by faults in router. Completing the fault-detection mechanism and implementing thermal-power-aware techniques [21] aim to improve the reliability of our solution and maintain the obtained good performance.

References

Aghaei,

Khademzadeh,

Reshadi and

Badie, A new bistbased test approach with the fault location capability for communication channels in network-on-chip, Journal of Electronic Testing: Theory and Applications (JETTA) 33(4) (2017), 501–513. doi:10.1007/s10836-017-5666-9.

Amory,

Briao,

Cota,

Lubaszewski and

Moraes, A scalable test strategy for network-on-chip routers, in: Proc. IEEE International Test Conference (ITC), Austin, TX, 2005, pp. 551–559.

Bhowmik,

J.K.

Deka and

Biswas, On-line analysis of stuck-at faults in on chip interconnects, Journal of Circuits, Systems and Computers 13 (2018), 1.

Bhowmik,

J.K.

Deka,

Biswas and

B.B.

Bhattacharya, Performance-aware test scheduling for diagnosing coexistent channel faults in topology-agnostic networks-on-chip, ACM Trans. on the Autom. Electron. Syst. 24 (2019), 17.

Bjerregaard and

Mahadevan, A survey of research and practices of network-on-chip, ACM Comput. Surv. 38 (2006). doi:10.1145/1132952.1132953.

Y.-C.

Chang,

C.-T.

Chiu,

S.-Y.

Lin and

C.-K.

Liu, On the design and analysis of fault tolerant NoC architecture using spare routers, in: Proc. 16th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, 2011, pp. 431–436.

Charif,

N.E.

Zergainoh and

Nicolaidis, Addressing transient routing errors in fault-tolerant networks-on-chips, in: 21th IEEE European Test Symposium (ETS), 2016, 1–6.

Chatterjee,

Paul and

Chattopadhyay, Fault-tolerant dynamic task mapping and scheduling for network-on-chip-based multicore platform, ACM Trans. Embed. Comput. Syst. 16 (2017), 108.

Chen,

Lu,

Lei,

Wang and

Chen, Multi-bit transient fault control for noc links using 2d fault coding method, in: Networks-on-Chip (NOCS), Tenth IEEE/ACM International Symposium on, IEEE, 2016, pp. 1–8.

10.

Chen,

Lu,

Lei,

Wang and

Chen, Multi-bit transient fault control for noc links using 2d fault coding method, in: Networkson-Chip (NOCS), 2016 Tenth IEEE/ACM International Symposium on, IEEE, 2016, pp. 1–8.

11.

Chrysanthou

et al., An online and real-time fault detection and localization mechanism for network-on-chip architectures, ACM Trans. on Architecture and Code Opt. (TACO) 13(2) (2016), 22.

12.

Chrysanthou,

Englezakis,

Prodromou et al., An online and real-time fault detection and localization mechanism for network-on-chip architectures, ACM Trans. on Architecture and Code Optimization 13(2) (2016), 22. doi:10.1145/2930670.

13.

J.H.

Collet, A brief overview of the challenges of the multicore roadmap, in: Proc. of the 21st Intl. Conf. on Mixed Design of Integrated Circuits and Systems, Lublin, 2014 pp. 22–29.

14.

Cota,

de Morais Amory and

M.S.

Lubaszewski, Reliability, Availability and Serviceability of Networks-on-Chip, Springer, 2011.

15.

Dalirsani,

Holst,

Elm and

H.-J.

Wunderlich, Structural test and diagnosis for graceful degradation of NoC switches, Journal of Electronic Testing: Theory and Applications (JETTA) 28(6) (2012), 831–841. doi:10.1007/s10836-012-5329-9.

16.

Dalirsani,

Hosseinabady and

Navabi, An analytical model for reliability evaluation of NoC architectures, in: Proc. 13th IEEE International on-Line Testing Symposium (IOLTS), Crete, Greece, 2007, pp. 49–56. doi:10.1109/IOLTS.2007.13.

17.

W.J.

Dally and

B.P.

Towles, Principles and Practices of Interconnection Networks, Elsevier, 2004.

18.

DiTomaso,

A.K.

Kodi,

Louri and

Bunescu, Resilient and power-efficient multi-function channel buffers in network-on-chip architectures, IEEE Trans. on Computers 64(12) (2015), 3555–3568. doi:10.1109/TC.2015.2401013.

19.

Fochi,

Wächter,

Erichsen,

A.M.

Amory and

F.G.

Moraes, An integrated method for implementing online fault detection in NoC-based MPSoCs, in: Proc. of 2015 IEEE Intl. Symposium on Circuits and Systems, Lisbon, 2015, pp. 1562–1565. doi:10.1109/ISCAS.2015.7168945.

20.

Frank, Fault diagnosis in dynamic systems using analytical and knowledge-based redundancy – a survey and some new results, Automatica 26(3) (1990), 459–474. doi:10.1016/0005-1098(90)90018-D.

21.

Gellert,

Florea,

Fiore,

Zanetti and

Vintan, Performance and energy optimisation in CPUs through fuzzy knowledge representation, Information Sciences 476 (2019), 375–391. doi:10.1016/j.ins.2018.03.029.

22.

Ghofrani,

Parikh,

Shamshiri,

DeOrio,

K.-T.

Cheng and

Bertacco, Comprehensive online defect diagnosis in onchip networks, in: Proc. 30th IEEE VLSI Test Symposium (VTS), Hyatt Maui, HI, 2012, pp. 44–49.

23.

Grecu,

Ivanov,

Saleh,

E.S.

Sogomonyan and

P.P.

Pande, On-line fault detection and location for NoC interconnects, in: Proc. of the 12 th IEEE Intl. on-Line Testing Symposium, Lake Como, 2006.

24.

Hervé,

Almeida,

F.L.

Kastensmidt,

Cota and

Concurrent, Test of network-on-chip interconnects and routers, in: 11th Latin American Test Workshop (LATW), 2010, pp. 1–6.

25.

Ibrahim,

N.K.

Baloch,

Anjum,

Y.B.

Zikria and

S.W.

Kim, An energy efficient and low overhead fault mitigation technique for internet of thing edge devices reliable on-chip communication. Software: Practice and Experience, Available online https://onlinelibrary-wiley-com-443.web.bisu.edu.cn/doi/pdf/10.1002/spe.2796, 4 Feb 2020.

26.

Jiang,

D.U.

Becker,

Michelogiannakis,

Balfour,

Towles,

Kim and

W.J.

Dally, A detailed and flexible cycle-accurate network-on-chip simulator, in: ISPASS’13, 2013.

27.

A.B.

Kahng,

Li,

L.S.

Peh and

Samadi, ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration, in: 2009 Design, Automation & Test in Europe Conference & Exhibition, IEEE, 2009, pp. 423–428. doi:10.1109/DATE.2009.5090700.

28.

M.R.

Kakoee,

Bertacco and

Benini, At-speed distributed functional testing to detect logic and delay faults in NoCs, IEEE Trans. on Computers 63(3) (2014), 703–717. doi:10.1109/TC.2013.202.

29.

Killian,

Tanougast and

Dandache, Hybrid fault detection for adaptive NoC, IEEE Embedded Systems Letters 5(4) (2013), 69–72. doi:10.1109/LES.2013.2283740.

30.

Killian,

Tanougast,

Monteiro and

Dandache, Smart reliable network-on-chip, IEEE Trans Very Large Scale Integr VLSI Syst 22(2) (2014), 242–255.

31.

Lehtonen, On fault tolerance methods for networks-on-chip, Turku, Finland, 2009.

32.

Lehtonen,

Wolpert,

Liljeberg,

Plosila and

Ampadu, Self-adaptive system for addressing permanent errors in on-chip interconnects, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 18(4) (2010), 527–540. doi:10.1109/TVLSI.2009.2013711.

33.

J.X.

Liu,

Harkin,

Y.-H.

Li and

Maguire, Online traffic-aware fault detection for networks-on-chip, Journal of Parallel and Distributed Computing 74(1) (2014), 1984–1993. doi:10.1016/j.jpdc.2013.09.001.

34.

H.J.

Mohammed,

W.N.

Flayyih and

F.Z.

Rokhani, Tolerating permanent faults in the input port of the network on chip router, J. Low Power Electron. 9 (2019).

35.

J.M.

Montanana,

de Andres and

Tirado, Fault tolerance on NoCs, IEEE, 2013, pp. 138–143.

36.

Palesi,

Ascia,

Fazzino and

Catania, Data encoding schemes in networks on chip, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30(5) (2011), 774–786. doi:10.1109/TCAD.2010.2098590.

37.

Pengzhan,

Shixiong and

Ramalingam, A high throughput router with a novel switch allocator for Network on Chip, in: 2015 28th IEEE International System-on-Chip Conference (SOCC), IEEE, 2015, pp. 160–163.

38.

Pengzhan,

Shixiong and

Ramalingam, A novel fault-tolerant router architecture for Network-on-Chip reconfiguration, in: System-on-Chip Conference (SOCC), 2015 28th IEEE International, IEEE, 2015, pp. 292–297.

39.

Poluri and

Louri, Tackling permanent faults in the network-on-chip router pipeline, SBAC-PAD’13, IEEE, 2013, pp. 49–56.

40.

Poluri and

Louri, Shield: A reliable network-on-chip router architecture for chip multiprocessors, IEEE Trans. Parallel Distrib. Syst. 27 (2016), 3058–3070. doi:10.1109/TPDS.2016.2521641.

41.

Poluri and

Louri, An improved router design for reliable on-chip networks, in: Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Phoenix, AZ, May 19–23, 2014, pp. 283–292.

42.

Putkaradze,

S.P.

Azad,

Niazmand,

Raik and

Jervan, Fault-resilient NoC router with transparent resource allocation, in: Proceedings of the 12th International Symposium on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC), Madrid, Spain, 2017, p. 1–8.

43.

Radetzki,

Feng,

Zhao and

Jantsch, Methods for fault tolerance in networks on chip, ACM Computing Surveys 44 (2013), 1–35.

44.

Schley,

Batzolis and

Radetzki, Fault localizing end-to-end flow control protocol for Networks-on-Chip, in: 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Belfast, Northen Ireland, 2013, pp. 454–461.

45.

Shamshiri,

Ghofrani and

K.T.

Cheng, End-to-end error correction and online diagnosis for on-chip networks, in: Proc. ITC, 2011.

46.

Valinataj and

Shahiri, A low-cost, fault-tolerant and high-performance router architecture for on-chip networks, Microprocess. Microsyst. 45 (2016), 151–163. doi:10.1016/j.micpro.2016.04.009.

47.

Wang,

Ebrahimi,

Huang,

Xie,

Li and

Li,

Jantsch, Efficient design-for-test approach for networks-on-chip, IEEE Trans. Comput. 68 (2019), 198–213.

48.

Wang,

Ma,

Li,

Chen and

Wang, A high performance reliable NoC router, Integration 58 (2017), 583–592. doi:10.1016/j.vlsi.2016.10.016.

49.

L.-T.

Wang,

C.E.

Stroud and

N.A.

Touba, System-on-Chip Test Architectures: Nanometer Design for Testability, Morgan Kaufmann, 2010.

50.

Werner,

Navaridas and

Lujan, A survey on design approaches to circumvent permanent faults in networks-on-chip, ACM Computing Surveys (CSUR) 48(4) (2016), 59. doi:10.1145/2886781.

51.

Yang and

Ampadu, An energy-efficient NoC router with adaptive fault-tolerance using channel slicing and on-demand TMR, IEEE Trans. Emerg. Top. Comput. 6 (2018), 538–550. doi:10.1109/TETC.2016.2597799.

52.

Yu and

Ampadu, A dual-layer method for transient and permanent error comanagement in noc links, IEEE Transactions on Circuits and Systems II: Express Briefs 58(1) (2011), 36–40. doi:10.1109/TCSII.2010.2092817.

A new architecture for online error detection and isolation in network on chip

Abstract

Keywords

1. Introduction

4.1. Detection mechanism

4.1.2. Transient fault detection

4.2. Fault location

5.1. Reliability analysis

5.1.2. Silicon protection factor (SPF) analysis

Table 3 Comparison of area overhead for Shield, Wang and our proposed router Architecture Area Number of Faults causing Failure SPF Poluri Router [40] 34% 15 11.19 Wang Router [48] 30% 28.5 21.92 Proposed Router 27% 29 22.8

References

Table 3
Comparison of area overhead for Shield, Wang and our proposed router

Architecture Area Number of Faults causing Failure SPF

Poluri Router [40] 34% 15 11.19

Wang Router [48] 30% 28.5 21.92

Proposed Router 27% 29 22.8