# Network-on-Chip Link Analysis under Power and Performance Constraints

Manho Kim, Daewook Kim and Gerald E. Sobelman Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN 55455 USA Email: {mhkim,daewook,sobelman}@ece.umn.edu

*Abstract*— This paper analyzes the behavior of interconnects in the highly structured environment of a network-on-chip (NoC). Two distinct classes of wires are considered, namely links between adjacent routers and links between a router and an attached processing element (PE). Analytical models for global routerto-router links and semi-global router-to-PE links are studied. Power and performance optimizations are obtained for each of these two classes of interconnections.

## I. INTRODUCTION

In a network-on-chip(NoC) platform, there are two major signal paths, namely from router-to-router and from router-toprocessing element (PE). Compared to the short-length local wires inside PEs and routers, the global wires between routers and the semi-global wires between a router and a PE pose many challenges that must be addressed, particularly as as the semiconductor technology scales [1].



Fig. 1. NoC Communication Platform.

An NoC is an embodiment of a layered design approach [2]. This methodology considers on-chip communication and its abstraction as a micro-network consisting of particular layers, i.e. physical, data link, network, transport and application, each one having its own functions. Physical layer design should find a compromise between competing quality metrics and provide a clean and complete abstraction of the channel characteristics to the other layers. The data-link layer abstracts the physical layer as an unreliable digital link, where the probability of bit upsets is non-zero and increasing as technology scales down. Furthermore, reliability can be traded off against energy [3].

In this paper, we consider the delay, power, noise, area and throughput of inter-router and router-to-PE physical and link layer connections in an NoC platform as shown in Fig. 1. We also propose guidelines for designing a reliable and power efficient communication environment for both of these types of links.

II. LINK CONFIGURATION OF ROUTER-TO-ROUTER AND ROUTER-TO-PE



Fig. 2. Various NoC interface configurations. (a) n parallel (b) reduced m parallel (c) source-synchronous serial.

Fig. 2 shows several possible configurations of NoC links between routers or from router-to-PE. Fig. 2 (a) represents a parallel wire configuration. For example, a 128 or 256 bit packet can be sent in parallel. Fig. 2 (b) shows an *n*-bit packet or flit divided into a smaller number of bits (*m* bits) so that the transmission is only partially parallel. Fig. 2 (c) shows a source-synchronous serial communication strategy.

#### A. Single Wire Delay Model

An optimal set of repeaters may be inserted into each wire in a cascaded structure [4] [5]. The total delay of a wire can be modeled by adding the first stage cascaded driver delay and the optimal repeater inserted wire delay.

The delay for the first stage cascade driver is given as

$$\tau_{driver,0.5} = 0.7eR_0C_0 + ln(\frac{C_{int} + C_L}{C_0} + 0.4R_{int})C_{int} + 0.7R_{int}C_L$$



Fig. 3. (a) Optimal repeaters with cascaded first stage (b) Cascaded drivers (c) Optimal repeaters.

where *e* is the base of the natural logarithms,  $C_0$  and  $R_0$  are the input capacitance and output resistance of a minimum-size inverter, and  $R_{int}$  and  $C_{int}$  are the resistance and capacitance of an interconnection. Each can be calculated as  $R_{int} = rL$ , where *r* is the wire resistance per unit length and  $C_{int} = cL$ , where *c* is the wire capacitance per unit length.

The delay for an optimal repeater inserted into the wire is given as

$$\tau_{0.5} = k[0.7\frac{R_0}{h}(\frac{C_{int}}{k} + hC_0) + \frac{R_{int}}{k}(0.4\frac{C_{int}}{k} + 0.7hC_0)]$$

Optimal values for k and h can be obtained by setting  $d\tau_{0.5}/dk$  and  $d\tau_{0.5}/dh$  to zero.

$$k = \sqrt{rac{0.4R_{int}C_{int}}{0.7R_0C_0}}, h = \sqrt{rac{R_0C_{int}}{R_{int}C_0}}$$

and the resulting delay expression becomes  $\tau_{0.5} = 2.5\sqrt{R_0C_0R_{int}C_{int}}$  and the total delay of a single wire is  $\tau_0 = \tau_{driver,0.5} + \tau_{0.5}$ 

In this single wire case, the clock period  $T_c$  can be set to  $T_c \ge \tau_0$ . However, the above equation can only applied to Fig. 2 (c) (i.e., a serial wire) but is not directly applicable for parallel wires because it neglects the capacitance between adjacent wires.

## B. DSM Parallel Wire Delay Model

We consider an *n*-bit parallel set of wires within a single metal layer. We assume that the rise time of the drivers and the loss in the interconnects are such that inductance can be ignored. Such a deep submicron (DSM) wire can be modeled as a distributed RC network with a coupling capacitance between adjacent wires. The delay of the  $l_{th}$  wire of the bus is given as follows

$$T_{l} = \begin{cases} \tau_{0} \left[ (1+\lambda)\Delta_{1}^{2} - \lambda\Delta_{1}\Delta_{2} \right] &, l = 1\\ \tau_{0} \left[ (1+2\lambda)\Delta_{l}^{2} - \lambda\Delta_{l}(\Delta_{l-1} + \Delta_{l+1}) \right] &, 1 < l < n\\ \tau_{0} \left[ (1+\lambda)\Delta_{n}^{2} - \lambda\Delta_{n}\Delta_{n-1} \right] &, l = n \end{cases}$$

 $\tau_0$  is the delay of a crosstalk-free wire,  $\lambda$  is the ratio of coupling capacitance to bulk capacitance and  $\Delta_l$  is the transition occurring on wire *l*, where

$$\Delta_l = \begin{cases} 1 \text{ for rising transition} \\ -1 \text{ for falling transition} \\ 0 \text{ for stable transition} \end{cases}$$

In the parallel wire case above, the clock period  $T_c$  should be sufficiently large so that all the transitions in the bus have enough time to be completed. In other words it must be that  $T_c \ge \eta \cdot (1+4\lambda)$  where  $\eta$  is a technology parameter [6].

## C. Throughput

The throughput of an NoC link is given by:

$$T_L = \frac{N}{T_c}$$

where  $T_L$  is the total throughput of the link,  $T_c$  is the minimum pulse width and N is the number of signal wires. The above expression should be modified to  $T_L = \delta \frac{N}{T_c}$  if pipelining is used, where  $\delta$  is the pipelining effect factor. When we consider a parallel *n*-bit wide bus structure, the control signal overhead can be modeled as follows:

$$T_L = \frac{1}{T_c} \cdot \frac{N_{data}}{N_{(data + control)}}$$

D. Power Consumption on Wires

The dynamic power of a wire is:

$$P_W = \alpha \cdot C \cdot f \cdot V_{dd}^2 \cdot N_{wires}$$

where  $\alpha$  is the switching probability, C is the wire capacitance, *f* is the signal frequency and  $V_{dd}$  is the supply voltage.

The power consumption on the link between two routers is as follows:

$$P_{link} = (P_{driver} + P_{repeaters} + P_{wire}) \cdot N_{wires}$$

where  $N_{wires}$  is equal to the number of parallel wires of the link.

## III. NOC LINK INTERFACE MODEL

Fig. 4 (a) shows a general link configuration with an nbit wide parallel wire structure and Fig. 4 (b) shows an improved link obtained by adding special features such as error correction [7] or TDMA [8] or CDMA [9] techniques.



Fig. 4. NoC Link wire model (a) n wires (b) m wires with an encoder and decoder pair.

#### A. Power consumption of a parallel wire structure

Let two router interfaces (IFs), Router IF<sub>1</sub> and Router IF<sub>2</sub>, be connected using *n*-parallel wires of length l [2]. The total power consumption of the link is given by [10]

$$P = f \cdot V_{dd}^2 \cdot (\alpha_1(C_{out_1} + l \cdot n \cdot C_w + C_{in_2}) + \alpha_2 C_{out_2})$$

where  $C_{out_1}$  and  $C_{out_2}$  are the intrinsic output capacitance of Router<sub>1</sub> and Router<sub>2</sub>, respectively,  $C_{in_2}$  is the input capacitance of router<sub>2</sub>,  $C_w$  is the per-unit-length value of the capacitance of a wire, f is the clock frequency, and  $\alpha_1$  and  $\alpha_2$  are the switching probabilities of router<sub>1</sub> and router<sub>2</sub>, respectively.

Suppose we insert an n-to-m encoder and an m-to-n decoder between Router<sub>1</sub> and Router<sub>2</sub> and allow them to communicate via a reduced number m of wires. We can show that the total power consumption of such a link system is given by:

$$P_m = f \cdot V_{dd}^2 \cdot [\alpha_1(C_{out_1} + C_{in_e}) + \alpha_e(C_{out_e} + l \cdot n_{ed} \cdot C_w + C_{in_2}) + \alpha_d(C_{out_d} + C_{in_2}) + \alpha_2 C_{out_2}]$$

where  $n_{ed}$  is the number of wires between the encoder and the decoder,  $C_{in_e}, C_{in_d}, C_{out_e}$ , and  $C_{out_d}$  are the input and output capacitances of the encoder and decoder and  $\alpha_e$  and  $\alpha_d$  are the switching probabilities of the encoder and the decoder, respectively.

Clearly, a reduction in the number of wires between Router<sub>1</sub> and Router<sub>2</sub> comes at the expense of additional hardware, which contributes to the area, delay and power of the whole system. We must ensure that the cost of this additional hardware does not exceed the improvement which it brings.

$$P_a = \alpha_1 V_{dd}^2 f \cdot C_{out_1} + \alpha_1 V_{dd}^2 f \cdot d \cdot n \cdot C_w + \alpha_2 V_{dd}^2 f \cdot C_{in_2} + \alpha_2 V_{dd}^2 f \cdot C_{out_2} P_b = \alpha_1 V_{dd}^2 f (C_{out_1} + C_{in_e}) + \alpha_e V_{dd}^2 f' (C_{out_e} + d' \cdot n_{ed} \cdot C_w + C_{in_d}) + \alpha_e V_{dd}^2 f' (C_{out_e} + C_{in_2}) + \alpha_e V_{dd}^2 f \cdot C_{out_2}$$

In order for this scheme to have a beneficial trade-off, the condition  $P_b \leq P_a$  must be satisfied.

## B. Signal reliability in DSM interconnect

An interconnect wire, at a high level of abstraction, can be modeled as a noisy communication channel over which bit streams are transmitted. The error performance of such a channel has been analyzed in [11]. Because the interconnect is not reliable, the upper layer of the NoC protocol handles these errors. Error correction [12] and retransmission are two popular techniques for this unreliable channel.

## C. Noise reduction in parallel wires

Crosstalk between adjacent lines may cause a link to become unreliable, leading to effects such as skew, jitter, signal error, hold-time violations, setup-time violation, etc. Such problems will become more dominant as the technology scales. In this subsection we propose efficient techniques for reducing such deleterious effects.

1) Semi-global Links: Instead of transmitting the signal at every clock cycle, the odd-numbered wires may transmit on odd clock cycles and even-numbered wires transmit on even clock cycles, as shown in Fig. 5 (a). The worst-case delay of this scheme is  $\eta \cdot (1+2\lambda)$  which is lower than the parallel case  $\eta \cdot (1+4\lambda)$ . Fig. 5 (a) illustrates an example. Suppose we have a 16-bit packet in the buffer. At time t0, even numbered bits in the buffer(b0, b2, etc) are transmitted via the even wires (b<sub>0</sub>, b<sub>1</sub>, etc). At time t1, odd numbered bits in the buffer (b1, b3, etc) are transmitted over the odd numbered wires (b<sub>1</sub>, b<sub>3</sub>, etc).



Fig. 5. Signal Illustration (a) Multiplexing wire (b) TDMA wire share.

2) Global Links: In the global interconnect (router-torouter) environment, an upper-level metal layer is normally used. Compared to lower- or middle-level metal layers, these are thick and have a wide pitch between wires. Therefore, they consume more area and power than lower-level metal. We suggest a multiplexing scheme having a much larger wire spacing instead of shielding an individual wire. The link configuration between router and router is similar to Fig. 4 (b). The number of wires between encoder and decoder (m) can be reduced to below n by adding an encoder-decoder pair. Fig. 5 (b) gives an example. Suppose there is a 16-bit packet in the buffer. Then, b0 and b1 share wire  $b_0$ , b2 and b3 share wire  $b_2$ , etc.

#### IV. CASE STUDY

We modeled the inter-router and router-to-PE links in the NoC platform of the previous sections. We used global link length values from [13] which are 13 mm for 0.18  $\mu$ m and 9.3 mm for 0.13  $\mu$ m and 7.1 mm for 0.9  $\mu$ m and the predictive technology model (PTM) [14] was used to obtain parameters for interconnect and devices using BSIM3 models, as shown in Table I.

TABLE I Process parameters

|         | width [µm] | space [µm] | Vdd | $C_w$ [fF/mm] |
|---------|------------|------------|-----|---------------|
| 90 nm   | 0.5        | 0.5        | 1.2 | 331           |
| 0.13 µm | 0.6        | 0.6        | 1.5 | 268           |
| 0.18 µm | 0.8        | 0.8        | 1.8 | 208           |

Table II shows the analytical model performance results using the above parameters. The maximum link length was assumed to be the same as the maximum synchronous NoC resource size which can be approximated from the FO4 delay of the technology. The maximum link delay can be obtained after finding the number of optimum repeaters (k) and the repeater size (h) by using the closed-form formula as in II.

The average power consumed in the interconnect can be approximated by  $P_{link} = \frac{1}{2}(C_w + C_{rep})V_{dd}^2 \cdot f \cdot \alpha \cdot N_w = \frac{1}{2}(C_w + hkC_0)V_{dd}^2 \cdot f \cdot \alpha \cdot N_w$  where  $k = \sqrt{\frac{0.4R_{int}C_{int}}{0.7R_0C_0}}$ ,  $h = \sqrt{\frac{R_0C_{int}}{R_{int}C_0}}$  Therefore,  $P_{link} = 0.875C_wV_{dd}^2 \cdot f \cdot \alpha \cdot N_w$ . We assumed  $\alpha$  is 0.2 for the power results.

## TABLE II NOC LINK PERFORMANCE

| width = 256                  | 0.18 <i>um</i> | 0.13 <i>um</i> | 90 <i>nm</i> |
|------------------------------|----------------|----------------|--------------|
| Max. link length [mm]        | 13             | 9.3            | 7.1          |
| Link delay of Max. link [ps] | 851.5          | 663            | 539.6        |
| Max. frequency [GHz]         | 1.2            | 1.51           | 1.85         |
| Link area [mm <sup>2</sup> ] | 5.325          | 2.856          | 1.817        |
| Link power consumption [W]   | 0.7945         | 0.3793         | 0.1762       |



Fig. 6. Link Throughput.

Fig. 7. Link power consumption.

Fig. 6 shows that the throughput of the link increases with bit width. Fig. 7 shows that the power consumption also increases with the bit width of the link. As can be seen from the above figures, using more parallel wires gives better link throughput. However, we cannot increase the signal frequency to its maximum in a parallel wire structure due to noise, skew and jitter. Therefore, we have to trade off high throughput against reliability, power consumption and area. The various design options are shown in Table III.

TABLE III NOC LINK PERFORMANCE COMPARISON

| at 0.18 um Technology     | Width | Freq.[GHz] | Area [mm <sup>2</sup> ] |
|---------------------------|-------|------------|-------------------------|
| Conv. parallel Fig. 2 (a) | 16    | 0.6        | 0.66                    |
| Fig. 2 (b) w/ Fig. 5 (a)  | 16    | 1.2        | 0.43                    |
| Fig. 2 (b) w/ Fig. 5 (b)  | 8     | 1.2        | 0.26                    |
| Conv. serial Fig. 2 (c)   | 1     | 9.6        | 0.02                    |

Table III shows the inter-router/router-to-PE link comparison results for various configurations. If the link throughput requirement is 9.6 Gbps, we have 4 possible link configurations. For consistency with the previous figures, we have chosen a parallel 16-bit wide bus for the base configuration. A conventional parallel bus with every wire shielded is shown in Fig. 2 (a) and the alternate wire scheme is shown in Fig. 5 (a) The third scheme shares one wire with two adjacent bits, as shown in Fig. 5 (b). The final configuration is a conventional serial line as shown in 2 (c). We assume that the optimal number of repeaters are inserted into the individual wires using the wire delay model of section II. All of the above configurations can achieve a 9.6 Gbps throughput.

## V. CONCLUSIONS

In the highly structured NoC platform, the power and noise characteristics of inter-router links and router-to-PE links have been systematically investigated. We have used analytical models for these structures to determine the trade-offs that must be made between power, performance and reliability. Based on this analysis, we have proposed multiplexed link structures which provide a beneficial design point for these types of links. A case study was used to obtain design metrics for these structures and to validate the proposed methodology.

#### ACKNOWLEDGMENTS

The authors would like to thank Sang Woo Rhim, Bumhak Lee and Euiseok Kim of Samsung Advanced Institute of Technology (SAIT) for their help with this manuscript. This research work is supported by a grant from SAIT.

#### REFERENCES

- K. Lee, S.-J. Lee, and H.-J. Yoo, "SILENT: serialized low energy transmission coding for on-chip interconnection networks," in *IEEE/ACM International Conference on Computer Aided Design*, 7-11 Nov. 2004, pp. 448–451.
- [2] L. Benini and G. D. Micheli, "Networks on chips: a new SoC paradigm," *IEEE Computer*, vol. 35, no. 1, pp. 70–78, Jan. 2002.
- [3] L. Benini and G. D. Micheli, "Powering networks on chips," in *The 14th International Symposium on System Synthesis*, 2001, pp. 33–38.
- [4] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, *Digital Integrated Circuits: A Design Perspective*, 2nd ed. Prentice Hall, Dec 2002.
- [5] H. B. Bakoglou, Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley, 1990.
- [6] P. P. P. Sotiriadis, "Interconnect modeling and optimization in deep sub-micron technologies," Ph.D. dissertation, Massachusetts Institute of Technology, 2002.
- [7] N. Sridhara, S.R.; Shanbhag, "Coding for system-on-chip networks: a unified framework," *IEEE Transactions on VLSI Systems*, vol. 13, no. 6, pp. 655 – 667, Jun. 2005.
- [8] A. Joshi and J. Davis, "A 2-slot time-division multiplexing (TDM) interconnect network for gigascale integration (GSI)," in *Proc. of the* 2004 international workshop on System level interconnect prediction, 2004, pp. 64 – 68.
- [9] D. Kim, M. Kim, and G. E. Sobelman, "CDMA-based network-on-chip architecture," in *Proc. of the IEEE Asia-Pacific Conference on Circuits* and Systems, vol. 1, 6-9 Dec. 2004, pp. 137–140.
- [10] I. Dhaou, E. Dubrova, and H. Tenhunen, "Power Efficient Inter-Module Communication for Digit-Serial DSP Architecture In Deep-Submicron Technology," in *Proc. of the 31st IEEE International Symposium on Multiple-Valued Logic*, 22-24 May 2001, pp. 61–66.
- [11] V. Raghunathan, M. Srivastava, and R. Gupta, "A Survey of Techniques for Energy Efficient On-Chip Communication," in *Proc. of the Design Automation Conference*, 2-6 Jun. 2003, pp. 900 – 905.
- [12] N. Shanbhag, "Reliable and efficient system-on-chip design," *Computer*, vol. 37, no. 3, pp. 42–50, Mar. 2004.
- [13] J. Liu, L.-R. Zheng, D. Pamunuwa, and H. Tenhunen, "A global wire planning scheme for Network-on-Chip," in *Proc. of the International Symposium on Circuits and Systems*, vol. 4, 25-28 May 2003, pp. 25– 28.
- [14] Predictive technology model (PTM), www.eas.asu.edu/ptm.
- [15] A. Naeemi, R. Venkatesan, and J. Meindl, "Optimal global interconnects for GSI," *IEEE Transactions on Electron Devices*, vol. 50, no. 4, pp. 980–987, Apr. 2003.