# Noise Model Analysis of Optimized Mixed-Radix Structures for Pulsed OFDM

Kai-Chuan Chang and Gerald E. Sobelman Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN 55455 USA {kaichang, sobelman}@umn.edu

#### Abstract

Pulsed OFDM (P-OFDM) is a proposed enhancement to Multi-Band Orthogonal Frequency Division Multiplexing which reduces the power and complexity of Ultra Wideband transceivers without sacrificing performance. In this paper, the effects of finite precision arithmetic in the mixed-radix Fast Fourier Transforms of P-OFDM architectures are analyzed using a noise model. The results of this analysis lead to the selection of optimal values for the wordlengths of the data and coefficients. Synthesis results based on these optimal wordlengths are presented for a Xilinx Virtex<sup>TM</sup>-4 FPGA implementation.

## 1. Introduction

The Pulsed-OFDM (P-OFDM) system has been proposed as an enhancement to the multi-band OFDM (MB-OFDM) standard having reduced complexity and power consumption [1][2][3]. It preserves the band planning of the MB-OFDM system but replaces normal OFDM symbols in each sub-band with a pulsed OFDM symbol. The P-OFDM signal is generated by upsampling a normal OFDM signal by a factor of K. The upsampling process spreads the spectrum of the original signal by repeating the spectrum of the modulated signal in the frequency domain and effectively produces a frequency repetition OFDM scheme that provides transmitter diversity for each sub-carrier at the receiver. Figure 1 shows the diagram of a P-OFDM transceiver using K = 4.

This paper examines the finite wordlength effects in 32-point mixed-radix FFT processors, which are key elements of P-OFDM transceivers. A radix-2 FFT

structure will be used as a baseline model for comparison purposes. Errors due to arithmetic round-off are determined and the actual synthesized FPGA resource requirements are given so that a system designer can make informed performance/area trade-offs.

## 2. FFT Structures for P-OFDM Systems

Several pipelined 32-point FFT architectures were proposed for the P-OFDM system [1][2]. In the Buffered Radix-2 Multi-path Delay Commutator (BR2MDC) structure, input data are stored in a 4-by-16 sample buffer RAM and then passed onto the processor for computation. A 32-point BR2MDC has 5 stages of pipelined processing for each 32-point FFT computation. A block diagram of this architecture is shown in Figure 2. The block contains radix-2 butterflies (R2BFs), complex multipliers (WA0, WA1, and WA2), commutators (SA0, SA1, SA2, and SA3 each consisting of 16 2:1 MUXs) and a control unit.

This number of stages in this design can be reduced by using a mixed-radix FFT architecture. In the Buffered Mixed Radix Multi-path Delay Commutator (BMRMDC442 for a 32-point FFT), two radix-4 butterflies (R4BFs) are used in the first two stages, followed by a pair of radix-2 butterflies in the last stage. Figure 3 shows the block diagram of a 32-point BMRMDC442 FFT which contains radix-4 butterflies, radix-2 butterflies, complex multipliers (WB0 and WB1), commutators (SB0 and SB1, consisting of 8 4:1 MUXs and 8 2:1 MUXs, respectively) and a control unit.



Figure 1: P-OFDM transceiver structure.



Figure 2: 32-point R2MDC FFT.



Figure 3: 32-point BMRMDC442 FFT.





A third possible 32-point buffered, mixed-radix FFT architecture uses a radix-8 butterfly (R8BF) and two radix-4 butterflies, as shown in Figure 4 (BMRMDC84 FFT). Input data are stored in an 8-by-8 sample buffer RAM and then passed onto three processing elements (R8BF, complex multiplier WC0 and R4BFs) for the FFT computation. In addition, it contains 16 4:1 MUXs in the SC0 commutator block and a control unit.

## 3. Noise Propagation Models

In the following subsections, we analyze the finite precision effects of each processing element and then assemble them into an overall noise propagation model for each of these mixed-radix FFT architectures.

#### 3.1. Processing Element Noise Models

Figure 5 shows a single path noise model in the radix-2 butterfly; similar paths can also be obtained for the other three outputs of the R2BF. Two  $B_i$  bit wide input signals  $\{a_r, b_r\}$  along with input noise sources  $\{n_{a_r}, n_{b_r}\}$  are passed into an adder. The output wordlength of the adder is  $B_i + 1$  to avoid overflow or saturation. Rounding can be applied at the output of the adder to keep the output wordlength the same as that for each input. In order to analyze noise in the R2BF, all input components  $\{a_r, b_r, n_{a_r}, n_{b_r}\}$  are modeled as zeromean uncorrelated random variables [4][5][6]. The input signal powers are assumed to be equal for each path along with equal input noise powers (  $\sigma_{a_r}^2 = \sigma_{b_r}^2 = \sigma_s^2$  and  $\sigma_{n_{n_r}}^2 = \sigma_{n_{h_r}}^2 = \sigma_n^2$ ). The rounding noise  $n_r$  is modeled as uniformly distributed white noise [7][8][9][10]. For unbiased rounding of the LSB, the power of  $n_r$  is  $\sigma_{n_r}^2 = \frac{\Delta_r^2}{12} = \frac{2^{-2B_i + 2L_a - 2}}{3}, \text{ where } \Delta_r \text{ is the rounding step}$ size. The output signal and noise power for a R2BF is







Figure 6: Non-trivial Multiplier

For the non-trivial multiplier block in an FFT structure, the noise model of the real (in-phase) output path is shown in Figure 6. Another similar model can also be obtained for the imaginary (quadrature) output. The  $B_x$  bit wide inputs are multiplied by the corresponding twiddle factors ( $e^{jk/N}$ ), which are rounded to  $B_w$  bits with coefficient rounding noise  $\{n_{w1}, n_{w2}\}$ . The output signal and noise power are as follows:

$$\sigma_z^2 = \sigma_s^2$$
(2)  
$$\sigma_{n_z}^2 = \sigma_{n_x}^2 + 2\left(\sigma_s^2 + \sigma_{n_x}^2\right)\frac{2^{-2B_w}}{3} + \frac{2^{2(L_m - B_x - B_w)}}{3}$$

For a radix-4 butterfly, four  $B_i$  bit wide inputs are added within the block and rounding can be applied at the output node to reduce the hardware required. The output signal and noise powers are as follows:

$$\sigma_{y}^{2} = 4\sigma_{s}^{2}, \sigma_{n_{y}}^{2} = 4\sigma_{n}^{2} + \frac{2^{-2B_{i}+2L_{a}-4}}{3}$$
 (3)

There are two noise models for the last processing element, a radix-8 butterfly. The average output signal noise power for the upper 4 complex output paths of the R8BF can be modeled as in Equation (4):

$$\sigma_{y}^{2} = 8\sigma_{s}^{2}, \sigma_{n_{y}}^{2} = 8\sigma_{n}^{2} + \frac{2^{-2B_{i}+2L_{c}-6}}{3}$$
(4)

A second model is used for the lower 4 complex output paths of R8BF, as shown in Figure 8. An internal constant multiplier ( $\alpha = \pm \sqrt{2}/2$ ) is required, where  $\alpha_r$ is the constant rounding error between  $\alpha$  and its corresponding  $B_w$  bit wide two's complement representation. The output signal and noise power for this model of the R8BF are:

$$\sigma_{y}^{2} = 8\sigma_{s}^{2}$$
(5)  
$$\sigma_{n_{y}}^{2} = 8\sigma_{n}^{2} + \alpha_{r}^{2} \left(8\sigma_{s}^{2} + 8\sigma_{n}^{2}\right) + 2\sigma_{r1}^{2} + \sigma_{r2}^{2}$$

The overall average noise model for the R8BF can be obtained from these two models and is shown in equation (6):

$$\sigma_{y}^{2} = 8\sigma_{s}^{2}$$
(6)  
$$\sigma_{n_{y}}^{2} = 8\sigma_{n}^{2} + \alpha_{r}^{2} \left(4\sigma_{s}^{2} + 4\sigma_{n}^{2}\right) + \sigma_{r1}^{2} + \frac{\sigma_{r}^{2} + \sigma_{r2}^{2}}{2}$$

# 3.2. 32-Point FFT Noise Models

The overall noise model for each FFT architecture is obtained by cascading the noise models for the corresponding processing elements. As there is no noise associated with trivial multiplications, the noise due to non-trivial multiplications is weighted by a factor equal to the fraction of non-trivial multiplications at each stage k. We denote this fraction as  $\rho_k$ .

For the 32-point BR2MDC architecture, the overall noise model is constructed by cascading 8 processing elements. The fraction of non-trivial multipliers in stage k for this case is given in Equation (7).

$$\rho_{k} = \begin{cases} \frac{1}{2} - \frac{2^{k+1}}{N} & ;k = 0, 1..., \lceil \log_{2} N \rceil - 1 \\ 0 & ;k \ge \lceil \log_{2} N \rceil - 1 \end{cases}$$
(7)

For the BMRMDC442 FFT, the overall noise model is shown in Figure 7 which cascades five stages of processing per output path. The fraction of non-trivial multipliers in a radix 4 stage is given by the following equation:

$$\rho_{k} = \begin{cases} \frac{3}{4} - \frac{4^{k+1}}{N} & ;k = 0, 1..., \lceil \log_{4} N \rceil - 1 \\ 0 & ;k \ge \lceil \log_{4} N \rceil - 1 \end{cases}$$
(8)

Finally, the noise model for the BMRMDC84 FFT architecture is shown in Figure 8, which cascades three stages of processing elements. The fraction of non-trivial multipliers in a radix 8 stage is obtained from the following equation:

$$\rho_{k} = \begin{cases} \frac{7}{8} - \frac{8^{k+1}}{N} & ;k = 0, 1..., \lceil \log_{8} N \rceil - 1 \\ 0 & ;k \ge \lceil \log_{8} N \rceil - 1 \end{cases}$$
(9)



Figure 7: BMRMDC442 noise model



#### 4. Results

In this section, the noise propagation models are used to determine an optimal internal wordlength for each of the processing elements in order to conserve hardware. The precision of each non-trivial multiplier output is rounded to obtain the same wordlength as the input and no rounding is applied to the output of the butterflies. Figure 9 shows the results for the output SNR of the FFT computation as a function of the input SNR. At lower input SNR, the BMRMDC442 has the best output SNR and the MRMDC84 has the worst performance. As input SNR increases, the number of required stages starts to become the dominant factor. At high SNR, more stages lead to lower output SNR which gives worse performance. The BMRMDC442 architecture seems to have the best overall performance. Therefore, it was selected for further analysis.

The computational errors due to finite precision arithmetic were determined using the actual synthesized hardware on a Xilinx Virtex<sup>TM</sup>-4 FPGA. Each error value is computed to be the average error over 1 million randomly generated frames.

In Figure 10, each curve represents a different selection of twiddle factor wordlength. The curves start hitting a computational error floor when the input wordlength approaches the twiddle factor wordlength. Furthermore, the computational error floor drops more than linearly with increasing twiddle coefficient wordlength (5.25, 6 and 6.75 dB drops as the twiddle factor wordlength increases from 6 to 8 to 10 to 12, respectively).

In Figure 11, each curve represents a different selection of input wordlength for various twiddle factor wordlengths. The computational error floor increases linearly with linearly increasing twiddle factor wordlength (approximately 3 dB per increment of input wordlength). From these two plots, it is evidential that twiddle factor wordlength has more influence than the

input wordlength in the performance of the 32-point BMRMDC 442 architecture.







Figure 10: Total computation Errors of BMRMDC442 with varying input wordlength



Figure 11: Computation Errors of BMRMDC442 with varying twiddle factor wordlength

Table 1 shows the synthesized hardware requirements of the 3 FFT architectures on a Xilinx Virtex<sup>TM</sup>-4 xc4vsx35-10ff668 FPGA. The input data wordlength is set to 10 with a twiddle factor wordlength of 12 for all three architectures. The table shows the required number of slices, flip-flops, look-up tables (including those used for logic, route-through and shifted registers), DSP48s and the total equivalent gate count of an ASIC implementation. As expected, the number of required resources increases going from the BR2MDC to the BMRMDC442 to the BMRMDC84. Due to the large size of the R8BF, the increments of the required resources and the total equivalent gates are significant in going from the BMRMDC442 to the BMRMDC84.

| Architecture | Slice | F.F. | LUTs  |       |     | DSD49a | Equivalent |
|--------------|-------|------|-------|-------|-----|--------|------------|
|              |       |      | Logic | Route | S.R | D3F405 | gate count |
| BR2MDC       | 553   | 242  | 664   | 62    | 162 | 12     | 18820      |
| BMRMDC442    | 1047  | 407  | 1435  | 124   | 252 | 24     | 33811      |
| BMRMDC84     | 1873  | 368  | 3053  | 164   | 240 | 44     | 51187      |
|              |       |      |       |       |     | -      |            |

Table 1: Synthesized hardware requirements for the FFTs.

# 5. Conclusions

The noise models described in this paper can be used as a design tool for analyzing the performance of FFT structures as a function of the data and coefficient wordlengths. The models consider the rounding errors of the complex twiddle factors and the butterfly elements and averages the propagated noise by appropriately weighting between trivial and non-trivial multipliers. In particular, the models have been applied to analyze the performance of mixed-radix FFT architectures used in Pulsed-OFDM transceivers. Actual hardware resource requirements were also presented and simulation results were given for the synthesized design. The 32-point BMRMDC444 architecture was found to have a good balance between its performance and its hardware requirements and is therefore suitable for use in P-OFDM systems.

# 6. Acknowledgements

This research was supported by NSF Grant No. CCR-0313224 and by an equipment grant from Intel Corporation.

## 7. References

[1] Kai-Chuan Chang, G.E. Sobelman, E. Saberinia, and A.H. Tewfik, *"Transmitter architecture for pulsed OFDM,"* IEEE APCCAS, Dec 2004, pp. 693-696.

[2] Kai-Chuan Chang, G.E. Sobelman, E. Saberinia, and A.H. Tewfik, "Implementation of a Multi-band Pulsed-OFDM Transceiver," Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 2006.

[3] E. Saberinia, J. Tang, A.H. Tewfik and K. Parhi, "Design and Implementation of Multi-band Pulsed-OFDM system for wireless personal area networks," IEEE ICC, June 2004, pp. 862-866.

[4] A.V. Oppenheim and R.W. Schafer, *Discrete-Time Signal Processing*, Prentice Hall, 1989.

[5] S.K. Mitra, *Digital Signal Processing*, 2nd edition, 2002

[6] Randall B. Perlow and Tracy C. Denk, "*Finite Wordlength Design for VLSI FFT Processors*," IEEE Trans on Signals, Systems and Computers, p1227-1231, Nov. 2001.

[7] K.K. Parhi, VLSI Digital Signal Processing Systems Design and Implementation, Wiley Inter-Science, 1999.

[8] L. Wanhammar, *DSP Integrated Circuits*, Academic Press, 1999.

[9] H.G. Rey and C. Galarza, "*Finite Word Length Analysis of the Radix-2*<sup>2</sup> *FFT*," European Signal Processing Conference, September 2004.

[10] V. Ivanovic, L. Stankovic, and D. Petranovic, "Finite Word-Length Effects in Implementation of Distributions for Time-Frequency Signal Analysis," IEEE Trans on Signal Processing, vol. 466, no. 7, July 1998.