## UCLA

## FPGA Polyphase Filter Bank Study \& Implementation

Raghu Rao<br>Matthieu Tisserand<br>Mike Severa<br>Prof. John Villasenor

Image Communications/Reconfigurable Computing Lab. Electrical Engineering Dept. UCLA

## Introduction

This document describes a polyphase filter bank and summarizes the results of a feasibility study of its implementation on FPGA-based architectures with respect to size, timing and bandwidth requirements

Under the SLAAC program, UCLA and Los Alamos National Labs have collaborated in mapping new adaptive algorithms to configurable computing platforms

## Project Goals

The current portion of the collaboration has involved the feasibilty and implementation of a Polyphase Filter bank using various FPGAs and hardware architectures.

The Polyphase implementation is a multi-rate filter structure combined with a DFT designed to extract subbands from an input signal

It is an optimization of the standard approaches and offers increased efficiency in both size and speed, aspects that are well suited to reconfigurable computing

Task heretofore implemented only in ASIC; offers a good opportunity as an example of migration from ASIC to an Adaptive platform

## Basic Project Parameters

- 128 Megasamples/sec input signal
- 16 distinct subband outputs
- Implement using Polyphase filter and DFT structure



## Polyphase Filter Architecture



## Polyphase Filter Architecture

Commutator:

- distributes signal to $n$ lines
- reduces clock speed by factor of $n$

Polyphase Filter bank:

- 32 1-input, 1-output polyphase filters or
- 16 1-input, 2-output optimized polyphase filters

FFT:

- $2 n$-point real FFT
- $n$-point complex FFT


## System Requirements



- 128 MHz system speed
- Note: 4 samples at 32 MHz equivalent to one sample at 128 MHz
- All lines are buses equal to sample precision (from 8 to 16 bits)
- Precision has been implemented as a generic in VHDL
- makes precision configurable
- allows easy assessment of precision's affect on feasibility


## What Happens Inside?

- Data will be sent to 32 filters...
- i.e., need to be latched and further demuxed by factor of 8
- Clock speed reduced also by factor of 8 to 4 MHz



## And then?

- Some work gets done
- Polyphase filtering, DFT: @ 4MHz
- note: using resource-sharing filter structures, initial decimation only by factor of 4 , smaller filter bank, work gets done @ 8 MHz (slides on this method later)



## And finally?

- 16 samples at 4 MHz are available to the remuxing logic
- 16 samples are required for system
- Re-Mux runs at 16 MHz and samples 4 DFT outputs at a time
- Results data has latency of a minimum of 12 clock cycles due to demux/remux (plus polyphase/DFT latency)



## Polyphase Filter Banks

The following slides describe the regular polyphase filter bank, the transpose form FIR filter, and optimizations based on symmetry

This is a symmetric FIR filter, i.e., the first $n / 2$ and the last $n / 2$ coeffs are the same, albeit in reverse order.

We can exploit this symmetry to implement an optimal form of the filter bank, using resource sharing.

We also describe two methods of exploiting resource sharing. The advantage of these schemes is the reduction in the size of both the filter bank and the commutator.

## The Polyphase Filter bank Design

- First step is to design a prototype low-pass, FIR filter $h(n)$ with the desired filter parameters
- The $I$ polyphase filters $p_{k}$, each of integer length $K=M / I$ are derived from the length $M$ FIR filter $h(n)$ via
- $p_{k}(n)=h(k+n I), k=0 . . I-1, n=0 . . K-1$
- ( $M$ is selected to be a multiple of $I$ )


## Our Polyphase Design

- $K=M / I: M=128, K=4, I=32$
- $p_{k}(n)=h(k+n I), k=0 . . I-1, n=0 . . K-1:$
- $p_{0}=h(0+0), h(0+32), h(0+64), h(0+96)$
- $p_{l}=h(1+0), h(1+32), h(1+64), h(1+96)$
- $p_{31}=h(31+0), h(31+32), h(31+64), h(31+96)$


## Our Polyphase Design

$$
\begin{array}{ll}
p_{0} & \mathbf{h}(0), \mathbf{h}(32), \mathbf{h}(64), \mathbf{h}(96) \\
p_{1} & \mathbf{h}(1), \mathbf{h}(33), \mathbf{h}(65), \mathbf{h}(97) \\
p_{2} & \mathbf{h}(2), \mathbf{h}(34), \mathbf{h}(66), \mathbf{h}(98)
\end{array}
$$

$p_{31} \quad \mathbf{h ( 3 1 ) , h ( 6 3 ) , h ( 9 5 ) , h ( 1 2 7 )}$

Decimate by 32


# Symmetry - how is it useful? 

## Symmetric filter bank



## Symmetry - how is it useful?

- Given an $n$-tap filter with coefficients $h(0 . . n)$
h0 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15
- In a symmetric filter of $n$ taps, coefficient $h(i)=h(n-1-i)$, i.e., we can re-label the above filter coefficients as
h0 h1 h2 h3 h4 h5 h6 h7 h7 h6 h5 h4 h3 h2 h1 h0
- What does this mean for our polyphase structure?


## Symmetry - how is it useful?

- What does this mean for our polyphase structure?

| $h 0 \quad h 8$ |
| :---: |
| $h 1 \quad h 9$ |
| $h 2 h 10$ |
| $h 3 \quad h 11$ |
| $h 4 h 12$ |
| $h 5 h 13$ |
| $h 6 h 14$ |
| $h 7 h 15$ |



## Symmetry - how is it useful?

- What does this mean for a polyphase structure?
- We can reduce number of coefficient multipliers



## UCLA <br> Symmetry - how is it useful?

- What does this mean for a polyphase structure?

$$
\begin{aligned}
& . . x 15 . . x 7 \longrightarrow \text { h0 h7 } \longleftarrow x 0 \text {.. X8 .. } \\
& . . x 14 . . x 6 \rightleftharpoons \text { h1 h6 }<x 1 . . X 9 \text {.. } \\
& . . x 13 . . x 5 \longrightarrow \text { h2 h5 } \longrightarrow \text { x2.. X10 .. } \\
& . . x 12 . . x 4 \rightleftharpoons \text { h3 h4 } \longrightarrow \text { x11 .. }
\end{aligned}
$$

## UCLA

## The Commutator

The commutator is half the size for this architecture. After feeding 8 filters, it reverses direction.


It goes to filters $1,2,3,4,5, \ldots 8$ and then $8,7,6,5,4$,
3,2,1 and reverses direction again.

## Symmetry - how is it useful?

## Hardware Implementation

## Transpose Form of the FIR filter


$\square$ register
[/] 24

## Resource Sharing Optimization - Scheme 1



Clocked for even samples
Clocked for odd samples


## Comparison of Schemes

|  | Multipliers | Adders | Registers | Multiplexer | Demultiplxr |
| :--- | :---: | :---: | :---: | :---: | :--- |
| Regular | 128 | 96 | 128 | 0 | 0 |
| Scheme 1 | 64 | 96 | 128 | 0 | 0 |
| Scheme2 | 64 | 48 | 256 | 112 | 16 |

NOTE: schemes 1 and 2, also reduce the size of the commutator. With these schemes only a $\mathrm{N} / 2$ commutator is needed (decimate by 16 ).

## UCLA

Polyphase filter bank with resource shared filters


Since each filter convolves alternate samples, giving two outputs, one a convolution of even samples and the other a convolution of odd samples, it also acts to decimate by 2 . So, the initial decimator needs to decimate only by 16 .

## The Commutator

The commutator is half the size for this architecture. After feeding 16 filters, it reverses direction.


It goes to filters $1,2,3,4,5, \ldots 16$ and then $16 \ldots 6,5$,
$4,3,2,1$ and reverses direction again.

## UCLA

A clocking scheme to enable flipflops alternately
The flipflops in different colors need to be latch data alternately. When blue is on, green is off. This can be accomplished by a 2 phase clocking scheme.


Positive edge DFF Negative edge DFF


Clock divider circuit

## Alternate scheme using enable flipflops



Instead of positive and negative DFFs, enable FFs can be used to convolve alternate samples. This clock enable also can be used as the select line to the muxes and demuxes.

## Initial Studies

The initial work involved approaching the topic from a theoretical standpoint

- understand polyphase theory
- implement polyphase structure simulation
- DSP Canvas
- MatLab
- create filter based on design specs from Fiore's paper
- generate initial size estimates based on knowledge of the size of components and number of CLB's necessary to implement them on an FPGA


## Feasibility Experiments

These experiments evaluated the feasibility of implementing the polyphase filter bank on an Altera Flex10K250A (part EPF10K250AGC599-1) a Xilinx XC40150 (part XC40150XV-09-BG560) and a Xilinx VirtexXCV1000 (part XCV1000-4-BG560)

All experiments were synthesized using Synplify 5.1.4 and placed and routed with Maxplus2 9.1

The filter bank consisted of a decimator at the input, feeding a bank of either 16 or 32, 4 tap filters (filters optimized for symmetry have 2 outputs). The outputs of the filter bank feed a commutator that "re-muxes" data onto 4 lines that will feed a DFT (assumption that the DFT is on another chip).

## UCLA <br> Results for non-symmetry optimized filter bank

## Flex10K250A, part EPF10k250AGC599-1, does not fit.

The critical resource on an Altera Flex 10K is the carry chain (fast interconnect) routing.

32 filters, with 1 output each, not optimized for symmetry

Results for non-symmetry optimized filter bank

## Xilinx Virtex, part XCV1000-4-BG560

|  | AREA | MAX FREQ |
| :--- | :--- | :--- |
| $D=8, C=13$ | 2130 of 12288 CLB <br> slices, $17 \%$ of chip | 66.578 MHz |
| $D=13, C=13$ | 3740 of 12288 CLB <br> slices, $30 \%$$\quad 64.583 \mathrm{MHz}$ chip |  |

This has 32 filters, with 1 output each, not optimized for symmetry

$$
\begin{aligned}
& \text { D - data precision } \\
& \text { C - coeff precision }
\end{aligned}
$$

## UCLA <br> Results for symmetry optimized filter bank

## Flex10K250A, part EPF10k250AGC599-1

|  | AREA | MAX FREQ |
| :---: | :---: | :---: |
| $D=8, C=13$ | 4932 LEs $40 \%$ of chip | 22.47 MHz |
| $D=13, C=13$ | $\begin{aligned} & 8109 \text { LEs } \\ & 66 \% \text { of chip } \end{aligned}$ | 22.42 MHz |
| $D=14, C=13$ | $\begin{aligned} & \text { 8523 LEs } \\ & 70 \% \text { of chip } \end{aligned}$ | 22.02 MHz |
| $D=15, C=13$ | 8975 LEs. $76 \%$ of chip. No Fit, routing resources | No Fit. |

This has 16 filters, with 2 outputs each, optimized for symmetry

# Xilinx XC40150XV-09-BG560 

| AREA | MAX FREQ |  |
| :--- | :--- | :--- |
| $\mathbf{D = 8 , C = 1 3}$ | 2946 of 5184 CLB <br> slices, $56 \%$ of chip | 31.448 MHz |

This has 16 filters, with 2 outputs each, optimized for symmetry

## Xilinx Virtex XCV1000-4-BG560



This has 16 filters, with 2 outputs each, optimized for symmetry

## FFT Implementation

The following slides describe some optimizations of the FFT and how its inclusion into the system logic affects size and speed.

- Goal of system is 16 distinct positive frequency bins
- An N-point FFT produces N/2+1 distinct bins
- Our input sequence is real
- The FFT of a real valued sequence of 2 N points can be computed efficiently by employing an N-point complex FFT


## 32-point Real FFT Implementation

$\mathrm{X}(\mathrm{n})$, the 2 N point real sequence is divided into 2 , N -point sequences as follows:
$h(k)=x(2 k), k=0,1, \ldots ., N-1$
$g(k)=x(2 k+1), k=0,1, \ldots, N-1$
i.e.. The function $h(k)$ is equal to the even-numbered samples of $x(k)$, and $g(k)$ is equal to the odd-numbered samples.

A N-point complex valued sequence $y(k)$ can be written as
$\mathbf{y}(\mathbf{k})=\mathbf{h}(\mathbf{k})+\mathbf{j} \mathbf{g}(\mathbf{k})$
The DFT of $y(k)$ is then computed.

## FFT cont'd.

$$
\begin{aligned}
& \mathbf{Y}(\mathbf{k})=\mathbf{H}(\mathbf{k})+\mathbf{W}^{\mathbf{k}_{2 N}} \mathbf{G}(\mathbf{k}), \quad \mathrm{k}=0,1, \ldots, \mathrm{~N}-1 \\
& \mathbf{Y}(\mathbf{k}+\mathbf{n})=\mathbf{H}(\mathbf{k})-\mathbf{W}^{\mathbf{k}}{ }_{2 \mathrm{~N}} \mathbf{G}(\mathbf{k}), \quad \mathrm{k}=0,1, \ldots, \mathrm{~N}-1
\end{aligned}
$$

To compute the real and imag. parts of the output, $\mathrm{H}(\mathrm{k})$ and $\mathrm{G}(\mathrm{k})$ can be expressed in terms of even and odd components.
$\mathbf{H}(\mathbf{k})=\mathbf{R}_{\mathbf{e}}(\mathbf{k})+\mathbf{j} \mathbf{I}_{\mathbf{0}}(\mathbf{k})$
$G(k)=I_{e}(k)-j R_{0}(k)$

Substituting this in $\mathrm{Y}(\mathrm{k})$, we get,
$\mathbf{Y}(\mathbf{k})=\mathbf{Y}_{\mathbf{r}}(\mathbf{k})+\mathbf{j} \mathbf{Y}_{\mathbf{i}}(\mathbf{k})$, where

FFT of a 2 N point real sequence from a N point complex FFT


## UCLA

Area and delay numbers for the 32-point real FFT

| Data precision | Timing | Area | Percentage of chip |
| :---: | :---: | :---: | :---: |
| 13 | $\mathbf{1 6 . 0 3 ~ M H z}$ | $\mathbf{2 1 6 5}$ | $\mathbf{1 7 \%}$ of chip |
|  | 62.3 ns |  |  |
| 14 | $\mathbf{1 6 . 4 7 \mathrm { MHz }}$ | $\mathbf{2 4 1 3}$ | $\mathbf{1 9 \%}$ of chip |
|  | 60.7 ns |  |  |
| 15 | $\mathbf{8 . 8 9} \mathrm{MHz}$ | $\mathbf{2 6 5 4}$ | $\mathbf{2 1 \%}$ of chip |
|  | $\mathbf{1 1 2 . 4} \mathrm{ns}$ |  |  |
| 16 | $\mathbf{1 5 . 3 1 \mathrm { MHz }}$ | $\mathbf{2 8 7 6}$ | $\mathbf{2 3 \%}$ of chip |
|  | 65.3 ns |  |  |

## Altera Flex10K-250A GC599-1

Xilinx xc40150-09-bg560: Area 2530 out of $5184(48 \%$ of chip), 20.001 MHz . Xilinx Virtex xcv1000-4-bg560: Area 1754 out of $12288(14 \%$ of chip), 48.96 MHz . (virtex precision $13 \& 13$, XC40150: 8 \& 13)

## Full System Estimates

The entire polyphase filter bank along with the FFT does not fit on an Altera Flex device. But it does fit on the Xilinx XC40150 and Virtex.

Decimation factor $=32$, 17 positive frequency bins
Data precision $=13$, Coeff precision $=13$
Xilinx xc40150-09-bg560 ( $\mathrm{D}=8, \mathrm{C}=13$ )
4581 CLBs out of 5184-88\% of chip.
Freq: 20.492 MHz
Xilinx Virtex - xcv1000-4-bg560
7156 CLB slices out of 12288 - $58 \%$ of chip.(11631 LUTs).
Freq: 56.715 MHz .

Polyphase filter bank on a Xilinx XC40150XV-09-BG560

| DESIGN | XILINX <br> XC40150XV-09-BG560 <br> 8 point data precision <br> 13 point coeff precision <br> AREA |  | XILINX VIRTEX XCV1000-4-BG560 <br> 8 point data precision 13 point coeff precision |  | ALTERAFLEX10K250AGC599-18 point data precision13 point coeff precision |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Fir filterbank | $\begin{gathered} 2946 / 5184 \\ \text { CLBs 56\% } \\ \text { of chip } \end{gathered}$ | $\begin{gathered} \text { Freq } \\ 31.448 \\ \mathrm{MHz} \end{gathered}$ |  |  | 4932/12160 <br> LEs $40 \%$ of chip | Freq 22.47 <br> MHz |
| 32 point real FFT | $\begin{gathered} 2530 / 5184 \\ 48 \% \text { of } \\ \text { chip } \end{gathered}$ | $\begin{gathered} \text { Freq } \\ 20.001 \\ \mathrm{MHz} \end{gathered}$ |  |  | 2092/12160 <br> LEs $17 \%$ of chip | $\begin{aligned} & \text { Freq } \\ & 19.08 \\ & \mathrm{MHz} \end{aligned}$ |
| Full design (polyphase filter bank+fft) | 4581/5184 $88 \%$ of chip | $\begin{gathered} \text { Freq } \\ 20.492 \\ \mathrm{MHz} \end{gathered}$ | $\begin{aligned} & 4195 / 12288 \\ & 34 \% \text { of chip } \end{aligned}$ | $\begin{gathered} \text { Freq } \\ 50.594 \\ \mathrm{MHz} \end{gathered}$ | No Fit |  |

## Area and delay estimation flow



RTL level schematics and design browser from Synplify



## Future Work

Simulate and test polyphase VHDL implementation using LANL test vectors

Work together with LANL to facilitate possible demo of polyphase work

Implement Scheme 2 of resource sharing symmetrical filter bank

Study the advantages and disadvantages with regards to system goals of FFT replacing the FFT with a DCT

Look into adaptive filtering techniques
Modifying our current polyphase design to accommodate configurable or even programmable rate

## Conclusions

Very productive intitial phase of collaboration between UCLA and LANL

Our work has resulted in some innovations at the algorithmic level

Task migration from ASIC to FPGA
This study has provided useful sizing information for the Altera Flex and Xilinx Virtex families as well as some initial benchmarks of basic DSP methods used in UWB

