White Paper

Accelerating 2D FFTs

Using HBM2 and oneAPI on Stratix 10 MX

Overview

FPGAs with HBM2 memory are faster than GPU chips when algorithms include “corner turns”. This is because the FPGA can overlap the corner turn operation with computation, making it “free” from a latency perspective. We demonstrate the technique using a 2D FFT and provide the source code below. To keep things “fair”, that code uses a data type which GPUs also support. FPGAs can easily process any data type.

BittWare previously created a 2D FFT kernel for FPGAs using Intel’s OpenCL compiler. We have now rewritten that code to leverage Intel’s oneAPI programming model, specifically its DPC++ programming language. We achieved the similar performance on our 520N-MX card. Intel is promoting oneAPI as the best toolchain for applications that require GPU or FPGA acceleration.

Why did we implement a 2D FFT ?

The 2D FFT is often found in FPGA IP libraries and thus is not something programmers generally implement themselves. However, a common implementation strategy for 2D FFT on parallel hardware involves a “corner turn” or “data transposition” step that poses a major performance bottleneck on CPUs and GPUs.

Free Corner Turn

The insight we highlight in this demonstration is that an FPGA implementation can perform data transposition in parallel with computation, making it almost “free” from a latency perspective.

A GPU cannot do the same because GPU architectures do not have enough memory inside the GPU to pipeline intermediate results without touching HBM2/GDDR6 memory.

Using HBM2 on FPGAs using oneAPI

The Stratix 10 MX has 32 pseudo HBM2 memory channels. Our 2D FFT implementation uses half those channels.

We can place two 2D FFT kernels in the chip. One kernel uses all the pseudo channel interfaces on the north side of the chip. The other uses the channels on the south side.
Trying to use all 32 channels in a single 2D FFT creates routing challenges we did not want to deal with in demonstration code.

Each 2D FFT kernel consumes about 30% of the DSP resources of the MX2100 FPGA we used.

One Row per FPGA HBM2 Channel

The 2D FFT is implemented using many 1D FFTs. A single 1D FFT saturates the HBM2 channel it uses. With 16 HBM2 channels, our 2D FFT implementation processes 16 rows of the 1024 row matrix in parallel. A high-end GPU can probably process all 1024 rows in a single pass. The FPGA requires 64 passes. However, number of passes does not matter as HBM2/GDDR6 bandwidth is the limiting factor in both cases.

400 MHz Fmax is Enough

We need a 400 MHz OpenCL clock frequency to saturate each HBM2 channel doing the optimal 32-byte bursts. Going faster has no value. Going slower is not optimal. Intel’s current implementation of DPC++ leverages BittWare’s existing OpenCL board support package. Thus both implementations look like OpenCL to the FPGA. Both implementation require the same tweaks to achieve 400 MHz. These tweaks are just expressed differently in source code.

1D FFT Output

The 1D FFT does not output results directly into HBM2. Instead it writes into FPGA internal memory. Additional FPGA logic collects the results of multiple 1D FFTs, corner turns those results, and writes them back to HBM2, all in parallel with computation.

1D FFT Delayed Start

We don’t start the 16 1D FFTs on the same clock cycle. Instead we stagger their starting points by one complex element each. Doing this allows us to pipeline the 1D FFT results into the corner turn logic.

32-byte Bursts

We need to wait for the output of four 1D FFTs in order to be able to write the transposed data to HBM in a 32-byte chunk.

Ease of Programming Using oneAPI

BittWare’s development team uses RTL (Verilog and VHDL), HLS (mostly C++), OpenCL, and now DPC++. We previously published a white paper comparing RTL with HLS. This project was our first opportunity to compare OpenCL with DPC++.

We are pleased to report that moving from OpenCL code for FPGA to DPC++ for FPGA was easy. The adaptations specific to FPGAs and the HBM2 on-package memory were very similar to the OpenCL version. Data movement on and off the host computer was simple. We even called our 2D FFT kernel from NumPy which we never attempted using the OpenCL version.

Overview video of oneAPI for FPGA Acceleration

Additional Details

Internal Memory to Spare

We are using 512 M20K memories (12.5% of an MX2100).

2D FFT Parameters

The length of FFTs implemented for FPGAs is typically set at bitstream compile time.

Our demonstration uses a fixed 1024 by 1024 matrix of complex, 32-bit floating point numbers.
Our FFT implementation interleaves the components of the complex number (as opposed to using separate real and imaginary matrices).
The alignment does not really matter as the data is copied into HBM memory where it will become 32-byte aligned.

1D FFT Implementation

We implemented our own 1D FFT (as opposed to using somebody’s FPGA IP library). The 1D FFT is radix 2 and fully pipelined. It is not in-place.

Performance

2D FFT performance will always be limited by the HBM2 bandwidth. Thus our focus is to find an algorithm that allows us to maximize HBM2 bandwidth. That also means we can use HBM2 bandwidth benchmarks to exactly estimate FFT performance.

The peak HBM2 performance for a batch 1 implementation, with two independent kernels in the same device, is 291 GBytes/Sec. When pipelining/batching a peak bandwidth of 337 GBytes/Sec is possible.

Conclusion

We were pleased to be able to quickly port our 2D FFT from an OpenCL version to oneAPI. While the work to optimize the approach for an HBM2-enabled FPGA card was already done, it still gives OpenCL-based users an excellent path to move to oneAPI on BittWare hardware with minimal impact.

Working with HBM2 on FPGAs allows for some significant advantages in certain applications such as we have seen here. If you are seeking more information on the code we presented, please get in touch using the form below to inquire about availability, or to learn more about the 520N-MX FPGA card from BittWare.

Request the Source Code

You can request the open-source TAR file by filling out this form. The code you will receive has open source legends.

How to use the code:

We compile the OneAPI 2D FFT into a Linux dynamically linked, shared object library. Just type “cmake .” to create that library.

We call that library from a simple NumPy script which creates some input data and then either validates or plots the output. Type “python runfft.py” to see a graph.

The software above was tested on Centos 8 with the Intel OneAPI tools Beta 10 release using a BittWare 520N-MX card which hosts an MX2100 FPGA. The released OpenCL board support package for that card also allows OneAPI to work

"*" indicates required fields

Notice: Most of our downloads require end-user verification, which may require additional information. We will reach out to you as required for this information.

Name*

First Last

Corporate Email

Email*

Please use a corporate email address as we will often need to validate your eligibility based on company email for the download.

Notice: Please use a corporate email to avoid delays validating your download request.

Phone*

Company/Organization*

Company Division

Address and City*

Street Address Address Line 2 City State / Province / Region ZIP / Postal Code Country

Where to next?

Comparing RTL to HLS

Read about another BittWare comparison of two development approaches

Learn More

520N-MX

Learn more about the card we used

Learn More

Back to Resources

Learn More

Powered by the latest FPGAs and SoCs from Achronix, AMD, and Altera, our cards are designed and manufactured in-house for enterprise-class performance.

Browse by Silicon Technology:

Early Access Program

BittWare is launching new 3U VPX solutions! Request to join our early access program to engage with experts on our plans.

Card-Level Products

Our RFX family of cards leverage the AMD RFSoC chip with our own analog amplification/filtering and more.

WaveBox RF Servers and Enclosures

Taking RFX in an integrated, modular approach with up to 12 analog in/out in a 1U server.

Finding the right server and configuring it can be time-consuming. We've designed TeraBox servers to be ready to go from the start, saving you time if you're developing and for deployment: a robust solution suitable for the toughest challenges.

WaveBox RF

Looking for analog + digital in one box? Our direct RF-focused WaveBox servers and enclosures are a perfect fit!

BittWare Partners and internal projects gives you an easy way to get started quickly.

Data Movers + DMA + RDMA

Network Offload

Precision Time

More Network Acceleration

Open Source + Free

These solutions don't need FPGA programmers, rather they are software configurable and built on BittWare hardware!

Data/Packet Capture + Record

P2P and Storage

Financial Services/Fintech

LMS has created ÜberNIC, pre-programmed with the entire network stack in hardware.

AI + Machine Learning

Vendor Tools we Suport

From RFX PCIe cards to our modular integrated WaveBox RF soutions, we have your RFSoC needs covered!

Browse our RF Products:

Early Access Program

BittWare is launching new VPX and VNX+ solutions! Request to join our early access program to engage with experts on our plans.

The ability to tailor the application to the silicon is a major win for FPGAs in the HPC space. We've also seen AI/ML use cases where these programmable devices can run more efficiently than even competing GPUs.

AI/ML Partner Solutions:

Networking covers a wide range of use cases, which is why we also have a large portfolio of solutions!

Offload Engines

Need TCP/UDP offload? Our partners offer premium IP cores ready to integrate into your project.

MACsec + IPsec

Featuring Xiphera's IP running these popular security protocols in hardware has never been easier!

RDMA

RDMA over Converged Ethernet (RoCE v2) system implementation and integration, from Grovf.

SmartNIC

LMS has created ÜberNIC, pre-programmed with the entire network stack in hardware.

Open Source + Free

Whether you're after ultra-low-latency trade performance or simply need a high-performance NIC optimized for fintech, BittWare has a suite of solutions from experts like LMS and Exegy.

More Fintech Solutions:

Building on accelerators like FPGAs is a smart way to get more from your investment. Broadcast video is moving away from legacy pre-configured pipelines to software-defined architecture but with hardware doing the heavy lifting.

Move your algorithm to the data, not data to the algorithm with our Storage and P2P partner Eideticom.

Jump Directly to Eideticom NoLoad Solutions:

Storage Webinars On Demand:

White Paper

Accelerating 2D FFTs

Using HBM2 and oneAPI on Stratix 10 MX

Overview

Why did we implement a 2D FFT ?

Free Corner Turn

Using HBM2 on FPGAs using oneAPI

One Row per FPGA HBM2 Channel

400 MHz Fmax is Enough

1D FFT Output

1D FFT Delayed Start

32-byte Bursts

Ease of Programming Using oneAPI

Overview video of oneAPI for FPGA Acceleration

Additional Details

Internal Memory to Spare

2D FFT Parameters

1D FFT Implementation

Performance

Conclusion

Request the Source Code

How to use the code:

Notice: Most of our downloads require end-user verification, which may require additional information. We will reach out to you as required for this information.

Corporate Email

Notice: Please use a corporate email to avoid delays validating your download request.

Where to next?

Comparing RTL to HLS

520N-MX

Powered by the latest FPGAs and SoCs from Achronix, AMD, and Altera, our cards are designed and manufactured in-house for enterprise-class performance.

Browse by Accelerator Manufacturer:

Early Access Program

BittWare is launching new 3U VPX solutions! Request to join our early access program to engage with experts on our plans.

Card-Level Products

Our RFX family of cards leverage the AMD RFSoC chip with our own analog amplification/filtering and more.