BittWare
On-Demand Webinar

BittWare and Intel logos
oneAPI logo and PCIe card

Using Intel® oneAPI™ to Achieve High-Performance Compute Acceleration with FPGAs

Join BittWare and Intel as we look at oneAPI™ with a focus on FPGAs. We will look at a real-world 2D FFT acceleration example which utilizes the Intel® Stratix® 10 MX including HBM2 memory on BittWare’s 520N-MX card.

What You’ll Learn

Video Transcript

(Marcus)

Thanks for joining our webinar covering Intel’s oneAPI with a focus on high-performance compute acceleration using FPGAs. Let me start by briefly introducing our presenters and what they will cover.

First up is Craig Petrie with BittWare. He started his career as an FPGA engineer at Nallatech, and currently serves as marketing VP for BittWare. He’ll explain how FPGAs fit in with HPC, plus some of the BittWare cards and our support for oneAPI.      

Next, we’ll hear from David Clarke at Intel. He’s a technical sales specialist and will give an overview of the ecosystem, software, and platforms at Intel for better harnessing the power of FPGAs, especially by bringing in an easier software-driven development flow.

Then Maurizio Paolini will explain why Intel is introducing oneAPI. He’s a field application engineer with over twenty years at Altera, moving to focus on Cloud and Enterprise acceleration after the purchase of Altera by Intel. He’ll explain how oneAPI is solving programming challenges by enabling easier development for acceleration across multiple architectures.         

Finally, we will hear from Richard Chamberlain, principal systems engineer at BittWare. Richard created the original 2D FFT demonstration in OpenCL for BittWare’s FPGA cards, and has now ported to oneAPI, and so he’ll share details on the results.

Note that the 2D FFT code is available, by request, on the BittWare website in the resources section.

Now if you’re watching this during one of our live slots, you’ll want to get ready for Q&A session, with our panelists, after the main presentation. To ask a question either now or during the Q&A look for the questions panel, type out your question, and I will put it to our panel during the question and answer session in about 30 minutes.

Alright, let’s start with an HPC overview from Craig Petrie.

(Craig)

Okay thanks Marcus.

For those of you who are unfamiliar with BittWare, we’re actually a product of two acquisitions made by Molex.

The first was the Nallatech group in 2016, with locations in California the UK. The second was BittWare in 2018, with headquarters in Concord,  New Hampshire. Both companies were merged during 2018 and rebranded BittWare, a Molex Company.

Combined, we have thirty years of FPGA expertise across various markets. We are part of the Intel Partner Alliance program and have developed high-end FPGA accelerators featuring every generation of Altera and Intel FPGAs over the last twenty years.

The other unique selling point worth emphasizing is the fact that we are part of the large Molex group of companies with in-house manufacturing and global logistics.

Specifically, we are a division of the Datacom and Specialty Solutions Group or DSS.

This allows us to combine our FPGA expertise and advanced products with the scale and reach of the Molex global brand.

BittWare is therefore the only FPGA card and system product supplier of critical mass able to address enterprise class qualification, validation, life cycle, and support requirements.

BittWare supports many Intel FPGA devices. However, for the purposes of this technical workshop, we are going to focus on two flagship BittWare products.

The first is the 520 series, which has support for Intel’s Stratix 10 GX, NX, and MX.

The MX variant, featuring 16GB of on-package HBM2 memory, is a card we’ll be targeting for today’s demo, which is using oneAPI.

It is natively supported in OpenCL with our new board support package, boasting support for sixteen lanes of PCI Express Gen 3, all sixteen gigabytes of HBM2 memory, and even I/O pipes, allowing applications to be scaled using a low latency, high bandwidth, and deterministic card-to-card serial connections via the QSFP28 network ports.

The 520 series of OpenCL programmable accelerators from BittWare have been successfully deployed in two supercomputers in the public domain.

Please check out these articles and videos to learn more.

The second product we will be targeting is our new flagship accelerator featuring the Intel Agilex FPGAs.

The BittWare IA-840f boasts the largest Agilex FPGA currently available from Intel. First units are scheduled to ship in July.

Initial tool flows will be based upon VHHL and Verilog. However, we will be introducing support for oneAPI later in the year.

The IA-840f shares the same enterprise-class DNA as 520N series. However, we have improved all the main interfaces, including support for PCIe Gen 4 x16 lane via Agilex P-Tile hard IP. The IA-840f is a GPU-sized PCIe card with the option of active, passive, and even liquid cooling. It’s compatible with almost all server and edge platforms featuring GPUs.

BittWare’s FPGA accelerated products are designed to address four main application areas: compute, network, storage, and sensor processing, which span data center, cloud infrastructure, and edge-of-network spaces.

The FPGA value proposition for workloads within these application areas is strong and getting stronger with each year, with many customers now employing AI and machine learning inference techniques in order to bolster performance while reducing the total cost of ownership.

What we would like to do now is provide some color as to the compute intensive workloads that can be accelerated on Intel-based FPGA accelerators from BittWare.

Specifically, those which are using high level tools such as OpenCL and, in this case, oneAPI.

For this, we’ll make use of the BittWare website which has dedicated landing pages for each of the main application areas.

Each page has a wealth of information, including customer case studies, white papers, reference designs, and videos.

Okay. So, let’s jump into the compute landing page, first of all. So, I’ll just click here.

So, in our experience, customers who are using FPGAs to accelerate high performance computing workloads are almost always building a heterogeneous platform. That is a platform featuring a mix of complementary technologies working together in concert. Examples include x86 (based on CPUs), GPGPUs and SmartNICs.

FPGAs contribute to the overall application performance, but they also provide workload flexibility and energy-efficient operation.

For those who are less familiar with FPGA technology or perhaps come from a software-oriented background, it can be quite daunting to determine whether or not your application will benefit from FPGA acceleration.

What we’ve tried to do on this landing page is provide some guidance for customers, to help them make that determination.

There is a lot of content here.

So, if you have the opportunity after the webinar, then I recommend that you take some time to explore the landing page and the content in more detail.

Okay. So, for customers who can boil their application down to a distinct set of characteristics, then it is possible to make a quick determination as to whether or not you should be using FPGAs for your application.

We’ve distilled it down to five application properties where FPGAs are a good match.

So, if you can tick, the boxes on one or multiples of these characteristics, then you should probably explore using FPGAs in more detail.

Okay. So, the first characteristic we see is when applications are highly parallel in nature. That is to say that you can perform a number of compute calculations simultaneously.

FPGAs are a good fit there, given the granularity and the parallel nature of the device.

The second is when the memory access patterns are not cache friendly. If you’re using a microprocessor or a GPGPU, they will have a fixed memory hierarchy. FPGAs, on the other hand, allow customers to construct completely custom memory hierarchies. So, they’re a better option if your application needs something nonstandard.

The third property is when you’re using data types that are not natively supported in CPUs or GPGPUs.

So, if customers are using applications that are programming and say double precision floating point, then you should probably just use a GPGPU to solve that kind of problem.

If, on the other hand, you’re operating at lower levels of arithmetic such as the bit level, pattern matching, or some kind of unusual integer calculation or perhaps even a transcendental single-precision floating point, then the FPGA is a strong candidate for you to consider.

The fourth is a bit more obvious: where your application requires low latency or deterministic operation then an FPGA is an ideal candidate in that situation.

The fifth one we have listed here is use case where we, see a customer needing an interface to some kind of external I/O. That can come in many shapes and sizes. That can include protocols or our industry standards such as Ethernet or NVMe.

FPGAs can also communicate to custom or proprietary interfaces. In those situations, the FPGA I/O is highly customizable since most of the FPGA pins are protocol agnostic. And so, an FPGA is a very good candidate in those situations.

Okay. With that overview, I’d now like to hand you over to David and Maurizio to provide an introduction to the Intel company and the oneAPI tools. David, over to you.

 (David)

Thank you Craig. Hi, my name is David Clarke. I am a technology sales specialist for Intel, responsible for driving FPGA acceleration into Cloud and Enterprise.

Intel have a long history in compute and data processing across many markets from the edge into the datacenter, spanning the financial services industries, artificial intelligence, machine learning, scientific research, and high performance compute to name a few.

Known traditionally for their CPU pedigree, Intel purchased Altera to take advantage of FPGA as a  technology.

High performance compute, as we mentioned, is a market that readily utilizes not only CPU but has also adopted FPGA for its unique benefits, whether that be for deep pipelined compute, real-time in-line low latency deterministic processing, or maybe the acceleration of highly parallel mathematical functions while delivering a performance-per-watt TCO advantage.

But adoption of a new technology into a market is difficult and it’s dependant on one of three criteria:

  • The new technology is the only solution to a given problem that can’t be solved by any standard technology today.
  • The benefits are so large the investment, or hiring, or re-training of resource is tolerated.
  • Smaller incremental benefits are attractive when they are very, very easy to achieve.

Until now, FPGA has been a technology only accessible to certain applications and markets as traditionally it required development flow using RTL specialist languages or specialist resource.

The ultimate objective has always been to be able to harness the advantages of FPGA—in cloud and enterprise—through a development flow that requires no knowledge of FPGA itself. A flow that could allow resource already in house to target FPGA without investment or retraining. Essentially enabling software engineers to design hardware—making FPGA a technology available to developers, partners, system integrators and customers spanning non-traditional FPGA markets and applications.

Intel and its eco-system of partners such as BittWare provide a range of market-ready platforms which, coupled with the emergence of Intel oneAPI, deliver a methodology that allows non-FPGA users to embrace FPGA to take advantage of its unique abilities to solve compute challenges and deliver game-changing acceleration, finally using a flow that can be adopted by non-hardware development resources.

To learn more, I’m now going to hand you over to Maurizio, one of Intel’s acceleration specialist engineers, who will tell you more about oneAPI. Thank you.

(Maurizio)

Thanks David. Why is Intel introducing oneAPI?

In today’s HPC landscape, several hardware architectures are available for running workloads: CPUs, GPUs, FPGAs, and specialized accelerators. The push to architecture diversity comes from workload diversity. No single architecture is best for every workload, so a mix of architectures is required to maximize performance in all possible scenarios.

However, using heterogeneous architectures comes with a significant burden. First of all, each kind of data-centric hardware needs to be programmed using different languages and tools. That means separate code bases need to be maintained for different platforms, and re-using code across platforms becomes impossible. Besides, each platform comes with its own set of tools for compiling, analyzing and debugging code.

That means that developing software for each platform requires a separate investment, with little to no ability to reuse that work to target a different architecture.

oneAPI has been designed to solve this problem. It delivers a unified programming model that simplifies development across diverse architectures.

With the oneAPI programming model, developers can target different hardware platforms with the same language and libraries and can develop and optimize code on different platforms using the same set of debug and performance analysis tools. For instance, they can get run-time data across their host and accelerators through the Vtune profiler.

Using the same language across platforms and hardware architectures makes source code easier to re-use. Even if platform-specific optimization is still needed when code is moved to a different platform, no code translation is required anymore. And using a common language and set of tools results in faster training for new developers, faster debug, and, in the end, higher productivity.

The oneAPI language is Data Parallel C++. This is a high-level language designed for parallel programming productivity, and based on the C++ language for broad compatibility.

DPC++ is not a proprietary language. Its development is driven by an open cross-industry initiative. Its starting point is SYCL, being developed under the industry consortium Khronos Group;

Language enhancements are being driven by a community project to which Intel actively contributes addressing gaps in language through extensions.

The Intel oneAPI product includes a DPC++ compiler based on LLVM compiler technology and taking advantage of Intel’s years of experience in compiler development.

It also includes a source code to source code compatibility tool to assist with CUDA translation to DPC++.

Now, one of the main problems when compiling code for FPGA is compile time. The backend compile process required for translating DPC++ code into a timing-closed FPGA design implementing the hardware architecture specified by that code can take hours to complete. So, the FPGA development flow has been tailored to minimize full compile runs.

This slide illustrates the FPGA development flow. 

The first step in the flow is functional validation, where code is checked for correctness using a test bench. This is made using emulation on the development platform, where the code targeting the FPGA is compiled and executed on CPU. That allows for a much faster turnaround time when a bug is found and needs to be fixed. A standard CPU debugger (such as the Intel Distribution for GDB) can be used for that purpose.

Once functional validation is completed, static performance analysis is performed through compiler-generated reports. Reports include all the information required for identifying memory, data flow, and other performance bottlenecks in the design, as well as suggestions for optimization techniques to resolve said bottlenecks. Besides, they provide area and timing estimates of the designs for the target FPGA.

Only after the results of static analysis are satisfactory, a full compile can take place. Note that the compiler can insert—on request—profiling logic into the generated hardware. Profiling logic generates dynamic profiling data for memory and pipe accesses that can later be used by the Vtune performance analyzer for identifying data pattern dependent bottlenecks that cannot be spotted in any other way.

To start working with oneAPI on FPGA, three components are needed:

  • The Intel oneAPI Base Toolkit
  • The Intel FPGA Add-on for oneAPI Base Toolkit
  • And a Board Support Package (or BSP) for the card being used

The oneAPI components can be downloaded from the Intel site, while the BSP is provided by card vendor.

Plenty of resources are available for whoever wants to learn more about oneAPI on FPGA. First of all, the Intel oneAPI DPC++ specifications that can be found on the Intel website that includes the Intel oneAPI Programming Guide and the Intel oneAPI DPC++ FPGA Optimization Guide.

Then, a rich set of tutorials can be downloaded from the Github site and that cover several features of the language and come together with code snippets that can be used in the oneAPI toolchain.

Also, full-blown reference design, high-performance, for several focus areas such as finance, database acceleration, compression, and so on.

Finally, trainings that are provided by Intel. These are instructor-led, all-day and half-day trainings that are scheduled across different GEOs.

Intel also provides Jupyter notebook modules that connect to the Jupyter lab on the Intel Dev cloud for those interested to learn at their own pace and play around with code in a controlled environment.

Finally, there is a book that can be found on the Springer website and downloaded for free in PDF format or purchased in paperback.

And now, over to Richard.

(Richard)

This FFT 2D case study targets the BittWare 520N-MX accelerator card. The 520N-MX is a PCIe board, featuring Intel’s Stratix 10 MX FPGA with integrated HBM2 memory. The high bandwidth of HBM2 enables acceleration of memory-bound applications. This presentation illustrates how to maximum performance of a 2D FFT using the on-chip HBM memory and the oneAPI tool flow.

The 2D FFT was chosen for this use case, as it is a memory-bound problem. In particular a large 2D FFT was chosen to be too large to fit in local FPGA memory and therefore force the calculation to be dependent on HBM bandwidth.

In this presentation we illustrate how to best optimize the HBM memory access patterns to be as efficient as possible.

The Stratix 10 MX has 16 HBM memories, with 2 pseudo-ports each, all independently addressable. There are 16 banks of memory at the top of the device and 16 at the bottom. The maximum bandwidth of each port is 12.8 GB/s with a total theoretical performance of 409 GB/s on the speed grade -2 device used here.

A 2D FFT can be computed using a series of 1D FFTs on the rows of an image, followed by 1D FFTs on the columns of an image.

To generate the transform, two passes are required over the original image. All SDRAM memories lose performance when memory accesses are not consecutive. Moving from row addressing to column addressing, sometimes referred to as a corner turn, requires jumping rows in SDRAM memory to the detriment of overall performance.

To measure the impact of the corner turn on performance, a simple test application was created.

This graph illustrates the effect of burst size on the average read/write bandwidth, for the data access patterns observed in this 2D FFT example. The graph represents the performance of a single HBM pseudo-channel.

There are two possible memory configurations to consider, each with their own merits. Reading and writing to the same HBM allows a batch of one FFT to be implemented but comes with a small cost in overall performance. This is represented by the blue line.

Alternatively, using separate input and output HBM banks, illustrated by the red line, significantly improves bandwidth, but requires batching or pipelining of multiple 2D FFTs to fully utilize the bandwidth available.

For this example, a burst size of 512 bytes was chosen, as it provides near-optimum performance whilst requiring less local FPGA memory for the caching of intermediate results.

The 2D FFT example was originally programmed using the Intel FPGA OpenCL compiler and was ported to oneAPI for this demo. The kernel code did not significantly change between the two software flows, with only basic changes to optimization pragmas required. Pipelining and parallelisation techniques remain the same.

DPC++ is built upon SYCL and uses SYCL constructs to infer separation between host and kernel code, rather than using two distinct flows for host and kernel binaries. Instead of compiling host code and kernel code with different compilers, a single compilation is performed.

Another change to OpenCL is the creation of a FAT binary, where the FPGA image is now included within the executable.

In the following section briefly discuss some of the DPC++ and SYCL constructs used in this application port.

Devices, which can be FPGAs, CPUs, or GPUs, are accessed via selectors which give an application a handle to a particular accelerator type. This handle is then used to create a queue, used by SYCL for communication to the target device.

This code example illustrates how to target either the FPGA hardware or an emulation of the device. Emulation is important as compilation for FPGAs can take many hours.

DPC++ uses SYCL buffers and accessors to describe memories that are accessible to both host and kernel applications. This example code copies data from an array into a buffer type, which is then accessed using SYCL accessor types.

The location of the buffer is provided through an accessor property. Here we are pointing to HBM memories 0 of 32 using the buffer location property. The property ID refers to the order in which each memory bank appears in the device’s board support package.

A queue connects a host program to a single device. Using the generated handle a SYCL task, in this case a single task kernel, can be submitted to the queue. This is also where any interface-specific attributes can be added.

In this design we stripe the input data over multiple HBM memories, with a 1D FFT dedicated to each HBM interface. This creates a good balance between FPGA compute and memory bandwidth. Each FFT was designed to be fully pipelined, with a peak throughput of approximately 40 GFlops/sec.

The output of the 1D FFTs are placed into a local buffer. Here we store enough rows of data, so we can create large enough bursts to the HBM to maintain good throughput. This simplified animation illustrates how in practice the corner turn is achieved in this design. It shows how 8 rows of data, temporarily stored in local on-chip memory, are used to enable a burst of 8 words when writing back to the HBM. In practice 16 FFTs are performed in parallel and 64 rows of results buffered.

The output of the multiple 1D FFTs need to be transposed for the second stage. Rather than perform this on the entire image, this is performed as part of the main pipeline. Outputs from each FFT must be written to all other HBM banks. This would require multiple accesses from each output buffer to all HBM memories, causing bubbles in the pipeline, reducing performance. As the accesses are linear, we can make use of the abundance of registers in the FPGA to create a dedicated shift register.

This is illustrated in the code example, which creates a fully pipelined sliding window using just FPGA registers. Each column output is delayed in order to create an update over each HBM only once every clock cycle. Finally, the whole process is repeated again, with the final image now striped across multiple HBMs in the same orientation as the original image.

After porting the code to DPC++ the first thing we should do is test that the code is functionally correct.

All Intel and BittWare examples can be built for emulation which allows the kernel code to be run on the local processor. For the 2D FFT example we simply type make fpga_emu, (“EMU” for FPGA emulation). Running the emulation executable allows the functionally of the design to be verified in seconds.

Here is an example output of the 2D FFT using Python to display input and transformed images. Once the functionality of the design is verified, the next task is to check for pipelining efficiency and resource usage.

This is done by targeting the FPGA, but pausing once the initial design analysis is complete. To build the reports we type “make report.” This will take a few minutes to complete so here’s one we ran earlier.

First, we check if the pipelining of the kernel was successful. We are looking for an initiation Interval of 1, or in other words a fully pipelined kernel. Next, we check the resources required. If everything looks good, we can then compile for the FPGA. This will take several hours.

A new executable is creating containing the FPGA binary. Running this executable we can now see the time taken to run the FFT kernel on the FPGA.

To enable Vtune to profile the FPGA kernel, simply compile for oneAPI FPGA executable with the -XS profile. This automatically adds performance counters to the FPGA design that Vtune can then use to retrieve statistics after running an application.

To profile the FPGA, you choose the correct accelerator configuration by selecting the CPU/FPGA Interaction option. Select the host executable FAT binary containing the FPGA design and start the executable. This runs the design and collects profiling information from the host and the FPGA.

On the summary page, we can see the overall timer of the application and an approximation of the time taken to run the FPGA kernel.

Selecting the Platform gives a schedule of events that show how the offloaded FPGA kernel fits within the profile of the application.

Zooming in on the FPGA, you can see the profile information for all the internal and external memory interfaces of the FPGA kernel. This provides crucial information regarding the interaction of the kernel with external memory. Hovering over one of the memory profiles provides information regarding the global memory bandwidth, the number of stores, how often a kernel is starved of data, the occupancy, the percentage of time the kernel is processing that data, and idle time—the percentage of time the kernel is doing nothing.

From these reports, we can quickly identify which memory access require further attention.

The approach taken here is one of many valid implementations for this algorithm. It is why, in my opinion, high level tools are so important for modern FPGA development.

The ability to quickly experiment with different designs is made possible with Intel’s oneAPI design flow, allowing software engineers to quickly gauge performance impacts of offloading codes to the FPGA.

(Marcus)

Alright. Thank you, Richard, for that demo of oneAPI. This is Marcus again getting us ready for the Q and A.

I wanted to briefly conclude the main presentation by, again, noting we are pleased to be supporting, as BittWare, oneAPI with cards like our 520N-MX with HBM2. If you want more information on the 2D FFT demo that Richard just showed you, we do have a white paper on the BittWare website, and that’s available in the Resources section.

There’s also a request form on that white paper to get the source code for this demo.

Okay. So, if you have a question, some questions have already come in, but, if you have a question now for our panelists, there should be a question icon or text panel to type out your questions, and I’ll put them to the panel. So, you’re going to type out your questions and I’ll voice them to the panel, and they’ll reply over voice.

So, I’m going to shift to my Q and A screen here. Alright. Craig, can you see my Q and A screen and hear my…?

(Craig)

Yes I can.

(Marcus)

Alright. So, first question.