FPGA Accelerated NVMe Storage Solutions
White Paper FPGA-Accelerated NVMe Storage Solutions Using the BittWare 250 series accelerators Overview In recent years, the migration towards NAND flash-based storage and the introduction
BittWare’s Loopback example demonstrates several things:
The functionality of the Loopback is not the principal focus of this example. Our focus was to demonstrate all of the bulleted items listed above. However, the Loopback has value. BittWare uses it to validate DAC cable settings when connected to third-party devices like NIC cards and switches.
The Loopback contains an L2 filter that selects frames to process. If those frames contain IPv4 packets, the Loopback swaps source and destination addresses at both the MAC and IP layers. The Loopback can respond to ARP packets. This was added to eliminate any requirement for specialized configuration of third-party devices.
The Loopback operates on a single QSFP cage, looping packets from input to output. Any additional QSFP cages are not used.
This Looback was designed and tested on a BittWare XUP-P3R board containing a Xilinx VU9P chip, speed grade 2. The Loopback does not use any external memory and should port into any BittWare Xilinx UltraScale+ chip containing a CMAC.
The Loopback’s FPGA bitstream contains several components. Each component has an AXI4-Stream interface on both input and output collectively used as a data plane. The bitstream’s control plane uses AXI4-Lite interfaces connected to the physical PCIe interface.
The Loopback is supplied as a Xilinx IP Integrator Project. Several of the components are written in Verilog. Three are written using the Xilinx HLS flow that emits Verilog.
The current implementation groups the three components written with HLS into a single component from the perspective of IP Integrator. However, those three components are separately documented here. In fact they are documented as four distinct components. This is because the HLS components share a common “Parser Library” which we document separately to avoid repetition.
Philosophically, at reset, all components initialize enabled, but in a mode that does the “least harm”. Software must then configure the components before the Loopback begins successfully operating.
Each component also exposes statistical registers to assist users to debug hardware or software. We include a snapshot signal so that all of the statistical register values are synchronized in time.
The bitstream’s interface widths and clock speed were selected to host 100 Gigabit Ethernet traffic. The data plane’s AXI4-Stream interface is 512 bits wide. Except where it touches the CMAC, the interface is clocked at 300 MHz. Frame metadata travels on a separate bus, the AXI TUSER bits, with valid data when AXI TLAST is high.
Metadata is not consistent across the bitstream. Thus the documentation associated with each of the components describes the metadata that component expects on input and the metadata it forwards on output.
The bitstream control interfaces are AXI4-Lite slaves, 32-bits wide. All reads and writes are 32-bits. In cases where byte order matters, the Loopback expects our control registers to hold data in network or big-endian byte order.
We document the component control plane registers in a single place, separately from the descriptions of the components themselves. Cross references exist to help users navigate between the two locations. The memory map used for the Loopback’s control registers is heavily influenced by the requirements of the AXI4-Lite interface implementation in the Xilinx HLS tool chain.
The formal definition of AXI used is from the Xilinx “AXI Reference Guide” available here: https://www.xilinx.com/support/documentation/ip_documentation/ug761_axi_reference_guide.pdf
The BittWare Loopback Example runs on a PCIe card inserted into a host computer. BittWare supplies software for that host computer to control the example’s functionality. The control software uses Python 3 running on the host.
The Example’s software builds above BittWare’s BittWorks II Toolkit. More specifically, it adds Python bindings to the BwHIL and BmcLib libraries. It then leverages those bindings inside a collection of Python components created to manipulate registers that the Example’s bitstream exposes in PCIe address space.
In addition, the Loopback Example bitstream translates some hardware events into PCIe interrupts. To support this, the Loopback’s software translates those interrupts into Python calls.
To illustrate with a very basic interaction with the Loopback Example bitstream:
$ # First map the PCIe card using the Toolkit's command line or use the GUI
$ bwconfig --add=usb # mapped over USB first as device 0
$ bwconfig --add=pci # same card mapped over PCIe as device 1
$ python3 # Invoke python3
>>> from components.hildev import *
>>> card = Card(1)
>>> card.show_stats() # Shows all stats from all components
>>> # Show stats from just the first CMAC component with some options
>>> card.cmac[0].show_stats(showall=False, doTick=False)
>>> help()
>>> exit()
All of the Python components support a common collection of low-level methods. Note that our Python implementation does not have the PCIe memory map hard-coded. Instead Python reads a JSON database that defines the available FPGA bitstream components, their registers, and where the registers are located in PCIe address space. That JSON database is automatically generated from the Loopback Example’s documentation.
The full Python API documentation is available on BittWare’s Developer Site.
The low-level methods include:
The higher level methods available depend upon the specific component. However, a few methods are relatively common:
We have more details on the Loopback available as a free App Note download; get it today through the form below!
What you see on this page is the introduction to the Loopback example. There’s a lot more detail in the full App Note, and best of all it’s FREE to download! Fill in the form to request access to a PDF version of the full App Note.
"*" indicates required fields
White Paper FPGA-Accelerated NVMe Storage Solutions Using the BittWare 250 series accelerators Overview In recent years, the migration towards NAND flash-based storage and the introduction
IA-440i 400G + PCIe Gen5 Single-Width Card Compact 400G Card with the Power of Agilex The Intel Agilex 7 I-Series FPGAs are optimized for applications
BittWare On-Demand Webinar Computational Storage Using Intel® Agilex™ FPGAs: Bringing Acceleration Closer to Data Watch Now on Demand! Accelerating NVMe storage means moving computation, such
PCIe FPGA Card S7t-VG6 VectorPath Accelerator Card Achronix Speedster7t FPGA board with GDDR6 and QSFP-DD Overview The S7t-VG6 VectorPath accelerator card offers a 7nm Achronix