TeraBox 1400B Family of 1U FPGA Servers, Four Cards, U.2 Storage and AMD or Intel CPU
FPGA Server TeraBox 1400B Family Extreme Density Standard-Depth FPGA Servers Choice of AMD EPYC 7002 series or Intel 3rd Gen Xeon CPU Overview At the
One of the features of both BittWare’s SmartNIC Shell and BittWare’s Loopback Example is a packet parser/classifier that extracts protocol fields from packets. With this white paper, we not only wanted to describe our Parser, but explain how using HLS to build and configure it has resulted in a better implementation than using the P4 language. The Parser code is available on the BittWare developer website for free to AMD UltraScale+ owners as part of our Loopback Example (January 2020 availability)
Today the Parser component of BittWare’s SmartNIC Shell is built using the AMD HLS C++ development environment. But an earlier revision of BittWare’s SmartNIC Shell used the P4 language though the AMD SDNet tool.
One reason to use P4 is that it’s an emerging standard popular among people embracing software-defined networking (SDN) on commodity Intel servers. However, AMD later restricted the availability of SDNet. Our use of P4 was specifically for end-users of SmartNIC Shell, so this restriction caused us to search for a more open solution. Following the success of our RSS implementation using HLS, we were motivated to re-implement the SmartNIC Shell parser using this same HLS approach (specifically the Xilinx HLS C++ environment).
The protocols used over Ethernet are challenging for hardware to leverage. This challenge exists because the protocols have many optional fields. Those options make it complicated to find, for example, the start of an IP header. Why? In the IP header case, there can be zero, one, or two VLAN tags in front of it. There can also be MPLS tags. Thus hardware needs to understand the protocol just enough to find the IP header. Hardware needs the IP header in order to find IP addresses which are often used in hardware filters and tables. Similar problems exist at the next level as the IP header itself has optional fields.
BittWare’s HLS C++ packet parser can deal with:
It assumes port IDs are found in these IP protocols: TCP, UDP, DCCP, and STCP
Having essentially created two versions of a packet parser, we noted some differences between using P4 versus HLS C++. Overall, the HLS flow is less abstract than P4, but the tool is far more mature.
Details of resource usage are in the table:
Characteristic | P4/SDNet | HLS C++ |
---|---|---|
CLBs | 3,185 | 3,391 |
BRAM | 22 | 0 |
Registers | 10,361 | 5,975 |
Lines of Code | 206 | 1,154 |
You can see that across all FPGA resources, HLS is either similar or better. While the source code does require more lines, part of that is impacted by comments and formatting. However, it is true that an HLS C++ implementation is always going to need more lines of code than P4. That’s for a packet parser/classifier though, which falls under the scope of what P4 can describe—HLS C++ can do more. HLS is very general purpose and can pretty much do anything. P4 is very specialized.
Even better, now that the HLS implementation exists, any follow-on effort to modify it to digest an Ethernet protocol variation is roughly the same as doing the modification in the P4 language. This is because our HLS C++ implementation is structured as a sequence of calls to low-level parser functions that we created. This approach is analogous to directly manipulating the runtime that sits under P4.
As noted, the source code for the Loopback Example, including its Parser block, is available for free to Ultrascale++ owners through the BittWare Developer site. It is an excellent illustration of how to use AXI interfaces within HLS C++ code. Want to see it but don’t have a BittWare FPGA card? Get in touch with us for where to buy.
The protocols used over Ethernet are challenging for hardware to leverage. This challenge exists because the protocols have many optional fields. Those options make it complicated to find, for example, the start of an IP header. Why? In the IP header case, there can be zero, one, or two VLAN tags in front of it. There can also be MPLS tags. Thus hardware needs to understand the protocol just enough to find the IP header. Hardware needs the IP header in order to find IP addresses which are often used in hardware filters and tables. Similar problems exist at the next level as the IP header itself has optional fields.
BittWare’s HLS C++ packet parser can deal with:
It assumes port IDs are found in these IP protocols: TCP, UDP, DCCP, and STCP
The P4 language was created to define a “packet forwarding data plane” (or network switch) using software. The language is particularly associated with hardware vendor Barefoot Networks. The P4 language is distinct from something called “P4 Runtime” which Google helps promote. P4 Runtime presents a standard runtime API that enables manipulating the control plane of solutions compiled by P4.
P4 does make it easy to define a packet classifier/parser for a new protocol. P4 also specifies a complete set of table lookup functions, and it can rewrite packets that flow through, eliminating VLAN tags, for example.
Does this mean that the flexibility of P4 will lead to adoption for FPGAs? There are several reasons we see against this happening.
Commercial options to provide a subset of P4 on FPGA hardware exist, however they are currently limited in scope. Furthermore, as noted earlier, the commercial terms make it difficult for BittWare to leverage these to create an example program that we can provide free with our products.
It’s important to note that no real-world FPGA application can be exclusively written in P4. For example, the Receiver Side Scaling (RSS) block that follows our Parser in some examples cannot be authored in P4. However, HLS C++ can be used to author either block, or even a single block that combines the two functions.
Also, the P4 table lookup functions are basically a wrapper on hardware-specific runtime libraries written in RTL or HLS C++. Programmers can call such runtimes directly from HLS C++ with no penalty.
The bottom line is that after using both P4 and HLS C++ to implement a parser, we actually favor the HLS C++ approach. It isn’t clear that demand for P4 on FPGAs will grow large enough to support a mature tool. HLS C++ can do more and is more mature.
We hope the explanation of two implementations of a packet parser on an FPGA, one in the P4 language and then using HLS C++, are helpful in evaluating the right approach for you.
One final note is regarding portability between our FPGA cards. Between our AMD FPGA-based cards, HLS provides an easy method with few, if any, changes needed. For moving to an Intel-based card, such as our 520N-MX, source code changes will be required, particularly with respect to compliler pragmas. However, the basic concepts are identical. In both cases we are structuring C++ based upon our knowledge of FPGA translation challenges. Arbitrary C++ code will run very poorly inside an FPGA. However, C++ code restructured and anointed with pragmas works very well. The changes required for AMD or Intel are very similar but just expressed a little differently.
As part of BittWare’s SmartNIC Shell, our Parser helps teams get up to speed quickly for building network packet processing applications on our FPGA cards. Learn more about SmartNIC for our cards or get in touch with us to talk about your application needs.
BittWare’s Loopback example redeploys a subset of the SmartNIC shell that we can offer at no charge. That subset includes our Parser library.
What you see on this page is a basic overview of the Parser, part of SmartNIC Shell. There’s a lot more detail in the full App Note for SmartNIC Shell, and best of all it’s FREE to download! Fill in the form to request access to a PDF version of the full App Note.
"*" indicates required fields
FPGA Server TeraBox 1400B Family Extreme Density Standard-Depth FPGA Servers Choice of AMD EPYC 7002 series or Intel 3rd Gen Xeon CPU Overview At the
White Paper Introduction to BittWare’s SmartNIC Shell for Network Packet Processing Overview SmartNIC Shell is a complete working NIC that is implemented on a BittWare
Article Homomorphic Encryption Acceleration FPGA acceleration enables this unique solution that allows compute on encrypted data without decrypting or sharing keys Traditional Encryption Limits Encrypting
Go Back to IP & Solutions UDP Offload Engine UOE IP Core for 10/25/50/100GbE Atomic Rules UDP Offload Engine (UOE) is a UDP FPGA IP