Arkville PCIe Gen4 Data Mover Using Intel® Agilex™ FPGAs Webinar
The Arkville IP from Atomic Rules was recently updated to support Intel Agilex FPGAs including those on BittWare’s latest IA-series of products. Arkville moves data at up to 220 Gb/s over PCIe Gen4 x16.
In this webinar, you’ll hear from Jeff Milrod at BittWare covering products supporting Intel Agilex FPGAs, and the use of data mover IP in a variety of markets. Tom Schulte from Intel will give perspective on the Agilex product line including future features such as PCIe Gen5 support.
We’ll conclude with Shep Siegel at Atomic Rules to give a demo and explain the performance achieved in a short time with the Arkville data mover IP on Agilex FPGAs. He will provide insight into how Arkville reduces time-to-market and makes development easier without sacrificing performance.
Watch the webinar below, including the recorded Q&A session from the live event.
Welcome to our webinar: Arkville PCIe Gen4 Data Movement Using Intel FPGAs. I’m Marcus with BittWare.
Let me briefly introduce our presenters and what they’ll cover.
First up is Jeff Milrod, Chief Technical and Strategy Officer at BittWare. He’ll talk through the BittWare product line featuring Agilex FPGAs, plus speak a little bit about IP partners and solutions at BittWare.
Next, Tom Schulte from Intel will focus on Intel Agilex FPGAs, including some of the upcoming features for moving data.
We’ll conclude with Shep Siegel from Atomic Rules. Shep will walk us through Arkville, including a demo of Gen4 x16 data movement running on BittWare’s IA-840F card featuring an Intel Agilex FPGA.
I’ll be back on afterwards to take your questions live.
Now over to Jeff to get us started!
Jeff Milrod | Chief Technical and Strategy Officer, BittWare
Hi, all. Thanks for joining us today. As Marcus just stated, my name is Jeff Milrod and I’m the chief technology and strategy officer here at BittWare. For those of you who are unfamiliar with us, BittWare is part of Molex. Specifically, we are a business unit within the Datacom and Specialty Solutions Group. As part of Molex, as you can see here on this slide, we have access to in-house manufacturing and global logistics capabilities.
BittWare now has over 30 years of FPGA experience and expertise across a variety of markets. During that time, we’ve broadened our offerings to include not just enterprise-class FPGA hardware platforms, but also system integration, tool support, reference designs and application IP that enable our customers to deploy their solutions quickly and with low risk. We’re part of the Intel Partner Alliance program and have developed high-end FPGA accelerators and boards featuring every generation of Altera and Intel FPGAs for the last 20 years.
Our blend of heritage, expertise and global reach via Molex uniquely qualifies BittWare to enable customers to leverage and deploy FPGA technology to address their demanding applications and workloads. In the broadest terms, BittWare’s acceleration platforms target four different applications and market areas shown here: compute, network, storage and sensor processing. Each of these applications and markets are complex and cover a wide variety of workloads; we show some examples here.
Personally, I’ve been at BittWare for decades, and during that time we’ve focused on riding the leading-edge wave of FPGA accelerators by delivering solid hardware platforms that enable people to develop, deploy accelerated solutions. The Agilex will be our seventh generation of Altera/Intel-based FPGA solutions, and in that time I can’t remember ever being so excited about a new technology generation as I am about Agilex.
These are really capable engines that I think have taken a leap in performance capabilities and will allow us to empower our users to accelerate far more applications and workloads than we have in the past. The first wave of Agilex FPGAs from Intel is the F-series which BittWare has leveraged to produce our aptly-named F-series product family shown here. We’ll be coming out with I-series and M-series parts in the future…we’ll talk about that in a minute.
Our flagship product is the IA-840f shown on the left. This features the largest Agilex currently available: the AGF027. It’s a GPU sized card that has PCIe Gen4 x16. So, we have the largest bandwidth from the FPGA to the host available anywhere in the world at this point.
We have three QSFP-DDs on the front that allow us to implement six lanes of 100GbE; can be used for other formats as well. The four banks of DDR4…we have expansion ports out the back—16 lanes—that allow us to connect to things like storage arrays…other devices…and can be used board-to-board to expand communications.
We have our board management controller that is a key part of the value add that we bring to the hardware platform. And of course, we have support for Intel’s world-class tools, including oneAPI.
On the right you’ll see two boards that are more targeted for specialty applications. The IA-420F is a NIC-sized card that can be used for SmartNICs among other things…computational storage arrays, computational storage processing, radio access networks…there’s a variety of applications for this smaller-sized card.
And on the far right, we have our IA-220-U2, which is in a U.2 SSD drive format. So, it’s particularly targeting computational storage processing applications.
All of these boards—and all of BittWare’s hardware boards—are built on our enterprise-class foundation. By that we mean a well-defined, stable and trusted platform that reduces and mitigates risk.
Because we’ve taken all the time to be extremely comprehensive, thorough, all compliance and certifications, rigorous management and controls on the configurations, clear and concise documentation, providing working demos, software tools to access it, we have extensive support capabilities. In the enterprise class category—shown on the right—each one of these is just a higher level that has multiple checklists underneath that we make sure that we implement and rigorously validate prior to releasing full production quality boards that are now trusted and stable.
Our application solution enablement team is hard at work, continually developing higher-level abstractions on top of the hardware to deal with a lot of the details and specifics that need to be implemented within the FPGA, which I call gateware development, and the software on the host to communicate to those gateware elements.
This is an example here of some of the white papers, case studies, examples and reference designs that we have available from our website. There’s more if you take a look up there.
This IP roadmap for platform enablement is critical to the value that BittWare’s adding to the Agilex FPGAs. Our overall goal is to supply PCIe, Ethernet and NVMe infrastructures that customers struggle with and take those problems away.
We partner with key third parties as well as our internal development to ship world-class specific implementations such as Arkville and the DPDK data mover that Atomic Rules will talk about in a minute. This is a listing of all the things we’re working on currently that will be out in the next year as we mature our Agilex platforms.
Years ago we used to sell what we affectionately call blank FPGAs or sometimes just now called bare metal. Our customers would then take a lot of time and develop this customer application IP that would expand to consume the whole FPGA. And we still think about it that way…and people tend to think of, you know, there’s all this work on this kernel or workload that has to get dropped on to this FPGA, but it ends up we’re now finding—as FPGA are increasing in size, complexity, sophistication and performance—that the lower-level implementations that connect up the memories, sensors, networks, expansion I/O to the host communications, the board management controller, system integration: these things are consuming more and more design resources and capabilities.
I call this plumbing and our gateware plumbing is a key part of the value-add that we’re bringing with this application solutions enablement on these sophisticated FPGAs so that our customers don’t have to dive in to all the gory details of the hardware on the FPGA, the I/O, the last micron, as I call it, more of the board-level implementation of the BittWare hardware.
Of course, we provide all that—if customers want to do that themselves, that’s fine—but we have it all done and proven as part of our enterprise-class platform such that our customer’s unique application IP is more about their special secret sauce and the unique value added for that workload and application rather than dealing with all the additional complications of getting from that kernel to the memory…to the host…from the network…all of these things we take care of for you.
One of the prime examples of that is Atomic Rules and their Arkville DPDK data mover to the host, where we’ve now seen just absolutely top-notch performance and CPU offload with this engine. We’ll talk about that in a minute. Before we let Shep get into that, though, I think it’d be appropriate to provide a better foundation of this Agilex FPGA and the offerings that Intel is bringing to the table. So, with that, I’ll hand it off to Tom. Thank you all very much.
Thomas M. Schulte | Product Line Mgr., FPGA Products, Intel® Programmable Solutions Group
Thanks, Jeff. As Jeff mentioned a few slides ago, the new production-quality BittWare accelerator cards are based on Intel’s newest Agilex FPGAs. Highlighted here are some of the more significant features offered within the Agilex family.
The devices are built on the second-generation Intel Hyperflex™ architecture and Intel’s 10nm SuperFin process technology, both of which have demonstrated significant performance improvements and power savings when compared to the prior generation of Intel FPGAs, but also when compared to the competitor’s 7nm FPGAs.
I’m not going to review all of these features in this webinar, but instead I’ll focus on the new CPU interface protocols that are available, the PCI Express Gen5 and Compute Express Link, commonly abbreviated as CXL.
For selected members of the Intel Agilex family, those devices support the full bandwidth of PCI Express, configured up to x16 lanes per port, providing two times more bandwidth when compared to equivalently configured Gen4 devices.
For accelerated use cases and workloads which do not need a coherent interface connection back to the host CPU, PCI Express is the industry-wide standard for high performance applications and will continue to be a key building block for FPGA-based accelerators.
For selected members of the Intel Agilex FPGAs, these devices support the full bandwidth of the CXL protocol. Again, up to x16 lanes for port, providing a lower latency and coherent interface when compared to the PCI Express protocol. For accelerated use cases and workloads, especially those that are heavily dependent on memory-related transactions, the CXL interface looks to be utilized for many high-performance applications.
I see this based on two key indicators. First, the over 100 companies that have joined the Compute Express Link Consortium. And secondly, the number of customers planning to offer see CXL-based products and solutions.
While the details of PCI Express Gen5 and CXL are exciting, actually getting silicon and running these interfaces is even better. Various different Intel teams continue to test, characterize and ship engineering samples of the hardware and software necessary to enable new platforms based on the new next-generation Intel CPUs codenamed Sapphire Rapids.
In addition to those platforms, many customers have also already received engineering samples of the Agilex FPGAs supporting these two new interfaces.
In fact, some of the internal hardware used to test each and every Sapphire Rapids CPU is based on Intel Agilex FPGAs.
Agilex FPGAs are categorized into three different series, each targeted at different applications. The F-series devices bring together transceiver support up to 58 gigabits per second, increased DSP capabilities and high system integration that target a wide range of applications from the data center, networking, edge, embedded, industrial, military and even wireless. This is considered, sort of, the general-purpose category of devices in the Agilex family.
In the I-series, we’ve got a bunch of devices that are optimized for high-performance processor interfaces and bandwidth intensive applications. These series will offer options for the new CXL protocol, the PCI Express Gen5 and options to support transceiver bandwidth up to 116G. The I-series FPGAs are a compelling choice for applications which demand massive interface bandwidth and high performance.
And finally, the M-series devices. These are optimized for compute and memory intensive applications. This series will offer additional features not available in F- and I-series. Things such as DDR5, LPDDR and integrated HBM2 stacks. The Agilex M-series FPGAs are optimized for data-intensive use cases such as high-performance computing applications which generally need a lot of massive amounts of memory in addition to high bandwidth.
To learn more about Intel Agilex FPGAs use the URL shown at the top right of this page…but enough about Intel Agilex FPGAs. Let’s hear from Shep at Atomic Rules about their Arkville data mover IP, which can be used with BittWare’s new accelerator cards based on Agilex F-series production-quality FPGAs. Over to you, Shep!
Shepard Siegel | CTO, Atomic Rules
Thank you, Tom, for that introduction—that’s great. Hi, I’m Shep Siegel with Atomic Rules, and thank you for coming to this webinar today, we’re really excited and I’m really excited to tell you about Arkville on Agilex. It’s something that’s been a long time in the making and this webinar is the rolling-out party…so here we go.
Arkville on Agilex: it’s Gen4 data motion for FPGAs that just works. But first, a little bit about Atomic Rules. We’ve been doing this for some time. Our business model is to do fewer things better. We have a few key core products that we make: Arkville, of course, which we’ll be talking about today, a UDP offload engine which does UDP and hardware and TimeServo and TimeServo PTP which provide coherent system time clock across a fleet of FPGA devices in the datacenter.
We are an Intel Gold Partner Certified Solution provider and we’re quite proud of that. And, for over a decade, we’ve been contributing to open source projects, engineering- enterprise-grade engineering solutions…really focusing in compute and communication in terms of networking IP. And we’re really proud and thankful to have the small select and returning set of customers that we have that allow us to grow.
All right, let’s jump into Arkville. So, Arkville is a DPDK packet conduit. What I mean by that is it’s a way to interconnect FPGA data flows, to host memory buffers, and the other way around: it allows data that might be in host memory to be a stream or in a host memory pool on the FPGA and the other way around.
And it facilitates this data motion as streams of data moving across PCI Express. We talk about it as a conduit because all of the complexity from the API on the software side down through PCI Express over to the FPGA, out to the AXI streams where the data is produced and consumed is abstracted away…which means quicker time to market, quicker time to solution using standard APIs from DPDK (which is part of the Linux Foundation) and FPGA hardware such as Intel Agilex devices. Intel also might make some processors which you can use to run on the host side, but we’ll save that for another webinar.
So where is Arkville used? Arkville is used whenever there’s the need to efficiently move data between the host and an FPGA device for the other way around. It’s a building block component that abstracts away many of the complexities of that data motion so that the users of Arkville can get on with building products such as SmartNIC devices, network appliances or DPDK accelerators.
Why DPDK…I heard that’s just for networking? Well, for networking, DPDK does have a really strong story, but it’s deeper than that. DPDK is a trusted API that’s been around for a long time. It’s been under the stewardship of the Linux Foundation recently. It’s community vetted, it’s routinely tested and it’s an open-source, standardized solution and set of APIs not just for networking but also for bulk data movement.
By designing Arkville to use DPDK, it frees up host processor cycles to perform more useful work. It’s a kernel bypass means for sure—that is the kernel’s out of the way and that means out of the gate higher throughput and lower latency to the application—but Arkville is DPDK aware (and we’ll get into that in a later slide) but, by pushing the business logic of DPDK’s data motion into FPGA gates, Arkville can have both higher throughput and lower latency, resulting in reduced general purpose processor cache pollution, which in turn results in higher post-core performance.
So, DPDK makes terrific sense if you have workloads that will be empowered by their API, which certainly could be networking…but also could be simple bulk data movement between an FPGA device and the host.
The key point of Arkville is that Arkville implements the low-level inner loops of the DPDK spec in FPGA hardware…essentially turned the DPDK spec into RTL gates. Every other DPDK solution, including merchant ASIC NICs, push some, or all, of this work onto the host processor cores. We designed Arkville from the beginning to do one thing and to do one thing well: manipulate DPDK mbuf data structures in hardware so that the processor cores don’t have to do that. And by doing that in hardware, we have this unique advantage of simultaneously achieving high throughput and deterministic low latency. And in doing that as well, there’s almost no host core utilization, as we’ll see in the following slide. The other point behind Arkville is it’s a complete solution for data motion—it works out of the box.
The software engineers are using standard APIs to produce and consume data buffers. The hardware engineers are connecting to AXI interfaces. Compare and contrast that “get going the same day” story to a “roll your own” solution which could take weeks or even months to simulate let alone stand up on real-world hardware.
So, Arkville is delivered as a combination of software and gateware. There is a DPDK pull-mode driver that is fully open source and available today at DPDK.org, and then there’s the RTL component that fits inside your Intel Agilex FPGA which Atomic Rules licenses. There’s a named project and a multi-project license, but basically it’s a licensed piece of IP that goes inside your Intel FPGA. The two work together to provide this data mover conduit that I’ve been talking about, to allow data to flow from the FPGA to the host and the other way around.
This eye-chart cartoon shows the host processor on the left and the FPGA on the right, and shows some of the sub-modules of how the host processor, typically a Xeon workstation or server, and the FPGA, typically an Intel Agilex device, might be split up and where the different components are. It’s not to any kind of scale in terms of area or complexity, but the green and red boxes down on the bottom represent the sources and sinks for device-to-host and host-to-device data motion that are essentially the destination or source or the producer or the consumer for the currency that Arkville is carrying through its conduit.
Here we see a chart that shows the throughput of Arkville as a function of packet size. Now, immediately you can see that the throughput is less for smaller packet sizes, and that’s just a fact of the world with the overhead that PCI express has on smaller packets. But we also see, if we focus on the right side of the chart, that the blue and the red lines, which represent the device-to-host and host-to-device transport speed, approach the theoretical limit at the top which is at 220 gb/s and even a little bit higher. We’ll see more of that in a moment with the demo.
Arkville also has exceptionally low latency (not high-frequency fintech trading latency, which is expected to be sub-microsecond) but unit-microsecond latency all the time between the FPGA and the host. And the lack of a long tail, particularly on long packets and under high load, is a value to vRAN and ORAN and 5G applications that can’t tolerate a missed deadline.
In addition to low latency, Arkville also has essentially no latency jitter. How is that? Well, by not being a standard DMA engine with caches and other dynamic means to support some large number of queues, Arkville has deterministic latency from the time a packet arrives, for example, and the time it lands in host memory (or the other way around).
This specificity on just doing DPDK and moving mbufs allows Arkville’s latency jitter to be essentially zero. Arkville also has no memcpy meaning to say the host processors have no work, none! Zero cycles to move packet data from one spot to another. Arkville’s RTL hardware on the FPGA ensures that data land exactly in the mbuf where it should so that the host doesn’t have to move that data, leaving more CPU cycles for your application.
This graph shows how there’s less than 20 nanoseconds per packet spent in the Arkville PMD for packet sizes that fit in a single mbuf. In this case, the mbuf is just a two-kilobyte mbuf. If we expanded the mbuf size, this flatness would continue right off the right side of the page.
Arkville has zero packets dropped forever—for always. Unless, for example, the system is hit by a rock.
The flow control on Arkville is full front-back hardware-software all the way across all domains. Under no circumstance is data admitted on one end that can’t be safely transported to the other end, and vice versa. Other data movers will drop packets if they can’t keep up or if there’s distress or retransmission. We have hardware and software fully flow-controlled indications that provide for zero packets dropped under all conditions.
Now, after all that, let’s jump to a prerecorded demo (recorded a few days ago) that shows how Arkville is installed, shows it running on a Xeon server, and then after that, we’ll jump over to the questions and answers—see you there.
Hi, Shep Siegel here, and this is a prerecorded demo that we’re doing on Friday, December 10th, a few days ahead of the Intel/BittWare/Atomic Rules Arkville on Agilex webinar. I’m going to go through what we’re going to demonstrate here and then we’ll see the demo.
There’s an Intel Xeon processor that is being used as the host system and an Intel Agilex FPGA, which is being used as the device under test. If you look carefully at this slide in the lower left, we can see the user-land processor memory (basically the DRAM from which data will move to and from) and in the lower right-hand corner we’ll see FPGA fabric memory again, from which data will move to and from.
In between we have gen4 x16 PCI Express connecting the Agilex device to the Xeon host. The parts that we’re using in this demo are a Dell R750 server with Xeon 6346 processors (those are gen4 x16 PCIe capable). A BittWare IA-840F with an Intel Agile F device and, of course, Atomic Rules’ own Arkville—our 21.11 release, which just shipped earlier this week.
So, the first thing we’re going to show in the demo is the Arkville script. It’s going to bring in all the needed libraries and download and compile DPDK and take care of what we need on the host system side. The next thing we’re going to do (if Quartus Prime Pro isn’t installed) is install it and then we’re going to compile a bit stream for the Agilex device from RTL by using the make target make Agilex.
Once the bitstream is ready, we’re loaded into the FPGA and just do a pseudo reboot. There’s no need to make the bitstream persist in the Agilex device’s flash memory.
There are about a dozen DPDK applications that are distributed with the Arkville distribution, but we’re going to focus on TX (or downstream) or RX (or upstream) throughput specifically in this demo. Last, at the end of the demonstration, the demo data is placed in a performance log and we’re going to plot that data out.
All right. So, we’ll start out here in our projects directory, and the first thing we’re going to do is we’re going to expand the Arkville release from the supplied tarball. There we have it. The tarball has been expanded. Next thing we’re going to do is we’re going to run The Atomic Rules Arkville installer script, which will bring in necessary libraries as well as download and compile DPDK. So, we see the libraries tearing on down. And now at this point we’ll download DPDK from DPDK.org
With DPDK downloaded, we can kick off the Meson Ninja compile system.
This part of the demo is actually showing it real time, which is really fast—except for test string…it always stops there for a little bit…and DPDK will be done in a minute. And great: we’ve got DPDK installed and we’re ready to move on.
At this point we need to build our bitstreams for Arkville on Agilex F. So, we’ll shoot over to the hardware targets directory and simply type make Agilex in order to build all the Intel Agilex targets.
I’ll first check to make sure that we have Quartus 21.3 installed. Yep, that’s it, let’s go!
In this part, we certainly have truncated a bit. It takes about 30 minutes to an hour to run though the entire tool flow to build the bitstream (depending on the size of the design). We have six different designs here, so we’re only caring about one of them at this point.
With the bitstream completed, we’ll download it to our BittWare IA-840f card inside the Dell server. So, we download the bitstream, do a sudo reboot to bring the system back up.
We see now, after the system has rebooted from lspci, the Arkville device is visible in the server. It happens to be in slot C-A (Charlie Alpha).
We can go and use some extended lspci verbosity in order to look at some of the capabilities that the device is trained for. And here, what we’re looking for—if we can manage the scroll bar without the screen going back and forth—is not just the original lspci that we saw at the beginning, but also seeing that the device is indeed Gen4 x16 capable. That’s the link capability line highlighted here—and that we’ve actually achieved Gen4 x16 link status, meaning we trained to that.
So that’s a good sign that we’re off to a good start. So, with this in place, we can move on now to any one of the dozen or so DPDK applications that are distributed. We’re going to use the Arkville Duplex Performance Test, which independently runs a suite of tests that measure the ingress and egress performance and is also the full duplex performance of the system.
There are many dimensions to this test, and in various iterations it could run seconds, minutes, hours or days—so we’re going to take the data from the set of these tests and grab that into a file called performance.log. We’re going to take the data from this performance file and bring it up into a Google Sheets document where we can plot it and look at it in detail.
There’s the plot data log and here are the results. So the yellow line at the top, the skyline, if you will, represents the theoretical limit of this configuration of hardware and software, while the blue and red lines, respectively, show the device-to-host and host-to-device throughput respectively. You can see the y-axis has the useful throughput in gigabits per second.
So over on the left side of the graph—where we’re zooming in or zooming away from right now—the performance is not quite so good because of the smaller packet size and largely owing to PCI’s 512-byte MPS. But as we pan over here to the right and look at packet sizes of 512 bytes or a kilobyte and above, we can see that the throughput grows to well over 200 gigabits per second for the upstream direction and approaching 180 gigabits per second here for the downstream direction.
So, quite laudable performance in terms of upstream and a little bit of room for improvement that we can see in the asymptotic performance over on the on the right-hand side.
But overall, we hope it’s clear from looking at this graph that, out of the gate, Arkville on Agilex F does an admirable job of approaching the theoretical performance.
Thank you for taking your time to watch this demo. We’re going to cut back now to real time, where Marcus will lead the question and answers that I’m sure some of you will have.
And once again, thank you and happy holidays.
I wanted to share a few words before we get into our Q&A time.
So, today’s webinar featured Arkville from Atomic Rules running on the BittWare IA-840f card, which has an Intel Agilex FPGA. For more information on any of these, please visit the BittWare, Intel or Atomic Rules websites.
So, with that, let’s begin with some questions.
Let’s see, the first one is for Shep—we just saw the graph. So, what’s the likely final performance numbers for Arkville?
Thanks, Marcus. So, did you say what’s the likely final performance numbers—as in—the end?
Well, yeah, because I think you had presented some performance numbers and you noted maybe some updates or something. So maybe that’s what they intended.
Sure, got it—understood. So, performance, at a system level, involving general purpose processors, FPGAs interconnection networks like PCI Express…are difficult. We can simulate all we want, but in the real world…things happen. We’re confident in putting the 220 gigabit per second number out there, mainly because over the past couple of months, early on…on consumer or workstation-grade Rocket Lake systems, we’ve seen that so reliably. Maybe some of the keener eyes out there noticed, in looking at the demo that we ran, that the ultimate performance on the downstream side of the high-powered Dell server, with the big Xeons, wasn’t as good as the Rocket Lake in the end—that owes to perhaps NUMA issues, QPI issues and things of that sort. So, in the end, the best way to…the best benchmark, we think with regard to throughput is to—since this hardware is available from BittWare and from Intel and the IP is available from Atomic Rules—is to get it in your own shop and to do that as soon as possible.
The demos that we have and the tools that we showed in the prerecorded demo that we presented will quickly show you on your own system what that’s capable of and, in that way, rather than taking a number that we see as a nominal performance number in your own application…in your own system…in your own special case of circumstances can see—for example for throughput—what that sustained throughput number is.
Allright, yeah—thanks for that answer. And another question for you, that has a pretty simple answer, I suppose: How would a Quartus user utilize the Arkville IP…is it Platform Designer/Qsys compatible?
It is. The easiest way…we support both Platform Designer or, to those who’ve been using the tool for a while, Qsys flow…but we also support a full, straight-up SystemVerilog. And, because of the concise nature of SystemVerilog interfaces and the support that Quartus provides for SystemVerilog, a standard RTL flow using SystemVerilog, or Platform Designer is supported. So, Arkville is instantiated in your Agilex device like any other core.
So probably for Shep: What’s the roadmap for Arville RTL IP for supporting PCI Gen5 and then CXL and what could be the performance numbers?
Well, that’s a great question. So, the performance that we’re showing today, of course, is with Agilex F and Gen4 x16…and there’s been such a pent-up demand to double the throughput over, say, Gen3 x16 that we’re happy that we’re able to make this first step.
The question, however, is what lies ahead for Gen5? It’s our expectation that we will be able to double or more than double the performance again when we move to Gen5 x16. We’ve been working with closely with Intel on this for some time now, and a key portion of this has to do not with…you know frequency scaling has stopped long time ago…a lot of this has to do with architectural innovation…and one of the things, to tout our engineering team’s own horn a little bit here (but we couldn’t have done it without Intel’s enablement), is that Agilex, both in the current version and in future versions that will support Gen5, allow multiple PCI TLPs to move per clock cycle. Today, with Agilex F-Series, we’re moving up to a billion—I’m sorry correction—two billion TLPs per clock cycle: two on ingress, two on egress, at 500 MHz.
With Gen5, and the I-Series R-Tile, we’ll be able to double that again. Now, doubling the number of TLPs doesn’t necessarily double the bandwidth, but it allows our Arkville IP to be smarter about…sorry I go on any ramble too much…short answer to Gen5: the bandwidth and throughput is going to double again, without any significant increase and, possibly, a reduction in latency.
Now, the question also touched on CXL. CXL is a different beast entirely. Just as the world has discovered over the past few decades that heterogenous compute is a good idea…you know…heterogenous communication is a great idea. And there is a place for bulk data movement and packet data movement, which PCI addresses, and there’s also a place for CXL.
Arkville’s position (and its interconnection to Intel’s technology: the underlying P-Tile and R-Tile technology that’s inside Agilex) is such that we do not preclude working alongside a CXL solution in the future…but we’re getting ahead of ourselves. We’re just…we’re happy today, after five years of shipping Arkville to this pend-up demand and desire to get to Gen4 x16, that we’re here today and I hope we can enjoy that for a little bit before we jump on and start pounding on Gen5 x16 and CXL.
No Shep, this is Jeff—you’re not going to get much rest. We have the F-Series parts out now that are the Gen4, as we talked about earlier. And I mentioned the I-Series and M-Series parts that are coming down the road that Tom talked about a little bit. We will have I-Series boards coming out with Gen5 targeting the middle of next year.
And we’ll be right there with you.
I’ve got a question about the H2D latency for Shep—I know you spoke to that, but maybe you could elaborate a bit.
Sure. Well, again, as I said—I said earlier, the best way to investigate any performance parameter…power, throughput, latency…because despite those components shortages, this hardware and software and IP is all available—let’s get it into your shop and measure it under your conditions. Under our test conditions, as I said, this isn’t a fintech design cut through IP. It’s store-and-forward, and quite intelligently it’s a store-and-forward, but there’s no latency jitter whatsoever.
So, we’re on the order of 1-3 microsecond nominal latency with no long tail. The calling card here, and the differentiator on a store-and-forward architecture, is what’s going to happen that’s going to put the packet that’s at the head of line (whether it’s going upstream or downstream) on hold to get it to move. And, unlike a standard DMA engine pinning pages, scatter-gather—all the stuff that we mentioned not at all in this conversation up to this point…Arkville has none of that. It’s completely deterministic. So, for example, a chunk of data arriving at the FPGA on the way to a userland host memory buffer is fire-and-forget, and that latency’s going to be on the order of unit microseconds.
I’ve got a question here…oh yeah, so, the graph had different upstream and downstream transfer rates, so why is that?
That’s a great question. I thought I touched on it a little bit before, but I’ll repeat it because maybe I wasn’t clear enough. Well, actually, there’s several graphs that were shown…showed different upstream and downstream performance.
Let’s see…why is the downstream or egress performance less? I would say, in general, there’s more room for there to be Amdahl law-style serialization delays somewhere in the system—whether that’s in software, in hardware (which could include Atomic Rules hardware and the like). Particularly for those paying really close attention on the Xeon servers, as I said, we saw significantly less downstream performance than we did on Rocket Lake, and we believe that’s owing to our own programming of our demo and what NUMA zone we were pulling the downstream data from.
We believe, and our team is still looking at that example in test, that the data that was coming downstream on the demo that we showed was actually coming from the processor-attached memory on the other processor—on the other QPI side—of the NUMA zone.
We’ll refine that as we go on. In general, you know, a posted write that…it’s fire-and-forget to move data upstream is going to be really easy if the memory system can retire it (and both the Xeons and the Rocket Lakes did that really well). The reads—no matter how many read requests we have outstanding and how careful we are about trying to be nice to the memory controller—sometimes those completions just take longer to come back.
It’s a plus that we can handle multiple completions within a single clock cycle. Again, back to the architectural advantages that the Agilex interfaces give us. But, in general, reads which have completions are more open to issues than writes which can be posted and fire-and-forget.
Hey Marcus? Did you want me to talk a little bit more about the Gen5 and CXL?
So yeah, sure, if you got a chance.
Yeah. So, so I think, I think my audio dropped off so maybe I missed that earlier, but just, just to let people know, I think Jeff already mentioned that they are planning to do additional cards that are based on Agilex, but the Agilex I-Series. And the I-Series is the device with the chiplet that we call the R-Tile that does support the PCI Express Gen5 and CXL.
And at the chip level, we are sampling those devices today and we are showing…we’ve been to PCI-SIG workshops already. We, you know, we are getting the full bandwidth out of our device and R-Tile. So, we’re doing full bandwidth PCI Express Gen5 by 16-lane. And, if you compare that to this particular board, you know, it’s essentially double the bandwidth just from a PCI Express point of view.
All right, thanks for that additional information. We’ve got time for a couple more questions. So, this one, again, is for Shep. How is the Arkville DPDK different from the Intel FPGA multi-cue DMA DPDK support?
Awesome—that’s a great question, Marcus. But you know what? Put that on hold for a second—push that out a sec, because I want to touch on the Gen5 point that Tom and Jeff both mentioned. So, almost 100% of our Arkville customers (which is our preeminent IP) depend on throughput. That’s why Arkville on Agilex today at Gen4 x16 is so important: Customers and applications are being empowered.
When Gen5 x16 devices and boards come along, Atomic Rules is committed to being there, or we don’t have a business. I just want to underscore, without prematurely announcing Arkville support for Gen5—obviously, we have our eye on that very closely.
Okay, so over to Arkville versus a roll-your-own such as Intel’s excellent multi—I think it’s called multi-cue or Multi Channel DMA (MCDMA). Well, MCDMA is an excellent free IP available from Intel, built into Quartus with example designs. It really is a kitchen-sink DMA—not so much a data mover in my opinion.
It supports virtually every possible role that you would want to use a data move as: streams, messages, caching, CXL…you name it, MCDMA does that. It does use about twice as much memory resource on the FPGA as Arkville does. But, then again, Intel’s in the business of selling bigger FPGAs, so maybe there’s a method to that madness. And it’s going to require some work. I mean, it’s going to RTL to use that IP. It’s going to require some RTL simulation and hooking up, and it’s going to require some software on the other side, even though I’m sure Intel will provide that.
If you have specific needs, that Arkville doesn’t spot-on address, I say go and run, run after that. We’re really not competing against that. If you have a data motion problem, where either it’s bulk data or it’s DPDK and it’s networking, we have something that’s going to get you going literally that day. So, I think that’s the differentiator between that in a build-versus-buy sense.
I guess, last, you know, another piece is CPU offload because, you know, MCDMA is going to use scatter-gather lists…it’s going to use host cores in order to do that. So, if you have a boatload of cores sitting on the host to participate in your DMA, go for it. Arkville will have taken that and left those cores for your application.
Sorry Marcus, there is one thing I want to add on to is…I can verify one of the points that Shep’s saying there is that our customers who have used Arkville and other Atomic Rules IP have been up and running remarkably quickly. It’s just Shep and his team do a great job of providing an out-of-the-box, ready for use deployment, rather than, you know, roll your own…here’s the basic components—put it all together. So that is one thing I see a difference when we’re deploying that with our customers.
A question you can see here…I’ve got…yeah. On the IP—just portability to other Intel FPGA devices. I don’t know if they mean, perhaps, Stratix 10 or other Agilex—how easy is it to port to other devices.
I guess that’s for me. So, we haven’t had any demand to use the Stratix 10 devices with Arkville. However, all of our other IP—TimeServo, TimeServo PTP and our UDP Offload Engine—are supported on Stratix 10 and even earlier Intel devices as well as other FPGAs. Although there’s a lot of desire and value in things like Quartus, Platform Designer (formerly Qsys), we’ve been moving towards a SystemVerilog representation for all of our cores, which means, you know, a dozen lines of text represents the instantiation.
If someone genuinely has interest in using Arkville on Stratix 10—get in touch with us.
I think one of the answers is the customer/user does not do any porting of Arkville. That’s Atomic Rules will do the porting. And if you want to build, you know, with the S10 or the Agilex I-Series, it’s just a different core from Atomic Rules that just loads right in and just works seamlessly. There’s no additional user work required.
In the Arkville interface signature (I apologize: I’m speaking now to the RTL designer gals and guys out there right now) is a dozen lines of SystemVerilog—done…a handful of interfaces…drop that into your design and go. And that doesn’t change between any FPGA device; it’s the same thing.
Okay, last one or two questions here. This one is, well I’ll just read out here: Is Arkville provided as an encrypted net list or obfuscated HDL, and if the latter, then what language—again, for Shep?
Ahh—trick question. So, we provide to our licensed customers Arkville as an unencrypted IEEE Verilog netlist. However, that unencrypted Verilog isn’t the source code. We use our own Atomic Rules-based functional programing language to generate that Verilog by machine and that’s how we do our formal verification. So, to be specific, the delivered asset that you’re simulating and compiling against is an un-obfuscated, un-encrypted Verilog netlist.
Thank you all for watching and have a good day. This ends webinar.