IA-840f PCIe FPGA Card Featuring Intel Agilex FPGA
Meet the Powerful IA-840f: Enterprise-Class Intel Agilex Based FPGA Accelerator > Flexible, Customizable Hardware > oneAPI Software Support Buy at Mouser Tap Into the Power
There’s a growing need for 200 Gb/s and higher data capture and recording in hardware. Many options tend towards proprietary, but is there a way to use commercial-off-the-shelf (COTS) components? We’ll explore the options with IP provider Atomic Rules, COTS FPGA card provider BittWare, and system provider One Stop Systems.
Thank you for joining us today. I’m Bryan DeLuca, along with Nicolette Emmino, and we’ll be your host for today’s live chat, Data Capture and Record from a COTS Perspective, sponsored by Mouser Electronics and BittWare.
We have some great panelists, and this is a live chat, so make sure you ask your questions in the Q&A at the bottom of your screen. And now to Nicolette.
Hello, and like Bryan said, welcome to our panel discussion on Data Capture and Record from a COTS Perspective. To lead us today in this discussion, we’re joined by Chad Hamilton, VP of Products at BittWare, Shep Siegel, CTO over at Atomic Rules, and Jim Ison, Chief Product Officer at One Stop Systems.
But, before we dive into specific questions, I want to give each of our panelists a moment to introduce themselves and their companies because these are three very different companies that have come together to provide a cohesive solution.
So, Chad, why don’t you kick us off? Let’s each share a brief overview of your company’s core expertise and particularly how it relates to COTS and data capture solutions.
Sure, thank you for having me. Again—Chad Hamilton—been with the company for almost 16 years now. BittWare has actually been in business for, I think, about 34 years. We provide enterprise-class FPGA hardware accelerators in the compute, network, storage, and sensor processing space with products from high-end FPGA companies like Intel, AMD, and Achronix.
We have one of the largest, maybe the largest, COTS FPGA portfolios in the market, and our customers can develop and deploy the application quickly and cost-effectively. In the case where COTS isn’t a solution for a customer, we do customize where possible. We’ll entertain it, you know, if it makes good business sense, we can do anything from either a slight variant of our existing products or a full custom solution. It just, again, has to make sense for all parties.
We do also have a lineup of TeraBox certified servers for both development and deployment, and along with our partners such as One Stop, you know, we can provide fully integrated server and card solutions.
And lastly, we’ve…I think everybody in the world is talking about AI and machine learning at this point. So, we’ve begun to partner up with a few FPGA- and ASIC-based companies to cover that, you know, datacenter out to the edge.
Awesome! Shep, how about you?
Sure. Hi, I’m Shep Siegel. I’m the CTO and founder of Atomic Rules. I started Atomic Rules back in 2008, and when we started, it was a services shop that was basically just me. Over the years, we’ve brought in about a dozen really talented engineers, and in bringing those engineers in, around 2012 or 2013, we started making IP cores in addition to just offering IP design services. And, as it would turn out, that business of making cores would ultimately lead us to the COTS and turnkey solution like we’re talking about with TK242 today (more on that later).
Atomic Rules DNA is about complex concurrency. We solve difficult RTL problems with lots of moving parts. We have our…we have specific tools and languages that we use to handle complex concurrency well. Our preeminent product, Arkville, which is our brand of DMA for moving data between host memory and FPGA or FPGA and host, is the highest performance DMA engine today…on the latest standards like Gen 5 x16 PCI…60 gigabytes a second. That’s our calling card.
Most people know Atomic Rules from their IP cores and specifically Arkville is…our brand of DMA. But as we’ll get to in this call today, we’ve been trying something new in the COTS space. You asked about COTS, and we’ve kind of composed some of our IP greatest hits to something that’s useful for packet capture. And I hope that we can get into that more in this call.
Nicolette
How about you, Jim?
Yep, Jim Ison, Chief Product Officer at One Stop Systems. One Stop has been around for 25 years. I’ve been here 19 of those and seen a good transformation of…we do rugged systems that go to the edge. Typically we are at the data center class of componentry that we can put at the edge very quickly, time to market, to get that…things like GPUs, the FPGA cards, NVME drives, all that you’re used to using in Amazon cloud or in your desktop workstations that you just bought.
But we are able to bring that into ruggedized systems at the edge, and then we do that at scale. So we’re also PCI Express experts that take, say, a server that has five slots and we can scale it up to 16, 32, 128 slots—so you can really deploy very high-end systems at the edge—at scale.
Alright, guys, thank you so much for taking a moment to really explain what each company does. Let’s start at the beginning. Can anyone give us…and I know most of us probably know the answer to this…but can anyone give us a brief overview of what commercial off-the-shelf (or COTS) means in the context of data capture and recording technologies? And kind of…how does it differ from custom-built solutions? Whoever would like to field that. Chad?
(Jim) Chad you’re on mute.
(Chad) Sorry. I said, “Why don’t I take that?”
(Nicolette) I said anyone but I really meant you, Chad. (laughter)
(Bryan) Right, right.
So, commercial off the shelf are products that are kind of ready-made for the general public with standard, well-established form factors like for PCIe is a lot of the cards that we’re doing today. But we’ve done in the past U.2, VPX, Compact PCI. I mean there’s a whole variety of standards and form factors out there that people know that they can go buy these products off the shelf and plug them into a system that is pretty quickly deployable.
They don’t need to go design a new custom back plane for example for these type of products…as compared to fully custom solutions—which somebody might come to BittWare and ask us for just some different format card, maybe it doesn’t have the same standard form factors. And you know, that’s a more costly investment because we’re then going to go develop a card for typically that one customer. Which is fine, like I said earlier, if the business case makes sense.
From a data capture and recording standpoint, these cards that we’re developing with these (let’s stick with PCIe for now because that’s again the most prevalent cards we’re in the market with right now) we will add I/O on there. So, for example, Ethernet. We can do 400-gigabit Ethernet on these cards. Those are standard QSFP connectors, for example. PCIe Gen 5 x16 is a COTS form factor. We’ll have external memory and other types of interfaces on that card—and it allows companies like Atomic Rules to go ahead and implement their design on a platform that is easily accessible in the market. And he can take that and put the secret sauce in there needed for this type of solution.
So, how has the availability of COTS components impacted innovation and time to market for new data capture and recording solutions?
(Shep) Me, me, me!
(Bryan) Okay, there you go! (laughter) It’s all yours Shep.
Right. Yeah, so everything…certainly I agree with what Chad said about COTS, everything there.
But you know, COTS isn’t just about hardware and edge and systems and heavy iron. It’s also about software. To me, COTS means buying instead of building and getting something prototyped quickly instead of going down the long road of development.
I mentioned earlier how Atomic Rules has this history of IP cores, DMA engines, packet processing, and so on and so forth. A couple of years back, we started hearing a growing drumbeat of the need for packet capture. We saw the limitations of merchant ASIC-based NICs for doing packet capture: they would drop packets, they’re not performant in the way that packet capture solutions require.
And we started hearing from multiple customers saying, “Could we Atomic Rules cores together with OSS disk drives and BittWare boards to make a solution?”
And yeah, they certainly could, but that’s still a lot of assembly required: expert FPGA competencies, expert system-level competencies…writing software. The drumbeat of the need to capture Ethernet packets became so loud that we said, “What the heck are we doing here? Why don’t we, instead of producing essentially throwaway examples (which we usually provide with our IP cores to get people started), why don’t we do a turnkey example?” The “TK” and TK242 that we’ll talk about stands for turnkey.
Turnkey, like COTS…probably different things to different people, but the idea is employing COTS boards (FPGA boards off the shelf from companies like BittWare), and system (metal edge PCI systems from companies like OSS and others), and most importantly (from our perspective) our IP—we could produce a set of software: a bitstream that goes with…that transforms a COTS board from a card that BittWare makes into a packet capture solution capable of solving the most basic of packet capture problems, essentially the intersection in the Venn diagram of all the requests we heard for what they wanted to do.
Can I put a little more on that and go without stealing too much time from everyone Bryan?
(Bryan) Ha, yeah, of course, you grab a little more time!
(Shep) I want to go just a little bit further. So, I said a moment ago we wanted to do the things that the merchant NICs can’t do. Because obviously, if you can buy a $1,000 NIC off the shelf and plug it into Ethernet and write some software, and you’re done, what’s the FPGA add? What are we…at what value are we at? What are you doing there?
(Bryan) Right.
(Shep) What kind of requests we were getting were in line rates—Ethernet speeds—that were above what merchant NICs could capture flawlessly without dropping packets. And for the most part (there are exceptions) people just can’t drop packets. Dropping packets is like money leaking out of your wallet or failing a test—you just can’t do that.
It turned out in that Venn diagram of trying to find the sweet spot—again, turn the clock back 18 months or so—the sweet spot was recording both sides of a 100 gigabit conversation for any packet size—whether they’re 60-byte tiny grams or 9-kilobyte jumbo packets—recording the worst-case scenario for 100 gigabits per second going in both directions (which is about 200 gigabits per second)—so that 200-gigabit-per-second number (or about 25 gigabytes per second) was the sweet spot we saw quite clearly.
You cannot get an off-the-shelf NIC from Intel or NVIDIA Mellanox to do that. It will drop packets when the packet size gets small. It doesn’t know what…it can’t handle that. It won’t do that.
However, an FPGA application with our DMA engine and our PCAP hardware to put those pieces together can do that. So, we put that together.
The other thing that fueled 200 Gbps and why 200 Gbps is kind of magic is, 18 months ago there was no Gen 5 PCI. There wasn’t…Gen 5…was in development. The spec was written, but it didn’t exist. So, the 25-gigabyte-per-second or 200-gigabit number—that’s a nice fit to Gen 4 x16 PCI…18 months ago. It turns out it’s also a great fit to Gen 5 x8 today (more on that later).
Recognizing that this stuff isn’t easy, we had to get to work on it. We were at work 18 months ago aiming at this 200-gigabit solution, not because that’s the be-all and end-all, but that covered the large set of customers that we heard coming in.
To be clear, there are people who come in to say, “We don’t need 200 gigabit.” Maybe we’re recording something that’s less than that, and some bandwidth in reserve is seldom a bad thing in this case.
To us, COTS is about Atomic Rules being able to flip over its IP model, which had been about requiring expert FPGA competencies, expert software engineering competencies, expert competencies in disk systems, and hardware. Flipping it over and saying, “You know what? Atomic Rules is going to have (and excuse me if we’re using this the wrong way) but we call it a turnkey solution to the problem of packet capture, where we’ve done the engineering, our bitstream loads over a BittWare card and transforms it—in the personality—from an FPGA card that could do anything to a very specific packet capture solution.
You bring your own hardware, in the sense that you bring the board that you get from BittWare, you bring the disk system that you get from OSS or wherever you choose to bring the pieces together, and you’re off to the races. We’re having a lot of fun with that.
Hey, I just want to jump in real quick here to address the whole COTS time-to-market aspect of it as well, right? So, as I mentioned earlier, we’re putting on the latest and greatest FPGA technology from these specific vendors. They don’t come first in production devices. BittWare will get a really large head start designing these cards before even their—what they call—engineering silicon is available. And that allows us to get the cards, get what we call early access units out to customers and partners like Atomic Rules, and they can start working on these cards way before production silicon is available.
So, these are complex designs that need to be rung out and tested and simulated and just over and over in an iterative process to optimize the IP. So, I think that’s one of the great aspects of having commercial off-the-shelf: when production silicon’s ready, BittWare cards are there with production silicon ready to go on the card.
And that’s really how you’re adapting. I mean, I was going to ask you, as the FPGAs are growing more complex, how is BittWare adapting to accelerate this time to market for applications that use these components? I think you’ve addressed some of that just there, Chad.
Yeah…and it’s not just that. So, another thing is we’ve got over 30+, 34 years of extensive knowledge base built up.
You know, one of the hardest things to do with these cards surprisingly is the PCIe design. It used to be much easier, but with the signaling rates that are going across these PCBs now and the power requirements, it is very difficult to build a PCIe form factor card (which has a limited width of the card, so it plugs into a slot and get it to meet…not, basically, overheat). You know we’ve got all that knowledge built into these cards. We have our BMC, which will monitor the health of the card and shut it down when it needs to.
But there is a lot of complexity that goes into designing these cards nowadays at the speeds and feeds we’re putting on them.
So Chad, why does BittWare partner with companies like Atomic Rules and One Stop Systems to offer solutions such as TK242, as opposed to…taking it all in house?
Sure, I think the easiest answer is it’s hard, right? (Laughs) I think Shep was mentioning a little bit earlier that the expertise needed to go develop IP for these cards is different from developing the actual hardware themselves.
Now, we could go, certainly, invest in more resources, develop our own solutions, but you have to try to hit the right target often or that team of engineers is going to maybe not be the best return on investment.
So, by partnering with IP vendors like Atomic Rules (and there are several others you can go on our website and check out), we’re picking out the best-in-class IP in the market, partnering up with those teams, and asking them to, basically, get their solution on our best-in-class hardware products.
And then we’re able to work with One Stop to get systems that can be customized for whatever the end application is. We can provide that whole system as a solution now, as opposed to giving somebody a blank FPGA card that they’ve got to go design all that themselves.
Instead of customized, I’d say maybe configurable off-the-shelf.
(Chad) Yeah, that’s a much better word.
(Jim) For us, the configurable off-the-shelf piece is a big part of being able to select the servers, and the expansion, and then the BittWare cards and putting the right software on, like from Atomic Rules, to make that solution happen.
(Bryan) We do have questions coming in, so we’re going to hold off on answering some of those until we get further into the conversation.
Specifically, for you, Jim, so we’re going to hold off! (laughs) Because we want for… Shep if you could…we’re talking about TK242…if you could you give us a brief overview of TK242 and some of its “no-programming required” features and how it benefits the users in the context of COTS solutions?
Thank you, Nicolette, I’d love to jump all over that, and if I go too long, just throw things at me virtually.
(Nicolette) I don’t want to break my screen though, Shep! (laughs)
One thing before you go Shep, this is one of the reasons why we partner with companies like you because you spend countless days, months—years developing this IP, right?
Right, we have.
So, I’m going to throw up a block diagram that I’ll talk to a little bit when we get into the nuts and bolts. But, before I start moving around to talk to the blocks on the screen, I’ll speak to you a little bit about TK242. By the way, “TK,” as I mentioned earlier, stands for “turnkey.” The magic of the 242 number is that there are two 100-gig paths. We wanted a four in there because, as mentioned earlier, this was tuned for Gen 4 x16, hence the 200-gigabit number, and who doesn’t like the number 42 in a product? (laughter)
And so by “turnkey,” (and we’ll come back to this over and over): no FPGA programming. This is a bitstream that we deliver completely that runs on the board, so we don’t have to talk about FPGA vendor tools…same is true with software, and we’ll get to that as well.
While yes, there’s a C and a C++ and a Python API that can be used, we deliver, as open-source code, a full Linux service for TK242 where—once the service is installed—literally all you do is you turn your system on and, for eternity, every packet that’s captured, up to 200 Gbps, is stored into an infinite buffer of .PCAP files on the host. It just doesn’t get simpler than that.
Let me dive in a little bit so we can go through the machinery of what goes on. TK242 here, is really an overlay on a BittWare card. We want to show a picture of it somewhere. It’s a half-height and half-length board. We haven’t specifically targeted to this board, but we really think the bang for the buck on this card in particular is just astounding because, well, it’s off the shelf from Mouser. Shameless plug: if people want to try TK242 and get it going tomorrow, pick out a box, get the card there, download our install package—it’s one script to install everything—plug in your connector and your packets are streaming to disk ad infinitum.
Let’s go through…I’ll quickly tour through some of the pieces inside the FPGA, because people are probably wondering “Well, you know—how you do that packet capture?” I mentioned how we’re doing things that a merchant NIC can’t. Let me just talk a little bit about them.
So, we have two parallel 100-gigabit acquisition channels. It’s on a QSFP-DD cable, so if you’re using, say, a 100GBASE CR4, you would split that out with a split-out cable. It can work with a DAC cable, an active optical cable—it doesn’t matter.
The entire data path for TK242 is provisioned for 300 million packets per second and 200 gigabits. So we include, as I mentioned earlier, TK242 in some ways is kind of like an Atomic Rules greatest hits of IP, without you having to integrate it because, of course, we put it together.
Our TimeServo system timer has nanosecond resolution time and it feeds that to the max. Every packet that’s arriving, the L2 packet, is stamped to nanosecond resolution. As they arrive, we order them into a single stream so that they are in monotonically increasing order of arrival in terms of their merge.
I won’t go into it here (but we can if people want). We have a deep VXLAN RSS packet processor, a flow table with 64,000 entries where we could (if we wanted to) split this merge stream of 200 gigabits per second up into four different streams. Let’s say we wanted to filter out certain packets and send certain packets to different PCAP files. It turns out (we found after doing all this work) the vast majority of our users really want one or two, not many, PCAP files. But the hardware is provisioned to run four of them at a time and if you wanted to funnel all 200 gigabits per second or all 300 million packets per second to one PCAP file we will do so (and we won’t drop any packets regardless of the packet size).
So that’s the P2PCAP engine where we’re basically making an industry standard bite-true PCAP file in hardware so the host CPU (the Linux processors) have zero touch on the actual data. From there it’s off to our Arkville DMA engine and to host memory where it’s a bounce where a subsequent NVMe storage system is doing writes to the disk (which are actually reads from host memory) and all of this happens without a hitch at rates of up to 200 gigabits per second.
That’s the story and obviously there’s a lot of dance to get there along the way.
There’s a whole separate conversation that we’ll get into with Jim and Chad in a little bit about what storage system is right for everyone. ’Cause one of the things that we’ve learned in this Odyssey is, a year in we found, for example, that for us, the 200 gigabit—or more than that, I mean obviously…we do 400 or 500 gigabit per second today with Gen 5 x16…the Gen 4 x16 interface we’re limited to about 25 gigabytes per second—but having that same throughput through to disk is another challenge altogether, and that’s what our software TK242 service works to do.
I wouldn’t say there’s magic there. There’s coding—all the pieces are put together…hung together to make it work—but it is certainly not a “gimme” to have a disk system that’s going to flawlessly swallow 200 gigabits per second all day long. For certain that one little M.2 Chiclet that your Ubuntu OS is on is not enough. You’re going to need to have some kind of RAID solution—and here’s the fun part—that’s different for just about every application and we leave the door open for you to bring your own right-sized hardware to tackle that piece of storage. “What’s the persistence, what’s the capacity, what’s the reliability?” You determine that—it’s not Atomic Rules or BittWare or OSS telling you what you have to do. Let me pause there because I don’t want to monopolize…
I have a question. You mentioned that the TK242 can handle up to 200 gigabits per second. Can you elaborate on how this capability fits various data capture needs from smaller/larger bandwidth requirements?
Thanks Bryan. Briefly—obviously, the fact that there are two 100 Gig MACs (and their wire line rate subscription could asymptotically approach 100 gigabit) is where the 200 came from. I also came up with the fact it’s not entirely a coincidence that that’s about the limit of the right-size bandwidth of Gen 4 x16 which was the sweet spot we designed this for—which is why this is so commercially affordable today, right?
You’re still paying a premium today for Gen 5 technology, but Gen 4 is actually starting to come out in volume and, as Chad mentioned earlier, a lot of stuff came to bring Gen 4 to maturity (which it is now, and it’s a good time for users to harvest it)—but not everyone needs to capture both sides of a 100 Gig conversation.
In the Ethernet world many conversations aren’t fully subscribing both sides of the line, so having that 200 be some number less than that (although it might instantaneously or for some length of time go to 200) is fine. Having that extra bandwidth really doesn’t cost you that much—so provisioning for 200 isn’t terrible.
But it turns out that some people have much lower requirements in terms of pure, sustained throughput. In a pure Ethernet sense…maybe you only have a 40 gigabit Ethernet link or 25 gigabit Ethernet—so obviously it’s proportionately less—so…great, that’s icing on the cake.
There are some TK242 users today in fact, who specifically asked for a 10 gigabit link as opposed to 100 gigabit—they’re taking it down by a whole order of magnitude. So instead of 200 it’s only 20 and they are commercial users of this product, and they find it useful in that sense. So not everyone has to go up against that limit.
The other piece (and this is…again opening the door for more talk here) is we’ve just been thrilled at the uptake of TK242s used to record digital radio…digital intermediate frequencies: VITA 49, DIFI. Essentially I/Q streams out of A to D converters in 5G and ORAN and radio where the traffic isn’t necessarily a TCP chitchat or UDP going one way or another—although sending it over UDP is an option—but the stream…the throughput…that is not dictated by the line rate of the Ethernet connection, it’s dictated by the precision and sample rate of the A to D converters that are capturing that spectrum, then in turn we’re going to capture.
So, there’s this wonderful opportunity of users out there for TK242 who essentially are slapping a packetizer on their isochronous stream coming out of their A to D converter…from their spectrum analyzer…from their de-converter…from whatever it is the continuous-time thing. It’s essentially an infinite stream of the packets that they want to capture. They turn it into packets at a lower rate and, for the most part, we’re seeing that number falling significantly less than 200 gigabits. Maybe it’s in the ratio of somewhere between 100 gigabits per second and 200.
So, I think that, while we have been preoccupied from a 802.x compliance point of view for over a year with “Oh my goodness is there anything we can’t capture that’s ethernet?”, a great many users are like, “We’re never going to do that. Our converter is so many gigasamples per second and so many bits per sample and it’s going to be an isochronous stream that’s packetized of that rate.” Chad you might want to embellish that.
I want to bring it back to some products because we actually have some more questions coming in about some product-related features so Chad, go for it.
Sure, so we’ve got a couple of different products here that that Shep is working on and we’re working on to satisfy some of the RF stuff so I’ll get into that.
But in the basic case of a network product with our Intel Agilex cards right now, Atomic Rules is implementing TK242 and we’ve got the up to…we have multiple 400 gig links (which obviously can’t be consumed yet by the host over Arkville) but in the case of our 420f for example with two 200 gig streams that he was…the block diagram that he just had up there, exactly fits that product.
So, there’s the 420F which has the Gen 4 x16, we’re now working on…well we’re shipping our IA-440i which is an I-Series with Gen 5 x16 and those products are just going to take the next step and can potentially double the bandwidth there.
But in terms of the RF space, we have a product—the RFX-8440—which is based on AMD’s Zynq UltraScale+ RFSoC chip with ADCs and DACs built onto that chip and, as Shep was mentioning, we digitize that data and then we can send it over standard QSFP ports that connect directly to the two other cards that I just mentioned. Actually, not just two but those are two of the low-profile cards that we offer, and we can have a full solution there from end to end which is fantastic.
So we actually…we have a question from a user, “Does TK242 on a BittWare card do any form of offloading from the CPU that a standard NIC would not, and if so what and how?”
Oh, perfect question! That almost sounds like a question I would ask. So I would guess…
(Bryan) (Laughs)
(Chad) You would ask yourself?
(Bryan) That’s how he figures things out, he asks himself questions and then… (laughter)
(Chad) That’s the right thing to do.
(Shep) I’ll bring my screen back again real quick, one moment. So, there is specifically one thing you may have heard me earlier say, “…doing things that a merchant ASIC-based NIC cannot,” and in order to do—to provide the capability, the fundamental capability of dropless packet capture at 200 gigabit, there was one key piece (in offload, in hardware) that a NIC does not do and we’re doing: and that’s the online (in the hardware, in the FPGA) conversion of packet streams into PCAP files.
There’s no way that individual packets moving over the PCI bus (with the overheads of TLPs) and that if there was, for example, a parade of 64-byte tiny grams that you would actually be able to bring 200 gigabit capture to bear.
What we engineered is circuitry in the FPGA to aggregate the collection P to PCAP… the collection of data that would be going to a single PCAP (do that offloading in in hardware) and actually have our DMA engine move a bite-true data stream (in other words, byte-by-byte, it’s identical to the PCAP file as you want to see it on disk) and land that in main memory.
I can’t emphasize this one point enough. The host processor (the x86 CPU that’s choreographing all of this) never touches the individual data. Not to reorganize it for the NVME drive, not to shift it around or put a header…or take something or align it so that it can store or be read properly.
Because we’ve done all of this in hardware, we’ve not only unburdened the host CPU with this offload, but we also have streamlined the storage system so that whether you’re running HFS, NTFS, XFS—whatever file system you want to—raw data—whatever you want on the back end—the actual NVMe request cue (where the storage system is essentially reading the data from memory and writing it to solid state cells) doesn’t require any reorg.
Now the counterpoint—because there’s sort of a…that’s sort-of the “good news.” The “bad news—” well, it’s not really “bad news,” but the counterpoint—just to be clear—is TK242 is a fixed bitstream. It does what it does. It’s an overlay and it makes a BittWare card, for example, have this packet capture capability.
It is not a SmartNIC, it is not a FPGA bag of bits where you could go in and say “I want to do a TCP decode,” or “I want to do some compression,” or “I want to do some encryption,” (by the way, those are all things we’re terrifically eager to have a discussion about with any customer that wants to do them) but that is not TK242 which is that COTS turnkey, “here it is…and this is what it does.” Thank you so much for that question.
I have a question. How does Atomic Rules validate the throughput performance of the TK242 especially with specific hardware requirements?
Okay. Briefly, we break it into pieces—it’s divide and conquer. (closing graphic) We don’t need to…stop that right there.
In a simple sense we split the verification problem into achieving throughput from the FPGA card to main memory (and then, in the storage system, from main memory onto disk), and then we do holistic testing.
Before any of that begins everything starts with CI and CD (continuous integration, continuous development). We have an elaborate and extensive Jenkins bench—which isn’t Jenkins for CI/CD in the conventional sense that most users are familiar with—we have about two dozen servers (from Intel and AMD) with boards from BittWare, from Intel, from AMD from Nvidia, from others, where we don’t just run the standard Jenkins pipeline on all of our software. We’re actually compiling the TK242 bitstream and running the application in hardware over and over.
So, we’ve been running dozens of systems 24/7 (at great expense on-prem) for over a year now to, for example, prove out the DMA engine and be able to give hard objective evidence to anyone who’s curious, “How can you prove to me you never drop a packet?” By the way, we can also do so by inspection (by looking at the code and the way the flow control is and so forth).
That gets us to main memory. The challenge of getting…but getting to main memory alone does not a packet capture solution make, and anyone who’s gone down this road knows this all too well. We also do a similar set of testing (which I’ll say is a little bit newer to us—it’s much more in the wheelhouse of a company like OSS) of validating the throughput from main memory out to the storage system is equally performant, or at least suitably performant, to meet the goals.
Only when you have the conjunction of satisfactory movement from the FPGA to main memory and from main memory to disk should you even consider actually analyzing end-to-end to make sure that the two combined still are performant in that way.
Our backs are scarred with the hard reality that we could move 240 gigabits per second into main memory and go, “oh, aren’t we great?” And we could—using standard Linux tools like FIO to benchmark the burst performance from main memory to disk—go, “Oh look, another 220 into that 12-drive striped array of disks, we should be fine right?” (makes buzzer sound) Wrong, no!
A lot of effort in the…mostly in the Linux service (that we developed, and we supply with TK242) went into squaring that and providing validation tools that so if you get a card and load a TK242 on it—one of the first things we asked you to do is run a test suite to help validate that performance. TK242 wasn’t shown on those block diagrams, but we have internal packet generators that have shapable traffic flow and can ramp up to 256 gigabits per second. And we run shaping sweeps up and down to measure throughput to main memory, throughput all the way to disk, and on your system, on your motherboard on your disk system, you yourself (not Atomic Rules, not OSS, not BittWare) are going to have a hard objective number to go “oh, look I guess that’s my performance.”
Can we guarantee what that performance is? Absolutely not, we don’t know what your architecture is, but we are keenly aware that it is not a “gimme” that…you bring the wrong disk system…you don’t stuff enough memory DIMMs into the system…your cat spills a bunch of water on the processor…you’re probably not going to be getting that 200 gigabits under those conditions, but we can measure.
All right, so I want to get Jim in here because we have a couple questions for Jim and, you know what Jim, you mentioned One Stop before, so this is a perfect time. Jim, why don’t you describe for us what AI transportable means in the context of COTS data recorders, and how this impacts performance and usability? We’ll start there.
You’ve been looking at our website where we have this term AI transportable…
(Bryan) (laughs) That’s doing your research!
(Jim) …and I know Chad mentioned AI earlier and it’s great hearing Shep saying how these speeds and feeds and everything gets into the server and into the systems, because that’s really what we do at OSS.
AI transportables to us is…we didn’t want to say something like mobile AI which is…people think of cell phones and things like that, but we’re really putting data center class hardware so we can get these kinds of speeds and feeds from the BittWare hardware and the Atomic Rules software.
So, AI transportables: we see the entire AI workflow as somewhere that OSS brings a lot of value. There are millions of sensors out there, and the BittWare card is the way to get those sensors into a data set that you’re going to store, and Atomic Rules makes that turnkey and easy. Really that sensor ingest being the data recorder piece—data loggers if you’re talking about autonomous vehicles—things like that, is the headend of the AI workload.
So, we really give these high-end hardware systems like our SDS server that’s been one of those servers that Shep’s been hammering on there at Atomic Rules headquarters…and getting all that data in.
Then the next thing is you have to inference that data. So, the type of scale that I talked about before (and I’ll get to a question on PCIe lanes in a minute) is that we allow that…now if you wanted to, on the same system, even process that data using GPUs—get it to usable form for visualization or anything like that—depending on what sensor data you’re bringing in. Or, making decisions based on that which is your AI inference—that’s kind of the next piece of the puzzle and we bring (because of our Nvidia relationship) a big piece of that into the AI workflow.
So that’s the overview of AI transportable…is getting all those pieces but doing all that at the edge, not in the Amazon cloud, not waiting two hours to send all the data you wanted to (over even a 100-gigabit link to the internet…is pretty expensive these days) but we have ways to transport that data to larger clouds if need be. But really, we’re talking about datacenter-class processing power that you can do right on a vehicle, right on an aircraft, in a submarine—things like that.
Well there you go—you answered my next question. The different applications where your data recorders—your servers can be deployed, right? So, you’re looking at a lot of vehicles, aircraft?
Yea so, just as far as…a little more…we go anything from an autonomous truck on the commercial side of things…these data logger systems…you figure a long-haul truck is going in two days from coast to coast in the US, and it takes four or five for a long-haul driver to do the same thing. That’s really where the value is going to be created in autonomous trucks. But there’s a tremendous amount of data being captured there, so that’s one of the applications where a lot of those sensors are ethernet-based and this solution that we’re talking about here can bring that data in.
Other ones…we mentioned aircraft is…even in military applications, we have both helicopters and large-format systems in places like the P8 aircraft where we’re doing data ingest of all the sensors: the sonobuoys…the surroundings if you’re doing visualization inputs from those types of sensors, so that’s another application.
And I mentioned even submarines because we’re doing sonar processing—data ingest and processing—in submarines, both in autonomous and in manned submarines.
So, those are pretty harsh environments and that’s where the OSS hardware allows you to take the same products that you’re working on at your workstation—at your desk—and actually put them into these vehicles. Where most…if you’re looking at other real edge-type applications you always have to go…maybe you’re still on Gen 3 PCI Express, or you’re using a low-voltage processor that’s really compromising the performance that you want to get—that you want to see the same performance that you just had at your workstation, but you want to have that in the vehicle. That’s really where we add the value to this solution.
Well you know, it’s pretty crucial to address this challenge of transferring these large data volumes, right, like 500 terabytes? Can we talk a little bit more about how your solution tackles this?
Yeah, the biggest problem is, now you’ve got all this sensor data: you might get a petabyte of data on an aircraft flight from London to New York—that you’ve just collected TK242 at all this extreme rates, and you’ve been doing it for, you know, seven, eight hours straight—now and you have it on a disk or on a set of disks.
So, one of the solutions that we have is—I talked about how it could take weeks to send that to the cloud to get processed—most of our systems have what we call data canisters. So, we have two data packs on the SDS server that this solution’s been tested on that fit up to…now with 60 terabyte drives, we’re in the close to a petabyte range. And, then we can pull those two drive packs out and actually send them by FedEx overnight anywhere in the world, instead of taking two weeks to get that set of…petabyte-worth of data…from over a wire.
So, this data pack concept makes it very transportable to land an aircraft, pull out the drive packs, plug it into your data center or your data hub that you have there at the airport, and you’ve now uploaded all your data so you can get use of it really fast.
So, we have a question for you Jim, from Sergey. How do you scale PCIe lanes to 128 lanes when, for example, only eight is available? What extenders are you using?
So that that was a good question, because I might have had some confusion between links and lanes.
With PCI Express we might have, say 4 or 5 x16 slots or x8 slots that we put the BittWare cards into, in order to ingest the data. But if you needed to scale that to more cards we use PCI Express switching to take some of those products (which also supports all the DMA that Atomic Rules is doing at very fast rates—line rates—150 nanoseconds worth of latency, so hardly even noticeable, not even buffering frames or anything like that)…but the PCI Express switch can allow us to go to a second chassis (that we call our expanders) and expand more slots into…to be able to add even more cards or add your GPUs or add more NVMe drives if you have an even larger data set than we were talking about from a single SDS server.
So, it’s more the PCI Express fan-out that you get from the switching that I was referring to when we talk about how we scale our solutions.
So noise can also be a significant… oh, sorry.
(Nicolette) No, no, go ahead, go ahead.
(Bryan)…can be a significant concern, in vehicles, right based on high performance applications? What innovations or measures has One Stop Systems implemented to address this?
Yeah so, when you’re those…talk about a submarine-type application, first of all, you want to be quiet in the submarine, especially if it’s a military submarine and there’s people there. And if you go into a server room these days, you know, you’re going to be screaming loud at 85 decibels and above. Everybody needs ear protection and all that, and when you’re trying to be stealthy underwater you really can’t handle that kind of noise.
So, the SDS server that we talked about has options for self-contained liquid cooling, where we use the better efficiency of liquid cooling. So the heat exchangers that are right there in the server—still stays in this short depth package to be able to fit in those tight applications, and in these vehicles that we’re talking about—but reduces the noise level from say 85 dB down to 60 to 65 which is more like an office chatter-type environment, so you can actually….don’t go crazy from the constant drone of high-end servers.
So, we’ve taken cooling and power as our key ways of getting these type of data center-type products into these vehicle applications, all the way to the point of even adding liquid immersion cooling to our repertoire: so, we can dunk all this in a liquid cooled tank and let it run for three years out there, gathering data and recording it, without ever having to touch it because it’s all at a constant temperature and really no noise at that level.
All right, we have a couple more questions—I know we’re coming down to the last few minutes. Shep, let me see…yeah, let’s give you this one, Shep.
So, since it was mentioned that the TK242 offloads the PCAP formatting, does that mean the libpcap does not work on a Linux system, just to clarify? And please tell me if I pronounced those things correctly.
(laughter) You know it’s alphabet soup and you said everything just right.
Let me address the confusion…there shouldn’t be any: libpcap is wonderful. It’s a software API, it runs on Linux, it probably runs on Windows as well. Whether it’s making a PCAP file or decoding it, it’s a software API—it runs in software, it’s going to use cycles. It touches every single bite of the data stream on the way in and on the way out.
Precisely to avoid that touch—so that there is no host involvement at any stage of the data on the way in or…on the way in—we do that in hardware, in an offload, so there is nothing for the host to do.
Could we have instead removed the P to PCAP engine in TK242 and just DMA data to host the way a NIC did? Absolutely, and people do use…IP that way. You wouldn’t get 200 gigabit performance. Even the fastest AMD and Intel processors, with a god-awful number of cores, would choke at that rate and would have all the software jitter associated with it.
So again, in summary, libpcap works terrifically. It’s a software application and it has its place. It has no place in a real-time capture system where touching the data potentially means data being dropped.
All right, we do have another question.
(Bryan) …and it looks like Chad did answer it directly, but…
(Nicolette) No, no, I have another question.
(Bryan) Okay, you have another one, okay! (laughs)
(Nicolette) So, it refers…I’m going to go back to bare metal for a second. We hear and we read this term, “bare metal,” when associated with FPGAs, and I was wondering if you could just take a moment to explain that concept for us.
Yeah, sure. So, bare metal is…mostly how it sounds now (things have changed over the years). I mean…FPGAs used to have—you know—literally just logic cells that somebody would have to go implement everything on their own.
Now, these days they have hard IP blocks for PCIe and DDR controllers and Ethernet. And BittWare—to speed up development time and to test out our hardware—we have several FPGA designers that have to go and parameterize those blocks properly (know how to communicate between those blocks) so, kind of implementing either examples or card test in that…in that—the logic cells—the empty logic cells in between those.
But…the fact that it’s bare in between those hard IP blocks allows Atomic Rules to go leverage the vast resources inside the FPGA—which are empty logic cells—to go implement a solution that is highly customized for exactly what we want it to do. That’s why FPGA cards are so versatile in many different markets. Because they can be configured in many different ways to do, honestly, many different applications.
Thanks Chad. All right guys, we’ve gotten a lot of good questions from the audience. Bryan and I have had some questions for you. Is there anything that we didn’t ask that you think we should have? (pause) I think Shep can think of something!
(Bryan) (laughter) I see Shep there processing, there.
Well, Chad and Jim and the rest of anyone online of course…a question can come in, but let me piggyback on Chad’s comment about bare metal because that’s worth a thought.
With TK242 as a turnkey solution, we are as far away from bare metal as we can possibly be, in the sense that the marketing view of TK242, “Phooey FPGA! There is no…RTL lookup table…all bad!”
We’re delivering…we load our identity—this bitstream—on top of a card from BittWare, and it takes on this persona that does this one thing really well.
And to what we think is a non-zero set of people who are interested in that capability, “Hallelujah!” Instead of all that R&D and all the other…you get all the value of COTS, you’re good to go.
However, OSS…BittWare…Atomic Rules: we all have the other side (as I’ve said multiple times in this call). TK242: in some ways it’s like Atomic Rules’ greatest hits in terms of all the IP.
Chad’s point about the bare metal…we recognize the people on this call: you’re sharp, you’re looking inside here going, “Wow! I bet if we could put our capability inside—we have a secret sauce for compression or for encryption or for any of the…down conversion of RF signals…any of the other of litany packet process…”
TK242 is not intended to do that…but you’re sure talking to the right group of people here with BittWare, and OSS, and Atomic Rules in terms of partners who could get the job done by flipping that image upside down and going, “You bet we could put your secret sauce in there!” But again, to the talk (and we’re grateful for all the coverage we’ve had today about our turnkey design) that is not a turnkey design, that’s putting components together from various pieces and getting to market quicker with our component IP.
So hopefully Chad I didn’t make a mess of what you were saying about bare metal.
You know, look, the team here at Atomic Rules—we love the canvas we have to paint on with an FPGA. And with the rich set of heterogeneous processors we have today between GPUs from Nvidia and others and the host processors meaning that system software and RTL—now more than ever—go hand-in-glove.
It’s no more…it’s not just an FPGA problem and a system problem—which is why TK242 isn’t just a bitstream. TK242 is more the Linux service that does the work of packet capture than the bitstream (again, depending on who you talk to at Atomic Rules).
So, bare metal, I think, is always there for people who want to get down into it. Hey, if your volume’s high enough let’s start talking ASICs and let’s really take off the gloves (and I think everyone knows that’s available).
But the marquee element of the talk today that I’d like to finish on is how the COTS availability of these pieces from all the vendors here today democratize the packet capture process, so that anyone wanting to get going and do this isn’t facing this enormous timeline or enormous economic hurdle in order to see if they can bring their value-add (which is capture and grab that data) to do something with.
Yeah I think you hit it on the head Shep. Look, it’s companies like Atomic Rules that are able to provide…sometimes IP blocks that someone else has to bolt on the back end and do whatever they want, or we could sell these cards (in particular this solution) to customers who have no idea how to program an FPGA. And they don’t need to know because it’s a canned solution that is ready, “Turnkey.” I mean, that’s…like he said, it’s in the name.
It just depends on the end use case, the different IP from our different partners that we’re trying to deploy, and if someone’s trying to do something that no one else has done before then they’ll want something with bare metal so they can actually go program it. But you hit it right on the head.
Well thank you for joining us today for this live chat. Thank you to our sponsors: Mouser Electronics and BittWare, and our amazing panelists. Have a great day everyone.
(everyone) Thank you!
Meet the Powerful IA-840f: Enterprise-Class Intel Agilex Based FPGA Accelerator > Flexible, Customizable Hardware > oneAPI Software Support Buy at Mouser Tap Into the Power
Go Back to IP & Solutions IPsec IP Core Extreme Speed IPsec IP Core IPsec (Internet Protocol Security) is a widely accepted and adopted security
PCIe FPGA Card XUP-PL4 UltraScale+ FPGA Low-Profile PCIe Card Dual QSFP28s and DDR4 Need a Price Quote? Jump to Pricing Form Ready to Buy? Check
IA-860m Massive Memory Bandwidth Next-Gen PCIe 5.0 + CXL M-Series Agilex Featuring HBM2e The Intel Agilex M-Series FPGAs are optimized for applications that are throughput-