BittWare Webinar

Introducing VectorPath S7t-VG6 Accelerator Card

Now available on demand:

In this webinar, Achronix® and Bittware will discuss the growing trends of using PCIe based FPGA accelerator cards in high-performance applications. You will learn about a new S7t-VG6 VectorPath™ accelerator card produced in collaboration between BittWare and Achronix. The S7t-VG6 VectorPath accelerator card is the first card that leverages advanced 7nm TSMC process technology and advanced connectivity features such as 400G Ethernet, GDDR6 memory and a flexible OCulink expansion ports. After attending the webinar you will understand the market trends driving the adoption of PCIe accelerator cards, see how you can leverage this technology in your application and get the details on the new high-performance VectorPath accelerator card from BittWare and Achronix.

S7t-VG6 PCIe card photo

Speakers

Steve Mensor | VP Sales and Marketing, Achronix
Craig Petrie | VP Marketing, BittWare

Introduction – Marcus

Welcome everyone to the webinar—thank you for joining.

Our topic today is FPGA accelerator cards, and specifically the new VectorPath S7t-VG6. I’m Marcus Weddle with BittWare.

During the main presentation, if you think of a question, please ask it in the questions control panel. We have staff online that can answer you in chat during the presentation. Of course, you can also wait until the Q&A session at the end where we will answer your questions live with our presenters. So, let’s get started.

Our first presenter is Steve Mensor, VP Sales and Marketing for Achronix. Steve has 25 years in the FPGA industry, and you can see a range of positions he held at Altera, which of course is now part of Intel.

Steve will be joined by Craig Petrie, VP Marketing for BittWare, also with decades of experience in FPGAs. Craig began at Nallatech, which most of you probably know was merged with the BittWare brand in 2018.

All right, I’m going to turn over now to Steve to lead us into the main presentation, and I’ll be back afterwards for the live Q&A.

 

Steve Mensor – Introduction to Achronix

Thank you, Marcus. We’re excited to be able to have this opportunity to talk at this webinar about the recent announcement for the VectorPath S7t-VG6 Accelerator Card. We’ll go through some of the details of the card, and then allow for questions at the end. First, an introduction to make sure you know both companies that are working on this product together.

Achronix: We are a company that focuses on high end FPGA and best-in-class tools to support the FPGA. We are unique in that we are the only FPGA company that focuses on high end standalone FPGA products, as well as selling embedded FPGA IP technology that can be integrated into ASICS or SLCs. We’re focusing on our latest product family called Speedster7t. It’s a product family focused on high bandwidth and AI applications.

Craig Petrie – Introduction to BittWare

Thanks Steve, and hi, everybody. This is Craig Petrie from BittWare here and, for those of you who are unfamiliar with BittWare, we are part of the Molex group of companies, and we focus on FPGA-based acceleration for applications including compute, network, storage and sensor processing. These programable FPGA products increase application performance and energy efficiency whilst reducing the total cost of ownership for our customers.

At BittWare we have a 30-year track record of successfully designing and deploying advanced FPGA accelerator products. As part of the Molex group of companies, we are the only FPGA card supplier of critical mass able to address enterprise-class qualification, validation, lifecycle and support requirements. The collaboration with Achronix is very exciting for BittWare, and together we feel that we’re greater than the sum of our parts.

Steve Mensor

Diverse Workloads Driving the Need for Data Accelerators (02:57)

Before we get into the details of the card, we want to talk about a little bit of the background. There are many forecasts out that predict hardware accelerators. Silicon based hardware accelerators are going to be in excess of $20 billion by as soon as 2023. And the fundamental reason for that is that your traditional methodology of using an array of servers based on Xeon processors cannot address certain workload problems sufficiently for the performance requirements or the energy requirements needed.

I’m going to go into a little bit of detail about that. If we break up application segments into Compute, Networking, Storage and Sensor Processing, what we’ve shown at the bottom are many of the various workloads that have to be done on data. So historically, data would ultimately have to be moved from point A to point B.

Now we’re in a domain where there’s work that has to be done on that data before it’s moved. Some simple examples are compression and encryption. And some other examples, of course, are the very fast-growing market segment in AI/ML. In fact, you’ll see at the bottom here, we’ve highlighted and noted the different workloads that are AI/ML oriented. What you’ll see though is, across this large array of workloads, is that it’s very difficult to create one technology that will address all these different workloads as a hardened function or as an ASIC function.

Ultimately, if there were a way to do that, there would be a fix or chipset that would solve this problem. But again, from a CPU perspective, there are inefficiencies. And so we see that either GPUs or FPGAs are programable solutions. And one of the benefits of FPGAs in particular is that they have the ability to create orders of magnitude acceleration, and they can address all these different workload types.

Craig Petrie

Market Demand for PCIe Accelerator Cards (05:06)

The purpose of this slide is to highlight the trends we see in the marketplace and, together with Achronix, explain how we’re helping to address them. Over the course of the last 5 to 10 years, we have seen the success of GP-GPU technology from Nvidia. This has changed attitudes and generated an acceptance of acceleration technology as a means of increasing application performance.

More and more customers are using heterogeneous architectures which provide a mix of different technologies, each with their own pros and cons and, when used in concert, these platforms provide an overall benefit for a spectrum of applications. As GP-GPUs deliver diminishing returns, customers are now looking for where the next wave of performance improvements and energy efficiencies will come from.

There have been some high-profile FPGA success stories in the markets over the last three years. Examples include the Microsoft Catapult program, which used the Altera, now Intel, ARRIA 10 FPGAs to accelerate the Bing search engine. Microsoft has continued to invest in applications using FPGAs and now use the Intel STRATIX 10 FPGA for their Brainwave Program, which runs persistent AI neural network.

Most recently Amazon’s AWS F1 cloud instance has been augmented with Xilinx UltraScale+ FPGAs. These are all good examples of what we call chip down designs, where hyperscale customers invest significant time, money and people resources to create a special implementation to solve one or two application requirements. What we see is that, as the FPGA reads the classic technology adoption curve, the consumption model changes. Tier two hyperscale and enterprise-class customers have a different set, a very broad set of application problems to solve and cannot justify the investment of a chip down model.

They want to purchase off-the-shelf products and the card in the server level. This is why we now see server vendors such as Dell, HPE and IBM selling FPGA cards in some of their popular server platforms. The two main FPGA vendors, Intel and Xilinx, have also recognized this market trend and have launched their own range of FPGA PCIe accelerator cards, the PAC and ALVEO range respectively.

Data Centers System and Business Requirements (07:30)

In the next few slides, we will explain how the S7t card and design tools will help address some of the detailed technical requirements we see from our target customers. We’re also trying to ensure that we are addressing many of the business requirements that our customers experience. They’re under significant pressure to improve energy efficiency in order to reduce total cost of ownership.

The reconfigurable nature of the Achronix Speedster7t FPGA allows our customers to achieve ASIC-like performance whilst reacting quickly to new application requirements. The fact that we are able to deliver the Achronix S7t FPGA as an enterprise-class off-the-shelf PCIe card or server platform helps customers prove out their application, then ramp quickly and cost effectively. The hardware they use for their proof of concept is also their production ready hardware ready to scale. All these factors together improve time to market and reduce risk.

Steve Mensor

Introducing VectorPath Accelerator Card (08:29)

Okay, so now we’re going to go into the details of the VectorPath accelerator card. First, in terms of the card form factor, this is a full height card that’s double-width. This is the same form factor as GPU-size cards. Very importantly, there are multiple cooling options, passive, active, as well as liquid. And this is built to be enterprise-class in terms of its overall quality.

Craig Petrie

Our goal in this collaboration is to deliver enterprise-class products, and that obviously means quality hardware. But it also means design tools and utilities. The bundle which we’re providing with Achronix includes the hardware, but also includes the Achronix ACE design tools and a toolkit from BittWare. The toolkit leverages 30 years of FPGA card experience and features. The headlines for the toolkit include a sophisticated board management controller (or BMC) that allows customers to monitor the health of the cards under application load. For example, power consumption, voltage, current and temperature monitoring, plus various other parameters that are important to the customer. The toolkit is supported with the latest versions of Linux, most of which our customers use. But for some customers who have legacy application complications, we also support Windows as an optional extra. There is an extensive API PCIe driver set available, multiple application example designs, demonstrating how to move data during each of the main FPGA peripherals, PCI Express, Memory and network ports. These are delivered with source code. Finally, there’s a diagnostic self-test, which is the baseline of our technical support and warranty services. This test is part of our production test regime and is used by our customers before starting their application to ensure that the card has survived transport, handling and installation. It is a golden image useful for debug, which verifies that all features are working to maximum performance.

High Speed Data Interfaces (10:35)

The purpose of this slide is to summarize the headline features of the Speedster7t FPGA accelerator card. We tried to make sure that the Speedster7t device features and IP are exposed to customers at the card and the server level. First up is a PCI Express connection, which is a full 16 lanes. The card will initially support PCI Express Gen3, which is where the market is today. However, the card has also been designed to support Gen4. Our goal is to qualify Gen4 and upgrade the product specification over time. We’re very lucky in that the Achronix Speedster7t FPGA has hard IP supporting 16 lane PCI Express Gen5. When the time is right, we will look at the signal integrity of the card and make any necessary tweaks to ensure compatibility.

The Speedster7t FPGA is rich with an array of multi-line rate SERDES, MAC and FEC IP, supporting everything from 1 to 400 gigabits per second. We have leveraged technology from BittWare’s parent company, Molex, to design-in two types of network cages, QSFP 56 and QSFP DD, which means double density. The QSFP 56 is the 1 x 200 gigabit ethernet port. The QSFP DD is the 1 x 400 gigabit ethernet port. Both network ports can be broken down into multiples of 10, 25, 40 and 50 gigabit ethernet connections through use of breakout cables from Molex.

Memory Interfaces (12:02)

The other headline features of the accelerator card include the innovative memory architecture. Instead of using expensive HBM2 integrated memories on their FPGA, Achronix has included GDDR6 hard IP. The BittWare card supports eight independent banks with a capacity of eight gigabytes. This produces four terabits per second of external memory bandwidth, which is in-line with the performance achieved using HBM2, but without the cost. GDDR6 memories are used extensively on GPUs and are multi source. This architecture allows BittWare to customize the card for volume applications. A customer may say,

 “Look, I’m only using half the memory banks from my application. Can we depopulate four of the banks in order to reduce price and power consumption?”

This is a straightforward option that would require a different FPGA and potentially a new PCB, if we were using an HBM2 enabled device. In this case, we can simply depopulate, reduce cost, and save money for the customer. We’ve also included a Single Bank of DDR4 as a buffer, a cache level memory option for customers who need that level of application buffering.

Clock and Expansion Interfaces (13:15)

This slide summarizes some of the more subtle but hugely important features needed for applications to run efficiently. BittWare explores all 30 years of experience to ensure that this product can cope with the various requirements of compute, network, storage and sensor processing workloads all in one card. You won’t see many of these features on other FPGA cards in the market.

First up are the SMA connectors on the front panel to the left-hand side. These are clock inputs, including one PPS and 10 megahertz, which allow customers to synchronize multiple cards for timing-critical applications. Without these, it is extremely difficult for customers to scale network enabled applications. At the right side of the card, we have a general-purpose digital IO header. This is a relatively low-tech connector running eight single-ended pins from the Speedster7t FPGA through voltage buffers to convert from 1.2 to 3.3 volts. This header is extremely important when customers are integrating new acceleration technologies into legacy systems which need simple digital triggers or control loops.

Expansion Interface Use Cases (14:30)

The final card feature we want to walk you through today in detail is an expansion port using the industry standard OCuLink Edge connector. OCuLink stands for Optical Copper Link and is a PCI Express interconnect system. There are four main use cases for this port on the S7t accelerator card.

The first allows the FPGA to interface directly with the NVME flash arrays. This allows data captured through the network ports to be pre-processed and saved directly to NVME storage or retrieved by the FPGA and processed as part of a database acceleration or big data application. The other use case we see at BittWare is a need to scale FPGA applications across multiple devices. As good as PCIe is, when you try to scale via the host interface, chipset and driver and operating system, you inevitably will kill performance and increase system level jitter. Using the application expansion ports as a simple SERDES, I.e. non PCIe protocol, customers can interconnect Speedster7t devices directly through a simple cable assembly that can be provided by Molex.

Customers can interconnect using whatever topology best suits their application; daisy chain, ring, mesh topologies. Using this technique, customers will have a low latency, high bandwidth and, most importantly of all, a deterministic interface between FPGAs in their system. The third use case is, as IO Rich as this card is, there are always some customers who want just a bit more in terms of network connectivity. In order to help cater for those customers, an adapter can be used to bring in even more network ports from the front panel into the Speedster7t FPGA via the expansion port.

The last use case, as was the case with the GPIO header on previous slides, the OCuLink expansion ports, which are protocol agnostic, can be used to integrate new technology into older systems.

Steve Mensor

Speedster7t 2D Network-on-Chip (NoC) (16:30)

Now I’m going to talk about some of the features of the Speedster S7t device. So, as we click forward here, I’ll highlight the NoC or Network-on-Chip, and it’s a two-dimensional implementation. So, you can see it highlighted here. And I’m not going to go into all the details of the NoC, but basically the NoC is very high bandwidth, so 512 gigabits per second for each horizontal and vertical column as well as connectivity to the high-speed interfaces which include the Ethernet, the PCI Express, the GDDR6 ports as well as DDR4.

There are multiple different modes for communicating to the NoC. The most common one is an AXI interface, which is very standard in terms of any design engineer understanding how the interface works.  Where those points of connectivity are is any point where the horizontal rows and vertical columns cross over each other. There is what we call a NAP or a NoC access point. And so you can get either on the Network-on-Chip or you can get off at any one of those points. And that’s a very powerful feature because it means that the NoC is distributed across the FPGA fabric for doing real world implementations.

Speedster7t NoC: A New Design Paradigm (17:55)

So let me give you an example here. What I’m showing here is a cartoon of an FPGA with the desire to create two different accelerator functions, accelerator one and accelerator two. And you can see the different GDDR6 ports. So, if I wanted to connect to any of these ports, let’s just say two of these ports, I have to build functionality to make that happen. Ultimately, as a user, I really only care about the accelerator functions. Everything else that I have to do is, because it’s an FPGA I have to design myself all the functionality for it to talk to the outside world. If we click forward, you’ll see the functionality that has to be done. So, first of all, because I’m talking to GDDR6 ports I have clocking from the outside world and I need to synchronize that to the clocking inside the FPGA. So, I go through a standard structure like a FIFO. That’s fairly straightforward.

What is more complex is, because the different accelerators need to talk to both the different memories, there will be a shared memory space. There has to be a control mechanism, or really a switch function that does the addressing, decode and routing. There needs to be controls and back pressure to make sure that the two accelerators don’t talk to the same memory location at the same time. And then all that has to be done in the reverse direction also.

So, if we click again, you’ll see that the area that’s highlighted in red here is really of no value, as I said earlier, to the customer, except obviously it allows the accelerator functions to talk to the outside world, so it has to be done. It’s a necessary evil.

There are a couple of challenges associated with this. One, as you can see at the bottom, we’ve highlighted that this area in red grows at a quadratic rate. Every time you add a new port to the switch function, whether it’s another memory port or whether it’s a high-speed interface or another accelerator, it grows at a very fast rate, to the point where you will have much more of your circuitry, that is this red functionality, in other words, the connectivity to the outside world versus the functionality that you would otherwise want to focus on, which is your accelerators.

Probably more importantly is that it makes the design and timing closure very difficult. In reality, when you are talking to very high-performance interfaces, what will happen from a place and route point of view is that the circuitry is going to try to be pulled towards the port of connectivity. And what that means is you’re going to be stretching between your accelerator and the various different ports, so it becomes very difficult to close timing.

Speedster7t is Software Friendly Hardware (20:42)

What’s different in Speedster 7t FPGAs is, because we have this 2D NoC with these NoC access points or NAPs that are distributed across the FPGA fabric at every point where the horizontal or vertical columns intersect or cross over each other, this means all you have to do as an engineer is very simply design your accelerator RTL, whatever expertise you have and simply do an AXI connection. So, you would create an instantiation of a NAP and you hook it up. And from there the Achronix software tools called ACE would take care of all the routing between your accelerators and the different memory ports or different high-speed ports like Ethernet, PCI Express, etc. This is very important because the ecosystem will be something that is critical for this product as there will be many different types of solutions built to create those workloads we talked about before in terms of different types of accelerators.

Because this environment is so much easier for engineers to design with, it means the ecosystem will be able to thrive in the sense of the different ecosystem companies that have a specialty, or a type of IP to offer, they will be able to create an end-to-end solution and the way that works is the IP provider will simply create the functionality for whatever their value proposition is. It could be encryption, acceleration or any type of AI/ML application and all they’ll have to do is instantiate a NAP and then they’ll be talking to the outside world. We currently have a fairly decent-sized ecosystem. It is growing very quickly, and we expect it to continue to grow very rapidly because this is a very easy environment for both designers as well as ecosystems to build their IP for an end-solution product.

Speedster7t Optimized for Compute Intensive AI/ML Applications (22:47)

The other thing I wanted to talk about is, inside the Speedster 7t device are MLPs or machine learning processors. These are what would be traditionally called a DSP block in standard FPGAs, but they’ve been optimized specifically for AI/ML applications. I’m not going to go through all the functional description of them, but I want to give you some of the highlights about them.

First of all, in terms of them being distributed throughout the device, you can get over 40,000 INT8 MAC Operations on this card, on this 7t card, and they’ll run at 750 megahertz. That equates to over 80 Tera Operations per second functionality. And then another benchmark that’s often cited is RESNET 50 and in that benchmark it delivers 8600 images per second.

I do want to comment about GPU’s which have otherwise been focused on AI/ML, mostly in the training area. One of the challenges of AI/ML is that because they have many different engines with a cache structure and are largely designed similar to what you would have in terms of a sequential implementation on a CPU, you get into memory delivery challenges or data transfer challenges where the actual compute engines are really only being utilized about 10 to 20% of the time.

Whereas you would hear GPU quote, let’s say 130 Tera Operations per Second, an actual result when you’re measuring the images per second for example, you get something closer to 15 to 20 Tera Operations per Second, whereas because of the Speedster7t high bandwidth memory implemented as GDDR6 plus the Network-on-Chip, plus these highly optimized machine learning processor blocks, we’re seeing that we can deliver up to 80% efficiency for AI/ML implementations.

Craig Petrie

Performance and Support (24:53)

As mentioned earlier in a webinar, we explained that, as FPGAs ride the technology adoption curve, we see the consumption model changing from chip-down design to card to integrated server platform.

For customers who wish to purchase at the server level. The S7t card is available as part of BittWare’s TeraBox range of products. Over the course of the last few years BittWare has seen an increase in design wins and revenues from the TeraBox range, which we relaunched earlier this year. FPGA cards are delivered to customers as pre-integrated, pre-tested Dell or HPE systems with the mechanicals and the thermals already figured out plus a comprehensive warranty covering both the server and the FPGA card.

The operating system, BittWare toolkit and Achronix ACE design tools are all pre-loaded and ready to go. Customers just need to apply power, log in and run the diagnostic test to verify that everything is good.

TeraBox Server Platform offering (25:55)

The TeraBox product range was created to address two specific types of customer:

The first is a developer who is trying to get their proof of concept ready for their customer deadline or the boss’s demo. Time to market and minimizing setup and hassle are really important, so the Terabox range allows the customer to place a single purchase order for a single line item and get everything delivered in one package.

Once a proof-of-concept phase is complete, the next customer type we tend to meet during the deployment phase is a program manager or an IT lead. These customers are usually unfamiliar with the details of the FPGA technology and instead care a lot about application deployment and management. They want to understand service level agreements, warranties, technical support, utilities for monitoring products in the field, and upgrade and maintenance schedules. As part of the Molex group, BittWare has a global supply chain and infrastructure which it can uniquely leverage. Furthermore, BittWare is also part of the Dell and HPE OEM programs. This means that customers, if they prefer, can purchase TeraBoxes directly from Dell or HPE under their current contracts.

Steve Mensor

Availability and Pricing (27:12)

So, in terms of the availability, the VectorPath S7t board will be available in the beginning of the Q2 2020 timeframe, the one unit list price is 7,500 USD. But beforehand, folks can start their designs in terms of designing the FPGA functionality, using the Achronix ACE design tools. These tools are available now. So, folks can start evaluating the software or preparing their designs for the card’s availability.

Summary (27:47)

So let me just do a quick wrap up and then we’ll transition to some questions.

First of all, it is a very high growth market. The idea of using accelerator cards was proven out in terms of GPUs, and we’ve seen that drive into data center applications and it’s obviously valuable across many different application segments.

Achronix and BittWare working together is an exciting partnership and delivering some unique functionality. The card is powered by the Speedster7t FPGAs and we talked about the Network-on-Chip as well as the MLP. And then Craig talked about some really interesting, innovative features on the card that come from BittWare’s many years of experience delivering this technology. But ultimately what this is, this collaboration and the solution, has been focused on trying to deliver a complete solution, enterprise-class, that’s a low-risk product that you can use for your production applications.

 

Questions & Answers Section (28:43)

 

Marcus Weddle

All right. This is Marcus here. I just want to make sure that our panelists can hear me okay.

[Steve and Craig confirm]

Our first question is more about the HBM2 FPGA cards that are on the market: So obviously the S7t has GDDR6, but HBM2 is also on the market. So, I know BittWare has some cards and some others. Does the S7t compare to them?

So, if you can speak to that, maybe Steve.

Craig Petrie

So let me start on this one actually.  As a card question, it’s a good question there.

I think differentiating the S7t card is very important as we go to market and hopefully that’s come across in the the product webpage, in the collateral we’ve shared so far, and also the webinar. I think compared to other FPGA cards in the market from other vendors which have got HBM2 memory, I think what we’ve tried to show here in the webinar by highlighting some key features is that we’re providing quite a bit of differentiation. This one particular card we think has a very balanced architecture, which lends itself well to a wide array of workload types. So, we’re trying to cover the Compute, Network, Storage and Sensor Processing examples which we cited in the webinar. In order to do that, you really do need a rich mix of IO. So that includes things like clock inputs, digital triggers and expansion ports as well. And I think that comes across on the S7t cards, whereas those features are almost entirely missing from other HBM2 cards in the market. I think in summary, we’ve got a very high level of flexibility which we can offer our customers.

But there are some other things that we’ve tried to highlight to help differentiate the card compared to the others in the market.

The Achronix Speedster7t device itself; that we believe has got some unique features which we’re trying to expose at the card, and also the system level. That includes things like the very high line rate  Ethernet ports via the QSFP 56 and the QSFP DD (Double Density) network ports. And of course, when those are hooked up to the GDDR6, which is very high bandwidth, at the same bandwidth as HBM2 and also the NoC, then you have a very interesting architecture which is not currently shipping in the market today.

When we ship this product early Q2 next year, we think it will be ahead of other cards in the market. Other things we’ve added here to provide further differentiation, and ultimately to make life easier for our customers, include those 30 years of BittWare experience and IP that we’ve got stored up. So that that comes in many forms including driver and API support for both Linux and Windows, a built-in diagnostic self-test, example designs for source code, etc..

And then finally for customers who want to use this technology, not just at a card level, but they want to purchase at the server level, we of course have the TeraBox range. So being able to market, sell and support this type of product through the Molex global supply chain is an incredible advantage that we have at BittWare, and we’ve tried to make sure all of that is poured into the value proposition of the card and gives it differentiation compared to what’s out there. Hopefully that helps answer the question.

Marcus Weddle (32:50)

Thanks, Craig. And I do have one for Steve this time. The question is, does the software provide HLS tools? If not, can we use Mentor Catapult?

Steve Mensor

Yeah, that’s a good question. In terms of OpenCL, we don’t have any plans to support OpenCL, but we do have plans for HLS. We have worked with Mentor Catapult and the Catapult product does support our previous generation family, and we will be working with Mentor for Catapult to support our Speedster7t family. That will be in the future and there’ll be an announcement when that is available.

Marcus Weddle

Great. Thanks for that. We have quite a few questions. I just wanted to make a note that we’ll stay on the line as long as everybody would like to keep asking questions. We do have quite a bit of opportunity to answer some great questions, so keep them coming.

So the next question probably again for Steve, what is the fMAX for this new FPGA?

Steve Mensor

So FMAX is always an interesting question with FPGAs. The FPGA in terms of clocking is backed up to 750 MHz. In most FPGAs your actual FMAX will be part of your time enclosure. So it will depend on the complexity of your designs, the levels of logic, etc.. There are things that are unique in Speedster7t that we talked about in terms of the NoC that greatly reduces the normal congestion that in other FPGAs would cause timing closure challenges. So it’s designed by design dependent. What you will get; we have an example design of a 2D convolution that uses 94% of the FPGA and it’s running right around 750 MHz, It’s like 749.1 MHz. So 750 is the max, but it will depend on your actual implementation.

Marcus Weddle

Okay. There’s a question about the card’s price. I believe it was on the slide. The list price for a single unit S7t is 500 USD. But Craig Petrie, maybe you can just elaborate on that a bit.

Craig Petrie

Sure. Yeah. So just to clarify, the low volume List Price is 7,500 USD. Just to be very clear that is a bundled price. So that’s not just the card. It obviously includes the card, but you’re also getting a license for the Achronix ACE Design Tools included in that price. You’re also getting the comprehensive BittWare Toolkit as well. So that includes the drivers and the APIs for Linux. You can get Windows, that’s an optional extra beyond the List price. But the toolkit is bundled for Linux, that includes the diagnostic self-test, source code example designs, all that good stuff, the board management controller. So we’ve tried to make the purchasing experience very straightforward. It’s just one quotation, one price, and you get the hardware, the firmware, the software, the drivers, APIs, the tools; everything you need to get started.

And for many customers, they’re starting with a new technology like this and exploring what’s possible with the product. And so we want to make sure that everything’s included in one bundle. There are no hidden costs and there’s no complication about having to buy tools from different places or hardware from other places, it’s all available as one price. So hopefully that will simplify the experience for our customers. In terms of volumes, hopefully one thing that came through in the webinar is the fact that, for volume applications, we have the ability to customize the product. The customization is in varying degrees. One of the straightforward customizations we can make is to depopulate components which are not being used in the volume application. The obvious benefit there is we can reduce the cost of the units for our customers. We can also reduce power consumption and really give customers an optimized solution. Not everyone will need all the features on the product. We recognize that, so there’s a plan in place to give customers a choice to get the unit price down as far as they can.

If there are more significant customizations required where maybe there’s a change in mechanicals or even PCB, that is also an option too. So really this is just a very flexible model and we’re giving customers a lot of choice in terms of how they go from one unit proving their application all the way through to a volume deployment. And one thing we think will kick in is the TeraBox range that we have as well. In addition to dealing with the card, we can also deal with the server level requirements, and give the customer a one-stop-shop to get the unit price down as low as possible. The $7,500 starting price is just for low volume. We think we can get this down significantly once volumes kick into hundreds or even thousands of units.

Marcus Weddle (38:33)

All right, thanks, Craig. I’ve got a couple of questions that were coming in from a couple of folks regarding, I believe it’s pronounced C six or C C I X, but I think it’s CCIX. Steve is probably a good one to reply. The question was regarding their support for PCIe Gen5 with some details. Is there support for CCIX is their plans to support the CXL for coherency. So, if you wouldn’t mind taking that.

Steve Mensor

Yeah, good question. Our PCI Express on the particular device on this card does not have support for CCIX or CXL. Our follow-on devices at Achronix will have support for CXL.

Marcus Weddle

Okay. Again, this is probably a Steve question. Where can you download the prototyping software kit?

Steve Mensor

Yeah. We have a methodology that’s pretty straightforward. If you go to the Achronix website, there is a registration form. If you just either do a search or go to the bottom of any of the webpages, you’ll see where it says registration. And from that what you will do is start a process where you will get login credentials and password for our portal. And once you get the portal, all the steps to download the software and get an evaluation license are explained there. And we typically give an evaluation license on the order of about two months. So, it’s all on our website, for your reference.

Marcus Weddle (40:13)

Okay. And for Craig, really on our card level, do you have any power consumption numbers? Somebody mentioned liquid cooling. This person’s curious about power delivery and whether there are additional power connectors needed.

Craig Petrie

Okay. Yeah, good question. So just to provide some clarity, I think there are some pictures in the datasheet which is available on the BittWare and Achronix web pages. If you want to get all the detail on the card’s features. We’ve got two 12-volt auxiliary and power connectors on the card, which is similar to what you get on a GP-GPU. Most people are familiar with FPGAs on this call, then you’ll understand that the power consumption of the card will be application dependent. If someone is running a low-speed small design, then the card is likely to be very low at power consumption and most of the power could just be supplied through the PCIe bus, which is rated to 75 watts.

We’re expecting customers to take advantage of the Speedster features, push the card to its limits and try and reach those 750-megahertz FMAX numbers, which Steve shared earlier. For that, we think most customers will require the higher power capability which we’ve engineered into the card. The card, I think the standard is rated to 225 watts, is a kind of high typical power consumption for an advanced application.

And then you’ve got everything in between that. So, all the power is taken care of via the connectors on the card. We’ve engineered the card to be a GPU-size card. So that means it will be compatible with a vast array of 1, 2 and 4U servers in the market. So, that’s where we’re playing. Our experience at BittWare in recent times is that customers are looking for choice in terms of how they’re cooling the product.

Although FPGAs are very energy efficient from a performance per watt perspective, the power density of FPGA is still increasing. This is a seven-nanometer device so it’s very energy efficient, but it’s very high performance as well. We’ve made sure that we’ve taken care of the mechanicals and the thermals on this product. What we’ve done there is we have provided them three options which best meet the customer’s end requirements.

The standard product will be the passive-cooled heatsink option. So there’s no embedded fans, it’s all passive. In our experience many customers going to volume deployment will prefer the passive heatsink because there’s no moving parts, it’s a simpler design. Typically there’s good air-flow in these server platforms and therefore you can get very good MTVF figure for reliability. Customers who are in the lab environments conducting proof of concepts, or if they just have a preference, they can use a product with an active fanSINK. It’s not showing in the picture here, but there’s an active fan sink option, which we’ve successfully deployed on other high powered FPGA cards in the BittWare portfolio. This will cool the card adequately as well.

And then more recently, we have been experimenting with liquid cooling. We’re not talking about immersion liquid cooling here, we’re talking about liquid cooling where you’re piping water or liquid into the card and back out again. That’s a relatively new area for FPGA cards. There are not any liquid cooled cards on the market from other vendors at the moment of this class. So, as luck would have it, next week is Supercomputing in Denver. I’m sure many people on the call are going to be going there and seeing what’s happening. At the BittWare and Achronix booths we’ve actually got mechanical samples of the S7t card, populated with liquid cooling cold plate technology from a Canadian company called Cool IT. Cool IT is the market leader in liquid cooling and they are designed-in to Dell, HPE and other rack technologies and they’re used today to cool very high-powered GP-GPU cards. We are leveraging their technology and we’re going to have samples on display. And for customers who have a preference for liquid cooling, then this is going to be a really good option. We are trying to give customers choice here and this will hopefully take care of all the requirements which we see in the market.

Marcus Weddle (45:40)

Thanks, Craig. I’ve got a couple of folks out there having some audio difficulties. I’ve tried checking it out. I’m not having any cut outs, are you? Steve, can you hear Craig okay.

Steve Mensor

Yeah, I can. It’s good.

Marcus Weddle

Okay. Yeah, it might be, unfortunately, individual connections. Let’s see here. I’ve got a couple that were somewhat related to the cooling. Okay, I’m going to read this one. This is about function acceleration. Can you elaborate on the process of function acceleration? Does this have to be done using RTL or is there support for high level tools such as Python? Also, if you can elaborate on the implementation of ResNET used in the earlier slide, it would be a great example of machine learning. And I think just to add to that was another question on whether OpenCL was supported. I forgot in the answer when we talked about HLS, if that was answered, but if you could kind of do a wrap up of all that.

Steve Mensor

Yeah. Okay. We’ll start with OpenCL. Achronix does not have any plans to either directly or work with an ecosystem partner to support OpenCL. It seems to not be as popular as other solutions. We are working on HLS solutions, but nothing that’s been announced yet. We expect to have an announcement sometime this year with some details with partner companies. In terms of AI, the question on ResNET50, for example, there are multiple ways to solve those problems. What we do is we basically supply, you know, obviously either at an RTL level or we have libraries, low level libraries that can be worked on at a at a network level ultimately. And we supply those libraries, and we’re going to be doing various different announcements with companies that will be supporting different network implementations for AI.

In terms of the ResNET50 number, what we have right now, maxing out the Speedster7t,1500 is about 8600 frames per second, which is I think one of the largest numbers in the semiconductor space, not just FPGA. Now that design implementation has not been released yet. It’ll be released later this year or earlier next year in terms of folks using that for their own purposes. And then there will be other implementations as well, for example, for YOLOv2 and, and those, as I said, will be announced at a later date.

Marcus Weddle (48:25)

Okay. This question is about kernel bypass. Are there going to be drivers available for Hostess CPU communication with the FPGAs? And I guess maybe that would go to Craig.

Craig Petrie

Yeah. There’s no support initially for kernel bypass with the drivers. That’s something we’re looking at. We’re already getting good feedback from customers on additional features that they would want to see supported. Some of those are hardware features and some of those are firmware, software type features. The Speedster device is a whole new architecture. It’s kind of a leap forward in terms of how customers will move data in there. So, we are speaking with some partners in this space who specialize in network IP to understand what we can also offer to customers to get maximum performance. So given this is brand new technology, our focus is just to get the standard product shipping on time with the main features customers expect, and then look at upgrades and improvements over time.

So, I’d welcome and encourage everyone with an opinion or a requirement that they feel is not being met to feed that back to us and let us factor that requirement into future plans.

Marcus Weddle (49:53)

All right. And we still have some questions that we’ll get to. We are somewhat catching up with all the questions. If anybody still has some questions that they haven’t posted yet, please do that. As people have to drop off, I did want to mention that we’ve got a couple of things for you we’ll send after the webinar, one of which is about this question which I’m about to pose to Steve, but it’s about latency. So, there is actually a write up already on latency, which I’d like to send around, which is going to help. But anyway, I’ll ask the question right now.

We’re looking for FPGA cards for HFT (High Frequency Trading). Does your card come with network IP cores, like Mac, ADP, TCP IP cores? How does the latency compare with competitors such as UltraScale,  speed grade 3, that sort of thing?

And then there is another question I think along the same lines, but that answer should be for both.

Steve Mensor

Sure. Okay. First of all, at the silicon level, the Mac for the Ethernet in the various different forms is all hardened. If you want to have TCP IP on top of that, that would be soft IP. There are a variety of third-party companies that offer that ecosystem, partners of Achronix. If you contact Achronix, we can put you in touch with one of those companies that can help out in that regard. In terms of latency, we do have a datasheet. The title of it is “Minimizing Latency in Speedster7t and Speedcore FPGA products”. And it goes through the calculations. These are obviously focused on 10 Gigabit Ethernet, whether you’re running at a 16-bit interface or a 32-bit interface, then it goes through and gives the details there.

The latency on the Speedster7t; this device is over 20 nanoseconds in 16-bit mode. Future devices will have an additional SerDes structure, some extra short-reach SerDes, and those implementations will offer the 16-bit interface. It will be running something around 11 1/2 nanoseconds. So, a significantly large drop. But those will be on future Speedster7t devices.

 

Marcus Weddle

All right. And in an earlier question it may have been covered, but the question is really more on a card level, I think.

What’s the process for getting user guides, architecture documents, trial tools? So, there’s obviously Achronix tools for the chip, but on a card level, Craig, if you could speak to that and the developer site and that sort of thing.

 

Craig Petrie

Sure. Yeah, absolutely. So, we have quite detailed hardware information available to customers. So that takes various forms and I think there’s good information on the product web page at the moment. If you go to BittWare, the Speedster7t device is featured in the main banner, you can click there and download the datasheet which provides some good detail.

We have a developer’s site for BittWare products. So, customers who are considering purchasing product or have already purchased product can access the developer site, and you can download from there more detailed product information to give you all the detail you need to make a decision whether or not to proceed with the purchase, or if you’re mid-development, solve the problem you’re trying to crack.

Information’s available and I’d ask you to please get in touch and let us know what information you’re looking for. And I’d be very happy to provide that.

Marcus Weddle (53:44)

All right. This one’s for Steve, regarding the chip, I think. In particular, how our security functions handle on the card?

Steve Mensor

I assume the question is related to bitstream security, there is also data security like MACsec, IPsec. And we do not have anything hardened for data security, although that will be considered for future devices. Bitstream security; there’s an array of different security measures for bitstream security, which we think is equivalent or better than best in class, in terms of authentication. It’s a verification authentication, physically, [    ] functionality. We have a document on that. I believe it’s the user guide. You can contact Achronix and we can give you the details on that. But I’ll go through and explain exactly all the security measures that are hardened into the FPGA.

Marcus Weddle

That was a good point. I was thinking about security within the chip, but the security applications too. So, I appreciate that answer. Another one for you, Steve, on the chip – the Ethernet FEC modes, which ones are supported with hard IP?

Steve Mensor

That’s a good question and I should know that answer. It is documented. It is in our documentation, and I don’t know offhand.

Marcus Weddle

Yeah, that’s fair. We can get back to that person who asked the question. Obviously, it’s in the documentation too. Let’s see, the question is on the on-chip memory, the monolithic memory. So, what’s the largest size monolithic memory that can be configured internally using the 300 plus megabits of internal memory? I’m not sure if that figure is accurate, but if you could answer that Steve.

Steve Mensor

Yeah, I think what we’re talking about is within the Speedster 7t family, so there’s a family of devices up on the Achronix website. This device has the 7t 1500, which has 190 megabits of memory in the form of BRAMs and LRAMs. The BRAM’s configuration are 72 Kbits and the LRAMs are much smaller for a registered file type of functionality. And those are 2 Kbits each.

Marcus Weddle

Yes, good point. Sorry. I’m thinking of the the existing card and what’s on there, but there are other devices as well coming down the road. Sorry, I just looking through the questions, make sure I’m catching the earlier ones that have been asked. There’s a question on TensorFlow, a couple of other ML frameworks. I guess you kind of answered that already.

 

Steve Mensor

I think those are great. That is one of the focus capabilities in the Speedster 7t devices. So, what I would suggest if people want more details is to contact Achronix directly. And there are many different, you know, TensorFlow Cafe and a variety of other different network types of applications. So, I would suggest you contact Achronix and we can go through and identify what your requirements are and tell you how we are going to address that.

 

Marcus Weddle (57:27)

Okay. We do have at least a couple of questions to put out still there, but we’re getting near the end. So if anybody has any questions, we are getting close here to not have any more to go for. So, ask your questions soon. For Craig Petrie. Do we have any plans to release low profile cards in the near future?

 

Craig Petrie

Yep, good question. We started with the GPU class of device and we’re packing quite a lot into that. We do recognize the need for other form factors. Half height, half-length is of course a popular one, especially in SmartNIC types of applications. So yeah, we are looking at that and we do have other half height, half-length cards in our portfolio today. So, we’ve got form in being able to deliver those.

What we’re really trying to figure out is which product variant to do next. Whad a number of requests for different features, and Achronix has got four different Speedster7t devices publicly announced of different sizes, different features. And then some additional ones which are not yet in the public domain.

So, yeah, we’re looking at that now. Again, we’d love to hear some feedback on the detail of the requirements which would be needed. So, depending on the use case, then you might want a subset of the full card we have here in GPU class. Some customers don’t care about external memory for their applications. Some people will prefer SRAM instead of DDR6.

So, perfect time to give us that feedback and help us make a good decision which ensures that your requirements are covered. So please get in touch.

 

Marcus Weddle (59:27)

Okay. Let’s see a couple more here, one of which is going to be a competitor comparison. You know how much we want to get into detail about that, but certainly I can pose it here. Compared to the next generation Xilinx device, the Versal chip, what’s the benefit of an S7t? So, if you could speak to that maybe both Craig and Steve, do you have a take on that?

 

Steve Mensor

Yeah, that’s a good question for both of us, because there are answers at the chip level and at the board level. So let me take the chip level. So Versal is a heterogeneous architecture trying to solve the AI problem with vector engines. It has four major constructs; it has high speed IOs, FPGA fabric, Vector Engines and CPU complex.

It’s interesting. They’ll have to prove out how data will be moved across those different constructs in a usable system development format to ultimately address different types of applications. Speedster7t, on the other hand, has a couple of things. It is more of a traditional FPGA architecture where the AI functionality is addressed in the FPGA fabric with what I talked about earlier, which are MLP blocks that offer very high AI performance within the FPGA so data doesn’t have to be basically formatted and then moved to a different part of the device. – Number one. Number two is the 2D NoC that I talked about. That is a very intriguing architectural feature where folks won’t be designing a lot of the circuitry that they historically did, which is that connectivity, how do I talk to Ethernet, how do I talk to my memory interfaces, etc.? Instead, you’ll be using a standard AXI interface that each of those NoC access points have that are distributed throughout the device. And at that point, the 2D NoC takes care of everything. So, it enables much higher performance because you don’t have that logic congestion and timing closure challenge. And ultimately it saves a lot of those valuable FPGA resources so you get a much higher density FPGA than you’d otherwise get with a traditional FPGA.

Marcus Weddle

And then Craig, your take on it as well.

Craig Petrie

Yeah. So, I guess there will naturally be comparisons to the other FPGA vendors. From Intel there’s Agilex. It’s more of a traditional type of FPGA, it doesn’t have a NoC, for example. It’s very much like Stratix10 in many ways. In terms of Xilinx, they’ve announced the Versal chip, it does have a NoC and Steve’s mentioned some of the some of the features of the Speedster device which you can draw comparisons against. And at the card level, there’s actually very little information out there for customers to draw direct comparisons between cards. If you look at Intel with Agilex, I think there’s a Dev Kit which is going to be available for Agilex. It’s a kind of lab environment type of card we’re led to believe, on the bench type of use, lots of different connectors on it, certainly not an enterprise-class product for deployment. That will probably come later from Xilinx.

What they’ve announced, and some folks on the call may have been at the XDF events in San Jose and Europe just this week, Versal is available at the card level. But again, it’s just a Dev kit. So, it’s very much just for low volumes doing PoC stuff in the lab. What we’ve got here with this card is we’ve got a card which, of course absolutely can be used in the lab for getting demos going and doing proof of concepts. But this card has been designed from scratch to be deployment ready. So, it’s an enterprise-class card. it’s cost-effective at low and high volumes, and we’ve designed it to make sure it has very high quality and reliability built in. So that’s why in terms of the marketing message from Achronix and BittWare, we’re positioning this card as an enterprise-class product and over time there will be some competition there. But we think that the Speedster 7t device, the ACE Tools and the features at the card level, and also server level, will give customers really good performance and a lot of flexibility to let the card be used for different application types.

Marcus Weddle (1:04:27)

Thank you. And just a quick plug for next week. If anybody is at the Supercomputing Show, we will have the S7t board to take a look at. Just a showboard, obviously, but make sure to stop by the Achronix or BittWare/Molex booth.

Let’s see, we’ve got a question. I guess I’m going to ask these together for Craig probably. One is about JTAG and one is about board schematics. Question 1. Does the board already have a JTAG adapter for programing and debugging and how long is the upload process for loading bitstreams for using JTAG USB. And then question 2.  Will the card come with reference schematics?

Craig Petrie

Okay. So maybe I can start Steve, just at the card level and you can jump in. So, it’s a good question. The JTAG access and debug is lowish tech, right? It’s not super high speed, but it’s absolutely critical to the user experience and how you access some of the card features.

So, this is where all the benefits of BittWare’s BMC (Board Management Controller) kick in. So built into the card we’ve got JTAG access simply through a USB cable, so plug in a USB cable to the card, you can tap into the BMC which gives you access to JTAG and there’s an FPGA UART there as well.

So, through that capability you can program your bitstream, your executable for the FPGA. Most customers will probably prefer to do that through the host API and driver. So you can program the card just from software through the PCIe bus. But if you’re using a JTAG cable for programing, that’s also an option. And through the BMC and through the PCIe host and the JTAG adapter, you can also read back the onboard card parameters, power, voltage, current and various others.

So, all of that is built in and hopefully that will give customers a good user experience during development of the application and also during the monitoring of the application, when it’s actually running. Steve, I don’t know if you’ve anything to add to that.

Steve Mensor

No, that’s a perfect description.

Marcus Weddle

Okay. Just one quick follow up question then that came in. Aside from USB to JTAG and FPGA UART, what are the other features that the BMC provides? I guess that was somewhat answered but if you want to elaborate a little bit Craig, that would be good.

Craig Petrie

Yeah, I just realized I didn’t fully answer the previous question. There was a question saying will schematics be available. So, schematics are not provided as a default deliverable on the product.  There are only certain circumstances under which a customer would need schematic information. There is a lot of detail in the hardware reference guide which we provide, so hopefully that’s got all the information that customers need to develop their bitstream.

If a customer has a reason why they may need to get schematic information, and possibilities there includes someone making use of maybe the digital IO header, or an expansion port and wanted to understand a bit more about how the card works, then that’s no problem. We do share schematic information under NDA. It’s intellectual property owned by BittWare so we just want to make sure we control that carefully. So, customers with questions will definitely get answers. And that may include sharing some snippets of our schematics to make sure that they’ve got all the information they need on that one.

In terms of the other question on the BMC, yeah, I think probably covered most of that. We do have a lot of good information on the website which goes into the detail of the BMC and how it works, including examples. I think what we’ll do, (Marcus is taking action) as part of the summary of the webinar, we’ll provide the link to that information from the website.

Marcus Weddle

All right. Yes, that sounds good. I think that said, I think we’ve certainly gone into a lot of Q&A and I appreciate that. Just everyone on the call, again, we’re here for you. Send us more questions offline. Obviously, the best way to do that is to go to our respective websites, achronix.com and bittware.com

We would love to hear more questions, talk one on one, set up some time to chat. That would be great. But otherwise, hopefully this webinar gave you a good taste of this new card. Like I said, we’re going to be at the show next week – SC19. Visit our booth there and we can speak further.

So again, I want to thank my fellow panelists for joining us. Thank you to everyone on the call for doing a lot of Q&A, hopefully it was helpful. More of these webinars will be coming up, check out our respective websites for that and certainly keep in touch and thank you.

Learn more about the VectorPath S7t-VG6 card