spacer.gif
Search:
home   contact us   site map
BittWare Signal Processing
spacer.gif
spacer.gif
spacer.gif spacer.gif
spacer.gif

Multiprocessing DSP Systems

Ron Huizen
VP of Product Development, BittWare, Inc.

While the lure of the telecommunications and consumer markets draw the focus of many DSP chip and board vendors towards low-cost, fixed-point, single-chip solutions, the need for the traditional high-end multiprocessor DSP systems remains strong. As DSPs become more powerful, traditional applications eagerly gobble up all of the additional processing they can get. Further, recent increases in processing power allow for new and innovative applications that were not previously feasible.

This article explores typical applications, architectures, tools, and design issues of multiprocessing DSP systems. The focus is on systems designed for OEM products rather than more scientific or research oriented systems.

Typical Applications

Traditional applications for multiprocessor DSP systems include medical imaging, sonar, radar, and inspection systems, while new applications are emerging in areas such as digital video processing and specific segments of telecommunications. These systems demand a large amount of compute power coupled with extremely high data bandwidth. Multiprocessor DSP applications are almost always real time - they are dealing with real world signals and must perform within predictable and bounded time constraints. Given the complexities of these systems, floating point processors have tended to dominate this market, as the volumes do not usually justify the additional effort required to use lower cost fixed point devices.

Typical Systems

A typical system consists of a host Single Board Computer (SBC) controlling multiple DSP boards and I/O. While industrial PCI and VME bus formats used to split this market, the maturing of CompactPCI has provided for a more standard platform. [The switch to CompactPCI that has been so long predicted and anticipated is finally occurring. Traditional industrial PCI users are switching for the mechanical strengths of CompactPCI, while VME users can keep the mechanical advantages they enjoyed with VME while gaining the architectural advantages of PCI. The fact that almost all current VME single board computers use a local PCI bus onboard and then have to bridge to the VME bus makes it an easy migration to just drop the VME bus altogether.] By far the most popular form factor for implementing multiprocessor DSP systems is 6U, as it allows for a large amount of processing per board.

The host, or system controller, is usually based on a standard compute platform such as Pentium, Sparc, Alpha, or PowerPC. The operating system can range from the many flavors of Windows, to Linux, Solaris, and Digital Unix, to RTOSs such as VxWorks. The role of the host is to provide the overall control of the system, managing and monitoring the system activity, as well as providing the user's interface to the system, whether this be a GUI or a remote network connection. The host may also take an active role in the actual processing of the system, although typically the hard real time aspects are left to the DSP subsystem.

The input subsystem provides the real world data to be processed and is typically either some form of high speed analog capture board, an imaging board or a high speed digital interface. The data flows can be a stream of continuous samples or frames of captured data. The output subsystem takes the results of the processing and provides them back to the world in some meaningful way. This is often some form of user interface or control system, or it may be some combination of DACs and digital outputs.

The DSPs are the heart of the system. They take the inputs from the I/O subsystem, perform the required processing, and deliver the results to the output subsystem. They tend to work in a continuous data flow driven state, often referred to as stream processing, keeping up to the input subsystem and driving the outputs as required. There are two main types of system I/O configurations - one in which each DSP board or group of boards is matched with an I/O device, and another in which a single I/O device can be used to feed multiple DSP boards. In the first type of system, a DSP subsystem is matched to provide all the required processing for its I/O device, allowing for system scalability by adding additional sets of I/O and DSP. In the second type of system, a single input provides enough data for multiple DSP subsystems, so that scalability is achieved by adding more DSPs.

Critical Design Issues

In designing multiprocessor DSP systems, two primary requirements are highly interdependent: processing power and data bandwidth. Often designers put too much emphasis on the processing requirements and too little on the data bandwidth needs. DSP chip vendors advertise processing performance numbers based on theoretical best cases, which tend to assume that one has the data sitting in some fast internal memory ready to be processed. They tend to ignore the fact that getting that data into the DSP and then getting the results out of the DSP may actually take more time than the processing itself. A DSP that is not architected to balance data flows and processing power does not lend itself well to multiprocessing DSP systems with large data flow requirements. System designers often find themselves struggling to achieve fractions of the promised performance when the DSPs must run in a continuous manner. Insiders in the DSP industry have been working (so far unsuccessfully) towards performance numbers from the chip vendors that reflect continuous mode performance, including the time it takes to get the data into and out of the DSP.

Multiprocessor Design Issues

In addition to looking at the capabilities of each individual DSP, you must carefully examine how these DSPs interact in a multiprocessor system. The two standard approaches to connecting multiple DSPs are clustered and chained. In a clustered architecture, DSPs share an external bus, which allows them to share external devices, such as system memory or I/O, and communicate directly with each other. Chained interconnect designs typically make use of dedicated processor peripheral ports whereby a series of DSPs are chained together over proprietary high speeds interconnects.

A significant advantage of a clustered architecture is that interprocessor communication and data sharing is easier due to the shared bus, as opposed to chained architectures which provide each DSP with a private bus for external devices. Disadvantages of clustered designs include sharing the bandwidth of the external bus, while chained systems tend to be harder to program and manage.

Another issue in multiprocessor architectures is that adding DSPs tends to increase the data bandwidth requirements, because the multiple DSPs require additional communication and data flows. Take care to include these data flows when considering scaling requirements.

Chip-Level DSP Architectures

Some DSP architectures, such as the SHARC® from Analog Devices, have been specifically designed to provide a balance of data bandwidth and compute power. The SHARCs have a large dual ported internal memory that the processing core can access at full speed while the integrated I/O processor is accessing it simultaneously. This means that sustained performance can very nearly approach peak performance, since the task of moving data on and off chip can be done in parallel by DMA engines operating on an independent bus into the internal dual ported memory. The SHARC takes advantage of this high internal memory bandwidth by providing a fast external bus and high speed peripherals. For example, the ADSP-21160, the latest SHARC processor, has a 64-bit, 50 MHz external bus and six 100 MByte/second link ports, each with dedicated DMA engines that can access the 64-bit wide, 100 MHz internal dual ported RAM. The combination of multiple high data bandwidth buses and the required peripherals to take advantage of them allows the ADSP-21160 to outperform DSPs that claim higher peak processing performance, but fail to live up to those numbers in real world applications.

The SHARC also lends itself well to architecting multiprocessor systems, providing both the link ports for use in chained systems and a multiprocessor interface for clustering up to six DSPs. On the cluster bus each SHARC has full access to the internal memory and I/O registers of the other DSPs, and can also generate vectored interrupts into other processors in the cluster.

Balanced DSP Boards

Even though a DSP chip architecture provides balanced processing power and data bandwidth, a multiprocessor DSP board with that DSP on it does not necessarily extend that balance to the system designer. Great care must be taken in architecting a multiprocessor DSP board to provide the user with the power of the DSPs flexibly so that they can use it efficiently within their application. The board vendor's role is not to dictate how the system designers must implement their algorithm, but to provide a flexible and powerful platform that allows the system designers to do what they do best. It is amazing what system engineers can, and will, create when given resources that empower innovation.

A multiprocessor DSP board should not only provide a balance of compute power and data bandwidth, but should also balance the advantages and disadvantages of clustered and chained multiprocessor architectures. A board that has data bottlenecks inherent in its design will throw away what the DSP chip architect has worked so hard to provide. Likewise, a board that tries too hard to maximize data bandwidth may give up flexibility and ease of programming, forcing the system designer into a predetermined implementation approach.

BittWare's Hammerhead DSP board architecture [see Figure 1] was designed to strike this balance of power, data bandwidth, and flexibility. The design centers on a cluster of four Analog Devices ADSP-21160 SHARC processors, which share a large SDRAM (up to 512 Mbytes), a 64-bit 66 MHz PCI interface, Flash, and UART. In addition, the four ADSP-21160s have a double ring link port interconnection topology to support the advantages of chained designs. While the ADSP-21160 allows for a cluster of six DSPs, we chose four as a balance between clustered and chained designs. For high-speed I/O, the board provides a high-performance PCI interface, a dedicated PMC I/O site, and external link ports into every DSP.

Figure 1: Hammerhead-PCI Board Block Diagram

This combination of cluster and chained design, with the high-speed 64-bit 66 MHz PCI bus, provides an extremely flexible platform that has the compute power backed up by extensive I/O bandwidth. Careful use of bridges on the PCI bus provide segmentation, keeping data flow between the PMC site and its DSP cluster on a local bus segment, which means that all local traffic stays within that bus segment. On the Hammerhead 6U CompactPCI board, which has two of the Hammerhead designs on it, each cluster of DSPs can use the full bandwidth of its local 64/66 PCI bus to its PMC site without interfering with the other cluster. The bridges also minimize backplane traffic because all local data flows stay onboard.

Application Development Tools

Once you have your system, how do you go about implementing your application on it? Currently, you face a wide range of choices. For programming, writing in assembler, which used to be the norm in DSP, is seldom done with these large DSPs because the C compilers are good and continually improving. C++ is a fairly recent addition for DSPs and hence will undergo the same lag time as it did for traditional programming. The majority of developers today are writing their applications primarily in C and buying optimized math libraries, such as BittWare's SpeedDSP, that have been hand optimized for the DSP. Only when the system is nearly complete and the last bit of performance is being squeezed out do they tend to revert to assembler for critical code sections. VisualDSP, the integrated development environment for SHARCs, supports multiprocessor builds, which allow you to share symbols between modules on different DSPs, assuming that they are within the same cluster of DSPs.

At the other end of the programming spectrum, tools such as BittWare's SharcLAB allow you to develop within The Mathworks Simulink® environment. Using code generation, these tools will generate C code from your Simulink model, build it with the DSP compiler, and generate an executable for your DSP. This executable can then be run standalone or within the Simulink environment, allowing you to tune parameters on the fly and view your results. A current limitation of these tools is their lack of support for multiprocessor applications. Designers must currently work on a single DSP at a time as there is not an easy method by which to split a Simulink model over multiple DSPs. Methods for removing this limitation are currently under development.

As for operating systems, solutions range from none, to a basic kernel, to a true multiprocessing DSP operating system. Developers who opt to not use an OS are typically designing OEM products which are well specified and stable, i.e. they know what they want to do and how to do it and don't need the features provided by an OS. In the world of kernels, many of the standard small open source RTOSs support the large DSPs, however, few are actually designed for DSP or multiprocessor systems. Analog Devices currently has a small kernel in the works for the SHARCs, which was announced at the recent Embedded Systems Conference.

Virtuoso, recently acquired by Wind River Systems, is a true multiprocessing DSP operating system. It has been designed from the ground up for DSP and provides a "Virtual Single Processor" model whereby one designs tasks. and can assign tasks to DSPs at build time. The OS takes care of intertask communication, using a configuration file that tells it how your DSPs are connected. It prefers to use link ports for interprocessor communications, though it can also use shared memory.

Debugging tools tend to be tied to the development tools. Using emulators and software based targets is common, while higher level tools such as Virtuoso and Simulink provide their own tools to aid in debugging your system. The VisualDSP debugger, whether using an emulator or a software target, allows multiprocessor debug sessions in which you can work with multiple DSPs within the same debug session.

Finally, board vendor specific tools and utilities, such as BittWare's DSP21k Toolkit, allow you to develop applications on the system controller to interact with the DSPs. This toolkit, consisting of drivers, libraries and utilities, lets a program running on the system controller download the DSPs, send and retrieve data, start and stop the DSPs, and also provides interrupt capability to and from the DSPs. It also allows for allocating buffers in host memory that the DSPs can directly DMA data into and out of over the PCI bus.

Summary

  • The market for multiprocessing DSP systems continues to be strong, with CompactPCI emerging as a common platform.
  • A typical system consists of a system controller, the DSPs, and I/O, with the DSPs looking after the real time data processing, and the system controller providing overall control, monitoring, and the user interfaces.
  • The critical design issues in multiprocessor DSP systems are compute power and data bandwidth, with data bandwidth often not getting the attention it deserves.
  • A multiprocessor DSP board must provide the processing power and data bandwidth in a flexible way to allow system designers the freedom to innovate.
  • To maximize continuous real world performance, a DSP architecture, both at the chip and board level, must provide a means to run the data flows in parallel with the computations.
  • Careful use of bridging techniques on shared buses combined with dedicated DSP interconnects can provide a system that makes for straightforward programming and control, while providing the critical power and data bandwidth needed.
  • The wide range of development tools available for multiprocessor DSP allow you to tailor your development environment to your methodology and needs.

spacer.gif spacer.gif
spacer.gif
spacer.gif
Global Partners
spacer.gif
AlteraAnalog

spacer.gif
spacer.gif spacer.gif spacer.gif