networkZONE Products for the week of February 19, 2007


Stream Processors Says…
Storm-1 Family of Data-Parallel Digital Signal Processors
Advanced digital signal processor architecture enables easy development of efficient applications for parallel processing

Stream Processors, Inc. (SPI), has announced the first members of its Storm-1 family of data-parallel digital signal processors (DSPs) -- the SP16-G160 and the SP8-G80. Based on the company's breakthrough SPI Stream Processor Architecture, the Storm-1 family combines the ability to scale to 80 giga-multiplications and accumulations per second (GMACs/s) performance with a simple, predictive and efficient C programming model. The Storm-1 family is suitable for a wide range of demanding signal processing applications, such as high-definition video H.264 HD encoding, transcoding, analytics, image processing and video surveillance with processing headroom for customer-specific enhancements.

"Data bandwidth and ease of programming are what matter most in modern computer systems," said Chip Stearns, president and CEO of SPI. "The level of performance offered by the Storm-1 family will enable a new wave of innovation in markets that have been begging for an easy-to-use path to higher performance without sacrificing software programmability."

The SP16-G160 and SP8-G80: A New Class of DSPs
The Storm-1 SP16-G160 device is a DSP that offers 160 giga operations per second (GOPS) and 80 GMACs/s of performance by featuring a high performance data-parallel unit (DPU) with 16 parallel lanes with five ALUs each. Each ALU contains a MAC unit and is capable of four 8-, two 16- or one 32-bit operation per cycle. Input and output data for each lane is stored in on-chip lane register files that are allocated by the compiler to maximize data bandwidth. Each device includes a MIPS32 4KEc CPU core for system tasks, and a second MIPS32 4KEc that is dedicated to handling main DSP threads and making kernel function calls to the DPU for acceleration. A rich set of I/O includes Gigabit Ethernet, PCI, and high-speed data ports for video and communications. Designed to make parallel performance easily accessible to programmers, a key feature of the architecture is its compiler-managed memory hierarchy and single-threaded approach. A simple C programming model allows specification of compute-intensive kernel functions that process data records, enabling the compiler and hardware to efficiently manage on-chip memory and synchronize runtime direct-memory access (DMA). Kernel functions process stream data in a data-parallel fashion across all of the lanes. Unlike traditional DSPs, there is no need to spend time manually choreographing caches or dealing with synchronization of DMA, or load-balance cores, greatly increasing predictability and simplifying the overall programming task.

The Storm-1 SP8-G80 device leverages the SPI Stream Processor Architecture in an eight lane flavor offering 80 GOPS of performance. Additional information about the Storm-1 family can be found at http://www.streamprocessors.com/.

Development Tools
SPI's RapiDev tool suite supports an industry standard development and debug flow using C language tools running on a Windows/Cygwin or Linux platform. The RapiDev tool suite includes easy-to-use functional and cycle-accurate simulators and leverages the predictability of SPI's Stream Processor Architecture to provide a fast, linear path to production code. Source code compatibility is maintained across devices with different numbers of lanes and ALUs, providing greater scalability and portability.

The Storm-1 Development Kit supports evaluation and software development on SPI hardware, and has I/O options to support multiple video sources and formats, including HD and D1.

analogZONE Says . . .

While multi-core processors are all the rage in everything from laptops to basestations, there are many subtle and not-so-subtle reasons why they often fail to deliver the linear performance boost that one would expect from throwing two, four or even eight more cores at an application. But from what I have learned about the Stream Processors DSPs, it seems that their new approach to multi-core architecture may have found a way around many of these pitfalls -- at least for computing tasks that lend themselves to a highly-parallel approach.

Since Stream Processors has already done a pretty good job of explaining most of the basic technology behind their product, I'll use this review to drill down into the architecture a bit and explore a few of the basic concepts that make the Stream Processors so unique. I'll also take a shot at explaining why I think it has a fighting chance of becoming commercially viable when most other oddball computing arrays have never managed to catch on.

Before we go any further, I should explain that the term stream processing is SPI's term for a combination of a programming model and a hardware architecture designed for highly-parallel operations which, the company claims, overcome the performance problems that occur in clusters of conventional CISC, RISC and VLIW CPUs, not to mention the painful programming characteristics that plague most multi-processor designs.

The data-parallel approach combines the attributes of parallel processors and reconfigurable arrays -- but with a twist. Every one of the 16 multi-ALU elements (SPI calls them lanes) in the processor array runs the same program at any given time on different pieces of a data set (see Fig. 1). This ability to digest huge chunks of related data in a handful of clock cycles makes it a very powerful tool for processing large arrays such as video streams, or medical images.

Of course there have been many multi-core processors, and even several reconfigurable compute fabrics (Lord knows, I've covered several of both flavors) which made similar claims and only delivered on a fraction of their promises. In some cases, manufacturers were unable to deliver programming tools that allowed software engineers to unlock the true potential of the chip without laborious bit-level coding techniques. In other cases, unexpected hardware bottlenecks did not allow the chip to yield anything close to the theoretical arithmetical summing of CPU power it was expected to deliver. I'd argue that products such as Stretch's software-configurable processors and Cavium's Nitrox and Octeon families of multi-core processors have managed to enjoy some well-deserved market success but, even then, internal choke points and programming challenges can keep these otherwise excellent devices from achieving their full theoretical performance levels.

So what is there about SPI's ambitious little engine that might allow it to escape a similar fate?

My long tech briefing with SPI revealed lots of interesting things about the Stream Processors architecture that set it apart from most of the other compute arrays I've seen to date but I think I can boil down my impressions to three significant developments:

  1. First (and probably most important), the folks at SPI reversed the normal processor design methodology by letting programming requirements drive the silicon instead of vice-versa. To this end, they created a compute architecture around a programming model that allows most, or all, of the software to be developed in straightforward C language which can be efficiently parsed and translated to code for the native silicon. The result is that you get programming tools which let you quickly produce applications that take full advantage of the chip's parallelism without having to worry too much about bit-twiddling within the chip's innards.
  2. The Stream Processors architecture does an excellent job of removing most of the bottlenecks that can have multi-processor systems spending nearly as much time competing with each other for bus and memory bandwidth as they do getting the actual work done. Besides using lots of distributed memory (details below), their extensive use of meshed interconnect busses and other clever architectural tricks allow most, or all of their processing cores to run full-bore most of the time.
  3. SPI has taken a distributed approach to on-chip memory, placing small chunks of it right where it's needed instead of in big centralized pools. For example, each of the machine's many ALUs has its own local register files and memory as well as another set of registers that are shared among its small cluster of related compute elements. Putting the memory as close as possible to where it's needed is one of several reasons that this chip is able to clock its processors at 700+ MHz using standard 0.13-micron CMOS.

As promised earlier, I'll try to confine this part of the review to a few interesting observations about the processor's compute array and the tools used to program it.

The Stream Processors compute array (known as the data parallel unit, or DPU) is fed by a standard MIPS RISC core that manages the I/O and configuration of the other processors on the chip. A second MIPS processor manages the DPU and coordinates its work flow among the 16 lanes inside the DPU. Each of the DPU lanes has 5 ALUs running in parallel as a VLIW engine. The 32-bit ALUs look a lot like a DSP MAC and can perform four 8-bit add, shift, multiply etc operations per clock cycle. Each ALU within a given lane can execute a different program but all lanes must be running the same set of programs.

Each of the 16 lanes can communicate with other lanes to pass interim results or completed data on each clock cycle using the device's fully-meshed inter-lane switch. By eliminating the common bottlenecks that plague most multi-core designs should enable linear scaling of performance by simply adding lanes and/or ALUs with almost no overhead penalty -- at least until the complexity of the meshed switch becomes too overwhelming to practically implement.

As I mentioned earlier, the Stream Processors also owes much of its efficiency and ease of use to its compiler which helps the multi-kernel DPU look like a single-threaded processor to a programmer. The compiler breaks tasks down into kernels, the basic collections of functions that are executed by one or more of the 16 processing lanes. It is also responsible for determining a program's memory requirements and allocating storage as needed during configuration to run a particular kernel. Keeping this process invisible to the programmer improves productivity -- and probably does wonders for their sanity as well.

The first versions of the Stream Processor are available in either eight-lane (the SP8-G80) or 16-lane (the SP16-G160) configurations, which supply 80 and 160 Goperation/s worth of compute power respectively when clocked at 500 MHz. This translates to roughly 8x - 16x the performance of a high-end TI C6 DSP which only draws a little less than the 10 - 11 W worth of power that the 16-lane SPI device consumes. While they have not released a formal roadmap SPI has assured me that the inter-lane switch should allow the basic design to scale easily to much larger configurations if there are enough applications that demand more than 160 Goperation/s worth of compute power.

Until now, one of the only ways to get this sort of power/performance ratio in a programmable solution was to offload DSPs from repetitive, or highly-parallel tasks with FPGA. I'd guess that these hybrid DSP/FPGA designs will remain a great solution for certain classes of problems (especially the enhancement of legacy products), but SPI products represent one of the first credible threats to this approach that I've seen. And if your design requires the faster Virtex/Statix-class FPGAs, the aggressive pricing of the SPI 8- and 16-lane processors ($99/$59 in 10-k piece lots) will definitely give them a significant cost advantage.

I'm not enough of a DSP guru to say exactly where the SPI devices could enjoy a hands-down advantage over traditional architectures but I'd venture a guess that they'll really shine at crunching large data arrays, such as video frames, digital radar signals and medical images. SPI says that their target applications include wireless infrastructure, broadcast (HD encoding/transcoding). The sub-$60 price for their 8-lane device even puts them within reach of higher-end consumer media/IPTV/DTV products.

Despite all my enthusiastic mumblings, I'd like to offer a cautionary observation that the same things that make them so ripping fast as an array or signal processor may be the same things that make it really mediocre for others. By this I mean that I'm not sure about what sorts of non-DSP tasks can be handled by the on-chip MIPS-32 RISC front-and which ones must be offloaded to a more powerful general-purpose processor.

While many more conventional DSPs can be coaxed to behave like reasonably good RISC engines (most notably for protocol processing and control-plane tasks), I wonder if trying to do this with a Stream Processor might be like trying to use a Ferrari to plow a corn field. This is not necessarily bad, it just means that you'll need to segment your tasks properly and, if necessary, have a sufficiently-powerful host processor within close reach for tasks that outstrip the capacity of the chip's MIPS RISC engine. Of course, I may be wrong about this since SPI hinted at the fact that they have already developed IPv4 forwarding, security and encryption software for their engine -- although they did not say what bit rates they can support or how much each layer of protocol encapsulation affects throughput. Unfortunately these questions will probably remain unanswered for some time since SPI has wisely decided to focus on markets such as video processing and baseband processing where their chip should be a shark among minnows.

The SP16-G160 and SP8-G80 are sampling with production slated for Q2 2007 in PBGA.

Data Sheet

Lee's Saltshaker Rating


analogZONE
(c) 2007. All rights reserved.