networkZONE Products for the week of April 15, 2002


Internet Machines Says . . .
Internet Machines' NPE10 Programmable Multi-CPU Network Processor, Uses Single-Threaded Programming Model, Delivers 20 Gbit/s Performance

Internet Machines disclosed details of its NPE10 Protocol Independent Network Processing Engine architecture. The NPE10 is a single chip solution that delivers 20 Gbps (full-duplex 10 Gbps) data rates. Based on 64 standard RISC cores working in partnership with highly efficient specialized co-processor engines, the NPE10 is fully programmable without sacrificing any performance, and leverages innovative approaches to packet processing tasks. Significant headroom is available under common networking applications to allow for future support of new protocols and services. A single-threaded programming model makes programming the device very straightforward, and sample application code is included for a variety of popular applications, shortening development time.

The NPE10 uses standard glueless look aside interfaces to commercially available memory subsystems or specialized coprocessors, and standard SPI 4.2 interfaces on ingress and egress. Key highlights of the architecture include:

Frame Processing Engines
The NPE10 consists of a massively parallel array of 64 Frame Processing Engines (FPEs). These are built around 32-bit standard RISC cores, executing at 333 MHz clock speed, and can be programmed in C or assembly language to provide application-specific fast-path data packet processing at wire-speed. Each FPE includes locally dedicated instruction memory and locally dedicated register space for packet header and data manipulations by the fast-path code.

Single-Threaded, Single Processor Model
The NPE10 implements a single-threaded single-processor programming model with deterministic packet processing performance. Each FPE executes single threaded code in a non time-shared manner with no dependency on events in order to handle wire rate packet processing at full duplex 10 Gbps rates.

Parallelism And Pipelining
To maximize performance, conserve valuable cycles and ensure deterministic processing under all traffic load conditions, the NPE10 offers a unique approach to parallelism and pipelining to achieve non-blocking access to shared resources. Each FPE performs resource-independent operation and works on one packet at a time. Sixty-four packets may be worked on simultaneously by 64 FPEs in a non-time shared manner. Consequently, no cycles or instructions are wasted on task or thread switching, and no complex low level-programming model is imposed on designers.

Highly Efficient Co-Processing Engines
Augmenting the FPEs are specialized co-processors. The high-level application pipelining within a packet (classification, policing, modification, etc.) is enabled by a DMA-based co-processor invocation mechanism between FPEs and co-processor engines. This key feature accommodates multiple protocol processing in a single threaded code.

High Availability And Fault Tolerance
The NPE10 supports extensive high availability and fault tolerance features including:

Performance Metrics
Application-specific performance is demonstrable for a variety of popular applications, including IP routing, MPLS LER, and Draft Martini for Metro Ethernet. All have been shown to be sustainable at 10 Gbps full-duplex rates (20 Gbps aggregate traffic) down to the minimum packet size expected for those applications. Significant headroom is available across all applications for additional application support, with less than 60% of the FPEs in use for IP routing, and 60-70% in use for MPLS LER and Draft Martini.

In addition, the NPE10 has a full-featured software development kit, the Development Workbench, which comprises a clock cycle-accurate simulator of the NPE10, along with a complete suite of development, debugging, and performance analysis tools. The Development Workbench has been available since April 2001, and version 2.1 is currently shipping.

"The NPE10's combination of 64 processors with specialized co-processors make it a powerful, flexible network processor," said Linley Gwennap, principal analyst of The Linley Group, an analyst firm focused on network processors. "As more system vendors adopt network processors, the NPE10 will provide a strong solution for multi-protocol, multi-service systems."

Chris Hoogenboom, Internet Machines' president, CEO, and founder, said, "This architecture is proof that a fully programmable network processor can deliver wire-speed performance and functionality previously only found in ASICs." He continued, "But unlike ASICs, the NPE10 provides headroom for feature expansion, and flexibility for feature change."

Manufactured using a 0.13 micron process, the NPE10 is being fabricated by Taiwan Semiconductor Manufacturing Company (TSMC), and will be available for sampling in a 1716-pin organic flip chip BGA package during Q2 of 2002.

analogZONE Says . . .

Divide and Conquer - Internet Machines' 20 Gbit/s Packet Processor Employs Novel And Efficient Multi-CPU Architecture

It almost sounds too good to be true. In a move akin to figuring a way to let nine women produce a single baby in a month, Internet Machines (IM) appears to have found a way to ride herd on 64 (count'em, 64) ripping-fast, programmable packet processors and allow them to be programmed without many of the headaches associated with multi-processor designs.

The NPE-10 is the first element of IM's three-chip OC-192 architecture which I reviewed last September. The rest of the family consists of a traffic manager chip that connects to the 64-port switch element (multiple 2.5 Gbit/s via a SERDES backplane.) A software development kit has been available for a year that allows cycle-accurate development of fast-path code.

They claim tape-out began last month, and that prototype chips are expected in late 2Q '02

It's nice to see IP from two of my favorite IP providers, appearing in this chip. The first example are the SPI-4.2, DDR-SDRAM, QDR-SRAM, PCI-X interfaces and supplied by TriCN, a really good source of fast, reliable, and easy-to-integrate analog cores. I'm so impressed by their commitment to solving really nasty analog I/O problems that their SPI-4.2 interface is the Product of The Week in my inaugural edition of the i/oZONE section of this web site.

I'm also pleased to see the ARC RISC core turning up so abundantly in such a powerful piece of silicon. Internet Machines has used the ARC's customizable features to create a compact 333-MHz core, known as a frame-processing engine (FPE) that has been optimized for parsing and manipulating packets. Their NPE-10 has managed to shoehorn 64 of these FPEs into an intelligent array, along with several specialized processing cores (see Fig. 1.)

They claim that all processors have nearly-non-blocking access to resources on the chip. From conversations with IM, and the block diagram below, I'm inferring that they use some clever DMA arbitration to allow efficient sharing of resources in a deterministic manner.

Traffic enters the chip's frame memory via an SPI-4 bus, is classified by the flow policy co-processor, and its headers passed to one of the 64 FPEs via two Splice, Reorder, and Fragmentation (SRF) engines. Internal sequence control keeps the data stream in order. Each FPE has local instruction space, data storage, and auxiliary register space that permit it work on its own most of the time (See Fig. 2.) An arbitrated DMA pipeline bus moves traffic to and from one of the specialized engines including 2 database tables for flow information and a policer.

This arrangement offers several advantages, not the least of which is that each FPE is blissfully oblivious to the existence of any of its 63 other neighbors. The other most obvious advantage is that the code for the FPEs can be written as if for a single machine (which in truth it is), thus avoiding much of the mental gymnastics and wasted resources involved with managing traditional multi-processor architectures. This allows processing resources to be allocated at will to ingress and egress functions, or dedicated to a single flow and a single direction. They can even be set up to run different protocols - all without the usual losses from context switching.

Another benefit of totally independent processors is that their code can be modified on a per-CPU basis. This means that you can make "hitless" upgrades, where software can be upgraded on-the-fly as the others continue running. Although this makes me a tad nervous to think about, I imagine that this will be great for high-availability, telco-grade apps.

Speaking of which, high-availability is built into the Internet Machines architecture. For example, into all data paths major data busses, & memory interfaces contain error correction and parity detection capabilities.

At least on paper, the result is a machine that performs all fast path tasks under software control, with hardware assist. If the NPE-10 actually works as advertised, its independent processors should allow the chip to modify packet headers at wire rate while supporting multiple protocols, including IP, MPLS, ATM, TDM and more. Estimated capacity for a worst-case traffic scenario is 50 million, 40-byte packets (for packet-over-SONET apps.)

This review is already too long, so I won't bore you further by re-listing the claims being made for what the chip will be able to do. Just for the record, some of the more interesting things the NPE-10 is supposed to do at wire-speed are:

- as well as extremely flexible header editing and modification (including packet fragmentation.)

One thing that gives me a warmer feeling than usual about the claims made for this very complex chip is the extent to which the design was verified using cycle-accurate hardware simulation. Besides proving the design works, the tests seemed to indicate that the NPE-10 has a bunch of extra capacity. Using the IKOS FPGA-based hardware simulation, IM engineers found that the processor is only taxed at below 60% while performing standard IP routing, and 60-70% for MPLS LER and draft Martini tasks. IM was confident enough in their verification efforts that they didn't feel the need for prototype chips.

I did have some minor concerns about Internet Machines' decision to integrate classification and network processing on the same chip, since overly-complex "does it all" chips often involve compromises in performance or flexibility. The manufacturer, however, claims just the opposite. They explain that what they mean by "one-chip" is a single chip that processes traffic in both directions. The argument they make is that most applications tax the ingress processing functions more heavily than egress (especially for carrier boxes) and that two-chip solutions cannot allow for efficient partitioning of the workload. Running both flows on the same chip allows you to partition memory and resources asymmetrically, and put the resources in the direction needed.

IM also counters that they are not really a "one-chip solution" since the chip's basic classification capabilities can easily be augmented by a TCAM or other classification engine via a high-speed look-aside interface. Given the excellent engineering that these folks have displayed in their design, I guess I'll take their word for the moment.

Despite its complexity, I think that the NPE-10 will sample within a month or so of the promised 2Q '02 date. This is in good part due to the ARC core's reputation for trouble-free designs, and in part because of the impressive design and management team they have assembled. I still wonder if the chip will deliver on all the impressive promises it makes, simply because of the ambitious nature of the design. I think that the basic architecture will eventually deliver the goods, it might just take a spin or two to do so. For this reason, the Vapor Index Rating remains at 2.5 saltshakers. I hope to be wrong about this one, because I really want it to work.

Finally, just for comparison's sake, there is one competitor (Silicon Access) with a similar, but not identical architecture. For details on the 32-CPU, five-chip set, see my past review.

Lee's Saltshaker Rating

 






powerZONE - acquisitionZONE - audio/videoZONE - hf/rfZONE - networkZONE - home

analogZONE
(c) 2002. All rights reserved.