networkZONE Products for the week of April 15, 2002
The NPE10 uses standard glueless look aside interfaces to commercially available memory subsystems or specialized coprocessors, and standard SPI 4.2 interfaces on ingress and egress. Key highlights of the architecture include:
Frame Processing Engines
The NPE10 consists of a massively parallel array of 64 Frame Processing
Engines (FPEs). These are built around 32-bit standard RISC cores, executing
at 333 MHz clock speed, and can be programmed in C or assembly language
to provide application-specific fast-path data packet processing at wire-speed.
Each FPE includes locally dedicated instruction memory and locally dedicated
register space for packet header and data manipulations by the fast-path
code.
Single-Threaded, Single Processor Model
The NPE10 implements a single-threaded single-processor programming model
with deterministic packet processing performance. Each FPE executes single
threaded code in a non time-shared manner with no dependency on events in
order to handle wire rate packet processing at full duplex 10 Gbps rates.
Parallelism And Pipelining
To maximize performance, conserve valuable cycles and ensure deterministic
processing under all traffic load conditions, the NPE10 offers a unique
approach to parallelism and pipelining to achieve non-blocking access to
shared resources. Each FPE performs resource-independent operation and works
on one packet at a time. Sixty-four packets may be worked on simultaneously
by 64 FPEs in a non-time shared manner. Consequently, no cycles or instructions
are wasted on task or thread switching, and no complex low level-programming
model is imposed on designers.
Highly Efficient Co-Processing Engines
Augmenting the FPEs are specialized co-processors. The high-level application
pipelining within a packet (classification, policing, modification, etc.)
is enabled by a DMA-based co-processor invocation mechanism between FPEs
and co-processor engines. This key feature accommodates multiple protocol
processing in a single threaded code.
High Availability And Fault Tolerance
The NPE10 supports extensive high availability and fault tolerance features
including:
Performance Metrics
Application-specific performance is demonstrable for a variety of popular
applications, including IP routing, MPLS LER, and Draft Martini for Metro
Ethernet. All have been shown to be sustainable at 10 Gbps full-duplex rates
(20 Gbps aggregate traffic) down to the minimum packet size expected for
those applications. Significant headroom is available across all applications
for additional application support, with less than 60% of the FPEs in use
for IP routing, and 60-70% in use for MPLS LER and Draft Martini.
In addition, the NPE10 has a full-featured software development kit, the Development Workbench, which comprises a clock cycle-accurate simulator of the NPE10, along with a complete suite of development, debugging, and performance analysis tools. The Development Workbench has been available since April 2001, and version 2.1 is currently shipping.
"The NPE10's combination of 64 processors with specialized co-processors make it a powerful, flexible network processor," said Linley Gwennap, principal analyst of The Linley Group, an analyst firm focused on network processors. "As more system vendors adopt network processors, the NPE10 will provide a strong solution for multi-protocol, multi-service systems."
Chris Hoogenboom, Internet Machines' president, CEO, and founder, said, "This architecture is proof that a fully programmable network processor can deliver wire-speed performance and functionality previously only found in ASICs." He continued, "But unlike ASICs, the NPE10 provides headroom for feature expansion, and flexibility for feature change."
Manufactured using a 0.13 micron process, the NPE10 is being fabricated
by Taiwan Semiconductor Manufacturing Company (TSMC), and will be available
for sampling in a 1716-pin organic flip chip BGA package during Q2 of 2002.
analogZONE Says . . .
Divide and Conquer - Internet Machines' 20 Gbit/s Packet Processor Employs Novel And Efficient Multi-CPU Architecture
It almost sounds too good to be true. In a move akin to figuring a way to let nine women produce a single baby in a month, Internet Machines (IM) appears to have found a way to ride herd on 64 (count'em, 64) ripping-fast, programmable packet processors and allow them to be programmed without many of the headaches associated with multi-processor designs.
The NPE-10 is the first element of IM's three-chip OC-192 architecture which I reviewed last September. The rest of the family consists of a traffic manager chip that connects to the 64-port switch element (multiple 2.5 Gbit/s via a SERDES backplane.) A software development kit has been available for a year that allows cycle-accurate development of fast-path code.
They claim tape-out began last month, and that prototype chips are expected in late 2Q '02
It's nice to see IP from two of my favorite IP providers, appearing in this chip. The first example are the SPI-4.2, DDR-SDRAM, QDR-SRAM, PCI-X interfaces and supplied by TriCN, a really good source of fast, reliable, and easy-to-integrate analog cores. I'm so impressed by their commitment to solving really nasty analog I/O problems that their SPI-4.2 interface is the Product of The Week in my inaugural edition of the i/oZONE section of this web site.
I'm also pleased to see the ARC RISC core turning up so abundantly in such a powerful piece of silicon. Internet Machines has used the ARC's customizable features to create a compact 333-MHz core, known as a frame-processing engine (FPE) that has been optimized for parsing and manipulating packets. Their NPE-10 has managed to shoehorn 64 of these FPEs into an intelligent array, along with several specialized processing cores (see Fig. 1.)
They claim that all processors have nearly-non-blocking access to resources on the chip. From conversations with IM, and the block diagram below, I'm inferring that they use some clever DMA arbitration to allow efficient sharing of resources in a deterministic manner.
Traffic enters the chip's frame memory via an SPI-4 bus, is classified by the flow policy co-processor, and its headers passed to one of the 64 FPEs via two Splice, Reorder, and Fragmentation (SRF) engines. Internal sequence control keeps the data stream in order. Each FPE has local instruction space, data storage, and auxiliary register space that permit it work on its own most of the time (See Fig. 2.) An arbitrated DMA pipeline bus moves traffic to and from one of the specialized engines including 2 database tables for flow information and a policer.
This arrangement offers several advantages, not the least of which is that each FPE is blissfully oblivious to the existence of any of its 63 other neighbors. The other most obvious advantage is that the code for the FPEs can be written as if for a single machine (which in truth it is), thus avoiding much of the mental gymnastics and wasted resources involved with managing traditional multi-processor architectures. This allows processing resources to be allocated at will to ingress and egress functions, or dedicated to a single flow and a single direction. They can even be set up to run different protocols - all without the usual losses from context switching.
Another benefit of totally independent processors is that their code can be modified on a per-CPU basis. This means that you can make "hitless" upgrades, where software can be upgraded on-the-fly as the others continue running. Although this makes me a tad nervous to think about, I imagine that this will be great for high-availability, telco-grade apps.
Speaking of which, high-availability is built into the Internet Machines architecture. For example, into all data paths major data busses, & memory interfaces contain error correction and parity detection capabilities.
At least on paper, the result is a machine that performs all fast path tasks under software control, with hardware assist. If the NPE-10 actually works as advertised, its independent processors should allow the chip to modify packet headers at wire rate while supporting multiple protocols, including IP, MPLS, ATM, TDM and more. Estimated capacity for a worst-case traffic scenario is 50 million, 40-byte packets (for packet-over-SONET apps.)
This review is already too long, so I won't bore you further by re-listing the claims being made for what the chip will be able to do. Just for the record, some of the more interesting things the NPE-10 is supposed to do at wire-speed are:
- as well as extremely flexible header editing and modification (including packet fragmentation.)
One thing that gives me a warmer feeling than usual
about the claims made for this very complex chip is the extent to which
the design was verified using cycle-accurate hardware simulation. Besides
proving the design works, the tests seemed to indicate that the NPE-10 has
a bunch of extra capacity. Using the IKOS FPGA-based hardware simulation,
IM engineers found that the processor is only taxed at below 60% while performing
standard IP routing, and 60-70% for MPLS LER and draft Martini tasks. IM
was confident enough in their verification efforts that they didn't feel
the need for prototype chips.
I did have some minor concerns about Internet Machines' decision to integrate
classification and network processing on the same chip, since overly-complex
"does it all" chips often involve compromises in performance or
flexibility. The manufacturer, however, claims just the opposite. They explain
that what they mean by "one-chip" is a single chip that processes
traffic in both directions. The argument they make is that most applications
tax the ingress processing functions more heavily than egress (especially
for carrier boxes) and that two-chip solutions cannot allow for efficient
partitioning of the workload. Running both flows on the same chip allows
you to partition memory and resources asymmetrically, and put the resources
in the direction needed.
IM also counters that they are not really a "one-chip solution" since the chip's basic classification capabilities can easily be augmented by a TCAM or other classification engine via a high-speed look-aside interface. Given the excellent engineering that these folks have displayed in their design, I guess I'll take their word for the moment.
Despite its complexity, I think that the NPE-10 will sample within a month or so of the promised 2Q '02 date. This is in good part due to the ARC core's reputation for trouble-free designs, and in part because of the impressive design and management team they have assembled. I still wonder if the chip will deliver on all the impressive promises it makes, simply because of the ambitious nature of the design. I think that the basic architecture will eventually deliver the goods, it might just take a spin or two to do so. For this reason, the Vapor Index Rating remains at 2.5 saltshakers. I hope to be wrong about this one, because I really want it to work.
Finally, just for comparison's sake, there is one competitor (Silicon Access) with a similar, but not identical architecture. For details on the 32-CPU, five-chip set, see my past review.
![]() |