networkZONE Products for the week of July 1, 2002


Xelerated Says . . .
A Disciplined Approach - Deterministic pipelined architecture lets Xelerated's NP run 10 Gbit/s IPv6 packets at wire-speed

Xelerated recently demonstrated IPv6 at 10 Gbps wire-speed on an evaluation board. The company's fully programmable network processors (X10s, X10d and X10q), have been designed to handle IPv6 along with any other network protocol combination, targeting vendors of metro, edge and core networking equipment. The network processors (NPUs) can easily be programmed to implement full IPv6 functionality, including multicasting, forwarding, multi-field classification for traffic conditioning, and extension-header parsing - all at wire-speed.

Xelerated demonstrated IPv6, IPv4 and MPLS applications running at 10 Gbps wire-speed over SPI-4.2 ports on a FPGA. With a system clock of only 50 MHz, the FPGA contains the essential functionality of the soon-to-be-released Xelerator NPUs. The demo both verifies interoperability with surrounding circuitry and validates Xelerated's programmable PISC (Packet Instruction Set Computer) architecture.

"It's prime time for large-scale IPv6 deployment, taking IPv6 from pre-deployment to carrier-grade production. New generation technology pioneers like Xelerated, with its high-class programmable network processors, is well positioned to drive this evolution," said Latif Ladid, President IPv6 Forum.

In service-provider and enterprise networks deployment of IPv6 has already begun. The Gartner Group predicts that by 2006 50% of all carriers in the Asia/Pacific region will be running IPv6 in their networks.

"We have seen a growing demand for IPv6 solutions, driven by Asian customers in particular. Our ability to do full IPv6, IPv4 and MPLS processing simultaneously at wire-speed, positions us uniquely on the market," said Thomas Eklund, Director of Business Development and founder of Xelerated.

The Xelerator X10 NPU family meets all the requirements imposed by IPv6, thanks to its sequential processing model and its capacity to perform very wide look-ups. This is done without compromising the wire-speed performance inherent in the NPU. New formats for extension headers are easily accommodated since the Xelerator X10 NPUs do not employ hard-coded packet formats. Example IPv6 application code together with IPv4 and MPLS code are included in Xelerated's development tools kit, which has been shipping to customers since January.

analogZONE Says . . .

Editor's Note: I've been a tad skeptical of Xelerated's ambitious claims when they announced their architecture last year, and had declined to give them much coverage. Now that I've taken a closer look at their architecture, and learned that they have a full-speed FPGA-based demonstrator working at NPC East conference this week, I regret my decision. I'm taking the occasion of their proof-of-concept demonstration to make amends and shed a little light on this unique chip which was announced earlier this month.

Calvinism has never been my first choice for a personal theology, but in the wild world of network processors, its deterministic principles can be very desirable. One of the tough truths we're learning as the network processor market matures is that programmability and deterministic behavior are often mutually exclusive, forcing us to make trade-offs between the flexibility of RISC-based engines and the certainty of fixed-function state machine logic. Xelerated joins a handful of packet processor designs that attempt to get around this paradox by using configurable processing arrays that deliver the predictable performance of a state machine, but can be re-programmed as the need arises.

As we shall see Xelerated's chips employ a multi-stage pipeline processor architecture that can be programmed to make multiple classification and packet modification operations. While it is programmable, it is not as flexible as a traditional solutions and is not at its best when try ing to do recursive operations of indeterminate length. For this reason (among others) Xelerated has wisely narrowed its target market to the metro WAN, and layer 2-4 processing. It may be a slight case of "sour grapes", but I feel that Xelerated's argument that a single chip can span all layers efficiently has some merit (I know EZchip will have something to say about this.) They say that most manufacturers are finding that you need a co-processor to handle higher-layer processing at OC-48 and above, or risk uncontrolled packet discard when they get an unexpectedly-high number of packets that need extra attention.

Xelerated's unique architecture gets around the painful programmable performance paradox with a programmable pipelined state machine. The current processing engine used in the chip's core can perform over 200 discrete packet manipulations and classifications in a long multi-stage pipeline. Each packet is cascaded from stage-to-stage along with a header tag that triggers a specific operation to occur. The pipeline's processing elements can access on-chip (TCAM, meter, counter, hashing engine) and off-chip resources (TCAMs, SRAM and co-processors) via a fast cross-connect multiplexer.

The X10's multi-stage architecture allows for processing of multiple encapsulations without deep search trees and other complex classification methods. It handles wide address fields and extension headers easily. MPLS? No problem. The same goes for IPv6. Its over-length packets are handled by cutting hem into two parts passing them to two adjacent stages simultaneously. This cuts number of operations per packet in half, but maintains a uniform processing time for each packet.

Using their chip's current clock speed of 200 MHz, the Xelerated processor architecture scales from 1 Gbit/s through 16X OC-48. Combined with its multiple-port interface, the X10 can be used to efficiently load a switch fabric by aggregating multiple streams and filling its ports to maximum capacity.

Xelerated will offer its processor in three flavors: There is the S model, a single-port unit that supports a single (simplex)10-Gigabit connection, the D - a dual-port10-Gigabit simplex (or single duplex) chip. At the top of the heap, there is the Q - a quad-port device that can support either 40 Gbit/s of simplex, traffic or 20 Gbit/s of full-duplex. Interestingly, they all use the same pipeline core with different numbers of SPI-4.2 interfaces. One interesting application for the Q model is to have the device make multiple passes through the chip for additional processing steps, and perhaps even thread the stream through a specialized in-line processor.

Of course Calvinist doctrine (and the laws of nature) says that you can't get this kind of performance without somehow paying the piper. The one potentially serious bottleneck I can see in this chip is in conflicts between several pipeline processing elements trying to simultaneously access the same TCAM or other outboard logic unit at the same time. While the chip actually permits several (the actual number is classified) processors to do this, it's pretty obvious that you'd get an ugly traffic jam if all 200 of them tried it at the same time.

Xelerated gets around this by using its development software to arbitrate resource conflicts ahead of time while the code is being written. The software does not allow violations to occur, and instead makes trade-offs in the number of operations permitted based on traffic rate and number of connections to the chip. This preserves the determinism that is essential in QoS-sensitive tasks, and allows designers to optimize trade-offs per their particular application.

The pipeline processors carry another hidden bonus for software development since they have no need for the processor to run an operating system. This requires much less code to be written and de-bugged, and the pre-developed code is much more stable.

The other price for determinism is latency. The 200+ stages of the NPU each consume one clock cycle, creating a constant 2 µs of delay between input and output. Actually, this is not bad compared to conventional processors, especially when they are heavily loaded with multiply-encapsulated packets. You also have the luxury of getting to deal with a fixed time delay. Xelerated's traffic manager (see below) adds much less fixed delay because of its shorter pipeline, but then adds the normal buffering-related latencies.

Like most NPUs, the X10 sports a separate traffic manager. Unlike its peers, it re-uses much of the pipeline architecture from the NPU. It also adds a SAR to handle segmentation for switch fabric, plus a multicast copier. Its flexible header processing capability can generate tagging and control flags for any NP or switch fabric that can hook up to a SPI-4.2-compliant port. This means that it will handle new NPF streaming interface spec nicely which will employ a CSIX logical interface, and a SPI 4.2 electrical spec.

The chip's queue manager maintains a constant delay for priority streams on even 40-byte packets at wire speed. It is configurable and can run different algorithms and different weights in each flow. The manager can schedule on the basis of 3 different levels - flow, aggregated flow, or on the port level, this helps retain DifServ functionality.

Of course, the chips come with as set of development tools. In fact, they have been in customer's hands for a good fraction of a year. Their rather expensive but effective FPGA-based test bed allows development now with validation of interfaces as well as fast path code. It also verifies the relationship between fast path code and the control plane code's ability to load and program FP code. This is especially important since the control plane can also update CAM parameters. Xelerated's extensive software development tools can run on either the simulator or the chip, something that keeps code development "real," with no transition difficulties.

While I cannot vouch as to whether their architecture is as significant as they say it is, I will give Xelerated a vote of confidence on it ability to deliver the chip. This is in good part because they have a full-speed mockup running in a Xilinx-based FPGA test bed. I'll also take their word that the fixed-pipeline architecture requires fewer transistors and results in a somewhat smaller chip than many of the bulky, design tool-busting, 64-CPU behemoths that share Xelerated's performance class.

My only real concern about hitting the promised sample date (Xelerated expects to sample their X10 network processor in Q3 '02, and their T10 traffic manager during Q1 '03) is the extremely low body count for such a massive undertaking. I'd expect at least half-again the 55 employees, (with around 38 people actually doing the engineering and design work) for a project this demanding. But then again, I have not been directly involved with an ASIC design in over 20 years, so I may be missing something here. I've always said that a small group of talented, motivated individuals can outflank a much larger standing army of drones, and perhaps the powerful pedigrees of the design staff (Cisco, Synopsis, SwitchCore, and Zettacom to name a few) can move their design into tape-out and fab as quickly as they say they can.

Data Sheet

Lee's Saltshaker Rating

 Delivery of working silicon

On-time sampling





acquisitionZONE - audio/videoZONE - greenZONE - hf/rfZONE - i/oZONE - networkZONE - powerZONE - in the ZONE
home

analogZONE
(c) 2002. All rights reserved.