networkZONE Products for the week of April 22, 2002
analogZONE Says . . .
Architecture Shoot-Out - Bay's 10-Gbit Power Pipeline Promises to Punish Poly-Processor Packet Pushers by Delivering Deterministic Data
After staying up till 2 AM several nights last week to get the i/oZONE ready for press, I realized I could not write a fourth product review. But when I got a hold of this story about Bay Microsystems' new chip that incorporates a network processor and a traffic manager, I was sorely tempted. Bay's novel, and potentially powerful architecture provides an excellent counterpoint to the arguments made by Internet Machines' 64-CPU RISC architecture that I reviewed last week. Having these two well-conceived network processors being released so close to each other gives us a great opportunity to closely compare the chips, and the radically different architectural philosophies that they are designed around.
The first way Bay diverges from Internet Machines (IM), and for that matter, much of the NP industry, is that it chooses to not rely on the vagaries of an array of programmable RISC engines for its critical, time-bounded, packet processing functions. They say that they chose their deterministic pipeline architecture because it was the only way to guarantee a sustained line rate at 16 Gbit/s for all traffic patterns and conditions.
Bay also says that their pipeline architecture uses far fewer gates than a RISC array of similar processing power. They assert that besides the CPU taking up less real estate, the pipeline architecture eliminates need for complex arbitrator logic to juggle tasks and keep data aligned between CPUs. They say that the savings leaves them enough room on the chip to implement a full-blown traffic manager.
Bay has defined the classes of network processing tasks it performs as: Classification, Transformation, and Traffic Management. Passing packets and their associated headers through a fixed number of stages provide deterministic performance - the certainty a given task will be performed within a specific time frame. Displaying ambition that seems extreme by even Silicon Valley standards, they have incorporated five wire-speed functions on the chip - a classifier, a packet editor, a SAR, a queue manager, and a traffic manager. Now that they have some working Alpha silicon, they say that they are confident that all functions will run at wire speed - including the SAR.
Time, space, and the limits of my intellectual capacity prevent me from giving you a fully detailed account of the Bay Montego architecture, but I'll do my best to touch on a few highlights. You can refer to the simplified block diagram as we take a quick trip through this formidable chunk of packet-processing silicon.
The pipeline design breaks processing tasks up and assigns them to separate dedicated engines. Each engine has its own instruction set, and an assigned set of states that it works through while operating on a packet - Bay stresses, the execute and dwell time for a packet does not vary, regardless of the task being performed on it.
Incoming packet headers are extracted and passed on internally between engines via the control bus with no buffering. The packets themselves are only buffered once, in a bank of external SD or FC DRAM, that sits between the policy engine and forwarding engine. The traffic manager handles flow IDs and Queue Ids, while the actual traffic resides in payload buffer until the forwarding engine calls for it.
Separate external memories are used to provide CAM for the classifier, and instructions for the policy engine, forwarding engine, and traffic manager. While slightly more costly, the separate memories eliminate any possibility of bus contention between processing engines.
The resulting architecture is fast enough that it can handle equivalent of a full-duplex 10G Ethernet connection - except it is unidirectional. An interesting side note is that Bay has a design that makes their chip into a bi-directional 10 Gbit switch - inquire with them for further details.
The result of all this is that you get some extraordinary performance from a single chip design, while retaining a fair amount of flexibility - although not as much as with Internet Machines, or other RISC-based designs.
Rather than editorialize further, I'll simply pass on the following specs and features that Bay claims for its alpha silicon:
Classification
Policy Engine
Of course, you're probably wondering how you'd program a little monster like this, which uses five separate flavors of microcode. It is interesting to note that while Bay has a Superscalar architecture with multiple pipelines, they employ a single-threaded programming model that is similar to their rival, Internet Machines. An internal traffic flow manager coordinates the multiple pipelines automatically, allowing the developer to write code for the parallel pipelines as if it were just a single, fast processing element. They say that their software development tools allow you to program it like a router on a per-flow basis. This package has been working on simulations in their lab for months, and on their Alpha chips for weeks.
Another religious argument made by Bay, and other members of the pro-pipeline camp, is that multi-thread architectures can be difficult to program, and hard performance limits difficult to identify. I would argue that Internet Machines' efforts in developing cycle-accurate simulation and analysis tools have negated many of these objections. Nevertheless, I do admit that unlike Bay's clearly defined performance parameters, Internet Machines does leave the task of finding their processor's limits for specific applications as an exercise for the customer.
Taking yet another shot at the Motorolas, Cognigines, and Internet Machines of the NP universe, Bay also questions the scalability of most multi-CPU architectures. They point out that adding more processors represents an N2 complexity issue for managing the flows, while widening their pipeline is only a linear increase in complexity. I think that there are some ingenious ways for multi-ISC architectures to get around this scaling problem, but I do think that the pipelined approach is more efficient in terms of the amount of silicon required for a given task. Of course, this assumes you're willing to give up the flexibility that a fully-programmable solution offers
Quibble as I might with Bay, they do substantiate many of their performance claims with the working silicon (Alpha silicon is shipping to a customer TODAY - after having been fully tested for 3 weeks.) It's also interesting to note that as fast as it is, the chip's bottlenecks are not in the processing elements, which can actually run 42+ million packets/s. Right now, the long pole in the tent is the memory interface at the payload buffer which limits it to around 31.25 M packets/s (I suggested that they look into TriCN's memory interfaces for their next design, and they refused to comment.)
Working silicon and some reasonable proof that they are making good on many of their claims earns Bay a very low Vapor Index Rating for such an ambitious, and complex product.
Montego is sampling now in a BGA-1600 epoxy flip-chip
and will be priced at less than $1200 volume. Volume production will be
in Q3 of this year.
![]() |