networkZONE Products for the week of August 22, 2005
Telairity Semiconductor Says
Telairity-1: Real-Time H.264 HD Video Architecture
Delivers Over 55 GOPs
Harnessing multiple independent vector/scalar processors, the multicore Telairity-1 architecture is specifically designed to handle the demanding computational requirements of the H.264 (MPEG-4 Part 10) HD codec. H.264 is set to supersede MPEG-2 as the standard by which HD video is compressed in the professional broadcast environment for transmission, storage, and editing, where the new standard will deliver the same or better picture quality with a lower bit rate.
Beginning with the T1P2000 multicore video processor, the first system-on-chip (SoC) to be built using the new architecture, H.264 encoding solutions implemented with Telairity-1 will offer the smallest footprint and lowest cost for broadcast-quality H.264 video compression, requiring typically less than one quarter the number of chips of general-purpose DSPs solutions.
"The best processor benchmark is the customer's application, and it is in this type of environment that we've designed and measured the capabilities of Telairity-1 to deliver the highest level of video processing available in a single chip," said Howard Sachs, founder, president and CEO of Telairity Semiconductor. "Lower prices for HD equipment, ramping sales of HDTV receivers and monitors, and the availability of HD-DVD and Blu-Ray DVD players mean that HDTV has arrived. Now the industry is positioned to ensure that the reality of HDTV will live up to the audience's expectations. Encoders designed with Telairity-1 processors will play a major role in making this happen."
A very powerful processor is required to implement the H.264 algorithm. An H.264 compression engine requires 4 to 6 times the computational power of an MPEG-2 compression engine.
The programmable Telairity-1 architecture delivers this power by combining five independent vector/scalar processors, a video controller, and a DRAM controller supporting an I/O bandwidth up to 5.3 Gbps in a single multicore SoC. Each vector/scalar processor features four vector pipes with independent hardware, an independent scalar unit, 128 Kbytes of on-chip vector SRAM, a 4 Kbyte vector SRAM data cache, an 8 Kbyte scalar scratchpad memory, and a 32-Kbyte instruction cache. As a fully programmable chip, Telairity-1 will allow customers to modify or add new algorithms to customize or improve the encoder over time.
At a clock rate of 668.25 MHz, or nine times the 74.25-MHz 20-bit video standard, the T1P2000, first product to be built on the new architecture, achieves a total sustained chip performance of 55.5 GOP (Giga operations) per second.
Used at the heart of professional broadcast encoding solutions, the Telairity-1 T1P2000 will allow designers to build high-quality encoders with fewer chips at the board level, which translates into better reliability and lower production costs for OEMs. Where a general-purpose, 600-MHz to 1-GHz DSP based real-time H.264 encoder implementation would require 18 to 32 DSPs and 6 or more FPGAs, the Telairity solution requires only four to eight Telairity video processor chips and one small FPGA to achieve equivalent bit rates.
Flexibility is another bottom-line benefit Telairity-1 delivers. Devices
built on this architecture can be used for many different video encoding
applications, allowing OEMs to use the same platform to deliver a range
of functional capabilities. Beyond professional broadcast applications,
Telairity-1 processor will be used to enable HD video applications in video
conferencing, security and surveillance, and medical imaging systems.
analogZONE Says . . .
Much like a sniper rifle, Telairity's little silicon monster is a specialized machine designed to do a limited number of tasks, and do them much better than a piece of general-purpose hardware could ever dream of doing. In this case the task at hand is real-time encoding and compression -- (with a heavy emphasis on H.264) at rates typically in so-called "professional" equipment used by cable, satellite and DSL operators. The chip is designed to to address a growing need for MPEG 4 Part 10 (H.264) compression required support HDTV which cuts HDTV transmission bandwidth requirements by around 50% (from 20 Mbit/s to 10 Mbit/s). The challenge here is that MPEG 4's AVC compression algorithm uses a more complex entropy encoding which makes it 4x to 6x more compute intensive than MPEG 2. While you could enlist the help of a stack of power-hungry Pentiums or other general-purpose processors, the Telairity-1 processor attacks the problem with a specialized architecture that's fine-tuned to the tasks associated with video compression. This allows it, at least according to Telairity, to deliver the highest throughput per watt or per silicon area of any solution on the market today.
To
get a sense of why they're making such bold claims, we can look at the Telairity-1's
architecture and see that it consists of five identical custom-designed
processor blocks (see Fig.
1), each with its own vector and scalar processing
elements. The five processors share a common video controller and a DRAM
controller block. Very high speed connections between processors associated
on-chip vector and scratch memories, and also to their off-chip DDR SDRAM.
Each of the five cores has four vector pipes (see
Fig. 2).
For each tick of the system clock, each pipe can support up to four reads,
two writes, two loads, and a single store (plus one scalar operation in
a separate queue). The four pipes share 128 kbyte worth of common vector
memory with a random backoff algorithm to handle conflicts and "fair"
bandwidth sharing. Resource conflicts between processors are handled by
programmable priority circuitry that can be adjusted to give priority to
certain processors. I'm told that priority management can also be done in
software, but did not get all the details. 
Much of the chip's performance is due to its powerful vector-based architecture, but careful attention to efficient inter-processor connections is also critical to keeping things blazing along at warp speed. For example, spreading a process across five non-deterministic processors introduces issues like latency variation due to resource conflicts. This is handled using an on-chip "scoreboard" system adapted from Cray's mainframe vector processor machines. This hardware block takes care of tracking pending operations, balancing traffic within the pipe, and assembling blocks in sequence at the end of the pipeline. This avoids doing all these operations in software like CISC machines must do.
Telairity also uses strategically-placed blocks of on-chip vector memory to help minimize latency in the processor/memory connection, something which greatly reduces the chances of pipeline stalls and minimizes the difference between the chip's peak and sustained throughput figures. Putting aside all the fancy talk, the bottom line is that processor running at its rated 668 MHz core speed can deliver a sustained rate of 55 GOP/s.
Depending on quality and bit rate desired, it takes between four and eight Telairity-1 chips to perform real-time H.264 encoding. To put this into context, it would take 36 - 40 32-bit RISC processors, 18 - 30 DSPs, or six high-end Pentiums (plus lots of FPGA glue) to do what four Telairity devices can. And at a bit over $400 apiece and 15 W (typical) per chip, their claim that it delivers a minimum of 2x cost and power savings over conventional solutions seems pretty reasonable.
As with any custom processor (especially ones with really weird instruction sets) I'm always concerned about whether its performance advantage outweighs the time and cost involved with writing the specialized code for your particular application. And like every other custom processor manufacturer, Telairity insists that its C-based development environment that includes a compiler, vectorizer, as well as GNU-based assembler/linker and debug tools that hide nearly all of the chip's architectural peculiarities, from a programmer. This, at least in theory, should allow your code crew to concentrate on the application at hand instead of worrying about which register has which chunk of block coding data. They also claim that their library of C-based applications, including a set of pre-written h.264 functions, will greatly shorten most development efforts.
And while I'm still a bit skeptical that any development effort for such a complex chip won't be without its painful moments, I think that as long as you stick to the tightly-defined application areas that Telairity has targeted for its chip, you will find that much of the "heavy lifting" has already been done for you. The situation I expect you'll really get to find out exactly how good Telairity's tools are (as well as your software skills) is when you decide to use the chip for some other sort of signal processing, or running your own compression algorithm on their silicon.
Between the fact that I did not make a point to get a real detailed briefing on their tools, and that I have not written much code in the last 20 years, I'm undecided as to whether Telairity's tools and resources will be enough to over come the resistance most processors with radical architectures encounter. I'd welcome any readers' insights on this or, better yet, any hands-on experiences with programming the chip that you'd care to share.
Telairity says it's sampling the chip now, along with a general-purpose evaluation platform. They will be introducing a second development platform which is dedicated to video coding applications in late Q3 of this year. Given the chip's complexity, I would not be surprised to see a minor design re-spin between now and when the processor goes into full production at the end of this year to correct any unexpected timing issues that are likely to crop up in a design of this complexity operating at these speeds.
Given the anticipated transition rate to H.264 for telcos and satellite of around 80% by 2010 (with cable lagging way behind) there should be a large demand for equipment that support this compression algorithm in cable and satellite head-end equipment as well as boxes used at production facilities. And from what I could extract from my conversations with Telairity there is also a growing demand for these kinds of functions in other applications such as servers, storage and authoring systems, and perhaps even video conferencing and security as well. Many of these applications will demand equipment at much lower price points that current technology can support, making specialized engines like the Telairity-1 an extremely attractive option.
Packaged in a FCBGA-1156 samples of the first Telairity-1 product, the 668.25-MHz T1P2000, are available now, with production in Q4 2005, and priced at $425 in 10-k piece lots.
|
| ||||||||