RV770 architecture
800 stream processors and new texture units
At the heart of ATI’s new Terascale graphics engine are 800 stream processors. The chip boasts a new SIMD layout, with the stream processors arranged into 10 SIMD cores. Each SIMD core contains 80 stream processors (for a total of 800) and 16KB of shared local cache. This shared cache is used for communication amongst the SIMD cores. Each core also has four texture units (for a total of 40) and its own L1 cache in addition to the 16KB of shared local cache as well as dedicated flow control logic. ATI’s ultra-threaded dispatch processor is responsible for tracking and distributing thousands of threads simultaneously across the Radeon HD 4800’s stream processors. Threads that are already executing can be bumped at any time if a higher priority thread is pulled from the command queues. The temporary data is saved so the thread can be resumed later. If a thread is forced to wait for data, it is suspended and a new thread begins executing immediately. The suspended threads remain in the command queue until their requested data arrives. According to ATI, hundreds of threads can be queued up to make sure the SIMD arrays are never sitting idle.
Like R600, RV770 continues to support double-precision floating point, and the stream processing units utilize more aggressive clock gating to reduce power consumption. ATI has also added integer bit-shift operations on every stream processing unit, delivering an improvement of 12.5X over RV670, which was limited to just a few stream processors that support bit shift operations).
ATI has also addressed one of RV670/R600’s chief shortcomings: its lack of texture units. As we stated earlier, RV770 boasts 40 texture units; that’s over two times their previous architecture, which had just 16. The texture units have also been redesigned and deliver up to two times the texture cache bandwidth of RV670. To further improve AA performance, the ROPs have been optimized as well. While the number of units has remained the same, ATI has doubled the AA fill-rate for 32-bit and 64-bit color from 8 pixels/clock to 16 pixels/clock under 2x/4xMSAA and 4-pixels/clock to 8-pixels/clock with 8xMSAA. ATI also doubled the peak rate for depth/stencil operations to 64 per clock.
256-bit memory interface
ATI has also incorporated a new memory interface into RV770. ATI has ditched the ring-bus architecture in favor of a new distributed controller design, with four 64-bit memory controllers spread across the edge of the die (256-bit total), directly adjacent to the ROPs. Each memory controller has its own L2 cache, and the controllers are linked to a central hub which handles duties such inter-chip communication, PCI Express, display controllers and the CrossFire interconnect. The memory controllers support GDDR3 memory as well as GDDR5, which is the memory type used on the Radeon HD 4870 and ATI’s upcoming dual GPU Radeon HD 4870 X2.
GDDR5 offers much greater bandwidth than GDDR3 with lower power. For example, the Radeon HD 4870 boasts up to 115GB/sec of peak memory bandwidth. This is quite impressive when you consider that the wider 512-bit memory interface employed on GeForce GTX 280 offers up to 141.7GB/sec of peak bandwidth.