T
traeai
登录
返回首页
Jeff Dean(@JeffDean)

TPU 8i is co-designed with our Gemini research team to support low latency inference. Among the att...

5.0Score
TPU 8i is co-designed with our Gemini research team to support low latency inference.  Among the att...
AI 深度提炼

The https://t.co/sQw19DE2iZ" / X

Post

Conversation

TPU 8i is co-designed with our Gemini research team to support low latency inference. Among the attributes that support this are large amounts of on-chip SRAM, enabling more computations to be done on chip without having to go to HBM for weights or KVCache state as often. The boardfly network topology (see image below) offers a much lower diameter network to connect all 1152 chips in an 8i pod, by fully connecting all 4 chips on the board together, fully connecting groups of 8 boards together, and then fully connecting 36 groups of 8 boards together. In addition, there is specialized Collectives Acceleration Engine (CAE) circuitry on each chip to offload various kinds of reductions and other global operations from the main computational portion of each chip, reducing on-chip latency by up to 5x. Together, these features will offer very high throughput for large-scale models (including MoEs, which often require mapping onto many chips for inference), and will do so at very low latency. This will make agentic workloads and interactive usage really shine on the TPU 8i.

![Image 1: Image](https://x.com/JeffDean/status/2047407537566495033/photo/1)

TPU 8i is co-designed with our Gemini research team to support low latency inference. Among the att... | Jeff Dean(@JeffDean) | traeai