Google just set a new high-water mark for shared-memory AI systems. Its Ironwood TPU platform, unveiled at Hot Chips 2025 after a preview at Google Cloud Next ’25, scales to 9,216 chips and exposes a stunning 1.77PB of directly addressable HBM. That makes Ironwood not only Google’s most powerful supercomputer to date, but a signal flare for where large-scale AI inference is headed.
Key Takeaways
- Inference-first design: Ironwood is Google’s seventh-generation TPU and the first primarily tuned for large-scale inference rather than training.
- Per-chip muscle: Dual-die architecture delivering 4,614 TFLOPs (FP8), with 8× HBM3e stacks for 192GB capacity and 7.3TB/s bandwidth per chip.
- Pod-scale throughput: Up to 9,216 chips per pod, reaching about 42.5 exaflops (FP8).
- Record shared memory: 1.77PB of directly addressable HBM across a pod, linked with optical circuit switches.
- Massive interconnect: 1.2TB/s of I/O bandwidth enabling scale-out without glue logic.
- Reliability & security: On-chip root of trust, built-in self-test, mitigation for silent data corruption, checkpointing, and automatic reconfiguration around failed nodes.
- Operational efficiency: Enhanced cooling and AI-assisted design features targeted at efficient inference workloads.
Why Ironwood Matters Now
The center of gravity in AI is shifting from headline-grabbing training runs to always-on inference that powers products. That flips the bottlenecks: it’s no longer just raw compute; it’s memory capacity (to keep giant models resident) and bandwidth (to keep them fed) — while controlling latency and cost.
Ironwood tackles those constraints head-on. By pooling petabyte-scale HBM behind an ultra-fast fabric, Google can serve larger models and longer contexts without excessive sharding or cross-node chatter. In practical terms: higher throughput at steadier latency, more efficient batching, and the headroom to roll out new features without constantly rearchitecting serving stacks.
Under the Hood: Architecture at a Glance
- Dual-die compute per chip optimizes FP8 throughput for inference.
- HBM3e (192GB/chip, 7.3TB/s) ensures the memory wall isn’t the limiting factor.
- Optical circuit switches knit racks together, exposing 1.77PB shared HBM as a single, addressable memory space.
- 1.2TB/s I/O sustains pod-scale traffic without extra glue logic, simplifying system design.
- Resilience stack: root of trust, BIST, SDC mitigations, checkpoint/restore, and topology-aware failover.
What’s Different vs. Traditional AI Clusters
Most inference clusters lean on model and tensor sharding, which adds orchestration overhead and network tax. Ironwood’s approach — treat memory as a first-class, shared resource — reduces fragmentation and helps keep working sets hot. That’s especially relevant for LLMs with expanding context windows and for multimodal inference where memory footprints balloon with images, video, and embeddings.
There’s also a trust story: at hyperscale, a tiny rate of silent errors can snowball. Ironwood’s built-in checks and on-chip root of trust aim to make inference outputs more repeatable and verifiable — crucial for enterprise workloads and safety-critical applications.
Bigger Picture: Trends to Watch
- FP8 everywhere: FP8 is becoming the default precision for high-throughput inference. Expect growing toolchain support and quantization-aware training to keep improving model quality at lower bit-widths.
- Memory is the new moat: As context lengths and retrieval-augmented workflows grow, shared HBM capacity may be as decisive as raw FLOPs for user-perceived performance.
Specs Snapshot (for the nerds)
- Generation: TPU v7 “Ironwood” (inference-focused)
- Perf (per chip): 4,614 TFLOPs FP8
- Memory (per chip): 192GB HBM3e @ 7.3TB/s
- Scale: Up to 9,216 chips per pod
- Pod Perf: ~42.5 EFLOPs (FP8)
- Shared Memory (pod): 1.77PB directly addressable HBM
- Interconnect: 1.2TB/s I/O; optical circuit switches
- Resilience: Root of trust, BIST, SDC mitigation, checkpoint/restore, self-healing topology
What It Means for Builders
If you’re shipping AI features globally, Ironwood-class systems could shrink tail latency and stabilize throughput during peak traffic. That enables richer prompts, longer contexts, and larger multi-tool pipelines without blowing up serving complexity — or the bill.
Join the Conversation
Your turn: What’s the bigger unlock for real-world AI this year — petabyte-scale shared memory or ever-cheaper FP8 compute? Share your take and tag a teammate who obsesses over inference latency.




