Zum Inhalt

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

The difficulties of developing for today’s AI infrastructure have changed at a core level. Today’s machine learning frontier demands the use of distributed systems that can scale across thousands of accelerators. As models grow to run on clusters of around 100,000 chips, the software supporting them must satisfy rising requirements for performance, hardware portability, and reliability. At Google, our Tensor Processing Units (TPUs) form the foundation of our supercomputing infrastructure. These custom ASICs are used to train and run inference for both Google’s own AI platforms, such as Gemini and Veo, as well as the large-scale workloads of our Cloud customers. The whole AI community deserves seamless access to the full power of TPUs. Since many potential users develop their models in PyTorch, a native and high-performance integration between PyTorch and TPUs is essential. That’s where TorchTPU comes in. Our engineering team’s goal was to create a technology stack that prioritizes usability, portability, and outstanding performance. We aimed to let developers easily transition their existing PyTorch workloads with as little code modification as possible, while providing the APIs and tools needed to fully harness the compute power of our hardware. Here’s a behind-the-scenes look at the engineering principles behind TorchTPU, the technical architecture we’ve developed, and our 2026 roadmap.

Architecting for Usability, Portability, and Performance.

  Google Developers Blog