Blog | 2 of 25 | PyTorch

April 04, 2024

Accelerating MoE model inference with Locality-Aware Kernel Design

1.0 Summary

March 13, 2024

Maximizing training throughput using PyTorch FSDP

In this blog, we demonstrate the scalability of FSDP with a pre-training exemplar, a 7B model trained for 2T tokens, and share various techniques we used to achieve a rapid training speed of 3,700 tokens/sec/GPU, or 40B tokens/day on 128 A100 GPUs. This translates to a model FLOPS utilization (MFU) and hardware FLOPS utilization (HFU) of 57%. Additionally, we have observed near linear scaling of FSDP to 512 GPUs, implying that training a 7B model on 512 GPUs to 2T tokens using this method wou...

February 06, 2024

PyTorch 2 paper and tutorial @ ASPLOS 2024

The PyTorch team is excited to share that our paper on PyTorch 2 has been accepted for presentation at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), scheduled to take place from April 27 to May 1, 2024, in San Diego, CA, USA.

February 01, 2024

What's New in PyTorch Documentation

Greetings to the PyTorch community! Here is a quick update on PyTorch docs.

January 30, 2024

PyTorch 2.2: FlashAttention-v2 integration, AOTInductor

We are excited to announce the release of PyTorch® 2.2 (release note)! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.

January 30, 2024

New Library Updates in PyTorch 2.2

Summary

January 23, 2024

Accelerating Generative AI with PyTorch IV: Seamless M4T, fast

This post is the fourth part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. To skip to the code, check out our github (seamless_communication, fairseq2). We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native PyTorch. In part two, we showed how...

torchtune: Easily fine-tune LLMs using PyTorch

Accelerating MoE model inference with Locality-Aware Kernel Design

Maximizing training throughput using PyTorch FSDP

PyTorch 2 paper and tutorial @ ASPLOS 2024

What's New in PyTorch Documentation

PyTorch 2.2: FlashAttention-v2 integration, AOTInductor

New Library Updates in PyTorch 2.2

Accelerating Generative AI with PyTorch IV: Seamless M4T, fast

Install PyTorch

Quick Start With
Cloud Partners

Docs

Tutorials

Resources

Install PyTorch

Quick Start WithCloud Partners

Docs

Tutorials

Resources

Quick Start With
Cloud Partners