Uniform FP4: New 4-Bit Training Method for LLMs Reduces Systematic Errors

Share on:

Bottom line: Uniform 4-bit formats eliminate the systematic shrinkage bias of E2M1 in FP4 LLM training and enable consistently better convergence across all model sizes.

FP4 quantization saves memory and compute during pretraining of large language models, but the previous E2M1 format choice causes systematic rounding errors. Researchers have developed UFP4, an alternative training method that uses uniformly distributed 4-bit grids and shows consistently better convergence.

The problem lies in the geometric asymmetry of E2M1, the current standard format on GPUs like NVIDIA Blackwell/Rubin and AMD MI350. Non-uniform formats like E2M1 systematically suffer from shrinkage bias: a negative rounding error that arises from the unequal distribution of representable numerical ranges. This error accumulates multiplicatively across network layers and is further amplified by the Random Hadamard Transform (RHT) – an effect that destabilizes previous E2M1-based FP4 training.

The researchers propose UFP4, a uniform 4-bit training method that applies RHT to all three training GEMMs and restricts stochastic rounding to gradients (dY) only. In contrast to E2M1, uniform grids like E1M2 and INT4 avoid this grid geometry error. The method was evaluated on three model sizes: Dense 1.5B, MoE 7.9B, and MoE 124B. UFP4 achieved lower loss degradation relative to BF16 references in all cases compared to established E2M1 baselines.

The scaling laws and ablation studies confirm that uniform grids more effectively convert the improved bucket utilization achieved by RHT into higher quantization quality than E2M1. The result has implications for hardware design: according to the authors, future accelerators should support E1M2/INT4-style uniform 4-bit grids as equivalent training primitives alongside E2M1.

Source: arxiv.org · Published 17 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrasing and classification via Lumi News Pipeline v1.7.1.

Share on:

Uniform FP4: New 4-Bit Training Method for LLMs Reduces Systematic Errors

Lumi AI News

Legal

Topics