In a nutshell: End-to-end training of the tokenizer and generator with dual codebook selection accelerates ImageNet convergence up to 10x compared to LlamaGen-REPA.
Researchers present GEAR, a method for simultaneously training a VQ tokenizer and an autoregressive generator for image generation. The key problem of non-differentiability of VQ indices is solved through dual codebook selection, which allows the generator to guide the tokenizer’s training.
Visual generative models are typically trained in two separate phases: first, a tokenizer for reconstruction is trained and then frozen; next, a generator is trained on its discrete indices or continuous latent vectors. This decoupling causes the tokenizer to be unaware of which structures the generator can model easily.
GEAR (Guided End-to-end AutoRegression) solves this problem through joint and end-to-end training of a VQ tokenizer and an autoregressive generator, guided by representation alignment. The core challenge: the VQ index passed to the AR model is non-differentiable – gradients normally do not reach the tokenizer, and a straight-through estimator collapses. GEAR addresses this through dual selection of codebook assignment: a hard branch with one-hot encoding trains the AR model with next-token prediction, while a differentiable soft branch carries a representation alignment loss that flows back to the tokenizer and steers it strategically. Thus the AR generator becomes a guide for its tokenizer toward an index distribution that the generator itself can more easily predict.
This reversal of alignment focus leads to asymmetric feature properties: the tokenizer’s features become less DINOv2-like, while those of the AR generator become more DINOv2-like – the opposite of diffusion approaches that make the latent vector itself semantic.
In experiments, GEAR accelerates ImageNet gFID convergence up to 10-fold compared to the LlamaGen-REPA baseline and learns significantly better patch-level and spatially coherent features. The method also generalizes across different quantizers (VQVAE, LFQ, IBQ) and can be applied to text-to-image generation.
Source: arxiv.org · Published June 29, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.2.