The bottom line: CausalMix uses causal modeling instead of static assumptions to find optimal data mix ratios that generalize across different data pool sizes and model scales.
Researchers propose CausalMix, an approach that formulates the optimization of training data mixtures for large language models as a causal inference problem. The method generalizes across different model sizes without requiring expensive retraining.
The ratio of various data sources during training of large language models significantly influences the final model performance. Previous methods optimize mix weights through proxy models, but assume that the data distribution remains constant. If the underlying data basis shifts, these methods must start training from scratch—an expensive process that makes scaling from small to larger data volumes and model sizes practically impossible.
CausalMix addresses this problem through causal inference: the statistical features of the data pool are modeled as covariates, the domain mixture as a treatment. After training a causal inference pipeline on 512 runs of Qwen2.5-0.5B to estimate the Conditional Average Treatment Effect (CATE), the method extrapolates the optimal mixture for an 800K-sized data pool and applies it to training a 7-billion-parameter model. The framework was also successfully generalized to long-chain-of-thought data with Qwen3-4B-Base.
The decisive advantage: through causal modeling, confounders are isolated, so CausalMix derives state-dependent optimal data mixtures without requiring retraining. In extensive experiments, mixtures guided by CausalMix consistently outperformed baseline methods such as RegMix across multiple downstream tasks. The framework also provides visual insights into the learned mixing strategies via a CATE interpreter.
Source: arxiv.org · Published June 30, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.2.