Skip to content

Steerability of Language Models Can Be Predicted Early

Share on:

The Bottom Line: A trainable classifier predicts with a 0.7 Macro-F1-Score based on early hidden states whether activation steering will succeed without requiring complete generations.

Activation steering enables control of language model behavior at runtime – but whether it works depends heavily on prompt, concept, and model. Researchers have developed a method that can predict after just a few generated tokens whether a steering attempt will succeed.

Activation steering is a lightweight method for controlling language model behavior during inference. However, it requires resource-intensive optimization runs and evaluation of complete generations to determine the optimal steering configuration – particularly because success or failure depends heavily on which prompt, concept, model, and steering strength are used.

The new study “ASTEER” examines whether steerability can be predicted from a model’s internal states at the beginning of generation – specifically after the first few tokens. A test bed with 1.4 million steering operations was constructed for this purpose: 150 concepts were tested, with each steering intervention marked as successful or failed. By analyzing early decoding dynamics, features were extracted that compare hidden states before and after steering across all layers and the first decoding steps. These features reveal how steering effects propagate through the model – key information for prediction.

A Gradient Boosting Decision Trees (GBDT) classifier was trained based on these features to predict whether an intervention results in under-steering, successful steering, or over-steering – without requiring the complete autoregressive rollout to be performed. The classifier achieved a Macro-F1-Score of approximately 0.7 on unseen concepts, demonstrating that early hidden states contain substantial, structured information about eventual steering effectiveness.

For CTOs and systems architects, this is particularly relevant: the steerability predictor can serve as a guide for optimizing steering strength and delivers near-optimal performance with a fraction of the computational load. This significantly shortens tuning cycles and reduces costs for production workloads where steering is deployed.


Source: arxiv.org · Published June 9, 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: