Skip to content

ICALens: Interpretability Method for Language Models Without Training Additional Autoencoders

Share on:

At a glance: ICA-based analysis enables rapid exploration of interpretable directions in language models without expensive training of additional autoencoders.

Researchers propose Independent Component Analysis (ICA) as an efficient alternative to Sparse Autoencoders for interpreting language models. The new method ICALens identifies interpretable directions in model activations without training large neural dictionaries.

Sparse Autoencoders (SAEs) have become the standard method for searching interpretable directions in language models, but require expensive training and storage of large overcomplete dictionaries. This requirement becomes a bottleneck for rapid exploration and raises the question of how much interpretable structure is already visible in activation geometry before training a new neural dictionary.

ICALens leverages Independent Component Analysis, a classical statistical method for identifying non-Gaussian directions. The tool combines an optimized GPU-parallel FastICA pipeline with specialized stability measures for language models and improved diagnostic procedures. This combination enables efficient and reliable per-layer analysis without gradient-based per-layer training. The system was evaluated with GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base.

In benchmarks (SAEBench), ICA achieves competitiveness with existing public SAEs in sparse probing tasks and outperforms them on targeted probe perturbation under small to medium budgets. The results suggest that ICA should not be understood as a weak baseline, but rather as an efficient and complementary first analysis method for exploring language model representations.


Source: arxiv.org · Published June 9, 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on: