Qwen-Scope paper is an interesting shift in how we handle mechanistic interpretability. The core idea here is moving sparse autoencoders from being just a post-hoc inspection tool to an actual interface for building and fixing language models. The team open-sourced 14 groups of SAEs for Qwen3 and Qwen3.5 architectures and demonstrated four practical ways to use them directly in the development pipeline.
First up is inference steering. Instead of just looking at what features activate when a model messes up, you can actively suppress or amplify those latent features to fix the output on the fly without updating any model weights. They showed an example where suppressing a specific Chinese language feature stopped the model from randomly mixing languages during an English prompt. They also proved you can trigger a classical literary style transfer just by turning on the right feature direction.
The evaluation finding is probably the most immediately useful for saving compute. They found that tracking the footprint of SAE features activated by a benchmark gives you a highly accurate proxy for dataset redundancy. If a bunch of reasoning problems activate the exact same micro-capability features, you can just sample a tiny subset of the benchmark and still get the exact same model ranking. Measuring feature overlap is also a reliable way to figure out if two different benchmarks are actually just testing the exact same capabilities before you waste time running full evaluations.
On the data curation side they proved you do not even need to train a classification head for things like toxicity. A simple logical rule over a few toxic-biased SAE features acts as a classifier and achieves an F1 score above 0.90. These toxic features discovered in English actually transfer quite well to other European languages. They also used this representation-level view for synthetic data generation by identifying safety features that were missing from the training distribution and prompting the model to generate examples that specifically trigger those missing internal directions.
Finally they integrated these latent features directly into supervised fine-tuning and reinforcement learning. In the fine-tuning stage they added an auxiliary loss to suppress language-specific features which heavily reduced unexpected code-switching. For reinforcement learning they intentionally amplified repetition features to force the policy model to generate endless loops. This gives the RL pipeline rare negative samples that are otherwise incredibly hard to encounter naturally and provides an explicit training signal against repetitive loops.