Can we build mechanistic interpretability for trillion-parameter models?

Status: Queued

Recent work (Anthropic, OpenAI, DeepMind) has reverse-engineered small circuits in transformer models — induction heads, indirect object identification, modular arithmetic. Whether such methods scale to frontier models is unclear.

Open problems include superposition, polysemanticity, and the search for high-level features that map cleanly onto human concepts in models with hundreds of billions of parameters.

Sources

Anthropic interpretability research
Wikipedia: Mechanistic interpretability

Runs

No runs yet — this question is queued.