Can we build mechanistic interpretability for trillion-parameter models?
Category: Computer Science
Status: Queued
Recent work (Anthropic, OpenAI, DeepMind) has reverse-engineered small circuits in transformer models — induction heads, indirect object identification, modular arithmetic. Whether such methods scale to frontier models is unclear.
Open problems include superposition, polysemanticity, and the search for high-level features that map cleanly onto human concepts in models with hundreds of billions of parameters.
Sources
Runs
No runs yet — this question is queued.