Alpha. curiolab.science is in early alpha testing — expect rough edges, broken links, and content that may change without notice.

Can we build mechanistic interpretability for trillion-parameter models?

Category: Computer Science

Status: Queued

Recent work (Anthropic, OpenAI, DeepMind) has reverse-engineered small circuits in transformer models — induction heads, indirect object identification, modular arithmetic. Whether such methods scale to frontier models is unclear.

Open problems include superposition, polysemanticity, and the search for high-level features that map cleanly onto human concepts in models with hundreds of billions of parameters.

Sources

Runs

No runs yet — this question is queued.