2025.11.21
BlockCert: Certified Blockwise Extraction of Transformer Mechanisms
Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining.