Replication of "The Hydra Effect: Emergent Self-repair in Language Model Computations" (link)
WORK IN PROGRESS
This is a self contained jupyter notebook containing the original paper's contents, along with code that implements all of its methods.
Once this is done, another copy will be made to extend the results to study its emergence over model size and training, as well as the effects of multiple ablations and other phenomena.
To Do (Section 2):
- Boilerplate pip installs
- All latex equations and text
- All figures
- Insert codeblocks and pseudocode for all methods
- Section 2.1: Print model config
- Section 2.2: Load and munge Counterfact dataset
- General equivalence classes to handle all Counterfact prompt formats
- Ordinary resampling ablation
- Averaged resampling ablation
-
$\text{do}(\ .)$ - Multiple layer ablations
- Section 2.3:
$u$ unembedding,$\hat{pi_t}$ ,$\Delta_{\text{unembed},l}$ - Section 2.4:
$\Delta_{\text{ablate},l}$ - Section 2.5: Produce the plots in figure 2
- Implement
$\tilde{\Delta}_{\text{unembed},l}^k$ - Produce plots in figure 1, 3
- Derive same or different conclusions across model suite
- Pythia models are too weak for the given task of "What is the twin city of {}? It is"
- Try a different recall task
- Try few k-shot prompt to encourage the model to play nice
To do (Sections 3 & 4):
- Total effect
- Direct effect
- Indirect effect
- Generalized:
$Y(x_{\leq t} \mid \text{do}(Z = z'))$ - Not really worth doing
- Determine whether "unrolled" RMS math in Section 3.1 applies to our LayerNormPre (it should, if we center first?)
- Section 4: Compensatory effect
- Generate CE/DE dataset
- Produce plots in figure 7, 8
- Derive same or different conclusions across model suite