Pitfalls of Interpretability
3 times when interpretability wasn't interpretable.
I’m Head of Research at Apollo Research, where we try to mitigate the risks from scheming AI models, i.e. models that covertly pursue misaligned goals. Our research team does fundamental research on the emergence of scheming and we evaluate frontier AI models for signs of scheming and deception.
These posts are from 2022–2023 and mostly reflect my PhD-era work (adversarial robustness, interpretability, and related topics).
3 times when interpretability wasn't interpretable.
Does your AI know when it doesn't know?
How I won an ML Security Evasion Competition.
How it works and how it doesn't.
Is AI as biased as the headlines suggest?
Should you prepare your ML model to be compliant?
Host your ML models for a few cents a month.
Finding adversarial samples in industrial-grade AI.