Neel Nanda
13 papers Β· 5132 citations
Most-cited papers
- Training A Helpful And Harmless Assistant With Reinforcement Learning From Human Feedback2022 Β· 3872 citations
- Refusal In Language Models Is Mediated By A Single Direction2024 Β· 599 citations
- Linear Representations Of Sentiment In Large Language Models2023 Β· 147 citations
- Improving Dictionary Learning With Gated Sparse Autoencoders2024 Β· 145 citations
- Transcoders Find Interpretable LLM Feature Circuits2024 Β· 126 citations
Topics