9
Refusal in LLMs is mediated by a single direction
(www.lesswrong.com)
A mirror of Hacker News' best submissions.
Oh for sure, and that was the main point, but I just find LLMs that refuse to do anything at all hilarious.
I wonder how much work it'd be to use this to jailbreak llama3. I only started playing with local LLMs recently. It's not exactly a step by step guide, but it gives you all the datasets you need and the general procedure. There's a bit of "draw then rest of the owl," but not too much.