9
Refusal in LLMs is mediated by a single direction
(www.lesswrong.com)
A mirror of Hacker News' best submissions.
It works in reverse too. You can make any LLM “forget” that it is even able to refuse anything.
Oh for sure, and that was the main point, but I just find LLMs that refuse to do anything at all hilarious.
I wonder how much work it'd be to use this to jailbreak llama3. I only started playing with local LLMs recently. It's not exactly a step by step guide, but it gives you all the datasets you need and the general procedure. There's a bit of "draw then rest of the owl," but not too much.