You can make top LLMs break their own rules with gibberish (www.theregister.com)

submitted 1 year ago* (last edited 1 year ago) by Elephant0991@lemmy.bleh.au to c/technology@beehaw.org

18 comments fedilink

Paper & Examples

"Universal and Transferable Adversarial Attacks on Aligned Language Models." (https://llm-attacks.org/)

Summary

Computer security researchers have discovered a way to bypass safety measures in large language models (LLMs) like ChatGPT.
Researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI found a method to generate adversarial phrases that manipulate LLMs' responses.
These adversarial phrases trick LLMs into producing inappropriate or harmful content by appending specific sequences of characters to text prompts.
Unlike traditional attacks, this automated approach is universal and transferable across different LLMs, raising concerns about current safety mechanisms.
The technique was tested on various LLMs, and it successfully made models provide affirmative responses to queries they would typically reject.
Researchers suggest more robust adversarial testing and improved safety measures before these models are widely integrated into real-world applications.

Lemdro.id

1,777 readers
25 users here now

Our Mission 🚀

Lemdro.id strives to be a fully open source instance with incredible transparency. Visit our GitHub for the nuts and bolts that make this instance soar and our Matrix Space to chat with our team and access the read-only backroom admin chat.

Community Guidelines

We believe in maintaining a respectful and inclusive environment for all members. We encourage open discussion, but we do not tolerate spam, harassment, or disrespectful behaviour. Let's keep it civil!

Get Involved

Are you an experienced moderator, interested in bringing your subreddit to the Fediverse, or a Lemmy app developer looking for a home community? We'd be happy to host you! Get in touch!

Quick Links

Lemdro.id Interfaces 🪟

Our Communities 🌐

Lemmy App List 📱

See thread

Chat and More 💬

Instance Updates

!lemdroid@lemdro.id

founded 1 year ago

ADMINS

cole

ijeff

Paradox