1
81
submitted 1 year ago* (last edited 1 year ago) by Elephant0991@lemmy.bleh.au to c/technology@beehaw.org

Paper & Examples

"Universal and Transferable Adversarial Attacks on Aligned Language Models." (https://llm-attacks.org/)

Summary

  • Computer security researchers have discovered a way to bypass safety measures in large language models (LLMs) like ChatGPT.
  • Researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI found a method to generate adversarial phrases that manipulate LLMs' responses.
  • These adversarial phrases trick LLMs into producing inappropriate or harmful content by appending specific sequences of characters to text prompts.
  • Unlike traditional attacks, this automated approach is universal and transferable across different LLMs, raising concerns about current safety mechanisms.
  • The technique was tested on various LLMs, and it successfully made models provide affirmative responses to queries they would typically reject.
  • Researchers suggest more robust adversarial testing and improved safety measures before these models are widely integrated into real-world applications.
view more: next โ€บ

Lemdro.id

1,777 readers
25 users here now

Our Mission ๐Ÿš€

Lemdro.id strives to be a fully open source instance with incredible transparency. Visit our GitHub for the nuts and bolts that make this instance soar and our Matrix Space to chat with our team and access the read-only backroom admin chat.

Community Guidelines

We believe in maintaining a respectful and inclusive environment for all members. We encourage open discussion, but we do not tolerate spam, harassment, or disrespectful behaviour. Let's keep it civil!

Get Involved

Are you an experienced moderator, interested in bringing your subreddit to the Fediverse, or a Lemmy app developer looking for a home community? We'd be happy to host you! Get in touch!

Quick Links

Lemdro.id Interfaces ๐ŸชŸ

Our Communities ๐ŸŒ

Lemmy App List ๐Ÿ“ฑ

Chat and More ๐Ÿ’ฌ

Instance Updates

!lemdroid@lemdro.id

founded 1 year ago
ADMINS