this post was submitted on 29 Nov 2023
434 points (97.4% liked)

Technology

71777 readers
3769 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

you are viewing a single comment's thread
view the rest of the comments
[–] KingRandomGuy@lemmy.world 16 points 2 years ago (7 children)

Not sure what other people were claiming, but normally the point being made is that it's not possible for a network to memorize a significant portion of its training data. It can definitely memorize significant portions of individual copyrighted works (like shown here), but the whole dataset is far too large compared to the model's weights to be memorized.

[–] ayaya 16 points 2 years ago* (last edited 2 years ago) (6 children)

And even then there is no "database" that contains portions of works. The network is only storing the weights between tokens. Basically groups of words and/or phrases and their likelyhood to appear next to each other. So if it is able to replicate anything verbatim it is just overfitted. Ironically the solution is to feed it even more works so it is less likely to be able to reproduce any single one.

[–] Kbin_space_program@kbin.social 2 points 2 years ago* (last edited 2 years ago) (5 children)

That's a bald faced lie.

and it can produce copyrighted works.
E.g. I can ask it what a Mindflayer is and it gives a verbatim description from copyrighted material.

I can ask Dall-E "Angua Von Uberwald" and it gives a drawing of a blonde female werewolf. Oops, that's a copyrighted character.

[–] ayaya 7 points 2 years ago

I think you are confused, how does any of that make what I said a lie?

load more comments (4 replies)
load more comments (4 replies)
load more comments (4 replies)