421
submitted 1 year ago by dl007@lemmy.ml to c/technology@lemmy.ml
you are viewing a single comment's thread
view the rest of the comments
[-] Zetaphor@zemmy.cc 14 points 1 year ago* (last edited 1 year ago)

Quoting this comment from the HN thread:

On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.

While it strikes me as perfectly plausible that the Books2 dataset contains Silverman's book, this quote from the complaint seems obviously false.

First, even if the model never saw a single word of the book's text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book's Wikipedia page.

Second, it's not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.

We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT's training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman's book.

I chose "The Ruby of Kishmoor" at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn't even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn't know anything about the story and it isn't part of its training data.

If ChatGPT's ability to summarize Silverman's book comes from the book itself being part of the training data, why can it not do the same for other books?

As the commentor points out, I could recreate this result using a smaller offline model and an excerpt from the Wikipedia page for the book.

[-] patatahooligan@lemmy.world 8 points 1 year ago

You are treating publicly available information as free from copyright, which is not the case. Wikipedia content is covered by the Creative Commons Attribution-ShareAlike License 4.0. Images might be covered by different licenses. Online articles about the book are also covered by copyright unless explicitly stated otherwise.

[-] Zetaphor@zemmy.cc 3 points 1 year ago* (last edited 1 year ago)

My understanding is that the copyright applies to reproductions of the work, which this is not. If I provide a summary of a copyrighted summary of a copyrighted work, am I in violation of either copyright because I created a new derivative summary?

[-] patatahooligan@lemmy.world 4 points 1 year ago

Not a lawyer so I can't be sure. To my understanding a summary of a work is not a violation of copyright because the summary is transformative (serves a completely different purpose to the original work). But you probably can't copy someone else's summary, because now you are making a derivative that serves the same purpose as the original.

So here are the issues with LLMs in this regard:

  • LLMs have been shown to produce verbatim or almost-verbatim copies of their training data
  • LLMs can't figure out where their output came from so they can't tell their user whether the output closely matches any existing work, and if it does what license it is distributed under
  • You can argue that by its nature, an LLM is only ever producing derivative works of its training data, even if they are not the verbatim or almost-verbatim copies I already mentioned
[-] barsoap@lemm.ee 4 points 1 year ago* (last edited 1 year ago)

LLMs have been shown to produce verbatim or almost-verbatim copies of their training data

That's either overfitting and means the training went wrong, or plain chance. Gazillions of bonkers court cases over "did the artist at some point in their life hear a particular melody" come to mind. Great. Now that's flanked with allegations of eidetic memory we have reached peak capitalism.

[-] crackgammon@lemmy.world 1 points 1 year ago

Don't all three of those points apply to humans?

[-] Banzai51@midwest.social 4 points 1 year ago

Aren't summaries and reviews covered under fair use? Otherwise Newspapers have been violating copyrights for hundreds of years.

[-] barsoap@lemm.ee -1 points 1 year ago

Second, it’s not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.

Summarising stuff is literally all ML models do. It's their bread and butter: See what's out there and categorise into a (ridiculously) high-dimensional semantic space. Put a bit flippantly: You shouldn't be surprised if it's giving you the same synopsis for both Dances with Wolves and Avatar because they are indeed very similar stories, occupying the same approximate position in that space. If you don't ask for a summary but a full screenplay it's going to come up with random details to fill in the details it ignored while categorising, again the results will look similar if you squint right because, again, they're at the core the same story.

It's not even really necessary for those models to learn the concept of "summary" -- only that, in a prompt, it means "write a 200 word output instead of a 20000 word one". It will produce a longer or shorter description of that position in space, hallucinating more or less details. It's really no different than police interviewing you as a witness to a car accident and having to pay attention to not prompt you wrong, including assuming that you saw certain things or you, too, will come up with random bullshit (and believe it): It's all a reconstructive process, generating a concrete thing from an abstract representation. There's really no art to summary it's inherent in how semantic abstraction works.

this post was submitted on 10 Jul 2023
421 points (94.7% liked)

Technology

34781 readers
243 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS