299
Reddit has a new AI training deal to sell user content
(www.theverge.com)
This is a most excellent place for technology news and articles.
it's all but guaranteed. Reminds me of this Computerphile video: https://youtu.be/WO2X3oZEJOA?t=874 TL;DW: there were "glitch tokens" in GPT (and therefore ChatGPT) which undeniably came from Reddit usernames.
Note, there's no proof that these reddit usernames were in the training data (and there's even reasons to assume that they weren't, watch the video for context) but there's no doubt that OpenAI already had scraped reddit data at some point prior to training, probably mixed in with all the rest of their text data. I see no reason to assume they completely removed all reddit text before training. The video suggest reasons and evidence that they removed certain subreddits, not all of reddit.
Here is an alternative Piped link(s):
https://piped.video/WO2X3oZEJOA?t=874
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I'm open-source; check me out at GitHub.