this post was submitted on 14 Sep 2025
186 points (97.4% liked)
chat
8507 readers
355 users here now
Chat is a text only community for casual conversation, please keep shitposting to the absolute minimum. This is intended to be a separate space from c/chapotraphouse or the daily megathread. Chat does this by being a long-form community where topics will remain from day to day unlike the megathread, and it is distinct from c/chapotraphouse in that we ask you to engage in this community in a genuine way. Please keep shitposting, bits, and irony to a minimum.
As with all communities posts need to abide by the code of conduct, additionally moderators will remove any posts or comments deemed to be inappropriate.
Thank you and happy chatting!
founded 4 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
In diffusion, this has already been done. Most models that were made after SD1.5 have a "handpicked" input dataset. I guess its because most of SD1.5 input had garbage quality, which transferred over to the output.
I have to check that out at some point, models like Gemini and GPT take up all the space in the room and it's easy to forget that there's others
The other person was talking about image generation models, not LLMs. I think that the only LLMs with super curated input sets are tiny and less useful. Unfortunately it takes a lot of data for LLMs to be trained so it's hard to find enough good quality data if you're curating it.