this post was submitted on 14 May 2025
53 points (98.2% liked)
Fuck AI
2756 readers
1120 users here now
"We did it, Patrick! We made a technological breakthrough!"
A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It's certainly possible. Wouldn't even be hard, since early crawlers were using radically smaller computers and the technology involved is now just available free and open source.
If you limit it to only curated domains you'll find issues with limited content and difficulties discovering novel information.
If you need information on Peruvian sand art you can only find it if you've already added it to the index.
What you might consider is starting with a set of "seed" sites that you trust, and fan out from there. Use something like pagerank to rank encountered sites, and augment that ranking with distance from a known good domain. A site with a lot of link activity that's also referenced by a site you find credible is probably better than one that's four steps removed.
Human review of sites as they cross some threshold of ranking is plausible, since it's easier to look at a list of sites that seem consistently okay and check if they're slop than to enumerate the sites that aren't slop from scratch.
One of the better ways to gain insight about which results your users find most helpful is, ironically, a non-llm neural net. Understanding what types of queries lead to which domains should help you guide curation and put trustworthy sites that users pick higher.