102

Archive link

Full textGoogle is now the only search engine that can surface results from Reddit, making one of the web’s most valuable repositories of user generated content exclusive to the internet’s already dominant search engine.

If you use Bing, DuckDuckGo, Mojeek, Qwant or any other alternative search engine that doesn’t rely on Google’s indexing and search Reddit by using “site:reddit.com,” you will not see any results from the last week. DuckDuckGo is currently turning up seven links when searching Reddit, but provides no data on where the links go or why, instead only saying that “We would like to show you a description here but the site won't allow us.” Older results will still show up, but these search engines are no longer able to “crawl” Reddit, meaning that Google is the only search engine that will turn up results from Reddit going forward. Searching for Reddit still works on Kagi, an independent, paid search engine that buys part of its search index from Google.

The news shows how Google’s near monopoly on search is now actively hindering other companies’ ability to compete at a time when Google is facing increasing criticism over the quality of its search results. And while neither Reddit or Google responded to a request for comment, it appears that the exclusion of other search engines is the result of a multi-million dollar deal that gives Google the right to scrape Reddit for data to train its AI products.

“They’re [Reddit] killing everything for search but Google,” Colin Hayhurst, CEO of the search engine Mojeek told me on a call. Hayhurst tried contacting Reddit via email when Mojeek noticed it was blocked from crawling the site in early June, but said he has not heard back.

“It's never happened to us before,” he said. “Because this happens to us, we get blocked, usually because of ignorance or stupidity or whatever, and when we contact the site you certainly can get that resolved, but we've never had no reply from anybody before.”

As Jason wrote yesterday, there’s been a huge increase in the number of websites that are trying to block bots that AI companies use to scrape them for training data by updating their robots.txt file. Robots.txt is a text file which instructs bots whether they are or are not allowed to access a website. Googlebot, for example, is the crawler or “spider” that Google uses to index the web for search results. Websites with a robots.txt file can make an exception to give Googlebot access, and not other bots, so they can appear in search results that can generate a lot of traffic. Recently Google also introduced Google-Extended, a bot which crawls the web specifically to improve its Gemini apps, so websites can allow Googlebot to crawl but block the crawler Google uses to power its generative AI products.

Robots.txt files are just instructions, which crawlers can and have ignored, but according to Hayhurst Reddit is also actively blocking its crawler.

Reddit has been upset about AI companies scraping the site to train large language models, and has taken public and aggressive steps to stop them from continuing to do so. Last year, Reddit broke a lot of third party apps beloved by the Reddit community when it started charging to access its API, making many of those third party apps too expensive to operate. Earlier this year, Reddit announced that it signed a $60 million with Google, allowing it to license Reddit content to train its AI products.

Reddit’s robots.txt used to include a bunch of jokes, like forbidding the robot Bender from Futurama from scraping it (User-Agent: bender, Disallow: /my_shiny_metal_ass) and specific pages that search engines are and are not allowed to access. “/r*.rss/” was allowed, while “/login” was not allowed.

Today, Reddit’s robots.txt is much simpler and more strict. In addition to a few links to Reddit’s new “public content policies,” the file simply includes the following instruction:

User-agent: *
Disallow: /

Which basically means: no user-agent (bot) should scrape any part of the site. “Reddit believes in an open internet, but not the misuse of public content,” the updated robots.txt file says.

“Unfortunately, we’ve seen an uptick in obviously commercial entities who scrape Reddit and argue that they are not bound by our terms or policies,” Reddit said in June. “Worse, they hide behind robots.txt and say that they can use Reddit content for any use case they want. While we will continue to do what we can to find and proactively block these bad actors, we need to do more to protect Redditors’ contributions. In the next few weeks, we’ll be updating our robots.txt instructions to be as clear as possible: if you are using an automated agent to access Reddit, you need to abide by our terms and policies, and you need to talk to us.”

Reddit appears to have updated its robots.txt file around June 25, after Mojeek’s Hayhurst noticed its crawler was getting blocked. That announcement said that “good faith actors – like researchers and organizations such as the Internet Archive – will continue to have access to Reddit content for non-commercial use,” and that “We are selective about who we work with and trust with large-scale access to Reddit content.” It also links to a guide on accessing Reddit data which plainly states Reddit considers “Search or website ads” as a “commercial purpose” and that no one can use Reddit data without permission or paying a fee.

Google did not respond to a request for comment, but its announcement of the company’s deal with Reddit points out not only how valuable Reddit is for training AI, but what many of us already know: As Google Search gets increasingly worse in turning up relevant search results, one of the best ways to still get them is to add “Reddit” to your search queries, directing Google to a site where real humans have been writing advice and recommendations for almost two decades. There are a lot of ways to illustrate how useful Reddit can be, but I’m not going to do better than this video:

https://www.youtube.com/watch?v=tcJcw55zIcc

The fact that Google is the only search engine that leads users to that information now, and that it is apparently the result of a $60 million deal around AI training data, is another example of the unintended consequences of the indiscriminate scraping of the entire internet in order to power generative AI tools.

“We've always crawled respectfully and we've done it for 20 years. We're verified on Cloudflare, we don't train AI, we're like genuine, traditional genuine searching, we don't do ‘answer engine’ stuff,” Hayhurst said. “Answer engine” is Perplexity’s name for its AI-powered search engine. “The whole point about Mojeek, our proposition is that we don't do any tracking. But people also use us because we provide a completely different set of results.”

Reddit’s deal with Google, Hayhurst said, makes it harder to offer these alternative ways of searching the web.

“It's part of a wider trend, isn't it?” he said. “It concerns us greatly. The web has been gradually killed and eroded. I don't want to make too much of a generalization, but this didn't help the small guys.”

monke-beepboop

top 47 comments
sorted by: hot top controversial new old
[-] dannoffs@hexbear.net 44 points 3 months ago

Crazy that google paid for rights after reddit's gone to shit. 3 years ago it would have been great, but these days I can't remember the last time I clicked on a reddit link for an answer.

[-] mkultrawide@hexbear.net 54 points 3 months ago

It's because the Google algo is broken, and a good percentage of Internet users search "thing you want to know reddit" for everything now.

[-] Ideology@hexbear.net 38 points 3 months ago

Would be neat if lemmy could form similar niche communities where people talk about their hobbies and scientific interests.

[-] comrade_pibb@hexbear.net 27 points 3 months ago

pog poop balls lemmy

hit search

[-] Robert_Kennedy_Jr@hexbear.net 35 points 3 months ago* (last edited 3 months ago)

It's a little unnerving how many Hexbear posts/users show up using Yandex.

[-] EllenKelly@hexbear.net 27 points 3 months ago

Yandex is legitimately great, especially image searching, simply breaks pintrest and the like so you can actually find high res copies of what youre looking for

[-] coolusername@lemmy.ml 4 points 3 months ago

And it's only like that because of the CIA. Around 2016 google started prioritizing MSM

[-] Ideology@hexbear.net 26 points 3 months ago

I'm starting to wonder where people even get information anymore. Are libraries making a comeback?

[-] Owl@hexbear.net 18 points 3 months ago

Discord, and I'm not happy about it.

[-] mathemachristian@hexbear.net 10 points 3 months ago

Wtf how do you search discord?

[-] Owl@hexbear.net 14 points 3 months ago

You search for the topic you're interested in, find out there's a discord about it, then open their discord, then check all their pins and try discord's search function.

As I said, I'm not happy about it.

[-] FunkyStuff@hexbear.net 7 points 3 months ago

Public discord servers should obviously not be a thing. It can be nice to have a place for people who are just getting started to casually be able to talk with more experienced people in some domain, I've definitely found it useful, but that's something that can easily be achieved with a web forum. So public Discord servers just come with all the downsides of a forum: no expectation of privacy, power tripping moderators, bad search function, etc; then with none of the upsides: searchable by an external search engine, decentralized, familiar UI, configurability. And that's without getting into all the issues with Discord as a company.

[-] InevitableSwing@hexbear.net 17 points 3 months ago

I'm starting to wonder where people even get information anymore.

Facebook, Twitter, and other cesspools.

[-] glans@hexbear.net 16 points 3 months ago

i guess we can all start using out individual imaginations again

[-] Thordros@hexbear.net 14 points 3 months ago

Facebook, Twitter, and other cesspools.

i guess we can all start using out individual imaginations again

You're both correct! same-picture

[-] Antiwork@hexbear.net 7 points 3 months ago

The search engine was the imaginations we made along the way

[-] BynarsAreOk@hexbear.net 9 points 3 months ago

Youtube is a primary source now. A lot of people take information from some [insert cracker specialist on random topic here] specialy about social/economic topics but realy everything.

And to be fair there is also some actualy decent and informative channels too which makes it even worse because its not like everything on the internet/YT is false or wrong its just you picking idiot grifters as your source.

[-] glans@hexbear.net 40 points 3 months ago

we are really in need to a viable FLOSS search engine. That can do its own indexing instead of repackaging google results like searx does. Maybe the spidering could be distributed somehow so the small self hosters could benefit from it while also able to apply their own standards, priorities, sorting etc.

[-] hypercracker@hexbear.net 32 points 3 months ago* (last edited 3 months ago)

unfortunately search is expensive in a way that FLOSS does not solve, it requires a lot of hosting infrastructure and boring volunteer labor to fine-tune results to combat spam (and spam might even benefit from looking at the FLOSS rules that filter it)

[-] glans@hexbear.net 4 points 3 months ago

(and spam might even benefit from looking at the FLOSS rules that filter it)

idk i feel like that might be a problem which is created or at least greatly exaggerated by monopolies. if there was a diversity of search engines it would be much more difficult to do shitty SEO on all of them at the same time. You'd need a whole team combing through repo hosting sites and mailing lists to figure it out.

[-] Owl@hexbear.net 14 points 3 months ago

Crawling could be distributed and shared, but indexing is a bigger problem.

All the things you'd want to be different on some sort of federated search platform (standards, priorities, and sorting, as you say) are things that require different indexing. But the index is the big expensive part that would most need to be shared.

[-] thethirdgracchi@hexbear.net 12 points 3 months ago

https://search.marginalia.nu/ is open source, extremely good, and insanely resource efficient.

[-] christian@hexbear.net 5 points 3 months ago

Thanks for this recommendation, I hadn't heard of this. Looks promising.

[-] glans@hexbear.net 3 points 3 months ago

Oh that looks cool. Do you run an install of it or are you using the install on their main page? Are there other instances

The hw requirements aren't prohibitive. I mean it's not nothing, maybe a few hundred upfront and then the connection. Well within reach especially with support of an existing organization who'd be willing to physically house it. I guess SSDs would be the largest part of the cost.

an x86-64 machine, have at least 16GB of RAM, and at least 4 cores. It is designed to run on physical hardware, and will likely be very expensive to run in the cloud.

Crawling requires a decent network connection, ideally at least 1 Gbps. 100 Mbps will work, but will be slower.

Storage requirements are highly dependent on the size of the index, and the number of documents being indexed. For 100,000 documents, you can probably get away with 2 TB of SSD storage, and 4 TB of mechanical storage for the crawl data.

I don't know how far 100k documents gets you. It doesn't sound like much if you are going for the whole internet but if you are curating a more narrow subset it could be enough.

This page has their philosophy and towards the bottom of the page, links to similar projects.

[-] thethirdgracchi@hexbear.net 2 points 3 months ago

I just use their search page, not trying to run it on my own. But yeah it's actually like possible for you to run, as opposed to something like Google.

[-] emizeko@hexbear.net 34 points 3 months ago

death to google
death to reddit
death to america
a century of humiliation upon the first world

[-] VeganicTankie@lemmygrad.ml 29 points 3 months ago

Wonders of capitalist innovation

[-] Plasma@lemmy.ml 21 points 3 months ago* (last edited 3 months ago)

Couldn't a crawler just add a bypass for reddit's robots.txt file?

[-] comrade_pibb@hexbear.net 15 points 3 months ago

Yeah, can't they just ignore it?

[-] fox@hexbear.net 13 points 3 months ago

If you're only making a few requests, yes, but it's very easy to detect something like an indexing crawler.

[-] O__O@hexbear.net 5 points 3 months ago

Or just do what bing did when it started and copy Google’s results directly

[-] Plasma@lemmy.ml 2 points 3 months ago

Or maybe they could make a browser extension like Bring Back YouTube Dislikes that will send the relevant metadata of the page back for indexing.

[-] Grandpa_garbagio@hexbear.net 16 points 3 months ago* (last edited 3 months ago)

Anyone notice that Google Reddit searches got way worse like in the past few months? Used to be able to get away with "search query" but now it often gives whatever it considers synonyms even when in quotes.

Often the first few results are just completed unrelated. Like recently it considered a proper name of a city the synonym of the word district for me lol

[-] Ideology@hexbear.net 13 points 3 months ago

@yogthos@lemmygrad.ml this looks like something you would post.

[-] yogthos@lemmygrad.ml 8 points 3 months ago

that's very much up my alley, thanks for tagging :)

[-] InevitableSwing@hexbear.net 10 points 3 months ago

Google is now the only search engine that can surface results from Reddit...

I'm terrible at proofreading so if I can nearly immediately find a mistake - the website isn't even trying.

[-] PointAndClique@hexbear.net 5 points 3 months ago* (last edited 3 months ago)

What's the mistake? Is it surface/service? Because I've seen surface used increasingly to mean 'reveal/show up/present' (i.e. 'bring to the surface)

[-] MedicareForSome@hexbear.net 4 points 3 months ago

What is the mistake?

[-] hexaflexagonbear@hexbear.net 9 points 3 months ago

Lol terrible

[-] Titou@hexbear.net 7 points 3 months ago

All my homies hate reddit.

[-] MedicareForSome@hexbear.net 6 points 3 months ago

Searx is an alternative meta-search engine that can include google results. If you don't want to host an instance you can find one here: https://searx.space/

I like https://searx.work/ personally.

[-] Tomorrow_Farewell@hexbear.net 1 points 3 months ago

A thing I found annoying about searx is that it's instances sometimes stop working, so I have to switch to other ones.

[-] Lemmygradwontallowme@hexbear.net 5 points 3 months ago

Ah putain...

[-] HexReplyBot@hexbear.net 3 points 3 months ago

I found a YouTube link in your post. Here are links to the same video on alternative frontends that protect your privacy:

[-] AndJusticeForAll@hexbear.net 2 points 3 months ago

Internet slowly bundling together.

this post was submitted on 25 Jul 2024
102 points (100.0% liked)

technology

23307 readers
101 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

founded 4 years ago
MODERATORS