this post was submitted on 09 Jul 2025
168 points (97.7% liked)

Technology

73142 readers
3670 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
top 29 comments
sorted by: hot top controversial new old
[–] Pika@sh.itjust.works 36 points 2 weeks ago (3 children)

I mean, with a company as large as cloudflare. I think they could /easily/ strong-arm this move by making blocking google crawlers a default setting on websites. The amount of traffic drop alone from that would make google think twice about the whole ordeal. And people who care about the google search indexers can turn them on again which will allow indexing again. but a default block would cause a lot of disruption google side and many people I don't think would go in and fix the setting till later on down the road.

[–] sbv@sh.itjust.works 25 points 2 weeks ago (2 children)

Cloudflare's customers probably wouldn't be on board with that. Google's properties provide a tonne of traffic to businesses. Doing anything to put that in jeopardy would probably have many of Cloudflare's customers looking for a new provider.

[–] Glitchvid@lemmy.world 22 points 2 weeks ago (1 children)

Google used to provide a ton of traffic, they hoard it all themselves now through AI and summaries of content. Eventually the balance of cost/benefit will shift and Google will suddenly see itself rejected from scraping, furthering the product deathspiral.

[–] pupbiru@aussie.zone 10 points 2 weeks ago (1 children)

content is only 1 category of website

ecommerce drives all the advertising that funds content… it’s a much bigger market, and they don’t care about content scraping as long as you buy their product

[–] Glitchvid@lemmy.world 5 points 2 weeks ago

And the long term plan there is to strangle sites and take %100 of the adrev spend for themselves since users won't ever leave the Google site. Either way Google as a search engine enters a death spiral, it's already bleeding users.

[–] Pika@sh.itjust.works 5 points 2 weeks ago (1 children)

it would need to be advertised as a change and have it as a setting that had to be set yea, just have it default blocking abusive trackers, having Google bot or whatever it's crawler name is as on there, with a toggle to allow it again

[–] osaerisxero@kbin.melroy.org 2 points 2 weeks ago (1 children)

Alternatively, you use the cloudflare money to sue the monopoly to decouple search and all other products, since blocking the AI trawlers shouldn't have any measurable impact in search rankings

[–] Pika@sh.itjust.works 1 points 2 weeks ago

I agree, but I think that hurting the companies bottom line is more effective than waiting on an archaic court system to do something. Just look at how slow the /current/ monopoly case on google is going.

[–] 3dcadmin@lemmy.relayeasy.com 2 points 2 weeks ago

I disagree. Searches are not googles main focus these days. Blocking their crawlers will just make the AI searches better - exactly what google want

[–] sugarfoot00@lemmy.ca 1 points 2 weeks ago

Exactly. It's not like google is the only indexer out there. And if this cuts into their search dominance, so much the better.

[–] Outwit1294@lemmy.today 20 points 2 weeks ago (2 children)

If I had to choose my favourite corporation, it would be Cloudflare. They at least do something good.

[–] r00ty@kbin.life 11 points 2 weeks ago (3 children)

There seems to be a line, so far as I can tell. If everything you need sits on the free tier, they're really good (well tbh their R2 storage is reasonably priced too). But once you stray into needing a paid tier, it apparently (I'm not there) quickly gets expensive as you're lured into every higher tiers.

But yes, in general I don't mind cloudflare so much and do use their free (and R2 paid) services.

[–] DreamlandLividity@lemmy.world 3 points 2 weeks ago (1 children)

It would be fine if it was just "lured", but this made me very sceptical of cloudflare: https://robindev.substack.com/p/cloudflare-took-down-our-website

[–] r00ty@kbin.life 1 points 2 weeks ago

This was actually the story I had in mind when I wrote my comment. In my case, I'm using cloudflare for this mbin instance, another unrelated low traffic site, and R2 for the media on the instance. It's so small that it will never really escape their free tier.

But yeah, if you're doing something that is scaling up this is definitely something you need to be aware of.

[–] Outwit1294@lemmy.today 1 points 2 weeks ago

They have not given me any reason to hate them which is a win in my books. Apple was my favourite but their policy of region specific features is getting annoying.

[–] r00ty@kbin.life 1 points 1 week ago

I thought I'd reply to this with an update. Because, I saw how far the free tier goes, and it's pretty far.

I had another huuuuge influx of AI/other bots scraping my instance at top speed. Hundreds of requests per second, and it was putting some load on the postgres server.

What I found was, that there was a mixture of traffic. Some was coming from a handful of AS numbers (each hosting hundreds of large IP blocks) controlled by a small handful of the same names. So, those I was able to block outright by AS number.

But then I found a very large number of random requests coming in bursts (and definitely not humans) all on mobile or isp customer blocks.. I assume it's some kind of botnet being used? But they were all valid requests for posts and comments.

I looked at the custom ruleset on cloudflare and, it's quite powerful. I settled on the following.

1: Allow known fediverse software by user agent (yes, the bots could eventually spoof these. But right now they are not). 2: Allow known instances by IP blocks 3: Allow access to the fediverse inbox specifically. Which is where most inter-instance traffic goes. 4: Allow access to LOCAL users and well known services/other standard ActivityPub urls 5: Everything else, for everyone else. Managed challenge.

The traffic just completely stopped dead. The fediverse traffic continued unfettered. But the traffic coming in was legitimate (it's mostly me, a handful of others that is so little traffic).

All it adds, is the interstitial page with the "are you human?" checkbox that for most people automatically checks. And the user moves on fine and can interact normally with the site. So for people it's a very minor inconvenience and it stops bot traffic completely in its tracks.

What is annoying. I could make this MUCH better with regex matches. But, not only do they not allow free accounts to use regex (I understand). But "Pro" users cannot either. It's only for business or above... Business accounts are eye wateringly expensive for a hobbyist!

[–] DreamlandLividity@lemmy.world 7 points 2 weeks ago (1 children)
[–] Outwit1294@lemmy.today 4 points 1 week ago (1 children)

I still think they are good. Isolated incidents like this are going to happen when you are doing business at such scales.

[–] DreamlandLividity@lemmy.world 2 points 1 week ago* (last edited 1 week ago) (1 children)

This did not sound like an isolated incident at all. You don't get sales responding to a legal/engineering issue by accident. It may have been unintended by leadership, if they put too much pressure on sales not realizing how it was corrupting the company, or the leadership may have tacitly approved of this. Hard to tell.

[–] Outwit1294@lemmy.today 2 points 1 week ago (1 children)

Are there reports from others about similar things?

[–] Flipper@feddit.org 2 points 1 week ago (1 children)

The article links to 4 incidents that are reported on Hackernews. So yes. At least 4

[–] Outwit1294@lemmy.today 1 points 1 week ago

That is still isolated because they do at least a million times more business

[–] fubarx@lemmy.world 11 points 2 weeks ago (2 children)

Totally understandable.

If scanning to help send traffic to your website, that's cool. If scanning to generate summaries that won't send any traffic your way. No bueno.

Ultimately, it should be whatever most benefits users.

[–] acosmichippo@lemmy.world 6 points 2 weeks ago* (last edited 2 weeks ago)

but there also needs to be incentive for sites to host content. if it all gets hijacked by search engines that isn't sustainable.

[–] Outwit1294@lemmy.today 3 points 2 weeks ago

No, things should not benefit users, they should benefit the creator of the original content.

[–] BaroqueInMind@piefed.social 9 points 2 weeks ago* (last edited 2 weeks ago)

Additionally, Cloudflare's initiative faces criticism from those who "worry that academic research, security scans, and other types of benign web crawling will get elbowed out of websites as barriers are built around more sites" through Cloudflare's blocks and paywalls, the WSJ reported.

The fuck? Since when is a bot designed to enumerate your network weaknesses to sell to Russian/Chinese/US hacking groups a bad thing to block? Fuck the WSJ for even putting that dumb as fuck take on the internet for other idiots to think about.

NO , its not a good fucking idea to allow the equivalent of an incessant door-to-door salesman into your home to take notes of everything you own and sell to a random motherfucker somewhere else you don't know.

That behavior is fucking weird and shouldn't be tolerated. Cloudflare arbitrarily blocking that network traffic for you is a good thing.

[–] Zwuzelmaus@feddit.org 5 points 2 weeks ago

Google has existed long enough now. The are allowed to disappear.

[–] vxx@lemmy.world 4 points 1 week ago

Why did they copy the soundcloud logo?

[–] 2xsaiko@discuss.tchncs.de 0 points 2 weeks ago

Google’s bot is fine in my book, their crawler doesn’t absolutely blast your server with web requests like other AI crawlers do. (Speaking of, I need to update my list of netblocks and UAs to get iocaine-holed.)

That said, two evil megacorps potentially fighting? I hope they kill each other.