this post was submitted on 29 Oct 2025
25 points (93.1% liked)

PieFed Meta

1941 readers
53 users here now

Discuss PieFed project direction, provide feedback, ask questions, suggest improvements, and engage in conversations related to the platform organization, policies, features, and community dynamics.

Wiki

founded 2 years ago
MODERATORS
 

cross-posted from: https://poptalk.scrubbles.tech/post/3263324

Sorry for the alarming title but, Admins for real, go set up Anubis.

For context, Anubis is essentially a gatekeeper/rate limiter for small services. From them:

(Anubis) is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.

It puts forward a challenge that must be solved in order to gain access, and judges how trustworthy a connection is. For the vast majority of real users they will never notice, or will notice a small delay accessing your site the first time. Even smaller scrapers may get by relatively easily.

For big scrapers though, AI and trainers, they get hit with computational problems that waste their compute before being let in. (Trust me, I worked for a company that did "scrape the internet", and compute is expensive and a constant worry for them, so win win for us!)

Anubis ended up taking maybe 10 minutes to set up. For Lemmy hosters you literally just point your UI proxy at Anubis and point Anubis to Lemmy UI. Very easy and slots right in, minimal setup.

These graphs are since I turned it on less than an hour ago. I have a small instance, only a few people, and immediately my CPU usage has gone down and my requests per minute have gone down. I have already had thousands of requests challenged, I had no idea I was being scraped this much! You can see they're backing off in the charts.

(FYI, this only stops the web requests, so it does nothing to the API or federation. Those are proxied elsewhere, so it really does only target web scrapers).

top 35 comments
sorted by: hot top controversial new old
[–] rimu@piefed.social 15 points 1 month ago* (last edited 1 month ago) (2 children)

I tried 2 times, for hours each time, to set this up in a way that does not break federation, the api, or the web ui. It's not as easy as the OP thinks and problems aren't always obvious immediately.

Any PieFed instance owners out there who are getting flooded with requests from bots - send me a PM and I'll let you know how I solved it for piefed.social.

I'd rather not post it publicly because it could be circumvented if 'they' knew how I'm doing it.

[–] scrubbles@poptalk.scrubbles.tech 6 points 1 month ago (1 children)

It isn't supposed to sit in front of everything, only the webui. The API and federation endpoints should pass through your proxy as they always have. If you are following the standard Lemmy setup it should be a one line change in your nginx conf.

[–] rimu@piefed.social 4 points 1 month ago (2 children)
[–] scrubbles@poptalk.scrubbles.tech 1 points 1 month ago (1 children)

Yeah I guess I shouldn't have bothered showing people things that worked well for me

[–] wjs018@piefed.wjs018.xyz 6 points 1 month ago (1 children)

I'm confused why you are talking talking about lemmy containers in a piefed-focused community. Piefed doesn't separate out the web ui into a separate container by design. So the solutions and difficulty of implementing are very different.

[–] scrubbles@poptalk.scrubbles.tech 1 points 1 month ago* (last edited 1 month ago) (1 children)

Not terribly. I posted this because I think it would help any fediverse site. It's just a proxy that sits in front of whatever traffic you choose to send to it. So whatever routes go to the web client (probably /) you just forward to Anubis, which forwards onto piefed. Lemmy, piefed, Mastodon, any web based app that's how you would do it. You can be granular and go route by route, or do all of it. It's not hard coded for any site.

My puny site was getting hundreds of heavy requests per minute before I set this up from bots. I can't imagine what all fediverse sites are dealing with. I wanted to let fediverse admins know because I'm going to see a noticeable lessening of my bills I pay to host my instance, and I believe that would help other admins, which in turn will make the fediverse stronger.

[–] rimu@piefed.social 6 points 1 month ago (2 children)

Yes, we know.

This is a blog post about Anubis I wrote a few months back, when I thought I had it working https://join.piefed.social/2025/07/09/an-anubis-config-for-piefed/

After writing that post I took another approach and moved the logic into nginx instead as the Anubis configuration language was impossible to debug. That seemed fine but then a different weird Anubis bug that had been languishing in their issue queue for months with no solution hit me and I gave up.

It's just not ready. It's bad software. Their documentation is very misleading. I hope no one else loses as many days as I did.

See this message would have been better at the beginning of this thread, could have been a much better dialogue between us.

I see in your script your doing the filtering at Anubis:

request.path.startsWith("/api/")

I did the opposite approach, I filter at my proxy/nginx and then only send web traffic to Anubis. With Lemmy since they're 2-containers for web/api it looks like this:

                set $proxpass "http://anubis:8080/"; # this was the webui, but now it handles web traffic, passing into lemmy downstream
                if ($http_accept ~ "^application/.*$") {
                  set $proxpass "http://lemmy:8536/"; #api
                }
                if ($request_method = POST) {
                  set $proxpass "http://lemmy:8536/"; #api
                }

This way everything that goes to Anubis is 100% okay for it to handle. Then also if there are endpoints that may not work (someone called out oauth flow), you can filter those out to go directly the the UI.

For PieFed, even if you don't have a proxy in front now (which honestly would surprise me), I think it'd be better to add one then filter at that level. Let Anubis do what it does best, let Traefik/nginx/caddy/whatever do what it does best and route traffic.

For safety you could do the reverse - allow everything and cut endpoints one by one.

[–] Blaze@piefed.zip 1 points 1 month ago

It’s just not ready. It’s bad software. Their documentation is very misleading. I hope no one else loses as many days as I did.

That's unfortunate

[–] Blaze@piefed.zip 1 points 1 month ago

It's not really a Wendy

[–] sga@piefed.social 2 points 1 month ago

if it helps, i think slrpnk lemmy community did setup anubis and they still federate. Not sure if their are lemmy specific problems though.

[–] Blaze@piefed.zip 5 points 1 month ago (1 children)

I'm probably going to save a good chunk of my hosting costs because of this, I'm happy to share it out. And happy to make big tech pay a bit more to access our content.

[–] sga@piefed.social 3 points 1 month ago (1 children)

I am pro setting up anubis, but if only it does not break some "intended" behaviour (federation and rss for example).

I have it in front of Lemmy, but if piefed is similar at all then you can proxy the web requests through Anubis which is all of the scraping traffic. API and federation then would not be affected, that's how I set mine up

[–] hendrik@palaver.p3x.de 3 points 1 month ago* (last edited 1 month ago) (2 children)

To be honest, I'm not really a fan of fighting the AI scraper war. It's ultimately the same dynamics which turned the open internet into what it is today... 10 years ago I could just look up stuff... google whether my Thinkpad supports more or faster RAM than what the specs said, and I'd find some nice Reddit thread. And now it's all walled off, information isn't shared freely anymore, and I'm having a hard time. And we do the same thing.

But... We do need to fight. I just wish there was some solution which doesn't add to the enshittification of the internet. I've been hit by them as well and the database load made the entire server grind to a halt. I've added some firewall rules and deny lists to my reverse proxy to specifically block the AI companies and so far it's looking good. I'll try to postpone solutions like Anubis until it's unavoidable. And I guess it won't work for everything. We need machine-readable information, APIs, servers to talk to each other...

But that's just my 2 cents and I'm going to change my opinion and countermeasures once necessary.

[–] scrubbles@poptalk.scrubbles.tech 5 points 1 month ago (1 children)

This still lets scrapers through, but it's more of an artificial throttle. There are several knobs and dials it looks like I can turn to make it more or less permissive, but I got a say that over 90% of my traffic was coming from not farms, and I a small instance admin was paying for all of that with database queries and compute. I think this is a good middle ground. You can access it, if you're willing to really try to get at it. From what I see most scrapers give up immediately rather than spending on compute

[–] hendrik@palaver.p3x.de 2 points 1 month ago (1 children)

Hmmm. I mean my instance is small, almost all of my traffic comes from federation. There will be like 50 requests in the log to forward some upvotes, and then one or two from a user or crawler. Most of them seem to behave, it was just Alibaba and Tencent who did proper DDoS attacks on me. I've blocked most of their ASNs and so far it seems to do 100% what I was trying to do. I suppose the minor stream of requests from South America and other places of the world isn't humans either, but they mostly read articles and that's kind of a cheap request on PieFed and it's not a lot. I'll keep an eye on it, maybe that's an alternative solution. Though currently it's a manual process to look up the address ranges and write the firewall rules. It should probably be automated in a way before random people adopt it.

[–] scrubbles@poptalk.scrubbles.tech 3 points 1 month ago (1 children)

I thought so too with mine, divide your traffic though into the API vs Web requests. If you have a small instance most of your traffic should be federation traffic hitting the api endpoints, and posts. Scraping traffic will be GETs and requesting web content, not API content. That's what I noticed, that the vast, vast majority of my traffic wasn't federation at all, but web scrapes. Granted my instance has been around for almost 3 years now and I'm sure most of the bot farms know I exist.

[–] hendrik@palaver.p3x.de 1 points 1 month ago* (last edited 1 month ago) (1 children)

Interesting. For me on a 16 month old PieFed instance, mostly used by me, 95% of the total incoming traffic is federation. With some countermeasures activated. In the last 3 days I got 450k requests from Lemmy instances, 7k from PieFed, 10k from Mastodon, 30k from PeerTube and 30k have a browser User Agent string so those are users or crawlers. That's roughly 5%. 5k of those have bot or crawler in the User agent, that's 1% of my requests. And I did 1k requests from my home IP address. Fun fact: 52% of all incoming traffic is just to federate with lemmy.world.

[–] scrubbles@poptalk.scrubbles.tech 2 points 1 month ago (1 children)

I'm not surprised, but I have noticed a lot of the bots are fake user agents as well. For example, for my instance I know for a fact no one else was using my instance when I was testing this, and I kept getting User Agent requests from Safari. Which I would be surprised if my users were using Apple too, but knowing they weren't on was a huge driver. I want to dive in and see how Anubis knows this, or if they were just tested and failed or didn't bother to complete the challenge. So I'm curious when you remove POSTs and api/federation endpoint calls what your traffic looks like

[–] hendrik@palaver.p3x.de 2 points 1 month ago* (last edited 1 month ago) (1 children)

I don't know how you're reading this, in case you didn't know, a lot of browsers have the word "Safari" in their user agent string.

My Vanadium browser identifies as: "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Mobile Safari/537.36"
So there's Safari in it, but it's a Chromium based browser on Android. And none of the information is correct. It's Android 16, not 10 and Vanadium or Chromium isn't even in it. Also doesn't use Mozilla's engine nor AppleWebkit. It is version 142.something however.

My Firefox (LibreWolf) identifies as: "Mozilla/5.0 (X11; Linux x86_64; rv:144.0) Gecko/20100101 Firefox/144.0" that doesn't include anything else.

But the desktop Chromium does the same thing again: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"

[–] scrubbles@poptalk.scrubbles.tech 3 points 1 month ago (1 children)

That's absolutely true, looking at my logs there are definitely some weird ones:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b7) Gecko/20100101 Firefox/4.0b7

Which.... if that's firefox is very out of date, and Windows 6.1 is Windows 7. Which I could believe that people are posting from Windows 7 compared to 10 or 11, but sus. A lot of them are kind of weird combinations. Anubis auto flagged that one as bot/huawei-cloud, but I'm really curious as to how or why it did. It's not authentic, that's for sure, but how does it know it isn't?

Another one is Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47, is completely valid, but from an older Windows version and Edge, like if it was a snapshot from a few years ago. Idk it's all very interesting. That one was also flagged as bot/huawei-cloud.

[–] hendrik@palaver.p3x.de 2 points 1 month ago* (last edited 1 month ago) (1 children)

Well, pretty much all browsers fake their user agent strings for a long time now. That's for privacy reasons. Because malicious actors can single you out with all that information, the combination of IP, operating system and version and exact version of the browser and used libraries. Also they don't want to advertise if you skimped on updates and are vulnerable for exploits. And they fake more because via Javascript a website can get your screen size, resolution and all kinds of details. That's business as usual. These pieces of information still serve some purpose, occasionally... But they're more a relic from the distant past when the internet was an entirely different place without an advertisement and surveillance economy.

And sure, shady bots fake them as well. It's rare to see correct information with anything. The server to server communication in the Fediverse for example advertises the correct name and version number. Maybe some apps as well if they're not concerned with servers being malicious.

[–] scrubbles@poptalk.scrubbles.tech 3 points 1 month ago (1 children)

Fair, all of that. You know as someone who really likes pristine data, I really hate that actors used such basic clean data against all of us. What should have been a simple compatibility check was completely co-opted and now essentially serves no purpose.

Anubis I know looks at much more than that, you've made me curious to go read through their code. It's working for my instance, but I also see it warrants more research as to how it decides

[–] hendrik@palaver.p3x.de 1 points 1 month ago* (last edited 1 month ago) (1 children)

I think the main point is to load a JavaScript and make the client perform a proof-of-work. That's also the mechanism they advertise with.

The additional (default) heuristics seem to be here looks like they match for some known user agents and header strings and there are some IP address ranges in there. And some exception to allow-list important stuff.

[–] scrubbles@poptalk.scrubbles.tech 2 points 1 month ago (1 children)

Yeah I'm seeing that too, which is what I thought but it didn't talk to much about the blocklists which is interesting. Overall very simple concept to me, if you want to access the site then great, prove that you're willing to work for it. I've worked for large scraping farms before and for the vast majority they would rather give up than keep doing that over and over. Compute for them is expensive. What takes a few seconds on our machine is tons of wasted compute, which I think is why I get so giddy over it - I love having them waste their money

[–] hendrik@palaver.p3x.de 1 points 1 month ago* (last edited 1 month ago) (1 children)

Sure. It's a clever idea. I was mainly concerned with the side-effects like removing all information from Google as well, so the modern dynamics Cory Doctorow calls "enshittification" of the internet. And I'm having a hard time. I do development work and occasionally archive content, download videos or transform the stupid website with local events into an RSS feed. I'm getting rate-limited, excluded and blocked left and right. Doing automatic things with websites has turned from 10 lines of Python into an entire ordeal with loading a headless Chromium into an entire Gigabyte or more of RAM, having some auto-clickers dismiss the cookie banners and overlays, do the proof-of-work... I think it turns the internet from an open market of information into something where information isn't indexed, walled and generally unavailable. It's also short-lived and cant be archived or innovated upon any more. Or at least just with lots of limitations. I think that's the flipside. But it has two sides. It's a clever approach. And it works well for what it's intended to do. And it's a welcome alternative to everyone doing the same thing with Cloudflare as the central service provider. And I guess it depends on what exactly people do. Limiting a Fediverse instance isn't the same as doing it with other platforms and websites. They have another channel to spread information, at least amongst themselves. So I guess lots of my criticism doesn't apply as harshly as I've worded it. But I'm generally a bit sad about the general trend.

[–] scrubbles@poptalk.scrubbles.tech 2 points 1 month ago (1 children)

Yeah, I'm in that boat with you. I also have done light scraping, and it kills me that I have to purposely make my site harder to scrape - but at the same time big tech is not playing fairly, scraping my site over and over and over again for any and all changes, causing my database to be overloaded and requests way too high. It's not fair at all. I'd like to think if someone reached out and asked to scrape that I would let them - but honestly I'd rather just give them read API access

[–] hendrik@palaver.p3x.de 1 points 1 month ago* (last edited 1 month ago)

Yes, they're really nasty. Back in the day Google BingBot and all of them would show up in my logs, fetch a bit of content every now and then. And what I've seen with AI scrapers is mad. Alibaba and Tencent did tens of requests per second, ignoring all robots.txt and they did that from several large IP ranges. Probably to circumvent rate-limiting and simpler countermeasures. It's completely unsustainable. My entire server became unresponsive due to the massive database load. I don't think any single server with dynamic content can do this. It's just a massive DDoS attack. I wonder if this pays off for them. I'd bet in a year they're blocked from most of the web, and that doesn't seem to me like it's in their interest. They're probably aware of this? We certainly have no other option.

[–] sga@piefed.social 2 points 1 month ago

Earlier, scrapers were not harming a website much. maybe 1 or search engine hits a day, and just a handful of search engines meant that they would not consume much resources. they would scrape your data, but just so search would work (otherwise all search would just work on your website's title). this did not harm the websites' business model (if they had any).

it went downhill when google started giving dires "answers" to search queries. mostly based on wiki, but occasionaly other stuff like reddit. this was often just relevant bit from "high match" website and while it reduced some traffic to those webistes, it was not as bad as it is today - where all most all traffic is from scraping bots, and many just have stopped visiting sites directly.

[–] wjs018@piefed.wjs018.xyz 3 points 1 month ago

For piefed admins out there, a simpler way to basically kill scrapers dead is to simply enable the Private instance option in the Misc tab of the admin settings. This prevents non-logged in users from browsing your instance without blocking the /inbox route for federation or the api routes.

[–] julian@activitypub.space 2 points 1 month ago (1 children)

Mm, this pushed me toward installing Anubis for activitypub.space.

I set it up, and reloaded the nginx config... And I have no idea whether it's working or not lol

I'm not being challenged 🤔

[–] scrubbles@poptalk.scrubbles.tech 2 points 1 month ago (1 children)

First step is check the logs in the Anubis container, if you see logs then it's intercepting requests! If you're able to access your site still, then congrats you have it set up!

If you'd like I'm happy to look at your proxy config and let you know my thoughts. Either here or matrix I'm scrubbles@halflings.chat

[–] julian@activitypub.space 1 points 1 month ago

I installed Anubis via the .deb file so it's not in a container. I just don't know where the logs are being saved.

Journalctl maybe...