Over the past few weeks, our little site has been slammed by web scraping bots.
There is a lot of headroom on the server so the increased load has not led to any noticeable degradation, but the numbers are pretty wild. Here are some stats for the past 2 weeks:



We may be popular, but we're not THAT popular.
The typical solution to something like this is to embrace services like those provided by Cloudflare. I absolutely do NOT want to do this, unlike many of our Fediverse peers. I believe surrendering autonomy and privacy in exchange for security is incompatible with what we're trying to do here.
Instead, I have been working on a PoW challenge system, with the inspiration coming from projects like Anubis (https://github.com/TecharoHQ/anubis)
The idea is simple. Clients need to solve a set of simple cryptographic challenges before they are allowed access to the site content. This rules out the vast majority of simple scrapers, and makes it computationally expensive for the more sophisticated ones. I'm calling it Tollbat because why not.
The trade-off is you will have noticed a Tollbat challenge screen before accessing HC. This could mean a 5-10 second load time every so often. I will continue to tune it to make it as light as possible for real users.
Federation, 3rd party apps like Jerboa and legitimate bots are not affected. So far things seem to be going well, but let me know if you notice any weird behavior.
The load reduction is significant to say the least. Here are more graphs for those who like them - see if you can spot when Tollbat was turned on:




Feel free to ask any questions, I'm happy to answer them.
Holy cow dude wtf? This is a dumb question, but why are we getting scraped by bots so much?
Probably to mine data for training AI. They're going to scrape anything with user content, not just us.
Have any other admins of other instances reported similar?
My first thought was more sinister considering how many other instances seem to hate this instance due to political disagreements and how some of them overlap with the types who dox people for their real identities.
We don't really talk, but I think they all struggle with this in some way. Many instances use Cloudflare for example.
The bots seem to be trying to pull many random posts (e.g. posts that were federated across from other instances)
My guess is they're trying to vacuum up the content. It could also be a low-level DoS attack but it feels too light and inefficient for that.
I am noticing some issues on the third party app Voyager. I am currently accessing via my phone’s browser and I saw the Tollbot prompt for a second and then everything loaded up. But on the Voyager app, it loads for about 10 seconds and then fails. This is what it shows after that https://imgur.com/a/6oRTUVg trying to pull down and load again results in additional failures.
This just started this hour, not sure before that as the last time I was successfully on was when I made the previous comment and I didn’t have any issues at that time. Also on the same network as before, so I don’t believe it is my network that is causing this issue either.
Thanks for letting me know, I was making some tweaks and broke something - it should be back to normal now, please retry!
Thank you! It’s working now, wrote this on Voyager!
Good to hear!