datahoarder

10297 readers

1 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 6 years ago

MODERATORS

archivist@lemmy.ml

Help with archiving profiles on loyalfans.. (lemmy.ml)

submitted 2 months ago by furycd001@lemmy.ml to c/datahoarder@lemmy.ml

3 comments fedilink hide all child comments

Hey everyone,

I’m working on archiving a few profiles from Loyalfans, but I’ve hit a wall with their CDN (CloudFront) security and rate-limiting. I’m looking to grab all media (high-res images, GIFs, videos, video thumbnails & audio), but the platform seems particularly hostile to bulk downloading. Has anyone successfully scraped/download a profile on Loyalfans? If YES! then how?

The site uses heavily signed URLs with Expires, Signature, and Key-Pair-Id parameters. These seem to be session-bound or very short-lived.

What I’ve tried so far:

Manual "Save As" (Shift + Right Click): Result: Works for the first 10-15 files, then falls apart.
The Issue: I’m running into what looks like a cache collision or rate limit. After a few downloads, the browser starts saving randomly previously downloaded imagese instead of the new one. It only resolves if I wait 30+ minutes, try again & then continue in this cycle.
HAR Extraction & Shell Scripting: Result: Partially successful but extremely finicky. The Issue: I’ve been saving .har files from the network tab, then using grep to grab the CDN links. The problem is that the HAR often picks up thumbnails (_md.jpg, _sm.jpg) or pre-fetched neighbor images. Furthermore, if I don't run the wget/curl script quickly enough, the signatures expire.
Selenium-based Python Script: Result: Identical to the manual method. The Issue: Even with headless browsing and random delays, the CDN eventually detects the automated behavior and starts serving 403s or throttles the connection, resulting in the same "duplicate image" cache bug.
Vergil9000's Loyalfans Downloader: Link: https://github.com/Vergil9000/LoyalFans Result: Failed completely. I can load a list of profiles I follow, but the actual scraping/downloading logic seems broken or outdated for current site architecture.

Many thanks for taking the time to read my post. Any help would be greatly appreciated ....

top 3 comments

sorted by: hot top controversial new old

[–] Maxy@lemmy.blahaj.zone 2 points 2 months ago (1 children)

The GitHub project you linked appears to have forks with more recent commits, you could try those? Personally don't use the site though, haven't tested any myself.

Overview of forks: https://github.com/Vergil9000/LoyalFans/network/members

[–] furycd001@lemmy.ml 2 points 2 months ago

I did not know there was forks. Thank you very much for pointing this out :)

[–] hexagonwin@lemmy.sdf.org 2 points 2 months ago* (last edited 2 months ago)

sorry I can't really help you much, but many folks at archiveteam are extremely talented and might be able to help you