this post was submitted on 13 Jan 2026
696 points (98.7% liked)

Selfhosted

54588 readers
2063 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

  7. No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

top 50 comments
sorted by: hot top controversial new old
[–] ICastFist@programming.dev 5 points 19 hours ago (1 children)

What's the size difference when you remove the porn stuff from the torrent?

[–] spicehoarder@lemmy.zip 11 points 18 hours ago

Willing to bet a 90% size reduction

[–] Butterphinger@lemmy.zip 5 points 1 day ago

grabs external

[–] inspxtr@lemmy.world 3 points 23 hours ago

Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?

[–] offspec@lemmy.world 60 points 2 days ago (3 children)

It would be neat for someone to migrate this data set to a Lemmy instance

[–] JackbyDev@programming.dev 3 points 20 hours ago

Lemmit already existed and was annoying as hell. It was the first account I remember blocking.

[–] TeddE@lemmy.world 22 points 1 day ago (2 children)

It would be inviting a lawsuit for sure. I like the essence of the idea, but it's probably more trouble than it's worth for all but the most fanatic.

[–] floquant@lemmy.dbzer0.com 9 points 1 day ago* (last edited 1 day ago) (3 children)

Is it though? That is (or was, and should be again) publicly accessible information that was created over the years by random internet users. I refuse the notion that an American company can "own it" just because they ran the servers. Sure they can hold copyright for their frontend and backend code, name and whatever. But posts and comments, no way.

Of course it would be dumb for someone under US jurisdiction but we'll see how much an international DMCA claim is worth considering the current relations anyway.

load more comments (3 replies)
[–] Olgratin_Magmatoe@slrpnk.net 6 points 1 day ago* (last edited 1 day ago) (3 children)

Might be easiest to set up an instance in a country that doesn't give a fuck about western IP law, then others can federate to it.

So yeah, fanatic levels of effort.

[–] fennesz12@feddit.dk 6 points 1 day ago (1 children)

Brb, setting up a Lemmy server in Red Star OS

[–] MonkeMischief@lemmy.today 5 points 1 day ago* (last edited 1 day ago) (1 children)
[–] A_Random_Idiot@lemmy.world 3 points 1 day ago

The chances are pretty high that is probably Kims computer, arent they?

[–] floquant@lemmy.dbzer0.com 3 points 1 day ago* (last edited 1 day ago) (2 children)

Post and comments are not Reddit's IP anyway :3

[–] Buddahriffic@lemmy.world 3 points 1 day ago* (last edited 1 day ago)

They might have set up the user agreement for it. Stackexchange did and their whole business model was about catching businesses where some worker copy/pasted code from a stackexchange answer and getting a settlement out of it.

I agree with you in principle (hell, I'd even take it further and think only trademarks should be protected, other than maybe a short period for copyright and patent protection, like a few years), but the legal system might disagree.

Edit: I'd also make trademarks non-transferrable and apply to individuals rather than corporations, so they can go back to representing quality rather than business decisions. Especially when some new entity that never had any relation to the original trademark user just throws some money at them or their estate to buy the trust associated with the trademark.

load more comments (1 replies)
[–] 19_84@lemmy.dbzer0.com 3 points 1 day ago

this is one reason i support tor deployment out of the box 😋

[–] cyberpunk007@lemmy.ca 12 points 2 days ago

Now this is a good idea.

[–] vane@lemmy.world 6 points 1 day ago* (last edited 1 day ago) (1 children)

How long it takes to download this 3TB torrent ?

[–] Mubelotix@jlai.lu 1 points 23 hours ago

I do not consent for this

[–] lautan@lemmy.ca 14 points 1 day ago

Thanks. This is great for mining data and urls.

[–] HugeNerd@lemmy.ca -2 points 19 hours ago

Boring. I want the Kuro5hin site. That was actually good and hysterically funny at the best times. ASCII reenactment players of Michael Crawford anyone?

[–] breakingcups@lemmy.world 124 points 2 days ago (2 children)

Just so you're aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

[–] 19_84@lemmy.dbzer0.com 142 points 2 days ago (18 children)

Yes I used AI, English is not my first language. Thank you for the kind words!

load more comments (18 replies)
load more comments (1 replies)
[–] a1studmuffin@aussie.zone 55 points 2 days ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

so kinda like kiwix but for reddit. That is so cool

load more comments
view more: next ›