afaik, archive.org isnt open source. id recommend something like archivebox.io
Archive box is a piece of software and the Internet archive is a organization that is focused on predicting the content on the internet.
The Internet Archive has PBs worth of data. I doubt any home user could manage that.
archive
predicting
?
Protecting
They're beating the algorithm
i dont think op is looking to mirror archive.org, my take was that they wanted someyhing like archive.org but selfhosted and for personal / small-scale use
Exactly. I'm already running a local wiki, but I don't want stuff I link to in my wiki to result in 404 in a few years. Or worse, to some AI-ridden ad-infested dumpster fire.
You can use something as simple as a browser extension like SingleFile that can automatically download complete, contained copies of anything bookmarked or only certain URLs.
Oh yes, this looks like a winner. Thanks!
It seems like it's written in Python too, which means I can maintain it if need be.
Oh boy I wish I had set this up many years ago. I wouldn't have to resort to scouring !antiquememesroadshow@lemmy.world for the top quality memes of the past when I need them...
On a far side of the moon note, I wonder if ActivityPub could be used to federate multiple archiveboxes to create a more resilient Internet Archive alternative. 🤔 Then integrate that with Lemmy to autoarchive links from posts. Aaand lemmy.world ran out of disk space. 🤣
a network between networks to make them more resilient i think you've just invented the arpanet?.
+1 for ArchiveBox
I believe they used heritrix at one point. The important bit is that there is a special archive format that they use which is a standard. There are several tools that support it (both capturing to it and viewing it) - it allows for capturing a website in a 'working' condition with history or something. I'm a bit fuzzy on it since it's been some time since I looked into it.
It seems like all of their software is in the parent account of heritrix - https://github.com/orgs/internetarchive/repositories?type=all.
Does Linkwarden fit your intended use?
Kind of. Linkwarden seems to save as PDF. That's better than nothing, however preserving a functional copy of the pages would be better. Archivebox seems to do this.
I don't know for certain but I'm sure they run lots of different software. They have PBs of data.
New Lemmy Post: What software does the Internet Archive run? (https://lemmy.world/post/10891304)
Tagging: #SelfHosted
(Replying in the OP of this thread (NOT THIS BOT!) will appear as a comment in the lemmy discussion.)
I am a FOSS bot. Check my README: https://github.com/db0/lemmy-tagginator/blob/main/README.md
Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam posting.
-
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
-
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
-
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
-
No trolling.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!