this post was submitted on 31 Jan 2026

244 points (99.6% liked)

datahoarder

10234 readers

14 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 6 years ago

MODERATORS

archivist@lemmy.ml

244

Epstein Files Jan 30, 2026 Release - Archived from Justice.gov (lemmy.world)

submitted 1 month ago* (last edited 3 weeks ago) by xodoh74984@lemmy.world to c/datahoarder@lemmy.ml

221 comments fedilink hide all child comments

Epstein Files Jan 30, 2026

Data hoarders on reddit have been hard at work archiving the latest Epstein Files release from the U.S. Department of Justice. Below is a compilation of their work with download links.

Please seed all torrent files to distribute and preserve this data.

Ref: https://old.reddit.com/r/DataHoarder/comments/1qrk3qk/epstein_files_datasets_9_10_11_300_gb_lets_keep/

Epstein Files Data Sets 1-8: INTERNET ARCHIVE LINK

Epstein Files Data Set 1 (2.47 GB): TORRENT MAGNET LINK
Epstein Files Data Set 2 (631.6 MB): TORRENT MAGNET LINK
Epstein Files Data Set 3 (599.4 MB): TORRENT MAGNET LINK
Epstein Files Data Set 4 (358.4 MB): TORRENT MAGNET LINK
Epstein Files Data Set 5: (61.5 MB) TORRENT MAGNET LINK
Epstein Files Data Set 6 (53.0 MB): TORRENT MAGNET LINK
Epstein Files Data Set 7 (98.2 MB): TORRENT MAGNET LINK
Epstein Files Data Set 8 (10.67 GB): TORRENT MAGNET LINK

Epstein Files Data Set 9 (Incomplete). Only contains 49 GB of 180 GB. Multiple reports of cutoff from DOJ server at offset 48995762176.

ORIGINAL JUSTICE DEPARTMENT LINK

TORRENT MAGNET LINK (removed due to reports of CSAM)

/u/susadmin's More Complete Data Set 9 (96.25 GB)
De-duplicated merger of (45.63 GB + 86.74 GB) versions

TORRENT MAGNET LINK (removed due to reports of CSAM)

Epstein Files Data Set 10 (78.64GB)

ORIGINAL JUSTICE DEPARTMENT LINK

TORRENT MAGNET LINK (removed due to reports of CSAM)
INTERNET ARCHIVE FOLDER (removed due to reports of CSAM)
INTERNET ARCHIVE DIRECT LINK (removed due to reports of CSAM)

Epstein Files Data Set 11 (25.55GB)

ORIGINAL JUSTICE DEPARTMENT LINK

TORRENT MAGNET LINK

SHA1: 574950c0f86765e897268834ac6ef38b370cad2a

Epstein Files Data Set 12 (114.1 MB)

ORIGINAL JUSTICE DEPARTMENT LINK

TORRENT MAGNET LINK
INTERNET ARCHIVE FOLDER LINK

SHA1: 20f804ab55687c957fd249cd0d417d5fe7438281
MD5: b1206186332bb1af021e86d68468f9fe
SHA256: b5314b7efca98e25d8b35e4b7fac3ebb3ca2e6cfd0937aa2300ca8b71543bbe2

This list will be edited as more data becomes available, particularly with regard to Data Set 9 (EDIT: NOT ANYMORE)

EDIT [2026-02-02]: After being made aware of potential CSAM in the original Data Set 9 releases and seeing confirmation in the New York Times, I will no longer support any effort to maintain links to archives of it. There is suspicion of CSAM in Data Set 10 as well. I am removing links to both archives.

Some in this thread may be upset by this action. It is right to be distrustful of a government that has not shown signs of integrity. However, I do trust journalists who hold the government accountable.

I am abandoning this project and removing any links to content that commenters here and on reddit have suggested may contain CSAM.

Ref 1: https://www.nytimes.com/2026/02/01/us/nude-photos-epstein-files.html
Ref 2: https://www.404media.co/doj-released-unredacted-nude-images-in-epstein-files

top 50 comments

sorted by: hot top controversial new old

[–] susadmin@lemmy.world 39 points 4 weeks ago* (last edited 4 weeks ago) (13 children)

I'm in the process of downloading both dataset 9 torrents (45.63 GB + 86.74 GB). I will then compare the filenames in both versions (the 45.63GB version has 201,358 files alone), note any duplicates, and merge all unique files into one folder. I'll upload that as a torrent once it's done so we can get closer to a complete dataset 9 as one file.

Edit 31Jan2026 816pm EST - Making progress. I finished downloading both dataset 9s (45.6 GB and the 86.74 GB). The 45.6GB set is 200,000 files and the 86GB set is 500,000 files. I have a .csv of the filenames and sizes of all files in the 45.6GB version. I'm creating the same .csv for the 86GB version now.

Edit 31Jan2026 845pm EST -
- dataset 9 (45.63 GB) = 201357 files
- dataset 9 (86.74 GB) = 531257 files
I did an exact filename combined with an exact file size comparison between the two dataset9 versions. I also did an exact filename combined with a fuzzy file size comparison (tolerance of +/- 1KB) between the two dataset9 versions. There were:
- 201330 exact matches
- 201330 fuzzy matches (+/- 1KB)
Meaning there are 201330 duplicate files between the two dataset9 versions.

These matches were written to a duplicates file. Then, from each dataset9 version, all files/sizes matching the file and size listed in the duplicates file will be moved to a subfolder. Then I'll merge both parent folders into one enormous folder containing all unique files and a folder of duplicates. Finally, compress it, make a torrent, and upload it.

Edit 31Jan2026 945pm EST -

Still moving duplicates into subfolders.

Edit 31Jan2026 1027pm EST -

Going off of xodoh74984's comment (https://lemmy.world/post/42440468/21884588), I'm increasing the rigor of my determination of whether the files that share a filename and size between both version of dataset9 are in fact duplicates. This will be identical to rsync --checksum to verify bit-for-bit that the files are the same by calculating their MD5 hash. This will take a while but is the best way.

Edit 01Feb2026 1227am EST -

Checksum comparison complete. 73 files found that have the same file name and size but different content. Total number of duplicate files = 201257. Merging both dataset versions now, while keeping one subfolder of the duplicates, so nothing is deleted.

Edit 01Feb2026 1258am EST -

Creating the .tar.zst file now. 531285 total files, which includes all unique files between dataset9 (45.6GB) and dataset9 (86.7GB), as well as a subfolder containing the files that were found in both dataset9 versions.

Edit 01Feb2026 215am EST -

I was using wayyyy to high a compression value for no reason (ztsd --ultra --22). Restarted the .tar.zst file creation (with ztsd -12) and it's going 100x faster now. Should be finished ~~within the hour~~

Edit 01Feb2026 311am EST -

.tar.zst file creation is taking very long. I'm going to let it run overnight - will check back in a few hours. I'm tired boss.

EDIT 01Feb2026 831am EST -

COMPLETE!

And then I doxxed myself in the torrent. One moment please while I fix that....

Final magnet link is HERE. GO GO GOOOOOO

I'm seeding @ 55 MB/s. I'm also trying to get into the new r/EpsteinPublicDatasets subreddit to share the torrent there.

[–] epstein_files_guy@lemmy.world 8 points 4 weeks ago (1 children)

looking forward to your torrent, will seed.

I have several incomplete sets of files from dataset 9 that I downloaded with a scraped set of urls - should I try to get them to you to compare as well?

[–] susadmin@lemmy.world 4 points 4 weeks ago (1 children)

Yes! I'm not sure the best way to do that - upload them to MEGA and message me a download link?

[–] epstein_files_guy@lemmy.world 6 points 4 weeks ago (1 children)

maybe archive.org? that way they can be torrented if others want to attempt their own merging techniques? either way it will be a long upload, my speed is not especially good. I'm still churning through one set of urls that is 1.2M lines, most are failing but I have 65k from that batch so far.

[–] susadmin@lemmy.world 5 points 4 weeks ago (1 children)

archive.org is a great idea. Post the link here when you can!

[–] epstein_files_guy@lemmy.world 5 points 4 weeks ago* (last edited 4 weeks ago)

I'll get the first set (42k files in 31G) uploading as soon as I get it zipped up. it's the one least likely to have any new files in it since I started at the beginning like others but it's worth a shot

edit 01FEB2026 1208AM EST - 6.4/30gb uploaded to archive.org

edit 01FEB2026 0430AM EST - 13/30gb uploaded to archive.org; scrape using a different url set going backwards is currently at 75.4k files

edit 01FEB2026 1233PM EST - had an internet outage overnight and lost all progress on the archive.org upload, currently back to 11/30gb. the scrape using a previous url set seems to be getting very few new files now, sitting at 77.9k at the moment

[–] thetrekkersparky@startrek.website 8 points 4 weeks ago

I'm downloading 8-11 now, I'm seeding 1-7+12 now. I've tried checking up on reddit, but every other time i check in the post is nuked or something. My home server never goes down and I'm outside USA. I'm working on the 100GB+ #9 right now and I'll seed whatever you can get up here too.

[–] helpingidiot@lemmy.world 5 points 4 weeks ago

Have a good night. I'll be waiting to download it, seed it, make hardcopies and redistribute it.

Please check back in with us

[–] donmega@lemmy.world 5 points 4 weeks ago

Thank you so much for keeping us updated!!

load more comments (8 replies)

[–] Boomtown1873@lemmy.world 20 points 4 weeks ago (1 children)

Funny how a rag-tag ad-hoc group can seed data so much better than the DOJ. Beautiful to see in action.

load more comments (1 replies)

[+] Sundoen@lemmy.world 16 points 4 weeks ago* (last edited 4 weeks ago) (1 children)

[deleted]

[–] bay400@thelemmy.club 5 points 4 weeks ago* (last edited 4 weeks ago) (6 children)

Magnet:

magnet:?xt=urn:btih:acb9cb1741502c7dc09460e4fb7b44eac8022906&dn=DataSet_9.tar.xz&xl=93143408940&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=http%3A%2F%2Fopen.tracker.cl%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Ftracker.theoks.net%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.srv00.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.qu.ax%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.filemail.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.dler.org%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.alaskantf.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker-udp.gbitt.info%3A80%2Fannounce&tr=udp%3A%2F%2Ft.overflow.biz%3A6969%2Fannounce&tr=udp%3A%2F%2Fopentracker.io%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.dstud.io%3A6969%2Fannounce&tr=udp%3A%2F%2Fmartin-gebhardt.eu%3A25%2Fannounce&tr=udp%3A%2F%2Fevan.im%3A6969%2Fannounce&tr=udp%3A%2F%2Fd40969.acod.regrucolo.ru%3A6969%2Fannounce&tr=udp%3A%2F%2F6ahddutb1ucc3cp.ru%3A6969%2Fannounce&tr=https%3A%2F%2Ftracker.zhuqiy.com%3A443%2Fannounce

load more comments (6 replies)

[–] jankscripts@lemmy.world 15 points 4 weeks ago (2 children)

Heads up that the DOJ site is a tar pit, it's going to return 50 files on the page regardless of the page number your on seems like somewhere between 2k-5k pages it just wraps around right now.

Testing page 2000... ✓ 50 new files (out of 50)
Testing page 5000... ○ 0 new files - all duplicates
Testing page 10000... ○ 0 new files - all duplicates
Testing page 20000... ○ 0 new files - all duplicates
Testing page 50000... ○ 0 new files - all duplicates
Testing page 100000... ○ 0 new files - all duplicates

[–] WorldlyBasis9838@lemmy.world 9 points 4 weeks ago

I saw this too; yesterday I tried manually accessing the page to explore just how many there are. Seems like some of the pages are duplicates (I was simply comparing the last listed file name and content between some of the first 10 pages, and even had 1-2 duplications.)

Far as maximum page number goes, if you use the query parameter ?page=200000000 it will still resolve a list of files. — actually crazy.

https://www.justice.gov/epstein/doj-disclosures/data-set-9-files?page=200000000

load more comments (1 replies)

[–] hYcG68caGB7WvLX67@lemmy.world 14 points 4 weeks ago (1 children)

I was quick to download dataset 12 after it was discovered to exist, and apparently my dataset 12 contains some files that were later removed. Uploaded to IA in case it contains anything that later archivists missed. https://archive.org/details/data-set-12_202602

Specifically doc number 2731361 and others around it were at some point later removed from DoJ, but are still within this early-download DS12. Maybe more, unsure

[–] susadmin@lemmy.world 8 points 4 weeks ago* (last edited 4 weeks ago)

The files in this (early) dataset 12 are identical to the dataset 12 here, which is the link in the OP. The MD5 hashes are identical.

I shared a .csv file of the calculated MD5 hashes here

[–] bile@lemmy.world 11 points 4 weeks ago (2 children)

reposting a full magnet list (besides 9) of all the datasets that was on reddit with healthy seeds:

Dataset 1 (2.47GB)

magnet:?xt=urn:btih:4e2fd3707919bebc3177e85498d67cb7474bfd96&dn=DataSet+1&xl=2658494752&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 2 (631.6MB)

magnet:?xt=urn:btih:d3ec6b3ea50ddbcf8b6f404f419adc584964418a&dn=DataSet+2&xl=662334369&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 3 (599.4MB)

magnet:?xt=urn:btih:27704fe736090510aa9f314f5854691d905d1ff3&dn=DataSet+3&xl=628519331&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 4 (358.4MB)

magnet:?xt=urn:btih:4be48044be0e10f719d0de341b7a47ea3e8c3c1a&dn=DataSet+4&xl=375905556&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 5 (61.5MB)

magnet:?xt=urn:btih:1deb0669aca054c313493d5f3bf48eed89907470&dn=DataSet+5&xl=64579973&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 6 (53.0MB)

magnet:?xt=urn:btih:05e7b8aefd91cefcbe28a8788d3ad4a0db47d5e2&dn=DataSet+6&xl=55600717&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 7 (98.2MB)

magnet:?xt=urn:btih:bcd8ec2e697b446661921a729b8c92b689df0360&dn=DataSet+7&xl=103060624&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 8 (10.67GB)

magnet:?xt=urn:btih:c3a522d6810ee717a2c7e2ef705163e297d34b72&dn=DataSet%208&xl=11465535175&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce

Dataset 10 (78.64GB)

magnet:?xt=urn:btih:d509cc4ca1a415a9ba3b6cb920f67c44aed7fe1f&dn=DataSet%2010.zip&xl=84439381640

Dataset 11 (25.55GB)

magnet:?xt=urn:btih:59975667f8bdd5baf9945b0e2db8a57d52d32957&xt=urn:btmh:12200ab9e7614c13695fe17c71baedec717b6294a34dfa243a614602b87ec06453ad&dn=DataSet%2011.zip&xl=27441913130&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=http%3A%2F%2Fopen.tracker.cl%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.srv00.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.filemail.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.dler.org%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker-udp.gbitt.info%3A80%2Fannounce&tr=udp%3A%2F%2Frun.publictracker.xyz%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.dstud.io%3A6969%2Fannounce&tr=udp%3A%2F%2Fleet-tracker.moe%3A1337%2Fannounce&tr=https%3A%2F%2Ftracker.zhuqiy.com%3A443%2Fannounce&tr=https%3A%2F%2Ftracker.pmman.tech%3A443%2Fannounce&tr=https%3A%2F%2Ftracker.moeblog.cn%3A443%2Fannounce&tr=https%3A%2F%2Ftracker.alaskantf.com%3A443%2Fannounce&tr=https%3A%2F%2Fshahidrazi.online%3A443%2Fannounce&tr=http%3A%2F%2Fwww.torrentsnipe.info%3A2701%2Fannounce&tr=http%3A%2F%2Fwww.genesis-sp.org%3A2710%2Fannounce

Dataset 12 (114.0MB)

magnet:?xt=urn:btih:EE6D2CE5B222B028173E4DEDC6F74F08AFBBB7A3&dn=DataSet%2012.zip&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce

[–] xodoh74984@lemmy.world 8 points 4 weeks ago

Thank you for this!

I've added all magnet links for sets 1-8 to the original post. Magnet links for 9-11 match OP. Magnet link for 12 is different, but we've identified that there are at least two versions. DOJ removed files before the second version was downloaded. OP contains the early version of data set 12.

load more comments (1 replies)

[–] dessalines@lemmy.ml 10 points 4 weeks ago

Thx for posting, seed if you can ppl.

[–] ModernSimian@lemmy.world 9 points 4 weeks ago* (last edited 4 weeks ago) (2 children)

Does anyone have an index of filenames/links from the DOJ website scraped?

Edit, specifically for DataSet 9.

[–] Kindly_District9380@lemmy.world 6 points 4 weeks ago

coming up with that right now, check my comment below.

I made this script and slowly about to finish the crawl, it's close to 20k+ pages.

https://pastebin.com/zbF0Rmfx

It should be done in less than 1-2 hour, and I will upload it to Archive. org

[–] epstein_files_guy@lemmy.world 6 points 4 weeks ago (4 children)

I'm waiting for /u/Kindly_District9380 's version but I've been slowly working backwards on this in the meantime https://archive.org/details/dataset9_url_list

load more comments (3 replies)

[–] donmega@lemmy.world 8 points 4 weeks ago

DOJ Just Removed Epstein Flight Log + Contact book in the last 30 minutes

MIRRORS:

This should be the flight log: https://epsteinfilez.com/?doc=f2f8c7628ddc9cbc1cf6a7532b847ae45eec1165cb400fccea19213982956d3d.pdf&p=1

and

https://epsteinfilez.com/?doc=bccc4a7953c0834228a422f1c3fede4d361dc8240c191d2e2b4616259ccdb4e9.pdf&p=1

and this is the contact book: https://epsteinfilez.com/?doc=b395c578ed9394206eaae4f724f99b094d81a5fce45006b247150433b38016c6.pdf&p=1

[–] Wild_Cow_5769@lemmy.world 7 points 4 weeks ago (1 children)

As far as CSAM and the “don’t go looking for data set 9”…

Look I’ll be straight up.

If I find any CSAM it gets deleted…

But if you believe for 1 second that DOJ didn’t remove delete relevant files because they are protecting people then I have a time share to sell you at a cheap price on a beautiful scenic swamp in Florida…

load more comments (1 replies)

[–] Nomad64@lemmy.world 6 points 4 weeks ago* (last edited 4 weeks ago)

I am seeding sets 1-8, 10-12, and the larger set 9. Seedbox is outside the US and has a very fast connection.

I will keep an eye on this post for other sets. 👍

[–] PeoplesElbow@lemmy.world 6 points 3 weeks ago (11 children)

Ok everyone, I have done a complete indexing of the first 13,000 pages of the DOJ Data Set 9.

KEY FINDING: 3 files are listed but INACCESSIBLE

These appear in DOJ pagination but return error pages - potential evidence of removal:

EFTA00326497

EFTA00326501

EFTA00534391

You can try them yourself (they all fail):

https://www.justice.gov/epstein/files/DataSet%209/EFTA00326497.pdf

The 86GB torrent is 7x more complete than DOJ website

DOJ website exposes: 77,766 files

Torrent contains: 531,256 files

Page Range Min EFTA Max EFTA New Files

0-499 EFTA00039025 EFTA00267311 21,842

500-999 EFTA00267314 EFTA00337032 18,983

1000-1499 EFTA00067524 EFTA00380774 14,396

1500-1999 EFTA00092963 EFTA00413050 2,709

2000-2499 EFTA00083599 EFTA00426736 4,432

2500-2999 EFTA00218527 EFTA00423620 4,515

3000-3499 EFTA00203975 EFTA00539216 2,692

3500-3999 EFTA00137295 EFTA00313715 329

4000-4499 EFTA00078217 EFTA00338754 706

4500-4999 EFTA00338134 EFTA00384534 2,825

5000-5499 EFTA00377742 EFTA00415182 1,353

5500-5999 EFTA00416356 EFTA00432673 1,214

6000-6499 EFTA00213187 EFTA00270156 501

6500-6999 EFTA00068280 EFTA00281003 554

7000-7499 EFTA00154989 EFTA00425720 106

7500-7999 (no new files - all wraps/redundant)

8000-8499 (no new files - all wraps/redundant)

8500-8999 EFTA00168409 EFTA00169291 10

9000-9499 EFTA00154873 EFTA00154974 35

9500-9999 EFTA00139661 EFTA00377759 324

10000-10499 EFTA00140897 EFTA01262781 240

10500-12999 (no new files - all wraps/redundant)

TOTAL UNIQUE FILES: 77,766

Pagination limit discovered: page 184,467,440,737,095,516 (2^64/100)

I searched random pages between 13k and this limit - NO new documents found. The pagination is an infinite loop. All work at: https://github.com/degenai/Dataset9

load more comments (11 replies)

[–] o_derr889@lemmy.world 5 points 3 weeks ago (4 children)

Here is the download link for a text file that has all the original URL's https://wormhole.app/PpjJ3P#SFfAOKm1bnCyi-h2YroRyA The link will only last for 24 hours.

load more comments (4 replies)

[–] TheBobverse@lemmy.world 5 points 4 weeks ago (1 children)

Is there any grunt work that needs to be done? I would like to help out but I'm not sure how to make sure my work isn't redundant. I mean like looking through individual files etc. Is there an organized effort to comb through everything?

[–] berf@lemmy.world 4 points 4 weeks ago (1 children)

I’ve been working on a structured inventory of the datasets with a slightly different angle: rather than maximizing scrape coverage, I’m focusing on understanding what’s present vs. what appears to be structurally missing based on filename patterns, numeric continuity, file sizes, and anchor adjacency.

For Dataset 9 specifically, collapsing hundreds of thousands of files down into a small number of high-confidence “missing blocks” has been useful for auditing completeness once large merged sets (like yours) exist. The goal isn’t to assume missing content, but to identify ranges where the structure strongly suggests attachments or exhibits likely existed.

If anyone else here is doing similar inventory or diff work, I’d be interested in comparing methodology and sanity-checking assumptions. No requests for files (yet) Just notes on structure and verification

[–] jankscripts@lemmy.world 4 points 4 weeks ago (2 children)

Keep in mind when looking at the file names the File name is the name of the first page of the document each page in the document is part of the numbering scheme.

EFTA00039025.pdf

EFTA00039026 ...

... EFTA00039152

load more comments (2 replies)

[–] Arthas@lemmy.world 4 points 3 weeks ago (18 children)

I am downloading dataset 9 and should have the full 180gb zip done in a day. To confirm, the link on DOJ to the dataset 9 zip is now updated to be clean of CSAM or not? As much as I wish to help the cause, I do not want any of that type of material on my server unless permission has been given to host it for credible researchers only that need access to all files for their investigation, but I have no way of understanding what’s within legal rights to assist with redistributing the files to legitimate investigators and thus my plans to help create a torrent may be squashed. Please let me know.

load more comments (18 replies)

[–] bile@lemmy.world 4 points 4 weeks ago* (last edited 4 weeks ago) (2 children)

just advising you that there is confirmed csam in dataset9-more-complete.tar.zst and probably the other partial dataset9s

load more comments (2 replies)

load more comments