Kindly_District9380

joined 1 day ago
[–] Kindly_District9380@lemmy.world 5 points 1 day ago* (last edited 1 day ago) (1 children)

hey sorry I got super distracted with building a data mapper, but I have the version here, just gov stopped responding to my requests, even though I was quite gracefully requesting the pages:

UPDATE DATASET 9 Files List:

Progress:

Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)

  • Downloading individual files: 30K files / 41GB so far
  • Also grabbed the 86GB DataSet_9.tar.xz torrent (~500K files) - extracting now
  • Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.

link: https://archive.org/details/epstein-dataset9-index

The link is live and shows the 75.7MB JSON file available for downloa

coming up with that right now, check my comment below.

I made this script and slowly about to finish the crawl, it's close to 20k+ pages.

https://pastebin.com/zbF0Rmfx

It should be done in less than 1-2 hour, and I will upload it to Archive. org

[–] Kindly_District9380@lemmy.world 13 points 1 day ago* (last edited 1 day ago) (1 children)

Superb, I have 1-8, 11-12.

Only remaining 10 (to complete - downloading from Archive.org now)

Dataset 9 is the biggest. I ended up writing a parser to go through every page on justice.gov and make an index list.

Current estimate of files list is:

  • ~1,022,500 files (50 files/page × 20,450 pages)
  • My scraped index so far: 528,586 files / 634,573 URLs
  • Currently downloading individual files: 24,371 files (29GB)
  • Download rate ~1 file/sec to avoid getting blocked = ~12 days continuous for full set

Your merged 45GB + 86GB torrents (~500K-700K files) would be a huge help. Happy to cross-reference with my scraped URL list to find any gaps.


UPDATE DATASET 9 Files List:

Progress:

  • Scraped 529,334 file URLs from Justice .gov (pages 0-18333, ~89% of index)
  • Downloading individual files: 30K files / 41GB so far
  • Also grabbed the 86GB DataSet_9.tar.xz torrent (~500K files) - extracting now

Uploaded my URL index to Archive.org - 529K file URLs in JSON format if anyone wants to help download the remaining files.

link: https://archive.org/details/epstein-dataset9-index

The link is live and shows the 75.7MB JSON file available for download.


UPDATE Dataset Size Sanity Check:

Dataset Report Generated: 2026-01-31T23:28:29.198691 Base Path: /mnt/epstein-doj-2026-01-30

Summary

Dataset Files Extracted ZIP Types
DataSet_1 6,326 2.48 GB 1.23 GB .pdf, .opt, .dat
DataSet_1_incomplete 3,158 1.24 GB N/A .pdf, .opt, .dat
DataSet_2 577 631.66 MB 630.79 MB .pdf, .dat, .opt
DataSet_3 69 598.51 MB 595.00 MB .pdf, .dat, .opt
DataSet_4 154 358.43 MB 351.52 MB .pdf, .opt, .dat
DataSet_5 122 61.60 MB 61.48 MB .pdf, .dat, .opt
DataSet_6 15 53.02 MB 51.28 MB .pdf, .opt, .dat
DataSet_7 19 98.29 MB 96.98 MB .pdf, .dat, .opt
DataSet_8 11,042 10.68 GB 9.95 GB .pdf, .mp4, .xlsx
DataSet_9_files 35,480 40.44 GB 45.63 GB .pdf, .mp4, .m4a
DataSet_9_45GB_unique 28 84.18 MB N/A .pdf, .dat, .opt
DataSet_9_extracted 531,256 94.51 GB N/A .pdf
DataSet_9_45GB_extracted 201,357 47.45 GB N/A .pdf, .dat, .opt
DataSet_10_extracted 504,030 81.15 GB 78.64 GB .pdf, .mp4, .mov
DataSet_11 14,045 1.17 GB 25.56 GB .pdf
DataSet_12 154 119.89 MB 114.09 MB .pdf, .dat, .opt
TOTAL 1,307,832 281.07 GB 162.87 GB

https://pastebin.com/zdHbsCwH

here is a little script that can generate the above report if you have your dir something like this:

 # Minimum working example:
  my_directory/
  ├── DataSet_1/
  │   └── (any files)
  ├── DataSet_2/
  │   └── (any files)
  └── DataSet 2.zip  (optional - will be matched)