yea lmk
kongstrong
DM me your matrix account, we're looking to get more people to uncover what's missing from dataset 9, see https://lemmy.world/post/42440468/21884671
not familiar with it but sure i can set something up, will DM the 3 of you a link in a minute
I can also take up some of these. Do you happen to have more of those gaps?
Also, are you guys using some chat channel for this? Might be a little more accessible
E: other users that run into this thread, DM me and I can add you to an element group to coordinate all this
nice. Kinda feeling like we can't be sure whether our URL lists are ever exhaustive enough or that the DOJ might just let a large part of the dataset go dark
Awesome, I don't really understand what's happening but I'm also running it (also doing it for the presumably exact same 48GB torrent, but I'm supposed to do that right?)
I've been checking your URLs but it seems you've got a lot without a downloadable document attached?
Would love to help still from my PC on dataset 9 specifically. Any way we can exchange progress so I won't start with downloading files you already have downloaded?
E: just started scraping starting from page 18330 (as you mentioned you ended around 18333), hoping I can fill in the remaining 4000-ish pages
Update 2 (1715UTC): just finished scraping up until the page 20500 limit you set in the code. There are 0 new files in the range between 18330-20500 compared to the ones you already found. So unless I did something wrong, either your list is complete or the DOJ has been scrambling their shit (considering the large number of duplicate pages, I'm going with the second explanation).
Either way, I'm gonna extract the 48GB and 100GB torrent directories now and try to mark down which of the files already exist within those torrents, so we can make an (intermediate) list of which files are still missing from them
we're on element