datahoarder

10406 readers

2 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 6 years ago

MODERATORS

archivist@lemmy.ml

Mass Download course PDFs ? (lemmy.ml)

submitted 2 months ago* (last edited 2 months ago) by tdTrX@lemmy.ml to c/datahoarder@lemmy.ml

10 comments fedilink hide all child comments

spayee/graphy course

Webpage has a sidebar with category and sub-category and each opens just a PDF.

PDF files are stored here - https://randomlettersandnumbers.cloudfront.net/w/o/randomLettersAndNumbers/v/randomLettersAndNumbers/u/randomLettersAndNumbers/p/assets/pdfs/2021/01/13/randomLettersAndNumbers/file.pdf

top 10 comments

sorted by: hot top controversial new old

[–] hexagonwin@lemmy.today 0 points 2 months ago (1 children)

can you share the url? in the worst case you'll need to write a custom crawler that works by automating a web browser.

[–] tdTrX@lemmy.ml 1 points 2 months ago (1 children)

www . acadboost . com/courses/11th-JEE-MainAdvanced-Notes

[–] hexagonwin@lemmy.today 1 points 2 months ago (2 children)

ok, i figured out how to download :) i used firefox devtool, might be slightly different on chromium.

after going to a pdf page (like this), open devtools on the network tab, select XHR, filter with the string /preview/url and refresh. you'll get one item that contains 'url' and 'p'. as you also experienced the pdf is password protected.

now they have a JS function defined named parseJData, you can use it like parseJData(p, !0) where p is the p value from the xhr response e.g. parseJData("9dd1bbb2b96776b603b2666fb3173133x8Y+a7Fx0tdy2ntJSUCmLFQQW+BMJFz+UGUrdSyaNz2FpFx2fSJvzEJ8JdWXGbeH16ac82d92bc66da09f044fe9faebaaa9", !0). That's your pdf password.

you totally can automate this, but there doesn't seem to be that many PDFs (if you're only going for that one lecture). I'd just keep the devtool open, check "persist logs" (click option button to find it), browse through all the PDF pages, and save as HAR file and write some one off script to extract all the url and p value.

[–] tdTrX@lemmy.ml 1 points 2 months ago

thanks aton

[–] tdTrX@lemmy.ml 1 points 2 months ago* (last edited 2 months ago) (1 children)

Which part is password

{"response":true,"url":"https://d2a5xnk4s7n8a6.cloudfront.net/w/o/5de58cc5e4b06eaef9799a5e/v/5eadee8b0cf250d48d95a674/u/69a29c32cf33e87f41f96eb5/p/assets/pdfs/2020/05/02/5eadee8b0cf250d48d95a674/file.pdf","p":"085b4cff79ec580e9687ecf41d77672feDjUk8jWDf2mBGaRtmLWv/bkykxiE4t16pD/ZQJvvuLn1AFNM35N67fA61ORomhx99cbac6aada470ee7d48d35ebc98d09d","allowDownload":false,"allowWatermark":false}

Also there is a allowDownload How to make it true ?

Why there are 2 sometimes 3 .pdf in network tab when there is only 1 pdf on page ?

[–] hexagonwin@lemmy.today 1 points 2 months ago

sent you a pm, hope it helps

[–] sga@piefed.social 0 points 2 months ago (1 children)

try something in lines of

wget -r -np -k -p "website to archive recursive download"

may work, but in case it does not, i would download the the page html, and then filter out all pdf links (some regex or grep magic), and then just give that list to wget or some other file downloader.

if you can give the url, we can get a bit more specific.

[–] tdTrX@lemmy.ml 1 points 2 months ago* (last edited 2 months ago) (1 children)

Website needs login.

I downloaded some PDFs manually from F12 and they are password protected, how to unlock or get the password ?

[–] sga@piefed.social 1 points 2 months ago* (last edited 2 months ago) (1 children)

in this case try to fetch a list and then fetch your cookies from browser, and use curl and scripting to fetch stuff.

[–] sga@piefed.social 1 points 2 months ago

for cookies, you can try to open devtools, and then go to network tab, and there find the pdf file, and then right click, and you will find an option something in lines of 'copy as/for cURL', copy that, and paste somewhere. repeat exercise for some other file. this should give you some pattern as for how to make a query. it most likely just needs a bearerauth/token in header cookie, or something alike that.