askchapo

23225 readers

255 users here now

Ask Hexbear is the place to ask and answer ~~thought-provoking~~ questions.

Rules:

Posts must ask a question.
If the question asked is serious, answer seriously.
Questions where you want to learn more about socialism are allowed, but questions in bad faith are not.
Try !feedback@hexbear.net if you're having questions about regarding moderation, site policy, the site itself, development, volunteering or the mod team.

founded 5 years ago

MODERATORS

PorkrollPosadist@hexbear.net

replaceable@hexbear.net

VILenin@hexbear.net

SexUnderSocialism@hexbear.net

khizuo@hexbear.net

Wakmrow@hexbear.net

You seem like the kind of people that read. How do you convert to epub? (hexbear.net)

submitted 2 months ago by fort_burp@feddit.nl to c/askchapo@hexbear.net

32 comments fedilink hide all child comments

How on planet Earth can I change this pdf to epub? I tried everything I could think of in Calibre but the problem is that the pdf has 2 columns of text per page, plus footnotes on each page. When it converts to epub it just prints each line of each text column as a line of text, which makes it totally lose it's meaning. Footnotes are also just added as regular text, as part of a supremely incoherent story with aggressive punctuation.

Has anybody been able to solve this before?

you are viewing a single comment's thread
view the rest of the comments

[–] stupid_asshole69@hexbear.net 1 points 2 months ago (1 children)

Pdfs an be set up in a lot of different ways.

One way is where text is encoded into the document like if text were aligned and sized just right for one of those typewriters with the white out ribbon. Text encoded into the pdf in this way can be selected, edited and copied just like any other kind of document.

Another way is where text is embedded into the document, like a picture of a newspaper article pasted onto a piece of paper. Text in the pdf like this can’t be manipulated or selected and is the kind you’re having problems with.

The way to get around that kind of text is optical character recognition. OCR software analyzes images of text and figures out what characters it corresponds to. Just chase down some free ocr package and input your pdf.

[–] fort_burp@feddit.nl 1 points 2 months ago (1 children)

Cool, thank you very much. I got k2pdf (courtesy of another dope-ass bear) to get the two columns + footnotes in the original pdf into a pdf that is just one column with footnotes clearly distinguishable. Now I need just what you're saying because the result of the k2pdf conversion is an image that I can't select text from (but the words are all in the right order, which is good).

Tesseract seems like a popular choice, I'll give that a try.

[–] Edie@hexbear.net 2 points 2 months ago* (last edited 2 months ago) (1 children)

Tesseract doesn't support PDF input, you'll need some other program like ocrmypdf (which I have used. It uses tesseract), or extract each page to it's own image (which I have also done but I forget how right now.)

ⓘ This user is suspected of being a cat. Please report any suspicious behavior.

[–] fort_burp@feddit.nl 2 points 2 months ago

Thanks again! You're the best :)

This looks like exactly what I need. After getting the formatting right with k2pdf I can then use ocrmypdf to get it back to text form and then just ctrl + a copy to writer and export as epub, since the pdf size is like 15x the epub size.