this post was submitted on 09 Dec 2025
39 points (100.0% liked)
askchapo
23225 readers
255 users here now
Ask Hexbear is the place to ask and answer ~~thought-provoking~~ questions.
Rules:
-
Posts must ask a question.
-
If the question asked is serious, answer seriously.
-
Questions where you want to learn more about socialism are allowed, but questions in bad faith are not.
-
Try !feedback@hexbear.net if you're having questions about regarding moderation, site policy, the site itself, development, volunteering or the mod team.
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Pdfs an be set up in a lot of different ways.
One way is where text is encoded into the document like if text were aligned and sized just right for one of those typewriters with the white out ribbon. Text encoded into the pdf in this way can be selected, edited and copied just like any other kind of document.
Another way is where text is embedded into the document, like a picture of a newspaper article pasted onto a piece of paper. Text in the pdf like this can’t be manipulated or selected and is the kind you’re having problems with.
The way to get around that kind of text is optical character recognition. OCR software analyzes images of text and figures out what characters it corresponds to. Just chase down some free ocr package and input your pdf.
Cool, thank you very much. I got k2pdf (courtesy of another dope-ass bear) to get the two columns + footnotes in the original pdf into a pdf that is just one column with footnotes clearly distinguishable. Now I need just what you're saying because the result of the k2pdf conversion is an image that I can't select text from (but the words are all in the right order, which is good).
Tesseract seems like a popular choice, I'll give that a try.
Tesseract doesn't support PDF input, you'll need some other program like ocrmypdf (which I have used. It uses tesseract), or extract each page to it's own image (which I have also done but I forget how right now.)
ⓘ This user is suspected of being a cat. Please report any suspicious behavior.
Thanks again! You're the best :)
This looks like exactly what I need. After getting the formatting right with k2pdf I can then use ocrmypdf to get it back to text form and then just ctrl + a copy to writer and export as epub, since the pdf size is like 15x the epub size.