The most time consuming part of making an ebook for me, has been transcribing the text. OCR certainly helps immensely, but there are still errors and problems. If you have some time and want to help you can transcribe PDFs into HTML.
Step 1: Find a PDF
The first step is to look for the best possible PDF.
A "true text" PDF (as opposed to an image with OCR-text PDF) is the best case scenario, as they can be transformed into an EPUB using Calibre.^[I will enable the "Do not split on page breaks" and set "Split files larger than" to 0/Disable in EPUB Output, as otherwise it can split in weird places] You will likely still need to find a scan of the book to compare to.
If you can’t find a "true text" PDF, look for the highest quality scan. If the PDF doesn’t contain OCR, or it’s of bad quality, you’ll have to do OCR yourself. I usually use ocrmypdf to do so, although I haven’t gotten Calibre to create an EPUB from the outputted PDF with OCR, and instead used ocrmypdf for it’s --sidecar option which gives me plain text.^[In this ocrmypdf is essentially just a front-end for tesseract (the actual OCR program) because tesseract can’t take PDFs as input] You can send me a message with a link to the PDF asking to do the OCR or EPUB-ification and I’ll send back the result. If you don’t need my help, do still send me a message saying you intend to transcribe the book, and do include links.
Step 2: Transcribe
Once you have your EPUB or plain text, you’ll start on the actual "transcription."
The EPUB can be opened in Calibre’s ebook editor (simply click "Edit book" in the UI) or opened with a ZIP manager and the HTML file extracted and edited with your favourite editor. The plain text can be opened in any kind of editor^[Maybe also something like MS Word or LO Writer, although I don’t know if it can export as HTML and how well it will be, I’d rather you use something else], even notepad, but you may want to use a more capable editor like notepad++ or visual studio code.
Then your job is to:
- Proofread. OCR is good, but not perfect. The true-text PDFs are usually great, but may still have minor errors, so you should proofread no matter what source you have.
- Ensure headers/titles are inside heading elements, and they are at the correct level. Do not use a heading for it’s font size, use it for it’s semantic meaning! The books title is h1, so the chapters titles are likely to be h2, unless they are inside a part, then the part title is h2 and the chapter titles are h3. Also see the Standard Ebooks’ SEMOS for more information.
- Ensure paragraphs are inside elements. For EPUBs that means ensuring there isn’t some incorrect breaks in the ’s, this especially happens on page breaks. For plain text adding around each paragraph.
- Ensure italicized text is inside ^[You may see on this MDN link that is not always the correct element, and that there are other "more appropriate" elements for some things, and this is true. It would be nice if you did use the appropriate element, otherwise I will have to change them, but you don’t have to. See the Standard Ebooks’ SEMOS for when an element is appropriate] elements.
- Re-create tables with
- Re-create lists with and elements
- Ensure footnote/endnote numbers are linked with , the actual footnote should be moved to the end of the chapter or section or to the end of the book. The paragraph element of the footnote should be given an id so that the can reference it with
href=#insert_id_here(the hashtag is necessary, do not replace it with the id)
(TODO: Anything else?)
For EPUBs, calibre likes to add a bunch of unnecessary stuff, like <span>s and class="calibre1". So before you start working on it, you may want to remove them.
If you are using Calibre’s ebook editor—or some other editor with regex capabilities—you can remove them by opening up the find and replace, selecting "Regex" in mode (or how that works in your editor), inputting class=".*?" in Find and nothing in Replace and clicking Replace all. Then inputting <\/span> in Find (and still nothing in Replace.)
Also if you are using Calibre’s ebook editor, it has a preview on the right and it may not look good. But don’t worry your job is to ensure the text and HTML is correct, and I’ll make the ebook and ensure it looks good.
Step 3: ???
Once done, send the finished HTML (or EPUB if you’ve been working inside one) over to me. Otherwise if you are making an EPUB for yourself and not just transcribing for ComLib, I suggest you look at Standard Ebooks’ Step by Step guide
Step 4: Profit!
Your EPUB is now ready
pls do, maybe i was dum dum