The most time consuming part of making an ebook for me, has been transcribing the text. OCR certainly helps immensely, but there are still errors and problems. If you have some time and want to help you can transcribe PDFs into HTML.
Step 1: Find a PDF
The first step is to look for the best possible PDF.
A "true text" PDF (as opposed to an image with OCR-text PDF) is the best case scenario, as they can be transformed into an EPUB using Calibre.^[I will enable the "Do not split on page breaks" and set "Split files larger than" to 0/Disable in EPUB Output, as otherwise it can split in weird places] You will likely still need to find a scan of the book to compare to.
If you can’t find a "true text" PDF, look for the highest quality scan. If the PDF doesn’t contain OCR, or it’s of bad quality, you’ll have to do OCR yourself. I usually use ocrmypdf to do so, although I haven’t gotten Calibre to create an EPUB from the outputted PDF with OCR, and instead used ocrmypdf for it’s --sidecar option which gives me plain text.^[In this ocrmypdf is essentially just a front-end for tesseract (the actual OCR program) because tesseract can’t take PDFs as input] You can send me a message with a link to the PDF asking to do the OCR or EPUB-ification and I’ll send back the result. If you don’t need my help, do still send me a message saying you intend to transcribe the book, and do include links.
Step 2: Transcribe
Once you have your EPUB or plain text, you’ll start on the actual "transcription."
The EPUB can be opened in Calibre’s ebook editor (simply click "Edit book" in the UI) or opened with a ZIP manager and the HTML file extracted and edited with your favourite editor. The plain text can be opened in any kind of editor^[Maybe also something like MS Word or LO Writer, although I don’t know if it can export as HTML and how well it will be, I’d rather you use something else], even notepad, but you may want to use a more capable editor like notepad++ or visual studio code.
Then your job is to:
- Proofread. OCR is good, but not perfect. The true-text PDFs are usually great, but may still have minor errors, so you should proofread no matter what source you have.
- Ensure headers/titles are inside heading elements, and they are at the correct level. Do not use a heading for it’s font size, use it for it’s semantic meaning! The books title is h1, so the chapters titles are likely to be h2, unless they are inside a part, then the part title is h2 and the chapter titles are h3. Also see the Standard Ebooks’ SEMOS for more information.
- Ensure paragraphs are inside elements. For EPUBs that means ensuring there isn’t some incorrect breaks in the ’s, this especially happens on page breaks. For plain text adding around each paragraph.
- Ensure italicized text is inside ^[You may see on this MDN link that is not always the correct element, and that there are other "more appropriate" elements for some things, and this is true. It would be nice if you did use the appropriate element, otherwise I will have to change them, but you don’t have to. See the Standard Ebooks’ SEMOS for when an element is appropriate] elements.
- Re-create tables with
- Re-create lists with and elements
- Ensure footnote/endnote numbers are linked with , the actual footnote should be moved to the end of the chapter or section or to the end of the book. The paragraph element of the footnote should be given an id so that the can reference it with
href=#insert_id_here(the hashtag is necessary, do not replace it with the id)
(TODO: Anything else?)
For EPUBs, calibre likes to add a bunch of unnecessary stuff, like <span>s and class="calibre1". So before you start working on it, you may want to remove them.
If you are using Calibre’s ebook editor—or some other editor with regex capabilities—you can remove them by opening up the find and replace, selecting "Regex" in mode (or how that works in your editor), inputting class=".*?" in Find and nothing in Replace and clicking Replace all. Then inputting <\/span> in Find (and still nothing in Replace.)
Also if you are using Calibre’s ebook editor, it has a preview on the right and it may not look good. But don’t worry your job is to ensure the text and HTML is correct, and I’ll make the ebook and ensure it looks good.
Step 3: ???
Once done, send the finished HTML (or EPUB if you’ve been working inside one) over to me. Otherwise if you are making an EPUB for yourself and not just transcribing for ComLib, I suggest you look at Standard Ebooks’ Step by Step guide
Step 4: Profit!
Your EPUB is now ready
How does one deal with pages that are images or that are tables which are too difficult to transcribe into text format and which are best left as images?
Similarly, what should one do with creating a cover image for the book (assuming one is available)?
Is there a way to submit a book which has been mostly completed but where you got stuck on something tricky or you need feedback, e.g. yoir first attempt?
I would probably extract the page from the PDF and cut it to be just the image, then insert that (as an ). Tables that can't be transcribed into simple s should probably be recreated as SVGs, see Russian Justice (chapter 10) as an example.
The cover can either be extracted from the PDF if it has one. Or one can make on yourself, like how I make one for mine. If the transcription is submitted to ComLib I'll make a cover, so one doesn't have to be concerned about it in that case.
If one has any questions or need feedback simply reach out to me. If one knows they can't finish it but have a lot done, reach out to me and send it over.
ⓘ This user is suspected of being a cat. Please report any suspicious behavior.