MobileRead Forums - View Single Post

retiredbiker · 11-03-2023, 10:44 AM

Quote:

Originally Posted by Slash

the tools I'm using with ocr recognition is for pdf only, I didn't find the way to make an ocr from my original jpeg files in Calibre

If you are using Linux, a cool way to do OCR on jpeg images is OCRFeeder as a front end for Tesseract. It gives you fine control, handles paragraphing and end-of-line hyphens very well. Lets you do double-column and other ugly things.

If you have a pdf with OCR text in it, Calibre will use the pdftohtml tool to extract the text. Sometimes this does not work, for some reason, so try using the pdftotext tool outside Calibre. That will give you a text file, but you are on your own for paragraphing and formatting...as always with pdf.

Anything OCR'd needs proofing and editing, an that is usually the hardest part of the project.