Previously-OCR'd pdfs

Daan_Jansen · May 16, 2025, 9:57am

I might be completely misusing Leo here, but I’ve been working with a few Early Modern published/printed works I plucked off Google Books and the biodiversity library. These books are already OCR’d but usually quite poorly, so I wanted to see if Leo could make better work out of it. When I upload these pdfs, each page is uploaded in duplicate: The ‘background layer’ and the ‘text layer’. It might be useful to be able to compress both layers together like a pdf reader does.

Jon · May 20, 2025, 3:42pm

Thanks Daan. We’ll have a think about how to do this. Let me know if you spot any other examples of types of PDFs where Leo can’t extract the image successfully.

Jon · March 27, 2026, 12:16pm

This issue should be fixed in the latest release (v0.3.1), where you can choose whether to extract individual images or pages as a whole as images from PDFs.