Issues with pdfs from newspaper databases

Timothy_Alborn · June 7, 2025, 1:34pm

With print sources, up to now I was only transcribing screenshots I took from newspaper databases. I thought I’d see what Leo did with pdf files that I downloaded directly from the database: mixed results. When I upload the file, it creates two documents automatically: one with the company that operates the database, and the other with the actual document.

And then, unlike with all the jpegs I’ve uploaded, it didn’t make it through the first column before repeating the same word in a long row.

The second pdf was better – no problem transcribing — but had the same issue converting one pdf file into two documents (this seems to be a feature, not a bug). Since I imagine many people will be trying to transcribe directly from downloaded pdf files of this sort (these get produced in the same fasion, with two separate pages, for every newspaper database I’ve ever used), it might be worth trying to train Leo to ignore the “header” page. For now I’ll just go back to taking screen shots from my pdf files…

Jon · June 8, 2025, 1:36am

Thanks for this! I think the hallucination is probably unrelated to the image being extracted from a PDF. It looks like this particular image is very large which tends to gives the model some difficulties—see here.

I agree that extracting these header images is pointless, though I’m hesitant about trying to make Leo ignore anything from PDFs. We’ll think about how to make this a smoother experience for the user. In the mean time, we’ve tried to make it easy to allow users to delete multiple images by adding check boxes and batch edit options (under “…”) in the image list. Does that help at all?

Timothy_Alborn · June 8, 2025, 8:09pm

Yes, agreed re the hallucination: the image was also pretty smudgy (though not larger than many jpegs I’ve fed Leo). It’s good to know about large documents – makes perfect sense, and as I’m sure you know, a full newspaper page from the 19th century can hold 10,000 + words! So no worries on that score.

Regarding the extra page – it was easy enough to adapt to once I figured out what was going on. I do wonder if the user will be “charged” for two downloaded documents for this – or is the user only charged for the transcription? I’m sure that’s in the terms and conditions somewhere but I don’t remember which it is. If the latter this is a fairly minor issue. If the former, it would be a bummer to have to pay double for pdfs.

Jon · June 8, 2025, 9:29pm

We’ve discussed ways to get images of very large manuscripts like this working more reliably, which might involve cropping them up in some way from our end before feeding them into the model. It’s definitely helpful to know when people encounter the problem, so we can work on a fix sooner rather than later.

The header image would count toward storage limits, but these are supposed to be non-prohibitive for normal usage (between 10,000 and 100,000 images for the premium subscriptions). But only transcribed images are charged a credit, so hopefully it wouldn’t be too big a problem!

Timothy_Alborn · June 8, 2025, 10:17pm

Cool – happy to run a few huge files through to see what the success rate is.

Timothy_Alborn · June 17, 2025, 7:26pm

I’ve been working through some large-ish pdfs (2500-3000 words), with mixed results (as expected), and no real rhyme or reason. Leo has produce a near-perfect transcription a third of the time, a third of the time he stumbles halfway through and starts repeating the same paragraph, and a third of the time the wheel spins interminably (not sure if this counts as a transcription?). So for now, I think I’ll just take jpegs of smaller parts of the file and go from there.

Jon · June 18, 2025, 2:49pm

Thanks for this - very helpful feedback! I suspect the issue with hallucinations relates to the size of the images, cf:

But the third issue seems separate. Do you mean that the images are stuck interminably on the image uploading wheel (on the left) or the transcribing wheel (on the right)?

Timothy_Alborn · June 18, 2025, 8:17pm

Transcribing wheel. Might have been a connection issue and I wasn’t being patient enough, but I did give it a few minutes.

Jon · June 20, 2025, 1:04am

We’ll look into it. If you could you send a link to the relevant item, that would be great. Only Jack and I will be able to access it. Thanks!

Timothy_Alborn · June 20, 2025, 9:10pm

Sorry – deleted it. Hope you got good feedback from everyone! I’ll be sure to fill in the exit poll soon.

Jon · March 27, 2026, 12:15pm

This issue is fixed in the latest release (v0.3.1), where you can choose whether to extract individual images or pages as a whole as images from PDFs.