Reading highlighted text in printed documents

Timothy_Alborn · May 31, 2025, 6:17pm

I’m jumping into printed sources, which I generated by taking pngs of newspaper articles that I found online. First off, I’m so glad you finally fixed this – every single document has been transcribed. And best of all, Leo seems to be flawless at reading through one column then starting the next one – I was worried about that, since many OCR programs read across columns as often as not. The transcriptions are also (mostly) flawless.

So, my one question concerns highlighted text, i,e, the term I was searching for in the newspaper database. Leo does one or two things with these: mainly adds a strike-through on the word or an adjacent word or phrase (see screen shot), and less often putting the word in brackets (< and >). Is there a way to train Leo to recognize highlighted text and ignore the highlighting?

Timothy_Alborn · June 2, 2025, 3:47pm

Minor update regarding reading columns: I uploaded two different documents, each with two columns, and in both cases the righthand column was closer to the top of the page than the left. In the first case, Leo transcribed the right-hand column first, then the lefthand column (no biggie). In the second case, Leo again transcribed the right-hand column first, but ignored the left-hand column. Can you train him to start on the left?

Jon · June 4, 2025, 2:14pm

Thanks TIm. It’s helpful to know what the issues with the latest version of the model are.

Leo has no common sense so it’ll simply be a question of feeding the model training data where highlighted words are transcribed correctly. I don’t think we’ll need many examples for it to learn this but it’s a relatively niche issue.

Fortunately we do have a plan for solving the potentially infinite numbers of small problems, where Leo encounters something that is “out of distribution” (i.e., not reflected in the training data. The idea is to introduce the ability for users to fine-tune the Leo model with your own set of training data. So you’d be able to solve this problem for yourself by correcting a few transcripts, using these to fine-tune the base model, and then using that specialised model to transcribe the rest of these newspapers. We would then be able to use that data to train the base model, so you’d be solving the problem for other users too.

Jon · June 4, 2025, 2:18pm

The issue here probably relates to either inaccuracies in the training data or under-training (i.e., we haven’t trained the model for as long as we could due to the computing costs involved). We’ll be able to do the latter when we have more resources. The plan for resolving the former is twofold. First, to more data cleaning. And second, when we add image manipulation tools (e.g., also rotating), to have a cropping feature where the user can select any text that has been missed. Then, that can be added to the final transcript and used to retrain the model.

Timothy_Alborn · June 4, 2025, 4:48pm

That would be great if users can do their own training (assuming the interface is easy enough to figure out). I’m not completely sure how niche this will be—if enough users are working with print sources, Would think that highlighted text will show up fairly often. Anyway, it’s not a huge issue—I just need to remove all strike-throughs after I’ve pasted the document into my word file.

Also, an amusing twist just now (Leo is being quite literal here!): “and alchymy, [in a yellow box].”

Jon · June 4, 2025, 5:47pm

Yes, our job is somehow to incentivize you to make those edits in the web-app so you can teach Leo—hence the finetuning!

Did Leo really say “[in a yellow box]”?!

Timothy_Alborn · June 5, 2025, 1:28am

Yes! Twice! (There was only one (capacious) yellow box).

Timothy_Alborn · June 5, 2025, 2:14am

Oh also – good luck with that! (incentivizing users…). It’s much easier for me (and I assume others) to cut and paste into my notes and do all the editing there—e.g. eliminating all strike-throughs at one go instead of eliminating each one in the app before I cut and paste… Even getting a badge for “top new user” may not be sufficient!

Jon · June 6, 2025, 6:19pm

Haha, yes, I suspect a merit-based honours system would be insufficient. We’re hoping improved output quality by allowing users to fine tune the model on corrected transcripts will provide a more concrete incentive.