Skipping parts of printed documents

Jon · July 10, 2025, 2:40pm

Hi Nell. Sorry for the delay in responding here. This relates to a known problem, which is that sometimes when multiple lines start with similar things the text in between gets skipped. The technical issue relates to a symmetry problem - that when many inputs look the same, the model has no incentive keep them separate. In practice, it means that the model treats two similar passages as the same and ignores everything in the middle.

This issue should go away as we train the model for longer, but in the meantime, we’ve reviewed the issue in more detail and think we have developed a shorter-term solution. This would involve adding some randomness to the transcription process, e.g. with imperceptible noise added to top of each image. We’re currently working out the UI for this, that would allow users to flag a faulty transcript and retry the transcription process. But if you need transcripts fixed while this is being developed, then feel free to email me links to the relevant images and we can do this manually. Thank you!