Skipping parts of printed documents

I’ve noticed that sometimes its skipping paragraphs of printed documents. For example, the transcription has the petition of hugh penfold immediately followed by the petition of lawrence carter, completely skipping the petition of Thomas Raby and the petition of the Dowager Viscountess. Most pages don’t have this problem, and I can’t see anything different on the original images of the pages that do or don’t, so I’m not sure where its coming from.

1 Like

Hi Nell. Sorry for the delay in responding here. This relates to a known problem, which is that sometimes when multiple lines start with similar things the text in between gets skipped. The technical issue relates to a symmetry problem - that when many inputs look the same, the model has no incentive keep them separate. In practice, it means that the model treats two similar passages as the same and ignores everything in the middle.

This issue should go away as we train the model for longer, but in the meantime, we’ve reviewed the issue in more detail and think we have developed a shorter-term solution. This would involve adding some randomness to the transcription process, e.g. with imperceptible noise added to top of each image. We’re currently working out the UI for this, that would allow users to flag a faulty transcript and retry the transcription process. But if you need transcripts fixed while this is being developed, then feel free to email me links to the relevant images and we can do this manually. Thank you!

1 Like