Hey both! Segmentation (dividing the input image into meaningful units like columns, tables, lines, words, and characters) is one of the most challenging problems in HTR (handwritten text recognition) technology. Some HTR services use a first-step computer vision process to create bounding boxes in the way that you describe, before then “reading” the content, for instance line by line or even character by character. Leo’s model architecture delivers more accurate transcriptions by interpreting the image holistically, but one downside is that it doesn’t currently preserve location information (i.e., coordinates for text).
Our first aim is to ensure that the model always transcribes all text on the page, so that no text in complex layout structures is ignored. We’re making speedy progress with this and an updated version of the model that better preserves all text should be out in the next week or two. Then, we want to find some way to match parts of the image with the transcript, so that the user can hover over transcription text and it will highlight the relevant part of the image. This is a much more complex task but it is a priority for development in the intermediate term.
As a shorter term solution we are currently working on developing image manipulation tools (including cropping) so that if the model does miss out some text, you’ll easy be able to select that portion and transcribe it in the web app.