(How) does Leo learn from users's images and their corrections?

stolzius · July 25, 2025, 10:30pm

I was struck by the example pasted below, in which Leo transcribed an identically written final “a” three different ways in three successive words. I have some inkling of how machine learning works, so I’m not sure how helpful it is to provided such an example. (Hopefully, somewhat!)

But this example relates to a basic question I’ve been wondering while experimenting with Leo. To what extent does Leo already, or might Leo in the future, be able to learn from a specific manuscript or scribe and apply that to reading other works in the same manuscript or by the same scribe. For instance, in the example below, if Leo was confident that the final letter of pericula was a, but less certain about the final letters of damna and exilia (transcribed as piliae), but recognized that the three words had the same final letter, would it be able to improve its accuracy? And does Leo learn from the corrections that users make? And if knows that other images are written in the same hand, can it apply such knowledge selectively, rather than it become part of its universal method of reading all manuscripts?

Jon · July 29, 2025, 4:49pm

Thank you for this Daniel! Leo does not currently learn from corrections made to transcriptions. However, this will be changing soon. The plan is:

Leo will learn from corrections that users make to transcriptions. It won’t learn in real time, but in cycles of training for each release of the main model.
To encourage users to correct machine-generated transcriptions, we’ll allow users to fine-tune the base model using them. I discuss this more here and here. The hope is that this will put into motion a data flywheel, where transcription accuracy increases in a positive feedback loop.
It will also be possiblefor users to benefit from collaborating on and correcting each others’ transcriptions.
Finally, we’re planning on introducing a “Retry transcription” modal, that will harness stochasticity (like what you suggest here) to attempt to try to generate a better transcription. In addition, as part of this modal, we’ll ask the user to provide the opening text for that particular image as in-context learning, which may improve output.