I definitely think using more context clues, both from other images in the same document and from item-level metadata as you suggested elsewhere, is very smart and I’ll talk to Jack about how we can do it. One challenge is that it’d be more expensive due to extra GPU time, so maybe we’d need to introduce some kind of supercharged transcription setting (like ChatGPT’s “Deep Research”) that costs a little extra.
A couple of other points:
- We plan to highlight token-level confidence scores to assist users in reviewing transcripts. So if Leo isn’t sure about “Probbdignaggian”, that’ll be highlighted for review. @Laura_Nelson helpfully suggested this a while ago here.
- Something else that we’re currently working on, which would address this issue for larger datasets, is the ability for users to fine-tune Leo by correcting transcriptions. The current plan depends on users correcting whole transcripts, as explained below. But ideally in the future Leo would also learn from single-word/ smaller corrections too.
This feedback is really helpful for us to think through these problems so thank you!