Who is AI Transcription for?

I’m a bit late to getting started here, but have now begun experimenting with Leo after returning from a few weeks in the archives. Obviously, it’s extraordinarily helpful in many ways, so my sincere thanks for all the work in creating it and getting it running. Its knowledge of secretary hand is really impressive.

But my main question (for anyone) is this: Who are the intended users of AI transcription software? Insofar as it’s people with training in paleography (or the relevant skillset) who are simply using it to get a baseline transcription and then go back and manually check for errors, it’s clearly great and can save a TON of time. But my worry is about people (whether inside or outside the academy) using it without the training/knowledge to check for accuracy, etc. Are undergraduates going to try to start incorporating (likely digitized) manuscripts into term papers/senior theses? Is the College Board going to start including digitized manuscripts in DBQs for AP courses (in the U.S.)? Are incoming graduate students simply going to decide they don’t need to learn paleography and try to write a manuscript-based dissertation? A few years ago, I did a summer intensive course in paleography (mainly secretary hand), which proved to be one of the most helpful things I’ve ever done. But I worry about the perceived need for courses like that and the fate of paleography training at large in an AI world. So, I don’t mean to sound like I’m scaremongering or a complete Luddite here, but I also have a lot of questions about how these platforms will be used in the broader society. (With the knowledge that Leo isn’t the first and these already exist in various forms). But once they’re out there, can anyone really control how they’re used? And who they’re used by?

Obviously, I know that the point of software like this is to get a basic transcription - rather than a flawless one - and minor errors are to be expected. But to take one example in the attached screenshot, Leo read the word “yeres” as “xerces,” the Roman numerals “xxx” as “xxv” and the word “whitsun” as “whatson.” I’m not bothered by small errors like these for my own purposes. But I do wonder how they make us think about putting a tool like this into the hands of the broader public? But again, that goes back to the “Who are these tools for?” question. And if it’s just trained paleographers trying to save some time and energy, then all is fine.

3 Likes

Thank you for raising these questions Jenny. I hope others will address them in this topic. Reactions to Leo among historians run from excitement to alarm and both instincts are valid. The answer to your question, at least from my perspective, would be that Leo is for everyone, including those who may not have advanced paleography training—undergraduates, home genealogists, casual hobbyists. There is no intention to restrict access to the platform to those who already have credentials or training. I think one of the major promises of automated HTR (handwritten text recognition) is that it can make accessing manuscript far more accessible to a much broader audience than has been possible before. Although transcription services have always been available, they’ve traditionally been very expensive. Leo’s transcriptions are roughly 1-2% of the cost of a trained paleographer.

That said, of course, the transcriptions are not yet as reliable as an experienced human being. It’s true that if people without the requisite training use Leo to generate transcriptions that they then rely on without checking for accuracy, then they may be misled by errors. That’s not a problem specific to Leo. Indeed, people seeking automated transcriptions, unless they knew about specialist HTR platforms, would be likely to turn to the general-purpose large language models (ChatGPT, Claude, Gemini, etc.). Leo is significantly less likely to mislead than these models due to the way the model works. When it does make a mistake, it’s usually relatively conspicuous (e.g., in context, one would presumably expect “yeres” in your example, rather than the Persian king). By contrast, when the major LLMs make errors, they tend to be much more alluring, with hallucinations taking the form of a very plausible prediction of what “should” come next, irrespective of what the actual ensuing text says.

Jack and I have discussed possible ways to address the issue of Leo’s transcriptions potentially misleading users. The first thing to do is probably to include a caption somewhere in the transcription box, as the interfaces for the big LLMs do, that says something like “Leo can make mistakes. Be sure to check for errors”. We’re also planning on showing the user the model’s confidence for each part of the transcript, as part of a larger overhaul, that also allows users to hover over part of the transcript, to see which part of the image it corresponds to. (This should help in the process of checking over transcripts.) I hope @Brian_DeLay doesn’t mind me mentioning that, as he pointed out in the final feedback survey, historians are likely to be more comfortable with Leo when it is as transparent as possible about its limitations. I think currently this is the best way we have of going about that.

It’s still possible that someone with zero relevant knowledge of paleography might come to Leo and mistake an error (even with a low confidence score) for the truth. It might help to introduce a system for crowdsourcing transcription corrections, or some kind of internal mechanism for sending automated transcriptions for verification to a trained, professional paleographer. Obviously the former case relies on volunteer work and the latter would only be available to those who can afford it. We’ve also discussed building a kind of Duolingo for paleography as an off-shoot project, where Leo hides the correct (human-made) transcription until the user submits their own attempt, though that’s pretty long down the road. Ultimately, there’s no completely fail-safe way to get around this problem.

Building a tool like this is unlike the kind of work that I’m used to doing as a historian in that there’s much more limited control over what others can do with the output. It’s true that people can misinterpret your writing, but the scope for people to use Leo in unforeseen or unintended ways is much greater. Mostly, that’s exciting. Already, people are using the platform in a variety of ways that I hadn’t anticipated. For instance, scholars with disabilities who have not been able to work with manuscript material before find that this dramatically improves their ability to do so. But there are risks. Some researchers are indeed likely to misuse or over-rely on HTR in their work.

As with all technologies of this kind, the promise and peril are co-constitutive. To refuse one altogether would also be to forfeit the other. The best we can be is pragmatic: to implement safeguards, for transparency, pedagogy, and stewardship, in order to maximize the benefit while minimizing the risk. I’d be keen to hear any ideas that you or others have for how we can do this.

Thanks for your thoughts here, Jon. (And unrelated, sorry to have missed your IHR presentation on Monday. I had it marked on my calendar as one I wanted to attend but ended up driving down to the Huntington that day).

I completely agree that none of this - including the potential for error - is something unique to Leo. And I also think it’s significant that Leo has greater accuracy than most generic LLMs. I also like some of the ideas you raise about dealing with the inevitable error, like displaying the model’s confidence level and maintaining a general level of transparency about the model’s strengths and limitations. I think those are all good.

I think where we see things differently is related to a more general question of the relationship between specialized skillsets and the “public” (under which category I’d include, for example, not only casual hobbyists but also undergraduates etc). And the question of whether the goal should be to make something like manuscripts as widely accessible as possible. It’s a big question, obviously, and one that those in academia will no doubt take a range of stances on – as you say, everything from excitement to alarm.

I hope you had (or are having) a good time at the Huntington! You raise some important questions about the tradeoffs between access and rigor, and I completely agree that we need to develop Leo in a way that supports rather than replaces expert knowledge. If you have ideas, I’d be keen to hear more about how you think we could address those concerns in practice. Is there anything that would make you feel more confident about how Leo will be used in the long run?

Just returning to this after a few days. Thanks for your thoughts. I don’t have particular ideas necessarily, and I may very well be in the minority here anyway, but I think the main thing that would make me feel most confident has to do with access, which is difficult - if not impossible - to fully control. My own perspective is that Leo (or other HTR platforms) would best operate more as convenient tools for double checking thorny excerpts of manuscripts among trained paleographers rather than widely accessible platforms for use among the general public. But again, access and use aren’t easy, or sometimes possible, to control.