"ye" consistently corrected to "the"

Logan_Buffa · March 14, 2025, 6:33pm

I’ve noticed that across the records I’ve input into Leo, virtually all of them have a minor error: the word “ye” (which is stylized as a Y with an e written above it) is immediately corrected to “the.” This makes sense, given they essentially mean the same thing in context, but it’s still a consistent error.

Jon · March 14, 2025, 8:45pm

Thanks Logan. What’s going on here is that the transcription convention which predominates in the training data is to render þ (thorns, pronounced “th”), which appeared in early modern English manuscripts as y, in the italicized form “th”.

It’d be good to get feedback on whether this is desirable behavior from those who are experienced with these kinds of manuscripts. What do you think @NoahMillstone @Adam_B_Forsyth @Declan_Noble or others?

It might be a bit tricky to change because ye is homonymic, referring also to the second-person plural version of you. So if we cleaned the data we’d need to find a way to differentiate between these two different usages.

NoahMillstone · March 15, 2025, 9:32pm

Hi Jon, I think it really depends what you think users are going to be using the transcriptions for. If they are mostly going to be reading it or searching through text for words, a semi-diplomatic transcription would probably do. On MPESE we did semi-diplomatic (by hand, of course): Manuscript Pamphleteering in Early Stuart England

On the other hand, I think (1) certain sorts of literary critics and (2) people trying to do digital humanities work with big data sets of text will want transcriptions that are TOTALLY FAITHFUL to the text in a way that historians mostly don’t need. At MPESE we marked up in XML according to TEI [https://tei-c.org] conventions in a pretty light-touch way, but the people working on e.g. the John Donne project used to tag EVERYTHING. If you want Leo to create transcripts that could serve as a basis for people to do TEI tagging, you need for it to preserve as many of the features of the page as possible, thorns included.

I, personally, do not care about thorns and am happy with the or [th]e or whatever. But textual editors will care a lot.

Jon · March 16, 2025, 5:34pm

This is very helpful to know! We’ve been experimenting with tagging in order to deal with inconsistencies in the training data, so that if we have a particular set of transcriptions that, say, don’t preserve strikethroughs or interlineal additions, we can tell the model what to expect. This is making me think that we could perhaps build on this to try to get Leo to produce (its best attempt at) both diplomatic and semi-diplomatic transcripts.