Language models - Italian

JHSinclair · March 17, 2025, 12:48pm

So far I am finding the transcription of (c. 1600) Italian documents very good, but it consistently fails on common abbreviations: S[ua] M[aestà] is filled in as S[erenissimo] M[onsieu]r or M[ajest]a (an interesting combination of Italian and French), for example. It doesn’t recognise no[n], co[n], or the -mente suffix, which are very common. I’m sure this is just familiarity, and correcting it will help the learning process, but just to make you aware.

Jack · March 17, 2025, 1:05pm

Thanks for raising this – you’re right that this is likely reflective of the lack of Italian manuscripts in our training data and will get better over time as we retrain our models. Please do let us know if you see any other bad behaviours, it’s helpful for us when evaluating new models

NoahMillstone · March 17, 2025, 11:27pm

I found similar problems in Italian and French. Generally pretty good, and it often recognizes that there is an abbreviation, but it fails totally to guess what the abbreviation is. I’d provide examples but my workspace is down…