Diacritics in search queries

Jon · May 7, 2025, 2:54pm

Thanks for this Conor. I definitely see the issue here. The search function is currently just a simple, case-insensitive string search, like Ctrl+F. We had to make a decision in the winter about whether to begin with this or to develop a slightly more elaborate search function (PostgreSQL’s full-text search), which would handle word variations, partial matches, and would rank results by relevance. The problem with the latter option was that it would take a while to build yet would quickly become redundant.

Our goal for Leo is to have true semantic searching, which means to search for a meaningful concept rather than exact token overlap, so that any given query would also find results for roughly the same idea expressed in other lexical forms like synonyms or paraphrases. So if you searched for, say, the “glorious revolution” in eighteenth-century British pamphlets, you would also get hits for “the bloodless revolution”, “the happy revolution”, “the restoration of our liberties”, or “the great deliverance” when these referred to the same event. To do this will take time and investment. We’ll need to create quality embeddings and a vector-search layer, make it work with other metadata filters, build evaluation tooling, and integrate this system into Leo’s user interface. Of course, once it’s done, the problem with diacritics would disappear altogether.

It’ll be a little while until this is fully put together but I’ll talk with Jack and see if there are any small things that we can do to improve the search feature to tackle this and similar issues with near results in the meantime.