Problems with marginalia

Hi Mabel! I suspect in this case the issue might relate to the size of the image. Leo currently struggles with images of very large manuscripts like this one. The underlying Transformer-based models that power our system become very resource-intensive when processing high-resolution images. The technical reason for this is that they divide the image into small segments, or “patches,” and analyze the relationships between all patches simultaneously using a self-attention mechanism. The computational and memory cost of this operation grows quadratically with the number of patches. For very large images, that cost becomes prohibitive, which is why we downscale images before processing. At present, we resize inputs to a maximum of 4,184,304 pixels in total (equivalent to 2048x2048 if square). While text is sometimes still legible in large manuscripts at that resolution, it becomes significantly harder to interpret, making hallucinations such as this more likely. As a workaround for now you could try cropping the image. Let me know if that helps!

1 Like