Doubling Pages as I Upload

One More Question, Sorry…

I have 47 credits left so I planned to upload a 44-page PDF into the system to complete my Beta Testing. However, after I uploaded the 44-page PDF, the Dashboard shows that there are 88 pages to be selected:

I never quite kept track of my credits so this is new to me–it’s the first time I am carefully calculating and comparing the number of pages I upload and the number of pages shown in the system. I am just wondering, where does this doubling of pages come from? Thank you! Sorry for my technological ignorance.

1 Like

No problem, we like questions. Please could you untick “Apply automated transcription” (so you won’t be charged credits), upload the images, then send a link to the item to me (only Jack and I will be able to see it), and I’ll have a look at what’s going on here.

Thank you Jon. Do I untick “Apply automated transcription” and then tick all 88 of the images in the list and then hit “Next”? Or how should I proceed? Please advise. Thank you!

Untick it, make sure none of the images are selected, then press next. The images will then upload but the model won’t transcribe them so you won’t spend any credits.

Just did! Where should I find the “link to the item” that you referred to, or are you looking for the original PDF that I uploaded for comparison? I can email you that file if that works.

Click on the item/ make sure you’re in item view, then copy the URL from the address bar in your browser and paste it as a reply here please! :slight_smile:

I didn’t realize there was an item view function! Here is the URL:

https://www.tryleo.ai/document/63fa7e8b-e806-4484-ab87-5e1a107d78e5?imageId=4f491d2b-8e7b-4a4e-af95-acc50d3a54d4

Thanks!

1 Like

Oh wow! We have to make the item view more obvious then… presumably this will change your workflow with Leo?

Thanks for sending along the link. This relates to a known issue with the PDF image extraction process where some are transformed into transparent images. Sorry for the inconvenience! We’re looking into what causes this exactly. For now you could try transforming the PDF into single JPG files using another service—e.g., Adobe Acrobat Pro or if you Google “PDF to JPG” you’ll find free options online.

Thank you Jon. I actually don’t know to what extent this had happened–probably it has happened all the time and I have not noticed as I wasn’t keeping track of credits! Hope this feedback helps, especially since it is related to the prospect of credit-consumption for users–no one wants to be charged double after all!

1 Like

Absolutely. We’ll make sure to fix it!

Feel free to go back and check if this affected your other uploads (single images being duplicated, with one part transparent and the other part with the backround and only some text). If it did we can offer you some more credits to compensate.

Thank you Jon!

[Post must be at least 20 characters]

1 Like

I did take a quick look through the files and there weren’t any doubling of pages from the files I had exported. However, when I ran a total count of pages, there are some differences in the numbers:

When I ran into the “insufficient credit” message, I was stuck half-way in a file with 35 pages completed out of 88, and if we count these 35 pages, then I had an altogether of 1036 pages transcribed. If we do not count these 35, it was 1001 pages transcribed. It was certainly over the 1000 mark. But it was not, however, the 1198 number that you provided.

This is definitely not an indictment of any sort! I don’t mean to litigate for credits–that’s not my intention. I just want to point out, like Professor Cheney mentioned, maybe there have been some inconsistencies with the page counts by the Leo system, and that would be something to fix for a more mature version of the program. Users would not be happy if there are unpredictable counts (don’t we all experience that with telecom providers!), so I think it would be an important issue to keep an eye on. Hope this helps!

Also I wonder whether some of such inconsistencies might have come from exactly this kind of page-doubling, which the system probably then eliminated when exporting files?

Jon, I realized that this problem has only come from one set of documents from a particular archive. And it is hard to simply delete the more transparent pages, as some of the transparent pages also have substantial text on them. I don’t know whether Leo will be able to transcribe this strange set of documents–I was not allowed to scan the documents myself; they are scanned for me by the archival staff using their scanning machine, so maybe it is something with that particular machine?

I private messaged you a CSV (spreadsheet) with all of your completed transcriptions, which comes to the 1,145 number mentioned here. I’m reasonably sure this is accurate (it’s exported straight from Leo’s backend) but please do take a look and let me know if you spot any issues. This is, of course, something that we want to make sure is working correctly!

1 Like

Yes, one other user had this issue and it was also with professionally scanned documents. So that’s definitely a clue for what might be going on here!

Let me know if you manage to get the pipeline for extracting JPG images from PDFs working using Adobe Pro/ some other service. If not I can help out with that!

1 Like

I tried Adobe Acrobat to extract images, which had a result of 44 images, but when I uploaded it into Leo it was again 88 images I think. I also tried to “Export to PDF” using my preview, or print to PDF, but the problem still existed. I guess I’ll have to eye-OCR then!

You’d have to extract the PDF into individual JPG files, rather than another PDF. If you send me the problematic files I can do this for you :slightly_smiling_face: