Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 05-28-2021, 04:33 AM   #1
Ghitulescu
Fanatic
Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.
 
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
Archive.org ePub

I noticed that a lot (I probably could almost safely say 100%) of ePubs from archive, in the last say 2-3 years, appear to block the ADE and give various errors in Calibre 4.xx I use. They do not load in ADE (right now I closed one instance of ADE running for 3 hours to load such an ePub and being all the way irresponsive).

It doesn't matter whether the book is old or new (printing year), just that the books I have used years before work ok (still have some quirks, but rather inoffensive, solved by reconverting to epub in calibre) and those recent don't.

I spend some hours reading the various complains about archive, it appeared that many are discontent with the PDFs but the bad quality (incompatibility) of ePubs have not been raised that often (there is one single thread about it, here).

The ADE is 4.5.11.something and has no problems whatsoever displaying good ePubs I have (mine, made with sigil and calibre, or from other trusted sources).
I did not trust to load them on my hardware eReaders for the fear of getting them bricked.

Has anyone encountered this issue?
Ghitulescu is offline   Reply With Quote
Old 05-28-2021, 06:52 AM   #2
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,565
Karma: 87456643
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
I find that many archive dot org epub or mobi or PDF are poor unproofed OCR of scans.
Also they don't care about copyright. Likely that's why you have a problem. Their ebooks are often rubbish quality.
So I no longer download from there, only using it to find archives of defunct websites.

Use sites such as gutenberg.org and here that have human curated and proofed genuine public domain sites, or buy on Smashwords, kobo, Amazon etc.
Quoth is offline   Reply With Quote
Advert
Old 05-28-2021, 09:36 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Ghitulescu View Post
I noticed that a lot (I probably could almost safely say 100%) of ePubs from archive, in the last say 2-3 years, appear to block the ADE and give various errors in Calibre 4.xx I use.
Can you link to some that are broken?

Quote:
Originally Posted by Ghitulescu View Post
I spend some hours reading the various complains about archive, it appeared that many are discontent with the PDFs
??? First I'm hearing about this. What's the problems with their PDFs?

Quote:
Originally Posted by Ghitulescu View Post
but the bad quality (incompatibility) of ePubs have not been raised that often (there is one single thread about it, here).
Because the EPUBs (and MOBI, TXT, [...]) are auto-converted by OCR based on the PDF scans.

You'd get better and more accurate results by download the PDFs and running your own OCR.

I wrote about some of that here:

"Optimize PDFs from archive.org for E-Ink devices"

and just last month:

"Tutorial-from Paper Book to Ebook PDF - 400 pages in 4 hours"

I wouldn't touch Archive.org "EPUBs" with a ten foot pole though. To call those actual EPUBs is a travesty.

Quote:
Originally Posted by Quoth View Post
Also they don't care about copyright. Likely that's why you have a problem.
What? This is complete hogwash.

Last edited by Tex2002ans; 05-28-2021 at 09:42 PM.
Tex2002ans is offline   Reply With Quote
Old 05-29-2021, 08:38 AM   #4
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,565
Karma: 87456643
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
I mean the bad OCRed scans is the source of problem. Not copyright. The Open Library and other copyright shenanigans at Archive are nothing to do with ghastly mobi/epub quality. They have been scanning paper books themselves for about 12 years as well as source fro Google, Microsoft and uploaders. The problem is that none of it is human curated or proofed. It's automated.

I just set up Linux box with a 20 year old Epson Perfection1200 on SCSI and Tesseract and gocr* last night. The newish funky colour laser printer-copier-scanner is not obviously better and is also downstairs.
I have some 1890s to 1920s books, but likely I'm more interested in OCR of PD PDFs already scanned elsewhere.

Yes, I know about AbbyFineReader. But I don't have it.

I couldn't find any sort of SCSI adaptor for the laptop. I used to have a PCMCIA card and a laptop that could take them.

[* Xsane seems to want gocr, but 15 years ago I would have saved the scans, adjusted in PaintShopPro and used the OCR on files. I can't imagine why I do it from inside Xsane, even though I have a sheetfeeder]

Last edited by Quoth; 05-29-2021 at 08:44 AM.
Quoth is offline   Reply With Quote
Old 05-29-2021, 10:07 AM   #5
salamanderjuice
Guru
salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.
 
Posts: 727
Karma: 10215666
Join Date: Jul 2017
Device: Boox Nova 2
One issue I've had with their PDFs is they don't do any sort of correction for yellowed pages so on a B&W eReader they can look like serious junk with banding in the background. Other than that it's fine.

I also can't really blame them for automating this stuff, they just have way too much content and often it's the only place to get it on the web. I needed a chapter from some 70 year old niche book recently and they had it, only other option was a university library 6 hours away that was closed anyways due to COVID.
salamanderjuice is offline   Reply With Quote
Advert
Old 05-29-2021, 07:52 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by salamanderjuice View Post
One issue I've had with their PDFs is they don't do any sort of correction for yellowed pages so on a B&W eReader they can look like serious junk with banding in the background. Other than that it's fine.
Yep, their automatic Color->B&W doesn't work well for all books. (Though most do perfectly find.)

But the great thing about Archive.org is they release all the source files.

So if you have problems with the B&W PDF, then instead download the:
  • Color PDF
  • Original source images [JPEG2000]

If you check out Post #4+#6 in that Tutorial thread, I showed the why/how.

You can then use Scan Tailor Advanced in order correct "yellowed pages" -> B&W. Using that allows you to tweak all the variables to get a much better/cleaner B&W image.

* * *

And they're always tweaking their workflows.

Like in December 2020, they rescanned/rereleased the entire "Computerworld" magazine from microfilm:

https://blog.archive.org/2020/12/30/...age-microfilm/

Microfilm scanning technology has gotten much better since it was first digitized, so now a much higher quality release is available.

Quote:
Originally Posted by salamanderjuice View Post
I also can't really blame them for automating this stuff, they just have way too much content and often it's the only place to get it on the web. I needed a chapter from some 70 year old niche book recently and they had it, only other option was a university library 6 hours away that was closed anyways due to COVID.


Like GrannyGrump's conversion of the original Sweeney Todd story: "The String of Pearls":

https://www.mobileread.com/forums/sh...d.php?t=299744
https://archive.org/details/stringof...e/n13/mode/2up

I think that book was locked away in Oxford University, one of the only copies left in the world, and it's not even available to the public.

Now because of Archive.org, the entire world can read it.

Quote:
Originally Posted by Quoth View Post
The problem is that none of it is human curated or proofed. It's automated.
Yeah, but the scale is on a completely different level.

99.9999% accuracy on a few hundreds (maybe thousands) of books per year on Gutenberg.

vs.

99% OCR accuracy on millions of books. (And all original source files are available.)

And the scope is different too:

Sure, you get the nice ebooks (I really wish Gutenberg released the original PDFs though)...

But Archive.org is actually about making the works available/searchable. (NOT automating perfect ebooks. Those converted formats are just a side addition.)

Last edited by Tex2002ans; 05-29-2021 at 07:57 PM.
Tex2002ans is offline   Reply With Quote
Old 05-31-2021, 05:31 AM   #7
Ghitulescu
Fanatic
Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.
 
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
So, this is one of the offenders (but almost every single epub locks my ADE)
https://archive.org/details/russoturkishwari01hozi

Concerning PDFs from archive.org, here is a short list (most relevant) of the threads I have consulted before posting this question:
https://www.mobileread.com/forums/sh...ht=archive.org
https://www.mobileread.com/forums/sh...ht=archive.org

The errors in calibre are many, and while some repeat across epubs ("stock" errors) a good deal are new (non-repetitive, "guests"). The book is salvageable if the PDF was rather well OCRed, mostly unfortunately not.

It's not the copyright, not the DRM, not the PDF but rather the defectuous format of the epub.
Ghitulescu is offline   Reply With Quote
Old 05-31-2021, 08:21 AM   #8
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,565
Karma: 87456643
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by Ghitulescu View Post
not the PDF but rather the defectuous format of the epub.
It's because the epub is from a scan with bad OCR.
Use the PDF, or the image and if need be do your own OCR. The epub/mobi on archive org are rubbish.
Quoth is offline   Reply With Quote
Old 05-31-2021, 08:42 AM   #9
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,552
Karma: 84810789
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
Quote:
Originally Posted by Ghitulescu View Post
So, this is one of the offenders (but almost every single epub locks my ADE)
https://archive.org/details/russoturkishwari01hozi
While it does contain numerous OCR errors, that book appears to be a properly structured EPUB 3. It passes EpubCheck with no errors.

You don’t say what version of ADE you are using and on which platform. I suspect that the problem is the result of using outdated software on a modern file.
jhowell is online now   Reply With Quote
Old 05-31-2021, 08:52 AM   #10
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,552
Karma: 84810789
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
I tried the book using ADE 2.0.1 under Windows 10. It didn’t lock up but it did fail to work properly. Paging forward through the book caused it to skip around in the content, frequently jumping back to the beginning.

The book content is in a single fairly large HTML file. That might be too large for the old ADE version to process.



Added: I used calibre to convert from EPUB to EPUB and tested the resulting file in ADE 2.0.1. When I disabled splitting of large HTML files the resulting EPUB failed in ADE the same as the original EPUB. Enabling splitting resulted in seven smaller HTML files in the EPUB and that worked properly with ADE. This confirms the large HTML file (996KB) in the original EPUB causing a problem for ADE 2.0.1.

Last edited by jhowell; 05-31-2021 at 10:24 AM. Reason: Add more info
jhowell is online now   Reply With Quote
Old 05-31-2021, 11:37 AM   #11
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Ghitulescu View Post
So, this is one of the offenders (but almost every single epub locks my ADE)
https://archive.org/details/russoturkishwari01hozi
Yep, you most likely figured it out.

I'm betting the problem is the monolithic HTML file: ~900 KBs. If you have an older ereader, that would crash (can only handle files ~300 KBs).

Like you also figured out, a simple Calibre EPUB->EPUB with file splitting should take care of that issue.

Also, the book is laid out in two-column format. Usually, that's incredibly hard to OCR correctly. OCR might think both columns are a single line, so you get half-left/half-right sentences, making the ebook completely unreadable.

According to the metadata, looks like they ran it through Finereader 8.0.

I ran it through Finreader 12 for you, then created a very rough EPUB. This one should be more accurate + will at least not have all the headers/footers clogging up the text.

Note: This book's font also had very low-hanging+round 'g's. OCR thought they were 'O's on their own line, so you'll see lots of those randomly appearing within the EPUB.

Last edited by Tex2002ans; 05-31-2021 at 11:45 AM.
Tex2002ans is offline   Reply With Quote
Old 05-31-2021, 11:53 AM   #12
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,565
Karma: 87456643
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by jhowell View Post
The book content is in a single fairly large HTML file. That might be too large for the old ADE version to process.
* * *
Added: I used calibre to convert from EPUB to EPUB and tested the resulting file in ADE 2.0.1. When I disabled splitting of large HTML files the resulting EPUB failed in ADE the same as the original EPUB. Enabling splitting resulted in seven smaller HTML files in the EPUB and that worked properly with ADE. This confirms the large HTML file (996KB) in the original EPUB causing a problem for ADE 2.0.1.
Because it's simply the automated OCR layer automatically converted to epub with no rules to find breaks and create separate files. If I can't find a real ebook of a PD text on Archive I download the PDF. The 7.8" Mars with the autocrop on margins is better for PDFs than 9.7" DXG, kindle PW3 or Kobo Libra. Much faster too.
I feel I wasted a lot of time and download cap trying to read epubs & mobi from Archive before I realised what they are at (automatic on demand from unproofed PDF OCR layer).
If it's too big (like a multicolumn magazine) I'd use the 10" Lenovo tablet or even the laptop if it's not too many pages.
Quoth is offline   Reply With Quote
Old 06-01-2021, 02:55 AM   #13
Ghitulescu
Fanatic
Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.
 
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
Quote:
Originally Posted by jhowell View Post
You don’t say what version of ADE you are using and on which platform. I suspect that the problem is the result of using outdated software on a modern file.
Quote:
Originally Posted by Ghitulescu View Post
The ADE is 4.5.11.something and has no problems whatsoever displaying good ePubs I have (mine, made with sigil and calibre, or from other trusted sources).
ADE 4.5.11.187212 (strangely, I cannot directly copy this information and had to write it down by hand as before Gutenberg).

Quote:
Originally Posted by Tex2002ans View Post
I'm betting the problem is the monolithic HTML file: ~900 KBs. If you have an older ereader, that would crash (can only handle files ~300 KBs).

Like you also figured out, a simple Calibre EPUB->EPUB with file splitting should take care of that issue.

[...]

I ran it through Finreader 12 for you, then created a very rough EPUB. This one should be more accurate + will at least not have all the headers/footers clogging up the text.
Well, there is a night'n day difference, I would say. Thank you for your effort.

The ADE 4.5.11.187212 I have reads it perfectly, as far as I see it.
Ghitulescu is offline   Reply With Quote
Reply

Tags
archive.org, epub, error


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
archive.org 1 hour checkout? hobnail General Discussions 14 08-01-2020 12:14 PM
Archive.org, Google and Piracy Quoth News 60 04-16-2020 01:39 PM
archive.org downloads abrogard Calibre 2 08-11-2018 06:08 PM
Archive.org crutledge General Discussions 129 08-28-2015 06:22 AM
Archive.org copyright question Hatgirl General Discussions 7 03-23-2010 07:58 PM


All times are GMT -4. The time now is 10:01 PM.


MobileRead.com is a privately owned, operated and funded community.