Pāḷi Tesseract OCR

Explore the ancient language of the Tipitaka and Theravāda commentaries
Post Reply
cittaanurakkho
Posts: 66
Joined: Fri Dec 18, 2009 4:12 pm

Pāḷi Tesseract OCR

Post by cittaanurakkho »

I am presenting tesseract traineddata files for OCR of pali text using Tesseract.
The original files were from https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST.
I modified the files to use pali vocabulary. DevaIAST.pli.traineddata seems to work
better than PuranaIAST.pli.traineddata.

Like most OCRs, this OCR traineddata it's not perfect.
Namo tassa sometime become Namo fassa.
ṃ is the dominant nigahita, ṁ is also recognized, sometime.

https://github.com/anotatta/tesstrain-S ... sdata_fast

May it be useful
User avatar
Assaji
Posts: 2106
Joined: Thu Jan 01, 2009 7:24 pm

Re: Pāḷi Tesseract OCR

Post by Assaji »

Thank you, Cittaanurakkho!
User avatar
Eko Care
Posts: 1107
Joined: Mon Mar 18, 2019 7:13 am

Re: Pāḷi Tesseract OCR

Post by Eko Care »

cittaanurakkho wrote: Wed Apr 14, 2021 5:33 am May it be useful
May you gain merits!
and
May it be useful!
User avatar
frank k
Posts: 2247
Joined: Sat Jan 01, 2011 4:55 pm
Contact:

Re: Pāḷi Tesseract OCR

Post by frank k »

cittaanurakkho wrote: Wed Apr 14, 2021 5:33 am ...
May it be useful
How do you use it? How do you get it from github into a working program that runs on a pc? Looking at your links, I don't see anything that looks like an installation program or working executable.
www.lucid24.org/sted : ☸Lucid24.org🐘 STED definitions
www.audtip.org/audtip: 🎙️🔊Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
cittaanurakkho
Posts: 66
Joined: Fri Dec 18, 2009 4:12 pm

Re: Pāḷi Tesseract OCR

Post by cittaanurakkho »

frank k wrote: Thu Apr 15, 2021 3:37 pm How do you use it? How do you get it from github into a working program that runs on a pc? Looking at your links, I don't see anything that looks like an installation program or working executable.
  • Open command prompt: Win+r then type

    Code: Select all

    cmd
  • On the command prompt check the installed language:

    Code: Select all

    tesseract --list-langs
  • If everything is ok then you should see DevaIAST.pli listed among other languages
  • To do OCR:

    Code: Select all

    tesseract -l DevaIAST.pli inputPictureFileHere.tif output
  • The result is in output.txt file. For Tesseract help:

    Code: Select all

    tesseract --help-extra
User avatar
frank k
Posts: 2247
Joined: Sat Jan 01, 2011 4:55 pm
Contact:

Re: Pāḷi Tesseract OCR

Post by frank k »

cittaanurakkho wrote: Fri Apr 16, 2021 6:09 am ...
Thanks!
www.lucid24.org/sted : ☸Lucid24.org🐘 STED definitions
www.audtip.org/audtip: 🎙️🔊Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
User avatar
frank k
Posts: 2247
Joined: Sat Jan 01, 2011 4:55 pm
Contact:

Re: Pāḷi Tesseract OCR

Post by frank k »

I haven't had time yet to try to get the OCR installed on my PC.
Can someone who does, run it on this PDF article "clarifications on feelings"?

https://archive.org/details/clairificat ... n_jip_2005

Included on that page, are derived OCR versions made with google drive open document html and epub.
It didn't do such a great job with the diacriticals.
www.lucid24.org/sted : ☸Lucid24.org🐘 STED definitions
www.audtip.org/audtip: 🎙️🔊Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
cittaanurakkho
Posts: 66
Joined: Fri Dec 18, 2009 4:12 pm

Re: Pāḷi Tesseract OCR

Post by cittaanurakkho »

frank k wrote: Wed May 05, 2021 2:40 pm I haven't had time yet to try to get the OCR installed on my PC.
Can someone who does, run it on this PDF article "clarifications on feelings"?

https://archive.org/details/clairificat ... n_jip_2005

Included on that page, are derived OCR versions made with google drive open document html and epub.
It didn't do such a great job with the diacriticals.
Don't bother trying to OCR this kind of pdf file. The ocr result will be worse than what you get now.

Why? Because that file is the source pdf containing the original text; it is not a pdf containing images (scan or drawn). No ocr software can produce lower error than the original source file.

Then why no diacritics? Because the pali characters in that file were not encoded in unicode. For example the "ṃ" might be encoded as "m." (m plus a dot). The pdf software just draw an m and put a dot under the m to get ṃ.

How do you get the diacritics then? You have to do it manually:
1. Open the pdf file using acrobat/sumatra, then save it as a text file.
2. Open that text file using a heavy duty text editor such as notepad++.
3. Find where the the pali characters (āīūṭḍḷṃṇñṅ and the capital ĀĪŪṬḌÑṆ) should be and check what is the character before or after it. For example you might find m. where ṃ should be or �a where an ā should be.
4. Replace all m. with ṃ and �a with ā, and do that for all the pali characters.
User avatar
frank k
Posts: 2247
Joined: Sat Jan 01, 2011 4:55 pm
Contact:

Re: Pāḷi Tesseract OCR

Post by frank k »

Ah yes, I see your your point. I didn't realize it wasn't an image file.
But isn't there a way to automatically convert that type of pdf with non-unicode and weird font into highr resolution images, without having to for example use your PC to take screen shots a page at a time?

Seems like we should be able to convert it into an image file, and your OCR should
work perfectly.
cittaanurakkho wrote: Fri May 07, 2021 10:24 am
frank k wrote: Wed May 05, 2021 2:40 pm I haven't had time yet to try to get the OCR installed on my PC.
Can someone who does, run it on this PDF article "clarifications on feelings"?

https://archive.org/details/clairificat ... n_jip_2005

Included on that page, are derived OCR versions made with google drive open document html and epub.
It didn't do such a great job with the diacriticals.
Don't bother trying to OCR this kind of pdf file. The ocr result will be worse than what you get now.

Why? Because that file is the source pdf containing the original text; it is not a pdf containing images (scan or drawn). No ocr software can produce lower error than the original source file.

Then why no diacritics? Because the pali characters in that file were not encoded in unicode. For example the "ṃ" might be encoded as "m." (m plus a dot). The pdf software just draw an m and put a dot under the m to get ṃ.

How do you get the diacritics then? You have to do it manually:
1. Open the pdf file using acrobat/sumatra, then save it as a text file.
2. Open that text file using a heavy duty text editor such as notepad++.
3. Find where the the pali characters (āīūṭḍḷṃṇñṅ and the capital ĀĪŪṬḌÑṆ) should be and check what is the character before or after it. For example you might find m. where ṃ should be or �a where an ā should be.
4. Replace all m. with ṃ and �a with ā, and do that for all the pali characters.
www.lucid24.org/sted : ☸Lucid24.org🐘 STED definitions
www.audtip.org/audtip: 🎙️🔊Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
cittaanurakkho
Posts: 66
Joined: Fri Dec 18, 2009 4:12 pm

Re: Pāḷi Tesseract OCR

Post by cittaanurakkho »

frank k wrote: Fri May 07, 2021 2:52 pm Ah yes, I see your your point. I didn't realize it wasn't an image file.
But isn't there a way to automatically convert that type of pdf with non-unicode and weird font into highr resolution images, without having to for example use your PC to take screen shots a page at a time?
Yes there are many free programs to do that. For scan image, higher resolution helps, but only to a certain point. But for this text/font base pdf, the limitation is not the quality/resolution of the image, but the ocr software itself which perform the recognition at the resolution it was trained at (my guess is 150dpi).
Seems like we should be able to convert it into an image file, and your OCR should
work perfectly.
Before the original posting I already tested that with ChattaSangayana file. No, it won't work "perfectly" (=as in producing recognized file with the same identical characters as the original file), as already mentioned in the original post. Somebody made this OCR for devanagari character, then somebody else augment it to recognized sanskrit in roman character (IAST), then I just take that and insert pali vocabulary instead of sanskrit. Once in a while and randomly, a devanagari character pops up in between roman characters or roman t become f.

When you have the time you can verify that yourself.
cittaanurakkho
Posts: 66
Joined: Fri Dec 18, 2009 4:12 pm

Re: Pāḷi Tesseract OCR

Post by cittaanurakkho »

BurlLam wrote: Fri Mar 17, 2023 11:14 am I totally understand that OCRs are not perfect and can have some errors. Have you ever tried using Smart Engines OCR SDK? I heard it has some pretty impressive capabilities in recognizing different scripts and characters accurately. Just a suggestion! Anyway, thanks for sharing your work on training tesseract for Pāḷi text.
Thanks. It looks like a paid SDK for business/institution.
Post Reply