I am presenting tesseract traineddata files for OCR of pali text using Tesseract.
The original files were from https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST.
I modified the files to use pali vocabulary. DevaIAST.pli.traineddata seems to work
better than PuranaIAST.pli.traineddata.
Like most OCRs, this OCR traineddata it's not perfect.
Namo tassa sometime become Namo fassa.
ṃ is the dominant nigahita, ṁ is also recognized, sometime.
https://github.com/anotatta/tesstrain-S ... sdata_fast
May it be useful
Pāḷi Tesseract OCR
Re: Pāḷi Tesseract OCR
Thank you, Cittaanurakkho!
Re: Pāḷi Tesseract OCR
May you gain merits!
and
May it be useful!
Re: Pāḷi Tesseract OCR
How do you use it? How do you get it from github into a working program that runs on a pc? Looking at your links, I don't see anything that looks like an installation program or working executable.
www.lucid24.org/sted : ☸Lucid24.org
STED definitions
www.audtip.org/audtip:
Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
www.audtip.org/audtip:
-
- Posts: 56
- Joined: Fri Dec 18, 2009 4:12 pm
Re: Pāḷi Tesseract OCR
- Install Tesseract from https://github.com/UB-Mannheim/tesseract/wiki
- Goto https://github.com/anotatta/tesstrain-S ... sdata_fast then click DevaIAST.pli.traineddata ( PuranaIAST.pli.traineddata) then download the file to Tesseract installation directory, usually somewhere like C:\Program Files\Tesseract-OCR
- Open command prompt: Win+r then type
Code: Select all
cmd
- On the command prompt check the installed language:
Code: Select all
tesseract --list-langs
- If everything is ok then you should see DevaIAST.pli listed among other languages
- To do OCR:
Code: Select all
tesseract -l DevaIAST.pli inputPictureFileHere.tif output
- The result is in output.txt file. For Tesseract help:
Code: Select all
tesseract --help-extra
Re: Pāḷi Tesseract OCR
Thanks!
www.lucid24.org/sted : ☸Lucid24.org
STED definitions
www.audtip.org/audtip:
Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
www.audtip.org/audtip:
Re: Pāḷi Tesseract OCR
I haven't had time yet to try to get the OCR installed on my PC.
Can someone who does, run it on this PDF article "clarifications on feelings"?
https://archive.org/details/clairificat ... n_jip_2005
Included on that page, are derived OCR versions made with google drive open document html and epub.
It didn't do such a great job with the diacriticals.
Can someone who does, run it on this PDF article "clarifications on feelings"?
https://archive.org/details/clairificat ... n_jip_2005
Included on that page, are derived OCR versions made with google drive open document html and epub.
It didn't do such a great job with the diacriticals.
www.lucid24.org/sted : ☸Lucid24.org
STED definitions
www.audtip.org/audtip:
Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
www.audtip.org/audtip:
-
- Posts: 56
- Joined: Fri Dec 18, 2009 4:12 pm
Re: Pāḷi Tesseract OCR
Don't bother trying to OCR this kind of pdf file. The ocr result will be worse than what you get now.frank k wrote: ↑Wed May 05, 2021 2:40 pm I haven't had time yet to try to get the OCR installed on my PC.
Can someone who does, run it on this PDF article "clarifications on feelings"?
https://archive.org/details/clairificat ... n_jip_2005
Included on that page, are derived OCR versions made with google drive open document html and epub.
It didn't do such a great job with the diacriticals.
Why? Because that file is the source pdf containing the original text; it is not a pdf containing images (scan or drawn). No ocr software can produce lower error than the original source file.
Then why no diacritics? Because the pali characters in that file were not encoded in unicode. For example the "ṃ" might be encoded as "m." (m plus a dot). The pdf software just draw an m and put a dot under the m to get ṃ.
How do you get the diacritics then? You have to do it manually:
1. Open the pdf file using acrobat/sumatra, then save it as a text file.
2. Open that text file using a heavy duty text editor such as notepad++.
3. Find where the the pali characters (āīūṭḍḷṃṇñṅ and the capital ĀĪŪṬḌÑṆ) should be and check what is the character before or after it. For example you might find m. where ṃ should be or �a where an ā should be.
4. Replace all m. with ṃ and �a with ā, and do that for all the pali characters.
Re: Pāḷi Tesseract OCR
Ah yes, I see your your point. I didn't realize it wasn't an image file.
But isn't there a way to automatically convert that type of pdf with non-unicode and weird font into highr resolution images, without having to for example use your PC to take screen shots a page at a time?
Seems like we should be able to convert it into an image file, and your OCR should
work perfectly.
But isn't there a way to automatically convert that type of pdf with non-unicode and weird font into highr resolution images, without having to for example use your PC to take screen shots a page at a time?
Seems like we should be able to convert it into an image file, and your OCR should
work perfectly.
cittaanurakkho wrote: ↑Fri May 07, 2021 10:24 amDon't bother trying to OCR this kind of pdf file. The ocr result will be worse than what you get now.frank k wrote: ↑Wed May 05, 2021 2:40 pm I haven't had time yet to try to get the OCR installed on my PC.
Can someone who does, run it on this PDF article "clarifications on feelings"?
https://archive.org/details/clairificat ... n_jip_2005
Included on that page, are derived OCR versions made with google drive open document html and epub.
It didn't do such a great job with the diacriticals.
Why? Because that file is the source pdf containing the original text; it is not a pdf containing images (scan or drawn). No ocr software can produce lower error than the original source file.
Then why no diacritics? Because the pali characters in that file were not encoded in unicode. For example the "ṃ" might be encoded as "m." (m plus a dot). The pdf software just draw an m and put a dot under the m to get ṃ.
How do you get the diacritics then? You have to do it manually:
1. Open the pdf file using acrobat/sumatra, then save it as a text file.
2. Open that text file using a heavy duty text editor such as notepad++.
3. Find where the the pali characters (āīūṭḍḷṃṇñṅ and the capital ĀĪŪṬḌÑṆ) should be and check what is the character before or after it. For example you might find m. where ṃ should be or �a where an ā should be.
4. Replace all m. with ṃ and �a with ā, and do that for all the pali characters.
www.lucid24.org/sted : ☸Lucid24.org
STED definitions
www.audtip.org/audtip:
Audio Tales in Pāli: ☸Dharma and Vinaya in many languages
www.audtip.org/audtip:
-
- Posts: 56
- Joined: Fri Dec 18, 2009 4:12 pm
Re: Pāḷi Tesseract OCR
Yes there are many free programs to do that. For scan image, higher resolution helps, but only to a certain point. But for this text/font base pdf, the limitation is not the quality/resolution of the image, but the ocr software itself which perform the recognition at the resolution it was trained at (my guess is 150dpi).frank k wrote: ↑Fri May 07, 2021 2:52 pm Ah yes, I see your your point. I didn't realize it wasn't an image file.
But isn't there a way to automatically convert that type of pdf with non-unicode and weird font into highr resolution images, without having to for example use your PC to take screen shots a page at a time?
Before the original posting I already tested that with ChattaSangayana file. No, it won't work "perfectly" (=as in producing recognized file with the same identical characters as the original file), as already mentioned in the original post. Somebody made this OCR for devanagari character, then somebody else augment it to recognized sanskrit in roman character (IAST), then I just take that and insert pali vocabulary instead of sanskrit. Once in a while and randomly, a devanagari character pops up in between roman characters or roman t become f.Seems like we should be able to convert it into an image file, and your OCR should
work perfectly.
When you have the time you can verify that yourself.
Re: Pāḷi Tesseract OCR
Hey there! As someone who has some experience with OCR, I can appreciate the effort put into training tesseract for Pāḷi text. It's great to see that the DevaIAST.pli.traineddata works better than PuranaIAST.pli.traineddata.
Re: Pāḷi Tesseract OCR
I totally understand that OCRs are not perfect and can have some errors. Have you ever tried using Smart Engines OCR SDK? I heard it has some pretty impressive capabilities in recognizing different scripts and characters accurately. Just a suggestion! Anyway, thanks for sharing your work on training tesseract for Pāḷi text.
-
- Posts: 56
- Joined: Fri Dec 18, 2009 4:12 pm
Re: Pāḷi Tesseract OCR
Thanks. It looks like a paid SDK for business/institution.BurlLam wrote: ↑Fri Mar 17, 2023 11:14 am I totally understand that OCRs are not perfect and can have some errors. Have you ever tried using Smart Engines OCR SDK? I heard it has some pretty impressive capabilities in recognizing different scripts and characters accurately. Just a suggestion! Anyway, thanks for sharing your work on training tesseract for Pāḷi text.