tesseract取词 - 知识是一种生活方式

tesseract是google开源能够使用ocr技术从图片、PDF文件摘取的内容的库，准确率相对较高。

1.下载安装

Ubuntu18环境
apt install tesseract-ocr
apt install tesseract-ocr-chi-sim

可以查询语言包，有很多种
apt search tesseract-ocr-

apt install imagemagick

开发包
sudo apt install libtesseract-dev

2.使用与配置

引擎链接
/usr/share/tesseract-ocr/4.00/tessdata


对中文文字的图片进行识别命令如下：
tesseract --tessdata-dir /usr/share/tesseract-ocr/4.00/tessdata /home/pdf/1.png /home/pdf/text2 -l chi_sim

其中：
--tessdata-dir 引擎
1.png 原始文件
text2 目标文件
-l chi_sim 使用的语言，默认eng



gui推荐链接：https://github.com/tesseract-ocr/tesseract/wiki/User-Projects-%E2%80%93-3rdParty

编程语言级别的api：https://github.com/tesseract-ocr/tesseract/wiki/APIExample

3.其它问题

安装后直接使用会出错，pdf无法读写，需要修改配置文件：
nano /etc/ImageMagick-6/policy.xml




无法直接识别pdf，需要借助工具蒋pdf切成图片后识别并组合
https://python.freelycode.com/contribution/detail/344
https://qinghua.github.io/tesseract/


apt install imagemagick ghostscript

将每页拆成一张图片存放在当前文件夹
convert -density 100 -trim input.pdf output%04d.jpg

tesseract识别pdf方法:https://blog.eson.org/pub/1ff2cbf2/



也可以转换pdf成diff格式进行识别:
convert test.pdf test.tiff

使用lzw压缩，文件清晰度和大小都比较优良：【重要】
convert -density 300 -depth 8 -background white -alpha Off -compress LZW 201903tongji.pdf 201903tongji_lzw.tiff

#如果出错( can't handle bpp > 32),转为8bit的文件,然后使用ocr取词
convert test.pdf -depth 8 test.tiff

识别tiff
tesseract test.tiff -l [lan] test.txt

默认转化成txt格式：【重要】
tesseract -l chi_sim 201903tongji.tiff 201903tongji
也可以转化成hocr或pdf格式，只需要默认添加字段即可。

python实现:http://blog.topspeedsnail.com/archives/3571

当文件过大页面过多时，使用convert的过程中可能出现memory溢出的问题，解决方案参考链接。

nano /etc/ImageMagick-6/policy.xml
'''
 #修改



 #修改
 #修改
'''

一些链接：
tesseract API 链接
 tesseract 部分API样例链接包括表格格式数据的识别
pytesseract wiki链接
 tesseract训练链接