Linux command
tesseract 命令
文本
复制后可按需替换文件名、目录或参数。
常用示例
Extract text from image
tesseract [image.png] [output]
Extract to stdout
tesseract [image.png] stdout
Specify language
tesseract -l [deu] [image.png] [output]
Multiple languages
tesseract -l [eng+fra] [image.png] [output]
Output as PDF
tesseract [image.png] [output] pdf
Output as hOCR
tesseract [image.png] [output] hocr
Output as TSV
tesseract [image.png] [output] tsv
List available languages
tesseract --list-langs
说明
Tesseract is an open-source OCR (Optical Character Recognition) engine. It extracts text from images, supporting over 100 languages. The LSTM neural network engine (default) provides better accuracy than the legacy engine for most text. Engine mode selection (--oem) enables switching or combining engines. Page segmentation modes (--psm) tell Tesseract what to expect: single character, word, line, block, or full page. Correct mode selection improves accuracy significantly. Output formats include plain text, searchable PDF (text layer over image), hOCR (HTML with bounding boxes), TSV (detailed per-word data), and ALTO (XML archival format). Image quality greatly affects results. Best results come from: high resolution (300+ DPI), good contrast, straight alignment, minimal noise. Preprocessing with ImageMagick or similar can help. Language data files (traineddata) must be installed separately. Custom training can create models for specific fonts, historical documents, or specialized text.
参数
- -l _LANG_
- Language(s) for OCR (eng, deu, fra, etc.).
- --psm _NUM_
- Page segmentation mode (0-13).
- --oem _NUM_
- OCR Engine mode (0=legacy, 1=LSTM, 2=both).
- --dpi _NUM_
- Override image DPI.
- -c _VAR=VALUE_
- Set config variable.
- --tessdata-dir _PATH_
- Location of language data.
- --user-words _FILE_
- User word list.
- --user-patterns _FILE_
- User patterns file.
- --list-langs
- List available languages.
- --print-parameters
- Print config parameters.
- Output searchable PDF.
- hocr
- Output HTML with coordinates.
- tsv
- Output tab-separated values.
- alto
- Output ALTO XML.
FAQ
What is the tesseract command used for?
Tesseract is an open-source OCR (Optical Character Recognition) engine. It extracts text from images, supporting over 100 languages. The LSTM neural network engine (default) provides better accuracy than the legacy engine for most text. Engine mode selection (--oem) enables switching or combining engines. Page segmentation modes (--psm) tell Tesseract what to expect: single character, word, line, block, or full page. Correct mode selection improves accuracy significantly. Output formats include plain text, searchable PDF (text layer over image), hOCR (HTML with bounding boxes), TSV (detailed per-word data), and ALTO (XML archival format). Image quality greatly affects results. Best results come from: high resolution (300+ DPI), good contrast, straight alignment, minimal noise. Preprocessing with ImageMagick or similar can help. Language data files (traineddata) must be installed separately. Custom training can create models for specific fonts, historical documents, or specialized text.
How do I run a basic tesseract example?
Run `tesseract [image.png] [output]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does -l _LANG_ do in tesseract?
Language(s) for OCR (eng, deu, fra, etc.).