tesseract Command: Examples, Options, and Usage

常用示例

Extract text from image

tesseract [image.png] [output]

Extract to stdout

tesseract [image.png] stdout

Specify language

tesseract -l [deu] [image.png] [output]

Multiple languages

tesseract -l [eng+fra] [image.png] [output]

Output as PDF

tesseract [image.png] [output] pdf

Output as hOCR

tesseract [image.png] [output] hocr

Output as TSV

tesseract [image.png] [output] tsv

List available languages

tesseract --list-langs

说明

Tesseract is an open-source OCR (Optical Character Recognition) engine. It extracts text from images, supporting over 100 languages. The LSTM neural network engine (default) provides better accuracy than the legacy engine for most text. Engine mode selection (--oem) enables switching or combining engines. Page segmentation modes (--psm) tell Tesseract what to expect: single character, word, line, block, or full page. Correct mode selection improves accuracy significantly. Output formats include plain text, searchable PDF (text layer over image), hOCR (HTML with bounding boxes), TSV (detailed per-word data), and ALTO (XML archival format). Image quality greatly affects results. Best results come from: high resolution (300+ DPI), good contrast, straight alignment, minimal noise. Preprocessing with ImageMagick or similar can help. Language data files (traineddata) must be installed separately. Custom training can create models for specific fonts, historical documents, or specialized text.

参数

-l _LANG_: Language(s) for OCR (eng, deu, fra, etc.).
--psm _NUM_: Page segmentation mode (0-13).
--oem _NUM_: OCR Engine mode (0=legacy, 1=LSTM, 2=both).
--dpi _NUM_: Override image DPI.
-c _VAR=VALUE_: Set config variable.
--tessdata-dir _PATH_: Location of language data.
--user-words _FILE_: User word list.
--user-patterns _FILE_: User patterns file.
--list-langs: List available languages.
--print-parameters: Print config parameters.
pdf: Output searchable PDF.
hocr: Output HTML with coordinates.
tsv: Output tab-separated values.
alto: Output ALTO XML.

FAQ

What is the tesseract command used for?

Tesseract is an open-source OCR (Optical Character Recognition) engine. It extracts text from images, supporting over 100 languages. The LSTM neural network engine (default) provides better accuracy than the legacy engine for most text. Engine mode selection (--oem) enables switching or combining engines. Page segmentation modes (--psm) tell Tesseract what to expect: single character, word, line, block, or full page. Correct mode selection improves accuracy significantly. Output formats include plain text, searchable PDF (text layer over image), hOCR (HTML with bounding boxes), TSV (detailed per-word data), and ALTO (XML archival format). Image quality greatly affects results. Best results come from: high resolution (300+ DPI), good contrast, straight alignment, minimal noise. Preprocessing with ImageMagick or similar can help. Language data files (traineddata) must be installed separately. Custom training can create models for specific fonts, historical documents, or specialized text.

How do I run a basic tesseract example?

Run `tesseract [image.png] [output]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does -l _LANG_ do in tesseract?

Language(s) for OCR (eng, deu, fra, etc.).

tesseract 命令

常用示例

说明

参数

FAQ

相关命令