← 返回命令列表

Linux command

tesseract 命令

文本

复制后可按需替换文件名、目录或参数。

常用示例

Extract text from image

tesseract [image.png] [output]

Extract to stdout

tesseract [image.png] stdout

Specify language

tesseract -l [deu] [image.png] [output]

Multiple languages

tesseract -l [eng+fra] [image.png] [output]

Output as PDF

tesseract [image.png] [output] pdf

Output as hOCR

tesseract [image.png] [output] hocr

Output as TSV

tesseract [image.png] [output] tsv

List available languages

tesseract --list-langs

说明

Tesseract is an open-source OCR (Optical Character Recognition) engine. It extracts text from images, supporting over 100 languages. The LSTM neural network engine (default) provides better accuracy than the legacy engine for most text. Engine mode selection (--oem) enables switching or combining engines. Page segmentation modes (--psm) tell Tesseract what to expect: single character, word, line, block, or full page. Correct mode selection improves accuracy significantly. Output formats include plain text, searchable PDF (text layer over image), hOCR (HTML with bounding boxes), TSV (detailed per-word data), and ALTO (XML archival format). Image quality greatly affects results. Best results come from: high resolution (300+ DPI), good contrast, straight alignment, minimal noise. Preprocessing with ImageMagick or similar can help. Language data files (traineddata) must be installed separately. Custom training can create models for specific fonts, historical documents, or specialized text.

参数

-l _LANG_
Language(s) for OCR (eng, deu, fra, etc.).
--psm _NUM_
Page segmentation mode (0-13).
--oem _NUM_
OCR Engine mode (0=legacy, 1=LSTM, 2=both).
--dpi _NUM_
Override image DPI.
-c _VAR=VALUE_
Set config variable.
--tessdata-dir _PATH_
Location of language data.
--user-words _FILE_
User word list.
--user-patterns _FILE_
User patterns file.
--list-langs
List available languages.
--print-parameters
Print config parameters.
pdf
Output searchable PDF.
hocr
Output HTML with coordinates.
tsv
Output tab-separated values.
alto
Output ALTO XML.

FAQ

What is the tesseract command used for?

Tesseract is an open-source OCR (Optical Character Recognition) engine. It extracts text from images, supporting over 100 languages. The LSTM neural network engine (default) provides better accuracy than the legacy engine for most text. Engine mode selection (--oem) enables switching or combining engines. Page segmentation modes (--psm) tell Tesseract what to expect: single character, word, line, block, or full page. Correct mode selection improves accuracy significantly. Output formats include plain text, searchable PDF (text layer over image), hOCR (HTML with bounding boxes), TSV (detailed per-word data), and ALTO (XML archival format). Image quality greatly affects results. Best results come from: high resolution (300+ DPI), good contrast, straight alignment, minimal noise. Preprocessing with ImageMagick or similar can help. Language data files (traineddata) must be installed separately. Custom training can create models for specific fonts, historical documents, or specialized text.

How do I run a basic tesseract example?

Run `tesseract [image.png] [output]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does -l _LANG_ do in tesseract?

Language(s) for OCR (eng, deu, fra, etc.).