ocrmypdf Command: Examples, Options, and Usage

常用示例

Add OCR layer to PDF

ocrmypdf [input.pdf] [output.pdf]

OCR and correct skew

ocrmypdf --deskew [input.pdf] [output.pdf]

OCR and clean background

ocrmypdf --clean [input.pdf] [output.pdf]

Specify language

ocrmypdf -l [deu] [input.pdf] [output.pdf]

Multiple languages

ocrmypdf -l [eng+fra] [input.pdf] [output.pdf]

Force OCR even if text exists

ocrmypdf --force-ocr [input.pdf] [output.pdf]

Skip pages with text

ocrmypdf --skip-text [input.pdf] [output.pdf]

Optimize and reduce size

ocrmypdf --optimize [3] [input.pdf] [output.pdf]

说明

ocrmypdf adds an OCR text layer to scanned PDFs, making them searchable and selectable. It uses Tesseract OCR and outputs PDF/A for archival quality by default. The tool preserves the original visual appearance while adding invisible text behind the scanned images. This means the file looks identical but text can be copied, searched, and indexed. Image preprocessing improves OCR accuracy: deskew corrects tilted scans, clean removes noise and artifacts, and rotate-pages fixes orientation. These can significantly improve results on poor-quality scans. Multiple languages can be combined (eng+fra+deu). Language packs must be installed for Tesseract. The tool detects existing text to avoid double-processing unless forced. Optimization levels reduce file size through image recompression. Level 3 uses aggressive JBIG2 compression suitable for archival. PDF/A output ensures long-term readability. Parallel processing speeds up multi-page documents. Progress is shown by default. Sidecar output extracts just the text for external processing.

参数

-l _LANG_, --language _LANG_: OCR language (Tesseract language codes).
--deskew: Correct page skew before OCR.
--clean: Clean page background before OCR.
--clean-final: Clean and keep cleaned image in output.
--rotate-pages: Rotate pages to correct orientation.
--remove-background: Remove background from pages.
--force-ocr: OCR all pages, replacing existing text.
--skip-text: Skip pages that already have text.
--redo-ocr: Redo OCR on pages with existing text.
--optimize _LEVEL_: Optimize output (0=off, 1-3 increasing).
--output-type _TYPE_: Output type: pdf, pdfa, pdfa-1, pdfa-2, pdfa-3.
--pdfa-image-compression _TYPE_: Compression: jpeg, lossless.
-j _NUM_, --jobs _NUM_: Number of parallel jobs.
--image-dpi _DPI_: DPI for images without metadata.
-q, --quiet: Suppress output.
-v, --verbose _LEVEL_: Verbose output (0-2).
--sidecar _FILE_: Write OCR text to sidecar file.

FAQ

What is the ocrmypdf command used for?

ocrmypdf adds an OCR text layer to scanned PDFs, making them searchable and selectable. It uses Tesseract OCR and outputs PDF/A for archival quality by default. The tool preserves the original visual appearance while adding invisible text behind the scanned images. This means the file looks identical but text can be copied, searched, and indexed. Image preprocessing improves OCR accuracy: deskew corrects tilted scans, clean removes noise and artifacts, and rotate-pages fixes orientation. These can significantly improve results on poor-quality scans. Multiple languages can be combined (eng+fra+deu). Language packs must be installed for Tesseract. The tool detects existing text to avoid double-processing unless forced. Optimization levels reduce file size through image recompression. Level 3 uses aggressive JBIG2 compression suitable for archival. PDF/A output ensures long-term readability. Parallel processing speeds up multi-page documents. Progress is shown by default. Sidecar output extracts just the text for external processing.

How do I run a basic ocrmypdf example?

Run `ocrmypdf [input.pdf] [output.pdf]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does -l _LANG_, --language _LANG_ do in ocrmypdf?

OCR language (Tesseract language codes).

ocrmypdf 命令

常用示例

说明

参数

FAQ

相关命令