Linux command
ocrmypdf 命令
文本
复制后可按需替换文件名、目录或参数。
常用示例
Add OCR layer to PDF
ocrmypdf [input.pdf] [output.pdf]
OCR and correct skew
ocrmypdf --deskew [input.pdf] [output.pdf]
OCR and clean background
ocrmypdf --clean [input.pdf] [output.pdf]
Specify language
ocrmypdf -l [deu] [input.pdf] [output.pdf]
Multiple languages
ocrmypdf -l [eng+fra] [input.pdf] [output.pdf]
Force OCR even if text exists
ocrmypdf --force-ocr [input.pdf] [output.pdf]
Skip pages with text
ocrmypdf --skip-text [input.pdf] [output.pdf]
Optimize and reduce size
ocrmypdf --optimize [3] [input.pdf] [output.pdf]
说明
ocrmypdf adds an OCR text layer to scanned PDFs, making them searchable and selectable. It uses Tesseract OCR and outputs PDF/A for archival quality by default. The tool preserves the original visual appearance while adding invisible text behind the scanned images. This means the file looks identical but text can be copied, searched, and indexed. Image preprocessing improves OCR accuracy: deskew corrects tilted scans, clean removes noise and artifacts, and rotate-pages fixes orientation. These can significantly improve results on poor-quality scans. Multiple languages can be combined (eng+fra+deu). Language packs must be installed for Tesseract. The tool detects existing text to avoid double-processing unless forced. Optimization levels reduce file size through image recompression. Level 3 uses aggressive JBIG2 compression suitable for archival. PDF/A output ensures long-term readability. Parallel processing speeds up multi-page documents. Progress is shown by default. Sidecar output extracts just the text for external processing.
参数
- -l _LANG_, --language _LANG_
- OCR language (Tesseract language codes).
- --deskew
- Correct page skew before OCR.
- --clean
- Clean page background before OCR.
- --clean-final
- Clean and keep cleaned image in output.
- --rotate-pages
- Rotate pages to correct orientation.
- --remove-background
- Remove background from pages.
- --force-ocr
- OCR all pages, replacing existing text.
- --skip-text
- Skip pages that already have text.
- --redo-ocr
- Redo OCR on pages with existing text.
- --optimize _LEVEL_
- Optimize output (0=off, 1-3 increasing).
- --output-type _TYPE_
- Output type: pdf, pdfa, pdfa-1, pdfa-2, pdfa-3.
- --pdfa-image-compression _TYPE_
- Compression: jpeg, lossless.
- -j _NUM_, --jobs _NUM_
- Number of parallel jobs.
- --image-dpi _DPI_
- DPI for images without metadata.
- -q, --quiet
- Suppress output.
- -v, --verbose _LEVEL_
- Verbose output (0-2).
- --sidecar _FILE_
- Write OCR text to sidecar file.
FAQ
What is the ocrmypdf command used for?
ocrmypdf adds an OCR text layer to scanned PDFs, making them searchable and selectable. It uses Tesseract OCR and outputs PDF/A for archival quality by default. The tool preserves the original visual appearance while adding invisible text behind the scanned images. This means the file looks identical but text can be copied, searched, and indexed. Image preprocessing improves OCR accuracy: deskew corrects tilted scans, clean removes noise and artifacts, and rotate-pages fixes orientation. These can significantly improve results on poor-quality scans. Multiple languages can be combined (eng+fra+deu). Language packs must be installed for Tesseract. The tool detects existing text to avoid double-processing unless forced. Optimization levels reduce file size through image recompression. Level 3 uses aggressive JBIG2 compression suitable for archival. PDF/A output ensures long-term readability. Parallel processing speeds up multi-page documents. Progress is shown by default. Sidecar output extracts just the text for external processing.
How do I run a basic ocrmypdf example?
Run `ocrmypdf [input.pdf] [output.pdf]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does -l _LANG_, --language _LANG_ do in ocrmypdf?
OCR language (Tesseract language codes).