Linux command
pdfgrep 命令
文件
复制后可按需替换文件名、目录或参数。
常用示例
Search for pattern in PDF
pdfgrep "[pattern]" [file.pdf]
Case-insensitive search showing page numbers
pdfgrep -in "[pattern]" [file.pdf]
Search recursively in directory
pdfgrep -r "[pattern]" [/path/to/pdfs/]
Count matches per file
pdfgrep -c "[pattern]" [*.pdf]
Print only filenames with matches
pdfgrep -l "[pattern]" [*.pdf]
Search with multiple patterns
pdfgrep -e "[pattern1]" -e "[pattern2]" [file.pdf]
Limit search to a page range
pdfgrep --page-range=[1-10] "[pattern]" [file.pdf]
Print only the matched text
pdfgrep -o "[pattern]" [file.pdf]
说明
pdfgrep searches for text patterns in PDF files using the Poppler library for text extraction. It provides a familiar grep-like interface for PDF documents. Text is extracted from each page and matched against the given regular expression. By default pdfgrep uses PCRE2 for pattern matching. Fixed-string matching is available via -F. Page number output (-n) helps locate matches within a document. Restricting the search to a page range (--page-range) speeds up searches on large files. Context lines (-C) show surrounding text to aid understanding of a match. Recursive search (-r) processes entire directory trees. Combined with --include and --exclude, this enables targeted searches across document collections. Multiple patterns can be specified with repeated -e options or read from a file with -f. The --unac option is useful when PDFs use typographic ligatures or accented characters that differ from the search term. The --cache option stores extracted text to accelerate repeated searches.
参数
- -e _PATTERN_, --regexp=_PATTERN_
- Specify a search pattern. Can be used multiple times to match any of several patterns.
- -f _FILE_, --file=_FILE_
- Read patterns from a file, one per line.
- -i, --ignore-case
- Case-insensitive matching.
- -F, --fixed-strings
- Treat the pattern as a fixed string (no regular expression interpretation).
- -P, --perl-regexp
- Use Perl-compatible regular expressions (PCRE2).
- -n, --page-number=_TYPE_
- Prefix each match with its page number. _TYPE_ is `index` (default) or `label`.
- -c, --count
- Print match count per file instead of matched lines.
- -p, --page-count
- Print match count per page (implies -n).
- -l, --files-with-matches
- Print only filenames that contain a match.
- -L, --files-without-match
- Print only filenames that contain no match.
- -o, --only-matching
- Print only the matched portion of each line.
- -H, --with-filename
- Print the filename with each match (default when searching multiple files).
- -h, --no-filename
- Suppress filename prefix in output.
- -Z, --null
- Use a null byte instead of a colon to separate the filename from the rest of the output line. Useful for filenames containing colons or spaces.
- --match-prefix-separator _SEP_
- Use _SEP_ as the separator between the match prefix (filename, page number) and the matched line, instead of the default colon.
- -r, --recursive
- Search all PDF files under each directory recursively. Symlinks are followed only when specified on the command line.
- -R, --dereference-recursive
- Like -r, but follow all symlinks.
- --include=_GLOB_
- Only search files whose names match _GLOB_ (default: `*.pdf`).
- --exclude=_GLOB_
- Skip files whose names match _GLOB_.
- -A _NUM_, --after-context=_NUM_
- Print _NUM_ lines of context after each match.
- -B _NUM_, --before-context=_NUM_
- Print _NUM_ lines of context before each match.
- -C _NUM_, --context=_NUM_
- Print _NUM_ lines of context before and after each match.
- --page-range=_RANGE_
- Limit the search to the specified page range (e.g., `1-10,15`).
- -m _NUM_, --max-count=_NUM_
- Stop after _NUM_ matches per file.
FAQ
What is the pdfgrep command used for?
pdfgrep searches for text patterns in PDF files using the Poppler library for text extraction. It provides a familiar grep-like interface for PDF documents. Text is extracted from each page and matched against the given regular expression. By default pdfgrep uses PCRE2 for pattern matching. Fixed-string matching is available via -F. Page number output (-n) helps locate matches within a document. Restricting the search to a page range (--page-range) speeds up searches on large files. Context lines (-C) show surrounding text to aid understanding of a match. Recursive search (-r) processes entire directory trees. Combined with --include and --exclude, this enables targeted searches across document collections. Multiple patterns can be specified with repeated -e options or read from a file with -f. The --unac option is useful when PDFs use typographic ligatures or accented characters that differ from the search term. The --cache option stores extracted text to accelerate repeated searches.
How do I run a basic pdfgrep example?
Run `pdfgrep "[pattern]" [file.pdf]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does -e _PATTERN_, --regexp=_PATTERN_ do in pdfgrep?
Specify a search pattern. Can be used multiple times to match any of several patterns.