Linux command
pdftohtml 命令
文件
复制后可按需替换文件名、目录或参数。
常用示例
Convert
pdftohtml [path/to/file.pdf] [path/to/output_file.html]
Ignore images
pdftohtml -i [path/to/file.pdf] [path/to/output_file.html]
Example
pdftohtml -s [path/to/file.pdf] [path/to/output_file.html]
Example
pdftohtml -xml [path/to/file.pdf] [path/to/output_file.xml]
说明
pdftohtml converts PDF files to HTML, XML, or PNG format. Part of the poppler-utils package, it attempts to preserve the visual layout of PDF pages in the resulting HTML output. By default, it generates one HTML file per page plus a frameset index. The -s option creates a single file containing all pages. Images are extracted as separate PNG files unless -i is specified. The XML output mode provides structured data about text positioning and formatting, useful for further processing or text extraction.
参数
- -i
- Ignore images
- -s
- Generate single HTML file for all pages
- -xml
- Output as XML instead of HTML
- -c
- Generate complex output (more accurate layout)
- -hidden
- Force extraction of hidden text
- -f _n_
- First page to convert
- -l _n_
- Last page to convert
- -zoom _factor_
- Zoom factor (default: 1.5)
- -noframes
- Generate no frames (single page output)
- -enc _encoding_
- Output encoding (default: UTF-8)
FAQ
What is the pdftohtml command used for?
pdftohtml converts PDF files to HTML, XML, or PNG format. Part of the poppler-utils package, it attempts to preserve the visual layout of PDF pages in the resulting HTML output. By default, it generates one HTML file per page plus a frameset index. The -s option creates a single file containing all pages. Images are extracted as separate PNG files unless -i is specified. The XML output mode provides structured data about text positioning and formatting, useful for further processing or text extraction.
How do I run a basic pdftohtml example?
Run `pdftohtml [path/to/file.pdf] [path/to/output_file.html]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does -i do in pdftohtml?
Ignore images