Linux command
tabula 命令
文本
复制后可按需替换文件名、目录或参数。
常用示例
Extract tables from PDF
tabula [document.pdf]
Output as CSV
tabula -o [output.csv] [document.pdf]
Specific pages
tabula -p [1,2,3] [document.pdf]
JSON output
tabula -f JSON [document.pdf]
All pages
tabula -p all [document.pdf]
With area
tabula -a [0,0,100,100] [document.pdf]
说明
tabula extracts tabular data from PDF documents and converts it into structured formats such as CSV, JSON, or TSV. It is designed for liberating data trapped in PDFs, where tables are visually rendered but not stored as actual data structures. The tool offers two extraction modes: lattice mode detects tables by looking for ruling lines between cells, while stream mode uses whitespace and text alignment to identify column boundaries. Automatic detection chooses the best approach, but manual mode selection often improves accuracy for specific document layouts. An area option allows targeting specific page regions when only part of a page contains the desired table. Tabula runs as a Java application and can process specific pages or entire documents. It was originally created as a web application for journalists needing to extract data from government reports and financial disclosures, and the command-line version provides the same extraction engine for scripting and automation workflows.
参数
- -p _PAGES_
- Page numbers.
- -o _FILE_
- Output file.
- -f _FORMAT_
- Output format (CSV, JSON, TSV).
- -a _AREA_
- Extraction area.
- -g
- Guess table areas.
- -l
- Force lattice mode (tables with ruling lines).
- -s
- Force stream mode (tables without ruling lines).
FAQ
What is the tabula command used for?
tabula extracts tabular data from PDF documents and converts it into structured formats such as CSV, JSON, or TSV. It is designed for liberating data trapped in PDFs, where tables are visually rendered but not stored as actual data structures. The tool offers two extraction modes: lattice mode detects tables by looking for ruling lines between cells, while stream mode uses whitespace and text alignment to identify column boundaries. Automatic detection chooses the best approach, but manual mode selection often improves accuracy for specific document layouts. An area option allows targeting specific page regions when only part of a page contains the desired table. Tabula runs as a Java application and can process specific pages or entire documents. It was originally created as a web application for journalists needing to extract data from government reports and financial disclosures, and the command-line version provides the same extraction engine for scripting and automation workflows.
How do I run a basic tabula example?
Run `tabula [document.pdf]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does -p _PAGES_ do in tabula?
Page numbers.