← 返回命令列表

Linux command

tabula 命令

文本

复制后可按需替换文件名、目录或参数。

常用示例

Extract tables from PDF

tabula [document.pdf]

Output as CSV

tabula -o [output.csv] [document.pdf]

Specific pages

tabula -p [1,2,3] [document.pdf]

JSON output

tabula -f JSON [document.pdf]

All pages

tabula -p all [document.pdf]

With area

tabula -a [0,0,100,100] [document.pdf]

说明

tabula extracts tabular data from PDF documents and converts it into structured formats such as CSV, JSON, or TSV. It is designed for liberating data trapped in PDFs, where tables are visually rendered but not stored as actual data structures. The tool offers two extraction modes: lattice mode detects tables by looking for ruling lines between cells, while stream mode uses whitespace and text alignment to identify column boundaries. Automatic detection chooses the best approach, but manual mode selection often improves accuracy for specific document layouts. An area option allows targeting specific page regions when only part of a page contains the desired table. Tabula runs as a Java application and can process specific pages or entire documents. It was originally created as a web application for journalists needing to extract data from government reports and financial disclosures, and the command-line version provides the same extraction engine for scripting and automation workflows.

参数

-p _PAGES_
Page numbers.
-o _FILE_
Output file.
-f _FORMAT_
Output format (CSV, JSON, TSV).
-a _AREA_
Extraction area.
-g
Guess table areas.
-l
Force lattice mode (tables with ruling lines).
-s
Force stream mode (tables without ruling lines).

FAQ

What is the tabula command used for?

tabula extracts tabular data from PDF documents and converts it into structured formats such as CSV, JSON, or TSV. It is designed for liberating data trapped in PDFs, where tables are visually rendered but not stored as actual data structures. The tool offers two extraction modes: lattice mode detects tables by looking for ruling lines between cells, while stream mode uses whitespace and text alignment to identify column boundaries. Automatic detection chooses the best approach, but manual mode selection often improves accuracy for specific document layouts. An area option allows targeting specific page regions when only part of a page contains the desired table. Tabula runs as a Java application and can process specific pages or entire documents. It was originally created as a web application for journalists needing to extract data from government reports and financial disclosures, and the command-line version provides the same extraction engine for scripting and automation workflows.

How do I run a basic tabula example?

Run `tabula [document.pdf]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does -p _PAGES_ do in tabula?

Page numbers.