trafilatura Command: Examples, Options, and Usage

常用示例

Extract text from URL

trafilatura -u [https://example.com]

Extract from local HTML file

trafilatura -i [page.html]

Output as JSON with metadata

trafilatura -u [https://example.com] --json --with-metadata

Extract without comments or tables

trafilatura -u [https://example.com] --no-comments --no-tables

Batch process URLs from a file to output directory

trafilatura -i [urls.txt] -o [output_dir]

Favor precision over recall

trafilatura -u [https://example.com] --precision

说明

trafilatura extracts the main text content from web pages, automatically removing navigation, ads, headers, footers, and other boilerplate elements. It can fetch pages from URLs directly or process local HTML files. Output is available in plain text, CSV, JSON, HTML, Markdown, XML, or XML-TEI formats. The tool also extracts metadata such as publication dates, authors, and page titles. Batch processing handles multiple URLs from a list file, making it suitable for web scraping and corpus building. Link discovery via feeds, sitemaps, and crawling is built in.

参数

-u, --URL _URL_: Fetch and process a URL.
-i, --input-file _FILE_: Input file (HTML file or list of URLs for batch processing).
-o, --output-dir _DIR_: Write results to specified directory.
--output-format _FORMAT_: Output format: txt, csv, json, html, markdown, xml, xmltei.
--json: JSON output shorthand.
--xml: XML output shorthand.
--csv: CSV output shorthand.
--no-comments: Exclude comments from extraction.
--no-tables: Exclude table elements from extraction.
--with-metadata: Extract and include metadata in output.
--precision: Favor extraction precision (less noise, less text).
--recall: Favor extraction recall (more text, possibly more noise).
-f, --fast: Fast extraction without fallback detection.
--formatting: Include text formatting (bold, italic, etc.).
--links: Include links with targets in output.
--deduplicate: Filter out duplicate documents and sections.
--feed _URL_: Look for feeds or pass feed URL as input.
--sitemap _URL_: Look for sitemaps or enter sitemap URL.
--parallel _N_: Number of cores/threads for downloads and processing.

FAQ

What is the trafilatura command used for?

trafilatura extracts the main text content from web pages, automatically removing navigation, ads, headers, footers, and other boilerplate elements. It can fetch pages from URLs directly or process local HTML files. Output is available in plain text, CSV, JSON, HTML, Markdown, XML, or XML-TEI formats. The tool also extracts metadata such as publication dates, authors, and page titles. Batch processing handles multiple URLs from a list file, making it suitable for web scraping and corpus building. Link discovery via feeds, sitemaps, and crawling is built in.

How do I run a basic trafilatura example?

Run `trafilatura -u [https://example.com]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does -u, --URL _URL_ do in trafilatura?

Fetch and process a URL.

trafilatura 命令

常用示例

说明

参数

FAQ

相关命令