Linux command
trafilatura 命令
网络
复制后可按需替换文件名、目录或参数。
常用示例
Extract text from URL
trafilatura -u [https://example.com]
Extract from local HTML file
trafilatura -i [page.html]
Output as JSON with metadata
trafilatura -u [https://example.com] --json --with-metadata
Extract without comments or tables
trafilatura -u [https://example.com] --no-comments --no-tables
Batch process URLs from a file to output directory
trafilatura -i [urls.txt] -o [output_dir]
Favor precision over recall
trafilatura -u [https://example.com] --precision
说明
trafilatura extracts the main text content from web pages, automatically removing navigation, ads, headers, footers, and other boilerplate elements. It can fetch pages from URLs directly or process local HTML files. Output is available in plain text, CSV, JSON, HTML, Markdown, XML, or XML-TEI formats. The tool also extracts metadata such as publication dates, authors, and page titles. Batch processing handles multiple URLs from a list file, making it suitable for web scraping and corpus building. Link discovery via feeds, sitemaps, and crawling is built in.
参数
- -u, --URL _URL_
- Fetch and process a URL.
- -i, --input-file _FILE_
- Input file (HTML file or list of URLs for batch processing).
- -o, --output-dir _DIR_
- Write results to specified directory.
- --output-format _FORMAT_
- Output format: txt, csv, json, html, markdown, xml, xmltei.
- --json
- JSON output shorthand.
- --xml
- XML output shorthand.
- --csv
- CSV output shorthand.
- --no-comments
- Exclude comments from extraction.
- --no-tables
- Exclude table elements from extraction.
- --with-metadata
- Extract and include metadata in output.
- --precision
- Favor extraction precision (less noise, less text).
- --recall
- Favor extraction recall (more text, possibly more noise).
- -f, --fast
- Fast extraction without fallback detection.
- --formatting
- Include text formatting (bold, italic, etc.).
- --links
- Include links with targets in output.
- --deduplicate
- Filter out duplicate documents and sections.
- --feed _URL_
- Look for feeds or pass feed URL as input.
- --sitemap _URL_
- Look for sitemaps or enter sitemap URL.
- --parallel _N_
- Number of cores/threads for downloads and processing.
FAQ
What is the trafilatura command used for?
trafilatura extracts the main text content from web pages, automatically removing navigation, ads, headers, footers, and other boilerplate elements. It can fetch pages from URLs directly or process local HTML files. Output is available in plain text, CSV, JSON, HTML, Markdown, XML, or XML-TEI formats. The tool also extracts metadata such as publication dates, authors, and page titles. Batch processing handles multiple URLs from a list file, making it suitable for web scraping and corpus building. Link discovery via feeds, sitemaps, and crawling is built in.
How do I run a basic trafilatura example?
Run `trafilatura -u [https://example.com]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does -u, --URL _URL_ do in trafilatura?
Fetch and process a URL.