← 返回命令列表

Linux command

scrapy 命令

文件

复制后可按需替换文件名、目录或参数。

常用示例

Create new project

scrapy startproject [project_name]

Generate spider

scrapy genspider [spider_name] [domain.com]

Run spider

scrapy crawl [spider_name]

Run spider and save to file

scrapy crawl [spider_name] -o [output.json]

Interactive shell for testing

scrapy shell "[https://example.com]"

Check spider contracts

scrapy check [spider_name]

List available spiders

scrapy list

Fetch URL and show response

scrapy fetch [https://example.com]

说明

Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt. Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests. The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context. Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported. Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more. Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.

参数

startproject _NAME_
Create new Scrapy project.
genspider _NAME_ _DOMAIN_
Generate spider from template.
crawl _SPIDER_
Run a spider.
shell _URL_
Interactive shell for testing.
list
List available spiders.
check _SPIDER_
Run contract checks.
fetch _URL_
Fetch URL and print.
view _URL_
Open URL in browser.
parse _URL_
Parse URL with spider.
runspider _FILE_
Run spider from file.
-o _FILE_, --output _FILE_
Append crawled items to a file. Format is inferred from the extension (json, jsonl, csv, xml).
-O _FILE_, --overwrite-output _FILE_
Same as -o but overwrites any existing file.
-s _NAME=VALUE_, --set _NAME=VALUE_
Override a setting (e.g. -s LOG_LEVEL=INFO).
-a _NAME=VALUE_
Pass an argument to the spider (read via self.<NAME>).
-t _FORMAT_, --output-format _FORMAT_
Explicitly set the output format when the filename does not indicate it.
--nolog
Disable logging entirely.
--loglevel _LEVEL_, -L _LEVEL_
Set log level: DEBUG, INFO, WARNING, ERROR, CRITICAL.
--logfile _FILE_
Write log output to a file.
--profile _FILE_
Write Python cProfile stats to file.

FAQ

What is the scrapy command used for?

Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt. Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests. The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context. Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported. Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more. Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.

How do I run a basic scrapy example?

Run `scrapy startproject [project_name]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does startproject _NAME_ do in scrapy?

Create new Scrapy project.