scrapy Command: Examples, Options, and Usage

常用示例

Create new project

scrapy startproject [project_name]

Generate spider

scrapy genspider [spider_name] [domain.com]

Run spider

scrapy crawl [spider_name]

Run spider and save to file

scrapy crawl [spider_name] -o [output.json]

Interactive shell for testing

scrapy shell "[https://example.com]"

Check spider contracts

scrapy check [spider_name]

List available spiders

scrapy list

Fetch URL and show response

scrapy fetch [https://example.com]

说明

Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt. Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests. The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context. Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported. Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more. Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.

参数

startproject _NAME_: Create new Scrapy project.
genspider _NAME_ _DOMAIN_: Generate spider from template.
crawl _SPIDER_: Run a spider.
shell _URL_: Interactive shell for testing.
list: List available spiders.
check _SPIDER_: Run contract checks.
fetch _URL_: Fetch URL and print.
view _URL_: Open URL in browser.
parse _URL_: Parse URL with spider.
runspider _FILE_: Run spider from file.
-o _FILE_, --output _FILE_: Append crawled items to a file. Format is inferred from the extension (json, jsonl, csv, xml).
-O _FILE_, --overwrite-output _FILE_: Same as -o but overwrites any existing file.
-s _NAME=VALUE_, --set _NAME=VALUE_: Override a setting (e.g. -s LOG_LEVEL=INFO).
-a _NAME=VALUE_: Pass an argument to the spider (read via self.<NAME>).
-t _FORMAT_, --output-format _FORMAT_: Explicitly set the output format when the filename does not indicate it.
--nolog: Disable logging entirely.
--loglevel _LEVEL_, -L _LEVEL_: Set log level: DEBUG, INFO, WARNING, ERROR, CRITICAL.
--logfile _FILE_: Write log output to a file.
--profile _FILE_: Write Python cProfile stats to file.

FAQ

What is the scrapy command used for?

Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt. Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests. The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context. Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported. Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more. Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.

How do I run a basic scrapy example?

Run `scrapy startproject [project_name]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does startproject _NAME_ do in scrapy?

Create new Scrapy project.

scrapy 命令

常用示例

说明

参数

FAQ

相关命令