Linux command
scrapy 命令
文件
复制后可按需替换文件名、目录或参数。
常用示例
Create new project
scrapy startproject [project_name]
Generate spider
scrapy genspider [spider_name] [domain.com]
Run spider
scrapy crawl [spider_name]
Run spider and save to file
scrapy crawl [spider_name] -o [output.json]
Interactive shell for testing
scrapy shell "[https://example.com]"
Check spider contracts
scrapy check [spider_name]
List available spiders
scrapy list
Fetch URL and show response
scrapy fetch [https://example.com]
说明
Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt. Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests. The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context. Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported. Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more. Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.
参数
- startproject _NAME_
- Create new Scrapy project.
- genspider _NAME_ _DOMAIN_
- Generate spider from template.
- crawl _SPIDER_
- Run a spider.
- shell _URL_
- Interactive shell for testing.
- list
- List available spiders.
- check _SPIDER_
- Run contract checks.
- fetch _URL_
- Fetch URL and print.
- view _URL_
- Open URL in browser.
- parse _URL_
- Parse URL with spider.
- runspider _FILE_
- Run spider from file.
- -o _FILE_, --output _FILE_
- Append crawled items to a file. Format is inferred from the extension (json, jsonl, csv, xml).
- -O _FILE_, --overwrite-output _FILE_
- Same as -o but overwrites any existing file.
- -s _NAME=VALUE_, --set _NAME=VALUE_
- Override a setting (e.g. -s LOG_LEVEL=INFO).
- -a _NAME=VALUE_
- Pass an argument to the spider (read via self.<NAME>).
- -t _FORMAT_, --output-format _FORMAT_
- Explicitly set the output format when the filename does not indicate it.
- --nolog
- Disable logging entirely.
- --loglevel _LEVEL_, -L _LEVEL_
- Set log level: DEBUG, INFO, WARNING, ERROR, CRITICAL.
- --logfile _FILE_
- Write log output to a file.
- --profile _FILE_
- Write Python cProfile stats to file.
FAQ
What is the scrapy command used for?
Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt. Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests. The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context. Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported. Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more. Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.
How do I run a basic scrapy example?
Run `scrapy startproject [project_name]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does startproject _NAME_ do in scrapy?
Create new Scrapy project.