Scrapling 提取命令指南
Section titled “Scrapling 提取命令指南”无需编程,直接通过终端完成 Web Scraping!
scrapling extract 命令可以让你不写任何代码,直接在终端中下载网站并提取内容。它非常适合初学者、研究人员,以及任何希望快速提取网页数据的人。
什么是提取命令组?
Section titled “什么是提取命令组?”提取命令是一组简单的终端工具,可以:
- 下载网页 并把内容保存到文件。
- 把 HTML 转换为更易读的格式,例如 Markdown;也可以保留原始 HTML,或只提取页面文本。
- 支持自定义 CSS 选择器,只提取页面中的指定部分。
- 支持 HTTP 请求与浏览器抓取。
- 高度可定制,支持自定义请求头、Cookies、代理以及其他各种选项。代码层能用到的大多数参数,在命令行中同样可用。
-
基础网页下载
将网页的文本内容保存为干净、易读的纯文本:
Terminal window scrapling extract get "https://example.com" page_content.txt这会发送一个 HTTP GET 请求,并把网页文本内容保存到
page_content.txt。 -
保存为不同格式
通过更换输出文件扩展名选择不同格式:
Terminal window # 将 HTML 内容转换为 Markdown 再保存(很适合文档归档)scrapling extract get "https://blog.example.com" article.md# 原样保存 HTML 内容scrapling extract get "https://example.com" page.html# 将网页的纯净文本内容保存到文件scrapling extract get "https://example.com" content.txt# 也可以使用 Docker 镜像,例如:docker run -v $(pwd)/output:/output scrapling extract get "https://blog.example.com" /output/article.md -
提取指定内容
所有命令都支持通过
--css-selector或-s来提取页面中的指定部分,后续示例会展示具体写法。
你可以通过 scrapling extract --help 查看可用命令列表:
Usage: scrapling extract [OPTIONS] COMMAND [ARGS]...
Fetch web pages using various fetchers and extract full/selected HTML content as HTML, Markdown, or extract text content.
Options: --help Show this message and exit.
Commands: get Perform a GET request and save the content to a file. post Perform a POST request and save the content to a file. put Perform a PUT request and save the content to a file. delete Perform a DELETE request and save the content to a file. fetch Use DynamicFetcher to fetch content with browser... stealthy-fetch Use StealthyFetcher to fetch content with advanced...下面我们会逐个详细说明。
HTTP 请求
Section titled “HTTP 请求”-
GET 请求
下载网页内容时最常用的命令:
Terminal window scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS]示例:
Terminal window # 基础下载scrapling extract get "https://news.site.com" news.md# 自定义超时时间scrapling extract get "https://example.com" content.txt --timeout 60# 通过 CSS 选择器只提取特定内容scrapling extract get "https://blog.example.com" articles.md --css-selector "article"# 携带 Cookies 发起请求scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john"# 添加 User-Agentscrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0"# 添加多个请求头scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"你可以通过
scrapling extract get --help查看该命令的可用选项:Terminal window Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILEPerform a GET request and save the content to a file.The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.Options:-H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)--cookies TEXT Cookies string in format "name1=value1;name2=value2"--timeout INTEGER Request timeout in seconds (default: 30)--proxy TEXT Proxy URL in format "http://username:***@host:port"-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.-p, --params TEXT Query parameters in format "key=value" (can be used multiple times)--follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)--verify / --no-verify Whether to verify SSL certificates (default: True)--impersonate TEXT Browser to impersonate (e.g., chrome, firefox).--stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)--ai-targeted Extract only main content and sanitize hidden elements for AI consumption (default: False)--help Show this message and exit.这些参数在其他请求命令中的行为基本一致,因此后面不再重复说明。
-
POST 请求
Terminal window scrapling extract post [URL] [OUTPUT_FILE] [OPTIONS]示例:
Terminal window # 提交表单数据scrapling extract post "https://api.site.com/search" results.html --data "query=python&type=tutorial"# 发送 JSON 数据scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}'你可以通过
scrapling extract post --help查看该命令的可用选项:Terminal window Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILEPerform a POST request and save the content to a file.The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.Options:-d, --data TEXT Form data to include in the request body (as string, ex: "param1=value1¶m2=value2")-j, --json TEXT JSON data to include in the request body (as string)-H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)--cookies TEXT Cookies string in format "name1=value1;name2=value2"--timeout INTEGER Request timeout in seconds (default: 30)--proxy TEXT Proxy URL in format "http://username:***@host:port"-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.-p, --params TEXT Query parameters in format "key=value" (can be used multiple times)--follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)--verify / --no-verify Whether to verify SSL certificates (default: True)--impersonate TEXT Browser to impersonate (e.g., chrome, firefox).--stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)--ai-targeted Extract only main content and sanitize hidden elements for AI consumption (default: False)--help Show this message and exit. -
PUT 请求
Terminal window scrapling extract put [URL] [OUTPUT_FILE] [OPTIONS]示例:
Terminal window # 发送表单数据scrapling extract put "https://scrapling.requestcatcher.com/put" results.html --data "update=info" --impersonate "firefox"# 发送 JSON 数据scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}'你可以通过
scrapling extract put --help查看该命令的可用选项:Terminal window Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILEPerform a PUT request and save the content to a file.The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.Options:-d, --data TEXT Form data to include in the request body-j, --json TEXT JSON data to include in the request body (as string)-H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)--cookies TEXT Cookies string in format "name1=value1;name2=value2"--timeout INTEGER Request timeout in seconds (default: 30)--proxy TEXT Proxy URL in format "http://username:***@host:port"-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.-p, --params TEXT Query parameters in format "key=value" (can be used multiple times)--follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)--verify / --no-verify Whether to verify SSL certificates (default: True)--impersonate TEXT Browser to impersonate (e.g., chrome, firefox).--stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)--ai-targeted Extract only main content and sanitize hidden elements for AI consumption (default: False)--help Show this message and exit. -
DELETE 请求
Terminal window scrapling extract delete [URL] [OUTPUT_FILE] [OPTIONS]示例:
Terminal window # 发起 DELETE 请求scrapling extract delete "https://scrapling.requestcatcher.com/delete" results.html# 指定浏览器伪装scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome"你可以通过
scrapling extract delete --help查看该命令的可用选项:Terminal window Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILEPerform a DELETE request and save the content to a file.The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.Options:-H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)--cookies TEXT Cookies string in format "name1=value1;name2=value2"--timeout INTEGER Request timeout in seconds (default: 30)--proxy TEXT Proxy URL in format "http://username:***@host:port"-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.-p, --params TEXT Query parameters in format "key=value" (can be used multiple times)--follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)--verify / --no-verify Whether to verify SSL certificates (default: True)--impersonate TEXT Browser to impersonate (e.g., chrome, firefox).--stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)--ai-targeted Extract only main content and sanitize hidden elements for AI consumption (default: False)--help Show this message and exit.
-
fetch:处理动态内容
对于依赖 JavaScript 动态加载内容,或者只有轻度防护的网站:
Terminal window scrapling extract fetch [URL] [OUTPUT_FILE] [OPTIONS]示例:
Terminal window # 等待 JavaScript 加载内容并等待网络活动结束scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle# 等待指定内容出现scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded"# 以可见浏览器模式运行(适合调试)scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources你可以通过
scrapling extract fetch --help查看该命令的可用选项:Terminal window Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILEUse DynamicFetcher to fetch content with browser automation.The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.Options:--headless / --no-headless Run browser in headless mode (default: True)--disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)--network-idle / --no-network-idle Wait for network idle (default: False)--timeout INTEGER Timeout in milliseconds (default: 30000)--wait INTEGER Additional wait time in milliseconds after page load (default: 0)-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.--wait-selector TEXT CSS selector to wait for before proceeding--locale TEXT Specify user locale. Defaults to the system default locale.--real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)--proxy TEXT Proxy URL in format "http://username:***@host:port"-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)--dns-over-https / --no-dns-over-https Route DNS through Cloudflare's DoH to prevent DNS leaks when using proxies (default: False)--block-ads / --no-block-ads Block requests to known ad and tracker domains (default: False)--ai-targeted Extract only main content and sanitize hidden elements for AI consumption (default: False)--help Show this message and exit. -
stealthy-fetch:绕过防护
对于带有反爬机制或 Cloudflare 防护的网站:
Terminal window scrapling extract stealthy-fetch [URL] [OUTPUT_FILE] [OPTIONS]示例:
Terminal window # 绕过基础防护scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md# 解决 Cloudflare 挑战scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"# 使用代理提高匿名性scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"你可以通过
scrapling extract stealthy-fetch --help查看该命令的可用选项:Terminal window Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILEUse StealthyFetcher to fetch content with advanced stealth features.The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.Options:--headless / --no-headless Run browser in headless mode (default: True)--disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)--block-webrtc / --allow-webrtc Block WebRTC entirely (default: False)--solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)--allow-webgl / --block-webgl Allow WebGL (default: True)--network-idle / --no-network-idle Wait for network idle (default: False)--real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)--timeout INTEGER Timeout in milliseconds (default: 30000)--wait INTEGER Additional wait time in milliseconds after page load (default: 0)-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.--wait-selector TEXT CSS selector to wait for before proceeding--hide-canvas / --show-canvas Add noise to canvas operations (default: False)--proxy TEXT Proxy URL in format "http://username:***@host:port"-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)--dns-over-https / --no-dns-over-https Route DNS through Cloudflare's DoH to prevent DNS leaks when using proxies (default: False)--block-ads / --no-block-ads Block requests to known ad and tracker domains (default: False)--ai-targeted Extract only main content and sanitize hidden elements for AI consumption (default: False)--help Show this message and exit.
什么时候该用哪条命令?
Section titled “什么时候该用哪条命令?”如果你还不是 Web Scraping 专家,不知道该选哪种方式,可以先按下面这个经验法则:
- 简单网站、博客、新闻文章:用
get - 现代 Web App 或动态内容网站:用
fetch - 带保护机制、Cloudflare 或反爬系统的网站:用
stealthy-fetch
法律与伦理注意事项
Section titled “法律与伦理注意事项”祝你抓取顺利!请始终尊重网站政策,并遵守所有适用法律法规。