HTTP 请求

了解 Scrapling 的 Fetcher 与 FetcherSession，掌握共享参数、常见 HTTP 方法及会话管理方式。

HTTP 请求

Fetcher 类基于高性能的 curl_cffi 库，提供快速、轻量的 HTTP 请求能力，并带有大量隐匿（stealth）特性。

基本用法

你主要通过下面这种方式导入这个 Fetcher；这也是所有 fetcher 通用的导入方式。

from scrapling.fetchers import Fetcher

如何配置解析选项请看这里。

共享参数

这里所有发起请求的方法都共享一部分参数，所以先统一讲解它们。

url：目标 URL。
stealthy_headers：启用后（默认启用），会生成并附加真实浏览器头。同时会设置一个 Google referer 头。
follow_redirects：控制重定向行为。默认值是 "safe"，这会跟随重定向，但会拒绝跳转到内网 / 私有 IP（SSR F 防护）。传 True 表示无限制跟随所有重定向；传 False 表示完全禁用重定向。
timeout：每个请求等待完成的秒数。默认 30 秒。
retries：请求失败时 fetcher 的重试次数。默认重试 3 次。
retry_delay：每次重试之间等待的秒数。默认 1 秒。
impersonate：伪装特定浏览器的 TLS 指纹。可接受浏览器字符串，或字符串列表，例如 "chrome110"、"firefox102"、"safari15_5" 用于指定具体版本，或 "chrome"、"firefox"、"safari"、"edge" 用于自动使用当前可用的最新版本。这会让你的请求在 TLS 层看起来像真实浏览器发出的。如果传的是字符串列表，每次请求会随机选择一个。默认使用当前可用的最新 Chrome 版本。
http3：使用 HTTP/3 协议发起请求。默认 False。如果与 impersonate 一起使用，可能会有问题。
cookies：请求中使用的 cookies。可以是 name→value 形式的字典，或字典列表。
proxy：顾名思义，该请求使用的代理，用于转发所有流量（HTTP 与 HTTPS）。支持格式如 http://username:***@localhost:8030。
proxy_auth：代理的 HTTP Basic Auth，格式为 (username, password) 元组。
proxies：代理字典，格式：{"http": proxy_url, "https": proxy_url}。
proxy_rotator：用于自动轮换代理的 ProxyRotator 实例。不能和 proxy 或 proxies 同时使用。
headers：附加到请求中的请求头。它可以覆盖 stealthy_headers 生成的任意头。
max_redirects：最大重定向次数。默认 30，传 -1 表示不限制。
verify：是否校验 HTTPS 证书。默认 True。
cert：客户端证书文件名元组 (cert, key)。
selector_config：一个字典，用于创建最终 Selector / Response 类时传入自定义解析参数。

除此之外，如果你需要更深入的自定义，也可以向任意方法传入 curl_cffi 支持、但当前方法尚未直接支持的参数。

HTTP 方法

不同 HTTP 方法还会各自拥有一些额外参数，例如 GET 请求中的 params，以及 POST / PUT / DELETE 请求中的 data / json。

最好的解释方式还是示例。

因此：OPTIONS 和 HEAD 方法目前不支持。

GET

from scrapling.fetchers import Fetcher
# Basic GET
page = Fetcher.get('https://example.com')
page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True)
page = Fetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:***@localhost:8030')
# With parameters
page = Fetcher.get('https://example.com/search', params={'q': 'query'})

# With headers
page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
# Basic HTTP authentication
page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
# Browser impersonation
page = Fetcher.get('https://example.com', impersonate='chrome')
# HTTP/3 support
page = Fetcher.get('https://example.com', http3=True)

异步请求只需要做一点小调整：

from scrapling.fetchers import AsyncFetcher
# Basic GET
page = await AsyncFetcher.get('https://example.com')
page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True)
page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:***@localhost:8030')
# With parameters
 page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'})
>>>
# With headers
page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
# Basic HTTP authentication
page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
# Browser impersonation
page = await AsyncFetcher.get('https://example.com', impersonate='chrome110')
# HTTP/3 support
page = await AsyncFetcher.get('https://example.com', http3=True)

不用多说，以上所有情况下的 page 对象都是 Response 对象，它本身也是一个 Selector，因此你可以直接这样使用：

page.css('.something.something')

page = Fetcher.get('https://api.github.com/events')
>>> page.json()
[{'id': '<redacted>',
  'type': 'PushEvent',
  'actor': {'id': '<redacted>',
   'login': '<redacted>',
   'display_login': '<redacted>',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/<redacted>',
   'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'},
  'repo': {'id': '<redacted>',
...

POST

from scrapling.fetchers import Fetcher
# Basic POST
page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, params={'q': 'query'})
page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True)
page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:***@localhost:8030', impersonate="chrome")
# Another example of form-encoded data
page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
# JSON data
page = Fetcher.post('https://example.com/api', json={'key': 'value'})

异步版本同样只需小调整：

from scrapling.fetchers import AsyncFetcher
# Basic POST
page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True)
page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:***@localhost:8030', impersonate="chrome")
# Another example of form-encoded data
page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
# JSON data
page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})

PUT

from scrapling.fetchers import Fetcher
# Basic PUT
page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, impersonate="chrome")
page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:***@localhost:8030')
# Another example of form-encoded data
page = Fetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})

异步版本：

from scrapling.fetchers import AsyncFetcher
# Basic PUT
page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, impersonate="chrome")
page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:***@localhost:8030')
# Another example of form-encoded data
page = await AsyncFetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})

DELETE

from scrapling.fetchers import Fetcher
page = Fetcher.delete('https://example.com/resource/123')
page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, impersonate="chrome")
page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:***@localhost:8030')

异步版本：

from scrapling.fetchers import AsyncFetcher
page = await AsyncFetcher.delete('https://example.com/resource/123')
page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, impersonate="chrome")
page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:***@localhost:8030')

会话管理

如果你要用同一组配置发起多个请求，请使用 FetcherSession 类。它既可以用于同步代码，也可以用于异步代码；这个类会自动检测并切换会话类型，因此不需要额外的导入方式。

FetcherSession 类几乎可以接受这些方法所支持的全部参数，因此你可以为整个会话指定一套默认配置，之后再轻松地为某一个请求单独覆盖，如下面示例所示。

from scrapling.fetchers import FetcherSession

# Create a session with default configuration
with FetcherSession(
    impersonate='chrome',
    http3=True,
    stealthy_headers=True,
    timeout=30,
    retries=3
) as session:
    # Make multiple requests with the same settings and the same cookies
    page1 = session.get('https://scrapling.requestcatcher.com/get')
    page2 = session.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
    page3 = session.get('https://api.github.com/events')

    # All requests share the same session and connection pool

你也可以把 ProxyRotator 与 FetcherSession 搭配使用，实现跨请求的自动代理轮换：

from scrapling.fetchers import FetcherSession, ProxyRotator

rotator = ProxyRotator([
    'http://proxy1:8080',
    'http://proxy2:8080',
    'http://proxy3:8080',
])

with FetcherSession(proxy_rotator=rotator, impersonate='chrome') as session:
    # Each request automatically uses the next proxy in rotation
    page1 = session.get('https://example.com/page1')
    page2 = session.get('https://example.com/page2')

    # You can check which proxy was used via the response metadata
    print(page1.meta['proxy'])

你还可以通过在单次请求中直接传入 proxy=，覆盖 session 级的代理（或 rotator）：

with FetcherSession(proxy='http://default-proxy:8080') as session:
    # Uses the session proxy
    page1 = session.get('https://example.com/page1')

    # Override the proxy for this specific request
    page2 = session.get('https://example.com/page2', proxy='http://special-proxy:9090')

异步示例：

async with FetcherSession(impersonate='firefox', http3=True) as session:
    # All standard HTTP methods available
    response = await session.get('https://example.com')
    response = await session.post('https://scrapling.requestcatcher.com/post', json={'data': 'value'})
    response = await session.put('https://scrapling.requestcatcher.com/put', data={'update': 'info'})
    response = await session.delete('https://scrapling.requestcatcher.com/delete')

或者更好的写法：

import asyncio
from scrapling.fetchers import FetcherSession

# Async session usage
async with FetcherSession(impersonate="safari") as session:
    urls = ['https://example.com/page1', 'https://example.com/page2']

    tasks = [
        session.get(url) for url in urls
    ]

    pages = await asyncio.gather(*tasks)

Fetcher 类本身就是在你每次发起请求时，临时创建一个 FetcherSession 来完成工作的。

会话带来的好处

快得多：比每次请求都创建一个新 session 快 10 倍。
Cookie 持久化：自动在多个请求之间处理 cookies。
资源效率更高：多次请求时，内存和 CPU 占用更优。
集中式配置：可以在一个地方统一管理请求设置。

示例

下面给出一些更完整的示例，帮助 Web Scraping 新手理解。

基础 HTTP 请求

from scrapling.fetchers import Fetcher

# Make a request
page = Fetcher.get('https://example.com')

# Check the status
if page.status == 200:
    # Extract title
    title = page.css('title::text').get()
    print(f"Page title: {title}")

    # Extract all links
    links = page.css('a::attr(href)').getall()
    print(f"Found {len(links)} links")

商品抓取

from scrapling.fetchers import Fetcher

def scrape_products():
    page = Fetcher.get('https://example.com/products')

    # Find all product elements
    products = page.css('.product')

    results = []
    for product in products:
        results.append({
            'title': product.css('.title::text').get(),
            'price': product.css('.price::text').re_first(r'\d+\.\d{2}'),
            'description': product.css('.description::text').get(),
            'in_stock': product.has_class('in-stock')
        })

    return results

下载文件

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/main_cover.png')
with open(file='main_cover.png', mode='wb') as f:
   f.write(page.body)

处理分页

from scrapling.fetchers import Fetcher

def scrape_all_pages():
    base_url = 'https://example.com/products?page={}'
    page_num = 1
    all_products = []

    while True:
        # Get current page
        page = Fetcher.get(base_url.format(page_num))

        # Find products
        products = page.css('.product')
        if not products:
            break

        # Process products
        for product in products:
            all_products.append({
                'name': product.css('.name::text').get(),
                'price': product.css('.price::text').get()
            })

        # Next page
        page_num += 1

    return all_products

表单提交

from scrapling.fetchers import Fetcher

# Submit login form
response = Fetcher.post(
    'https://example.com/login',
    data={
        'username': 'user@example.com',
        'password': 'password123'
    }
)

# Check login success
if response.status == 200:
    # Extract user info
    user_name = response.css('.user-name::text').get()
    print(f"Logged in as: {user_name}")

表格提取

from scrapling.fetchers import Fetcher

def extract_table():
    page = Fetcher.get('https://example.com/data')

    # Find table
    table = page.css('table')[0]

    # Extract headers
    headers = [
        th.text for th in table.css('thead th')
    ]

    # Extract rows
    rows = []
    for row in table.css('tbody tr'):
        cells = [td.text for td in row.css('td')]
        rows.append(dict(zip(headers, cells)))

    return rows

导航菜单

from scrapling.fetchers import Fetcher

def extract_menu():
    page = Fetcher.get('https://example.com')

    # Find navigation
    nav = page.css('nav')[0]

    menu = {}
    for item in nav.css('li'):
        links = item.css('a')
        if links:
            link = links[0]
            menu[link.text] = {
                'url': link['href'],
                'has_submenu': bool(item.css('.submenu'))
            }

    return menu

何时使用

在以下场景中使用 Fetcher：

需要快速的 HTTP 请求。
想要尽可能少的额外开销。
不需要执行 JavaScript（目标网站可以直接通过请求完成抓取）。
需要一些 stealth 特性（例如目标网站有防护，但不使用 JavaScript challenge）。

在以下场景中使用 FetcherSession：

要对同一个或不同站点发起多次请求。
需要在多个请求之间保持 cookies / 认证状态。
想要连接池来提升性能。
需要在多个请求间保持一致配置。
正在处理依赖会话状态的 API。

在以下场景中使用其他 fetcher：

需要浏览器自动化。
需要更高级的反机器人 / stealth 能力。
需要 JavaScript 支持，或者需要与动态内容交互。

Parsing

Fetching

Spiders

Command Line Interface

Integrations

API Reference

HTTP 请求

HTTP 请求

基本用法

共享参数

HTTP 方法

GET

POST

PUT

DELETE

会话管理

会话带来的好处

示例

基础 HTTP 请求

商品抓取

下载文件

处理分页

表单提交

表格提取

导航菜单

何时使用

基础

User Guide

Tutorials

Development