代理管理与封锁处理

学习如何在 Scrapling Spider 中轮换代理、覆盖单次请求代理，并结合内置封锁检测与重试机制处理被拦截请求。

代理管理与封锁处理

介绍

当抓取规模变大时，你通常需要在多个代理之间轮换，以避免速率限制和封锁。Scrapling 的 ProxyRotator 让这件事变得很直接。它适用于所有会话类型，并且能与 spider 的“阻塞请求重试系统”无缝集成。

如果你还不知道代理是什么，或者不知道该如何挑选合适的代理，可以参考这篇指南。

ProxyRotator

ProxyRotator 类用于管理一组代理，并自动在它们之间轮换。你可以通过 proxy_rotator 参数把它传给任意会话类型：

from scrapling.spiders import Spider, Response
from scrapling.fetchers import FetcherSession, ProxyRotator

class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]

    def configure_sessions(self, manager):
        rotator = ProxyRotator([
            "http://proxy1:8080",
            "http://proxy2:8080",
            "http://user:***@proxy3:8080",
        ])
        manager.add("default", FetcherSession(proxy_rotator=rotator))

    async def parse(self, response: Response):
        # 查看本次请求实际使用了哪个代理
        print(f"Proxy used: {response.meta.get('proxy')}")
        yield {"title": response.css("title::text").get("")}

每个请求都会自动拿到轮换中的下一个代理。实际使用的代理会保存在 response.meta["proxy"] 中，因此你可以追踪究竟是哪个代理抓取了哪个页面。

当你把它用于浏览器会话时，还需要做一些对应调整，例如：

from scrapling.fetchers import AsyncDynamicSession, AsyncStealthySession, ProxyRotator

# 字符串形式的代理适用于所有会话类型
rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
])

# 字典形式的代理（Playwright 格式）适用于浏览器会话
rotator = ProxyRotator([
    {"server": "http://proxy1:8080", "username": "user", "password": "pass"},
    {"server": "http://proxy2:8080"},
])

# 然后在 spider 中使用

def configure_sessions(self, manager):
    rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
    manager.add("browser", AsyncStealthySession(proxy_rotator=rotator))

自定义轮换策略

默认情况下，ProxyRotator 使用循环轮换策略：按顺序依次使用代理，到末尾后再从头开始。

如果你想改变这个行为，可以传入自定义策略函数，但它必须符合下面的函数签名：

from scrapling.core._types import ProxyType

def my_strategy(proxies: list, current_index: int) -> tuple[ProxyType, int]:
    ...

它会接收代理列表和当前索引，并且必须返回“选中的代理”和“下一次要使用的索引”。

下面是一些可直接使用的自定义轮换策略示例。

随机轮换

import random
from scrapling.fetchers import ProxyRotator

def random_strategy(proxies, current_index):
    idx = random.randint(0, len(proxies) - 1)
    return proxies[idx], idx

rotator = ProxyRotator(
    ["http://proxy1:8080", "http://proxy2:8080", "http://proxy3:8080"],
    strategy=random_strategy,
)

加权轮换

import random

def weighted_strategy(proxies, current_index):
    # 第一个代理承担 60% 流量，其他代理均分剩余部分
    weights = [60] + [40 // (len(proxies) - 1)] * (len(proxies) - 1)
    proxy = random.choices(proxies, weights=weights, k=1)[0]
    return proxy, current_index  # 对加权轮换来说，索引本身并不重要

rotator = ProxyRotator(proxies, strategy=weighted_strategy)

按请求覆盖代理

你可以为单个请求覆盖 rotator，只需通过关键字参数传入 proxy=：

async def parse(self, response: Response):
    # 这个请求使用 rotator 给出的下一个代理
    yield response.follow("/page1", callback=self.parse_page)

    # 这个请求使用指定代理，绕过 rotator
    yield response.follow(
        "/special-page",
        callback=self.parse_page,
        proxy="http://special-proxy:8080",
    )

当某些页面必须使用特定代理时，这种方式尤其有用，例如：需要某个地理位置代理才能访问区域性内容。

被封锁请求的处理

Spider 内置了被封锁请求的检测与重试逻辑。默认情况下，它会将以下 HTTP 状态码视为“被封锁”：401、403、407、429、444、500、502、503、504。

重试系统的工作流程如下：

响应返回后，spider 会调用 is_blocked(response) 方法。
如果判断为被封锁，它会复制该请求，并调用 retry_blocked_request()，以便你在重试前修改请求。
重试请求会以 dont_filter=True 重新入队（绕过去重），并降低优先级，因此不会立刻再次重试。
这个过程最多重复 max_blocked_retries 次（默认值：3）。

自定义封锁检测

你可以重写 is_blocked()，加入自己的检测逻辑：

class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]

    async def is_blocked(self, response: Response) -> bool:
        # 检查状态码（默认行为）
        if response.status in {403, 429, 503}:
            return True

        # 检查响应内容
        body = response.body.decode("utf-8", errors="ignore")
        if "access denied" in body.lower() or "rate limit" in body.lower():
            return True

        return False

    async def parse(self, response: Response):
        yield {"title": response.css("title::text").get("")}

自定义重试行为

你还可以重写 retry_blocked_request()，在重试前修改请求。max_blocked_retries 属性用于控制被封锁请求最多重试多少次（默认值：3）：

from scrapling.spiders import Spider, SessionManager, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]
    max_blocked_retries = 5

    def configure_sessions(self, manager: SessionManager) -> None:
        manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari']))
        manager.add('stealth', AsyncStealthySession(block_webrtc=True), lazy=True)

    async def retry_blocked_request(self, request: Request, response: Response) -> Request:
        request.sid = "stealth"
        self.logger.info(f"Retrying blocked request: {request.url}")
        return request

    async def parse(self, response: Response):
        yield {"title": response.css("title::text").get("")}

上面的逻辑本质上是：保留默认的封锁检测方式，让 spider 主要先使用 requests；一旦被封锁，就切换到隐身浏览器重试。

完整组合示例如下：

from scrapling.spiders import Spider, SessionManager, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession, ProxyRotator

cheap_proxies = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])

# 浏览器可接受的代理格式
expensive_proxies = ProxyRotator([
    {"server": "http://residential_proxy1:8080", "username": "user", "password": "pass"},
    {"server": "http://residential_proxy2:8080", "username": "user", "password": "pass"},
    {"server": "http://mobile_proxy1:8080", "username": "user", "password": "pass"},
    {"server": "http://mobile_proxy2:8080", "username": "user", "password": "pass"},
])

class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]
    max_blocked_retries = 5

    def configure_sessions(self, manager: SessionManager) -> None:
        manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari'], proxy_rotator=cheap_proxies))
        manager.add('stealth', AsyncStealthySession(block_webrtc=True, proxy_rotator=expensive_proxies), lazy=True)

    async def retry_blocked_request(self, request: Request, response: Response) -> Request:
        request.sid = "stealth"
        self.logger.info(f"Retrying blocked request: {request.url}")
        return request

    async def parse(self, response: Response):
        yield {"title": response.css("title::text").get("")}

这段逻辑的意思是：先用廉价代理（例如机房代理）发起请求；一旦被封锁，就使用质量更高的代理（例如住宅代理或移动代理）重新尝试。

Parsing

Fetching

Spiders

Command Line Interface

Integrations

API Reference

代理管理与封锁处理

代理管理与封锁处理

介绍

ProxyRotator

自定义轮换策略

随机轮换

加权轮换

按请求覆盖代理

被封锁请求的处理

自定义封锁检测

自定义重试行为

基础

User Guide

Tutorials

Development