Skip to content

了解 Scrapling 的 Fetcher 与 FetcherSession,掌握共享参数、常见 HTTP 方法及会话管理方式。

Fetcher 类基于高性能的 curl_cffi 库,提供快速、轻量的 HTTP 请求能力,并带有大量隐匿(stealth)特性。

你主要通过下面这种方式导入这个 Fetcher;这也是所有 fetcher 通用的导入方式。

from scrapling.fetchers import Fetcher

如何配置解析选项请看这里

这里所有发起请求的方法都共享一部分参数,所以先统一讲解它们。

  • url:目标 URL。
  • stealthy_headers:启用后(默认启用),会生成并附加真实浏览器头。同时会设置一个 Google referer 头。
  • follow_redirects:控制重定向行为。默认值是 "safe",这会跟随重定向,但会拒绝跳转到内网 / 私有 IP(SSR F 防护)。传 True 表示无限制跟随所有重定向;传 False 表示完全禁用重定向。
  • timeout:每个请求等待完成的秒数。默认 30 秒
  • retries:请求失败时 fetcher 的重试次数。默认重试 3 次
  • retry_delay:每次重试之间等待的秒数。默认 1 秒
  • impersonate:伪装特定浏览器的 TLS 指纹。可接受浏览器字符串,或字符串列表,例如 "chrome110""firefox102""safari15_5" 用于指定具体版本,或 "chrome""firefox""safari""edge" 用于自动使用当前可用的最新版本。这会让你的请求在 TLS 层看起来像真实浏览器发出的。如果传的是字符串列表,每次请求会随机选择一个。默认使用当前可用的最新 Chrome 版本。
  • http3:使用 HTTP/3 协议发起请求。默认 False。如果与 impersonate 一起使用,可能会有问题。
  • cookies:请求中使用的 cookies。可以是 name→value 形式的字典,或字典列表。
  • proxy:顾名思义,该请求使用的代理,用于转发所有流量(HTTP 与 HTTPS)。支持格式如 http://username:***@localhost:8030
  • proxy_auth:代理的 HTTP Basic Auth,格式为 (username, password) 元组。
  • proxies:代理字典,格式:{"http": proxy_url, "https": proxy_url}
  • proxy_rotator:用于自动轮换代理的 ProxyRotator 实例。不能和 proxyproxies 同时使用。
  • headers:附加到请求中的请求头。它可以覆盖 stealthy_headers 生成的任意头。
  • max_redirects:最大重定向次数。默认 30,传 -1 表示不限制。
  • verify:是否校验 HTTPS 证书。默认 True
  • cert:客户端证书文件名元组 (cert, key)
  • selector_config:一个字典,用于创建最终 Selector / Response 类时传入自定义解析参数。

除此之外,如果你需要更深入的自定义,也可以向任意方法传入 curl_cffi 支持、但当前方法尚未直接支持的参数。

不同 HTTP 方法还会各自拥有一些额外参数,例如 GET 请求中的 params,以及 POST / PUT / DELETE 请求中的 data / json

最好的解释方式还是示例。

因此:OPTIONSHEAD 方法目前不支持。

from scrapling.fetchers import Fetcher
# Basic GET
page = Fetcher.get('https://example.com')
page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True)
page = Fetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:***@localhost:8030')
# With parameters
page = Fetcher.get('https://example.com/search', params={'q': 'query'})
# With headers
page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
# Basic HTTP authentication
page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
# Browser impersonation
page = Fetcher.get('https://example.com', impersonate='chrome')
# HTTP/3 support
page = Fetcher.get('https://example.com', http3=True)

异步请求只需要做一点小调整:

from scrapling.fetchers import AsyncFetcher
# Basic GET
page = await AsyncFetcher.get('https://example.com')
page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True)
page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:***@localhost:8030')
# With parameters
page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'})
>>>
# With headers
page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
# Basic HTTP authentication
page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
# Browser impersonation
page = await AsyncFetcher.get('https://example.com', impersonate='chrome110')
# HTTP/3 support
page = await AsyncFetcher.get('https://example.com', http3=True)

不用多说,以上所有情况下的 page 对象都是 Response 对象,它本身也是一个 Selector,因此你可以直接这样使用:

page.css('.something.something')
page = Fetcher.get('https://api.github.com/events')
>>> page.json()
[{'id': '<redacted>',
'type': 'PushEvent',
'actor': {'id': '<redacted>',
'login': '<redacted>',
'display_login': '<redacted>',
'gravatar_id': '',
'url': 'https://api.github.com/users/<redacted>',
'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'},
'repo': {'id': '<redacted>',
...
from scrapling.fetchers import Fetcher
# Basic POST
page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, params={'q': 'query'})
page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True)
page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:***@localhost:8030', impersonate="chrome")
# Another example of form-encoded data
page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
# JSON data
page = Fetcher.post('https://example.com/api', json={'key': 'value'})

异步版本同样只需小调整:

from scrapling.fetchers import AsyncFetcher
# Basic POST
page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True)
page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:***@localhost:8030', impersonate="chrome")
# Another example of form-encoded data
page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
# JSON data
page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
from scrapling.fetchers import Fetcher
# Basic PUT
page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, impersonate="chrome")
page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:***@localhost:8030')
# Another example of form-encoded data
page = Fetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})

异步版本:

from scrapling.fetchers import AsyncFetcher
# Basic PUT
page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, impersonate="chrome")
page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:***@localhost:8030')
# Another example of form-encoded data
page = await AsyncFetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})
from scrapling.fetchers import Fetcher
page = Fetcher.delete('https://example.com/resource/123')
page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, impersonate="chrome")
page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:***@localhost:8030')

异步版本:

from scrapling.fetchers import AsyncFetcher
page = await AsyncFetcher.delete('https://example.com/resource/123')
page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, impersonate="chrome")
page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:***@localhost:8030')

如果你要用同一组配置发起多个请求,请使用 FetcherSession 类。它既可以用于同步代码,也可以用于异步代码;这个类会自动检测并切换会话类型,因此不需要额外的导入方式。

FetcherSession 类几乎可以接受这些方法所支持的全部参数,因此你可以为整个会话指定一套默认配置,之后再轻松地为某一个请求单独覆盖,如下面示例所示。

from scrapling.fetchers import FetcherSession
# Create a session with default configuration
with FetcherSession(
impersonate='chrome',
http3=True,
stealthy_headers=True,
timeout=30,
retries=3
) as session:
# Make multiple requests with the same settings and the same cookies
page1 = session.get('https://scrapling.requestcatcher.com/get')
page2 = session.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
page3 = session.get('https://api.github.com/events')
# All requests share the same session and connection pool

你也可以把 ProxyRotatorFetcherSession 搭配使用,实现跨请求的自动代理轮换:

from scrapling.fetchers import FetcherSession, ProxyRotator
rotator = ProxyRotator([
'http://proxy1:8080',
'http://proxy2:8080',
'http://proxy3:8080',
])
with FetcherSession(proxy_rotator=rotator, impersonate='chrome') as session:
# Each request automatically uses the next proxy in rotation
page1 = session.get('https://example.com/page1')
page2 = session.get('https://example.com/page2')
# You can check which proxy was used via the response metadata
print(page1.meta['proxy'])

你还可以通过在单次请求中直接传入 proxy=,覆盖 session 级的代理(或 rotator):

with FetcherSession(proxy='http://default-proxy:8080') as session:
# Uses the session proxy
page1 = session.get('https://example.com/page1')
# Override the proxy for this specific request
page2 = session.get('https://example.com/page2', proxy='http://special-proxy:9090')

异步示例:

async with FetcherSession(impersonate='firefox', http3=True) as session:
# All standard HTTP methods available
response = await session.get('https://example.com')
response = await session.post('https://scrapling.requestcatcher.com/post', json={'data': 'value'})
response = await session.put('https://scrapling.requestcatcher.com/put', data={'update': 'info'})
response = await session.delete('https://scrapling.requestcatcher.com/delete')

或者更好的写法:

import asyncio
from scrapling.fetchers import FetcherSession
# Async session usage
async with FetcherSession(impersonate="safari") as session:
urls = ['https://example.com/page1', 'https://example.com/page2']
tasks = [
session.get(url) for url in urls
]
pages = await asyncio.gather(*tasks)

Fetcher 类本身就是在你每次发起请求时,临时创建一个 FetcherSession 来完成工作的。

  • 快得多:比每次请求都创建一个新 session 快 10 倍。
  • Cookie 持久化:自动在多个请求之间处理 cookies。
  • 资源效率更高:多次请求时,内存和 CPU 占用更优。
  • 集中式配置:可以在一个地方统一管理请求设置。

下面给出一些更完整的示例,帮助 Web Scraping 新手理解。

from scrapling.fetchers import Fetcher
# Make a request
page = Fetcher.get('https://example.com')
# Check the status
if page.status == 200:
# Extract title
title = page.css('title::text').get()
print(f"Page title: {title}")
# Extract all links
links = page.css('a::attr(href)').getall()
print(f"Found {len(links)} links")
from scrapling.fetchers import Fetcher
def scrape_products():
page = Fetcher.get('https://example.com/products')
# Find all product elements
products = page.css('.product')
results = []
for product in products:
results.append({
'title': product.css('.title::text').get(),
'price': product.css('.price::text').re_first(r'\d+\.\d{2}'),
'description': product.css('.description::text').get(),
'in_stock': product.has_class('in-stock')
})
return results
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/main_cover.png')
with open(file='main_cover.png', mode='wb') as f:
f.write(page.body)
from scrapling.fetchers import Fetcher
def scrape_all_pages():
base_url = 'https://example.com/products?page={}'
page_num = 1
all_products = []
while True:
# Get current page
page = Fetcher.get(base_url.format(page_num))
# Find products
products = page.css('.product')
if not products:
break
# Process products
for product in products:
all_products.append({
'name': product.css('.name::text').get(),
'price': product.css('.price::text').get()
})
# Next page
page_num += 1
return all_products
from scrapling.fetchers import Fetcher
# Submit login form
response = Fetcher.post(
'https://example.com/login',
data={
'username': 'user@example.com',
'password': 'password123'
}
)
# Check login success
if response.status == 200:
# Extract user info
user_name = response.css('.user-name::text').get()
print(f"Logged in as: {user_name}")
from scrapling.fetchers import Fetcher
def extract_table():
page = Fetcher.get('https://example.com/data')
# Find table
table = page.css('table')[0]
# Extract headers
headers = [
th.text for th in table.css('thead th')
]
# Extract rows
rows = []
for row in table.css('tbody tr'):
cells = [td.text for td in row.css('td')]
rows.append(dict(zip(headers, cells)))
return rows
from scrapling.fetchers import Fetcher
def extract_menu():
page = Fetcher.get('https://example.com')
# Find navigation
nav = page.css('nav')[0]
menu = {}
for item in nav.css('li'):
links = item.css('a')
if links:
link = links[0]
menu[link.text] = {
'url': link['href'],
'has_submenu': bool(item.css('.submenu'))
}
return menu

在以下场景中使用 Fetcher

  • 需要快速的 HTTP 请求。
  • 想要尽可能少的额外开销。
  • 不需要执行 JavaScript(目标网站可以直接通过请求完成抓取)。
  • 需要一些 stealth 特性(例如目标网站有防护,但不使用 JavaScript challenge)。

在以下场景中使用 FetcherSession

  • 要对同一个或不同站点发起多次请求。
  • 需要在多个请求之间保持 cookies / 认证状态。
  • 想要连接池来提升性能。
  • 需要在多个请求间保持一致配置。
  • 正在处理依赖会话状态的 API。

在以下场景中使用其他 fetcher:

  • 需要浏览器自动化。
  • 需要更高级的反机器人 / stealth 能力。
  • 需要 JavaScript 支持,或者需要与动态内容交互。
-
0:000:00